scrapy是一个使用python编写的免费开源的网页爬取框架,最初只用于网页爬取,后来也可以用于解析从API获取的数据。最初诞生于英国伦敦的一家电商公司Mydeco,于2008年首次发行公开版本,并于2011年交由scrapinghub公司维护。本文会演示如何用scrapy抓取stackoverflow上的最新问题数据(问题的标题和URL)以及如何把数据保存到mongodb数据库。
相关package的安装
MongoDB
笔者机器所用系统为Mac OSX.首先是安装配置MongoDB,具体的安装过程请参考我的另一篇文章:Mac OSX下使用Homebrew安装MongoDB.
Scrapy
使用virtualenv在python虚拟环境(配置python虚拟环境请参考我的另一篇文章:使用virtualenv管理python项目开发环境)中使用pip安装:
$ pip install Scrapy
$ pip freeze > requirements.txt
PyMongo
使用pip安装PyMongo
$ pip install pymongo
$ pip freeze > requirements.txt
安装成功,终端输出以下内容:
Collecting pymongo
Downloading pymongo-3.2.2-cp27-none-macosx_10_11_intel.whl (262kB)
100% |████████████████████████████████| 266kB 411kB/s
Installing collected packages: pymongo
Successfully installed pymongo-3.2.2
创建scrapy项目
根据scrapy官网文档,如果你希望将项目代码保存在某个文件夹下,你就在那个文件夹下执行scrapy startproject命令,比如我使用的python虚拟环境是engchen(这个虚拟环境位于/Users/chenxin/ProjectsEnv),创建对应的项目名称也是engchen(希望保存到/Users/chenxin/PycharmProjects),执行的代码如下:
(engchen) MacBookPro:PycharmProjects chenxin$ scrapy startproject engchen
创建成功以后终端输出
New Scrapy project 'engchen', using template directory '/Users/chenxin/ProjectsEnv/engchen/lib/python2.7/site-packages/scrapy/templates/project', created in:
/Users/chenxin/PycharmProjects/engchen
You can start your first spider with:
cd engchen
scrapy genspider example example.com
执行该指令后会创建一个名为engchen的项目根目录,里面包含了一个基础模版所具备的文件和文件夹结构:
├── scrapy.cfg
└── stack
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
指定要抓取的数据类型
items.py模块用于为我们即将要抓取的数据定义存储容器。
打开items.py文件,有一个类EngchenItem,可以看到这个类继承自scrapy的Item基类.
添加我们希望采集的items,更新items.py文件:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item,Field
class EngchenItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
url = Field()
pass
在“spider”文件夹下创建一个名为engchen_spider.py的文件。在这里我们操控爬虫寻找我们想要的数据,这个文件只能专属于某一个网页,而不能用于采集其它网页的数据。
定义一个类继承自Scrapy的Spider基类:
# -*- coding: utf-8 -*-
from scrapy import Spider
class EngchenSpider(Spider)
name = "engchen"
allowed_domains = ["stackoverflow.com"]
start_urls = ["http://stackoverflow.com/questions?sort=newest"]
关于几个参数的解释:
-
name定义了爬虫的名字
-
allowed_domains包含了爬虫爬取链接的基链接
-
start_urls是爬虫爬取链接的列表
XPath选择器
Scrapy使用XPath选择器从网页解析出数据,换言之,可以根据指定的XPath获得HTML数据中的特定部分。Scrapy文档的选择器(Selector)部分对XPath介绍如下:
XPath is a language for selecting nodes in XML documents, which can also be used with HTML
现在用chrome打开stackoverflow,找到我们想要的XPath。把鼠标放在第一个问题的标题上,右键->检查:
image找到[div class="summary"]对应的XPath: //*[@id="question-summary-37872090"]/div[2],可以在chrome的开发工具Console部分使用$x语法检查这个XPath是否能选中第一个问题,在Console中执行:
image可以看到,正好选中了第一个问题的代码块,其中的h3对应的正是问题的标题。
如果我们要调整XPath,让它能选中该页面全部的问题,该怎么做呢?很简单,使用这个XPath: //div[@class="summary"]/h3。这是什么意思呢?
这个XPath会抓取所有类是summary的div下面的h3子元素。
现在更新engchen_spider.py:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
class EngchenSpider(Spider):
name = "engchen"
allowed_domains = ["stackoverflow.com"]
start_urls = ["http://stackoverflow.com/questions?sort=newest"]
def parse(self,response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
解析数据
仅仅获取的问题是不够的,要把每一个问题的标题和链接提取出来,更新engchen_spider.py:
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from engchen.items import EngchenItem
class EngchenSpider(Spider):
name = "engchen"
allowed_domains = ["stackoverflow.com"]
start_urls = ["http://stackoverflow.com/questions?sort=newest"]
def parse(self,response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = EngchenItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
测试爬虫
在项目文件夹engchen下执行
$ scrapy crawl engchen
终端输出以下信息:
2016-06-17 12:51:03 [scrapy] INFO: Scrapy 1.1.0 started (bot: engchen)
2016-06-17 12:51:03 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'engchen.spiders', 'SPIDER_MODULES': ['engchen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'engchen'}
2016-06-17 12:51:03 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-06-17 12:51:03 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-17 12:51:03 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-17 12:51:03 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-17 12:51:03 [scrapy] INFO: Spider opened
2016-06-17 12:51:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-17 12:51:03 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-17 12:51:15 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/robots.txt> (referer: None)
2016-06-17 12:51:26 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/questions?sort=newest> (referer: None)
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'how to display picture in picture object using path in crystal report C#.net',
'url': u'/questions/37873352/how-to-display-picture-in-picture-object-using-path-in-crystal-report-c-net'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u"C language paese file's detail",
'url': u'/questions/37873351/c-language-paese-files-detail'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Shibboleth custom password flow',
'url': u'/questions/37873350/shibboleth-custom-password-flow'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Httpurlconnection getresponsecode throws eof exception. Tried the max try method',
'url': u'/questions/37873348/httpurlconnection-getresponsecode-throws-eof-exception-tried-the-max-try-method'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Is a Spring MVC App with Thymeleaf RESTful?',
'url': u'/questions/37873347/is-a-spring-mvc-app-with-thymeleaf-restful'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'How to bind an arbitrary number of values with procedural-style MySQLi prepared statement?',
'url': u'/questions/37873346/how-to-bind-an-arbitrary-number-of-values-with-procedural-style-mysqli-prepared'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Use Ffmpeg on android with linux commands',
'url': u'/questions/37873345/use-ffmpeg-on-android-with-linux-commands'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Play Framework 2.3.x: Wrap Request object using Scala Oauth in Play Framework',
'url': u'/questions/37873343/play-framework-2-3-x-wrap-request-object-using-scala-oauth-in-play-framework'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'FTP Support ChromeCast',
'url': u'/questions/37873341/ftp-support-chromecast'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u"Regex to find matches words that is in a line that doesn't start with",
'url': u'/questions/37873339/regex-to-find-matches-words-that-is-in-a-line-that-doesnt-start-with'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'ParseEroor in android Volley',
'url': u'/questions/37873337/parseeroor-in-android-volley'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Rails: maintaining application-dependent data',
'url': u'/questions/37873334/rails-maintaining-application-dependent-data'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Cannot Enable Autoexposure via V4L2',
'url': u'/questions/37873331/cannot-enable-autoexposure-via-v4l2'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'how do i use dct to extract features from image?',
'url': u'/questions/37873327/how-do-i-use-dct-to-extract-features-from-image'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'How to disable/ lock one page in viewpager?',
'url': u'/questions/37873326/how-to-disable-lock-one-page-in-viewpager'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Need Text finder and convert to PDF file for .DWG files',
'url': u'/questions/37873324/need-text-finder-and-convert-to-pdf-file-for-dwg-files'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'spring mvc: the difference between DeferredResult and ListenableFuture?',
'url': u'/questions/37873322/spring-mvc-the-difference-between-deferredresult-and-listenablefuture'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'.AsReadOnly() not included PCL despite it being listed as supported in MSDN',
'url': u'/questions/37873317/asreadonly-not-included-pcl-despite-it-being-listed-as-supported-in-msdn'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'j-query function to update value onchange',
'url': u'/questions/37873314/j-query-function-to-update-value-onchange'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Why Are format specifiers used in C',
'url': u'/questions/37873312/why-are-format-specifiers-used-in-c'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'H5py store list of list of strings',
'url': u'/questions/37873311/h5py-store-list-of-list-of-strings'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'how to get calculated values(totalPriceAmt) from js to html in by using angular js',
'url': u'/questions/37873310/how-to-get-calculated-valuestotalpriceamt-from-js-to-html-in-by-using-angular'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Jquery length not working with elements with variable in name',
'url': u'/questions/37873306/jquery-length-not-working-with-elements-with-variable-in-name'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'get succesor Binary Search Tree c++ Data structure',
'url': u'/questions/37873303/get-succesor-binary-search-tree-c-data-structure'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u"i have added fragments in android studio but on running it , it's just loading and not showing fragments",
'url': u'/questions/37873295/i-have-added-fragments-in-android-studio-but-on-running-it-its-just-loading-a'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Easy Share Application - Possibly Null Error?',
'url': u'/questions/37873293/easy-share-application-possibly-null-error'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'How to retrieve back signed value',
'url': u'/questions/37873292/how-to-retrieve-back-signed-value'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Uploading the app to AppStore when DataBase is changed(iOS)',
'url': u'/questions/37873291/uploading-the-app-to-appstore-when-database-is-changedios'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Getting properties object in spring',
'url': u'/questions/37873289/getting-properties-object-in-spring'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'SSIS make LastModifiedProductVersion from 10.50.1600.1 to 10.50.6000.34',
'url': u'/questions/37873286/ssis-make-lastmodifiedproductversion-from-10-50-1600-1-to-10-50-6000-34'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Servlet not loading on startup in webphere 8.5.5',
'url': u'/questions/37873285/servlet-not-loading-on-startup-in-webphere-8-5-5'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'will conversion of timestamp to date in diiferent timezones returns different date and time?',
'url': u'/questions/37873284/will-conversion-of-timestamp-to-date-in-diiferent-timezones-returns-different-da'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Structure of a Node.Js API with MySQL',
'url': u'/questions/37873280/structure-of-a-node-js-api-with-mysql'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'WordPress: Too many taxonomies slow down the site',
'url': u'/questions/37873279/wordpress-too-many-taxonomies-slow-down-the-site'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Team Coding Client + Server',
'url': u'/questions/37873277/team-coding-client-server'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'separate characters and numbers from a string',
'url': u'/questions/37873276/separate-characters-and-numbers-from-a-string'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Throttle function for 2 seconds',
'url': u'/questions/37873275/throttle-function-for-2-seconds'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'How to set customer _id by find other model By strongloop and mongodb',
'url': u'/questions/37873274/how-to-set-customer-id-by-find-other-model-by-strongloop-and-mongodb'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'com.phonegap.www is already in use by an app owned by another developer',
'url': u'/questions/37873273/com-phonegap-www-is-already-in-use-by-an-app-owned-by-another-developer'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'PHP :- How to handle Doc files (preview and edit)',
'url': u'/questions/37873271/php-how-to-handle-doc-files-preview-and-edit'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Laravel ORM Relationship',
'url': u'/questions/37873270/laravel-orm-relationship'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'reading a json in spark',
'url': u'/questions/37873269/reading-a-json-in-spark'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Where to find the Android app native code on my test server and how to decompile it to Java?',
'url': u'/questions/37873267/where-to-find-the-android-app-native-code-on-my-test-server-and-how-to-decompile'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'401\. That\u2019s an error. Error: invalid_client,no registered origin',
'url': u'/questions/37873266/401-that-s-an-error-error-invalid-client-no-registered-origin'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u"addAllowedApplication(String packageName) method of VPNService.Buildr class doesn't work on api level 14",
'url': u'/questions/37873265/addallowedapplicationstring-packagename-method-of-vpnservice-buildr-class-does'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'I cant remove an entity JPA, JEE7',
'url': u'/questions/37873262/i-cant-remove-an-entity-jpa-jee7'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'How to convert this MySQL query to Yii2 ActiveQuery format?',
'url': u'/questions/37873259/how-to-convert-this-mysql-query-to-yii2-activequery-format'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'CSS style checkbox inside td',
'url': u'/questions/37873256/css-style-checkbox-inside-td'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u"How do I read Class paths in Java's API documentation?",
'url': u'/questions/37873254/how-do-i-read-class-paths-in-javas-api-documentation'}
2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'IPBoard hook for IPSMember',
'url': u'/questions/37873253/ipboard-hook-for-ipsmember'}
2016-06-17 12:51:26 [scrapy] INFO: Closing spider (finished)
2016-06-17 12:51:26 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 512,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 31240,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 6, 17, 4, 51, 26, 211130),
'item_scraped_count': 50,
'log_count/DEBUG': 53,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 6, 17, 4, 51, 3, 602022)}
2016-06-17 12:51:26 [scrapy] INFO: Spider closed (finished)
当然也可以把爬取的数据保存到名为quesion的json文件中:
`$ scrapy crawl engchen -o question.json -t json
`
爬行完成,发现项目文件夹下多了question.json文件,打开question.json,共有50条数据:
image把数据存储到MongoDB中
每次抓取到一条数据之后,把它存储到MongoDB中。
第一步是创建保存爬取数据的数据库,打开setting.py,指定pipeline并添加数据库设置
ITEM_PIPELINES = {'engchen.pipelines.MongoDBPipeline':100}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "stackoverflow"
MONGODB_COLLECTION = "questions"
配置数据管道流(pipeline)
配置了爬虫爬取并解析数据,配置了数据库选项,现在需要配置管道流文件pipelines.py将二者连接起来
首先,定义一个连接数据库的方法
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(settings.get('MONGODB_SERVER'),settings.get('MONGODB_PORT'))
db = connection[settings.get('MONGODB_DB')]
self.collection = db[settings.get('MONGODB_COLLECTION')]
然后需要定义一个方法处理数据
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(settings.get('MONGODB_SERVER'),settings.get('MONGODB_PORT'))
db = connection[settings.get('MONGODB_DB')]
self.collection = db[settings.get('MONGODB_COLLECTION')]
def process_item(self, item, spider):
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
self.collection.insert(dict(item))
log.msg("Question Added To MongoDB Successfully!",level=log.DEBUG,spider=spider)
return item
测试爬取数据能否成功保存到MongoDB
先启动mongdb
$ mongod
在项目文件夹engchen下执行
$ scrapy crawl engchen
终端输出以下信息:
2016-06-17 14:20:23 [scrapy] INFO: Scrapy 1.1.0 started (bot: engchen)
2016-06-17 14:20:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'engchen.spiders', 'SPIDER_MODULES': ['engchen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'engchen'}
2016-06-17 14:20:23 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-06-17 14:20:23 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-06-17 14:20:23 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-06-17 14:20:23 [py.warnings] WARNING: /Users/chenxin/PycharmProjects/engchen/engchen/pipelines.py:11: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
from scrapy import log
2016-06-17 14:20:23 [scrapy] INFO: Enabled item pipelines:
['engchen.pipelines.MongoDBPipeline']
2016-06-17 14:20:23 [scrapy] INFO: Spider opened
2016-06-17 14:20:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-17 14:20:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-17 14:20:24 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/robots.txt> (referer: None)
2016-06-17 14:20:25 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/questions?sort=newest> (referer: None)
2016-06-17 14:20:25 [py.warnings] WARNING: /Users/chenxin/PycharmProjects/engchen/engchen/pipelines.py:28: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
log.msg("Question Added To MongoDB Successfully!",level=log.DEBUG,spider=spider)
2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
2016-06-17 14:20:25 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'how to remove all website addresses in bulk using regex',
'url': u'/questions/37874402/how-to-remove-all-website-addresses-in-bulk-using-regex'}
2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
2016-06-17 14:20:25 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
{'title': u'Dynamic subdomain creation in wp',
'url': u'/questions/37874401/dynamic-subdomain-creation-in-wp'}
2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
......
然后,使用MongoDB数据库工具Robomongo查看刚刚存储的数据:
image结束语
通过本文可以简单了解Scrapy的使用,项目源代码已经托管到Github。
网友评论