美文网首页
使用Scrapy爬取网页数据并保存到MongoDB

使用Scrapy爬取网页数据并保存到MongoDB

作者: nextliving | 来源:发表于2018-04-22 14:34 被阅读801次

    scrapy是一个使用python编写的免费开源的网页爬取框架,最初只用于网页爬取,后来也可以用于解析从API获取的数据。最初诞生于英国伦敦的一家电商公司Mydeco,于2008年首次发行公开版本,并于2011年交由scrapinghub公司维护。本文会演示如何用scrapy抓取stackoverflow上的最新问题数据(问题的标题和URL)以及如何把数据保存到mongodb数据库。

    相关package的安装

    MongoDB

    笔者机器所用系统为Mac OSX.首先是安装配置MongoDB,具体的安装过程请参考我的另一篇文章:Mac OSX下使用Homebrew安装MongoDB.

    Scrapy

    使用virtualenv在python虚拟环境(配置python虚拟环境请参考我的另一篇文章:使用virtualenv管理python项目开发环境)中使用pip安装:

    
    $ pip install Scrapy
    
    $ pip freeze > requirements.txt
    
    

    PyMongo

    使用pip安装PyMongo

    
    $ pip install pymongo
    
    $ pip freeze > requirements.txt
    
    

    安装成功,终端输出以下内容:

    
    Collecting pymongo
    
     Downloading pymongo-3.2.2-cp27-none-macosx_10_11_intel.whl (262kB)
    
     100% |████████████████████████████████| 266kB 411kB/s 
    
    Installing collected packages: pymongo
    
    Successfully installed pymongo-3.2.2
    
    

    创建scrapy项目

    根据scrapy官网文档,如果你希望将项目代码保存在某个文件夹下,你就在那个文件夹下执行scrapy startproject命令,比如我使用的python虚拟环境是engchen(这个虚拟环境位于/Users/chenxin/ProjectsEnv),创建对应的项目名称也是engchen(希望保存到/Users/chenxin/PycharmProjects),执行的代码如下:

    (engchen) MacBookPro:PycharmProjects chenxin$ scrapy startproject engchen

    创建成功以后终端输出

    
    New Scrapy project 'engchen', using template directory '/Users/chenxin/ProjectsEnv/engchen/lib/python2.7/site-packages/scrapy/templates/project', created in:
    
     /Users/chenxin/PycharmProjects/engchen
    
    You can start your first spider with:
    
     cd engchen
    
     scrapy genspider example example.com    
    
    

    执行该指令后会创建一个名为engchen的项目根目录,里面包含了一个基础模版所具备的文件和文件夹结构:

    
    ├── scrapy.cfg
    
    └── stack
    
     ├── __init__.py
    
     ├── items.py
    
     ├── pipelines.py
    
     ├── settings.py
    
     └── spiders
    
     └── __init__.py
    
    

    指定要抓取的数据类型

    items.py模块用于为我们即将要抓取的数据定义存储容器。

    打开items.py文件,有一个类EngchenItem,可以看到这个类继承自scrapy的Item基类.

    添加我们希望采集的items,更新items.py文件:

    
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    
    #
    
    # See documentation in:
    
    # http://doc.scrapy.org/en/latest/topics/items.html
    
    from scrapy.item import Item,Field
    
    class EngchenItem(scrapy.Item):
    
     # define the fields for your item here like:
    
     # name = scrapy.Field()
    
     title = Field()
    
     url = Field()
    
     pass
    
    

    在“spider”文件夹下创建一个名为engchen_spider.py的文件。在这里我们操控爬虫寻找我们想要的数据,这个文件只能专属于某一个网页,而不能用于采集其它网页的数据。

    定义一个类继承自Scrapy的Spider基类:

    
    # -*- coding: utf-8 -*-
    
    from scrapy import Spider
    
    class EngchenSpider(Spider)
    
     name = "engchen"
    
     allowed_domains = ["stackoverflow.com"]
    
     start_urls = ["http://stackoverflow.com/questions?sort=newest"]
    
    

    关于几个参数的解释:

    • name定义了爬虫的名字

    • allowed_domains包含了爬虫爬取链接的基链接

    • start_urls是爬虫爬取链接的列表

    XPath选择器

    Scrapy使用XPath选择器从网页解析出数据,换言之,可以根据指定的XPath获得HTML数据中的特定部分。Scrapy文档的选择器(Selector)部分对XPath介绍如下:

    XPath is a language for selecting nodes in XML documents, which can also be used with HTML

    现在用chrome打开stackoverflow,找到我们想要的XPath。把鼠标放在第一个问题的标题上,右键->检查:

    image

    找到[div class="summary"]对应的XPath: //*[@id="question-summary-37872090"]/div[2],可以在chrome的开发工具Console部分使用$x语法检查这个XPath是否能选中第一个问题,在Console中执行:

    image

    可以看到,正好选中了第一个问题的代码块,其中的h3对应的正是问题的标题。

    如果我们要调整XPath,让它能选中该页面全部的问题,该怎么做呢?很简单,使用这个XPath: //div[@class="summary"]/h3。这是什么意思呢?

    这个XPath会抓取所有类是summary的div下面的h3子元素。

    现在更新engchen_spider.py:

    
    # -*- coding: utf-8 -*-
    
    from scrapy import Spider
    
    from scrapy.selector import Selector
    
    class EngchenSpider(Spider):
    
     name = "engchen"
    
     allowed_domains = ["stackoverflow.com"]
    
     start_urls = ["http://stackoverflow.com/questions?sort=newest"]
    
     def parse(self,response):
    
     questions = Selector(response).xpath('//div[@class="summary"]/h3')
    
    

    解析数据

    仅仅获取的问题是不够的,要把每一个问题的标题和链接提取出来,更新engchen_spider.py:

    
    # -*- coding: utf-8 -*-
    
    from scrapy import Spider
    
    from scrapy.selector import Selector
    
    from engchen.items import EngchenItem
    
    class EngchenSpider(Spider):
    
     name = "engchen"
    
     allowed_domains = ["stackoverflow.com"]
    
     start_urls = ["http://stackoverflow.com/questions?sort=newest"]
    
     def parse(self,response):
    
     questions = Selector(response).xpath('//div[@class="summary"]/h3')
    
     for question in questions:
    
     item = EngchenItem()
    
     item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
    
     item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
    
     yield item
    
    

    测试爬虫

    在项目文件夹engchen下执行

    $ scrapy crawl engchen

    终端输出以下信息:

    
    2016-06-17 12:51:03 [scrapy] INFO: Scrapy 1.1.0 started (bot: engchen)
    
    2016-06-17 12:51:03 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'engchen.spiders', 'SPIDER_MODULES': ['engchen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'engchen'}
    
    2016-06-17 12:51:03 [scrapy] INFO: Enabled extensions:
    
    ['scrapy.extensions.logstats.LogStats',
    
     'scrapy.extensions.telnet.TelnetConsole',
    
     'scrapy.extensions.corestats.CoreStats']
    
    2016-06-17 12:51:03 [scrapy] INFO: Enabled downloader middlewares:
    
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
    
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
    
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
    
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
    
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
    
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
    
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
    
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
    
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
    
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
    
     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
    
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    
    2016-06-17 12:51:03 [scrapy] INFO: Enabled spider middlewares:
    
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
    
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
    
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
    
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
    
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    
    2016-06-17 12:51:03 [scrapy] INFO: Enabled item pipelines:
    
    []
    
    2016-06-17 12:51:03 [scrapy] INFO: Spider opened
    
    2016-06-17 12:51:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    
    2016-06-17 12:51:03 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    
    2016-06-17 12:51:15 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/robots.txt> (referer: None)
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/questions?sort=newest> (referer: None)
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'how to display picture in picture object using path in crystal report C#.net',
    
     'url': u'/questions/37873352/how-to-display-picture-in-picture-object-using-path-in-crystal-report-c-net'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u"C language paese file's detail",
    
     'url': u'/questions/37873351/c-language-paese-files-detail'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Shibboleth custom password flow',
    
     'url': u'/questions/37873350/shibboleth-custom-password-flow'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Httpurlconnection getresponsecode throws eof exception. Tried the max try method',
    
     'url': u'/questions/37873348/httpurlconnection-getresponsecode-throws-eof-exception-tried-the-max-try-method'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Is a Spring MVC App with Thymeleaf RESTful?',
    
     'url': u'/questions/37873347/is-a-spring-mvc-app-with-thymeleaf-restful'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'How to bind an arbitrary number of values with procedural-style MySQLi prepared statement?',
    
     'url': u'/questions/37873346/how-to-bind-an-arbitrary-number-of-values-with-procedural-style-mysqli-prepared'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Use Ffmpeg on android with linux commands',
    
     'url': u'/questions/37873345/use-ffmpeg-on-android-with-linux-commands'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Play Framework 2.3.x: Wrap Request object using Scala Oauth in Play Framework',
    
     'url': u'/questions/37873343/play-framework-2-3-x-wrap-request-object-using-scala-oauth-in-play-framework'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'FTP Support ChromeCast',
    
     'url': u'/questions/37873341/ftp-support-chromecast'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u"Regex to find matches words that is in a line that doesn't start with",
    
     'url': u'/questions/37873339/regex-to-find-matches-words-that-is-in-a-line-that-doesnt-start-with'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'ParseEroor in android Volley',
    
     'url': u'/questions/37873337/parseeroor-in-android-volley'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Rails: maintaining application-dependent data',
    
     'url': u'/questions/37873334/rails-maintaining-application-dependent-data'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Cannot Enable Autoexposure via V4L2',
    
     'url': u'/questions/37873331/cannot-enable-autoexposure-via-v4l2'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'how do i use dct to extract features from image?',
    
     'url': u'/questions/37873327/how-do-i-use-dct-to-extract-features-from-image'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'How to disable/ lock one page in viewpager?',
    
     'url': u'/questions/37873326/how-to-disable-lock-one-page-in-viewpager'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Need Text finder and convert to PDF file for .DWG files',
    
     'url': u'/questions/37873324/need-text-finder-and-convert-to-pdf-file-for-dwg-files'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'spring mvc: the difference between DeferredResult and ListenableFuture?',
    
     'url': u'/questions/37873322/spring-mvc-the-difference-between-deferredresult-and-listenablefuture'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'.AsReadOnly() not included PCL despite it being listed as supported in MSDN',
    
     'url': u'/questions/37873317/asreadonly-not-included-pcl-despite-it-being-listed-as-supported-in-msdn'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'j-query function to update value onchange',
    
     'url': u'/questions/37873314/j-query-function-to-update-value-onchange'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Why Are format specifiers used in C',
    
     'url': u'/questions/37873312/why-are-format-specifiers-used-in-c'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'H5py store list of list of strings',
    
     'url': u'/questions/37873311/h5py-store-list-of-list-of-strings'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'how to get calculated values(totalPriceAmt) from js to html in by using angular js',
    
     'url': u'/questions/37873310/how-to-get-calculated-valuestotalpriceamt-from-js-to-html-in-by-using-angular'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Jquery length not working with elements with variable in name',
    
     'url': u'/questions/37873306/jquery-length-not-working-with-elements-with-variable-in-name'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'get succesor Binary Search Tree c++ Data structure',
    
     'url': u'/questions/37873303/get-succesor-binary-search-tree-c-data-structure'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u"i have added fragments in android studio but on running it , it's just loading and not showing fragments",
    
     'url': u'/questions/37873295/i-have-added-fragments-in-android-studio-but-on-running-it-its-just-loading-a'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Easy Share Application - Possibly Null Error?',
    
     'url': u'/questions/37873293/easy-share-application-possibly-null-error'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'How to retrieve back signed value',
    
     'url': u'/questions/37873292/how-to-retrieve-back-signed-value'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Uploading the app to AppStore when DataBase is changed(iOS)',
    
     'url': u'/questions/37873291/uploading-the-app-to-appstore-when-database-is-changedios'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Getting properties object in spring',
    
     'url': u'/questions/37873289/getting-properties-object-in-spring'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'SSIS make LastModifiedProductVersion from 10.50.1600.1 to 10.50.6000.34',
    
     'url': u'/questions/37873286/ssis-make-lastmodifiedproductversion-from-10-50-1600-1-to-10-50-6000-34'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Servlet not loading on startup in webphere 8.5.5',
    
     'url': u'/questions/37873285/servlet-not-loading-on-startup-in-webphere-8-5-5'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'will conversion of timestamp to date in diiferent timezones returns different date and time?',
    
     'url': u'/questions/37873284/will-conversion-of-timestamp-to-date-in-diiferent-timezones-returns-different-da'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Structure of a Node.Js API with MySQL',
    
     'url': u'/questions/37873280/structure-of-a-node-js-api-with-mysql'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'WordPress: Too many taxonomies slow down the site',
    
     'url': u'/questions/37873279/wordpress-too-many-taxonomies-slow-down-the-site'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Team Coding Client + Server',
    
     'url': u'/questions/37873277/team-coding-client-server'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'separate characters and numbers from a string',
    
     'url': u'/questions/37873276/separate-characters-and-numbers-from-a-string'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Throttle function for 2 seconds',
    
     'url': u'/questions/37873275/throttle-function-for-2-seconds'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'How to set customer _id by find other model By strongloop and mongodb',
    
     'url': u'/questions/37873274/how-to-set-customer-id-by-find-other-model-by-strongloop-and-mongodb'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'com.phonegap.www is already in use by an app owned by another developer',
    
     'url': u'/questions/37873273/com-phonegap-www-is-already-in-use-by-an-app-owned-by-another-developer'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'PHP :- How to handle Doc files (preview and edit)',
    
     'url': u'/questions/37873271/php-how-to-handle-doc-files-preview-and-edit'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Laravel ORM Relationship',
    
     'url': u'/questions/37873270/laravel-orm-relationship'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'reading a json in spark',
    
     'url': u'/questions/37873269/reading-a-json-in-spark'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Where to find the Android app native code on my test server and how to decompile it to Java?',
    
     'url': u'/questions/37873267/where-to-find-the-android-app-native-code-on-my-test-server-and-how-to-decompile'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'401\. That\u2019s an error. Error: invalid_client,no registered origin',
    
     'url': u'/questions/37873266/401-that-s-an-error-error-invalid-client-no-registered-origin'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u"addAllowedApplication(String packageName) method of VPNService.Buildr class doesn't work on api level 14",
    
     'url': u'/questions/37873265/addallowedapplicationstring-packagename-method-of-vpnservice-buildr-class-does'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'I cant remove an entity JPA, JEE7',
    
     'url': u'/questions/37873262/i-cant-remove-an-entity-jpa-jee7'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'How to convert this MySQL query to Yii2 ActiveQuery format?',
    
     'url': u'/questions/37873259/how-to-convert-this-mysql-query-to-yii2-activequery-format'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'CSS style checkbox inside td',
    
     'url': u'/questions/37873256/css-style-checkbox-inside-td'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u"How do I read Class paths in Java's API documentation?",
    
     'url': u'/questions/37873254/how-do-i-read-class-paths-in-javas-api-documentation'}
    
    2016-06-17 12:51:26 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'IPBoard hook for IPSMember',
    
     'url': u'/questions/37873253/ipboard-hook-for-ipsmember'}
    
    2016-06-17 12:51:26 [scrapy] INFO: Closing spider (finished)
    
    2016-06-17 12:51:26 [scrapy] INFO: Dumping Scrapy stats:
    
    {'downloader/request_bytes': 512,
    
     'downloader/request_count': 2,
    
     'downloader/request_method_count/GET': 2,
    
     'downloader/response_bytes': 31240,
    
     'downloader/response_count': 2,
    
     'downloader/response_status_count/200': 2,
    
     'finish_reason': 'finished',
    
     'finish_time': datetime.datetime(2016, 6, 17, 4, 51, 26, 211130),
    
     'item_scraped_count': 50,
    
     'log_count/DEBUG': 53,
    
     'log_count/INFO': 7,
    
     'response_received_count': 2,
    
     'scheduler/dequeued': 1,
    
     'scheduler/dequeued/memory': 1,
    
     'scheduler/enqueued': 1,
    
     'scheduler/enqueued/memory': 1,
    
     'start_time': datetime.datetime(2016, 6, 17, 4, 51, 3, 602022)}
    
    2016-06-17 12:51:26 [scrapy] INFO: Spider closed (finished)
    
    

    当然也可以把爬取的数据保存到名为quesion的json文件中:

    `$ scrapy crawl engchen -o question.json -t json

    `

    爬行完成,发现项目文件夹下多了question.json文件,打开question.json,共有50条数据:

    image

    把数据存储到MongoDB中

    每次抓取到一条数据之后,把它存储到MongoDB中。

    第一步是创建保存爬取数据的数据库,打开setting.py,指定pipeline并添加数据库设置

    
    ITEM_PIPELINES = {'engchen.pipelines.MongoDBPipeline':100}
    
    MONGODB_SERVER = "localhost"
    
    MONGODB_PORT = 27017
    
    MONGODB_DB = "stackoverflow"
    
    MONGODB_COLLECTION = "questions"
    
    

    配置数据管道流(pipeline)

    配置了爬虫爬取并解析数据,配置了数据库选项,现在需要配置管道流文件pipelines.py将二者连接起来

    首先,定义一个连接数据库的方法

    
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    
    #
    
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    import pymongo
    
    from scrapy.conf import settings
    
    class MongoDBPipeline(object):
    
     def __init__(self):
    
     connection = pymongo.MongoClient(settings.get('MONGODB_SERVER'),settings.get('MONGODB_PORT'))
    
     db = connection[settings.get('MONGODB_DB')]
    
     self.collection = db[settings.get('MONGODB_COLLECTION')]
    
    

    然后需要定义一个方法处理数据

    
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    
    #
    
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    import pymongo
    
    from scrapy.conf import settings
    
    from scrapy.exceptions import DropItem
    
    from scrapy import log
    
    class MongoDBPipeline(object):
    
     def __init__(self):
    
     connection = pymongo.MongoClient(settings.get('MONGODB_SERVER'),settings.get('MONGODB_PORT'))
    
     db = connection[settings.get('MONGODB_DB')]
    
     self.collection = db[settings.get('MONGODB_COLLECTION')]
    
     def process_item(self, item, spider):
    
     valid = True
    
     for data in item:
    
     if not data:
    
     valid = False
    
     raise DropItem("Missing {0}!".format(data))
    
     if valid:
    
     self.collection.insert(dict(item))
    
     log.msg("Question Added To MongoDB Successfully!",level=log.DEBUG,spider=spider)
    
     return item
    
    

    测试爬取数据能否成功保存到MongoDB

    先启动mongdb

    $ mongod

    在项目文件夹engchen下执行

    $ scrapy crawl engchen

    终端输出以下信息:

    
    2016-06-17 14:20:23 [scrapy] INFO: Scrapy 1.1.0 started (bot: engchen)
    
    2016-06-17 14:20:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'engchen.spiders', 'SPIDER_MODULES': ['engchen.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'engchen'}
    
    2016-06-17 14:20:23 [scrapy] INFO: Enabled extensions:
    
    ['scrapy.extensions.logstats.LogStats',
    
     'scrapy.extensions.telnet.TelnetConsole',
    
     'scrapy.extensions.corestats.CoreStats']
    
    2016-06-17 14:20:23 [scrapy] INFO: Enabled downloader middlewares:
    
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
    
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
    
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
    
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
    
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
    
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
    
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
    
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
    
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
    
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
    
     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
    
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    
    2016-06-17 14:20:23 [scrapy] INFO: Enabled spider middlewares:
    
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
    
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
    
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
    
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
    
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    
    2016-06-17 14:20:23 [py.warnings] WARNING: /Users/chenxin/PycharmProjects/engchen/engchen/pipelines.py:11: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
    
     from scrapy import log
    
    2016-06-17 14:20:23 [scrapy] INFO: Enabled item pipelines:
    
    ['engchen.pipelines.MongoDBPipeline']
    
    2016-06-17 14:20:23 [scrapy] INFO: Spider opened
    
    2016-06-17 14:20:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    
    2016-06-17 14:20:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
    
    2016-06-17 14:20:24 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/robots.txt> (referer: None)
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Crawled (200) <GET http://stackoverflow.com/questions?sort=newest> (referer: None)
    
    2016-06-17 14:20:25 [py.warnings] WARNING: /Users/chenxin/PycharmProjects/engchen/engchen/pipelines.py:28: ScrapyDeprecationWarning: log.msg has been deprecated, create a python logger and log through it instead
    
     log.msg("Question Added To MongoDB Successfully!",level=log.DEBUG,spider=spider)
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'how to remove all website addresses in bulk using regex',
    
     'url': u'/questions/37874402/how-to-remove-all-website-addresses-in-bulk-using-regex'}
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Scraped from <200 http://stackoverflow.com/questions?sort=newest>
    
    {'title': u'Dynamic subdomain creation in wp',
    
     'url': u'/questions/37874401/dynamic-subdomain-creation-in-wp'}
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
    
    2016-06-17 14:20:25 [scrapy] DEBUG: Question Added To MongoDB Successfully!
    
    ......
    
    

    然后,使用MongoDB数据库工具Robomongo查看刚刚存储的数据:

    image

    结束语

    通过本文可以简单了解Scrapy的使用,项目源代码已经托管到Github

    参考

    相关文章

      网友评论

          本文标题:使用Scrapy爬取网页数据并保存到MongoDB

          本文链接:https://www.haomeiwen.com/subject/biqclftx.html