美文网首页
python3的爬虫笔记13——Scrapy初窥

python3的爬虫笔记13——Scrapy初窥

作者: X_xxieRiemann | 来源:发表于2019-04-07 16:03 被阅读0次

    1、Scrapy安装

    在windows平台anaconda环境下,在命令窗口输入conda install scrapy,输入确认的y后,静静等待安装完成即可。安装完成后,在窗口输入scrapy version,能显示版本号说明能够正常使用。


    2、Scrapy指令

    输入scrapy -h可以看到指令,关于命令行,后面会再总结。

    Scrapy 1.3.3 - project: quotetutorial
    
    Usage:
      scrapy <command> [options] [args]
    
    Available commands:
      bench         Run quick benchmark test
      check         Check spider contracts
      commands
      crawl         Run a spider
      edit          Edit spider
      fetch         Fetch a URL using the Scrapy downloader
      genspider     Generate new spider using pre-defined templates
      list          List available spiders
      parse         Parse URL (using its spider) and print the results
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
    
    Use "scrapy <command> -h" to see more info about a command
    

    3、新建项目

    爬取的为用于测试scrapy的网站:http://quotes.toscrape.com/
    爬取目标:获取名言---作者---标签。

    网站样式

    1、命令窗口下,用cd指令移动到想用来存放项目的文件夹
    2、命令窗口下,scrapy startproject 你的文件夹名,这里命名为scrapy startproject quotetutorial
    可以看到两个提示, cd quotetutorial ,scrapy genspider example example.com,(即cd 你的工作文件夹 ,scrapy genspider 你的爬虫名 爬取的目标地址),根据提示继续操作。

    C:\Users\m1812>scrapy startproject quotetutorial
    New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
        C:\Users\m1812\quotetutorial
    
    You can start your first spider with:
        cd quotetutorial
        scrapy genspider example example.com
    

    3、cd quotetutorial移动到创建好的文件夹中

    C:\Users\m1812>cd quotetutorial
    

    4、scrapy genspider quotes quotes.toscrape.com,生成一个名为quotes.py的文件,地址为quotes quotes.toscrape.com

    C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
    Created spider 'quotes' using template 'basic' in module:
      quotetutorial.spiders.quotes
    
    用pycharm打开工程,框架如图

    4、Scrapy初窥

    1、修改quotes.py中的parse函数,让其打印出网页的html代码,这个网页直接输出print(response.text)会有编码报错。parse函数会在爬虫运行的最后开始执行,这里的response就是网页请求返回的结果。


    2、在命令窗口中使用scrapy crawl quotes运行爬虫,看到scrapy除了打印出网页html代码外,还有很多信息输出。
    C:\Users\m1812\quotetutorial>scrapy crawl quotes
    2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
    2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
    ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
    2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.logstats.LogStats']
    2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
    2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
    2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
    b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bo
    otstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-
    box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n
                    </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/l
    ogin">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\
    n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0The world as we have
     created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1</span>\n        <span>by <small class="author"
    itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n
          Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > \n            \n
    <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n
               \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/"
    >world</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span cl
    ass="text" itemprop="text">\xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1</span>\n        <span>
    by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div cla
    ss="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > \n            \n
    <a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n
         \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop
    ="text">\xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1
    \xb1</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\
    n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="inspirational,life,l
    ive,miracle,miracles" /    > \n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n
      <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/live/page/1/">live</a>\n            \n
         <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n            \n            <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n
            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemp
    rop="text">\xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1</span>\n        <sp
    an>by <small class="author" itemprop="author">Jane Austen</small>\n        <a href="/author/Jane-Austen">(about)</a>\n        </span>\n        <div c
    lass="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > \n
    \n            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n            \n            <a class="tag" href="/tag/books/page/1/">books</a
    >\n            \n            <a class="tag" href="/tag/classic/page/1/">classic</a>\n            \n            <a class="tag" href="/tag/humor/page/1
    /">humor</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span
    class="text" itemprop="text">\xa1\xb0Imperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring
    .\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n        <a href="/author/Marilyn-Monroe">(about)</
    a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="be-yourself,inspi
    rational" /    > \n            \n            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n            \n            <a class="tag"
     href="/tag/inspirational/page/1/">inspirational</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://s
    chema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb
    1</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n
          </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="adulthood,success,value
    " /    > \n            \n            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n            \n            <a class="tag" href="/tag/
    success/page/1/">success</a>\n            \n            <a class="tag" href="/tag/value/page/1/">value</a>\n            \n        </div>\n    </div>\
    n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0It is better to be
     hated for what you are than to be loved for what you are not.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Andr\xa8\xa6
    Gide</small>\n        <a href="/author/Andre-Gide">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta cla
    ss="keywords" itemprop="keywords" content="life,love" /    > \n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n
      \n            <a class="tag" href="/tag/love/page/1/">love</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemty
    pe="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0I have not failed. I&#39;ve just found 10,000 ways that won&
    #39;t work.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n        <a href="/author/Thomas-A-Edis
    on">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="edis
    on,failure,inspirational,paraphrased" /    > \n            \n            <a class="tag" href="/tag/edison/page/1/">edison</a>\n            \n
        <a class="tag" href="/tag/failure/page/1/">failure</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational<
    /a>\n            \n            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n            \n        </div>\n    </div>\n\n    <div c
    lass="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0A woman is like a tea bag; you
    never know how strong it is until it&#39;s in hot water.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</
    small>\n        <a href="/author/Eleanor-Roosevelt">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta cl
    ass="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > \n            \n            <a class="tag" href="/tag/misattribut
    ed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemt
    ype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0A day without sunshine is like, you know, night.\xa1\xb1</s
    pan>\n        <span>by <small class="author" itemprop="author">Steve Martin</small>\n        <a href="/author/Steve-Martin">(about)</a>\n        </sp
    an>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > \n
              \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n            <a class="tag" href="/tag/obvious/page/1/">obvi
    ous</a>\n            \n            <a class="tag" href="/tag/simile/page/1/">simile</a>\n            \n        </div>\n    </div>\n\n    <nav>\n
       <ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">&r
    arr;</span></a>\n            </li>\n            \n        </ul>\n    </nav>\n    </div>\n    <div class="col-md-4 tags-box">\n        \n            <
    h2>Top Ten tags</h2>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a
    >\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/inspirationa
    l/">inspirational</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" hre
    f="/tag/life/">life</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 24px" h
    ref="/tag/humor/">humor</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 22p
    x" href="/tag/books/">books</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size:
     14px" href="/tag/reading/">reading</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="fo
    nt-size: 10px" href="/tag/friendship/">friendship</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="
    tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n            </span>\n            \n            <span class="tag-item">\n            <a
    class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n            </span>\n            \n            <span class="tag-item">\n
    <a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n            </span>\n            \n        \n    </div>\n</div>\n\n    </div>\n
        <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goo
    dreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class=\'sh-red\'>\x817\xc5
    8</span> by <a href="https://scrapinghub.com">Scrapinghub</a>\n            </p>\n        </div>\n    </footer>\n</body>\n</html>'
    2019-04-05 19:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
    2019-04-05 19:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 444,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 2701,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 4, 5, 11, 50, 12, 560342),
     'log_count/DEBUG': 3,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2019, 4, 5, 11, 50, 11, 713697)}
    2019-04-05 19:50:12 [scrapy.core.engine] INFO: Spider closed (finished)
    
    

    5、开始爬取

    首先浏览下目标信息的html结构:


    1、修改items.py中的内容,将欲提取的3个信息按照指定的格式填入:


    2、修改quotes.py中的内容,添加爬取的规则,并且和步骤一中items.py的配置相映射。
    # -*- coding: utf-8 -*-
    import scrapy
    from quotetutorial.items import QuotetutorialItem
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        allowed_domains = ["quotes.toscrape.com"]
        start_urls = ['http://quotes.toscrape.com/']
    
        def parse(self, response):
    
            quotes = response.css('.quote')
            for quote in quotes:
                item = QuotetutorialItem()
                text = quote.css('.text::text').extract_first()
                author = quote.css('.author::text').extract_first()
                tags = quote.css('.tags .tag::text').extract()
                item['text'] = text
                item['author'] = author
                item['tags'] = tags
                yield item
    

    这边用到了自带的css选择器。
    在命令窗口中,利用shell指令可以进行交互性测试,scrapy shell "quotes.toscrape.com"注意这里的双引号),从这里我们可以理解上面的css选出了什么,extract_first()extract()有什么区别。

    C:\Users\m1812\quotetutorial>scrapy shell "quotes.toscrape.com"
    
    2019-04-05 20:08:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
    2019-04-05 20:08:39 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
    r', 'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'quotetutorial.spiders'}
    2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats']
    2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2019-04-05 20:08:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2019-04-05 20:08:39 [scrapy.core.engine] INFO: Spider opened
    2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
    2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
    2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
    2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x000002B3C7410748>
    [s]   item       {}
    [s]   request    <GET http://quotes.toscrape.com>
    [s]   response   <200 http://quotes.toscrape.com>
    [s]   settings   <scrapy.settings.Settings object at 0x000002B3C74D97B8>
    [s]   spider     <DefaultSpider 'default' at 0x2b3c90e5ba8>
    [s] Useful shortcuts:
    [s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
    [s]   fetch(req)                  Fetch a scrapy.Request and update local objects
    [s]   shelp()           Shell help (print this help)
    [s]   view(response)    View response in a browser
    In [1]: 
    
    
    In [1]: response
    Out[1]: <200 http://quotes.toscrape.com>
    
    In [2]: quotes = response.css('.quote')
    
    In [3]: quotes
    Out[3]: 
    [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>,
     <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
    e itemtype="h'>]
    
    In [4]: quotes[0]
    Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" i
    temscope itemtype="h'>
    
    In [5]: quotes[0].css('.text')
    Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" i
    temprop="text">“The '>]
    
    In [6]:  quotes[0].css('.text::text')
    Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world a
    s we have created it is a pr'>]
    
    In [7]:  quotes[0].css('.text::text').extract()
    Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
    
    In [8]: quotes[0].css('.text').extract()
    Out[8]: ['<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing ou
    r thinking.”</span>']
    
    In [9]:  quotes[0].css('.text::text').extract_first()
    Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
    
    In [10]:  quotes[0].css('.tags .tag::text').extract()
    Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']
    
    In [11]: exit()
    

    此时我们再运行下爬虫scrapy crawl quotes
    在终端可以看到很多信息。


    3、单页爬取完成,接下来要进行翻页。网页的url变化如下,可以通过next按钮的href属性获得下一页网址。


    修改quotes.py中的代码:
    # -*- coding: utf-8 -*-
    import scrapy
    from quotetutorial.items import QuotetutorialItem
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        allowed_domains = ["quotes.toscrape.com"]
        start_urls = ['http://quotes.toscrape.com/']
    
        def parse(self, response):
    
            quotes = response.css('.quote')
            for quote in quotes:
                item = QuotetutorialItem()   #字典类型
                text = quote.css('.text::text').extract_first()
                author = quote.css('.author::text').extract_first()
                tags = quote.css('.tags .tag::text').extract()
                item['text'] = text
                item['author'] = author
                item['tags'] = tags
                yield item
    
            next = response.css('.pager .next a::attr(href)').extract_first()
            url = response.urljoin(next)  #生成完整url
            yield scrapy.Request(url=url, callback=self.parse)  #递归调用
    

    4、保存数据
    命令行scrapy crawl quotes -o quotes.json,保存为json格式


    或者存成jl格式,scrapy crawl quotes -o quotes.jl

    或者存成CSV格式,scrapy crawl quotes -o quotes.csv
    或者存成xml格式,scrapy crawl quotes -o quotes.xml
    或者存成pickle格式,scrapy crawl quotes -o quotes.pickle
    或者存成marshal格式,scrapy crawl quotes -o quotes.marshal

    5、处理一些不想要的item或者保存到数据库
    这时候需要修改pipelines.py中的代码


    这里限定了字符最大为50个字符,超过的部分在后面添加...
    同时定义了mongodb的保存函数
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import pymongo
    
    from scrapy.exceptions import DropItem
    
    # 这里限定了最大字符个数为50,超过用省略号代替
    class QuotetutorialPipeline(object):
    
        def __init__(self):
            self.limit = 50
    
        # 只能返回两种值,item和DropItem
        def process_item(self, item, spider):
            if item['text']:
                if len(item['text']) > self.limit:
                    item['text'] = item['text'][0:self.limit].rstrip() + '...'
                    return item
            else:
                return DropItem('Missing Text')
    
    class MongoPipeline(object):
    
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_URI'),
                mongo_db=crawler.settings.get('MONGO_DB')
            )
    
        def open_spider(self, spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
    
        def process_item(self, item, spider):
            name = item.__class__.__name__
            self.db[name].insert(dict(item))
            return item
    
        def close_spider(self, spider):
            # 关闭mongodb
            self.client.close()
    

    还要修改setting.py中的相关设置

    取消setting.py中关于pipeline的注释,这里的数字表示优先级,越小的优先级越高

    重新运行命令行scrapy crawl quotes

    mongodb中也可以看到保存的数据了。


    参考自崔庆才博主的Scrapy教学

    相关文章

      网友评论

          本文标题:python3的爬虫笔记13——Scrapy初窥

          本文链接:https://www.haomeiwen.com/subject/vujkiqtx.html