美文网首页
scrapy基础笔记1-创建并运行一个项目

scrapy基础笔记1-创建并运行一个项目

作者: BigBigTang | 来源:发表于2019-02-27 22:05 被阅读0次

    1.创建一个scrapy项目

    scrapy startproject quotetutorial
    

    2.进入到刚才创建的项目quotetutorial文件夹中为项目创建一个爬虫

    scrapy genspider quotes quotes.toscrape.com
    

    这时候发现quotetutorial-quotetutorial-spider文件夹中有生成quotes.py文件

    内容如下:

       class QuotesSpider(scrapy.Spider):
           name ='quotes' # 爬虫项目的名字
           allowed_domains = ['quotes.toscrape.com']
           start_urls = ['http://quotes.toscrape.com/']  # 刚才指定的url
           def parse(self, response):
               pass
    

    到现在为止的文件结构:

    image

    scrapy.cfg中指定settings文件和部署的配置

    [settings]
    default = quotetutorial.settings
    [deploy]
    #url = http://localhost:6800/
    project = quotetutorial
    

    1.items.py-保存数据结构
    2.middlewares.py-爬虫中间件
    3.pipelines.py-定义一些管道
    4.settings.py-配置信息

    所有的爬虫是写在spider文件夹下

    我们把def parse方法加上一个print内容:

    import scrapy
    
    class QuotesSpider(scrapy.Spider):
        name ='quotes'
        allowed_domains = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
        def parse(self, response):
            print(response.text)
    

    parse这个方法会在爬取网页后执行,在这里改成print(response.text)然后作如下操作执行爬虫


    3.运行爬虫

    quotetutorial下还有一个quotetutorial文件夹,在外层quotetutorial下执行

    scrapy crawl quotes
    

    这时候可以看到log信息如下,打印了scrapy框架执行的信息,有版本信息,系统信息,爬虫信息,使用的中间件,爬去的网页数据信息,刚才的print(response.text也会在下面打印)

        D:\study\bandwagon\repository\spider\scrapy\quotetutorial>scrapy         crawl quotes
    2019-02-27 21:58:22 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotetutorial)
    2019-02-27 21:58:22 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0,
    Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1,
    Platform Windows-10-10.0.17134-SP0
    2019-02-27 21:58:22 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'ROBO
    TSTXT_OBEY': True, 'SPIDER_MODULES': ['quotetutorial.spiders']}
    2019-02-27 21:58:22 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.logstats.LogStats']
    2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2019-02-27 21:58:23 [scrapy.core.engine] INFO: Spider opened
    2019-02-27 21:58:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2019-02-27 21:58:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2019-02-27 21:58:28 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
    2019-02-27 21:58:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
    <!DOCTYPE html>
    <html lang="en">
    **!!!在这个位置会打印刚才的response.text,由于篇幅就不放了**
    </html>
    2019-02-27 21:58:31 [scrapy.core.engine] INFO: Closing spider (finished)
    2019-02-27 21:58:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 446,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 2701,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 2, 27, 13, 58, 31, 73758),
     'log_count/DEBUG': 3,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2019, 2, 27, 13, 58, 23, 304498)}
    2019-02-27 21:58:31 [scrapy.core.engine] INFO: Spider closed (finished)
    

    4.输入爬虫结果到不同格式的文件或ftp server.

    通过-o 文件名的参数方式

    scrapy crawl quotes -o     quotes.json/quotes.csv/quotes.xml/quotes.pickle/quotes.jl/quote s.marshal/ftp://user:passwd@ftp.xxx.com/path/quotes.json
    

    5.scrapy shell命令行交互模式

    scrapy shell quotes.toscrape.com
    
    In [1]: quotes = response.css('.quote')
    In [4]: quotes[0]
    Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>
    In [5]: quotes[0].css('.text')
    Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“The '>]
    In [6]: quotes[0].css('.text::text')
    Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world as we have created it is a pr'>]
    

    在scrapy中css选择器可以用::text的方式获取文本

    In [7]: quotes[0].css('.text::text').extract()
    Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
    In [8]: quotes[0].css('.text::text').extract_first()
    Out[8]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
    In [9]: quotes[0].css('.tags .tag::text').extract_first()
    Out[9]: 'change'
    In [10]: quotes[0].css('.tags .tag::text').extract()
    Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']
    

    从上面这四个输入输出可以看出,extract_first()用于提取第一个匹配项,extract()用于提取所有匹配项成列表的格式,所以一般查找结果唯一的可以用extract_first(),查找结果很多项的就用extract()

    相关文章

      网友评论

          本文标题:scrapy基础笔记1-创建并运行一个项目

          本文链接:https://www.haomeiwen.com/subject/zapquqtx.html