美文网首页
刘硕的Scrapy笔记(一入门基本实例)

刘硕的Scrapy笔记(一入门基本实例)

作者: 费云帆 | 来源:发表于2018-11-26 11:52 被阅读0次
    • 首先创建工程和蜘蛛的命令:
    # 使用cmd进入需要的目录
    # 创建工程名为books
    scrapy startproject books
    

    创建蜘蛛:

    # 使用cd进入spiders目录
    # 创建文件名为book_info
    scrapy genspider book_info books.toscrape.com
    
    • 提取数据的时候,本人习惯用xpath,而书上使用的是css:
    import scrapy
    
    class BooksSpider(scrapy.Spider):
    
        name= 'books'
    
        allowed_domains= ['books.toscrape.com']
    
        start_urls= ['http://books.toscrape.com/']
    
        def parse(self, response):
    
            path=response.xpath('//li[@class="col-xs-6 col-sm-4 col-md-3 col-lg-3"]/article')
    
            for book in path:
    
                name=book.xpath('./h3/a/text()').extract()
    
                price=book.xpath('./div[2]/p[1]/text()').extract()
                # return只能返回一次
                # yield可以返回全部的结果
                yield {
                    'name':name,
                    'price':price
                }
    
            next_page=response.xpath('//li[@class="next"]/a/@href').extract_first()
            if next_page:
                # 这才是想要的地址
                # response.urljoin()会自动计算网址
                next_page=response.urljoin(next_page)
                yield scrapy.Request(next_page,callback=self.parse)
                #scrapy shell http://books.toscrape.com/
                #scrapy crawl books -o first_scrapy.csv
                #忽略表头,在屏幕上输出
                #sed -n '2,$p' first_scrapy.csv|cat -n
    
    • 先第一页的反馈:
    2019-01-02 16:38:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 502,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 6204,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 1, 2, 8, 38, 4, 995871),
     # 20个数目,这就对了
     'item_scraped_count': 20,
     'log_count/DEBUG': 23,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2019, 1, 2, 8, 38, 3, 948658)}
    2019-01-02 16:38:05 [scrapy.core.engine] INFO: Spider closed (finished)
    
    • 加入下一页:
    2019-01-02 16:56:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 16487,
     'downloader/request_count': 51,
     'downloader/request_method_count/GET': 51,
     'downloader/response_bytes': 299924,
     'downloader/response_count': 51,
     'downloader/response_status_count/200': 50,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2019, 1, 2, 8, 56, 20, 625895),
     # 一共1000个项
     'item_scraped_count': 1000,
     'log_count/DEBUG': 1052,
     'log_count/INFO': 7,
     'request_depth_max': 49,
     'response_received_count': 51,
     'scheduler/dequeued': 50,
     'scheduler/dequeued/memory': 50,
     'scheduler/enqueued': 50,
     'scheduler/enqueued/memory': 50,
     'start_time': datetime.datetime(2019, 1, 2, 8, 55, 59, 803472)}
    2019-01-02 16:56:20 [scrapy.core.engine] INFO: Spider closed (finished)
    

    相关文章

      网友评论

          本文标题:刘硕的Scrapy笔记(一入门基本实例)

          本文链接:https://www.haomeiwen.com/subject/kmzvqqtx.html