美文网首页
十一. 项目实战:爬取toscrape中的名人名言

十一. 项目实战:爬取toscrape中的名人名言

作者: 橄榄的世界 | 来源:发表于2018-03-14 01:22 被阅读0次

    爬取网址:http://quotes.toscrape.com/js/
    爬取信息:名人名言
    爬取方式:scrapy框架 + splash
    存储方式:csv

    1.需要调用scrapy-splash
    py -3 -m pip install scrapy-splash

    2.建立project:
    scrapy startproject splash_examples

    3.配置settings.py

    SPLASH_URL = 'http://192.168.99.100:8050' #splash服务地址
    
    #开启scrapy_splash的两个下载中间件,并调整HttpCompressionMiddleware的次序
    DOWNLOADER_MIDDLEWARES = {
       'scrapy_splash.SplashCookiesMiddleware': 723,
       'scrapy_splash.SplashMiddleware':725,
       'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
    }
    
    #设置去重过滤器
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    #支持cache_args(可选)
    SPIDER_MIDDLEWARES = {
       'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    

    4.编码实现Spider
    scrapy genspider quotes quotes.toscrape.com

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy_splash import SplashRequest
    
    class QuotesSpider(scrapy.Spider):
        name = 'quotes'
        allowed_domains = ['quotes.toscrape.com']
        start_urls = ['http://quotes.toscrape.com/js/']
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url,args={'images':0,'timeout':3})   #使用SplashRequest替代Request
    
        def parse(self, response):
            infos = response.xpath('//div[@class="quote"]')
            for info in infos:
                quote = info.xpath('span[@class="text"]/text()').extract_first()
                author = info.xpath('//small[@class="author"]/text()').extract_first()
                yield {'quote':quote,'author':author}
    
            href = response.xpath('//li[@class="next"]/a/@href').extract_first()
            if href:
                full_url = response.urljoin(href)
                yield SplashRequest(full_url,args={'images':0,'timeout':3})
    

    5.运行scrapy crawl quotes -o quotes.csv,结果为:

    6.值得注意的是,要保证splash服务连通。如果不确定是否开启splash服务,可先在浏览器中输入:http://192.168.99.100:8050/,然后随意输入网址:www.baidu.com,看是否能得到渲染后的页面。
    若未正常连接,在win7下,可先启动“Docker Quickstart Terminal”,然后打开“SecureCRT”,连接上192.168.99.100之后运行:
    sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
    可按照需要选择连接的端口,8050是Http协议,8051是Https协议,5023是telnet协议。

    相关文章

      网友评论

          本文标题:十一. 项目实战:爬取toscrape中的名人名言

          本文链接:https://www.haomeiwen.com/subject/eagzfftx.html