美文网首页
十一. 项目实战:爬取toscrape中的名人名言

十一. 项目实战:爬取toscrape中的名人名言

作者: 橄榄的世界 | 来源:发表于2018-03-14 01:22 被阅读0次

爬取网址:http://quotes.toscrape.com/js/
爬取信息:名人名言
爬取方式:scrapy框架 + splash
存储方式:csv

1.需要调用scrapy-splash
py -3 -m pip install scrapy-splash

2.建立project:
scrapy startproject splash_examples

3.配置settings.py

SPLASH_URL = 'http://192.168.99.100:8050' #splash服务地址

#开启scrapy_splash的两个下载中间件,并调整HttpCompressionMiddleware的次序
DOWNLOADER_MIDDLEWARES = {
   'scrapy_splash.SplashCookiesMiddleware': 723,
   'scrapy_splash.SplashMiddleware':725,
   'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
}

#设置去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

#支持cache_args(可选)
SPIDER_MIDDLEWARES = {
   'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

4.编码实现Spider
scrapy genspider quotes quotes.toscrape.com

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/js/']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url,args={'images':0,'timeout':3})   #使用SplashRequest替代Request

    def parse(self, response):
        infos = response.xpath('//div[@class="quote"]')
        for info in infos:
            quote = info.xpath('span[@class="text"]/text()').extract_first()
            author = info.xpath('//small[@class="author"]/text()').extract_first()
            yield {'quote':quote,'author':author}

        href = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if href:
            full_url = response.urljoin(href)
            yield SplashRequest(full_url,args={'images':0,'timeout':3})

5.运行scrapy crawl quotes -o quotes.csv,结果为:

6.值得注意的是,要保证splash服务连通。如果不确定是否开启splash服务,可先在浏览器中输入:http://192.168.99.100:8050/,然后随意输入网址:www.baidu.com,看是否能得到渲染后的页面。
若未正常连接,在win7下,可先启动“Docker Quickstart Terminal”,然后打开“SecureCRT”,连接上192.168.99.100之后运行:
sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
可按照需要选择连接的端口,8050是Http协议,8051是Https协议,5023是telnet协议。

相关文章

网友评论

      本文标题:十一. 项目实战:爬取toscrape中的名人名言

      本文链接:https://www.haomeiwen.com/subject/eagzfftx.html