爬取网址:http://quotes.toscrape.com/js/
爬取信息:名人名言
爬取方式:scrapy框架 + splash
存储方式:csv
1.需要调用scrapy-splash
py -3 -m pip install scrapy-splash
2.建立project:
scrapy startproject splash_examples
3.配置settings.py
SPLASH_URL = 'http://192.168.99.100:8050' #splash服务地址
#开启scrapy_splash的两个下载中间件,并调整HttpCompressionMiddleware的次序
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware':725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
}
#设置去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#支持cache_args(可选)
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
4.编码实现Spider
scrapy genspider quotes quotes.toscrape.com
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url,args={'images':0,'timeout':3}) #使用SplashRequest替代Request
def parse(self, response):
infos = response.xpath('//div[@class="quote"]')
for info in infos:
quote = info.xpath('span[@class="text"]/text()').extract_first()
author = info.xpath('//small[@class="author"]/text()').extract_first()
yield {'quote':quote,'author':author}
href = response.xpath('//li[@class="next"]/a/@href').extract_first()
if href:
full_url = response.urljoin(href)
yield SplashRequest(full_url,args={'images':0,'timeout':3})
5.运行scrapy crawl quotes -o quotes.csv
,结果为:
6.值得注意的是,要保证splash服务连通。如果不确定是否开启splash服务,可先在浏览器中输入:http://192.168.99.100:8050/
,然后随意输入网址:www.baidu.com,看是否能得到渲染后的页面。
若未正常连接,在win7下,可先启动“Docker Quickstart Terminal”,然后打开“SecureCRT”,连接上192.168.99.100之后运行:
sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
可按照需要选择连接的端口,8050是Http协议,8051是Https协议,5023是telnet协议。
网友评论