一、环境安装
1、安装
pip install scrapy-splash
2、安装docker
apt install docker.io
3、运行docker
下载代码 scrapy-splash https://github.com/scrapy-plugins/scrapy-splash.git
cd scrapy-splash
执行
docker run -p 8050:8050 scrapinghub/splash
或者 指定超时时间
docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300
4. setting.py 配置 SPLASH_URL = 'http://172.17.0.1:8050/'
5. 启动爬虫scrapy crawl getdata
参考资料 API 和教程
https://splash-cn-doc.readthedocs.io/zh_CN/latest/scrapy-splash-toturial.html
https://splash-cn-doc.readthedocs.io/zh_CN/latest/api.html#render-html
https://github.com/scrapy-plugins/scrapy-splash
https://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.touch_actions
二、建一个 scrapy-splash 项目
1、配置 setting.py
SPLASH_URL ifconfig docker0 -->inet addr:172.17.0.1
DOWNLOADER_MIDDLEWARES = {
# Engine side
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# Downloader side
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
SPLASH_URL = 'http://172.17.0.1:8050/'
# SPLASH_URL = 'http://192.168.59.103:8050/'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
2、使用 yield SplashRequest() 代替 yield scrapy.Request
网友评论