美文网首页
centos scrapy-splash 简明教程

centos scrapy-splash 简明教程

作者: AlastairYuan | 来源:发表于2018-11-17 15:44 被阅读0次

    一、环境安装

    1、安装

    pip install scrapy-splash

    2、安装docker

    apt install docker.io 

    3、运行docker

    下载代码 scrapy-splash  https://github.com/scrapy-plugins/scrapy-splash.git

    cd scrapy-splash

    执行

    docker run -p 8050:8050 scrapinghub/splash

    或者 指定超时时间

    docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300

    4. setting.py 配置 SPLASH_URL = 'http://172.17.0.1:8050/'

    5. 启动爬虫scrapy crawl getdata


    参考资料 API 和教程

    https://splash-cn-doc.readthedocs.io/zh_CN/latest/scrapy-splash-toturial.html

    https://splash-cn-doc.readthedocs.io/zh_CN/latest/api.html#render-html

    https://github.com/scrapy-plugins/scrapy-splash

    https://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.touch_actions


    二、建一个 scrapy-splash 项目

    1、配置 setting.py 

      SPLASH_URL ifconfig docker0 -->inet addr:172.17.0.1

    DOWNLOADER_MIDDLEWARES = {

        # Engine side

        'scrapy_splash.SplashCookiesMiddleware': 723,

        'scrapy_splash.SplashMiddleware': 725,

        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,

        # Downloader side

    }

    SPIDER_MIDDLEWARES = {

        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,

    }

    SPLASH_URL = 'http://172.17.0.1:8050/'

    # SPLASH_URL = 'http://192.168.59.103:8050/'

    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

    2、使用 yield SplashRequest() 代替 yield scrapy.Request

    相关文章

      网友评论

          本文标题:centos scrapy-splash 简明教程

          本文链接:https://www.haomeiwen.com/subject/flfifqtx.html