美文网首页
27.scrapy-splash初探

27.scrapy-splash初探

作者: starrymusic | 来源:发表于2019-04-02 11:04 被阅读0次

    使用scrapy-splash之前,可以先创建一个scrapy项目,然后打印一下网页,突出scrapy-splash的优秀,嘻嘻。
    作为scrapy-splash练习的靶子,它的长相是这个样子滴:



    如果仅仅使用scrapy,打印出当前的网页是这样滴:



    所以,还玩个蛇皮

    当当当当,今天的主角要登场了。

    重启下docker:

    sudo service docker start
    

    让Docker容器以守护态运行,这样在中断远程服务器连接后,不会终止Splash服务的运行:

    docker run -d -p 8050:8050 scrapinghub/splash
    

    如果运行不报错一般就是OK的,也可以在浏览器打开localhost:8050查看是不是这样婶儿的:



    然后配置下settings文件,添加如下代码:

    # 以下是scrapy-splash的配置:
    
    # 渲染服务的url
    SPLASH_URL = 'http://localhost:8050'
    
    #下载器中间件
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    # 去重过滤器
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    # 使用Splash的Http缓存
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    

    spiders文件zfcaigou.py代码如下:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy_splash import SplashRequest
    from caigou.items import CaigouItem
    # from caigou.items import ZfcaigouItemLoad, CaigouItem
    spiders文件zfcaigou.py代码如下:
    
    class ZfcaigouSpider(scrapy.Spider):
        name = 'zfcaigou'
        allowed_domains = ['www.zjzfcg.gov.cn']
        start_urls = ['http://www.zjzfcg.gov.cn/purchaseNotice/index.html?categoryId=3001']
    
        def start_requests(self):
            for url in self.start_urls:
                yield SplashRequest(url=url, callback=self.parse,
                                    args={'wait': 10}, endpoint='render.html')
    
        def parse(self, response):
            # print(response.body.decode("utf-8"))
            infodata = response.css(".items p")
            for infoline in infodata:
                caigouitem = CaigouItem()
                caigouitem['city'] = infoline.css(".warning::text").extract()[0].replace("[", "").replace("·", "").strip()
                caigouitem['issuescate'] = infoline.css(".warning .limit::text").extract()[0]
                caigouitem['title'] = infoline.css("a .underline::text").extract()[0].replace("]", "")
                caigouitem['publish_date'] = infoline.css(".time::text").extract()[0].replace("[", "").replace("]", "")
                yield caigouitem
    

    items.py代码:

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class CaigouItem(scrapy.Item):
        city = scrapy.Field()
        issuescate = scrapy.Field()
        title = scrapy.Field()
        publish_date = scrapy.Field()
    

    除了网络稍卡,还是将内容给抓取下来了。


    项目地址:https://github.com/hfxjd9527/caigou

    相关文章

      网友评论

          本文标题:27.scrapy-splash初探

          本文链接:https://www.haomeiwen.com/subject/dmywbqtx.html