在Scrapy中运用Selenium和Chrome

作者: 喵帕斯0_0 | 来源:发表于2018-06-21 00:03 被阅读25次

    本篇结合Scrapy、Selenium与Headless Chrome来爬取需要js渲染的页面,本节以爬取京东搜索手机的页面为例。

    页面分析

    image.png

    可以看到对于手机这个选项,总共有100页的结果,从动态图页可以看到,每次页面加载并不是一次性加载完的,而是当鼠标滚轮向下滚动到一定距离的时候,才会出现新的搜索结果,这种是通过js渲染的方式来实现的。
    我们可以通过Selenium的execute_script("window.scrollTo(0, document.body.scrollHeight);")来模拟向下滑动到最底的操作。

    在看页面,从图中我们可以看出,当下一页跳转到第2页的时候,url中的page值为3,在点击下一页跳转到第3页是,url中的page为5,由此可以推断出,page的变化与对应的展示页面对应关系为,real_page = 2*(page-1),由此,我们可以得到所有页面的url地址。

    实现

    只展示关键源码,其他settings.py等文件不做展示,具体可见我的Github

    # search.py
    # -*- coding: utf-8 -*-
    import scrapy
    from selenium import webdriver
    import time
    class SearchSpider(scrapy.Spider):
        name = 'search'
        search_page_url_pattern = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page={page}&enc=utf-8"
        start_urls = ['https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8']
    
        def __init__(self):
            chrome_options = webdriver.ChromeOptions()
            chrome_options.add_argument('--headless')
            chrome_options.add_argument('--no-sandbox')
            self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path='/usr/local/bin/chromedriver')
            super(SearchSpider, self).__init__()
        def closed(self,reason):
            self.browser.close()        # 记得关闭
    
        def parse(self, response):
            total_page = response.css('span.p-skip em b::text').extract_first()
            if total_page:
                for i in range(int(total_page)):
                    next_page_url = self.search_page_url_pattern.format(page=2*i + 1)
                    yield scrapy.Request(next_page_url, callback = self.parse_page)
                    time.sleep(1)
    
        def parse_page(self, response):
            phone_info_list = response.css('div.p-name a')
            for item in book_info_list:
                phone_name = item.css('a::attr(title)').extract_first()
                phone_href = item.css('a::attr(href)').extract_first()
    
                yield dict(name=phone_name, href=phone_href)
    

    这里在spider中定义了webdriver,这样的话就可以避免每次都重新打开一个新的浏览器。
    closed()中要关闭浏览器。
    parse()我们先获取到页面的总页数,然后在开始根据规则生成url,继续爬取。
    parse_page()中我们根据页面规则爬取要获取的信息,不再赘述。

    #middlewares.py
    from scrapy import signals
    from scrapy.http import HtmlResponse
    
    class JdDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            spider.browser.get(request.url)
            for i in range(5):
                spider.browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            return HtmlResponse(url = spider.browser.current_url, body = spider.browser.page_source, encoding = 'utf8', request = request)
    
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
            return response
    
        def process_exception(self, request, exception, spider):
            # Called when a download handler or a process_request()
            # (from other downloader middleware) raises an exception.
    
            # Must either:
            # - return None: continue processing this exception
            # - return a Response object: stops process_exception() chain
            # - return a Request object: stops process_exception() chain
            pass
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    

    这边我们利用DownloadMiddleware的特性,在process_request()中使用webdriver来模拟滚动获取整个页面的源码,在直接返回一个Response对象,根据规则,当返回Response对象,之后的DownloadMiddle将不会再运行,而是直接返回。

    运行scrapy crawl search -o result.csv --nolog即可获得爬取结果。

    总结

    本篇讲解了selenium与headless chrome和scrapy的联合使用,看怎么爬取动态页面的信息,通过此方法,再也不用怕需要动态渲染的页面无法爬取了。
    自此,解决了动态爬取动态页面的问题之后,就要解决爬取规模的问题,接下来将会学习如何使用scrapy-redis来进行分布式爬取。

    相关文章

      网友评论

        本文标题:在Scrapy中运用Selenium和Chrome

        本文链接:https://www.haomeiwen.com/subject/mndfyftx.html