本篇结合Scrapy、Selenium与Headless Chrome来爬取需要js渲染的页面,本节以爬取京东搜索手机的页面为例。
页面分析

可以看到对于手机这个选项,总共有100页的结果,从动态图页可以看到,每次页面加载并不是一次性加载完的,而是当鼠标滚轮向下滚动到一定距离的时候,才会出现新的搜索结果,这种是通过js渲染的方式来实现的。
我们可以通过Selenium的execute_script("window.scrollTo(0, document.body.scrollHeight);")
来模拟向下滑动到最底的操作。
在看页面,从图中我们可以看出,当下一页跳转到第2页的时候,url中的page值为3
,在点击下一页跳转到第3页是,url中的page为5,由此可以推断出,page的变化与对应的展示页面对应关系为,real_page = 2*(page-1)
,由此,我们可以得到所有页面的url地址。
实现
只展示关键源码,其他settings.py等文件不做展示,具体可见我的Github
# search.py
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
import time
class SearchSpider(scrapy.Spider):
name = 'search'
search_page_url_pattern = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&page={page}&enc=utf-8"
start_urls = ['https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8']
def __init__(self):
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
self.browser = webdriver.Chrome(chrome_options=chrome_options, executable_path='/usr/local/bin/chromedriver')
super(SearchSpider, self).__init__()
def closed(self,reason):
self.browser.close() # 记得关闭
def parse(self, response):
total_page = response.css('span.p-skip em b::text').extract_first()
if total_page:
for i in range(int(total_page)):
next_page_url = self.search_page_url_pattern.format(page=2*i + 1)
yield scrapy.Request(next_page_url, callback = self.parse_page)
time.sleep(1)
def parse_page(self, response):
phone_info_list = response.css('div.p-name a')
for item in book_info_list:
phone_name = item.css('a::attr(title)').extract_first()
phone_href = item.css('a::attr(href)').extract_first()
yield dict(name=phone_name, href=phone_href)
这里在spider中定义了webdriver,这样的话就可以避免每次都重新打开一个新的浏览器。
在closed()
中要关闭浏览器。
在parse()
我们先获取到页面的总页数,然后在开始根据规则生成url,继续爬取。
在parse_page()
中我们根据页面规则爬取要获取的信息,不再赘述。
#middlewares.py
from scrapy import signals
from scrapy.http import HtmlResponse
class JdDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
spider.browser.get(request.url)
for i in range(5):
spider.browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
return HtmlResponse(url = spider.browser.current_url, body = spider.browser.page_source, encoding = 'utf8', request = request)
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
这边我们利用DownloadMiddleware
的特性,在process_request()
中使用webdriver
来模拟滚动获取整个页面的源码,在直接返回一个Response
对象,根据规则,当返回Response
对象,之后的DownloadMiddle
将不会再运行,而是直接返回。
运行scrapy crawl search -o result.csv --nolog
即可获得爬取结果。
总结
本篇讲解了selenium与headless chrome和scrapy的联合使用,看怎么爬取动态页面的信息,通过此方法,再也不用怕需要动态渲染的页面无法爬取了。
自此,解决了动态爬取动态页面的问题之后,就要解决爬取规模的问题,接下来将会学习如何使用scrapy-redis
来进行分布式爬取。
网友评论