Scrapy基础——CrawlSpider详解

作者: xuzhougeng | 来源:发表于2016-06-15 19:43 被阅读15001次

    写在前面

    Scrapy基础——Spider中,我简要地说了一下Spider类。Spider基本上能做很多事情了,但是如果你想爬取知乎或者是简书全站的话,你可能需要一个更强大的武器。
    CrawlSpider基于Spider,但是可以说是为全站爬取而生。

    简要说明

    CrawlSpider是爬取那些具有一定规则网站的常用的爬虫,它基于Spider并有一些独特属性

    • rules: 是Rule对象的集合,用于匹配目标网站并排除干扰
    • parse_start_url: 用于爬取起始响应,必须要返回ItemRequest中的一个。

    因为rulesRule对象的集合,所以这里也要介绍一下Rule。它有几个参数:link_extractorcallback=Nonecb_kwargs=Nonefollow=Noneprocess_links=Noneprocess_request=None
    其中的link_extractor既可以自己定义,也可以使用已有LinkExtractor类,主要参数为:

    • allow:满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。
    • deny:与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。
    • allow_domains:会被提取的链接的domains。
    • deny_domains:一定不会被提取链接的domains。
    • restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接。还有一个类似的restrict_css

    下面是官方提供的例子,我将从源代码的角度开始解读一些常见问题:

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class MySpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            item = scrapy.Item()
            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item
    

    问题:CrawlSpider如何工作的?

    因为CrawlSpider继承了Spider,所以具有Spider的所有函数。
    首先由start_requestsstart_urls中的每一个url发起请求(make_requests_from_url),这个请求会被parse接收。在Spider里面的parse需要我们定义,但CrawlSpider定义parse去解析响应(self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
    _parse_response根据有无callback,followself.follow_links执行不同的操作

        def _parse_response(self, response, callback, cb_kwargs, follow=True):
        ##如果传入了callback,使用这个callback解析页面并获取解析得到的reques或item
            if callback:
                cb_res = callback(response, **cb_kwargs) or ()
                cb_res = self.process_results(response, cb_res)
                for requests_or_item in iterate_spider_output(cb_res):
                    yield requests_or_item
        ## 其次判断有无follow,用_requests_to_follow解析响应是否有符合要求的link。
            if follow and self._follow_links:
                for request_or_item in self._requests_to_follow(response):
                    yield request_or_item
    

    其中_requests_to_follow又会获取link_extractor(这个是我们传入的LinkExtractor)解析页面得到的link(link_extractor.extract_links(response)),对url进行加工(process_links,需要自定义),对符合的link发起Request。使用.process_request(需要自定义)处理响应。

    问题:CrawlSpider如何获取rules?

    CrawlSpider类会在__init__方法中调用_compile_rules方法,然后在其中浅拷贝rules中的各个Rule获取要用于回调(callback),要进行处理的链接(process_links)和要进行的处理请求(process_request)

        def _compile_rules(self):
            def get_method(method):
                if callable(method):
                    return method
                elif isinstance(method, six.string_types):
                    return getattr(self, method, None)
    
            self._rules = [copy.copy(r) for r in self.rules]
            for rule in self._rules:
                rule.callback = get_method(rule.callback)
                rule.process_links = get_method(rule.process_links)
                rule.process_request = get_method(rule.process_request)
    

    那么Rule是怎么样定义的呢?

        class Rule(object):
    
            def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
                self.link_extractor = link_extractor
                self.callback = callback
                self.cb_kwargs = cb_kwargs or {}
                self.process_links = process_links
                self.process_request = process_request
                if follow is None:
                    self.follow = False if callback else True
                else:
                    self.follow = follow
    

    因此LinkExtractor会传给link_extractor。

    有callback的是由指定的函数处理,没有callback的是由哪个函数处理的?

    由上面的讲解可以发现_parse_response会处理有callback的(响应)respons。
    cb_res = callback(response, **cb_kwargs) or ()
    _requests_to_follow会将self._response_downloaded传给callback用于对页面中匹配的url发起请求(request)。
    r = Request(url=link.url, callback=self._response_downloaded)

    如何在CrawlSpider进行模拟登陆

    因为CrawlSpider和Spider一样,都要使用start_requests发起请求,用从Andrew_liu大神借鉴的代码说明如何模拟登陆:

    ##替换原来的start_requests,callback为
    def start_requests(self):
        return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)]
    def post_login(self, response):
        print 'Preparing login'
        #下面这句话用于抓取请求网页后返回网页中的_xsrf字段的文字, 用于成功提交表单
        xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]
        print xsrf
        #FormRequeset.from_response是Scrapy提供的一个函数, 用于post表单
        #登陆成功后, 会调用after_login回调函数
        return [FormRequest.from_response(response,   #"http://www.zhihu.com/login",
                            meta = {'cookiejar' : response.meta['cookiejar']},
                            headers = self.headers,
                            formdata = {
                            '_xsrf': xsrf,
                            'email': '1527927373@qq.com',
                            'password': '321324jia'
                            },
                            callback = self.after_login,
                            dont_filter = True
                            )]
    #make_requests_from_url会调用parse,就可以与CrawlSpider的parse进行衔接了
    def after_login(self, response) :
        for url in self.start_urls :
            yield self.make_requests_from_url(url)
    

    理论说明如上,有不足或不懂的地方欢迎在留言区和我说明。
    其次,我会写一段爬取简书全站用户的爬虫来说明如何具体使用CrawlSpider


    最后贴上Scrapy.spiders.CrawlSpider的源代码,以便检查

    """
    This modules implements the CrawlSpider which is the recommended spider to use
    for scraping typical web sites that requires crawling pages.
    
    See documentation in docs/topics/spiders.rst
    """
    
    import copy
    import six
    
    from scrapy.http import Request, HtmlResponse
    from scrapy.utils.spider import iterate_spider_output
    from scrapy.spiders import Spider
    
    
    def identity(x):
        return x
    
    
    class Rule(object):
    
        def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity):
            self.link_extractor = link_extractor
            self.callback = callback
            self.cb_kwargs = cb_kwargs or {}
            self.process_links = process_links
            self.process_request = process_request
            if follow is None:
                self.follow = False if callback else True
            else:
                self.follow = follow
    
    
    class CrawlSpider(Spider):
    
        rules = ()
    
        def __init__(self, *a, **kw):
            super(CrawlSpider, self).__init__(*a, **kw)
            self._compile_rules()
    
        def parse(self, response):
            return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)
    
        def parse_start_url(self, response):
            return []
    
        def process_results(self, response, results):
            return results
    
        def _requests_to_follow(self, response):
            if not isinstance(response, HtmlResponse):
                return
            seen = set()
            for n, rule in enumerate(self._rules):
                links = [lnk for lnk in rule.link_extractor.extract_links(response)
                         if lnk not in seen]
                if links and rule.process_links:
                    links = rule.process_links(links)
                for link in links:
                    seen.add(link)
                    r = Request(url=link.url, callback=self._response_downloaded)
                    r.meta.update(rule=n, link_text=link.text)
                    yield rule.process_request(r)
    
        def _response_downloaded(self, response):
            rule = self._rules[response.meta['rule']]
            return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
    
        def _parse_response(self, response, callback, cb_kwargs, follow=True):
            if callback:
                cb_res = callback(response, **cb_kwargs) or ()
                cb_res = self.process_results(response, cb_res)
                for requests_or_item in iterate_spider_output(cb_res):
                    yield requests_or_item
    
            if follow and self._follow_links:
                for request_or_item in self._requests_to_follow(response):
                    yield request_or_item
    
        def _compile_rules(self):
            def get_method(method):
                if callable(method):
                    return method
                elif isinstance(method, six.string_types):
                    return getattr(self, method, None)
    
            self._rules = [copy.copy(r) for r in self.rules]
            for rule in self._rules:
                rule.callback = get_method(rule.callback)
                rule.process_links = get_method(rule.process_links)
                rule.process_request = get_method(rule.process_request)
    
        @classmethod
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs)
            spider._follow_links = crawler.settings.getbool(
                'CRAWLSPIDER_FOLLOW_LINKS', True)
            return spider
    
        def set_crawler(self, crawler):
            super(CrawlSpider, self).set_crawler(crawler)
            self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True)

    相关文章

      网友评论

      • AlexRothchild:你倒是举一个具体的例子啊,怎么后面的例子是模拟登陆了,根本和rule没有关系,,,,
        xuzhougeng: @AlexMars 这是我的学习笔记 跟你有什么关系
      • efaa16ff6c4a:按照我的理解,这段程序有问题,因为在做rule匹配的时候,cookiejar并没有传给request,所以虽然登录成功了,但是后续匹配的request都并没有传递meta={'cookiejar':response.meta['cookiejar']},因此需要复写_requests_to_follow函数
      • 瓜瓜俊:你的知乎账号和密码暴露了,我登录成功了
        xuzhougeng:显然你没有看到这句话,用从Andrew_liu大神借鉴的代码说明如何模拟登陆。
        也就是说这不是我的知乎账号和密码:sunglasses:
      • 七月宋:赞,终于搞懂rule的用法了,送个小红心
        xuzhougeng: @七月宋 😅一年后的我居然都看不懂自己是如何写出来的
      • 向右奔跑:另外,还有一点CrawlSpider的url添加机制不明白。如简书上有一类用户,0关注的用户(我称之为信息的孤岛,不能通过其他用户的链接来找到的),不知道CrawlSpider rules能匹配出来不?
      • 向右奔跑:赞!也正准备写一个简书全站用户的爬取。上次爬过11W+数据。昨晚试了一下就觉得CrawlSpider非常适合做全站数据爬取。

      本文标题:Scrapy基础——CrawlSpider详解

      本文链接:https://www.haomeiwen.com/subject/upkpdttx.html