美文网首页
scrapy rules 规则的使用

scrapy rules 规则的使用

作者: seven1010 | 来源:发表于2018-07-14 22:47 被阅读381次
    • 参考
    • 一般爬虫的逻辑是:给定起始页面,发起访问,分析页面包含的所有其他链接,然后将这些链接放入队列,再逐次访问这些队列,直至边界条件结束。 为了针对列表页+详情页这种模式,需要对链接抽取(link extractor)的逻辑进行限定。好在scrapy已经提供,关键是你知道这个接口,并灵活运用
    rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))),
            Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'),
        )
    

    解释:

    • 参数含义
    • Rule是在定义抽取链接的规则,上面的两条规则分别对应列表页的各个分页页面和详情页,关键点在于通过restrict_xpath来限定只从页面特定的部分来抽取接下来将要爬取的链接。
    • CrawlSpider的rules属性是直接从起始url请求返回的response对象中提取url,然后自动创建新的请求返回response, 由callback解析规则提取url返回的的response。
    • follow用途
      第一:这是我爬取豆瓣新书的规则 rules = (Rule(LinkExtractor(allow=(r’^https://book.douban.com/subject/[0-9]*/’),),callback=’parse_item’,follow=False), ),在这条规则下,只会爬取首页(start_urls)中的和规则符合的链接。假设我把follow修改为True,那么爬虫会在爬取的页面中再寻找符合规则的url,如此循环,直到把全站爬取完毕。
    • CrawlSpider已经重写了parse函数, 所有自动创建新的请求返回的response, 都由parse函数解析, rule无论有无callback,都由同一个_parse_response函数处理,只不过他会判断是否有follow和callback

    案例

    # -*- coding: utf-8 -*-
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    
    class ToscrapeRuleSpider(CrawlSpider):
        name = 'toscrape-rule'
        allowed_domains = ['toscrape.com']
        start_urls = ['http://quotes.toscrape.com/']
        custom_settings = {
            'FEED_FORMAT': 'Json',
            'FEED_EXPORT_ENCODING': 'utf-8',
            'FEED_URI': 'rule1.json'
        }
        # 必须是列表
        rules = [
            # follow=False(不跟进), 只提取首页符合规则的url,然后爬取这些url页面数据,callback解析
            # Follow=True(跟进链接), 在次级url页面中继续寻找符合规则的url,如此循环,直到把全站爬取完毕
            Rule(LinkExtractor(allow=(r'/page/'), deny=(r'/tag/')), callback='parse_item', follow=True)
        ]
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            for quote in response.xpath('//div[@class="quote"]'):
                yield {
                    'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                    'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                    'tags': quote.xpath('.//div[@class="tags"]/a/text()').extract()
                }
    
    • 结果(follow=True): 爬取了所有的索引页
    2018-07-14 22:36:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-07-14 22:36:41 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
    2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/3/
    2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/1/
    2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/4/
    2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/5/
    2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/6/
    2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/7/
    2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/8/
    2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/9/
    2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/10/
    2018-07-14 22:36:44 [scrapy.core.engine] INFO: Closing spider (finished)
    
    • 结果(follow=False): 只爬取page2的数据,因为在首页只提取到/page/2/这一个链接
    2018-07-14 22:44:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2018-07-14 22:44:08 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
    2018-07-14 22:44:08 [scrapy.core.engine] INFO: Closing spider (finished)
    
    爬虫.png

    相关文章

      网友评论

          本文标题:scrapy rules 规则的使用

          本文链接:https://www.haomeiwen.com/subject/nqjspftx.html