- 参考
- 一般爬虫的逻辑是:给定起始页面,发起访问,分析页面包含的所有其他链接,然后将这些链接放入队列,再逐次访问这些队列,直至边界条件结束。 为了针对列表页+详情页这种模式,需要对链接抽取(link extractor)的逻辑进行限定。好在scrapy已经提供,关键是你知道这个接口,并灵活运用
rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))),
Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'),
)
解释:
- 参数含义
- Rule是在定义抽取链接的规则,上面的两条规则分别对应列表页的各个分页页面和详情页,关键点在于通过restrict_xpath来限定只从页面特定的部分来抽取接下来将要爬取的链接。
- CrawlSpider的rules属性是直接从起始url请求返回的response对象中提取url,然后自动创建新的请求返回response, 由callback解析规则提取url返回的的response。
- follow用途
第一:这是我爬取豆瓣新书的规则 rules = (Rule(LinkExtractor(allow=(r’^https://book.douban.com/subject/[0-9]*/’),),callback=’parse_item’,follow=False), ),在这条规则下,只会爬取首页(start_urls)中的和规则符合的链接。假设我把follow修改为True,那么爬虫会在爬取的页面中再寻找符合规则的url,如此循环,直到把全站爬取完毕。 - CrawlSpider已经重写了parse函数, 所有自动创建新的请求返回的response, 都由parse函数解析, rule无论有无callback,都由同一个_parse_response函数处理,只不过他会判断是否有follow和callback
案例
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ToscrapeRuleSpider(CrawlSpider):
name = 'toscrape-rule'
allowed_domains = ['toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
custom_settings = {
'FEED_FORMAT': 'Json',
'FEED_EXPORT_ENCODING': 'utf-8',
'FEED_URI': 'rule1.json'
}
# 必须是列表
rules = [
# follow=False(不跟进), 只提取首页符合规则的url,然后爬取这些url页面数据,callback解析
# Follow=True(跟进链接), 在次级url页面中继续寻找符合规则的url,如此循环,直到把全站爬取完毕
Rule(LinkExtractor(allow=(r'/page/'), deny=(r'/tag/')), callback='parse_item', follow=True)
]
def parse_item(self, response):
self.logger.info('Hi, this is an item page! %s', response.url)
for quote in response.xpath('//div[@class="quote"]'):
yield {
'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
'tags': quote.xpath('.//div[@class="tags"]/a/text()').extract()
}
- 结果(follow=True): 爬取了所有的索引页
2018-07-14 22:36:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:36:41 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/3/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/1/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/4/
2018-07-14 22:36:42 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/5/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/6/
2018-07-14 22:36:43 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/7/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/8/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/9/
2018-07-14 22:36:44 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/10/
2018-07-14 22:36:44 [scrapy.core.engine] INFO: Closing spider (finished)
- 结果(follow=False): 只爬取page2的数据,因为在首页只提取到/page/2/这一个链接
2018-07-14 22:44:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-14 22:44:08 [toscrape-rule] INFO: Hi, this is an item page! http://quotes.toscrape.com/page/2/
2018-07-14 22:44:08 [scrapy.core.engine] INFO: Closing spider (finished)
爬虫.png
网友评论