response.follow主要用于简化url拼接的过程
Scrapy中对url进行拼接最原始的方式是response.urljoin 代码如下
def parse(self, response):
href_list = response.xpath("//div[@class='card']/a/@href").extract()
for href in href_list:
url = response.urljoin(href)
yield scrapy.Request(url=url, callback=self.parse_next)
虽然已经很简洁了但是依然有些多余,于是follow就诞生了
Follow用法一
直接将残缺的url字符串传入follow形参 无需关心拼接细节
def parse(self, response):
href_list = response.xpath("//div[@class='card']/a/@href").extract()
for href in href_list:
yield response.follow(url=href, callback=self.parse_next)
Follow用法二
直接将Selector对象作为形参传入follow
def parse(self, response):
href_list = response.xpath("//div[@class='card']/a/")
for href in href_list:
yield response.follow(url=href, callback=self.parse_next)
Follow_ALL 对 Follow 进行了进一步简化
Follow_ALL用法一
直接传入SelectorList
def parse(self, response):
href_list = response.xpath("//div[@class='card']/a")
yield from response.follow_all(urls=href_list, callback=self.parse_next)
Follow_ALL用法二
直接传入提取规则 这种写法与链接提取器类似
def parse(self, response):
yield from response.follow_all(xpath="//div[@class='card']/a", callback=self.parse_next)
网友评论