scrapy递归抓取网页数据

作者: aManNoName | 来源:发表于2017-12-06 15:28 被阅读0次

scrapy递归抓取网页数据
Python爬虫 --- 2.3 Scrapy 框架的简单使用
【实战演练】Python爬虫，使用2.3 Scrapy 框架爬
【实战演练】Python爬虫，使用2.3 Scrapy 框架爬
Scrapy抓取网页数据
使用Beautiful Soup抓取结构化数据
scrapy-splash抓取动态数据
《利用 Python36，基于 Scrapy 框架的爬虫思路
爬虫框架scrapy和数据库MongoDB的结合使用（一）
06 scrapy框架

scrapy spider的parse方法可以返回两种值：BaseItem，或者Request。通过Request可以实现递归抓取。

如果要抓取的数据在当前页，可以直接解析返回item（代码中带**注释的行直接改为yield item）；

如果要抓取的数据在当前页指向的页面，则返回Request并指定parse_item作为callback；

如果要抓取的数据当前页有一部分，指向的页面有一部分（比如博客或论坛，当前页有标题、摘要和url，详情页面有完整内容）这种情况需要用Request的meta参数把当前页面解析到的数据传到parse_item，后者继续解析item剩下的数据。

要抓完当前页再抓其它页面（比如下一页），可以返回Request，callback为parse。

有点奇怪的是：parse不能返回item列表，但作为callback的parse_item却可以，不知道为啥。

另外，直接extract()得到的文字不包含等子标签的内容，可改为d.xpath('node()').extract()，得到的是包含html的文本，再过滤掉标签就是纯文本了。

没找到直接得到html的方法。

fromscrapy.spiderimportSpider

fromscrapy.selectorimportSelector

fromdirbot.itemsimportArticle

importjson

importre

importstring

fromscrapy.httpimportRequest

classYouyousuiyueSpider(Spider):

name ="youyousuiyue2"

allowed_domains = ["youyousuiyue.sinaapp.com"]

start_urls = [

'http://youyousuiyue.sinaapp.com',

]

defload_item(self, d):

item = Article()

title = d.xpath('header/h1/a')

item['title'] = title.xpath('text()').extract()

printitem['title'][0]

item['url'] = title.xpath('@href').extract()

returnitem

defparse_item(self, response):

item = response.meta['item']

sel = Selector(response)

d = sel.xpath('//div[@class="entry-content"]/div')

item['content'] = d.xpath('text()').extract()

returnitem

defparse(self, response):

"""

The lines below is a spider contract. For more info see:

http://doc.scrapy.org/en/latest/topics/contracts.html

@url http://youyousuiyue.sinaapp.com

@scrapes name

"""

print'parsing ', response.url

sel = Selector(response)

articles = sel.xpath('//div[@id="content"]/article')

fordinarticles:

item =self.load_item(d)

yieldRequest(item['url'][0], meta={'item':item}, callback=self.parse_item)# ** or yield item

sel = Selector(response)

link = sel.xpath('//div[@class="nav-previous"]/a/@href').extract()[0]

iflink[-1] =='4':

return

else:

print'yielding ', link

yieldRequest(link, callback=self.parse)