python爬虫

作者: 轻狂清风 | 来源:发表于2018-09-16 13:57 被阅读0次

3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例
Python网络爬虫（一）- 入门基础
Python网络爬虫（四）- XPath
Python网络爬虫（三）- 爬虫进阶
Python网络爬虫（六）- Scrapy框架
Python网络爬虫（五）- Requests和Beautifu
Python网络爬虫实战之十四：Scrapy结合scrapy-s

for each in response.json['顶层名称']【中间根据json层数决定】[‘数据层名称’]

例如：json格式

{"code":1,

"msg":"操作成功",

"data":

    {"pageNo":1,

    "hasNext":true,

    "list":    [{"docid":"DRQQ35F90511ELD5","boardid":"dy_wemedia_bbs","postid":null,"topicid":null,"recommendtids":null,"userid":null,"nickname":null,"userinfo":null,"title":"海湾被鲜血染成血红色：100多只海豚和鲸鱼惨遭法罗群岛渔民斩杀",}]

                }

}

代码：


for each in response.json['data']['list']

pyspider传参数
我这边没有利用save传参数

def on_start(self):
    self.crawl('http://www.example.org/',
    callback=self.callback, save={'a': 123})

def callback(self, response):
    return response.save['a']

直接利用上一步爬取的参数，然后回调参数获取

    def index_page(self, response):
        for each in response.json['data']['list']:
            docid=each['docid']
            title=each['title']
            imgsrc=each['imgsrc']
            self.crawl('http://www.***.com/***',callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        imgsrc=response.save['imgsrc']
        content=response.doc('#content').html()
        return {
            "content":content,
            "title": response.doc('h2').text(),
            "imgsrc":imgsrc
        }

这样就可以利用上一步的参数了