美文网首页
python爬虫

python爬虫

作者: 轻狂清风 | 来源:发表于2018-09-16 13:57 被阅读0次

    for each in response.json['顶层名称']【中间根据json层数决定】[‘数据层名称’]

    例如:json格式

    {"code":1,
    
    "msg":"操作成功",
    
    "data":
    
        {"pageNo":1,
    
        "hasNext":true,
    
        "list":    [{"docid":"DRQQ35F90511ELD5","boardid":"dy_wemedia_bbs","postid":null,"topicid":null,"recommendtids":null,"userid":null,"nickname":null,"userinfo":null,"title":"海湾被鲜血染成血红色:100多只海豚和鲸鱼惨遭法罗群岛渔民斩杀",}]
    
                    }
    
    }
    

    代码:

    
    for each in response.json['data']['list']
    
    

    pyspider传参数
    我这边没有利用save传参数

    def on_start(self):
        self.crawl('http://www.example.org/',
        callback=self.callback, save={'a': 123})
    
    def callback(self, response):
        return response.save['a']
    

    直接利用上一步爬取的参数,然后回调参数获取

        def index_page(self, response):
            for each in response.json['data']['list']:
                docid=each['docid']
                title=each['title']
                imgsrc=each['imgsrc']
                self.crawl('http://www.***.com/***',callback=self.detail_page)
    
        @config(priority=2)
        def detail_page(self, response):
            imgsrc=response.save['imgsrc']
            content=response.doc('#content').html()
            return {
                "content":content,
                "title": response.doc('h2').text(),
                "imgsrc":imgsrc
            }
    
    

    这样就可以利用上一步的参数了

    相关文章

      网友评论

          本文标题:python爬虫

          本文链接:https://www.haomeiwen.com/subject/extfnftx.html