美文网首页
百家号爬取(2)

百家号爬取(2)

作者: 偷了月光的猫 | 来源:发表于2019-01-22 10:21 被阅读14次

    此篇文章主要讲述百家号评论数阅读数的爬取

    评论数和阅读数都在单独的一个json数据表中

    https://mbd.baidu.com/webpage?type=homepage&action=interact&format=jsonp&params=%5B%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229683117499664348209%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221113000014175815%22%2C%22feed_id%22%3A%229683117499664348209%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228997120757336896754%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171319%22%2C%22feed_id%22%3A%228997120757336896754%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229442416292259854102%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221106000014171220%22%2C%22feed_id%22%3A%229442416292259854102%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%228994022518148142722%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221084000014170786%22%2C%22feed_id%22%3A%228994022518148142722%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229180210467318996709%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221110000014181138%22%2C%22feed_id%22%3A%229180210467318996709%22%7D%2C%7B%22user_type%22%3A%223%22%2C%22dynamic_id%22%3A%229470100560664750777%22%2C%22dynamic_type%22%3A%222%22%2C%22dynamic_sub_type%22%3A%222001%22%2C%22thread_id%22%3A%221119000014172446%22%2C%22feed_id%22%3A%229470100560664750777%22%7D%5D&uk=D0hHfmuMEVka02HZelKA7g&_=1548119615162&callback=jsonp1

    该url解析

    主要是从上个json数据表中获得的

    "user_type"

    dynamic_id"

    "dynamic_type"

    "dynamic_sub_type"

    "thread_id"

    "feed_id"

    进行拼装

    代码为

    for iin range(len(title)):

    user_type = re.findall(r'"user_type":"(.+?)",', asyncData[i])[0]

    dynamic_id = re.findall(r'"dynamic_id":"(.+?)",', asyncData[i])[0]

    dynamic_type=re.findall(r'"dynamic_type":"(.+?)",', asyncData[i])[0]

    dynamic_sub_type=re.findall(r'"dynamic_sub_type":"(.+?)",', asyncData[i])[0]

    thread_id=re.findall(r'"thread_id":"(.+?)",', asyncData[i])[0]

    feed_id=re.findall(r'"feed_id":"(.+?)"', asyncData[i])[0]

    print(title[i],url[i],date[i],cerate[i],publish[i],updated[i])

    if i<len(title)-1

    readjson+='user_type%22%3A%22'+user_type+'%22%2C%22'\

    +'dynamic_id%22%3A%22'+dynamic_id+'%22%2C%22'\

    +'dynamic_type%22%3A%22'+dynamic_type+'%22%2C%22'\

    +'dynamic_sub_type%22%3A%22'+dynamic_sub_type+'%22%2C%22'\

    +'thread_id%22%3A%22'+thread_id+'%22%2C%22'\

    +'feed_id%22%3A%22'+feed_id+'%22%7D%2C%7B%22'

        else:

    readjson +='user_type%22%3A%22' + user_type +'%22%2C%22' \

    +'dynamic_id%22%3A%22' + dynamic_id +'%22%2C%22' \

    +'dynamic_type%22%3A%22' + dynamic_type +'%22%2C%22' \

    +'dynamic_sub_type%22%3A%22' + dynamic_sub_type +'%22%2C%22' \

    +'thread_id%22%3A%22' + thread_id +'%22%2C%22' \

    +'feed_id%22%3A%22' + feed_id +'%22%7D%5D'

    readjson+='&uk=D0hHfmuMEVka02HZelKA7g&_='+str(b)

    注:feed_id最后一个接的是%22%7D%5D,而不是之前的'%22%7D%2C%7B%22'

    相关文章

      网友评论

          本文标题:百家号爬取(2)

          本文链接:https://www.haomeiwen.com/subject/vltljqtx.html