2018-04-17微博爬取

作者: 纳米片 | 来源:发表于2018-04-17 13:37 被阅读0次

2018-04-17微博爬取
Python 实战项目
爬取微博博主文章
三十. 模拟登陆实战 - 爬取微博信息
利用rwda包抓取微博评论并分析
Python爬取新浪微博
python 爬取微博评论！
爬取新浪微博博文/评论
微博爬虫：爬取微博正文、关注人
微博爬虫开源项目汇总大全（长期更新、欢迎补充）

根据崔庆才老师的教程练习。

Ajax，全称为Asynchronous JavaScript and XML，即异步的JavaScript和XML。它不是一门编程语言，而是利用JavaScript在保证页面不被刷新、页面链接不改变的情况下与服务器交换数据并更新部分网页的技术。

对于传统的网页，如果想更新其内容，那么必须要刷新整个页面，但有了Ajax，便可以在页面不被全部刷新的情况下更新其内容。在这个过程中，页面实际上是在后台与服务器进行了数据交互，获取到数据之后，再利用JavaScript改变网页，这样网页内容就会更新了。

1、我的微博

https://m.weibo.cn/u/5650807778

2、基本原理

image

这里通过控制台可以看到的Ajax请求，发送Ajax请求到网页更新的这个过程可以简单分为以下3步：(1) 发送请求； (2) 解析内容； (3) 渲染网页。

更为详细的内容：https://cuiqingcai.com/5593.html

3、Ajax分析

点击Preview，可以看到响应的内容，它是JSON格式的。

image

选中了其中一项Ajax请求，请求的URL为：https://m.weibo.cn/api/container/getIndex?type=uid&value=5650807778&containerid=1076035650807778&page=2

包含了type，value，containerid，page 四个参数。type始终为uid，value始终为yshuid，containerid始终为107603+用户ID，只有page为可变参数。

page用来控制分页，page=2表示为第二页。

JSON数据中cardlistInfo 和cards 信息最为重要。cardlistInfo 中total:33表示了微博的总数量，可以用来计算分页。cards为一个列表，包含10个元素。

mblog中包含了attitudes_count（赞数目）、comments_count（评论数目）、reposts_count（转发数目）、created_at（发布时间）、text（微博正文）等，而且它们都是一些格式化的内容。

from urllib.parse import urlencode
import requests
import json
from pyquery import PyQuery as pq
from pymongo import MongoClient

base_url = 'https://m.weibo.cn/api/container/getIndex?'
headers = {
    'Host': 'm.weibo.cn',
    'Referer': 'https://m.weibo.cn/u/5650807778',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',#Ajax异步请求
}

client = MongoClient()#创建MongoClien对象
db = client['weibo']#创建一个名为‘weibo’的数据库
collection = db['weibo']#创建一个名为‘weibo’的集合

#返回一次请求的json数据
def get_page(page):
    params = {
        'type': 'uid',
        'value': '5650807778',#用户ID
        'containerid': '1076035650807778',#107603+用户ID
        'page': page#页码
    }
    url = base_url + urlencode(params)#生成url链接
    try:
        response = requests.get(url,headers=headers)
        if response.status_code == 200:#相应成功
            return response.json()#将返回结果转化为JSON数据
    except requests.ConnectionError as e:
        print('Error', e.args)

#解析数据
def parse_page(jsons):
    if jsons:
        print(json.dumps(jsons,sort_keys=True, indent=4, separators=(',', ': ')))#控制台打印出了格式化后的JSON数据
        items = jsons.get('data').get('cards')#获取cards中的所有数据
        for item in items:
            item = item.get('mblog')
            weibo = {}
            weibo['id']= item.get('id')
            weibo['text'] = pq(item.get('text')).text()#利用pyquery将文本中的html标签去掉
            weibo['attitudes'] = item.get('attitudes_count')
            weibo['comments'] = item.get('comments_count')
            weibo['reposts'] = item.get('reposts_count')
            yield weibo#利用yield关键字将函数变为生成器generator，通过遍历获取数据

#将数据存储到数据库中
def save_to_mongo(result):
    if collection.insert(result):#将数据插入到MongoDB数据库中
        print('saved to Mongo')

if __name__ == '__main__':
    for page in range(1,11):
        jsons = get_page(page)
        results = parse_page(jsons)
        for result in results:
            print(result)
            save_to_mongo(result)