美文网首页大数据 爬虫Python AI SqlPython小哥哥
Python采集微博热评进行情感分析祝你猪年脱单!

Python采集微博热评进行情感分析祝你猪年脱单!

作者: 14e61d025165 | 来源:发表于2019-04-15 15:55 被阅读0次

    <bi style="box-sizing: border-box; display: block;">Ps: 重要的事情说三遍!!! 结尾有彩蛋,结尾有彩蛋,结尾有彩蛋。</bi>

    如果自己需要爬(cai)(ji)的数据量比较大,为了防止被网站封Ip,可以分时段爬取,另外对于爬到的数据一般是用来存储数据库,这就需要对数据进行去重处理,记录上次爬取的状态,就可以实现在爬虫中断后,可以快速继续上次的状态,实现增量爬取,这里可以参考我之前写过的一个新闻采集,增量采集新闻数据,本文写的对新浪微博的数据采集和处理完整代码在我的Github。

    玩微博的人大多数应该知道微博搞笑排行榜的,刚好写这篇文之前看到榜姐1月8号0点话题是一人说一个,追女孩的小道理,感觉这个话题简直是对广大单身男性的福利啊,ヾ(✿゚゚)ノ,故有了何不就采集一下评论来分析一波的想法。

    1.使用新浪微博提供的API对数据进行采集

    作为一个爬虫菜鸟来说,如果不会使用代理IP池,同时对网站的反爬机制不太清楚,建议先去看下网站是否自己提供的有API,今天我们要爬取的网站是新浪微博,当然新浪网作为为全球用户24小时提供全面及时的中文资讯的大网站,一定是提供自己的API接口的。这样的大网站,必定是经历了无数场爬虫与反爬之间的战争,也一定有很健全的反爬策略,所以我们可以通过调用新浪微博的开放平台来获取我们想要的信息。使用之前请详细阅读API文档,在开放平台认证为开发者,附App key链接。

    • APIClient下载地址

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># 如果这里引入失败,可以直接下载SDK和文件放一块就ok
    from weibo import APIClient
    import webbrowser
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    APP_KEY = '你的App Key ' # 获取的app key
    APP_SECRET = '你的AppSecret' # 获取的appsecret
    CALLBACK_URL = 'https://api.weibo.com/oauth2/default.html' #回调链接

    在网站设置"使用微博账号登陆"的链接,当用户点击链接后,引导用户跳转至如下地址

    client = APIClient(app_key=APP_KEY, app_secret=APP_SECRET, redirect_uri=CALLBACK_URL)

    得到授权页面的url,利用webbrowser打开这个url

    url = client.get_authorize_url()
    webbrowser.open_new(url) #打开默认浏览器获取code参数

    获取URL参数code:

    print '输入url中code后面的内容后按回车键:'
    code = raw_input() # 人工输入网址后面的code内容
    r = client.request_access_token(code) # 获得用户授权
    access_token = r.access_token # 新浪返回的token,类似abc123xyz456
    expires_in = r.expires_in

    设置得到的access_token,client可以直接调用API了

    client.set_access_token(access_token, expires_in)
    </pre>

    获取某个用户最新发表的微博列表

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873136" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    uid 的获取方式,我们点开不同的微博,会发现链接中https://m.weibo.cn/u/2706896955?sudaref=login.sina.com.cn&display=0&retcode=6102 u之后的数字就是用户的uid。

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">content = client.statuses.user_timeline(uid=2706896955, count=100)
    </pre>

    返回的结果是json格式的

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">{
    "statuses": [
    {
    "created_at": "Tue May 31 17:46:55 +0800 2011",
    "id": 11488058246,
    "text": "求关注。",
    "source": "<a href="http://weibo.com" rel="nofollow">新浪微博</a>",
    "favorited": false,
    "truncated": false,
    "in_reply_to_status_id": "",
    "in_reply_to_user_id": "",
    "in_reply_to_screen_name": "",
    "geo": null,
    "mid": "5612814510546515491",
    "reposts_count": 8,
    "comments_count": 9,
    "annotations": [],
    "user": {
    "id": 1404376560,
    "screen_name": "zaku",
    "name": "zaku",
    "province": "11",
    "city": "5",
    "location": "北京 朝阳区",
    "description": "人生五十年,乃如梦如幻;有生斯有死,壮士复何憾。",
    "url": "http://blog.sina.com.cn/zaku",
    "profile_image_url": "http://tp1.sinaimg.cn/1404376560/50/0/1",
    "domain": "zaku",
    "gender": "m",
    "followers_count": 1204,
    "friends_count": 447,
    "statuses_count": 2908,
    "favourites_count": 0,
    "created_at": "Fri Aug 28 00:00:00 +0800 2009",
    "following": false,
    "allow_all_act_msg": false,
    "remark": "",
    "geo_enabled": true,
    "verified": false,
    "allow_all_comment": true,
    "avatar_large": "http://tp1.sinaimg.cn/1404376560/180/0/1",
    "verified_reason": "",
    "follow_me": false,
    "online_status": 0,
    "bi_followers_count": 215
    }
    },
    ...
    ],
    "previous_cursor": 0, // 暂未支持
    "next_cursor": 11488013766, // 暂未支持
    "total_number": 81655
    }
    </pre>

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873212" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    返回的字段说明

    假设我们想要查看的是微博信息内容调用text即可

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">for info in content.comments:
    text = info.text
    </pre>

    2.新浪微博爬虫

    chrome浏览器右键检查查看network这些老套路我就不说了,不懂得可以翻Python网络爬虫(一)- 入门基础 从头开始看。

    另外:代码是针对新浪微博移动端 https://m.weibo.cn/

    进行信息采集,之所以爬移动端而不是PC所有社交网站爬虫,优先选择爬移动版(不要来问我为什么好爬,我也不知道 逃

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873224" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    点来链接https://m.weibo.cn/single/rcListformat=cards&id=4193705642468999&type=comment&hot=0&page=2即为返回的json格式的数据

    接下来直接上代码

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import re
    import time
    import requests
    uid = '4193705642468999'
    url = 'https://m.weibo.cn/single/rcList?format=cards&id=' + uid + '&type=comment&hot=0&page={}'
    headers = {
    "Accept": "application/json, text/javascript, /; q=0.01",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Connection": "keep-alive",
    "Cookie": "你的cookie",
    "Host": "m.weibo.cn",
    "Referer": "https://m.weibo.cn/status/" + uid,
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Mobile Safari/537.36",
    "X-Requested-With": "XMLHttpRequest",
    }
    i = 0
    comment_num = 1 # 第几条评论
    while True:
    res = requests.get(url=url.format(i), headers=headers)
    r = res.json()
    content = r[0]['card_group']
    if r.status_code == 200:
    try:
    for j in range(0, len(content)):
    hot_data = content[j]
    comment_id = hot_data['user']['id'] # 用户id
    user_name = hot_data['user']['screen_name'] # 用户名
    created_at = hot_data['created_at'] # 评论时间
    comment = re.sub('<.?>|回复<.?>:|[\U00010000-\U0010ffff]|[\uD800-\uDBFF][\uDC00-\uDFFF]', '', hot_data['text']) # 评论内容
    like_counts = hot_data['like_counts'] # 点赞数
    comment_num += 1
    i += 1
    time.sleep(3)
    except Exception as e:
    logger.debug(e)
    else:
    break
    </pre>

    接下来就是对数据的保存和处理了。

    注意:

    新浪毕竟是大厂,对爬虫肯定有自己的反爬策略,为了防止访问频繁被封禁,可以设置代理ip池,限制抓取时间等等。你问我怎么知道的,我才不会告诉你~

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873261" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    如果你出现了这个页面或者采集不到任何信息,恭喜你,被新浪宠幸了

    3.数据的存储和处理

    因为现在越来越多的公司开始逐渐使用PostgreSQL作为公司数据库,这里我们就把数据存储于Postgresql,为了使我们的整个项目更加工程化,我们把对数据库的操作单独定义方法。

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># 对数据库实现查询的方法
    def execute_select(conn, sql, params=None):
    with conn.cursor() as cur:
    if params:
    cur.execute(sql, params)
    else:
    cur.execute(sql)
    return cur.fetchall()

    对数据库实现增删改的方法

    def execute_sql(conn, sql, params=None):
    with conn.cursor() as cur:
    if params:
    cur.execute(sql, params)
    else:
    cur.execute(sql)
    </pre>

    大功告成了一半,运行代码 --> 保存数据库 接下来当然是对我们拿下的数据进行分(hu)析(shuo)展(ba)示(dao)了(千年不变的套路hhhhhh..)

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873292" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    这里我们可以看到数据已经成功存储与数据库

    4.数据的处理和分析

    既然说到对中文数据的处理和展示,我们常用的就几种方法,词云、情感分析、数据可视化展示,这里我就必须提到python中比较出名的一个中文NLP库:snowNLP,snowNLP能够根据给出的句子生成一个0-1之间的值,当值大于0.5时代表句子的情感极性偏向积极,当分值小于0.5时,情感极性偏向消极,越偏向两头,情感就越敏感。使用一个库最简单暴力的方法———读官方文档。

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873299" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    snownlp的使用也很简单

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873304" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873310" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    我随机抽取了两张结果,简单标注了一下,我们不难发现涉及到主动、长得帅、有钱的、要勇敢、口红、情商这几个词生成的值都在0.9,矮矬穷、渣、你他妈这些词生成的值都在0.5以下,林佳,给我留一口啊!是什么鬼,竟然0.7???

    • 虽然数据量大(其实是没有剔除停用词ヾ(✿゚゚)ノ)导致的词云图效果不太好,但是我们还是可以看到聊天、主动、好看这几个词的词频较高,至于为什么我不剔除停用词,是因为没有语料库还是因为不会用,都不是,因为我懒,我懒,我懒… 剔除停用词的教程之前写的文章中有:Python数据科学(三)- python与数据科学应用(Ⅲ)

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">def word_cloud(comment):
    logger.info('制作词云图...word_cloud')
    comment_text = ''
    back_coloring = imread("static/heart.jpg")
    cloud = WordCloud(font_path='static/simhei.ttf',
    background_color="white", # 背景颜色
    max_words=2000,
    mask=back_coloring,
    max_font_size=100,
    width=1000, height=860, margin=2,
    random_state=42,
    )
    for li in comment:
    comment_text += ' '.join(jieba.cut(li, cut_all=False))
    wc = cloud.generate(comment_text)
    image_colors = ImageColorGenerator(back_coloring)
    plt.figure("wordc")
    plt.imshow(wc.recolor(color_func=image_colors))
    wc.to_file('微博评论词云图.png')
    </pre>

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873355" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    • 对处理过得情感值列表进行统计,并生成分布图,采集的评论大概有5w条

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">def snow_analysis(comment):
    logger.info('自然语言处理NLP...snow_analysis')
    sentimentslist = []
    for li in comment:
    s = SnowNLP(li)
    # logger.debug(li)
    # logger.debug(li, s.sentiments)
    print(li, s.sentiments)
    sentimentslist.append(s.sentiments)
    fig1 = plt.figure("sentiment")
    plt.hist(sentimentslist, bins=np.arange(0, 1, 0.02))
    plt.show()
    </pre>

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873381" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    微博 一个人说一个,追女孩的小道理 评论的情感值分布

    可以看到情感值在接近0.6~1.0左右位置频率较高,说明粉丝们对于这则微博的评论积极态度占大多数,因为这个微博本身就是偏积极性的,得出的结果也说明了这个问题。

    我们的初衷是为了如何追女孩子,我就统计了一下出现比较多的评论(有博主为了抢热门频繁刷评论?),三行代码就可以搞定,这个Counter的用法之前也写过,传送门:使用python中的第三方库Counter

    <pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;"># 使用python的第三方库
    from collections import Counter
    userdict = Counter(comment_list)
    print(userdict.most_common(8))
    </pre>

    <bi style="box-sizing: border-box; display: block;">1.一定要主动啊 不然等女孩子主动吗!但是主动也要适度 别让对方觉得害怕…</bi><bi style="box-sizing: border-box; display: block;">2.人品要好,三观要正确,责任感,孝顺善良这些内在因素也很重要</bi><bi style="box-sizing: border-box; display: block;">3.追某个女孩时 只追她一个人 千万别撩别人</bi><bi style="box-sizing: border-box; display: block;">4.言谈幽默风趣但不要轻佻</bi><bi style="box-sizing: border-box; display: block;">5.对她当女儿养吧</bi><bi style="box-sizing: border-box; display: block;">6.女孩子是要用来宠的,不是来跟她讲道理的。</bi><bi style="box-sizing: border-box; display: block;">7.多陪她聊天,多关心她,爱护她,保护她,了解她,宠她,尊重她,给她安全感</bi><bi style="box-sizing: border-box; display: block;">8.不要暧昧不清,不要套路</bi>

    文末彩蛋:

    有很多男生抱怨自己追不到喜欢的姑娘,追了几个星期就放弃了。其实,要改变的是你自己,只要努力向上,让自己变得更优秀,同时对姑娘保持适当的关心和热情,坚持几个月,总有一天你就会发现,不喜欢就是不喜欢这是没有办法的事情。

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1555314873411" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    如果你在学习Python的过程当中有遇见任何问题,可以加入我的python交流学习qq群:683380553,多多交流问题,互帮互助,群里有不错的学习教程和开发工具。学习python有任何问题(学习方法,学习效率,如何就业),可以随时来咨询我,如果你准备学习大数据,也欢迎加入大数据学习交流qq群683380553,每天与大家分享学习资源哦。

    相关文章

      网友评论

        本文标题:Python采集微博热评进行情感分析祝你猪年脱单!

        本文链接:https://www.haomeiwen.com/subject/zfouwqtx.html