美文网首页python小课——零基础入门——学习笔记
用Python分析18万条《八佰》影评,看看观众怎么说?

用Python分析18万条《八佰》影评,看看观众怎么说?

作者: python草莓 | 来源:发表于2020-09-05 22:09 被阅读0次

    直接上干货,很多爬虫项目实战内容可以私信我获取

    数据获取

    def parse_page(html):
        try:
            data = json.loads(html)['cmts']  # 将str转换为json
            #print(data)
            comments = []
            for item in data:
                comment = {
                    'id': item['id'],
                    'nickName': item['nickName'],
                    'cityName': item['cityName'] if 'cityName' in item else '',  # 处理cityName不存在的情况
                    'content': item['content'].replace('\n', ' ', 10),  # 处理评论内容换行的情况
                    'score': item['score'],
                    'startTime': item['startTime']
                }
                comments.append(comment)
            return comments
        except Exception as e:
            pass
    

    数据清洗

    读取影评数据

    import pandas as pd
    import numpy as np
    data=[]
    with open('comments.txt', 'r',encoding='utf-8-sig') as f_input:
        for line in f_input:
            data.append(list(line.strip().split(',')))
    data
    

    转为DataFrame并添加列名

    df = pd.DataFrame(data).iloc[:, 0:6]
    df.columns = ['观众ID','观众昵称','城市','评论内容','评分','评论时间']
    

    删除重复记录和缺失值

    df = df.drop_duplicates()
    df = df.dropna()
    

    预览并保存

    df.sample(5)
    df.to_csv("八佰.csv",index=False,encoding="utf_8_sig")
    
    image.png

    数据可视化

    导入相关库

    import jieba
    import re
    import matplotlib.pyplot as plt
    from pyecharts.charts import *
    from pyecharts import options as opts
    from pyecharts.globals import ThemeType
    import stylecloud
    from IPython.display import Image
    

    整体评论词云

    data = pd.read_csv("八佰.csv")
    data['评论内容'] = data['评论内容'].astype('str')
    # 定义分词函数
    def get_cut_words(content_series):
        # 读入停用词表
        stop_words = []
        
        with open("stop_words.txt", 'r', encoding='utf-8') as f:
            lines = f.readlines()
            for line in lines:
                stop_words.append(line.strip())
    
        # 添加关键词
        my_words = ['', '']
        
        for i in my_words:
            jieba.add_word(i)
    
        # 自定义停用词
        my_stop_words = ['电影', '中国','一部']
        stop_words.extend(my_stop_words)
    
        # 分词
        word_num = jieba.lcut(content_series.str.cat(sep='。'), cut_all=False)
    
        # 条件筛选
        word_num_selected = [i for i in word_num if i not in stop_words and len(i)>=2]
        
        return word_num_selected
    
    # 绘制词云图
    text1 = get_cut_words(content_series=data['评论内容'])
    stylecloud.gen_stylecloud(text=' '.join(text1), max_words=500,
                              collocations=False,
                              font_path='字酷堂清楷体.ttf',
                              icon_name='fas fa-square',
                              size=653,
                              #palette='matplotlib.Inferno_9',
                              output_name='./1.png')
    Image(filename='./1.png')
    
    image.png

    对18万条影评内容进行分词,并将频率最高的500个词抽离出来制作词云图,我们发现广大观众对《八佰》这部战争题材电影表现出强烈的情感。除了好看、不错这些赞美之词以外,更多的是震撼、感人、历史、勿忘国耻等代表着强烈民族色彩的词。

    评论类型分布

    data['评论类型'] = pd.cut(data['评分'],[0,3,4,6],labels=['差评','中评','好评'],right=False)
    df1 = data.groupby('评论类型')['评论内容'].count()
    df1 = df1.sort_values(ascending=False)
    regions = df1.index.to_list()
    values = df1.to_list()
    c = (
            Pie(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
            .add("", zip(regions,values),radius=["40%", "70%"])
            .set_global_opts(title_opts=opts.TitleOpts(title="评论类型占比",subtitle="数据来源:猫眼电影",pos_top="0.5%",pos_left = 'center'))
            .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%",font_size=18))
        )
    c.render_notebook()
    
    image.png
    超过90%的好评率,20亿+的票房不是没有道理。

    差评抽样

    image.png
    差评虽不多,但集中在对《八佰》结局的轰炸。

    评论数据量TOP10城市

    df2 = data.groupby('城市')['评分'].count() #按菜系分组,对评分求平均
    df2 = df2.sort_values(ascending=False)[:10]
    # print(df2)
    bar = Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
    bar.add_xaxis(df2.index.to_list())
    bar.add_yaxis("",df2.to_list()) #X轴与y轴调换顺序
    bar.set_global_opts(title_opts=opts.TitleOpts(title="城市影评数量TOP10",subtitle="数据来源:猫眼电影",pos_top="2%",pos_left = 'center'),
                       xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=16)), #更改横坐标字体大小
                       yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=16)), #更改纵坐标字体大小
                       )
    bar.set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='top'))
    bar.render_notebook()
    
    image.png
    成都人对《八佰》评论热情高涨,超过北上广深等大城市。

    相关演员提及

    result = []
    for i in data['评论内容']:
        result.append(re.split('[::,,.。!!~·`\;;……、]',i))
        
    def actor_comment(data,result):
        actors = pd.DataFrame(np.zeros(5 * len(data)).reshape(len(data),5),
                          columns = ['欧豪/端午','王千源/羊拐','姜武/老铁','张译/老算盘','张俊一/小湖北'])
        for i in range(len(result)):
            words = result[i]
            for word in words:
                if '端午' in word or '欧豪' in word:
                    actors.iloc[i]['欧豪/端午'] = 1
                if '羊拐' in word or '王千源' in word:
                    actors.iloc[i]['王千源/羊拐'] = 1 
                if '老铁' in word or '姜武' in word:
                    actors.iloc[i]['姜武/老铁'] = 1 
                if '老算盘' in word or '张译' in word:
                    actors.iloc[i]['张译/老算盘'] = 1
                if '小湖北' in word or '张俊一' in word:
                    actors.iloc[i]['张俊一/小湖北'] = 1 
        final_result = pd.concat([data,actors],axis = 1)
        return final_result
    
    df3 = actor_comment(data,result)
    df3.sample(5)
    df4 = df3.iloc[:,7:].sum().reset_index().sort_values(0,ascending = False)
    df4.columns = ['角色','次数']
    df4['占比'] = df4['次数'] / df4['次数'].sum()
    
    c = (
        Bar(init_opts=opts.InitOpts(theme=ThemeType.CHALK))
        .add_xaxis(df4['角色'].to_list())
        .add_yaxis("",df4['次数'].to_list())
        .set_global_opts(title_opts=opts.TitleOpts(title="主演及其饰演的角色被提及次数",subtitle="数据来源:猫眼电影",pos_top="2%",pos_left = 'center'),
                           xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=16)), #更改横坐标字体大小
                           yaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(font_size=16)), #更改纵坐标字体大小
                           )
        .set_series_opts(label_opts=opts.LabelOpts(font_size=16,position='top'))
        )
    c.render_notebook()
    
    image.png
    观众评论中提及欧豪和端午的次数最多,不知是因为演技还是颜值?

    关于端午的评论

    ouhao = df3.loc[df3['欧豪/端午'] == 1,]
    text = get_cut_words(content_series=ouhao['评论内容'])
    stylecloud.gen_stylecloud(text=' '.join(text), max_words=500,
                              collocations=False,
                              font_path='字酷堂清楷体.ttf',
                              icon_name='fas fa-camera',
                              #palette='matplotlib.Inferno_9',
                              size=653,
                              output_name='./ouhao.png')
    Image(filename='./ouhao.png')
    
    image.png

    关于老算盘的评论

    zhangyi = df3.loc[df3['张译/老算盘'] == 1,]
    text = get_cut_words(content_series=zhangyi['评论内容'])
    stylecloud.gen_stylecloud(text=' '.join(text), max_words=500,
                              collocations=False,
                              font_path='字酷堂清楷体.ttf',
                              icon_name='fas fa-video',
                              #palette='matplotlib.Inferno_9',
                              size=653,
                              output_name='./zhangyi.png')
    Image(filename='./zhangyi.png')
    
    image.png

    关于羊拐的评论

    wangqianyuan = df3.loc[df3['王千源/羊拐'] == 1,]
    text = get_cut_words(content_series=wangqianyuan['评论内容'])
    stylecloud.gen_stylecloud(text=' '.join(text), max_words=500,
                              collocations=False,
                              font_path='字酷堂清楷体.ttf',
                              icon_name='fas fa-thumbs-up',
                              #palette='matplotlib.Inferno_9',
                              size=653,
                              output_name='./wangqianyuan.png')
    Image(filename='./wangqianyuan.png')
    
    image.png

    以上内容摘自圈内“J哥”

    需要系统性python资料的请戳下面链接,大家一起学习
    https://shimo.im/docs/QvG8JqxGKvcrXQhH/ 《python基础到进阶学习资料》,可复制链接后用石墨文档 App 或小程序打开

    相关文章

      网友评论

        本文标题:用Python分析18万条《八佰》影评,看看观众怎么说?

        本文链接:https://www.haomeiwen.com/subject/ozhgsktx.html