豆瓣影评爬取及可视化

作者: Demafic | 来源:发表于2019-01-20 20:59 被阅读3次

近期，《白蛇：缘起》好像又是一被看好的国漫，所以打算爬取一下这部电影的豆瓣短评。略懂一些python知识，大家有意见的尽管批评，觉得ok的，可以点个喜欢啦。笔芯~~~

爬取短评内容

首先目标url：https://movie.douban.com/subject/30331149/comments?start=0&limit=20&sort=new_score&status=P
点击下一页观察url，很好发现只是start后面的数字发生了变化,每一页增加20。
右键，检查，可以写出xpath的路径：

//p/span[@class="short"]/text()

我们可以很简单的写出爬取代码：

import requests 
from lxml import etree
import time 

url = 'https://movie.douban.com/subject/30331149/comments?start={0}&limit=20&sort=new_score&status=P'.format(page)
ipAgent = {
        'http':'http://221.214.181.98:53281'
    }
headers = {
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
    }
    
def get_comment(url,page):
    txt = requests.get(url,headers=headers,proxies=ipAgent).text
    html = etree.HTML(txt)
    comment = html.xpath('//p/span[@class="short"]/text()')
    return comment
    
comments = []
for page in range(5):
    page = page*20
    comment = get_comment(url,page)
    comments.extend(comment)
    time.sleep(2)
with open('douban.txt','w') as f:
    f.write(str(comments))

记得加上headers,并且爬取不要过于频繁，否则会触发豆瓣的反爬机制的，像我这样。
不过可以换一个ip继续爬。

数据清理

首先读取评论：

data = open('douban.txt','r')
comments = data.read()
data.close()

此时，评论里又不少标点符号。这些标点符号时我们在进行词频统计时用不着的，因此要将它们清除，所用的方法是正则表达式：

import re
pattern = re.compile(r'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern, comments)
cleaned_comments = ''.join(filterdata)

因此要进行词频统计，所以先要进行中文分词操作。在这里我使用的是结巴分词。如果没有安装结巴分词，可以在控制台使用pip install jieba进行安装。（注：可以使用pip list查看是否安装了这些库）。代码如下所示：

import jieba    #分词包
import pandas as pd  

segment = jieba.lcut(cleaned_comments)
#lcut切分出来的结果类型是list
words_df=pd.DataFrame({'segment':segment})

同时，我们知道评论中的许多次是没有意义的比如“的”，“太”等虚词，所以要清除。
大家可以百度stopwords.txt下载，将我们的数据与停用词进行对比即可。代码如下：

stopwords = pd.read_csv('chineseStopWords.txt',encoding='gbk',sep='\t',names=['stopword'])
#names是用于结果的列名；sep：指定分隔符 
words_df = word_df[~word_df.segment.isin(stopwords.stopword)]
#isin其实就是 is in，而~就是取反的意思
#~word_df.segment.isin(stopwords.stopword) 这句话的意思就是：word_df.segment中不在stopwords.stopword里的词

词频统计

接下俩进行词频统计，代码如下：

import numpy    #numpy计算包
words_stat=words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})
words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False)

groupby与agg的用法可以见我的python学习——Pandas2

用词云显示

直接上代码：

import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
from wordcloud import WordCloud#词云包
matplotlib.rcParams['figure.figsize'] = (10.0, 5.0)
wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80) #指定字体类型、字体大小和字体颜色
#sihei.ttf是指定宋体，可以直接百度下载
word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}#pandas的values函数,将表中的对应关系表示为一个大列表下的多个列表
#形式如同[[x,y],[a,b],.....,[y,z]]
#总体看就最后变成了字典
wordcloud=wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)

结果

词云.png

网友评论

小白学python

本文标题：豆瓣影评爬取及可视化

本文链接：https://www.haomeiwen.com/subject/qgsfjqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

豆瓣影评爬取及可视化

爬取短评内容

数据清理

词频统计

用词云显示

结果

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

小白学python