在Python中自然语言处理生成词云WordCloud报告

作者: 拓端tecdat | 来源:发表于2020-03-26 19:15 被阅读0次

原文链接：http://tecdat.cn/?p=8585

了解如何在Python中使用WordCloud对自然语言处理执行探索性数据分析。

什么是WordCloud？

need-to-insert-img

很多时候，您可能会看到一片云，上面堆满了许多大小不同的单词，这些单词代表了每个单词的出现频率或重要性。这称为标签云或词云。对于本教程，您将学习如何在Python中创建自己的WordCloud并根据需要自定义它。

先决条件

该numpy库是最流行和最有用的库之一，用于处理多维数组和矩阵。它还与Pandas库结合使用以执行数据分析。

wordcloud安装可能有些棘手。如果您只需要它来绘制基本的wordcloud，则pip install wordcloud或conda install -c conda-forge wordcloud就足够了。

git clone https://github.com/amueller/word_cloud.git cd word_cloud pip install .

资料集：

首先，您加载所有必需的库：

# Start with loading all necessary libraries import numpy as np import pandas as pd from os import path from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt % matplotlib inline

c:\intelpython3\lib\site-packages\matplotlib\__init__.py: import warnings warnings.filterwarnings("ignore")

加载数据框。请注意，index_col=0我们没有将行名（索引）作为单独的列读入。

# Load in the dataframe df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0)

# Looking at first 5 rows of the dataset df.head()

need-to-insert-img

得到打印输出。

print("There are {} observations and {} features in this dataset. \n".format(df.shape[0],df.shape[1])) print("There are {} types of wine in this dataset such as {}... \n".format(len(df.variety.unique()), ", ".join(df.variety.unique()[0:5]))) print("There are {} countries producing wine in this dataset such as {}... \n".format(len(df.country.unique()), ", ".join(df.country.unique()[0:5])))

There are 129971 observations and 13 features in this dataset. There are 708 types of wine in this dataset such as White Blend, Portuguese Red, Pinot Gris, Riesling, Pinot Noir... There are 44 countries producing wine in this dataset such as Italy, Portugal, US, Spain, France...

df[["country", "description","points"]].head()

国家描述点数

0意大利香气包括热带水果，扫帚，brimston ...87

1个葡萄牙这是成熟果香，柔滑的酒...87

2我们酸和活泼，酸橙果肉的味道和...87

3我们菠萝皮，柠檬髓和橙花...87

4我们就像2012年以来的常规装瓶一样，这...87

使用groupby()和计算摘要统计信息。

使用葡萄酒数据集，您可以按国家/地区分组并查看所有国家/地区的价格。

need-to-insert-img

这将在所有44个国家/地区中选择前5个最高平均分：

need-to-insert-img

点数价钱

国家

英国91.58108151.681159

印度90.22222213.333333

奥地利90.10134530.762772

德国89.85173242.257547

加拿大89.36965035.712598

您可以使用Pandas DataFrame和Matplotlib的plot方法按国家/地区对葡萄酒的数量进行绘制。

plt.ylabel("Number of Wines") plt.show()

need-to-insert-img

在44个生产葡萄酒的国家中，美国的葡萄酒评论数据集中有50,000多种葡萄酒，是排名第二的国家的两倍：法国-以其葡萄酒而闻名的国家。意大利还生产大量优质葡萄酒，有近20,000种葡萄酒可供审查。

数量超过质量吗？

现在，按照评分最高的葡萄酒查看所有44个国家/地区的地块：

plt.ylabel("Highest point of Wines") plt.show()

need-to-insert-img

澳洲，美国，葡萄牙，意大利和法国都有100分的葡萄酒。如果您注意到，在数据集中生产的葡萄酒数量上，葡萄牙排名第5，澳大利亚排名第9，这两个国家/地区的葡萄酒种类少于8000。

设置基本的WordCloud

使用任何函数之前，您可能要做的第一件事是检出函数的文档字符串，并查看所有必需和可选参数。为此，键入?function并运行它以获取所有信息。

您可以看到WordCloud对象唯一需要的参数是text，而所有其他参数都是可选的。

因此，让我们从一个简单的示例开始：使用第一个观察描述作为wordcloud的输入。三个步骤是：

提取评论（文本文件）

创建并生成wordcloud图像

使用matplotlib显示云

# Display the generated image: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()

need-to-insert-img

您可以看到第一篇评论提到了很多关于葡萄酒的香气。

现在，改变WordCloud像一些可选参数max_font_size，max_word和background_color。

plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()

need-to-insert-img

如果要保存图像，WordCloud提供了一个功能to_file

# Save the image in the img folder: wordcloud.to_file("img/first_review.png")

将它们加载到其中时，结果将如下所示：

need-to-insert-img

因此，现在您将所有葡萄酒评论合并为一个大文本，并创建一个巨大的胖云，以查看这些葡萄酒中最常见的特征。

print ("There are {} words in the combination of all review.".format(len(text)))

There are 31661073 words in the combination of all review.

# Display the generated image: # the matplotlib way: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show()

need-to-insert-img

哦，似乎黑樱桃和浓郁的醇厚是最受欢迎的特征，而赤霞珠则是最受欢迎的特征。这与赤霞珠“是世界上最广为人知的红酒葡萄品种之一。

现在，让我们将这些话倒入一杯葡萄酒中！

为了为您的wordcloud创建形状，首先，您需要找到一个PNG文件以成为遮罩。以下是一个不错的网站，可以在Internet上找到它：

need-to-insert-img

为了确保遮罩能够正常工作，让我们以numpy数组形式对其进行查看：

array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

首先，使用该transform_format()函数将数字0交换为255。

def transform_format(val): if val == 0: return 255 else: return val

然后，创建一个形状与您现有的蒙版相同的新蒙版，并将该功能transform_format()应用于上一个蒙版的每一行中的每个值。

现在，您将以正确的形式创建一个新的蒙版。

array([[255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], ..., [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255]])

好的！使用正确的蒙版，您可以开始使用选定的形状制作wordcloud。

# show plt.figure(figsize=[20,10]) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show()

need-to-insert-img

创建了一个酒瓶形状的wordcloud！似乎葡萄酒描述中最常提及的是黑樱桃，水果风味和葡萄酒的浓郁特性。现在，让我们仔细看看每个国家/地区的评论：

need-to-insert-img

按照颜色图案创建wordcloud

可以合并五个拥有最多葡萄酒的国家的所有评论。要查找这些国家/地区，可以查看地块国家/地区与上方的葡萄酒数量的关系，也可以使用上方的组来查找每个国家/地区（每个组）的观察数量，并sort_values()使用参数ascending=False降序排列。

country US 54504 France 22093 Italy 19540 Spain 6645 Portugal 5691 dtype: int64

因此，现在您有5个热门国家/地区：美国，法国，意大利，西班牙和葡萄牙。

country US 54504 France 22093 Italy 19540 Spain 6645 Portugal 5691 Chile 4472 Argentina 3800 Austria 3345 Australia 2329 Germany 2165 dtype: int64

目前，仅5个国家就足够了。

要获得每个国家/地区的所有评论，您可以使用" ".join(list)语法将所有评论连接起来，该语法将所有元素合并在以空格分隔的列表中。

然后，如上所述创建wordcloud。

# store to file plt.savefig("img/us_wine.png", format="png") plt.show()

need-to-insert-img

看起来不错！现在，让我们再重复一次法国的评论。

# store to file plt.savefig("img/fra_wine.png", format="png") #plt.show()

请注意，绘图后应保存图像，以使单词云具有所需的颜色模式。

need-to-insert-img

# store to file plt.savefig("img/ita_wine.png", format="png") #plt.show()

need-to-insert-img

继意大利之后是西班牙：

# store to file plt.savefig("img/spa_wine.png", format="png") #plt.show()

need-to-insert-img

最后，葡萄牙：

# store to file plt.savefig("img/por_wine.png", format="png") #plt.show()

need-to-insert-img

最终结果在下表中。

need-to-insert-img

在Python中自然语言处理生成词云WordCloud报告

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据，机器学习，人工智能

大数据爬虫Python AI Sql

玩转大数据

大数据部落

大数据

在Python中自然语言处理生成词云WordCloud报告

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据，机器学习，人工智能

大数据 爬虫Python AI Sql

玩转大数据

大数据部落

大数据

大数据爬虫Python AI Sql