现有大量商品标题(电热水壶品类),现在想从标题中,得到热点词(卖点),并通过聚类的方式来给卖点总结分类。
具体思路是:先进行一些文本预处理,去掉停用词;再用jieba分词将每个标题分割成词List;去掉词频低的词,得到关键词的词集;构建关键词-评论矩阵:根据关键词是否在评论关键词中,得到0-1矩阵;计算关键词两两之间的jaccard距离,并进行层次聚类。
核心代码(省略部分函数,无法运行):
filename = "data.sql"
titles = []
file = open(filename,encoding = 'utf-8')
while True:
line = file.readline()
if not line:
break
words = clean(line)
titles.append(words)
word_list = []
word_count = {}
titles_words = [] # title words list
f_name = "my_dict.txt"
jieba.load_userdict(f_name)
for title in titles:
title_words = jieba.cut(title) # generator
title_words = list(title_words)
for word in title_words:
if word not in word_count:
word_count[word] = 1
else:
word_count[word] = 1 + word_count[word]
titles_words.append(title_words)
word_count_copy = deepcopy(word_count)
stop_words = file_to_list("stopwords.txt")
threshold = 10
for word in word_count_copy:
if word_count_copy[word] < threshold or word in stop_words:
word_count.pop(word)
word_list = list(word_count.keys())
rows = len(word_count)
cols = len(titles_words)
print(rows)
print(cols)
word_title_matrix = [[0 for i in range(cols)] for j in range(rows)]
for row in range(rows):
for col in range(cols):
if word_list[row] in titles_words[col]:
word_title_matrix[row][col] = 1
dis_array = ssd.pdist(word_title_matrix,metric='jaccard')
Z = sch.linkage(dis_array, 'average')
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(figsize = (10,300),dpi=200)
sch.dendrogram(Z, labels=word_list,orientation = "left")
plt.savefig('figure1.png')
最后的树形图非常大,截取部分,可观察到相近/相关联的词已经在低层聚集。
![](https://img.haomeiwen.com/i6969567/
4a76a664006ad641.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
网友评论