开始

首先下载小说,下载地址此处就不贴了,请自行百度

过滤

下载完的txt中,包含很多特殊字符,这些字不希望出现在最终统计中,因此先过滤

#! /usr/bin/env python3
import re

replace_list = []


# 匹配非中文字符 逐个字符处理
def find_unchinese(file):
    pattern = re.compile(r'[\u4e00-\u9fa5]')
    unchinese = re.sub(pattern, "", file)
    for i in unchinese:
        if i != "\n":
            replace_list.append(i)


# 逐行读取,取出非中文字符
with open("doupo.txt") as f:
    for line in f:
        find_unchinese(line)
# 去重复
replace_list = list(set(replace_list))
print(replace_list, len(replace_list))

with open("doupo_new.txt", "w+") as b:
    with open("doupo.txt") as f:
        for line in f:
            for i in replace_list:
                line = line.replace(i, "", -1)
                line = line.replace(" ", "", -1)
            # 去掉特殊字符
            if line.strip() != "":
                b.writelines(line) if f != "" else ""

过滤完成后的小说就只包含中文部分

过滤后的小说.png

准备停用词表

停用词是一类特殊的词,去除这些词可以提高检索速度,推荐使用哈工大的停用词表,下载地址

存入hdfs中

将处理好的小说和停用词词库置入hdfs中

hdfs dfs -copyFromLocal  doupo_new.txt /doupo_new.txt
hdfs dfs -copyFromLocal  stop_words.txt /stop_words.txt

下载字体

word_cloud默认字体不支持中文,会造成中文乱码,所以需要预先准备字体,此处使用旁门正道字体

旁门正道是个人制作并声明过的永久免费的字体，可以避免很多版权麻烦。

准备背景图

找一张背景为白色的背景图,非白色部分最终会填充上词,从网上找了张,侵删

背景图

下载相关库

需要用到pyspark、jieba、word_cloud、 numpy、matplotlib等python库文件,利用pip提前准备好

利用pyspark编写spark任务

万事俱备,准备编写脚本并生成词图

# -*- coding: utf-8 -*-
import jieba
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
from pyspark.sql import SparkSession


# top1 从hdfs中读取文件到textFile中
spark = SparkSession.builder.master("local").config("spark.hadoop.mapreduce.job.run-local", "true").getOrCreate()
context = spark.sparkContext

# 加载停动词词库
stop_word_rdd = context.textFile("hdfs://127.0.0.1:9000/stop_words.txt")
stop_words = set(stop_word_rdd.collect())


# 中文分词 去除停用词 这里使用到了集合运算
def get_word(line):
    return set(jieba.cut(line, cut_all=False)) - stop_words


# 最终结果为(word,num)格式,需要根据num排序
def sort_result(elem):
    return elem[1]


rdd = context.textFile("hdfs://127.0.0.1:9000/c.txt")
new_rdd = rdd.flatMap(lambda line: get_word(line))
result = new_rdd.map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y).collect()
result.sort(key=sort_result, reverse=True)

# 取前100个关键词生成画像
words_dict = dict()
for i in result[:500]:
    words_dict[i[0]] = i[1]

 # 引入背景图
color_mask = np.array(Image.open("background.jpeg")) 
cloud = WordCloud(
    font_path="PangMenZhengDaoBiaoTiTi-1.ttf",
    mask=color_mask,
    background_color='white',
    max_words=100,
    min_font_size=10,
    max_font_size=40,
    collocations=False,
    random_state=42
)
wCloud = cloud.generate_from_frequencies(words_dict)
image_colors = ImageColorGenerator(color_mask)
plt.imshow(wCloud, interpolation="bilinear")
plt.axis("off")
plt.savefig("2.jpg")
## 根据背景图,自动设置相似的颜色
plt.imshow(wCloud.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")
plt.savefig("4.jpg")