安装R包
jiebaR,jiebaRD :分词
wordcloud2 : 生成词云
BiocManager::install("jiebaR")
BiocManager::install("jiebaRD")
BiocManager::install("wordcloud2")
library(jiebaR,jiebaRD)
library(wordcloud2)
安装成功
> library(jiebaR,jiebaRD)
载入需要的程辑包:jiebaRD
Warning messages:
1: 程辑包‘jiebaR’是用R版本3.6.3 来建造的
2: 程辑包‘jiebaRD’是用R版本3.6.3 来建造的
> library(wordcloud2)
Warning message:
程辑包‘wordcloud2’是用R版本3.6.3 来建造的
处理数据
语法
调用jiebaR库的 worker()
函数,进行分词
参数如下
worker(type = "mix", dict = DICTPATH, hmm = HMMPATH,
user = USERPATH, idf = IDFPATH, stop_word = STOPPATH, write = T,
qmax = 20, topn = 5, encoding = "UTF-8", detect = T,
symbol = F, lines = 1e+05, output = NULL, bylines = F,
user_weight = "max")
- type="mix" :默认的分词引擎是混合模型(MixSegment)
- user :用户词典
- stop_word :停止词库
- write :是将输出写入文件,还是返回结果。此值仅在输入为文件路径时使用。默认值为TRUE
- qmax : 最大成词的字符数,默认20个字符
- topn:关键词数,默认5个
- ……
自定义词典
防止出现把”大数据“,分成 “大” “数据”
- 首先 使用
show_dictpath()
查看字典位置 - 在当下文件夹下 新建文本,按行输入你需要的词汇
- 重命名 “your_name.utf8” [ utf8 表示编码格式]
- 调用 worker 时,将相应的参数uesr 改为自定义的那个
使用上述同样的方法,新建 stop_word字典,将stop_word 修改不可行。会报错——
There is no such file for stop words.
下文另有方法过滤 停止词
导入数据
csv表格数据读取
sm_total <- read.csv("文件名.csv",,stringsAsFactors = F) #读入
title <- sm_total$Title #提取需要的数据
小插曲
read.csv记得把参数 stringsAsFactors 选上false,否则,data的类型是factor
后续使用segment会报错——Error in segment(data, wk) : Argument 'code' must be an string.
分词
使用 segment()
wk <- worker(user = 'SM_dict.utf8') #自定义词典
sm_seg <- segment(title,wk) # 分词语法的一种
去掉停止词
即去除无意义的"a" "and" "the" "of" ……
使用 filter_segment
#设置需要过滤的词
filter <- c("a","an","is","was","are","been","and","or","as","its","of","for","by","in","on","from","the")
sm_seg_filter <- filter_segment(sm_seg,filter)
绘制词云
计算词频
word_frequency <- table(sm_seg_filter)
结果会是 一个单词,下面对应它的出现次数 这样的
因为我的文件很大,所以,我选择取词频前100的词来生成词云
先排序 sort()
freq_sort <- sort(word_frequency,decreasing = T)
head(freq_sort) #查看前6个
绘制
语法
wordcloud2(data, size = 1, minSize = 0, gridSize = 0,
fontFamily = 'Segoe UI', fontWeight = 'bold',
color = 'random-dark', backgroundColor = "white",
minRotation = -pi/4, maxRotation = pi/4, shuffle = TRUE,
rotateRatio = 0.4, shape = 'circle', ellipticity = 0.65,
widgetsize = NULL, figPath = NULL, hoverFunction = NULL)
-
size: 字体大小
-
fontFamily :字体
-
minRotation/maxRotation :旋转的角度
-
rotationRation:字体旋转比例,如设定为1,则全部词语都会发生旋转
-
shape:形状,默认是‘circle’,即圆形。还可以选择‘cardioid’(苹果形或心形), ‘star’, ‘diamond’ (钻石),‘triangle-forward’(三角形),‘triangle’(三角形),‘pentagon’(五边形)
-
……
wordcloud2(head(freq_sort,100),color = "random-light",minRotation = pi/6,maxRotation = pi/6,rotateRatio = 1)
效果

Ref.
https://blog.csdn.net/snowdroptulip/article/details/78836941
https://www.jianshu.com/p/a4ba7637680c
网友评论