加载拓展包

Rwordseg：分词包，不要试图分太大的文件。本次使用了大概五分钟来给一个1MB左右大概几十万字的小说分词。
wordcloud2：词云绘制包
RColorBrewer：颜色

library(Rwordseg)
library(wordcloud2)
library(RColorBrewer)

设置颜色

wordcolor <- rep(brewer.pal(n=10,name='Set3'),10) 
#选择Set3颜色包中的10个颜色，每个颜色重复10次

分词

wordcloud2函数的参数

parameter	explanation
data	数据
size	字体大小
minSize	副标题
gridSize	网格大小
fontFamily	字体
fontWeight	字体粗细
color	颜色
backgroundColor	背景颜色
minRotation	字体最小旋转角度
maxRotation	字体最大旋转角度（如果两个参数数值相同则旋转角度都相同）
shuffle
rotateRatio	旋转的概率大小
shape	形状
ellipticity	扁平度
widgetsize	小部件的大小
figPath	图画大小

形状的备选项：

'circle' (default), '
cardioid' (apple or heart shape curve, the most known polar equation),
'diamond' (alias of square),
'triangle-forward',
'triangle',
'pentagon',
'star'

x <- readLines('data.txt',encoding = 'UTF-8') #读取文件

xx <- segmentCN(x,anaylzer='hmm',return='vector') #用segmentCN函数进行中文分词

xx <- unlist(xx) #如果只有一段话不需要这一句。多段话分词返回结果为列表。

xx <- xx[nchar(xx)>1] #长度大于1的字符串（要用nchar函数而不是length函数）

xx <- xx[xx!='说道'] 
#把‘说道’删去了，30多万字中一共出现了快3000个说道。一百个字就会‘说道’一次吗。

top <- sort(table(xx),decreasing = T)[1:100] # 频率出现前100个，decreasing=T降序排列

wordcloud2(top,color=wordcolor,shape='star')

segmentCN（）函数的返回值里面不存在空格。但是以防万一最好运行xx <- xx[nchar(xx)>1]，排除空格和单个字的词语

如果需要过滤数字可以使用xx<-xx[!grepl('[1-9]',xx)]

grep(), grepl()函数

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
抓取函数
可以抓取字符串
返回结果为规定的字符串的位置

y <- c('a','b','z','g','c','c','d')
grep("[a-d]", y)
[1] 1 2 5 6 7

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
作用与grep（）函数类似，返回结果为逻辑值。

grepl("[a-d]", y)
[1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

m <- as.character(c(1:9))
grepl('[1-5]',m)
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

因此grepl('[1-9]',xx)的返回值为xx中字符串是否为1-9中的任意一个。

而！grepl('[1-9]',xx)则是使结果相反，xx[!grepl('[1-9]',xx)]为选择xx中不是1-9的字符串