美文网首页
机器学习之文本挖掘(1) — 词云

机器学习之文本挖掘(1) — 词云

作者: changlugen | 来源:发表于2018-10-28 16:26 被阅读0次

这次做的文本挖掘以tm包为基础,数据集内容是奥巴马的国会演讲。
链接:https://github.com/datameister66/data

1、加载数据

library(tm)

建立包含演讲文稿的路径

name <- file.path("/Users/mac/rstudio-workplace/txtData")

查看路径下的文件

dir(name)
[1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" "sou2014.txt" "sou2015.txt"
[7] "sou2016.txt"

查看路径下文件数量

length(dir(name))
[1] 7

使用Corpus建立语料库,用命名为docs

docs <- Corpus(DirSource(name))

可以使用inspect()函数查看语料库内容

inspect(docs[1])

2、使用tm包的tm_map()函数进行文本转换

字母转换为小写:tolower

docs <- tm_map(docs,tolower)

剔除数字:removeNumbers

docs <- tm_map(docs,removeNumbers)

剔除标点符号:removePunctuation

docs <- tm_map(docs,removePunctuation)

剔除停用词:removewords的stopwords

docs <- tm_map(docs,removeWords,stopwords("english"))

剔除空白字符:stripWhitespace

docs <- tm_map(docs,stripWhitespace)

删除没必要的词:removewords,向量

docs <- tm_map(docs,removeWords,c("applause","can","cant","will","that","weve","dont","wont","youll","youre"))

3、将语料库放入文档-词矩阵

dtm <- DocumentTermMatrix(docs)

7个文档,4715个词

dim(dtm)
[1] 7 4715

查看矩阵

inspect(dtm)
<<DocumentTermMatrix (documents: 7, terms: 4715)>>
Non-/sparse entries: 10899/22106
Sparsity : 67%
Maximal term length: 17
Weighting : term frequency (tf)
Sample :
Terms
Docs america american jobs make new now people thats work years
sou2010.txt 18 18 23 14 20 30 32 26 21 20
sou2011.txt 18 19 25 23 36 25 31 24 20 25
sou2012.txt 30 34 34 15 27 26 21 24 16 18
sou2013.txt 24 19 32 20 24 35 18 18 20 22
sou2014.txt 28 21 23 22 29 11 24 19 27 21
sou2015.txt 35 19 18 23 41 15 22 30 20 25
sou2016.txt 21 16 8 17 16 15 21 29 20 17

查看自己想看的矩阵部分

inspect(dtm[1:3,1:3])

4、词频分析

计算每列总和

freq <- colSums(as.matrix(dtm))
head(freq)
abide ability able abroad absolutely abuses
1 4 14 13 4 1

对freq进行降序排序

ord <- order(-freq)
head(ord)
[1] 913 60 1386 991 755 922

查看头六个词

freq[head(ord)]
new america thats people jobs now
193 174 170 169 163 157

查看最后六个词

freq[tail(ord)]

withers  wordvoices worldexcept     worldin       worry yearsnamely 
      1           1           1           1           1           1 

查看词频的频率

出现频率最高前六

head(table(freq))
freq
1 2 3 4 5 6
2226 788 382 234 142 137
tail(table(freq))
freq
157 163 169 170 174 193
1 1 1 1 1 1

通过findFreqTerms()函数找出出现次数至少为N的词

findFreqTerms(dtm,125)
[1] "america" "american" "americans" "jobs" "make" "new" "now"
[8] "people" "thats" "work" "year" "years"

通过findAssocs()函数计算相关性,找出词与词之间的关联

比如与job相关性大于0.9

findAssocs(dtm,"job",corlimit = 0.9)
$job
wrong pollution forces together achieve training
0.97 0.96 0.93 0.93 0.93 0.91

生成词云

library(wordcloud)
wordcloud(names(freq),freq,min.freq = 70,scale = c(3,.3),colors = brewer.pal(6,"Dark2"))


wordcloud01.png

相关文章

网友评论

      本文标题:机器学习之文本挖掘(1) — 词云

      本文链接:https://www.haomeiwen.com/subject/wxnptqtx.html