机器学习之文本挖掘(1) — 词云

作者: changlugen | 来源:发表于2018-10-28 16:26 被阅读0次

机器学习之文本挖掘(1) — 词云
机器学习常见算法汇总
数据科学比赛平台集合
用Python制作酷炫词云图，原来这么简单！
文本挖掘——python词云实现
绿湾科技
机器读心术之文本挖掘与自然语言处理高级视频教程网盘下载
机器学习
NLP 学习3
Python数据挖掘与机器学习_通信信用风险评估实战(3)——特

这次做的文本挖掘以tm包为基础，数据集内容是奥巴马的国会演讲。
链接：https://github.com/datameister66/data

1、加载数据

library(tm)

建立包含演讲文稿的路径

name <- file.path("/Users/mac/rstudio-workplace/txtData")

查看路径下的文件

dir(name)
[1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" "sou2014.txt" "sou2015.txt"
[7] "sou2016.txt"

查看路径下文件数量

length(dir(name))
[1] 7

使用Corpus建立语料库，用命名为docs

docs <- Corpus(DirSource(name))

可以使用inspect()函数查看语料库内容

inspect(docs[1])

2、使用tm包的tm_map()函数进行文本转换

字母转换为小写：tolower

docs <- tm_map(docs,tolower)

剔除数字：removeNumbers

docs <- tm_map(docs,removeNumbers)

剔除标点符号：removePunctuation

docs <- tm_map(docs,removePunctuation)

剔除停用词：removewords的stopwords

docs <- tm_map(docs,removeWords,stopwords("english"))

剔除空白字符：stripWhitespace

docs <- tm_map(docs,stripWhitespace)

删除没必要的词：removewords，向量

docs <- tm_map(docs,removeWords,c("applause","can","cant","will","that","weve","dont","wont","youll","youre"))

3、将语料库放入文档-词矩阵

dtm <- DocumentTermMatrix(docs)

7个文档，4715个词

dim(dtm)
[1] 7 4715

查看矩阵

inspect(dtm)
<<DocumentTermMatrix (documents: 7, terms: 4715)>>
Non-/sparse entries: 10899/22106
Sparsity : 67%
Maximal term length: 17
Weighting : term frequency (tf)
Sample :
Terms
Docs america american jobs make new now people thats work years
sou2010.txt 18 18 23 14 20 30 32 26 21 20
sou2011.txt 18 19 25 23 36 25 31 24 20 25
sou2012.txt 30 34 34 15 27 26 21 24 16 18
sou2013.txt 24 19 32 20 24 35 18 18 20 22
sou2014.txt 28 21 23 22 29 11 24 19 27 21
sou2015.txt 35 19 18 23 41 15 22 30 20 25
sou2016.txt 21 16 8 17 16 15 21 29 20 17

查看自己想看的矩阵部分

inspect(dtm[1:3,1:3])

4、词频分析

计算每列总和

freq <- colSums(as.matrix(dtm))
head(freq)
abide ability able abroad absolutely abuses
1 4 14 13 4 1

对freq进行降序排序

ord <- order(-freq)
head(ord)
[1] 913 60 1386 991 755 922

查看头六个词

freq[head(ord)]
new america thats people jobs now
193 174 170 169 163 157

查看最后六个词

freq[tail(ord)]

withers  wordvoices worldexcept     worldin       worry yearsnamely 
      1           1           1           1           1           1

查看词频的频率

出现频率最高前六

head(table(freq))
freq
1 2 3 4 5 6
2226 788 382 234 142 137
tail(table(freq))
freq
157 163 169 170 174 193
1 1 1 1 1 1

通过findFreqTerms()函数找出出现次数至少为N的词

findFreqTerms(dtm,125)
[1] "america" "american" "americans" "jobs" "make" "new" "now"
[8] "people" "thats" "work" "year" "years"

通过findAssocs()函数计算相关性，找出词与词之间的关联

比如与job相关性大于0.9

findAssocs(dtm,"job",corlimit = 0.9)
$job
wrong pollution forces together achieve training
0.97 0.96 0.93 0.93 0.93 0.91

生成词云

library(wordcloud)
wordcloud(names(freq),freq,min.freq = 70,scale = c(3,.3),colors = brewer.pal(6,"Dark2"))

wordcloud01.png

网友评论

本文标题：机器学习之文本挖掘(1) — 词云

本文链接：https://www.haomeiwen.com/subject/wxnptqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

机器学习之文本挖掘(1) — 词云

1、加载数据

建立包含演讲文稿的路径

查看路径下的文件

查看路径下文件数量

使用Corpus建立语料库，用命名为docs

可以使用inspect()函数查看语料库内容

2、使用tm包的tm_map()函数进行文本转换

字母转换为小写：tolower

剔除数字：removeNumbers

剔除标点符号：removePunctuation

剔除停用词：removewords的stopwords

剔除空白字符：stripWhitespace

删除没必要的词：removewords，向量

3、将语料库放入文档-词矩阵

7个文档，4715个词

查看矩阵

查看自己想看的矩阵部分

4、词频分析

计算每列总和

对freq进行降序排序

查看头六个词

查看最后六个词

查看词频的频率

出现频率最高前六

通过findFreqTerms()函数找出出现次数至少为N的词

通过findAssocs()函数计算相关性，找出词与词之间的关联

比如与job相关性大于0.9

生成词云

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读