这次做的文本挖掘以tm包为基础,数据集内容是奥巴马的国会演讲。
链接:https://github.com/datameister66/data
1、加载数据
library(tm)
建立包含演讲文稿的路径
name <- file.path("/Users/mac/rstudio-workplace/txtData")
查看路径下的文件
dir(name)
[1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" "sou2014.txt" "sou2015.txt"
[7] "sou2016.txt"
查看路径下文件数量
length(dir(name))
[1] 7使用Corpus建立语料库,用命名为docs
docs <- Corpus(DirSource(name))
可以使用inspect()函数查看语料库内容
inspect(docs[1])
2、使用tm包的tm_map()函数进行文本转换
字母转换为小写:tolower
docs <- tm_map(docs,tolower)
剔除数字:removeNumbers
docs <- tm_map(docs,removeNumbers)
剔除标点符号:removePunctuation
docs <- tm_map(docs,removePunctuation)
剔除停用词:removewords的stopwords
docs <- tm_map(docs,removeWords,stopwords("english"))
剔除空白字符:stripWhitespace
docs <- tm_map(docs,stripWhitespace)
删除没必要的词:removewords,向量
docs <- tm_map(docs,removeWords,c("applause","can","cant","will","that","weve","dont","wont","youll","youre"))
3、将语料库放入文档-词矩阵
dtm <- DocumentTermMatrix(docs)
7个文档,4715个词
dim(dtm)
[1] 7 4715
查看矩阵
inspect(dtm)
<<DocumentTermMatrix (documents: 7, terms: 4715)>>
Non-/sparse entries: 10899/22106
Sparsity : 67%
Maximal term length: 17
Weighting : term frequency (tf)
Sample :
Terms
Docs america american jobs make new now people thats work years
sou2010.txt 18 18 23 14 20 30 32 26 21 20
sou2011.txt 18 19 25 23 36 25 31 24 20 25
sou2012.txt 30 34 34 15 27 26 21 24 16 18
sou2013.txt 24 19 32 20 24 35 18 18 20 22
sou2014.txt 28 21 23 22 29 11 24 19 27 21
sou2015.txt 35 19 18 23 41 15 22 30 20 25
sou2016.txt 21 16 8 17 16 15 21 29 20 17
查看自己想看的矩阵部分
inspect(dtm[1:3,1:3])
4、词频分析
计算每列总和
freq <- colSums(as.matrix(dtm))
head(freq)
abide ability able abroad absolutely abuses
1 4 14 13 4 1
对freq进行降序排序
ord <- order(-freq)
head(ord)
[1] 913 60 1386 991 755 922
查看头六个词
freq[head(ord)]
new america thats people jobs now
193 174 170 169 163 157查看最后六个词
freq[tail(ord)]
withers wordvoices worldexcept worldin worry yearsnamely
1 1 1 1 1 1
查看词频的频率
出现频率最高前六
head(table(freq))
freq
1 2 3 4 5 6
2226 788 382 234 142 137
tail(table(freq))
freq
157 163 169 170 174 193
1 1 1 1 1 1
通过findFreqTerms()函数找出出现次数至少为N的词
findFreqTerms(dtm,125)
[1] "america" "american" "americans" "jobs" "make" "new" "now"
[8] "people" "thats" "work" "year" "years"
通过findAssocs()函数计算相关性,找出词与词之间的关联
比如与job相关性大于0.9
findAssocs(dtm,"job",corlimit = 0.9)
$job
wrong pollution forces together achieve training
0.97 0.96 0.93 0.93 0.93 0.91
生成词云
library(wordcloud)
wordcloud(names(freq),freq,min.freq = 70,scale = c(3,.3),colors = brewer.pal(6,"Dark2"))
wordcloud01.png
网友评论