美文网首页
机器学习之文本挖掘(1) — 词云

机器学习之文本挖掘(1) — 词云

作者: changlugen | 来源:发表于2018-10-28 16:26 被阅读0次

    这次做的文本挖掘以tm包为基础,数据集内容是奥巴马的国会演讲。
    链接:https://github.com/datameister66/data

    1、加载数据

    library(tm)

    建立包含演讲文稿的路径

    name <- file.path("/Users/mac/rstudio-workplace/txtData")

    查看路径下的文件

    dir(name)
    [1] "sou2010.txt" "sou2011.txt" "sou2012.txt" "sou2013.txt" "sou2014.txt" "sou2015.txt"
    [7] "sou2016.txt"

    查看路径下文件数量

    length(dir(name))
    [1] 7

    使用Corpus建立语料库,用命名为docs

    docs <- Corpus(DirSource(name))

    可以使用inspect()函数查看语料库内容

    inspect(docs[1])

    2、使用tm包的tm_map()函数进行文本转换

    字母转换为小写:tolower

    docs <- tm_map(docs,tolower)

    剔除数字:removeNumbers

    docs <- tm_map(docs,removeNumbers)

    剔除标点符号:removePunctuation

    docs <- tm_map(docs,removePunctuation)

    剔除停用词:removewords的stopwords

    docs <- tm_map(docs,removeWords,stopwords("english"))

    剔除空白字符:stripWhitespace

    docs <- tm_map(docs,stripWhitespace)

    删除没必要的词:removewords,向量

    docs <- tm_map(docs,removeWords,c("applause","can","cant","will","that","weve","dont","wont","youll","youre"))

    3、将语料库放入文档-词矩阵

    dtm <- DocumentTermMatrix(docs)

    7个文档,4715个词

    dim(dtm)
    [1] 7 4715

    查看矩阵

    inspect(dtm)
    <<DocumentTermMatrix (documents: 7, terms: 4715)>>
    Non-/sparse entries: 10899/22106
    Sparsity : 67%
    Maximal term length: 17
    Weighting : term frequency (tf)
    Sample :
    Terms
    Docs america american jobs make new now people thats work years
    sou2010.txt 18 18 23 14 20 30 32 26 21 20
    sou2011.txt 18 19 25 23 36 25 31 24 20 25
    sou2012.txt 30 34 34 15 27 26 21 24 16 18
    sou2013.txt 24 19 32 20 24 35 18 18 20 22
    sou2014.txt 28 21 23 22 29 11 24 19 27 21
    sou2015.txt 35 19 18 23 41 15 22 30 20 25
    sou2016.txt 21 16 8 17 16 15 21 29 20 17

    查看自己想看的矩阵部分

    inspect(dtm[1:3,1:3])

    4、词频分析

    计算每列总和

    freq <- colSums(as.matrix(dtm))
    head(freq)
    abide ability able abroad absolutely abuses
    1 4 14 13 4 1

    对freq进行降序排序

    ord <- order(-freq)
    head(ord)
    [1] 913 60 1386 991 755 922

    查看头六个词

    freq[head(ord)]
    new america thats people jobs now
    193 174 170 169 163 157

    查看最后六个词

    freq[tail(ord)]

    withers  wordvoices worldexcept     worldin       worry yearsnamely 
          1           1           1           1           1           1 
    

    查看词频的频率

    出现频率最高前六

    head(table(freq))
    freq
    1 2 3 4 5 6
    2226 788 382 234 142 137
    tail(table(freq))
    freq
    157 163 169 170 174 193
    1 1 1 1 1 1

    通过findFreqTerms()函数找出出现次数至少为N的词

    findFreqTerms(dtm,125)
    [1] "america" "american" "americans" "jobs" "make" "new" "now"
    [8] "people" "thats" "work" "year" "years"

    通过findAssocs()函数计算相关性,找出词与词之间的关联

    比如与job相关性大于0.9

    findAssocs(dtm,"job",corlimit = 0.9)
    $job
    wrong pollution forces together achieve training
    0.97 0.96 0.93 0.93 0.93 0.91

    生成词云

    library(wordcloud)
    wordcloud(names(freq),freq,min.freq = 70,scale = c(3,.3),colors = brewer.pal(6,"Dark2"))


    wordcloud01.png

    相关文章

      网友评论

          本文标题:机器学习之文本挖掘(1) — 词云

          本文链接:https://www.haomeiwen.com/subject/wxnptqtx.html