美文网首页
词云Word Clouds的R实现

词云Word Clouds的R实现

作者: 北欧森林 | 来源:发表于2020-12-30 00:34 被阅读0次

缘起:这是生信技能树公众号里Jimmy大神布置的一道作业:把tcga大计划的CNS级别文章标题画一个词云。我看着代码不难,感觉自己应该可以完成,就偷偷地尝试了起来。做完才发现,我的感觉是对的。

制作词云主要分为5个步骤:
  • Step 1: Create a text file
  • Step 2 : Install and load the required packages
  • Step 3 : Text mining
  • Step 4 : Build a term-document matrix
  • Step 5 : Generate the Word cloud
具体步骤及代码:

Step 1: Create a text file

  • Copy and paste the text in a plain text file (e.g : ml.txt)
  • Save the file

本次作业所需的.txt文本来源于两个网页罗列的和TCGA相关的CNS文献标题,第一个在:2018的TCGA的泛癌项目论文全部是发表在Cell及其子刊上;第二个在:2020的Nature及其子刊的22篇全基因组的泛癌分析(Pan-Cancer Analysis of Whole Genomes)

Step 2 : Install and load the required packages

# Install
install.packages("tm")  # for text mining
install.packages("SnowballC") # for text stemming
install.packages("wordcloud") # word-cloud generator 
install.packages("RColorBrewer") # color palettes
# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Step 3 : Text mining
3.1 Load the text

# import the text file created in Step 1
text <- readLines(file.choose())

# Load the data as a corpus
docs <- Corpus(VectorSource(text))

# Inspect the content of the document
inspect(docs)

3.2 Text transformation

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

3.3 Cleaning the text

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
# docs <- tm_map(docs, stemDocument)

Step 4 : Build a term-document matrix

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
wordclouds1.jpg

Step 5 : Generate the Word cloud

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

效果图:


WordClouds.png
参考资料

文中参考代码链接在:Text mining and word cloud fundamentals in R : 5 simple steps you should know

相关文章

  • 词云Word Clouds的R实现

    缘起:这是生信技能树公众号里Jimmy大神布置的一道作业:把tcga大计划的CNS级别文章标题画一个词云[http...

  • Geo Word Clouds

    背景: 有许多文本数据含有重要的地理信息,而现有的词云方法无法显示空间信息。Geo word clouds:不但可...

  • word tagul clouds文字云

    一:问题描述 如何制作看起来很逼格而且由漂亮的文字组成的形状?一般用来制作PPT汇报、广告海报、关键词提取等 二:...

  • 九月的云(译)——海子

    The Clouds In the September 九月的云 展开殓布 The clouds in the s...

  • 2019-04-10 139 mantra

    1. 这是什么词? 词:mantra 英英释义:a word or phrase which is often r...

  • 每日一词139|mantra

    1. 这是什么词? 词:mantra 英英释义:a word or phrase which is often r...

  • 20190407 mantra

    1. 这是什么词? 词:mantra 英英释义:a word or phrase which is often r...

  • 每日一词 139 | mantra

    1. 这是什么词? 词:mantra 英英释义:a word or phrase which is often r...

  • 每日一词139 mantra

    1. 这是什么词? 词:mantra 英英释义:a word or phrase which is often r...

  • Day139-mantra

    1. 这是什么词? 词:mantra 英英释义:a word or phrase which is often r...

网友评论

      本文标题:词云Word Clouds的R实现

      本文链接:https://www.haomeiwen.com/subject/oixfoktx.html