写在前面
之前写了publish or perish的使用方法。那么问题来了,我们获取了这些数据能拿来干什么?于是萌生了一个想法,不如统计一下引用TBtools的文章,统计一下每年文章发表数量(截止12月初),发表的层次,用户都用了哪些功能等等。当初的想法是我一个滑铲就全部搞定了,但实现起来嘛……一干就是两个周,并且短期内我是再也不想看基因挖掘相关的任何东西了……统计到最后实在是想吐了,如果你发现你发的文章被错误的统计了,或者没统计到……欢迎私信告诉我,我一定不改(
数据导入和准备工作
这里的话数据我已经整理好了,如果你也感兴趣的话点这里进行下载(https://cowtransfer.com/s/6f63b578a5e34f)。
首先载入数据
#Wipe the environment
rm(list=ls())
#Set the work directory
setwd("/Users/yangshao/OneDrive")
#Import the data
dta <- read.csv("clean_cite.csv", header = T)
#Take a peek at the data
skim(dta)
#Dimension of the data
dim(dta)
PS: 这里是我的地址,不要无脑复制……
之后载入需要用到的包,接下来就可以进行分析了!如果没有安装对应的包,先让电脑安装就好了。实在不会可以让你师兄先来学,然后教你(
library(skimr)
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(ggthemes)
library(tidytext)
library(stringr)
library(wordcloud)
library(networkD3)
越来越多的文章使用并引用TBtools
首先是每年引用文章数的可视化
#Citation per year
cite_trend <- sort(table(dta$Year), decreasing = TRUE)
cite_trend <- as.data.frame(cite_trend)
#Visualization
colnames(cite_trend) <- c('Year','Num.')
ggbarplot(cite_trend, x = "Year", fill = "Year", order = c('2018','2019','2020'), y = "Num.", palette = 'jco') +
scale_y_continuous(expand = c(0, 0))
从2018年至2020年,TBtools的每年引用数呈现上升趋势(废话)。截止2020年12月初,2020年的引用数已经超过500次。需要说明的是,这里的数据是相对比较保守的数据,实际引用数可能远不止这些。
TBtools的用户都喜欢把文章发在谁家?
使用以下代码完成可视化,别问我为啥用ggpubr……
#Sort the top 15 publisher
cite_publisher <- head(sort(table(dta$Publisher), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_publisher <- as.data.frame(cite_publisher)
#Rename the col names
colnames(cite_publisher) <- c('Publisher','Count')
#Visualization
ggbarplot(cite_publisher, x = "Publisher", fill = "Publisher", y = "Count") +
scale_y_continuous(expand = c(0, 0)) +
theme_pubr(
base_family = "",
border = FALSE,
margin = TRUE,
legend = 'none',
x.text.angle = 45
)
Cite_Publisher
截止2020年12月初,mdpi发表了多数文章,施普林格(包含BMC)和爱斯维尔紧随其后。
三个出版商一共发表了538篇,占总数的65.1%。
cite_publisher %>% head(3) %>% select(Count) %>% sum()
cite_publisher %>% head(3) %>% select(Count) %>% sum()/count(dta)
那么引用了TBtools的文章发表在了哪些杂志上呢?同样的方式完成可视化,别问我为什么不用ggpubr……
#Sort the top 15 Journal
cite_source <- head(sort(table(dta$Source), decreasing = TRUE),15)
#Transform the data to a dataframe
cite_source <- as.data.frame(cite_source)
#Rename the col
colnames(cite_source) <- c('Journal Name','Count')
#Visualization
ggplot(cite_source,aes(x = cite_source$`Journal Name`, y = Count, fill = cite_source$`Journal Name`)) +
geom_bar(stat='identity') +
scale_y_continuous(expand = c(0, 0)) +
xlab("Journal Name") +
guides(fill=guide_legend(title="Journal Name")) +
theme_pubr(
base_family = "",
border = FALSE,
margin = TRUE
) + rremove("x.text")
Cite_journal
从结果上看,多数文章发在了OA杂志上,例如来自mdpi的International Journal of Molecular Sciences,例如来自Springer的BMC Plant Biology和BMC Genomics,例如来自Frontier的Frontier in Genetics和Frontiers in Physiology,例如PeerJ, PloS等……按你胃,有钱发OA真好……
用户最爱用热图
每篇文章都会在文章里提到使用了TBtools的哪些功能。很遗憾这部分功能只能手动登录了,因为很多情况下,大家对同一个功能叙述存在差异,例如同样是一个热图功能有的人说'the DEGs were visualized by TBtools'有的人会说'the heatmaps were used to show/draw 。。。。(TBtools, gihub)'或者是'the correlation coefficients were visualized by TBtools' 等等。还有Circos图,很多时候难以分辨到底是Circos画的还是用TBtools画的,这种情况提到了TBtools索性都用Synteny代替了。还有一些懒得索性就说'the data were visualized by aaa, bbb, and TBtools' 的,这种也只能自己手动找。笔者已经很认真的过滤了,但人工做这方面工作难免会出错,有些时候甚至看到某几个图就把几个常用功能一列完事了……所以如果你发现我列举的功能跟文章提到的有出入,李姐万岁……
在这里分析的时候需要先做成长透视表(pivot_longer),然后选择数据开始作图,这里因为需要做三次同样的图,干脆就把表格的命令做成一个函数,直接调用就好了。这样显得不那么弱智(
#Create a pivot table
dta %>% select(Year,17:25) %>% pivot_longer(!Year,
names_to = 'Name',
values_to = 'Fun') -> Fun_count_pivot
#Function for counting
Fun_count <- function(year, Pivot){
table(Pivot %>% filter(Year == year) %>% .$Fun) %>%
as.data.frame() %>% .[-1,] -> Count_Year
colnames(Count_Year) <-c('Function', 'Count')
top_three <- tail(Count_Year[order(Count_Year$Count),],3)
ggbarplot(Count_Year, x = "Function", fill = "Function", y = "Count") +
scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) +
theme_pubr(
base_family = "",
border = FALSE,
margin = TRUE
) + rremove("x.text") + geom_text(top_three, mapping = aes(label = Function), nudge_y = top_three$Count*.2) + geom_text(top_three, mapping = aes(label = Count), nudge_y = top_three$Count*.1)
}
#Visualization the mentioned functions in papers
Fun_count("2018",Fun_count_pivot)
Fun_count("2019",Fun_count_pivot)
Fun_count("2020",Fun_count_pivot)
2018年
2019年
2020年
从2018年到2020年,用户使用TBtools的姿势越来越多:2018年,主要提及的功能只有15个,而到了2020年文章中提及的功能有29个。从2019年开始,TBtools用户群体最多使用的三个功能为:'Heatmap', 'motif'和'Gene structure'。
2018 | Count | 2019 | Count | 2020 | Count | |
---|---|---|---|---|---|---|
1 | motif | 8 | Heatmap | 71 | Heatmap | 315 |
2 | Synteny | 8 | motif | 66 | motif | 183 |
3 | Gene structure | 7 | Gene structure | 45 | Gene structure | 171 |
4 | GO | 7 | Synteny | 41 | Synteny | 102 |
5 | Heatmap | 6 | Gene location | 30 | Phylogenetic tree | 96 |
6 | Gene location | 5 | Phylogenetic tree | 22 | Gene location | 88 |
7 | KEGG | 5 | GO | 15 | Venn diagram | 43 |
8 | Venn diagram | 4 | Venn diagram | 14 | GO | 37 |
9 | Fasta extraction | 3 | Fasta extraction | 9 | Seqlogo | 35 |
10 | Phylogenetic tree | 3 | MEME | 8 | Fasta extraction | 21 |
11 | Seqlogo | 3 | blastv | 7 | Hierarchical clustering | 21 |
12 | blast | 2 | Ks&Ka | 6 | Ks&Ka | 21 |
13 | MEME | 2 | Seqlogo | 6 | KEGG | 13 |
14 | eFP | 1 | KEGG | 5 | Upset plot | 13 |
15 | Ks&Ka | 1 | Hierarchical clustering | 3 | MEME | 12 |
最多提及的功能可以通过以下代码查看
Fun_sum <- function(year, Pivot){
table(Pivot %>% filter(Year == year) %>% .$Fun) %>%
as.data.frame() %>% .[-1,] -> Fun_sum
colnames(Fun_sum) <- c('Function', 'Count')
Fun_sum <- head(Fun_sum[order(-Fun_sum$Count),],15)
return(Fun_sum)
}
Count_2018 <- Fun_sum(2018,Fun_count_pivot)
Count_2019 <- Fun_sum(2019,Fun_count_pivot)
Count_2020 <- Fun_sum(2020,Fun_count_pivot)
data.frame("2018" = Count_2018,"2019" = Count_2019, "2020" = Count_2020) %>% view()
与2019年相比,今年的TBtools用户更喜欢用TBtools做进化树,韦恩图和fasta文件的相关操作。在2020年,TBtools的多数功能确实都有用户在用。尤其是陈博士重点照顾的Heatmap,gene structure annotation,motif和phylogenetic tree。常用到你不一起做就觉得少了点什么的程度。当然,这也侧面反映了今年家族挖掘的研究还是很流行……
使用了TBtools的文章都发在了什么水平杂志上
我知道IF这东西不好做横向对比。之前也有想法做个TBtools到底有没有帮助做着把文章发在更好的杂志上。但最后还是放弃了。说到底我认为一篇文章能否发在更好的杂志上还得看你的文章是否有真东西。工具是帮助你更好的展现你的想法,讲好你的故事而不是帮你做一些花里胡哨的图糊弄reviewer的。
所以在这里做一下TBtools这三年的IF趋势吧。
IF_by_year <- dta %>% select(Year, Source, ImpactFactor)
IF_by_year$Year <- as.character(IF_by_year$Year)
IF_by_year$ImpactFactor <- as.numeric(IF_by_year$ImpactFactor)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8,xticks.by = 1)
gghistogram(IF_by_year, x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8, facet.by = "Year",xticks.by = 1)
有意思的是,这三年TBtools的文章主要落在了IF=3-4的区间。而在2020年,IF的两端出现了凸起。:一方面,拿家族基因挖掘当财富密码的人越来越多。这块相互竞争激烈,一些文章不得不选择低IF的杂志发表文章。而另一方面也有不少文章发表在了老牌期刊上,而这些期刊的IF常常落在6-7的区间。当然这里有个例外:'Genomics',根据学之策公众号的实时IF统计显示,'Genomics'的实时IF实际在3.2左右。我能说什么呢,套路深啊……
dta %>% select(Year,Source, ImpactFactor) %>%
filter(ImpactFactor < 7 & ImpactFactor > 6 & Source != 'Genomics') %>% group_by(Source) %>%
summarise(Count = n()) %>% ggplot(aes(x=Source, y=Count, fill = Source)) +
geom_bar(stat='identity',show.legend = FALSE) +
scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0,.1))) +
coord_flip() + theme_few()
为了减少这部分的影响,这里剔除了这个'Genomics'期刊。
IF_by_year %>%
filter(Source != 'Genomics') %>%
gghistogram(., x = 'ImpactFactor', bins = 100, add = "mean", color = "Year", palette = 'mpg', rug = TRUE,add_density = TRUE, alpha = .8,xticks.by = 1)
dta %>% select(Year,Source, ImpactFactor,17:25) %>%
filter(ImpactFactor > 6& Source != 'Genomics') %>% view()
这里再次看了下6分以上的文章都喜欢用TBtools的什么功能。一看不得了,这可不是随便用的啊,一看就是有备而来。最多用的功能还是Heatmap。广州这个地方肯定很热……嗯,希望没什么奇怪的联想。这里就不做图了,感兴趣的不妨自己试试。
TBtools用户都喜欢研究什么
这个问题说真的我很难回答,毕竟做这个我不专业。大概琢磨了下分词,弄了个词云出来看看。
dta %>% select(Year, Title) %>%
unnest_tokens(output = 'word', input = 'Title', token = "words", to_lower = TRUE) %>% #Text segmentation
select(word) %>%
anti_join(stop_words) %>% #delete the stop words
count(word, sort = TRUE) -> word_count
set.seed(2020) #for reproducibility
wordcloud(words = word_count$word,
freq = word_count$n,
min.freq = 1,
max.words = 500,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2")
)
wordcloud
hmm,genome-wid and gene family expression analysis/identification...这还能说什么呢。可涨点薪吧。
最后,因为植物口出身所以看一下都用了什么植物做研究……
get_source <- function(name){
dta %>% select(Title, ImpactFactor) %>%
mutate(lower_title = tolower(Title)) %>%
filter(str_detect(lower_title, name)) %>%
select(ImpactFactor)
}
list <- c('cotton','arabidopsis', 'tea','potato','rice','tomato','wheat','soybean','cucumber','barley','maize','tobacco','brachypodium','pepper','radish')
list_fruit <- c('pear','apple','citrus','cucumis','sugarcane','strawberry','dimocarpus','grape','litchi','banana','watermelon','jujube','peach','prunus')
index <- data.frame(list.i. = '', n = '')
for (i in 1:length(list)){
count <- get_source(list[i]) %>% count()
index_i <- data.frame(list[i],count)
index <- rbind(index, index_i)
}
index <- index[-1,]
colnames(index) <- c('Species name','Count')
index$Count <- as.numeric(index$Count)
index <- index[order(-index$Count),]
view(index)
Species name | Count |
---|---|
wheat | 56 |
tea | 36 |
cotton | 27 |
rice | 26 |
soybean | 18 |
cucumber | 18 |
potato | 15 |
arabidopsis | 12 |
tomato | 10 |
barley | 10 |
maize | 9 |
tobacco | 6 |
brachypodium | 5 |
pepper | 5 |
radish | 5 |
嗨,我觉得该改标题了,就叫什么?研究小麦的居然还不知道TBtools?好了……本来还想一个滑铲把家族基因统计了,我怕又铲到沟里去,算了……另外还有个包含水果的list(我知道这么分类不专业,别骂了别骂了),感兴趣的可以自己看看。
写在最后
没啥想说的了,最后搞个山鸡图和热图祝大家科研顺利吧……
#Create a new dataset
dta %>% filter(Source != 'Genomics') %>%
select(Year, Publisher, Source, ImpactFactor, 17:25) %>%
pivot_longer(cols = starts_with('Fun_'),
names_to = 'Name',
values_to = 'Fun') %>%
select(Year, Publisher, Source, ImpactFactor, Fun) %>%
filter(Fun != '') -> sankey_dta
#Add an IF column
sankey_dta %>%
mutate(IF = ifelse(ImpactFactor > 6,'General Impact',
ifelse(ImpactFactor < 2, 'Have Impact',
ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific Impact',"no")))) %>%
select(Year,Publisher,Source,ImpactFactor, IF,Fun) -> sankey_dta
sankey_dta %>% group_by(Source,ImpactFactor) %>% filter(ImpactFactor>5|n()>10) -> sankey_dta
sankey_dta %>% select(Year,Publisher) %>%
group_by(Year,Publisher) %>%
summarise(value = n()) -> sankey_index
colnames(sankey_index) <- c('source','target','value')
sankey_index$source <- as.character(sankey_index$source)
sankey_index <- sankey_dta %>% select(Publisher,Source) %>%
group_by(Publisher,Source) %>%
summarise(value = n()) %>%
rename(.,source = Publisher, target = Source) %>%
rbind(sankey_index,.)
sankey_dta %>% select(Source,ImpactFactor) %>%
mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
ifelse(ImpactFactor < 2, 'Have_Impact',
ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>%
group_by(Source,IF) %>%
summarise(value = n()) %>% rename(.,source = Source, target = IF) %>%
rbind(sankey_index,.) -> sankey_index
sankey_dta %>% select(ImpactFactor,Fun) %>%
mutate(IF = ifelse(ImpactFactor > 6,'General_Impact',
ifelse(ImpactFactor < 2, 'Have_Impact',
ifelse(ImpactFactor > 2 & ImpactFactor < 6, 'Specific_Impact',"no")))) %>%
group_by(IF,Fun) %>%
summarise(value = n()) %>% rename(.,source = IF, target = Fun) %>%
rbind(sankey_index,.) -> sankey_index
sankey_index[order(sankey_index$source,sankey_index$target),]
nodes <- data.frame(
name=c(as.character(sankey_index$source), as.character(sankey_index$target)) %>%
unique()
)
sankey_index$IDsource <- match(sankey_index$source, nodes$name)-1
sankey_index$IDtarget <- match(sankey_index$target, nodes$name)-1
sankey_index[order(sankey_index$IDsource,sankey_index$IDtarget),]
p <- sankeyNetwork(Links = sankey_index, Nodes = nodes, Source = "IDsource", Target = "IDtarget",
Value = "value", NodeID = "name", fontSize = 12, nodeWidth = 20,LinkGroup="source")
p
sankey
热图当然用TBtools做了。最近天气好冷,希望能热起来。嗯
heat.jpg
网友评论