**Article name: **decoupleR: ensemble of computational methods to infer biological activities from omics data
**Journal: Bioinformatics Advances
Doi: https://doi.org/10.1093/bioadv/vbac016
IF: **创刊
在生信分析过程中,哪些基因有可能导致这个通路的激活/抑制,其实是比较关键的,今天的推送就是为此而来!
1
环境配置
首先,吸取教训,将我的配置环境分享出来,这样读者在安装包时也能有个参考。
> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Chinese (Simplified Han)_Hong Kong SAR.utf8
[2] LC_CTYPE=Chinese (Simplified Han)_Hong Kong SAR.utf8
[3] LC_MONETARY=Chinese (Simplified Han)_Hong Kong SAR.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified Han)_Hong Kong SAR.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggrepel_0.9.1 pheatmap_1.0.12 ggplot2_3.3.6 tidyr_1.2.1 tibble_3.1.7
[6] dplyr_1.0.9 decoupleR_2.3.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 pillar_1.8.1 compiler_4.2.1 cellranger_1.1.0
[5] RColorBrewer_1.1-3 BiocManager_1.30.18 tools_4.2.1 lifecycle_1.0.3
[9] gtable_0.3.1 lattice_0.20-45 pkgconfig_2.0.3 rlang_1.0.6
[13] Matrix_1.4-1 DBI_1.1.3 cli_3.3.0 rstudioapi_0.13
[17] withr_2.5.0 generics_0.1.3 vctrs_0.4.1 grid_4.2.1
[21] tidyselect_1.2.0 glue_1.6.2 R6_2.5.1 fansi_1.0.3
[25] readxl_1.4.1 purrr_0.3.4 magrittr_2.0.3 scales_1.2.1
[29] ellipsis_0.3.2 assertthat_0.2.1 colorspace_2.0-3 utf8_1.2.2
[33] munsell_0.5.0
2
需要安装的包
library(decoupleR)
library(dplyr)
library(tibble)
library(tidyr)
library(ggplot2)
library(pheatmap)
library(ggrepel)
3
正片开始
如果所有的包都成功library之后,就可以正式分析了,这一步主要是把所用到的3个数据整合到了一个list中,而我们自己分析的话仅需符合它的条件即可。 需要注意的是,示例数据中并非count值,后续使用的也不是count值所使用的分析方法,仅为举栗子,自己分析时输入输出可做调整。
inputs_dir <- system.file("extdata", package = "decoupleR")
data <- readRDS(file.path(inputs_dir, "bk_data.rds"))
# count矩阵
> summary(data$counts)
gene PANC1.WT.Rep1 PANC1.WT.Rep2 PANC1.WT.Rep3 PANC1.FOXA2KO.Rep1
Length:11093 Min. : 6.198 Min. : 6.265 Min. : 6.167 Min. : 6.254
Class :character 1st Qu.: 7.726 1st Qu.: 7.772 1st Qu.: 7.785 1st Qu.: 7.718
Mode :character Median : 9.187 Median : 9.237 Median : 9.236 Median : 9.203
Mean : 9.378 Mean : 9.417 Mean : 9.402 Mean : 9.400
3rd Qu.:10.749 3rd Qu.:10.808 3rd Qu.:10.776 3rd Qu.:10.803
Max. :18.526 Max. :18.481 Max. :18.197 Max. :18.489
NA's :67 NA's :140 NA's :82 NA's :62
PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO.Rep3
Min. : 6.220 Min. : 6.247
1st Qu.: 7.776 1st Qu.: 7.775
Median : 9.226 Median : 9.259
Mean : 9.404 Mean : 9.420
3rd Qu.:10.779 3rd Qu.:10.825
Max. :18.333 Max. :18.230
NA's :77 NA's :157
# 实验设计
> data$design
# A tibble: 6 × 2
sample condition
<chr> <chr>
1 PANC1.WT.Rep1 PANC1.WT
2 PANC1.WT.Rep2 PANC1.WT
3 PANC1.WT.Rep3 PANC1.WT
4 PANC1.FOXA2KO.Rep1 PANC1.FOXA2KO
5 PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO
6 PANC1.FOXA2KO.Rep3 PANC1.FOXA2KO
# 差异分析结果
> head(data$limma_ttop)
# A tibble: 6 × 7
ID logFC AveExpr t P.Value adj.P.Val B
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 RHBDL2 -1.82 8.60 -12.8 0.00000303 0.0336 3.95
2 PLEKHH2 -1.57 7.70 -10.8 0.00000993 0.0548 3.27
3 HEG1 -1.73 8.64 -9.79 0.0000194 0.0548 2.84
4 CLU -1.79 12.2 -9.76 0.0000198 0.0548 2.83
5 FHL1 2.09 9.61 8.95 0.0000355 0.0788 2.43
6 RBP4 -1.73 7.39 -8.53 0.0000490 0.0862 2.19
关于前期差异分析流程,可以移步gzh
原文还对数据进行了预处理
# 删除NA值,第一列作为行名
counts <- data$counts %>%
dplyr::mutate_if(~ any(is.na(.x)), ~ if_else(is.na(.x),0,.x)) %>%
column_to_rownames(var = "gene") %>%
as.matrix()
head(counts)
#> PANC1.WT.Rep1 PANC1.WT.Rep2 PANC1.WT.Rep3 PANC1.FOXA2KO.Rep1 PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO.Rep3
#> NOC2L 10.052588 11.949123 12.057774 12.312291 12.139918 11.494205
#> PLEKHN1 7.535115 8.125993 8.714880 8.048196 8.290154 8.621239
#> PERM1 6.281242 6.424582 6.589668 6.293285 6.486136 6.775344
#> ISG15 10.938252 11.469081 11.425415 11.549986 11.371464 11.178157
#> AGRN 6.956335 7.196108 7.522550 7.061549 7.485534 7.071555
#> C1orf159 9.546224 9.788721 9.794589 9.850830
# 提取gene与t value的关系
deg <- data$limma_ttop %>%
select(ID, t) %>%
filter(!is.na(t)) %>%
column_to_rownames(var = "ID") %>%
as.matrix()
head(deg)
#> t
#> RHBDL2 -12.810588
#> PLEKHH2 -10.794453
#> HEG1 -9.788112
#> CLU -9.761618
#> FHL1 8.950191
#> RBP4 -8.529074
然后利用PROGENy数据库,获取通路中基因的weigh和*P *值,为后续分析做准备。(仅支持人和小鼠),top主要限制通路中基因的数量(P排序)
net <- get_progeny(organism = 'human', top = 100)
[2022-11-04 23:00:15] [SUCCESS] [OmnipathR] Loaded 700257 annotation records from cache.
> net
# A tibble: 1,400 × 4
source target weight p_value
<chr> <chr> <dbl> <dbl>
1 Androgen TMPRSS2 11.5 2.38e-47
2 Androgen NKX3-1 10.6 2.21e-44
3 Androgen MBOAT2 10.5 4.63e-44
4 Androgen KLK2 10.2 1.94e-40
5 Androgen SARG 11.4 2.79e-40
6 Androgen SLC38A4 7.36 1.25e-39
7 Androgen MTMR9 6.13 2.53e-38
8 Androgen ZBTB16 10.6 1.57e-36
9 Androgen KCNN2 9.47 7.71e-36
10 Androgen OPRK1 -5.63 1.11e-35
# … with 1,390 more rows
# ℹ Use `print(n = ...)` to see more rows
既然有了权重,自然可以使用加权平均的方法来计算每个样本的得分,这里使用的是wmean方法,当然也可以使用show_methods()查看其他可以使用的方法,输入数据主要为count值和上一步得到了net。
sample_acts <- run_wmean(mat=counts, net=net, .source='source', .target='target',
.mor='weight', times = 100, minsize = 5)
> show_methods()
# A tibble: 12 × 2
Function Name
<chr> <chr>
1 run_aucell AUCell
2 run_consensus Consensus score between methods
3 run_fgsea Fast Gene Set Enrichment Analysis (FGSEA)
4 run_gsva Gene Set Variation Analysis (GSVA)
5 run_mdt Multivariate Decision Trees (MDT)
6 run_mlm Multivariate Linear Model (MLM)
7 run_ora Over Representation Analysis (ORA)
8 run_udt Univariate Decision Tree (UDT)
9 run_ulm Univariate Linear Model (ULM)
10 run_viper Virtual Inference of Protein-activity by Enriched Regulon analysis (VIPER)
11 run_wmean Weighted Mean (WMEAN)
12 run_wsum Weighted Sum (WSUM)
选择好看的颜色和渐变效果
palette_length = 100
my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)
my_breaks <- c(seq(-3, 0, length.out=ceiling(palette_length/2) + 1),
seq(0.05, 3, length.out=floor(palette_length/2)))
然后绘制好看的热图,总体聚类结果还行,就是有个WT_Rep3效果不是特别理想,然后可以看到每个样本中通路的上下调情况,红色上调蓝色下调。当然,效果不好可以尝试使用其他方法再次计算。
pheatmap(sample_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)
也可以更换行顺序
pheatmap(sample_acts_mat, border_color = NA, color=my_color, breaks = my_breaks,row_order = rownames(sample_acts_mat),cluster_rows = F)
当然,差异分析结果也可以利用起来,查看实验组与对照组之间异常活化或抑制的通路。
contrast_acts <- run_wmean(mat=deg, net=net, .source='source', .target='target',
.mor='weight', times = 100, minsize = 5)
contrast_acts
f_contrast_acts <- contrast_acts %>%
filter(statistic == 'norm_wmean')
ggplot(f_contrast_acts, aes(x = reorder(source, score), y = score)) +
geom_bar(aes(fill = score), stat = "identity") +
scale_fill_gradient2(low = "darkblue", high = "indianred",
mid = "whitesmoke", midpoint = 0) +
theme_minimal() +
theme(axis.title = element_text(face = "bold", size = 12),
axis.text.x =
element_text(angle = 45, hjust = 1, size =10, face= "bold"),
axis.text.y = element_text(size =10, face= "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
xlab("Pathways")
最后,哪些基因在通路中起了积极/消极的作用呢?
pathway <- 'JAK-STAT'
df <- net %>%
filter(source == pathway) %>%
arrange(target) %>%
mutate(ID = target, color = "3") %>%
column_to_rownames('target')
inter <- sort(intersect(rownames(deg),rownames(df)))
df <- df[inter, ]
df['t_value'] <- deg[inter, ]
df <- df %>%
mutate(color = if_else(weight > 0 & t_value > 0, '1', color)) %>%
mutate(color = if_else(weight > 0 & t_value < 0, '2', color)) %>%
mutate(color = if_else(weight < 0 & t_value > 0, '2', color)) %>%
mutate(color = if_else(weight < 0 & t_value < 0, '1', color))
ggplot(df, aes(x = weight, y = t_value, color = color)) + geom_point() +
scale_colour_manual(values = c("red","royalblue3","grey")) +
geom_label_repel(aes(label = ID)) +
theme_minimal() +
theme(legend.position = "none") +
geom_vline(xintercept = 0, linetype = 'dotted') +
geom_hline(yintercept = 0, linetype = 'dotted') +
ggtitle(pathway)
红色表示权重与t值成正相关,蓝色相反。t值正负又代表什么呢?其实看下图就明白啦!
最后~需要完整代码的小伙伴后台回复decoupleR,所有我运行成功的结果也保存到了results.Rdata中,足以复现。
敬请批评指正!
网友评论