美文网首页
decoupleR:丰富你的通路富集分析结果

decoupleR:丰富你的通路富集分析结果

作者: 生命数据科学 | 来源:发表于2022-12-13 00:18 被阅读0次
    图片

    **Article name: **decoupleR: ensemble of computational methods to infer biological activities from omics data
    **Journal: Bioinformatics Advances
    Doi: https://doi.org/10.1093/bioadv/vbac016
    IF: **创刊

    在生信分析过程中,哪些基因有可能导致这个通路的激活/抑制,其实是比较关键的,今天的推送就是为此而来!

    1

    环境配置

    首先,吸取教训,将我的配置环境分享出来,这样读者在安装包时也能有个参考。

    
    > sessionInfo()
    R version 4.2.1 (2022-06-23 ucrt)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 10 x64 (build 19043)
    
    Matrix products: default
    
    locale:
    [1] LC_COLLATE=Chinese (Simplified Han)_Hong Kong SAR.utf8 
    [2] LC_CTYPE=Chinese (Simplified Han)_Hong Kong SAR.utf8   
    [3] LC_MONETARY=Chinese (Simplified Han)_Hong Kong SAR.utf8
    [4] LC_NUMERIC=C                                           
    [5] LC_TIME=Chinese (Simplified Han)_Hong Kong SAR.utf8    
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
    [1] ggrepel_0.9.1   pheatmap_1.0.12 ggplot2_3.3.6   tidyr_1.2.1     tibble_3.1.7   
    [6] dplyr_1.0.9     decoupleR_2.3.2
    
    loaded via a namespace (and not attached):
     [1] Rcpp_1.0.8.3        pillar_1.8.1        compiler_4.2.1      cellranger_1.1.0   
     [5] RColorBrewer_1.1-3  BiocManager_1.30.18 tools_4.2.1         lifecycle_1.0.3    
     [9] gtable_0.3.1        lattice_0.20-45     pkgconfig_2.0.3     rlang_1.0.6        
    [13] Matrix_1.4-1        DBI_1.1.3           cli_3.3.0           rstudioapi_0.13    
    [17] withr_2.5.0         generics_0.1.3      vctrs_0.4.1         grid_4.2.1         
    [21] tidyselect_1.2.0    glue_1.6.2          R6_2.5.1            fansi_1.0.3        
    [25] readxl_1.4.1        purrr_0.3.4         magrittr_2.0.3      scales_1.2.1       
    [29] ellipsis_0.3.2      assertthat_0.2.1    colorspace_2.0-3    utf8_1.2.2         
    [33] munsell_0.5.0
    

    2

    需要安装的包

    
    library(decoupleR)
    library(dplyr)
    library(tibble)
    library(tidyr)
    library(ggplot2)
    library(pheatmap)
    library(ggrepel)
    

    3

    正片开始

    如果所有的包都成功library之后,就可以正式分析了,这一步主要是把所用到的3个数据整合到了一个list中,而我们自己分析的话仅需符合它的条件即可。 需要注意的是,示例数据中并非count值,后续使用的也不是count值所使用的分析方法,仅为举栗子,自己分析时输入输出可做调整。

    
    inputs_dir <- system.file("extdata", package = "decoupleR")
    data <- readRDS(file.path(inputs_dir, "bk_data.rds"))
    # count矩阵
    > summary(data$counts)
         gene           PANC1.WT.Rep1    PANC1.WT.Rep2    PANC1.WT.Rep3    PANC1.FOXA2KO.Rep1
     Length:11093       Min.   : 6.198   Min.   : 6.265   Min.   : 6.167   Min.   : 6.254    
     Class :character   1st Qu.: 7.726   1st Qu.: 7.772   1st Qu.: 7.785   1st Qu.: 7.718    
     Mode  :character   Median : 9.187   Median : 9.237   Median : 9.236   Median : 9.203    
                        Mean   : 9.378   Mean   : 9.417   Mean   : 9.402   Mean   : 9.400    
                        3rd Qu.:10.749   3rd Qu.:10.808   3rd Qu.:10.776   3rd Qu.:10.803    
                        Max.   :18.526   Max.   :18.481   Max.   :18.197   Max.   :18.489    
                        NA's   :67       NA's   :140      NA's   :82       NA's   :62        
     PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO.Rep3
     Min.   : 6.220     Min.   : 6.247    
     1st Qu.: 7.776     1st Qu.: 7.775    
     Median : 9.226     Median : 9.259    
     Mean   : 9.404     Mean   : 9.420    
     3rd Qu.:10.779     3rd Qu.:10.825    
     Max.   :18.333     Max.   :18.230    
     NA's   :77         NA's   :157
     
     # 实验设计
    > data$design
    # A tibble: 6 × 2
      sample             condition    
      <chr>              <chr>        
    1 PANC1.WT.Rep1      PANC1.WT     
    2 PANC1.WT.Rep2      PANC1.WT     
    3 PANC1.WT.Rep3      PANC1.WT     
    4 PANC1.FOXA2KO.Rep1 PANC1.FOXA2KO
    5 PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO
    6 PANC1.FOXA2KO.Rep3 PANC1.FOXA2KO
    
    # 差异分析结果
    > head(data$limma_ttop)
    # A tibble: 6 × 7
      ID      logFC AveExpr      t    P.Value adj.P.Val     B
      <chr>   <dbl>   <dbl>  <dbl>      <dbl>     <dbl> <dbl>
    1 RHBDL2  -1.82    8.60 -12.8  0.00000303    0.0336  3.95
    2 PLEKHH2 -1.57    7.70 -10.8  0.00000993    0.0548  3.27
    3 HEG1    -1.73    8.64  -9.79 0.0000194     0.0548  2.84
    4 CLU     -1.79   12.2   -9.76 0.0000198     0.0548  2.83
    5 FHL1     2.09    9.61   8.95 0.0000355     0.0788  2.43
    6 RBP4    -1.73    7.39  -8.53 0.0000490     0.0862  2.19
    

    关于前期差异分析流程,可以移步gzh

    原文还对数据进行了预处理

    
    # 删除NA值,第一列作为行名
    counts <- data$counts %>%
      dplyr::mutate_if(~ any(is.na(.x)), ~ if_else(is.na(.x),0,.x)) %>% 
      column_to_rownames(var = "gene") %>% 
      as.matrix()
    head(counts)
    #>          PANC1.WT.Rep1 PANC1.WT.Rep2 PANC1.WT.Rep3 PANC1.FOXA2KO.Rep1 PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO.Rep3
    #> NOC2L        10.052588     11.949123     12.057774          12.312291          12.139918          11.494205
    #> PLEKHN1       7.535115      8.125993      8.714880           8.048196           8.290154           8.621239
    #> PERM1         6.281242      6.424582      6.589668           6.293285           6.486136           6.775344
    #> ISG15        10.938252     11.469081     11.425415          11.549986          11.371464          11.178157
    #> AGRN          6.956335      7.196108      7.522550           7.061549           7.485534           7.071555
    #> C1orf159      9.546224      9.788721      9.794589           9.850830
    
    # 提取gene与t value的关系
    deg <- data$limma_ttop %>%
        select(ID, t) %>% 
        filter(!is.na(t)) %>% 
        column_to_rownames(var = "ID") %>%
        as.matrix()
    head(deg)
    #>                  t
    #> RHBDL2  -12.810588
    #> PLEKHH2 -10.794453
    #> HEG1     -9.788112
    #> CLU      -9.761618
    #> FHL1      8.950191
    #> RBP4     -8.529074
    

    然后利用PROGENy数据库,获取通路中基因的weigh和*P *值,为后续分析做准备。(仅支持人和小鼠),top主要限制通路中基因的数量(P排序)

    
    net <- get_progeny(organism = 'human', top = 100)
    [2022-11-04 23:00:15] [SUCCESS] [OmnipathR] Loaded 700257 annotation records from cache.
    > net
    # A tibble: 1,400 × 4
       source   target  weight  p_value
       <chr>    <chr>    <dbl>    <dbl>
     1 Androgen TMPRSS2  11.5  2.38e-47
     2 Androgen NKX3-1   10.6  2.21e-44
     3 Androgen MBOAT2   10.5  4.63e-44
     4 Androgen KLK2     10.2  1.94e-40
     5 Androgen SARG     11.4  2.79e-40
     6 Androgen SLC38A4   7.36 1.25e-39
     7 Androgen MTMR9     6.13 2.53e-38
     8 Androgen ZBTB16   10.6  1.57e-36
     9 Androgen KCNN2     9.47 7.71e-36
    10 Androgen OPRK1    -5.63 1.11e-35
    # … with 1,390 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    既然有了权重,自然可以使用加权平均的方法来计算每个样本的得分,这里使用的是wmean方法,当然也可以使用show_methods()查看其他可以使用的方法,输入数据主要为count值和上一步得到了net。

    
    sample_acts <- run_wmean(mat=counts, net=net, .source='source', .target='target',
                      .mor='weight', times = 100, minsize = 5)
    > show_methods()
    # A tibble: 12 × 2
       Function      Name                                                                      
       <chr>         <chr>                                                                     
     1 run_aucell    AUCell                                                                    
     2 run_consensus Consensus score between methods                                           
     3 run_fgsea     Fast Gene Set Enrichment Analysis (FGSEA)                                 
     4 run_gsva      Gene Set Variation Analysis (GSVA)                                        
     5 run_mdt       Multivariate Decision Trees (MDT)                                         
     6 run_mlm       Multivariate Linear Model (MLM)                                           
     7 run_ora       Over Representation Analysis (ORA)                                        
     8 run_udt       Univariate Decision Tree (UDT)                                            
     9 run_ulm       Univariate Linear Model (ULM)                                             
    10 run_viper     Virtual Inference of Protein-activity by Enriched Regulon analysis (VIPER)
    11 run_wmean     Weighted Mean (WMEAN)                                                     
    12 run_wsum      Weighted Sum (WSUM)  
    

    选择好看的颜色和渐变效果

    palette_length = 100
    my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)
    my_breaks <- c(seq(-3, 0, length.out=ceiling(palette_length/2) + 1),
                   seq(0.05, 3, length.out=floor(palette_length/2)))
    

    然后绘制好看的热图,总体聚类结果还行,就是有个WT_Rep3效果不是特别理想,然后可以看到每个样本中通路的上下调情况,红色上调蓝色下调。当然,效果不好可以尝试使用其他方法再次计算。

    pheatmap(sample_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)
    
    图片

    也可以更换行顺序

    pheatmap(sample_acts_mat, border_color = NA, color=my_color, breaks = my_breaks,row_order = rownames(sample_acts_mat),cluster_rows = F)
    
    图片

    当然,差异分析结果也可以利用起来,查看实验组与对照组之间异常活化或抑制的通路。

    
    contrast_acts <- run_wmean(mat=deg, net=net, .source='source', .target='target',
                               .mor='weight', times = 100, minsize = 5)
    contrast_acts
    f_contrast_acts <- contrast_acts %>%
      filter(statistic == 'norm_wmean')
     
    ggplot(f_contrast_acts, aes(x = reorder(source, score), y = score)) + 
      geom_bar(aes(fill = score), stat = "identity") +
      scale_fill_gradient2(low = "darkblue", high = "indianred", 
                           mid = "whitesmoke", midpoint = 0) + 
      theme_minimal() +
      theme(axis.title = element_text(face = "bold", size = 12),
            axis.text.x = 
              element_text(angle = 45, hjust = 1, size =10, face= "bold"),
            axis.text.y = element_text(size =10, face= "bold"),
            panel.grid.major = element_blank(), 
            panel.grid.minor = element_blank()) +
      xlab("Pathways")
    
    图片

    最后,哪些基因在通路中起了积极/消极的作用呢?

    pathway <- 'JAK-STAT'
    
    df <- net %>%
      filter(source == pathway) %>%
      arrange(target) %>%
      mutate(ID = target, color = "3") %>%
      column_to_rownames('target')
    inter <- sort(intersect(rownames(deg),rownames(df)))
    df <- df[inter, ]
    df['t_value'] <- deg[inter, ]
    df <- df %>%
      mutate(color = if_else(weight > 0 & t_value > 0, '1', color)) %>%
      mutate(color = if_else(weight > 0 & t_value < 0, '2', color)) %>%
      mutate(color = if_else(weight < 0 & t_value > 0, '2', color)) %>%
      mutate(color = if_else(weight < 0 & t_value < 0, '1', color))
    
    ggplot(df, aes(x = weight, y = t_value, color = color)) + geom_point() +
      scale_colour_manual(values = c("red","royalblue3","grey")) +
      geom_label_repel(aes(label = ID)) + 
      theme_minimal() +
      theme(legend.position = "none") +
      geom_vline(xintercept = 0, linetype = 'dotted') +
      geom_hline(yintercept = 0, linetype = 'dotted') +
      ggtitle(pathway)
    
    图片

    红色表示权重与t值成正相关,蓝色相反。t值正负又代表什么呢?其实看下图就明白啦!

    图片

    最后~需要完整代码的小伙伴后台回复decoupleR,所有我运行成功的结果也保存到了results.Rdata中,足以复现。

    图片

    敬请批评指正!

    相关文章

      网友评论

          本文标题:decoupleR:丰富你的通路富集分析结果

          本文链接:https://www.haomeiwen.com/subject/prxhqdtx.html