单细胞转录组基础分析六：伪时间分析

作者: Seurat_Satija | 来源:发表于2021-03-10 16:40 被阅读0次

单细胞转录组基础分析六：伪时间分析
Hemberg-lab单细胞转录组数据分析（五）
Hemberg-lab单细胞转录组数据分析（六）
Hemberg-lab单细胞转录组数据分析（四）
10xGenomics单细胞转录组亚群细分策略
单细胞转录组数据分析|| scanpy教程：使用ingest和B
单细胞转录组数据分析|| scanpy教程：可视化套件
Hemberg-lab单细胞转录组数据分析（三）
单细胞scRNA-seq学习笔记2-数据预处理
Hemberg-lab单细胞转录组数据分析（九）- Scater

本文是参考学习单细胞转录组基础分析六：伪时间分析的学习笔记。可能根据学习情况有所改动。
Monocle进行伪时间分析的核心技术是一种机器学习算法——反向图形嵌入 (Reversed Graph Embedding)。它分析的前提需要一张展现细胞转录特征相似性关系的图，Monocle2使用DDTree降维图，Monocle3使用UMAP降维图。Monocle的机器学习算法可以依据上述降维图形，学习描述细胞如何从一种状态过渡到另一种状态的轨迹。Monocle假设轨迹是树状结构，一端是“根”，另一端是“叶”。一个细胞在生物过程的开始，从根开始沿着主干进行，直到它到达第一个分支。然后，该细胞必须选择一条路径，并沿着树移动越来越远，直到它到达一片叶子。一个细胞的假时间值是它返回根所需的距离。降维方面monocle与seurat的过程大同小异，首先进行数据标准化，其次选择部分基因代表细胞转录特征，最后选用适当的算法降维。对Monocle原理感兴趣的同学可以登录官网查看：

http://cole-trapnell-lab.github.io/monocle-release/

数据导入与处理

轨迹分析的前提是待分析的细胞有紧密的发育关系，PBMC细胞不是很好的的示例数据，我们选择T细胞群体演示一下。Monocle建议导入原始表达矩阵，由它完成数据标准化和其他预处理。

dir.create("pseudotime")

expressionFamily参数用于指定表达矩阵的数据类型，有几个选项可以选择：

稀疏矩阵用negbinomial.size()，
FPKM值用tobit()，
logFPKM值用gaussianff()

mycds是Monocle为我们的数据生成的对象，相当于我们在seurat使用的scRNA对象。数据导入后需要进行标准化和其他预处理：

mycds <- estimateSizeFactors(mycds)

与seurat把标准化后的表达矩阵保存在对象中不同，monocle只保存一些中间结果在对象中，需要用时再用这些中间结果转化。经过上面三个函数的计算，mycds对象中多了SizeFactors、Dipersions、num_cells_expressed和num_genes_expressed等信息。

选择代表性基因

完成数据导入和预处理后，就可以考虑选择哪些基因代表细胞的发育特征，Monocle官网教程提供了4个选择方法：

选择发育差异表达基因
选择clusters差异表达基因
选择离散程度高的基因
自定义发育marker基因

前三种都是无监督分析方法，细胞发育轨迹生成完全不受人工干预；最后一种是半监督分析方法，可以使用先验知识辅助分析。第一种方法要求实验设计有不同的时间点，对起点和终点的样本做基因表达差异分析，挑选显著差异的基因进行后续分析。对于没有时序设计的实验样本，可以使用第2、3种方法挑选基因。第2种方法要先对细胞降维聚类，然后用clusters之间差异表达的基因开展后续分析。Monocle有一套自己的降维聚类方法，与seurat的方法大同小异，很多教程直接使用seurat的差异分析结果。第3种方法使用离散程度高的基因开展分析，seurat有挑选高变基因的方法，monocle也有自己选择的算法。本案例数据不具备使用第1、4种方法的条件，因此这里只演示2、3种方法的使用。

##使用clusters差异表达基因

图片

选择不同的基因集，拟时分析的结果不同，实践中可以几种方法都试一下。

降维及****细胞排序

使用disp.genes开展后续分析

#降维

图片

使用diff.genes分析的结果

图片

轨迹图分面显示

p1 <- plot_cell_trajectory(mycds, color_by = "State") + facet_wrap(~State, nrow = 1)

图片

Monocle基因可视化

s.genes <- c("ITGB1","CCR7","KLRB1","GNLY")

图片

拟时相关基因聚类热图

Monocle中differentialGeneTest()函数可以按条件进行差异分析，将相关参数设为fullModelFormulaStr = "~sm.ns(Pseudotime)"时，可以找到与拟时先关的差异基因。我们可以按一定的条件筛选基因后进行差异分析，全部基因都输入会耗费比较长的时间。建议使用cluster差异基因或高变基因输入函数计算。分析结果主要依据qval区分差异的显著性，筛选之后可以用plot_pseudotime_heatmap函数绘制成热图。

#cluster差异基因

图片

BEAM分析

单细胞轨迹中通常包括分支，它们的出现是因为细胞的表达模式不同。当细胞做出命运选择时，或者遗传、化学或环境扰动时，就会表现出不同的基因表达模式。BEAM(Branched expression analysis modeling)是一种统计方法，用于寻找以依赖于分支的方式调控的基因。

disp_table <- dispersionTable(mycds)

图片

> dir.create("pseudotime")
> scRNAsub <- readRDS("scRNAsub.rds")  #scRNAsub是上一节保存的T细胞子集seurat对象
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file 'scRNAsub.rds', probable reason 'No such file or directory'
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
 [1] splines   parallel  stats4    stats     graphics  grDevices utils     datasets 
 [9] methods   base     

other attached packages:
 [1] monocle_2.18.0              DDRTree_0.1.5               irlba_2.3.3                
 [4] VGAM_1.1-5                  Matrix_1.2-18               patchwork_1.1.1            
 [7] celldex_1.0.0               SingleR_1.4.1               SummarizedExperiment_1.20.0
[10] Biobase_2.50.0              GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
[13] IRanges_2.24.1              S4Vectors_0.28.1            BiocGenerics_0.36.0        
[16] MatrixGenerics_1.2.0        matrixStats_0.57.0          forcats_0.5.0              
[19] stringr_1.4.0               dplyr_1.0.2                 purrr_0.3.4                
[22] readr_1.4.0                 tidyr_1.1.2                 tibble_3.0.4               
[25] ggplot2_3.3.3               tidyverse_1.3.0             Seurat_3.2.3               

loaded via a namespace (and not attached):
  [1] reticulate_1.18               tidyselect_1.1.0             
  [3] RSQLite_2.2.2                 AnnotationDbi_1.52.0         
  [5] htmlwidgets_1.5.3             docopt_0.7.1                 
  [7] grid_4.0.2                    combinat_0.0-8               
  [9] BiocParallel_1.24.1           Rtsne_0.15                   
 [11] munsell_0.5.0                 codetools_0.2-18             
 [13] ica_1.0-2                     future_1.21.0                
 [15] miniUI_0.1.1.1                withr_2.3.0                  
 [17] fastICA_1.2-2                 colorspace_2.0-0             
 [19] rstudioapi_0.13               ROCR_1.0-11                  
 [21] tensor_1.5                    listenv_0.8.0                
 [23] labeling_0.4.2                slam_0.1-48                  
 [25] GenomeInfoDbData_1.2.4        polyclip_1.10-0              
 [27] bit64_4.0.5                   farver_2.0.3                 
 [29] pheatmap_1.0.12               parallelly_1.23.0            
 [31] vctrs_0.3.6                   generics_0.1.0               
 [33] BiocFileCache_1.14.0          R6_2.5.0                     
 [35] rsvd_1.0.3                    bitops_1.0-6                 
 [37] spatstat.utils_1.20-2         DelayedArray_0.16.0          
 [39] assertthat_0.2.1              promises_1.1.1               
 [41] scales_1.1.1                  gtable_0.3.0                 
 [43] beachmat_2.6.4                globals_0.14.0               
 [45] goftest_1.2-2                 rlang_0.4.9                  
 [47] lazyeval_0.2.2                broom_0.7.3                  
 [49] BiocManager_1.30.10           yaml_2.2.1                   
 [51] reshape2_1.4.4                abind_1.4-5                  
 [53] modelr_0.1.8                  backports_1.2.0              
 [55] httpuv_1.5.4                  tools_4.0.2                  
 [57] ellipsis_0.3.1                RColorBrewer_1.1-2           
 [59] sessioninfo_1.1.1             ggridges_0.5.3               
 [61] Rcpp_1.0.5                    plyr_1.8.6                   
 [63] sparseMatrixStats_1.2.1       zlibbioc_1.36.0              
 [65] RCurl_1.98-1.2                densityClust_0.3             
 [67] rpart_4.1-15                  deldir_0.2-3                 
 [69] viridis_0.5.1                 pbapply_1.4-3                
 [71] cowplot_1.1.1                 zoo_1.8-8                    
 [73] haven_2.3.1                   ggrepel_0.9.0                
 [75] cluster_2.1.0                 fs_1.5.0                     
 [77] magrittr_2.0.1                RSpectra_0.16-0              
 [79] data.table_1.13.6             scattermore_0.7              
 [81] lmtest_0.9-38                 reprex_0.3.0                 
 [83] RANN_2.6.1                    fitdistrplus_1.1-3           
 [85] hms_0.5.3                     mime_0.9                     
 [87] xtable_1.8-4                  sparsesvd_0.2                
 [89] readxl_1.3.1                  gridExtra_2.3                
 [91] HSMMSingleCell_1.10.0         compiler_4.0.2               
 [93] KernSmooth_2.23-18            crayon_1.3.4                 
 [95] htmltools_0.5.1.1             mgcv_1.8-33                  
 [97] later_1.1.0.1                 lubridate_1.7.9.2            
 [99] DBI_1.1.0                     ExperimentHub_1.16.0         
[101] dbplyr_2.0.0                  MASS_7.3-53                  
[103] rappdirs_0.3.1                cli_2.2.0                    
[105] igraph_1.2.6                  pkgconfig_2.0.3              
[107] plotly_4.9.3                  xml2_1.3.2                   
[109] XVector_0.30.0                rvest_0.3.6                  
[111] digest_0.6.27                 sctransform_0.3.2            
[113] RcppAnnoy_0.0.18              spatstat.data_1.7-0          
[115] cellranger_1.1.0              leiden_0.3.6                 
[117] uwot_0.1.10                   DelayedMatrixStats_1.12.3    
[119] curl_4.3                      shiny_1.5.0                  
[121] lifecycle_0.2.0               nlme_3.1-151                 
[123] jsonlite_1.7.2                BiocNeighbors_1.8.2          
[125] viridisLite_0.3.0             limma_3.46.0                 
[127] fansi_0.4.1                   pillar_1.4.7                 
[129] lattice_0.20-41               fastmap_1.0.1                
[131] httr_1.4.2                    survival_3.2-7               
[133] interactiveDisplayBase_1.28.0 glue_1.4.2                   
[135] qlcMatrix_0.9.7               FNN_1.1.3                    
[137] spatstat_1.64-1               png_0.1-7                    
[139] BiocVersion_3.12.0            bit_4.0.4                    
[141] stringi_1.5.3                 blob_1.2.1                   
[143] BiocSingular_1.6.0            AnnotationHub_2.22.0         
[145] memoise_1.1.0                 future.apply_1.7.0           
> #图片
> ##保存数据
> saveRDS(scRNAsub, file="scRNAsub.rds")
> scRNAsub <- readRDS("scRNAsub.rds")  #scRNAsub是上一节保存的T细胞子集seurat对象
> data <- as(as.matrix(scRNAsub@assays$RNA@counts), 'sparseMatrix')
> fd <- new('AnnotatedDataFrame', data = fData)
Error in value[[3L]](cond) : 
  AnnotatedDataFrame 'data' is class 'standardGeneric' but should be or extend 'data.frame'
  AnnotatedDataFrame 'initialize' could not update varMetadata:
  perhaps pData and varMetadata are inconsistent?
> data <- as(as.matrix(scRNAsub@assays$RNA@counts), 'sparseMatrix')
> pd <- new('AnnotatedDataFrame', data = scRNAsub@meta.data)
> scRNAsub <- readRDS("scRNAsub.rds")  #scRNAsub是上一节保存的T细胞子集seurat对象
> data <- as(as.matrix(scRNAsub@assays$RNA@counts), 'sparseMatrix')
> pd <- new('AnnotatedDataFrame', data = scRNAsub@meta.data)
> fData <- data.frame(gene_short_name = row.names(data), row.names = row.names(data))
> fd <- new('AnnotatedDataFrame', data = fData)
> mycds <- newCellDataSet(data,
+                         phenoData = pd,
+                         featureData = fd,
+                         expressionFamily = negbinomial.size())
> mycds <- estimateSizeFactors(mycds)
> mycds <- estimateDispersions(mycds, cores=4, relative_expr = TRUE)
Removing 276 outliers
Warning messages:
1: `group_by_()` is deprecated as of dplyr 0.7.0.
Please use `group_by()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
2: `select_()` is deprecated as of dplyr 0.7.0.
Please use `select()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
> ##使用clusters差异表达基因
> diff.genes <- read.csv('subcluster/diff_genes_wilcox.csv')
> diff.genes <- subset(diff.genes,p_val_adj<0.01)$gene
> mycds <- setOrderingFilter(mycds, diff.genes)
> p1 <- plot_ordering_genes(mycds)
> ##使用seurat选择的高变基因
> var.genes <- VariableFeatures(scRNAsub)
> mycds <- setOrderingFilter(mycds, var.genes)
> p2 <- plot_ordering_genes(mycds)
> ##使用monocle选择的高变基因
> disp_table <- dispersionTable(mycds)
> disp.genes <- subset(disp_table, mean_expression >= 0.1 & dispersion_empirical >= 1 * dispersion_fit)$gene_id
> mycds <- setOrderingFilter(mycds, disp.genes)
> p3 <- plot_ordering_genes(mycds)
> ##结果对比
> p1|p2|p3
Warning messages:
1: Transformation introduced infinite values in continuous y-axis 
2: Transformation introduced infinite values in continuous y-axis 
3: Transformation introduced infinite values in continuous y-axis 
4: Transformation introduced infinite values in continuous y-axis 
5: Transformation introduced infinite values in continuous y-axis 
> #降维
> mycds <- reduceDimension(mycds, max_components = 2, method = 'DDRTree')
> #排序
> mycds <- orderCells(mycds)
There were 50 or more warnings (use warnings() to see the first 50)
> #State轨迹分布图
> plot1 <- plot_cell_trajectory(mycds, color_by = "State")
> plot1
> ggsave("pseudotime/State.pdf", plot = plot1, width = 6, height = 5)
> ggsave("pseudotime/State.png", plot = plot1, width = 6, height = 5)
> ##Cluster轨迹分布图
> plot2 <- plot_cell_trajectory(mycds, color_by = "seurat_clusters")
> ggsave("pseudotime/Cluster.pdf", plot = plot2, width = 6, height = 5)
> ggsave("pseudotime/Cluster.png", plot = plot2, width = 6, height = 5)
> plot2
> ##Pseudotime轨迹图
> plot3 <- plot_cell_trajectory(mycds, color_by = "Pseudotime")
> plot3
> ggsave("pseudotime/Pseudotime.pdf", plot = plot3, width = 6, height = 5)
> ggsave("pseudotime/Pseudotime.png", plot = plot3, width = 6, height = 5)
> ##合并作图
> plotc <- plot1|plot2|plot3
> plotc
> ggsave("pseudotime/Combination.pdf", plot = plotc, width = 10, height = 3.5)
> ggsave("pseudotime/Combination.png", plot = plotc, width = 10, height = 3.5)
> ##保存结果
> write.csv(pData(mycds), "pseudotime/pseudotime.csv")
> p1 <- plot_cell_trajectory(mycds, color_by = "State") + facet_wrap(~State, nrow = 1)
> p2 <- plot_cell_trajectory(mycds, color_by = "seurat_clusters") + facet_wrap(~seurat_clusters, nrow = 1)
> plotc <- p1/p2
> plotc <- p1/p2
> p1
> p2
> plotc
> ggsave("pseudotime/trajectory_facet.png", plot = plotc, width = 6, height = 5)
> #cluster差异基因
> diff.genes <- read.csv('subcluster/diff_genes_wilcox.csv')
> sig_diff.genes <- subset(diff.genes,p_val_adj<0.0001&abs(avg_logFC)>0.75)$gene
> sig_diff.genes <- unique(as.character(sig_diff.genes))
> diff_test <- differentialGeneTest(mycds[sig_diff.genes,], cores = 1, 
+                                   fullModelFormulaStr = "~sm.ns(Pseudotime)")
> sig_gene_names <- row.names(subset(diff_test, qval < 0.01))
> p1 = plot_pseudotime_heatmap(mycds[sig_gene_names,], num_clusters=3,
+                              show_rownames=T, return_heatmap=T)
> p1
> ggsave("pseudotime/pseudotime_heatmap1.png", plot = p1, width = 5, height = 8)
> #高变基因
> disp_table <- dispersionTable(mycds)
> disp.genes <- subset(disp_table, mean_expression >= 0.5&dispersion_empirical >= 1*dispersion_fit)
> disp.genes <- as.character(disp.genes$gene_id)
> diff_test <- differentialGeneTest(mycds[disp.genes,], cores = 4, 
+                                   fullModelFormulaStr = "~sm.ns(Pseudotime)")
> sig_gene_names <- row.names(subset(diff_test, qval < 1e-04))
> p2 = plot_pseudotime_heatmap(mycds[sig_gene_names,], num_clusters=5,
+                              show_rownames=T, return_heatmap=T)
Warning messages:
1: In slot(family, "validparams") :
  closing unused connection 7 (<-DESKTOP-2F2KC96:11566)
2: In slot(family, "validparams") :
  closing unused connection 6 (<-DESKTOP-2F2KC96:11566)
3: In slot(family, "validparams") :
  closing unused connection 5 (<-DESKTOP-2F2KC96:11566)
4: In slot(family, "validparams") :
  closing unused connection 4 (<-DESKTOP-2F2KC96:11566)
> ggsave("pseudotime/pseudotime_heatmap2.png", plot = p2, width = 5, height = 10)
> disp_table <- dispersionTable(mycds)
> disp.genes <- subset(disp_table, mean_expression >= 0.5&dispersion_empirical >= 1*dispersion_fit)
> disp.genes <- as.character(disp.genes$gene_id)
> mycds_sub <- mycds[disp.genes,]
> plot_cell_trajectory(mycds_sub, color_by = "State")
> beam_res <- BEAM(mycds_sub, branch_point = 1, cores = 8)
Warning messages:
1: In if (progenitor_method == "duplicate") { :
  the condition has length > 1 and only the first element will be used
2: In if (progenitor_method == "sequential_split") { :
  the condition has length > 1 and only the first element will be used
> beam_res <- beam_res[order(beam_res$qval),]
> beam_res <- beam_res[,c("gene_short_name", "pval", "qval")]
> mycds_sub_beam <- mycds_sub[row.names(subset(beam_res, qval < 1e-4)),]
> plot_genes_
Error: object 'plot_genes_' not found

单细胞转录组基础分析六：伪时间分析
本文是参考学习单细胞转录组基础分析六：伪时间分析[https://mp.weixin.qq.com/s?__biz...
Hemberg-lab单细胞转录组数据分析（五）
Hemberg-lab单细胞转录组数据分析（一） Hemberg-lab单细胞转录组数据分析（二） Hemberg...
Hemberg-lab单细胞转录组数据分析（六）
Hemberg-lab单细胞转录组数据分析（一） Hemberg-lab单细胞转录组数据分析（二） Hemberg...
Hemberg-lab单细胞转录组数据分析（四）
Hemberg-lab单细胞转录组数据分析（一） Hemberg-lab单细胞转录组数据分析（二） Hemberg...
10xGenomics单细胞转录组亚群细分策略
10xGenomics单细胞转录组 10xGenomics单细胞转录组分析的核心就是聚类，但是单细胞转录组分析的聚...
单细胞转录组数据分析|| scanpy教程：使用ingest和B
单细胞转录组数据分析|| scanpy教程：预处理与聚类单细胞转录组数据分析|| scanpy教程：PAGA轨迹...
单细胞转录组数据分析|| scanpy教程：可视化套件
单细胞转录组数据分析|| scanpy教程：预处理与聚类单细胞转录组数据分析|| scanpy教程：PAGA轨迹...
Hemberg-lab单细胞转录组数据分析（三）
Hemberg-lab单细胞转录组数据分析（一） Hemberg-lab单细胞转录组数据分析（二）收藏|北大生信...
单细胞scRNA-seq学习笔记2-数据预处理
单细胞scRNA-seq学习笔记2-数据预处理课程学习生信技能树单细胞转录组（基础）一、转录组上游分析主要是...
Hemberg-lab单细胞转录组数据分析（九）- Scater
往期系列 Hemberg-lab单细胞转录组数据分析（一） Hemberg-lab单细胞转录组数据分析（二） He...