美文网首页单细胞测序专题集合单细胞R语言,代码
单细胞测序数据整合练习(详细代码)

单细胞测序数据整合练习(详细代码)

作者: 生信start_site | 来源:发表于2020-03-25 10:17 被阅读0次

    最近在学习芬兰CSC-IT科学中心主讲的生物信息课程(https://www.csc.fi/web/training/-/scrnaseq)视频,官网上还提供了练习素材以及详细代码,今天就来练习一下单细胞数据整合的过程。跟着官网的代码走一遍:

    https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-integration/Data_Integration.md

    该练习中使用两种方法进行多个单细胞测序dataset的整合,之后进行批次效应的去除,并且定量评估整合后的数据质量。练习中的datasets分别来自:CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061)。原始矩阵和相关metadata在这里下载。(这里需要注意的是,作者上传的这个矩阵是已经经过整合的,但是并没有去除批次效应,后面代码里会将这个矩阵拆分成4个datasets,然后再进行整合)

    开始之前,加载R包:

    > library("Seurat")
    > library("ggplot2")
    > library("cowplot")
    > library("scater")
    > library("scran")
    > library("BiocParallel")
    > library("BiocNeighbors")
    

    (一)利用Seurat (anchors and CCA) 方法进行数据整合以及批次效应处理

    加载表达矩阵和metadata,其中metadata里包含测序平台(列),细胞类型注释(列)

    > pancreas.data <- readRDS(file = "pancreas_expression_matrix.rds")
    > metadata <- readRDS(file = "pancreas_metadata.rds")
    

    看一下这个metadata:

    创建seurat对象:

    > pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)
    

    在做任何批次效应处理之前,都要先查看一下dataset,我们先做标准的预处理(log-标准化),然后识别变量(“vst”),接下来scale整合后的data,跑PCA和可视化,再将整合后的细胞分群(cluster)

    # 标准化并且寻找变量(variable features)
    > pancreas <- NormalizeData(pancreas, verbose = FALSE)
    > pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
    # 跑标准的流程(可视化和clustering)
    > pancreas <- ScaleData(pancreas, verbose = FALSE)
    > pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
    > pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
    > p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
    > p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + 
      NoLegend()
    > plot_grid(p1, p2)
    
    这是这个整合后的数据,但是这个数据并没有去除批次效应。左图里4个不同平台测序的结果重合度很低,右图里根据细胞类型分群也没有很好的clustering

    下面作者将这个整合的数据拆分成一个列表(包含4个不同的datasets),每一个dataset作为一个元素。进行标准的预处理(log-normalization),识别每一个datset的变量特征("vst"):

    > pancreas.list <- SplitObject(pancreas, split.by = "tech")
    
    > for (i in 1:length(pancreas.list)) {
      pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
      pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000, 
                                                 verbose = FALSE)
    }
    

    整合4个胰岛细胞的datasets
    利用FindIntegrationAnchors功能识别anchor,seurat对象列表作为输入:

    > reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
    > pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
    
    Computing 2000 integration features
    Scaling features for provided objects
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
    Finding all pairwise anchors
      |                                                  | 0 % ~calculating  Running CCA
    Merging objects
    Finding neighborhoods
    Finding anchors
        Found 3499 anchors
    Filtering anchors
        Retained 2821 anchors
    Extracting within-dataset neighbors
      |+++++++++                                         | 17% ~01m 01s      Running CCA
    Merging objects
    Finding neighborhoods
    Finding anchors
        Found 3515 anchors
    Filtering anchors
        Retained 2701 anchors
    Extracting within-dataset neighbors
      |+++++++++++++++++                                 | 33% ~49s          Running CCA
    Merging objects
    Finding neighborhoods
    Finding anchors
        Found 6173 anchors
    Filtering anchors
        Retained 4634 anchors
    Extracting within-dataset neighbors
      |+++++++++++++++++++++++++                         | 50% ~50s          Running CCA
    Merging objects
    Finding neighborhoods
    Finding anchors
        Found 2176 anchors
    Filtering anchors
        Retained 1841 anchors
    Extracting within-dataset neighbors
      |++++++++++++++++++++++++++++++++++                | 67% ~27s          Running CCA
    Merging objects
    Finding neighborhoods
    Finding anchors
        Found 2774 anchors
    Filtering anchors
        Retained 2478 anchors
    Extracting within-dataset neighbors
      |++++++++++++++++++++++++++++++++++++++++++        | 83% ~12s          Running CCA
    Merging objects
    Finding neighborhoods
    Finding anchors
        Found 2723 anchors
    Filtering anchors
        Retained 2410 anchors
    Extracting within-dataset neighbors
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 10s
    

    然后将上面这些anchors传递给IntegrateData函数,该函数返回一个Seurat对象:

    > pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
    

    运行IntegrateData后,Seurat对象将包含一个新的整合后的(或“批量校正”)表达矩阵的Assay,请注意,原始矩阵(未修正的值)仍然存储在Seurat对象的RNA Assay中,因此可以来回切换。

    然后我们可以使用这个新的整合的矩阵进行下游分析和可视化。在这里,我们scale整合的数据,运行PCA,并使用UMAP可视化结果。整合的数据集按细胞类型cluster,而不是按技术。

    #切换到整合后的assay
    > DefaultAssay(pancreas.integrated) <- "integrated"
    

    跑标准流程(可视化和clustering):

    > pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
    > pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
    > pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
    > p3 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
    > p4 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) + 
       NoLegend()
    > plot_grid(p3, p4)
    
    这时的图就是拆分后再次整合、去除批次效应之后的图了。左图的4个平台测序分类的结果重叠度很高,右图按照细胞类型分类的clustering结果也很好

    (二)利用Mutual Nearest Neighbor (MNN)方法进行数据整合

    你可以用count矩阵创建一个singlecellexper(SCE)对象,也可以从Seurat转换成SCE对象:

    > celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
    > celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
    > fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
    > smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)
    

    寻找共同的基因,并且把每个dataset简化成由那些共同基因组成的dataset:

    > keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
    +                                      rownames(fluidigmc1.data),rownames(smartseq2.data)))
    > celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
    > celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
    > fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
    > smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]
    

    接下来使用calculateQCMetrics()计算质量控制特征,通过发现异常count数低的或可检测到的基因总数少的异常值来确定低质量细胞:

    # 处理celseq.data
    > celseq.data <- calculateQCMetrics(celseq.data)
    > low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
    > low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
    > celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
    # 处理celseq2.data
    > celseq2.data <- calculateQCMetrics(celseq2.data)
    > low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
    > low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
    > celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
    # 处理fluidigmc1.data
    > fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
    > low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
    > low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
    > fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
    # 处理smartseq2.data
    > smartseq2.data <- calculateQCMetrics(smartseq2.data)
    > low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
    > low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
    > smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]
    

    然后使用computeSumFactors()和scran包的Normalize()函数计算sizefactor来标准化数据:

    # Compute sizefactors
    > celseq.data <- computeSumFactors(celseq.data)
    > celseq2.data <- computeSumFactors(celseq2.data)
    > fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
    > smartseq2.data <- computeSumFactors(smartseq2.data)
    # Normalize
    > celseq.data <- normalize(celseq.data)
    > celseq2.data <- normalize(celseq2.data)
    > fluidigmc1.data <- normalize(fluidigmc1.data)
    > smartseq2.data <- normalize(smartseq2.data)
    

    features(基因)选择:使用trendVar()和decomposeVar()函数来计算每个基因的variance,并将其分为技术variance和生物学的variance:

    # celseq.data
    > fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE) 
    > dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
    > dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
    > dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]
    # celseq2.data
    > fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE) 
    > dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
    > dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
    > dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]
    # fluidigmc1.data
    > fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE) 
    > dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
    > dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
    > dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]
    # smartseq2.data
    > fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE) 
    > dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
    > dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
    > dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]
    # 选择最能提供信息的基因,这些基因在所有的dataset里都表达
    > universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
                                       rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
    > mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] + 
                    dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
    > hvg_genes <- universe[mean.bio > 0]
    

    将这些datasets结合到一个统一的SingleCellExperiment里:

    # 总原始counts的整合
    > counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data), 
                              counts(fluidigmc1.data), counts(smartseq2.data))
    # 总的标准化后的counts整合 (with multibatch normalization)
    > logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data), 
                                 logcounts(fluidigmc1.data), logcounts(smartseq2.data))
    # 构建整合数据的sce对象
    > sce <- SingleCellExperiment( 
       assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),  
       rowData = rowData(celseq.data), # same as rowData(pbmc4k) 
       colData = rbind(colData(celseq.data), colData(celseq2.data), 
                       colData(fluidigmc1.data), colData(smartseq2.data)) 
     )
    # 将前面的hvg_genes存储到sce对象的metadata slot中 
    > metadata(sce)$hvg_genes <- hvg_genes
    

    用MNN处理批次效应之前先看一下这些datasets:

    > sce <- runPCA(sce,
                   ncomponents = 20,
                   feature_set = hvg_genes,
                   method = "irlba")
    > 
    > names(reducedDims(sce)) <- "PCA_naive" 
    > 
    > p5 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") + 
       ggtitle("PCA Without batch correction")
    > p6 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") + 
       ggtitle("PCA Without batch correction")
    > plot_grid(p5, p6)
    
    去除批次效应之前

    使用fastMNN() 功能处理批次效应。跑fastMNN()之前,我们需要先rescale每一个批次,来调整不同批次之间的测序深度。用scran包里的multiBatchNorm()功能对size factor进行调整后,重新计算log标准化的表达值,以适应不同SingleCellExperiment对象的系统差异。之前的size factors仅能移除单个批次里细胞之间的bias。现在我们要通过消除批次之间技术差异来提高了校正的质量:

    > rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data) 
    > celseq.data_rescaled <- rescaled[[1]]
    > celseq2.data_rescaled <- rescaled[[2]]
    > fluidigmc1.data_rescaled <- rescaled[[3]]
    > smartseq2.data_rescaled <- rescaled[[4]]
    

    跑fastMNN,把降维的MNN representation存在sce对象的 reducedDims slot里:

    > mnn_out <- fastMNN(celseq.data_rescaled, 
                       celseq2.data_rescaled,
                       fluidigmc1.data_rescaled,
                       smartseq2.data_rescaled,
                       subset.row = metadata(sce)$hvg_genes,
                       k = 20, d = 50, approximate = TRUE,
                       # BPPARAM = BiocParallel::MulticoreParam(8),
                       BNPARAM = BiocNeighbors::AnnoyParam())
    
    > reducedDim(sce, "MNN") <- mnn_out$correct
    

    需要注意的是,fastMNN()不会生成批次处理后的表达矩阵。因此,fastMNN()的结果只能作为降维表示,适用于直接绘图、TSNE/UMAP、聚类和轨迹分析。
    画批次矫正后的图:

    > p7 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
    > p8 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
    > plot_grid(p7, p8)
    
    去除批次效应之后的图

    相关文章

      网友评论

        本文标题:单细胞测序数据整合练习(详细代码)

        本文链接:https://www.haomeiwen.com/subject/jgwzyhtx.html