最近在学习芬兰CSC-IT科学中心主讲的生物信息课程(https://www.csc.fi/web/training/-/scrnaseq)视频,官网上还提供了练习素材以及详细代码,今天就来练习一下单细胞数据整合的过程。跟着官网的代码走一遍:
https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-integration/Data_Integration.md
该练习中使用两种方法进行多个单细胞测序dataset的整合,之后进行批次效应的去除,并且定量评估整合后的数据质量。练习中的datasets分别来自:CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061)。原始矩阵和相关metadata在这里下载。(这里需要注意的是,作者上传的这个矩阵是已经经过整合的,但是并没有去除批次效应,后面代码里会将这个矩阵拆分成4个datasets,然后再进行整合)
开始之前,加载R包:
> library("Seurat")
> library("ggplot2")
> library("cowplot")
> library("scater")
> library("scran")
> library("BiocParallel")
> library("BiocNeighbors")
(一)利用Seurat (anchors and CCA) 方法进行数据整合以及批次效应处理
加载表达矩阵和metadata,其中metadata里包含测序平台(列),细胞类型注释(列)
> pancreas.data <- readRDS(file = "pancreas_expression_matrix.rds")
> metadata <- readRDS(file = "pancreas_metadata.rds")
看一下这个metadata:
创建seurat对象:
> pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)
在做任何批次效应处理之前,都要先查看一下dataset,我们先做标准的预处理(log-标准化),然后识别变量(“vst”),接下来scale整合后的data,跑PCA和可视化,再将整合后的细胞分群(cluster)
# 标准化并且寻找变量(variable features)
> pancreas <- NormalizeData(pancreas, verbose = FALSE)
> pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)
# 跑标准的流程(可视化和clustering)
> pancreas <- ScaleData(pancreas, verbose = FALSE)
> pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
> pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
> p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
> p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
> plot_grid(p1, p2)
这是这个整合后的数据,但是这个数据并没有去除批次效应。左图里4个不同平台测序的结果重合度很低,右图里根据细胞类型分群也没有很好的clustering
下面作者将这个整合的数据拆分成一个列表(包含4个不同的datasets),每一个dataset作为一个元素。进行标准的预处理(log-normalization),识别每一个datset的变量特征("vst"):
> pancreas.list <- SplitObject(pancreas, split.by = "tech")
> for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
verbose = FALSE)
}
整合4个胰岛细胞的datasets
利用FindIntegrationAnchors功能识别anchor,seurat对象列表作为输入:
> reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
> pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
Computing 2000 integration features
Scaling features for provided objects
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s
Finding all pairwise anchors
| | 0 % ~calculating Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3499 anchors
Filtering anchors
Retained 2821 anchors
Extracting within-dataset neighbors
|+++++++++ | 17% ~01m 01s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 3515 anchors
Filtering anchors
Retained 2701 anchors
Extracting within-dataset neighbors
|+++++++++++++++++ | 33% ~49s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 6173 anchors
Filtering anchors
Retained 4634 anchors
Extracting within-dataset neighbors
|+++++++++++++++++++++++++ | 50% ~50s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2176 anchors
Filtering anchors
Retained 1841 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++ | 67% ~27s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2774 anchors
Filtering anchors
Retained 2478 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++ | 83% ~12s Running CCA
Merging objects
Finding neighborhoods
Finding anchors
Found 2723 anchors
Filtering anchors
Retained 2410 anchors
Extracting within-dataset neighbors
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 10s
然后将上面这些anchors传递给IntegrateData函数,该函数返回一个Seurat对象:
> pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
运行IntegrateData后,Seurat对象将包含一个新的整合后的(或“批量校正”)表达矩阵的Assay,请注意,原始矩阵(未修正的值)仍然存储在Seurat对象的RNA Assay中,因此可以来回切换。
然后我们可以使用这个新的整合的矩阵进行下游分析和可视化。在这里,我们scale整合的数据,运行PCA,并使用UMAP可视化结果。整合的数据集按细胞类型cluster,而不是按技术。
#切换到整合后的assay
> DefaultAssay(pancreas.integrated) <- "integrated"
跑标准流程(可视化和clustering):
> pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
> pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
> pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
> p3 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
> p4 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
NoLegend()
> plot_grid(p3, p4)
这时的图就是拆分后再次整合、去除批次效应之后的图了。左图的4个平台测序分类的结果重叠度很高,右图按照细胞类型分类的clustering结果也很好
(二)利用Mutual Nearest Neighbor (MNN)方法进行数据整合
你可以用count矩阵创建一个singlecellexper(SCE)对象,也可以从Seurat转换成SCE对象:
> celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
> celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
> fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
> smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)
寻找共同的基因,并且把每个dataset简化成由那些共同基因组成的dataset:
> keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
+ rownames(fluidigmc1.data),rownames(smartseq2.data)))
> celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
> celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
> fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
> smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]
接下来使用calculateQCMetrics()计算质量控制特征,通过发现异常count数低的或可检测到的基因总数少的异常值来确定低质量细胞:
# 处理celseq.data
> celseq.data <- calculateQCMetrics(celseq.data)
> low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
# 处理celseq2.data
> celseq2.data <- calculateQCMetrics(celseq2.data)
> low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
# 处理fluidigmc1.data
> fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
> low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
> low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
> fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
# 处理smartseq2.data
> smartseq2.data <- calculateQCMetrics(smartseq2.data)
> low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
> low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
> smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]
然后使用computeSumFactors()和scran包的Normalize()函数计算sizefactor来标准化数据:
# Compute sizefactors
> celseq.data <- computeSumFactors(celseq.data)
> celseq2.data <- computeSumFactors(celseq2.data)
> fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
> smartseq2.data <- computeSumFactors(smartseq2.data)
# Normalize
> celseq.data <- normalize(celseq.data)
> celseq2.data <- normalize(celseq2.data)
> fluidigmc1.data <- normalize(fluidigmc1.data)
> smartseq2.data <- normalize(smartseq2.data)
features(基因)选择:使用trendVar()和decomposeVar()函数来计算每个基因的variance,并将其分为技术variance和生物学的variance:
# celseq.data
> fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE)
> dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
> dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
> dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]
# celseq2.data
> fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE)
> dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
> dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
> dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]
# fluidigmc1.data
> fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE)
> dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
> dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
> dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]
# smartseq2.data
> fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE)
> dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
> dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
> dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]
# 选择最能提供信息的基因,这些基因在所有的dataset里都表达
> universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
> mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] +
dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
> hvg_genes <- universe[mean.bio > 0]
将这些datasets结合到一个统一的SingleCellExperiment里:
# 总原始counts的整合
> counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data),
counts(fluidigmc1.data), counts(smartseq2.data))
# 总的标准化后的counts整合 (with multibatch normalization)
> logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data),
logcounts(fluidigmc1.data), logcounts(smartseq2.data))
# 构建整合数据的sce对象
> sce <- SingleCellExperiment(
assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),
rowData = rowData(celseq.data), # same as rowData(pbmc4k)
colData = rbind(colData(celseq.data), colData(celseq2.data),
colData(fluidigmc1.data), colData(smartseq2.data))
)
# 将前面的hvg_genes存储到sce对象的metadata slot中
> metadata(sce)$hvg_genes <- hvg_genes
用MNN处理批次效应之前先看一下这些datasets:
> sce <- runPCA(sce,
ncomponents = 20,
feature_set = hvg_genes,
method = "irlba")
>
> names(reducedDims(sce)) <- "PCA_naive"
>
> p5 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") +
ggtitle("PCA Without batch correction")
> p6 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") +
ggtitle("PCA Without batch correction")
> plot_grid(p5, p6)
去除批次效应之前
使用fastMNN() 功能处理批次效应。跑fastMNN()之前,我们需要先rescale每一个批次,来调整不同批次之间的测序深度。用scran包里的multiBatchNorm()功能对size factor进行调整后,重新计算log标准化的表达值,以适应不同SingleCellExperiment对象的系统差异。之前的size factors仅能移除单个批次里细胞之间的bias。现在我们要通过消除批次之间技术差异来提高了校正的质量:
> rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data)
> celseq.data_rescaled <- rescaled[[1]]
> celseq2.data_rescaled <- rescaled[[2]]
> fluidigmc1.data_rescaled <- rescaled[[3]]
> smartseq2.data_rescaled <- rescaled[[4]]
跑fastMNN,把降维的MNN representation存在sce对象的 reducedDims slot里:
> mnn_out <- fastMNN(celseq.data_rescaled,
celseq2.data_rescaled,
fluidigmc1.data_rescaled,
smartseq2.data_rescaled,
subset.row = metadata(sce)$hvg_genes,
k = 20, d = 50, approximate = TRUE,
# BPPARAM = BiocParallel::MulticoreParam(8),
BNPARAM = BiocNeighbors::AnnoyParam())
> reducedDim(sce, "MNN") <- mnn_out$correct
需要注意的是,fastMNN()不会生成批次处理后的表达矩阵。因此,fastMNN()的结果只能作为降维表示,适用于直接绘图、TSNE/UMAP、聚类和轨迹分析。
画批次矫正后的图:
> p7 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
> p8 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
> plot_grid(p7, p8)
去除批次效应之后的图
网友评论