基于液滴的单细胞捕获方法通常其单细胞液滴也会捕获环境基因(ambient RNA)。环境基因的表达由于并不来自barcode细胞,会导致基因表达矩阵计数不准确,更会影响用标记基因鉴定细胞群。SoupX (Young et al.) 可以对环境基因表达做估量并从表达基因表达矩阵中去除其影响。
首先加载R包和数据集 (数据下载地址:https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k)
library(cowplot)
library(Seurat)
library(SoupX)
library(Matrix)
# Load data
scPBMC = load10X("./PBMC_4k/")
问题的提出如下图, IGKC作为B细胞特异表达的基因不仅在B细胞中(上方细胞群),在其他细胞群中多个细胞中其表达量不为零。
dd = scPBMC$metaData[colnames(scPBMC$toc), ]
dd$clusters <- as.factor(dd$clusters)
dd$IGKC = scPBMC$toc["IGKC", ]
dd$IGKC <- as.integer(dd$IGKC)
gg = ggplot(dd, aes(tSNE1, tSNE2)) + geom_point(aes(colour = IGKC > 0), size=1)
plot(gg)
image.png
首先我们用自动的方法对环境基因表达量做估测, 可以看到环境基因表达的预估值为0.058。
# Estimate the contamination fraction - automatic
scPBMC = autoEstCont(scPBMC)
scPBMC$autoEst$rhoEst = scPBMC$fit$rhoEst
#248 genes passed tf-idf cut-off and 177 soup quantile filter. Taking the top 100.
#Using 427 independent estimates of rho.
#Estimated global rho of 0.06
image.png
其次我们用手动的方法,这时需要指定用来估测环境基因表达的基因。这里我们选取了IG家族基因。还需要用estimateNonExpressingCells函数选取用来评估背景表达的细胞,如下图中所示绿色边框的细胞为选定的细胞。可以看到手动方法得到的预估值为0.055, 和自动方法的到的值非常接近。
# Estimate the contamination fraction - manual
igGenes = c("IGHA1", "IGHA2", "IGHG1", "IGHG2", "IGHG3", "IGHG4", "IGHD", "IGHE",
"IGHM", "IGLC1", "IGLC2", "IGLC3", "IGLC4", "IGLC5", "IGLC6", "IGLC7", "IGKC")
ute = estimateNonExpressingCells(scPBMC, nonExpressedGeneList = list(IG = igGenes) )
scPBMC = calculateContaminationFraction(scPBMC,nonExpressedGeneList= list(IG = igGenes),useToEst=ute,forceAccept = TRUE)
#Estimated global contamination fraction of 5.51%
image.png
最后我们用adjustCounts函数得到修正后的矩阵。
# corrected count matrix
out = adjustCounts(scPBMC,setNames(scPBMC$metaData$clusters,rownames(scPBMC$metaData)),verbose=2, roundToInt = TRUE)
我们可以检查一下修正后的效果, 如下图所示,可以看到除了B细胞群,其他细胞群里IGKC的表达的细胞大幅度减少了。
standard10X = function(dat,nPCs=50,res=1.0,verbose=FALSE){
srat = CreateSeuratObject(dat)
srat = NormalizeData(srat,verbose=verbose)
srat = ScaleData(srat,verbose=verbose)
srat = FindVariableFeatures(srat,verbose=verbose)
srat = RunPCA(srat,verbose=verbose)
srat = RunTSNE(srat,dims=seq(nPCs),verbose=verbose)
srat = FindNeighbors(srat,dims=seq(nPCs),verbose=verbose)
srat = FindClusters(srat,res=res,verbose=verbose)
return(srat)
}
sra_obj <- standard10X(out, nPCs=30, res=0.6)
ee = sra_obj@meta.data
ee = cbind(ee, Embeddings(sra_obj[["tsne"]]))
ee$IGKC = sra_obj@assays$RNA@counts[rownames(sra_obj@assays$RNA@counts) == "IGKC", ]
ee$IGKC <- as.integer(ee$IGKC)
gg = ggplot(ee, aes(tSNE_1, tSNE_2)) + geom_point(aes(colour = IGKC > 0), size=1)
plot(gg)
image.png
网友评论