2020-04-13
参考文献:A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia
目标:利用文章中描述的工具和方法,复现文中主要的scRNA-seq分析图表
前言
最有效的学习方法就是在实践中学习,今天开始要尝试复现的文章是A Cellular Taxonomy of the Bone Marrow Stroma in Homeostasis and Leukemia,该文章于2019年发表在Cell 杂志上。文章的分析代码并未开源,因此只能根据原文描述来尽量复现其结果。
摘要
Stroma is a poorly defined non-parenchymal component of virtually every organ with key roles in organ development, homeostasis, and repair. Studies of the bone marrow stroma have defined individual populations in the stem cell niche regulating hematopoietic regeneration and capable of initiating leukemia. Here, we use single-cell RNA sequencing (scRNAseq) to define a cellular taxonomy of the mouse bone marrow stroma and its perturbation by malignancy. We identified seventeen stromal subsets expressing distinct hematopoietic regulatory genes spanning new fibroblastic and osteoblastic subpopulations including distinct osteoblast differentiation trajectories. Emerging acute myeloid leukemia impaired mesenchymal osteogenic differentiation and reduced regulatory molecules necessary for normal hematopoiesis. These data suggest that tissue stroma responds to malignant cells by disadvantaging normal parenchymal cells. Our taxonomy of the stromal compartment provides a comprehensive bone marrow cell census and experimental support for cancer cell crosstalk with specific stromal elements to impair normal tissue function and thereby enable emergent cancer.
文章思路
- 利用单细胞测序分析小鼠正常骨髓基质,鉴定了17个细胞亚群及其基因表达特征,以及在稳定造血状态下表达关键龛位(niche)因子的基质细胞。
- 进一步推断细胞亚群的分化关系。
- 描绘了急性髓细胞白血病(Acute myeloid leukemia , AML)对骨髓微环境的整体影响及相应的细胞和分子异常。
scRNA-seq
- 建库方式:Chromium Single Cell 30 v2 Reagent Kit (10x Genomics)
- 上游分析:Cellranger toolkit (version 2.0.1, 10X Genomics)
- 表达矩阵下载:GSE128423
- 下游分析:Seurat v2.3.4、destiny等
这么多样本,重新跑Cellranger流程比较花时间,我们直接从下载表达矩阵开始。按照Cellranger输出格式来组织文件,每个数据集放在单独的文件夹,内含matrix.mtx.gz
、barcodes.tsv.gz
和genes.tsv.gz
(CellRanger 3.0以上版本改为features.tsv.gz
)三个文件。这里我手动改成了3.0的形式,以便于保持.gz
压缩格式读入Seurat
lyc@lyc-VirtualBox:~/lyc-1995/BM_Mouse/data$ tree
.
├── b1
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── b2
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── b3
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── b4
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm1
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm2
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm3
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── bm4
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_10May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_16May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_26May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_7Jun
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── ctrl_8May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_10May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_26May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_31May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── MLL_8May
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std1
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std2
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std3
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std4
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── std5
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
└── std6
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
安装Seurat
Seurat目前已经更新到3.1.4版,但基本流程和2.3.4并没有太大的差别。详见:单细胞Seurat包升级,换汤不换药
我们直接安装最新版Seurat:
> install.packages('Seurat')
> library(Seurat)
分析稳态下的小鼠骨髓(n = 6)
主要目的是得到Figure S1中的两张细胞分群tSNE图:
标注的std7和std8可能是笔误吧……
读取表达矩阵
> getwd()
[1] "/home/lyc/lyc-1995/BM_Mouse"
批量读取表达矩阵,并在细胞barcode前添加样本标签:
sample.id <- paste0('std', 1:6)
raw.data <- lapply(sample.id, function(x) {
counts <- Read10X(data.dir = file.path('data', x))
colnames(counts) <- paste0(x, '_', colnames(counts))
return(counts)
})
names(raw.data) <- sample.id
数据质控
原文对数据质控的描述:
Pre-processing of scRNA-seq data
ScRNA-Seq data were demultiplexed, aligned to the mouse genome, version mm10, and UMI-collapsed with the Cellranger toolkit (version 2.0.1, 10X Genomics). We excluded cells with fewer than 500 detected genes (where each gene had to have at least one UMI aligned). Gene expression was represented as the fraction of its UMI count with respect to total UMI in the cell and then multiplied by 10,000. We denoted it by TP10K – transcripts per 10K transcripts.
Filtering hematopoietic clusters and doublets
Based on cluster annotations with characteristic genes, we removed hematopoietic clusters from further analysis. It is further expected that a small fraction of data should consist of cell doublets (and to an even lesser extent of higher order multiplets) due to co-encapsulation into droplets and/or as occasional pairs of cells that were not dissociated in sample preparation. Therefore, when we found small clusters of cells expressing both hematopoietic and stromal markers we removed them from further analysis (original cluster 14). A small number of additional clusters and subclusters was marked by genes differentially expressed in at least two larger stromal clusters and were annotated as doublets if their average number of expressed genes was higher than the averages for corresponding suspected singlet cluster sources and/or they were not characterized by specific differentially expressed genes (original clusters 18 and 19). All marked doublets were removed from the discussion.
按照原文描述,先过滤掉检测基因数少于500的细胞,再根据下游聚类分析移除一些推定的doublets。我们先据此构建Seurat对象:
seu <- lapply(raw.data, CreateSeuratObject, min.features = 500)
检查了第一步质控,留下36,000+细胞,和作者原文给出的30,543数量差得有点多啊。原文并未给出分析用的源代码,因此只能猜测是作者省略了一些方法学上的描述。我们先按照常规流程走一遍看看吧:
seu.merge <- merge(seu[[1]], seu[2:length(seu)])
> dim(seu.merge)
[1] 27998 36181
进一步检查UMI数、基因数和线粒体相关UMI比例:
seu.merge[["percent.mt"]] <- PercentageFeatureSet(seu.merge, pattern = "^mt-")
VlnPlot(seu.merge, features = c('nCount_RNA', 'nFeature_RNA', 'percent.mt'))
线粒体基因高比例的细胞还是不少的,通常代表低质量细胞。而且也存在UMI数和基因数异常高的细胞,可能提示潜在的doublets。
再检查一下UMI数、基因数和线粒体基因比例的相关性:
plot1 <- FeatureScatter(seu.merge, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2 <- FeatureScatter(seu.merge, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot1 + plot2
根据以上质控指标,设置进一步过滤:
seu.clean <- subset(seu.merge, nCount_RNA < 50000 & nFeature_RNA < 6000 & percent.mt < 7.5)
rm(seu, seu.merge);gc(reset = TRUE)
> dim(seu.clean)
[1] 27998 34062
呃……还是多出来很多细胞。文章对这部分质控没有更多描述了,我们先暂且搁置这个问题。
Normalization
用的是最经典的LogNormalize
> seu.clean <- NormalizeData(seu.clean)
Performing log-normalization
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Feature selection and Dimensionality reduction
Dimensionality reduction
We performed dimensionality reduction using gene expression data for a subset of variable genes. The variable genes were selected based on dispersion of binned variance to mean expression ratios using FindVariableGenes function of Seurat package (Satija et al., 2015) followed by filtering of cell-cycle, ribosomal protein, and mitochondrial genes. Next, we performed principal component analysis (PCA) and reduced the data to the top 50 PCA components (number of components was chosen based on standard deviations of the principal components – in a plateau region of an ‘‘elbow plot’’).
利用Seurat 2.3.4 版的FindVariableGenes
函数,基于每个基因的平均表达量和离散程度,选择高可变基因进行下游的降维和聚类分析。在Seurat 3 中对应FindVariableFeatures
函数,将selection.method
参数改为'mvp'
来对应 2.3.4 版。
原文没有给出具体参数,这里使用默认参数:
> seu.clean <- FindVariableFeatures(seu.clean, selection.method = 'mvp')
Calculating gene means
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating gene variance to mean ratios
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
> length(VariableFeatures(seu.clean))
[1] 1127
top10 <- head(VariableFeatures(seu.clean), 10)
plot1 <- VariableFeaturePlot(seu.clean)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
筛选到了1,127个高可变基因。作者同时从其中剔除了线粒体基因、核糖体基因和细胞周期相关的基因,但是并没有指出挑选基因的规则,这里我们同样只能按照一般做法来尝试:
mt.genes <- grep(pattern = '^mt-', rownames(seu.clean), value = TRUE)
ribo.genes <- grep(pattern = '^(Rpl[0-9]|Rps[0-9])', rownames(seu.clean), value = TRUE)
# 细胞周期基因
install.packages('org.Mm.eg.db')
library(org.Mm.eg.db)
cellcycle <- select(org.Mm.eg.db, keys = "GO:0007049", columns = "SYMBOL", keytype = "GOALL")
cellcycle <- unique(cellcycle$SYMBOL)
rm.genes <- c(mt.genes, ribo.genes, cellcycle)
VariableFeatures(seu.clean) <- setdiff(VariableFeatures(seu.clean), rm.genes)
> length(VariableFeatures(seu.clean))
[1] 1043
主成分分析(PCA),默认计算50个主成分:
> seu.clean <- ScaleData(seu.clean)
Centering and scaling data matrix
|============================================================================================================================================| 100%
> seu.clean <- RunPCA(seu.clean)
PC_ 1
Positive: Col1a2, Cst3, Mgp, Igfbp5, Abi3bp, Comp, Dcn, Cxcl14, Clu, Serping1
1500015O10Rik, Fam46a, Htra1, Fmod, Spp1, Ibsp, Mt2, Fibin, Chad, Pam
Gsn, Scara3, Itm2a, Olfml3, Col1a1, Col3a1, Ogn, Meg3, Pth1r, Col11a1
Negative: Fabp4, Cldn5, Cdh5, Lrg1, Tfpi, Kdr, Stab2, Gpihbp1, Mmrn2, Emcn
Fam167b, Esam, Ctla2a, Gpm6a, Stab1, Cd93, Apold1, Flt1, Gm1673, Cd36
Kcnj8, Ptprb, Mrc1, Flt4, Cyp4b1, Pecam1, Ecscr, Dnase1l3, Ushbp1, Abcc9
PC_ 2
Positive: Tmem176b, Cxcl12, Vcam1, Gas6, Lpl, Gdpd2, Fbln5, Esm1, Nrp1, Kitl
Adipoq, Serping1, Hp, Dpep1, Pappa, Rarres2, Cxcl14, Epas1, Cdh11, Cyp1b1
Ebf3, 1500009L16Rik, Sfrp4, Tnc, Lepr, Angptl4, Trf, Arrdc4, Agt, Chrdl1
Negative: Chchd10, Rac2, Vpreb3, Cd79a, Cd79b, Ptprcap, Coro1a, Cd37, Mzb1, Lrmp
Pafah1b3, Blnk, Cnp, Arl5c, Rhoh, Laptm5, Cd72, Pou2af1, Gmfg, Cd53
Siglecg, Atp1b1, Fcrla, Xrcc6, Dusp2, Cytip, Tifa, Spib, Dnajc7, Bcl7a
PC_ 3
Positive: Comp, Chad, Fmod, Pcolce2, 1500015O10Rik, Col11a1, Meg3, Anxa8, Ndufa4l2, Cilp2
Mfge8, Dcn, Hapln1, Acan, Scrg1, Cilp, Fibin, Ucma, 3110079O15Rik, Mgp
Col2a1, Igfbp6, Nbl1, Prg4, Tnfrsf11b, Dhx58os, Crispld1, Col11a2, S100a4, Tppp3
Negative: Ebf1, Vpreb3, Cd79a, Cd79b, Zeb2, Hp, Tifa, Ptprcap, Rac2, Gdpd2
Coro1a, Chchd10, Esm1, Adipoq, Cxcl12, Mzb1, Dpep1, Cd37, Blnk, Lpl
Lrmp, Pappa, Arl5c, Cd72, Kitl, Pou2af1, Atp1b1, Siglecg, Cyp1b1, Rhoh
PC_ 4
Positive: Igfbp6, Nbl1, Crip1, Col3a1, Tppp3, Dcn, S100a4, Col1a1, Abi3bp, Cilp2
Mustn1, Cdh13, Ly6c1, Cilp, Tnxb, Ebf1, Angptl7, Lgals1, Col1a2, Ly6a
Vpreb3, Thbs4, Anxa8, Medag, Cd79a, Slurp1, Cd79b, Htra1, Clec3b, Cav1
Negative: Alox5ap, Lyz2, Slpi, Pglyrp1, Rgs18, Plek, Bin2, Wfdc21, Tmem40, Tyrobp
Hcst, Lcn2, S100a8, Prkar2b, Ppbp, Fcer1g, Ncf1, S100a9, Mcemp1, Ifitm6
Gp1bb, Nfe2, Ngp, Pf4, Gp9, Chil3, Camp, Fermt3, Clec1b, Ly6c2
PC_ 5
Positive: Col9a2, Col9a1, Col9a3, Mia, 3110079O15Rik, Matn3, Fxyd2, Col11a2, Lect1, Col27a1
Hapln1, Col2a1, Acan, Scrg1, Ucma, Epyc, Col11a1, Prkg2, Il17b, Pth1r
Ppa1, Panx3, Serpina1a, Dhx58os, C1qtnf3, Serpina1d, Bhlhe41, Calml3, Pla2g5, Tnni2
Negative: Igfbp6, S100a4, Alox5ap, Tppp3, Lyz2, Nbl1, Slpi, Col3a1, Rgs18, Plek
Bin2, Pglyrp1, Tmem40, Ppbp, Col1a1, Mustn1, Gp9, Tnxb, Stx11, Dcn
Wfdc21, Hcst, Clec1b, Tyrobp, Itga2b, Rgs10, Fcer1g, Abi3bp, Pf4, Fermt3
查看碎石图:
ElbowPlot(seu.clean, ndims = 50)
文章没说选择多少个PC来进行下游分析,根据碎石图拐点,我们大致选择25个PC左右,进行tSNE降维:
seu.clean <- RunTSNE(seu.clean, dims = 1:25)
原文的tSNE图显示数据集之间几乎没有批次效应,我们来检查一下:
DimPlot(seu.clean, reduction = 'tsne')
DimPlot(seu.clean, reduction = 'tsne', split.by = 'orig.ident', ncol = 3)
可以看到6个数据集之间确实没有明显的批次效应。
无监督聚类
原文对于无监督聚类策略的描述:
Clustering and sub-clustering
We used graph-based clustering of the PCA reduced data with the Louvain Method (Blondel et al., 2008) after computing a shared nearest neighbor graph (Satija et al., 2015). We visualized the clusters on a 2D map produced with t-distributed stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008). For sub-clustering, we applied the same procedure of finding variable genes, dimensionality reduction, and clustering to the restricted set of data (usually restricted to one initial cluster).
作者在PCA降维后的数据空间内计算 shared nearest neighbor graph(SNN graph),再利用Louvain算法基于图聚类(graph-based clustering)获得细胞分群。作者针对初始聚类结果的子集重复上述的“特征选择-降维-聚类”流程进行亚聚类(sub-clustering)分析。原文没有给出具体的分辨率(resolution)参数,但是根据Figure S1C可以看出似乎是先得到了10个初始分群,亚聚类分析后再得出33个亚群。
Seurat 3把原Seurat 2的FindClusters
函数拆分成FindNeighbors
和FindClusters
,分别用于计算SNN graph和聚类,其实并没有太大的变化。
和tSNE降维时一样,我们使用1-25个主成分构建SNN graph,FindClusters
的默认参数是algorithm = 1
(Louvain),分辨率是resolution = 0.8
:
> seu.clean <- FindNeighbors(seu.clean, dims = 1:25)
Computing nearest neighbor graph
Computing SNN
> seu.clean <- FindClusters(seu.clean)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 34062
Number of edges: 1270143
Running Louvain algorithm...
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Maximum modularity in 10 random starts: 0.9290
Number of communities: 33
Elapsed time: 8 seconds
哎哟,直接得到了和原文FigureS1C一样的亚群数(33个)?可视化检查一下:
DimPlot(seu.clean, reduction = 'tsne', label = TRUE, repel = TRUE) + NoLegend()
当然也不能排除作者是得到了33个聚类后根据已知marker基因在FigureS1中做了手动标注
过滤造血干细胞和doublets
由于作者关注的是骨髓基质细胞,因此对造血干祖细胞相关的亚群进行了过滤,并将同时表达两种或以上细胞类型marker的聚类判断为doublets加以去除:
Filtering hematopoietic clusters and doublets
Based on cluster annotations with characteristic genes, we removed hematopoietic clusters from further analysis. It is further expected that a small fraction of data should consist of cell doublets (and to an even lesser extent of higher order multiplets) due to co-encapsulation into droplets and/or as occasional pairs of cells that were not dissociated in sample preparation. Therefore, when we found small clusters of cells expressing both hematopoietic and stromal markers we removed them from further analysis (original cluster 14). A small number of additional clusters and subclusters was marked by genes differentially expressed in at least two larger stromal clusters and were annotated as doublets if their average number of expressed genes was higher than the averages for corresponding suspected singlet cluster sources and/or they were not characterized by specific differentially expressed genes (original clusters 18 and 19). All marked doublets were removed from the discussion.
1. 基于差异表达分析得到各亚群marker基因
Differential expression of gene signatures
For each cluster, we used the Wilcoxon Rank-Sum Test to find genes that had significantly different RNA-seq TP10K expression when compared to the remaining clusters (paired tests when indicated) (after multiple hypothesis testing correction). As a support measure for ranking differentially expressed genes we also used the area under receiver operating characteristic (ROC) curve.
作者基于LogNormalize
后的表达值,使用了Wilcoxon Rank-Sum Test的统计检验方式进行差异表达。Seurat中通过FindAllMarkers
函数实现,默认参数test.use = "wilcox"
。
利用future
包的plan
函数开启并行运算(根据自己的系统配置合理选择核心数),详见 Parallelization in Seurat with future。
最后不要忘记调回单核运算,并释放内存。
library(future)
plan('multiprocess', workers = 8)
all.markers <- FindAllMarkers(seu.clean)
plan('sequential');gc(reset = TRUE)
即使是在并行运算下,差异表达也花了超过了1小时。接下来根据p值对差异基因进行过滤(p_val
或p_val_adj
都行)。尽管文中没有明说是否做了这一步,但单细胞测序数据本身就具有高噪声的特点,根据统计学显著性筛选基因可以减少假阳性结果:
all.markers <- subset(all.markers, p_val_adj < 0.05)
作者同时还使用了接收者操作特征(receiver operating characteristic curve, ROC)曲线的检验方法作为辅助。我们令test.use = "roc"
:
all.markers.roc <- FindAllMarkers(seu.clean, test.use = 'roc')
需要注意的是ROC方法不支持future
的并行计算,这真是要算到地老天荒了……
关于ROC的方法,Seurat帮助文档里面已经说得比较清楚了:
"roc" : Identifies 'markers' of gene expression using ROC analysis. For each gene, evaluates (using AUC) a classifier built on that gene alone, to classify between two groups of cells. An AUC value of 1 means that expression values for this gene alone can perfectly classify the two groupings (i.e. Each of the cells in cells.1 exhibit a higher level than each of the cells in cells.2). An AUC value of 0 also means there is perfect classification, but in the other direction. A value of 0.5 implies that the gene has no predictive power to classify the two groups. Returns a 'predictive power' (
abs(AUC-0.5) * 2
) ranked matrix of putative differentially expressed genes.
ROC曲线可以用来评估给定二分类模型的优劣。简单地说,就是对于每一个给定的基因,评估其能否较准确地识别出目的细胞亚群。ROC的曲线下面积(AUC)可以反映分类模型的正确率,大于或小于0.5都是有意义的(小于0.5时说明基于该基因的分类模型总是给出错误预测,此时我们取相反结果即可得到正确预测,也就意味着该基因可能在目的亚群中表达下调),而AUC在0.5附近时说明基于该基因的分类模型近似于随机分类,该基因可能在两群细胞之间没有显著差异表达。Seurat根据每个基因的AUC计算了分类power代替p值。
2. 基于已知的亚群marker基因
作者同时还比较了已知的不同细胞亚群marker基因的表达分布(Figure S1D):
我们知道Seurat可以在降维图中展示基因表达特征:
FeaturePlot(seu.clean, features = c('Cd79a', 'Cd79b'))
同样可以利用
AddModuleScore
来评估一个基因集的表达情况。我们根据Figure S1D选择基因集:
cd <- list(
C1 = c('Cd79a', 'Cd79b'),
C2 = c('Gypa', 'Hbb-bt', 'Rhag', 'Rhd', 'Tfrc'),
C3 = c('Cd52', 'Cd177', 'Plaur', 'Clec4a2'),
C4 = c('Cd52', 'Selplg', 'Ms4a6c', 'Cd53'),
C5 = c('Gp9', 'Itga2b', 'Cd9', 'Gp1bb'),
C6 = c('Ms4a3', 'Clec12a', 'Fcgr3')
)
seu.clean <- AddModuleScore(
object = seu.clean,
features = cd,
ctrl = 5,
name = 'Known_markers'
)
FeaturePlot(seu.clean, features = paste0('Known_markers', 1:6), ncol = 3)
基本和原文吻合。
网友评论