其实我看单细胞的各种资料,已经断断续续看了有一段时间了。只是没有整理出来,就从今天这篇上路吧~且搜且查,加上问豆豆,开始发布笔记。
1.数据、代码和R包准备
代码:https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html
rm(list = ls())
options(stringsAsFactors = F)
library(Seurat)
library(dplyr)
2.读取数据
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
dir("filtered_gene_bc_matrices/hg19/")
## [1] "barcodes.tsv" "genes.tsv" "matrix.mtx"
pbmc <- CreateSeuratObject(counts = pbmc.data,
project = "pbmc3k",
min.cells = 3,
min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features)
#查看表达矩阵
exp = pbmc[["RNA"]]@counts;dim(exp)
## [1] 13714 2700
exp[30:34,1:4]
## 5 x 4 sparse Matrix of class "dgCMatrix"
## AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG
## MRPL20 1 . 1 .
## ATAD3C . . . .
## ATAD3B . . . .
## ATAD3A . . . .
## SSU72 . 1 . 3
很多是.,也就是0,换了一种更省空间的表示方式。
3.质控
质控指标:
线粒体基因含量不能过高;
nFeature_RNA 不能过高或过低
为什么? nFeature_RNA是每个细胞中检测到的基因数量。nCount_RNA是细胞内检测到的分子总数。nFeature_RNA过低,表示该细胞可能已死/将死或是空液滴。高nCount_RNA和/或nFeature_RNA表明“细胞”实际上可以是两个或多个细胞。结合线粒体基因count数除去异常值,即可除去大多数双峰/死细胞/空液滴,因此它们过滤是常见的预处理步骤。 参考自:https://www.biostars.org/p/407036/
3.1 质控指标的可视化
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data, 3)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC pbmc3k 2419 779 3.0177759
## AAACATTGAGCTAC pbmc3k 4903 1352 3.7935958
## AAACATTGATCAGC pbmc3k 3147 1129 0.8897363
VlnPlot(pbmc,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3)
![](https://img.haomeiwen.com/i9475888/4b9f8648c9c74153.png)
根据这个图,确定了过滤标准是:
nFeature_RNA在200~2500之间;线粒体基因占比在5%以下。
3.2 三者之间的相关性
plot1 <- FeatureScatter(pbmc,
feature1 = "nCount_RNA",
feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc,
feature1 = "nCount_RNA",
feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))
![](https://img.haomeiwen.com/i9475888/5cf9ebada6adfc94.png)
可见,绝大多数细胞的nCount_RNA与percent.mt在正常范围内,percent.mt异常高时,nCount_RNA则异常低,反之亦然;nCount_RNA和nFeature_RNA基本成正比。
过滤走起
pbmc <- subset(pbmc,
subset = nFeature_RNA > 200 &
nFeature_RNA < 2500 &
percent.mt < 5)
dim(pbmc)
## [1] 13714 2638
可以看到过滤掉了几十个细胞。
4.数据归一化
pbmc <- NormalizeData(pbmc)
pbmc@assays$RNA[30:34,1:3]
## 5 x 3 sparse Matrix of class "dgCMatrix"
## AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
## MRPL20 1.635873 . 1.429744
## ATAD3C . . .
## ATAD3B . . .
## ATAD3A . . .
## SSU72 . 1.111715 .
5.找差异基因HVGs
# 有三种算法:vst、mean.var.plot、dispersion;默认选择2000个HVG
pbmc <- FindVariableFeatures(pbmc,
selection.method = "vst",
nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10);top10
## [1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4"
## [8] "FTH1" "GNG11" "S100A8"
可视化展示HVGs,可以给top差异基因加上标签
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))
![](https://img.haomeiwen.com/i9475888/225e4bf9ca315195.png)
6.数据标准化
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc[["RNA"]]@scale.data[30:34,1:3]
## AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
## MRPL20 1.50566156 -0.56259931 1.24504892
## ATAD3C -0.04970561 -0.04970561 -0.04970561
## ATAD3B -0.10150202 -0.10150202 -0.10150202
## ATAD3A -0.13088200 -0.13088200 -0.13088200
## SSU72 -0.68454728 0.58087748 -0.68454728
7.降维
7.1 线性降维-PCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# 查看前三个主成分由哪些feature组成
print(pbmc[["pca"]], dims = 1:3, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
#可视化
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
![](https://img.haomeiwen.com/i9475888/3fa86758f46d5a8c.png)
#每个主成分对应基因的热图
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
![](https://img.haomeiwen.com/i9475888/f228306eefbc1dd9.png)
# 应该选多少个主成分进行后续分析
ElbowPlot(pbmc)
![](https://img.haomeiwen.com/i9475888/206458209ad8d6e4.png)
引用一下豆豆:它是根据每个主成分对总体变异水平的贡献百分比排序得到的图,我们主要关注”肘部“的PC,它是一个转折点(也即是这里的PC9-10),说明取前10个主成分可以比较好地代表总体变化
#PC1和2
DimPlot(pbmc, reduction = "pca")+ NoLegend()
![](https://img.haomeiwen.com/i9475888/0fba863200a258c7.png)
细胞聚类
# 因为我们前面挑选了10个PCs,所以这里dims定义为10个
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5) #分辨率
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 96033
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds
# 结果聚成几类,用Idents查看
length(levels(Idents(pbmc)))
## [1] 9
resolution的大小决定着分群的数量,3000细胞时,0.4~1.2的分辨率表现较好。
8.非线性降维–umap + 聚类
pbmc <- RunUMAP(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindNeighbors(pbmc, dims = 1:10, verbose = FALSE)
pbmc <- FindClusters(pbmc, verbose = FALSE)
length(levels(Idents(pbmc)))
## [1] 11
DimPlot(pbmc, label = TRUE) + NoLegend()
![](https://img.haomeiwen.com/i9475888/18293706e9807f6a.png)
可以看到,聚类后的细胞分成了9个群(cluster)
9.找biomarker基因
这个就类似于找每个cluster的标记,有一点差异基因的意思。
找cluster1中的marker基因
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## IL32 1.662549e-95 0.8407273 0.949 0.461 2.280020e-91
## LTB 3.900931e-89 0.8859959 0.980 0.640 5.349736e-85
## CD3D 4.422655e-73 0.6559217 0.919 0.428 6.065228e-69
## IL7R 1.359897e-68 0.8167820 0.743 0.324 1.864963e-64
## LDHB 8.461106e-66 0.6143392 0.941 0.614 1.160356e-61
比较cluster5和0,3的差别
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## S100A8 8.339302e-188 4.398822 0.987 0.075 1.143652e-183
## S100A9 5.303672e-165 4.573074 0.991 0.150 7.273455e-161
## LGALS2 5.247158e-163 2.733291 0.836 0.034 7.195953e-159
## TYROBP 5.566866e-160 3.240671 0.987 0.164 7.634399e-156
## CST3 9.451167e-155 3.300451 0.982 0.182 1.296133e-150
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
# 这一步过滤(进行了分类、排序、挑前2个)
tops = pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC);tops
## # A tibble: 22 x 7
## # Groups: cluster [11]
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 6.63e-112 0.756 0.956 0.597 9.09e-108 0 LDHB
## 2 2.44e- 86 0.900 0.481 0.118 3.35e- 82 0 CCR7
## 3 3.90e- 89 0.886 0.98 0.64 5.35e- 85 1 LTB
## 4 2.36e- 61 0.857 0.646 0.243 3.23e- 57 1 CD2
## 5 0\. 2.99 0.936 0.041 0\. 2 CD79A
## 6 9.48e-271 2.49 0.622 0.022 1.30e-266 2 TCL1A
## 7 1.25e-174 2.05 0.956 0.243 1.71e-170 3 CCL5
## 8 6.09e-164 2.00 0.593 0.058 8.36e-160 3 GZMK
## 9 5.60e-225 1.93 0.973 0.133 7.68e-221 4 LGALS2
## 10 3.26e-137 2.18 1 0.562 4.47e-133 4 LYZ
## # … with 12 more rows
多种可视化:
# 一个基因在所有cluster中的表达量
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
![](https://img.haomeiwen.com/i9475888/4baba9c7e194987a.png)
# 基因表达量投射在降维聚类结果中
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
![](https://img.haomeiwen.com/i9475888/15005c345b185472.png)
#对每个cluster的top10 gene 绘制热图
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
![](https://img.haomeiwen.com/i9475888/d2d6ce3642f929f3.png)
网友评论