美文网首页单细胞测序
Seurat 4.0 || 分析scRNA和膜蛋白数据

Seurat 4.0 || 分析scRNA和膜蛋白数据

作者: 周运来就是我 | 来源:发表于2020-10-30 21:33 被阅读0次

    前情回顾

    Seurat 4.0 ||单细胞多模态数据整合算法WNN
    Seurat 4.0 ||单细胞数据分析工具箱有更新

    得益于单细胞多模态技术的发展,允许我们在单细胞水平从不同侧面考察细胞状态,如CITE-seq技术可以同时对单细胞转录组和膜蛋白进行定量。单细胞多模态数据分析面临的挑战和Seurat给出的解决方案,在预印本文章Integrated analysis of multimodal single-cell data中进行了介绍(中译版已由单细胞天地于近日推送)。算法落实到实现层面,我们来学习一下WNN的几篇教程,本文介绍了用于分析多模态单细胞数据集的加权最近邻(WNN)工作流程。该流程由三个步骤组成:

    • 每个模态独立的预处理和降维
    • 学习细胞特定的模态“权重”,并构建一个集成了这些模态的WNN图
    • WNN图的下游分析(如可视化、聚类等)

    我们使用来自(Stuart, Butler等人,Cell 2019)的CITE-seq数据集,该数据集包含30,672个scRNA-seq 数据及25个抗体数据。研究对象包括RNA和抗体衍生标签(ADT)两种检测方法。

    要运行本教程,请安装Seurat v4,在github页面上有beta版本。

    remotes::install_github("satijalab/seurat", ref = "release/4.0.0")
    

    如下载入我们的老朋友:

    library(Seurat)
    library(SeuratData)
    library(cowplot)
    library(dplyr)
    

    下载示例数据集

    InstallData("bmcite")
    bm <- LoadData(ds = "bmcite")
    

    照例我们看看数据是怎么样的:

    bm
    An object of class Seurat 
    17034 features across 30672 samples within 2 assays 
    Active assay: RNA (17009 features, 2000 variable features)
     1 other assay present: ADT
     1 dimensional reduction calculated: spca
    
    

    我们看到这个Seurat对象有两个 assays每个 assays 都有相应的表达谱。

    dim(bm@assays$RNA)
    [1] 17009 30672
    
    bm@assays$ADT@counts[1:4,1:4]
    4 x 4 sparse Matrix of class "dgCMatrix"
                a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1 a_AAACCTGCAAACCTAC-1
    CD11a                        110                  541                  120                  645
    CD11c                         51                   12                   10                   42
    CD123                         14                    5                    1                    5
    CD127-IL7Ra                   20                  187                   43                  115
    

    RNA的assay我们是很熟悉了,ADT数据也是一个表达谱,包含的抗体:

    rownames(bm@assays$ADT)
     [1] "CD11a"       "CD11c"       "CD123"       "CD127-IL7Ra" "CD14"        "CD16"        "CD161"       "CD19"        "CD197-CCR7"  "CD25"        "CD27"       
    [12] "CD278-ICOS"  "CD28"        "CD3"         "CD34"        "CD38"        "CD4"         "CD45RA"      "CD45RO"      "CD56"        "CD57"        "CD69"       
    [23] "CD79b"       "CD8a"        "HLA.DR"     
    
    

    我们首先分别对这两种数据进行预处理和降维。我们使用标准的标准化,但您也可以使用SCTransform或任何替代方法。

    DefaultAssay(bm) <- 'RNA'
    bm <- NormalizeData(bm) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
    
    DefaultAssay(bm) <- 'ADT'
    # we will use all ADT features for dimensional reduction
    # we set a dimensional reduction name to avoid overwriting the 
    VariableFeatures(bm) <- rownames(bm[["ADT"]])
    bm <- NormalizeData(bm, normalization.method = 'CLR', margin = 2) %>% 
      ScaleData() %>% RunPCA(reduction.name = 'apca')
    
    Normalizing across cells
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=05s  
    Centering and scaling data matrix
      |================================================================================================================================================================| 100%
    Warning in irlba(A = t(x = object), nv = npcs, ...) :
      You're computing too large a percentage of total singular values, use a standard svd instead.
    Warning in irlba(A = t(x = object), nv = npcs, ...) :
      did not converge--results might be invalid!; try increasing work or maxit
    PC_ 1 
    Positive:  CD3, CD28, CD27, CD127-IL7Ra, CD278-ICOS, CD4, CD8a, CD161, CD25, CD45RO 
           CD45RA, CD197-CCR7 
    Negative:  HLA.DR, CD11c, CD14, CD38, CD69, CD123, CD11a, CD34, CD19, CD79b 
           CD16, CD56 
    PC_ 2 
    Positive:  CD45RO, CD11a, CD14, CD4, CD11c, CD28, CD3, CD69, CD278-ICOS, CD127-IL7Ra 
           CD38, CD161 
    Negative:  CD45RA, CD19, CD79b, CD197-CCR7, CD8a, CD57, CD56, CD16, CD27, CD25 
           HLA.DR, CD34 
    PC_ 3 
    Positive:  CD16, CD56, CD8a, CD11a, CD57, CD161, CD45RA, CD38, CD3, CD127-IL7Ra 
           CD27, CD11c 
    Negative:  CD19, CD79b, CD4, CD25, HLA.DR, CD197-CCR7, CD69, CD34, CD28, CD45RO 
           CD14, CD278-ICOS 
    PC_ 4 
    Positive:  CD161, CD25, CD45RO, CD56, CD16, CD4, CD79b, CD19, CD28, CD57 
           CD127-IL7Ra, HLA.DR 
    Negative:  CD8a, CD27, CD14, CD45RA, CD3, CD11c, CD69, CD38, CD197-CCR7, CD11a 
           CD34, CD123 
    PC_ 5 
    Positive:  CD4, CD38, CD123, CD16, CD34, CD45RA, CD56, CD278-ICOS, CD28, CD197-CCR7 
           CD27, CD3 
    Negative:  CD45RO, CD8a, CD69, CD161, CD79b, CD19, CD25, CD57, CD14, CD11a 
           CD127-IL7Ra, HLA.DR 
    Warning messages:
    1: Requested number is larger than the number of available items (25). Setting to 25. 
    2: Requested number is larger than the number of available items (25). Setting to 25. 
    3: Requested number is larger than the number of available items (25). Setting to 25. 
    4: Requested number is larger than the number of available items (25). Setting to 25. 
    5: Requested number is larger than the number of available items (25). Setting to 25. 
    6: Cannot add objects with duplicate keys (offending key: PC_), setting key to 'apca_' 
    
    

    随着单细胞多模态技术的发展,我们要习惯一个对象有多个assay的数据结构。数据结构就像一处盖好的大楼,房间都在那了,就是往数据里面填我们计算好的数据,比如这里的apca。

    head(bm@reductions$apca@cell.embeddings)
                            apca_1     apca_2     apca_3     apca_4     apca_5      apca_6     apca_7      apca_8      apca_9    apca_10     apca_11    apca_12     apca_13
    a_AAACCTGAGCTTATCG-1 -1.546214 -1.0560783 -0.3193425  0.5561288  1.1332821 -3.46190907 -0.6363978 -0.69711363 -0.15623767 -0.1291178  0.43826804  0.4848001  0.02257897
    a_AAACCTGAGGTGGGTT-1  2.698538  1.5314222  2.2152152  3.8489785 -0.2302644  0.41040618  1.7093026 -2.08814571  1.93091407  1.2356946  0.21717667 -0.6694691 -1.46231202
    a_AAACCTGAGTACATGA-1  3.287854  0.2591767  0.1191195 -0.7199622  1.7688191  0.01789973 -1.2586785  0.08800729  0.42980044  0.9671439 -0.73346613 -0.3946358 -1.28168949
    a_AAACCTGCAAACCTAC-1  3.100410  2.3937004 -0.9758685  1.0520679  0.0893280  0.01760703 -0.4626806  0.44875816  0.70985188  0.6547350 -0.12642028  0.2164491  1.06710170
    a_AAACCTGCAAGGTGTG-1 -3.707086  2.5657331 -0.3487033 -0.6951109 -0.7601989  0.20388700 -0.7173617 -0.12882660 -0.08661708  0.3569946 -0.10704170 -0.5074276 -0.07922931
    a_AAACCTGCACGGTAGA-1 -2.627561 -3.7290142 -2.2912037  0.5042828 -0.3032089  0.12050431  0.3047013 -0.01124192  0.74586469 -0.4926128  0.04005597  0.3078714  0.62619888
                             apca_14      apca_15     apca_16     apca_17    apca_18     apca_19    apca_20     apca_21     apca_22     apca_23     apca_24
    a_AAACCTGAGCTTATCG-1  0.03922873  0.284971458 -0.05987775 -0.63780150 -0.3893929  0.72669647  0.4473322  0.30682227 -0.56594325 -0.14519199  0.26113206
    a_AAACCTGAGGTGGGTT-1 -1.26101465  0.727477850 -1.33392808  0.06360178 -0.2847066  0.47726469 -0.2159961 -0.75962725  0.31336909  0.27562030 -0.46558363
    a_AAACCTGAGTACATGA-1 -0.09693877 -0.060252586  0.21063702 -0.31319328  0.2271467  0.17773498  0.3805679  0.28309195  0.15551460 -0.01404676 -0.12147257
    a_AAACCTGCAAACCTAC-1  0.09764741 -0.001231692  0.34339056 -0.05304418  0.1969659  0.05418865 -0.2202433 -0.16040501 -0.12525310  0.05894026 -0.03718946
    a_AAACCTGCAAGGTGTG-1  0.24172883  0.040471346  0.24044018  0.34446924 -0.1282743 -0.11257312 -0.2512336  0.18092572  0.23236869 -0.10883569  0.05730309
    a_AAACCTGCACGGTAGA-1 -0.18285025  0.911541345  0.20308131 -0.81552469 -0.5860654  0.25639653  0.3117227 -0.06062992 -0.07510367 -0.29029817 -0.01214964
    
    

    对于每个细胞,我们根据RNA和蛋白质相似度的加权组合来计算它在数据集中最近的邻居。细胞特有的模态权重和多模态邻居在一个函数中计算,该函数在此数据集上运行约2分钟。我们指定每个模态的维数(类似于指定要包含在scRNA-seq集群中的pc的数量),但是您可以改变这些设置,以看到小的更改对总体结果的影响最小。这是Seurat 4.0 最大的一个更新,一个函数整合多模态数据。所谓的模态在我们数据分析中就是多了一个表格与同模态不同的是,这里的数据来源不同。在使用之前我们先看一下这个函数的文档?FindMultiModalNeighbors

    ?FindMultiModalNeighbors
    # Identify multimodal neighbors. These will be stored in the neighbors slot, 
    # and can be accessed using bm[['weighted.nn']]
    # The WNN graph can be accessed at bm[["wknn"]], 
    # and the SNN graph used for clustering at bm[["wsnn"]]
    # Cell-specific modality weights can be accessed at bm$RNA.weight
    bm <- FindMultiModalNeighbors(
      bm, reduction.list = list("pca", "apca"), 
      dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight"
    )
    

    可以看出整合多模态是在各自的降维空间里进行的,可以指定降维方法和维度。

    Calculating cell-specific modality weights
    Finding 20 nearest neighbors for each modality.
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=18s  
    Calculating kernel bandwidths
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=02s  
    Finding multimodal neighbors
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 07s
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=03s  
    Constructing multimodal KNN graph
    Constructing multimodal SNN graph
    
    

    我们来看看这个加权矩阵有哪些内容。

    str(bm@neighbors$weighted.nn)
    Formal class 'Neighbor' [package "Seurat"] with 5 slots
      ..@ nn.idx    : num [1:30672, 1:20] 21408 26747 2354 14276 2344 ...
      ..@ nn.dist   : num [1:30672, 1:20] 0.146 0.202 0.297 0.336 0.301 ...
      ..@ alg.idx   : NULL
      ..@ alg.info  : list()
      ..@ cell.names: chr [1:30672] "a_AAACCTGAGCTTATCG-1" "a_AAACCTGAGGTGGGTT-1" "a_AAACCTGAGTACATGA-1" "a_AAACCTGCAAACCTAC-1" ...
    
    
    head(bm@neighbors$weighted.nn@nn.idx)
          [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
    [1,] 21408 24052  3638 22480 24855 21388 12916 21993 15870  2248 20577 15011 27793 16535 12400 29694 30082  8634 15448 20145
    [2,] 26747  7037  3857 10338  8055 25281 14106 28950 10337  9524 17717 12946 29411  7192   638 14297 27010 17611  7282  8051
    [3,]  2354  9333  3422  8414 22000 30077   993 29367 19518  8283 20360  4269 17634  8542  9665  8855 12444  3587 14110  1120
    [4,] 14276 28818 12742 11694 27019 16333   809 24540 27160 22614 28896  3774  1841 18030  5055 25481 23135 11424 22303 21478
    [5,]  2344  7799  6268 12324 13830  4142  7393 10436 12439 27303 19328 25339 24713  5597 14001 20544 19067 12394  4108 28421
    [6,]  5797 21480 12600 24915 21105  2197 18431 23934 22306 19231 11036  1687 24800 23613 24455 19422 14841 17582 29698 28724
    
    

    WNN图结构被存在bm[["wknn"]],用于聚类的SNN图在bm[["wsnn"]],既然是图,我们不免想起要用igraph探索一番。以bm[["wknn"]]为例。

    library(igraph)
    net <- graph_from_adjacency_matrix(bm[["wknn"]]) 
    net
    
    net
    IGRAPH 2231c24 DN-- 30672 1014296 -- 
    + attr: name (v/c)
    + edges from 2231c24 (vertex names):
     [1] a_AAACCTGAGCTTATCG-1->a_AAACCTGAGCTTATCG-1 a_AGGCCACTCTTCATGT-1->a_AAACCTGAGCTTATCG-1 a_CACAGTAAGTATTGGA-1->a_AAACCTGAGCTTATCG-1
     [4] a_CTGTTTACACCTCGGA-1->a_AAACCTGAGCTTATCG-1 a_GATGCTAGTCAATGTC-1->a_AAACCTGAGCTTATCG-1 a_TCGGGACGTGACTACT-1->a_AAACCTGAGCTTATCG-1
    + ... omitted several edges
    

    这是一个有向图,节点数是30672 (细胞数),边数量为1014296 。

    is_weighted(net)
    [1] FALSE
     is_simple(net)
    [1] FALSE
     edge_density(net)
    [1] 0.001078188
    transitivity(net)
    [1] 0.1941536
     reciprocity(net, mode="default")
    [1] 1
     is_connected(net)
    [1] TRUE
    comps <- decompose(net)
     table(sapply(comps, vcount))
    
    30672 
        1 
    
    

    边的分布:

    d.net <- degree(net)
    hist(d.net,col="blue",
         xlab="Degree", ylab="Frequency",
         main="Degree Distribution")
    

    取子图,主要是这么多节点不好画图。

    par(mfrow=c(1, 2))
    plot(subnet)
    visualize_graph( subnet, centrality.type="Markov Centrality")
    

    可以看出有的细胞有许多链接,但是大部分细胞和其他的没有什么链接。

    然后,我们基于这个网络,来用igraph的cluster_fast_greedy聚类。

    # igraph::as.undirected()
    kc <- cluster_fast_greedy( as.undirected(net)) # 
    sizes(kc)
    
    Community sizes
       1    2    3    4    5    6    7    8 
      76  377  370 9532   23 3997 8088 8209
    

    聚成了8个类,看每个细胞的归宿。

    head(membership(kc))# 社团归宿 )
    a_AAACCTGAGCTTATCG-1 a_AAACCTGAGGTGGGTT-1 a_AAACCTGAGTACATGA-1 a_AAACCTGCAAACCTAC-1 a_AAACCTGCAAGGTGTG-1 a_AAACCTGCACGGTAGA-1 
                       4                    8                    7                    7                    4                    6 
    

    对聚类结果做个可视化

    library(ape)
    dendPlot(kc, mode="phylo")
    
    细胞太多了

    好了,结束我们对数据结构的探索。这不是教程的一部分,而是我们学了网络图之后,探究起细胞组成的网络是可以有许多自己的视角。我们现在可以使用这些结果进行下游分析,比如可视化和聚类。例如,我们可以基于RNA和蛋白质数据的加权组合创建数据的UMAP可视化。我们还可以执行基于图形的聚类,并在UMAP上可视化这些结果,以及一组细胞注释。

    
    bm <- RunUMAP(bm, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
    bm <- FindClusters(bm, graph.name = "wsnn", algorithm = 3, resolution =seq( 0,2,0.4), verbose = FALSE)
    
    
    head(bm@meta.data)
                         orig.ident nCount_RNA nFeature_RNA nCount_ADT nFeature_ADT      lane  donor      celltype.l1 celltype.l2 RNA.weight wsnn_res.2
    a_AAACCTGAGCTTATCG-1     bmcite       7546         2136       1350           25 HumanHTO4 batch1 Progenitor cells    Prog_RBC  0.4827011         19
    a_AAACCTGAGGTGGGTT-1     bmcite       1029          437       2970           25 HumanHTO1 batch1           T cell         gdT  0.2417890         10
    a_AAACCTGAGTACATGA-1     bmcite       1111          429       2474           23 HumanHTO5 batch1           T cell   CD4 Naive  0.5077136          1
    a_AAACCTGCAAACCTAC-1     bmcite       2741          851       4799           25 HumanHTO3 batch1           T cell  CD4 Memory  0.4313079          4
    a_AAACCTGCAAGGTGTG-1     bmcite       2099          843       5434           25 HumanHTO2 batch1          Mono/DC   CD14 Mono  0.5685085          2
    a_AAACCTGCACGGTAGA-1     bmcite       2291          783       4658           25 HumanHTO6 batch1           B cell     Naive B  0.4255879          5
                         seurat_clusters wsnn_res.0 wsnn_res.0.4 wsnn_res.0.8 wsnn_res.1.2 wsnn_res.1.6
    a_AAACCTGAGCTTATCG-1              19          0            9            8           18           19
    a_AAACCTGAGGTGGGTT-1              10          0           10            9            9            9
    a_AAACCTGAGTACATGA-1               1          0            1            1            0            1
    a_AAACCTGCAAACCTAC-1               4          0            3            3            4            4
    a_AAACCTGCAAGGTGTG-1               2          0            0            0            1            2
    a_AAACCTGCACGGTAGA-1               5          0            4            4            5            5
    
    

    注意,这里有一列RNA.weight是指在RNA和膜蛋白中RNA的权重,而Protein weight is 1 - RNA weight.

    library(clustree)
    clustree(bm,prefix = 'wsnn_res.')
    
    p1 <- DimPlot(bm, reduction = 'wnn.umap', label = TRUE, repel = TRUE, label.size = 2.5) + NoLegend()
    p2 <- DimPlot(bm, reduction = 'wnn.umap', group.by = 'celltype.l2', label = TRUE, repel = TRUE, label.size = 2.5) + NoLegend()
    p1 + p2
    
    

    我们也可以计算UMAP可视化仅基于RNA和蛋白质数据和比较。我们发现,在识别祖细胞状态方面,RNA分析比ADT分析提供的信息更多(ADT panel包含分化细胞的标记),而T细胞状态则相反(ADT分析优于RNA分析)。

    bm <- RunUMAP(bm, reduction = 'pca', dims = 1:30, assay = 'RNA', 
                  reduction.name = 'rna.umap', reduction.key = 'rnaUMAP_')
    bm <- RunUMAP(bm, reduction = 'apca', dims = 1:18, assay = 'ADT', 
                  reduction.name = 'adt.umap', reduction.key = 'adtUMAP_') # 这个dims选择的有技术含量啊
    
    
    p3 <- DimPlot(bm, reduction = 'rna.umap', group.by = 'celltype.l2', label = TRUE, 
                  repel = TRUE, label.size = 2.5) + NoLegend()
    p4 <- DimPlot(bm, reduction = 'adt.umap', group.by = 'celltype.l2', label = TRUE, 
                  repel = TRUE, label.size = 2.5) + NoLegend()
    p3 + p4
    

    我们可以可视化多模态UMAP上典型标记基因和蛋白的表达,这有助于验证所提供的注释:

    p5 <- FeaturePlot(bm, features = c("adt_CD45RA","adt_CD16","adt_CD161"),
                      reduction = 'wnn.umap', max.cutoff = 2, 
                      cols = c("lightgrey","darkgreen"), ncol = 3)
    p6 <- FeaturePlot(bm, features = c("rna_TRDC","rna_MPO","rna_AVP"), 
                      reduction = 'wnn.umap', max.cutoff = 3, ncol = 3)
    p5 / p6
    

    最后,我们可以将每个细胞的模态权重可视化。RNA重量最高的每个群体代表祖细胞,而蛋白重量最高的群体代表T细胞。这符合我们的生物学预期,因为抗体panel不包含可以区分不同祖细胞群体的标记。

    VlnPlot(bm, features = "RNA.weight", group.by = 'celltype.l2', sort = TRUE, pt.size = 0) +
      NoLegend()
    

    https://satijalab.org/seurat/v4.0/weighted_nearest_neighbor_analysis.html
    https://github.com/satijalab/seurat/issues/3670

    相关文章

      网友评论

        本文标题:Seurat 4.0 || 分析scRNA和膜蛋白数据

        本文链接:https://www.haomeiwen.com/subject/odzdvktx.html