cell2location学习笔记

作者: 小丑竟是我自己0815 | 来源:发表于2022-03-28 11:15 被阅读0次

    Cell2location 是一个原则性的贝叶斯模型,它可以解析空间转录数据中的细胞类型,并创建不同组织的全面细胞图。Cell2location 解释了变异的技术来源,并借用了不同位置的统计强度,从而使单细胞和空间转录组的集成比现有工具具有更高的灵敏度和分辨率。这是通过估计哪种细胞类型的组合,其细胞丰度可以给出空间数据中的mRNA counts数,同时模拟技术效应(平台/技术效应,污染 RNA,无法解释的方差)来实现的。
    Cell2location需要未转化、未标准化的空间mRNA counts数作为输入;还需要提供给Cell2location每个位置被期望的细胞丰度,这是用来作为一个估计完整细胞丰度的先导,这个数值取决于组织,可以通过计算配对组织学图像中少数部位的细胞核来估计,可以是近似的。
    Cell2location提供了从 scRNA-seq 数据中估计参考细胞类型特征的两种方法:

    1. 一种基于负二项回归的统计方法。开发者通常推荐使用 NB 回归,它允许将跨技术和批次处理的数据有力地结合起来,从而提高了空间映射的准确性。
    2. 硬编码计算个体基因的每簇平均 mRNA counts数(cell2location.cluster_averages.compute_cluster_averages)。当批次效应很小时,这种计算每个集群平均值的更快的硬编码方法提供了同样高的准确性。我们还建议对于非 UMI 技术使用硬编码方法,如 Smart-Seq 2。
      image.png

    加载需要的包

    import sys
    
    IN_COLAB = "google.colab" in sys.modules
    if IN_COLAB and branch == "stable":
        !pip install --quiet git+https://github.com/BayraktarLab/cell2location#egg=cell2location[tutorials]
    
    import scanpy as sc
    import anndata
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    
    import cell2location
    import scvi
    
    from matplotlib import rcParams
    rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
    import seaborn as sns
    #定义分析结果保存在哪
    results_folder = './results/lymph_nodes_analysis/'
    
    # create paths and names to results folders for reference regression and cell2location models
    ref_run_name = f'{results_folder}/reference_signatures'
    run_name = f'{results_folder}/cell2location_map'
    

    加载 Visium 和 scRNA-seq 参考数据

    #通过scanpy导入公开发表的淋巴结空间转录组数据
    adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
    adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
    
    # rename genes to ENSEMBL
    adata_vis.var['SYMBOL'] = adata_vis.var_names#_names:提取行名
    adata_vis.var_names = adata_vis.var['gene_ids']
    adata_vis.var_names.name = None#_names.name = None将名称设为无
    

    注:线粒体编码基因(基因名称以 MT 或 MT-开头)对空间定位无关,因为它们的表达代表了单个细胞和细胞核数据中的技术性假象,而不是线粒体的生物丰度。然而,这些基因在每个位置的mRNA中占15-40% 。因此,为了避免绘制伪影,强烈建议移除线粒体基因。

    # find mitochondria-encoded (MT) genes
    adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]
    
    # remove MT genes for spatial mapping (keeping their counts in the object)
    adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()#传递了稀疏矩阵,但需要密集数据。使用X.toarray()转换为密集的numpy数组
    adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]
    

    已发表的淋巴结 scRNA-seq 数据集通常缺乏足够的代表性的生发中心相关的免疫细胞群体,由于患者捐赠者的年龄。因此,我们将涵盖淋巴结、脾脏和扁桃体的 scRNA-seq 数据集包含在单细胞参考数据中,以确保在空间转录数据集中能够捕捉到可能存在的免疫细胞状态的全部多样性。
    在这里,我们下载这个数据集,导入到anndata并改变变量名称为 ENSEMBL 基因标识符。

    # Download data if not already here
    import os
    if not os.path.exists('./data/sc.h5ad'):
        !cd ./data/ && wget https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad
    
    # Read data
    adata_ref = sc.read(f'./data/sc.h5ad')
    
    # Use ENSEMBL as gene IDs to make sure IDs are unique and correctly matched
    adata_ref.var['SYMBOL'] = adata_ref.var.index
    adata_ref.var.index = adata_ref.var['GeneID-2'].copy()
    adata_ref.var_names = adata_ref.var['GeneID-2'].copy()
    adata_ref.var.index.name = None
    adata_ref.raw.var['SYMBOL'] = adata_ref.raw.var.index
    adata_ref.raw.var.index = adata_ref.raw.var['GeneID-2'].copy()
    adata_ref.raw.var.index.name = None
    
    # before we estimate the reference cell type signature we recommend to perform very permissive genes selection
    # in this 2D histogram orange rectangle lays over excluded genes.
    # In this case, the downloaded dataset was already filtered using this method,
    # hence no density under the orange rectangle
    from cell2location.utils.filtering import filter_genes
    selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)
    
    # filter the object
    adata_ref = adata_ref[:, selected].copy()
    
    image.png

    参考细胞类型特征的估计(NB 回归)

    # prepare anndata for the regression model
    scvi.data.setup_anndata(adata=adata_ref,
                            # 10X reaction / sample / batch
                            batch_key='Sample',
                            # cell type, covariate used for constructing signatures
                            labels_key='Subset',
                            # multiplicative technical effects (platform, 3' vs 5', donor effect)
                            categorical_covariate_keys=['Method']
                           )
    scvi.data.view_anndata_setup(adata_ref)
    
    image.png
    image.png
    image.png
    # create and train the regression model
    from cell2location.models import RegressionModel
    mod = RegressionModel(adata_ref)
    
    # Use all data for training (validation not implemented yet, train_size=1)
    mod.train(max_epochs=250, batch_size=2500, train_size=1, lr=0.002, use_gpu=True)
    
    # plot ELBO loss history during training, removing first 20 epochs from the plot
    mod.plot_history(20)
    
    image.png
    # In this section, we export the estimated cell abundance (summary of the posterior distribution).
    adata_ref = mod.export_posterior(
        adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': True}
    )
    
    # Save model
    mod.save(f"{ref_run_name}", overwrite=True)
    
    # Save anndata object with results
    adata_file = f"{ref_run_name}/sc.h5ad"
    adata_ref.write(adata_file)
    adata_file
    
    检查QC后的图
    1. 重建准确性去评估是否有任何组织存在问题
    2. 由于批次效应,估计的表达特征不同于每个集群中的平均表达。对于不受批次效应影响的 scRNA-seq 数据集(这个数据集) ,可以使用聚类平均表达式来代替使用模型估计特征。当这个图与对角线图非常不同时(例如 y 轴上的值非常低,密度无处不在) ,它表明特征估计存在问题。
    mod.plot_QC()
    
    image.png
    #模型和输出 h5ad 可以像下面这样加载:
    mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
    adata_file = f"{ref_run_name}/sc.h5ad"
    adata_ref = sc.read_h5ad(adata_file)
    
    # export estimated expression in each cluster
    if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
        inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
                                        for i in adata_ref.uns['mod']['factor_names']]].copy()
    else:
        inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
                                        for i in adata_ref.uns['mod']['factor_names']]].copy()
    inf_aver.columns = adata_ref.uns['mod']['factor_names']
    inf_aver.iloc[0:5, 0:5]
    
    image.png

    Cell2location: 空间比对

    # find shared genes and subset both anndata and reference signatures
    intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
    adata_vis = adata_vis[:, intersect].copy()
    inf_aver = inf_aver.loc[intersect, :].copy()
    
    # prepare anndata for cell2location model
    scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
    scvi.data.view_anndata_setup(adata_vis)
    
    image.png
    在这里,需要指定2个用户提供的超参数(n_cells_per_locationdetection_alpha)
    注:虽然可以经常使用 detection_alpha 超参数的默认值,但是将预期的细胞丰度 n_cells per_location调整到每个组织是有用的。
    这个值可以通过成对的组织学图像估计,如上面的注释所述。将本教程中提供的值(n_cells_ per_location = 30)更改为在您的组织中观察到的值。
    # create and train the model
    mod = cell2location.models.Cell2location(
        adata_vis, cell_state_df=inf_aver,
        # the expected average cell abundance: tissue-dependent
        # hyper-prior which can be estimated from paired histology:
        N_cells_per_location=30,
        # hyperparameter controlling normalisation of
        # within-experiment variation in RNA detection (using default here):
        detection_alpha=200
    )
    
    mod.train(max_epochs=30000,
              # train using full data (batch_size=None)
              batch_size=None,
              # use all data points in training because
              # we need to estimate cell abundance at all locations
              train_size=1,
              use_gpu=True)
    
    # plot ELBO loss history during training, removing first 100 epochs from the plot
    mod.plot_history(1000)
    plt.legend(labels=['full data training']);
    
    image.png
    # In this section, we export the estimated cell abundance (summary of the posterior distribution).
    adata_vis = mod.export_posterior(
        adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
    )
    
    # Save model
    mod.save(f"{run_name}", overwrite=True)
    
    # mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
    
    # Save anndata object with results
    adata_file = f"{run_name}/sp.h5ad"
    adata_vis.write(adata_file)
    adata_file
    
    image.png
    #模型和输出 h5ad 可以像下面这样加载:
    mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
    adata_file = f"{run_name}/sp.h5ad"
    adata_vis = sc.read_h5ad(adata_file)
    
    # Examine reconstruction accuracy to assess if there are any issues with mapping
    # the plot should be roughly diagonal, strong deviations will signal problems
    mod.plot_QC()
    
    image.png

    当整合多个空间批次时和当处理载玻片中检测到的 RNA 差异很大的数据集时(这在组织学上无法用高细胞密度来解释) ,评估 cell2location 是否使这些影响恢复正常是很重要的。
    期望看到的是不同批次中有相似的总细胞数和不同的RNA检测灵敏度(这都可以通过cell2location评估)。
    期望组织学中的总细胞数量反映出高细胞密度。

    Visualising cell abundance in spatial coordinates

    # add 5% quantile, representing confident cell abundance, 'at least this amount is present',
    # to adata.obs with nice names for plotting
    adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']
    
    # select one slide
    from cell2location.utils import select_slide
    slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
    
    # plot in spatial coordinates
    with mpl.rc_context({'axes.facecolor':  'black','figure.figsize': [4.5, 5]}):
        sc.pl.spatial(slide, cmap='magma',
                      # show first 8 cell types
                      color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC','B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
                      ncols=4, size=1.3,
                      img_key='hires',
                      # limit color scale at 99.2% quantile of cell abundance
                      vmin=0, vmax='p99.2',
    save=True
                     )
    
    image.png
    # Now we use cell2location plotter that allows showing multiple cell types in one panel
    from cell2location.plt import plot_spatial
    
    # select up to 6 clusters
    clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
    clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels
    
    slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
    
    import matplotlib.pyplot as plt
    with mpl.rc_context({'figure.figsize': (15, 15)}):
        fig = plot_spatial(
            adata=slide,
            # labels to show on a plot
            color=clust_col, labels=clust_labels,
            show_img=True,
            # 'fast' (white background) or 'dark_background'
            style='fast',
            # limit color scale at 99.2% quantile of cell abundance
            max_color_quantile=0.992,
            # size of locations (adjust depending on figure size)
            circle_diameter=6,
            colorbar_position='right'
        )
        plt.savefig('showing_multiple_cell_types_in_one_panel.pdf')
    
    image.png

    下游分析

    基于 Leiden 聚类的离散组织区域识别

    我们通过使用由cell2location估计的细胞丰度来确定细胞组成不同的组织区域。
    我们通过估计每种细胞类型的细胞丰度来聚集 Visium 点,从而找到组织区域。
    首先构造了一个 k 邻近的 neigbour (KNN)图表示细胞丰度估计值中位置的相似性,然后应用 Leiden 聚类算法进行聚类。邻近区域的数量应适应数据集的大小和解剖学定义区域的大小(例如海马区域相当小,因此可能被大n_neighbors所掩盖)。这可以做一个范围的 KNN 近邻和Leiden聚类分辨率,直到一个聚类匹配的组织解剖结构获得。
    聚类是跨所有 Visium 部分/批次共同完成的,因此区域身份是直接可比的。当多个批次之间存在强烈的技术效应时(这里不是这种情况) ,sc.external.pp.bbknn 原则上可以用来解释 KNN 构建过程中的这些效应。
    生成的集群保存在 adata _ vis. obs [‘ region _ cluster’]

    # compute KNN using the cell2location output stored in adata.obsm
    sc.pp.neighbors(adata_vis, use_rep='q05_cell_abundance_w_sf',
                    n_neighbors = 15)
    
    # Cluster spots into regions using scanpy
    sc.tl.leiden(adata_vis, resolution=1.1)
    
    # add region as categorical variable
    adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"].astype("category")
    
    # compute UMAP using KNN graph based on the cell2location output
    sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)
    
    # show regions in UMAP coordinates
    with mpl.rc_context({'axes.facecolor':  'white',
                         'figure.figsize': [8, 8]}):
        sc.pl.umap(adata_vis, color=['region_cluster'], size=30,
                   color_map = 'RdPu', ncols = 2, legend_loc='on data',
                   legend_fontsize=20)
        sc.pl.umap(adata_vis, color=['sample'], size=30,
                   color_map = 'RdPu', ncols = 2,
                   legend_fontsize=20)
    
    # plot in spatial coordinates
    with mpl.rc_context({'axes.facecolor':  'black',
                         'figure.figsize': [4.5, 5]}):
        sc.pl.spatial(adata_vis, color=['region_cluster'],
                      size=1.3, img_key='hires', alpha=0.5)
    
    image.png
    image.png

    利用基质因子分解(NMF)识别细胞区间/组织区域

    在这里,我们使用 cell2location 映射结果来确定细胞类型的空间共存,以便更好地理解组织组织和预测细胞相互作用。我们用cell2location进行了细胞类型丰度的非负矩阵分解估计。与将 NMF 应用于常规 scRNA-seq 的有点相似,加性 NMF 分解产生了一组空间细胞类型丰度分布图,这些分布图可以捕捉共同定位的细胞类型。这种基于NMF 的分解自然地解释了多种细胞类型和微环境可以在相同的 Visium 位置共存的事实 ,同时跨组织区域(例如个体生发中心)共享信息。
    提示在实际应用中,最好对一系列因子 R = 5,..,30进行 NMF 训练。并选择 R作为捕获细小组织区域和分裂已知部分之间的平衡。如果你想找到一些最独特的细胞区间,使用小一点的R。如果你想找到非常强的同位信号,并假设大多数细胞类型不同位,使用大一点的R。
    下面我们将展示如何执行此分析。为了帮助这个分析,我们把这个分析包装成一个管道,这个管道可以自动训练不同R的 NMF 模型:

    from cell2location import run_colocation
    res_dict, adata_vis = run_colocation(
        adata_vis,
        model_name='CoLocatedGroupsSklearnNMF',
        train_args={
          'n_fact': np.arange(11, 13), # IMPORTANT: use a wider range of the number of factors (5-30)
          'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample
          'n_restarts': 3 # number of training restarts
        },
        export_args={'path': f'{run_name}/CoLocatedComb/'}
    )
    
    image.png
    对于每个factor number,模型生成以下文件夹输出列表:
    cell_type_fractions_heatmap/:a dot plot of the estimated NMF weights of cell types (rows) across NMF components (columns)
    cell_type_fractions_mean/:the data used for dot plot
    factor_markers/: tables listing top 10 cell types most speficic to each NMF factor
    models/:saved NMF models
    predictive_accuracy/:2D histogram plot showing how well NMF explains cell2location output
    spatial/:NMF weights across locatinos in spatial coordinates
    location_factors_mean/:the data used for the plot in spatial coordiantes
    stability_plots/:stability of NMF weights between training restarts
    要检查的关键输出是 cell _ type _ fraction _ heatmap/中的文件,其中显示了 NMF 组件(列)中对应于细胞间隔的各种类型(行)的估计 NMF 权重的点图。显示的是相对权重,每个单元类型的组件之间的规范化。
    # Here we plot the NMF weights (Same as saved to `cell_type_fractions_heatmap`)
    res_dict['n_fact12']['mod'].plot_cell_type_loadings()
    
    image.png

    高级应用

    估计空间数据中每个基因的细胞类型特异性表达

    为此,我们采用了由 Cable 等人提出的估计条件期望表达式的方法。使用 cell2location,我们可以查看后验概率分布,而不仅仅是细胞型特定表达式的估计值

    # Compute expected expression per cell type
    expected_dict = mod.module.model.compute_expected_per_cell_type(
        mod.samples["post_sample_q05"], mod.adata
    )
    
    # Add to anndata layers
    for i, n in enumerate(mod.factor_names_):
        adata_vis.layers[n] = expected_dict['mu'][i]
    
    # Save anndata object with results
    adata_file = f"{run_name}/sp.h5ad"
    adata_vis.write(adata_file)
    adata_file
    
    image.png
    # Look at cell type specific expression in spatial coordinates,
    # Here we highlight CD3D, pan T-cell marker expressed by
    # 2 subtypes of T cells in distinct locations but not expressed by co-located B cells
    ctypes = ['T_CD4+_TfH_GC', 'T_CD4+_naive', 'B_GC_LZ']
    genes = ['CD3D', 'CR2']
    
    with mpl.rc_context({'axes.facecolor':  'black'}):
        # select one slide
        slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
    
        from tutorial_utils import plot_genes_per_cell_type
        plot_genes_per_cell_type(slide, genes, ctypes);
    
    image.png

    与后验概率计算一起,计算任意分位数

    # Get posterior distribution samples for specific variables
    samples_w_sf = mod.sample_posterior(num_samples=1000, use_gpu=True, return_samples=True,
                                        batch_size=2020,
                                        return_sites=['w_sf', 'm_g', 'u_sf_mRNA_factors'])
    # samples_w_sf['posterior_samples'] contains 1000 samples as arrays with dim=(num_samples, ...)
    samples_w_sf['posterior_samples']['w_sf'].shape
    
    # Compute any quantile of the posterior distribution
    medians = mod.posterior_quantile(q=0.5, use_gpu = True)
    
    with mpl.rc_context({'axes.facecolor':  'white',
                         'figure.figsize': [5, 5]}):
        plt.scatter(medians['w_sf'].flatten(), mod.samples['post_sample_means']['w_sf'].flatten());
        plt.xlabel('median');
        plt.ylabel('mean');
    
    image.png
    参考:https://cell2location.readthedocs.io/en/latest/notebooks/cell2location_tutorial.html

    相关文章

      网友评论

        本文标题:cell2location学习笔记

        本文链接:https://www.haomeiwen.com/subject/xmzejrtx.html