Scanpy分析单细胞数据：轨迹分析PAGA

作者: Hayley笔记 | 来源:发表于2022-03-05 22:48 被阅读0次

单细胞转录组数据分析|| scanpy教程：使用ingest和B
单细胞转录组数据分析|| scanpy教程：可视化套件
Scanpy分析单细胞数据：轨迹分析PAGA
单细胞转录组数据分析|| scanpy教程：PAGA轨迹推断
GPU版scanpy (rapids)实践 | 大型单细胞数据分
单细胞转录组数据分析|| scanpy教程：预处理与聚类
scanpy分析单细胞数据
scanpy分析单细胞数据
单细胞转录数据分析之Scanpy
单细胞不同样本数据整合-解决AnnData合并时ValueErr

PAGA全称Partition-based graph abstraction（基于分区的图抽象）。在一种类似于图聚类的方法中，PAGA生成数据的最近邻图，然后生成细胞的分组，连接细胞之间的连接比随机期望的更多的组，从而构建数据的摘要图。由于“短路”在细胞群之间的连接比在单个细胞之间的连接更容易识别，所以PAGA排除了细胞群之间的虚假连接。

1. 导入数据

使用预处理和聚类中保存的注释好的pbmc3k.h5ad文件来做拟时序分析。（原则上PBMC的数据不推荐用来做拟时序，这里仅做演示）

安装scanpy的时候需要使用pip install 'scanpy[louvain]'，直接使用pip install scanpy后面跑sc.tl.paga(adata, groups='louvain')会报错。

import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
results_file = './pbmc3k_pseudo.h5ad'
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white')  # low dpi (dots per inch) yields small inline figures

adata = sc.read_h5ad("pbmc3k.h5ad")
adata

adata.X = adata.X.astype('float64')

2. 预处理和可视化

sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='leiden', legend_loc='on data',title = "")
#pl.savefig("draw_graph.pdf")

看起来比较混乱，但是这个降维图是怎么来的呢？

让我们来看看sc.tl.draw_graph()这个函数的帮助文档

help(sc.tl.draw_graph)
<pre style="caret-color: rgb(0, 0, 0); color: var(--jp-content-font-color1); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; box-sizing: unset; font-family: var(--jp-code-font-family); font-size: var(--jp-code-font-size); line-height: normal; border: none; margin: 0px; padding: 0px; overflow: auto; word-break: break-all; overflow-wrap: break-word; white-space: pre-wrap; font-variant-ligatures: normal; orphans: 2; text-align: left; widows: 2; background-color: rgb(255, 255, 255); text-decoration-thickness: initial;">Help on function draw_graph in module scanpy.tools._draw_graph:

draw_graph(adata: anndata._core.anndata.AnnData, layout: Literal['fr', 'drl', 'kk', 'grid_fr', 'lgl', 'rt', 'rt_circular', 'fa'] = 'fa', init_pos: Union[str, bool, NoneType] = None, root: Union[int, NoneType] = None, random_state: Union[NoneType, int, numpy.random.mtrand.RandomState] = 0, n_jobs: Union[int, NoneType] = None, adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, key_added_ext: Union[str, NoneType] = None, neighbors_key: Union[str, NoneType] = None, obsp: Union[str, NoneType] = None, copy: bool = False, **kwds)
    Force-directed graph drawing [Islam11]_ [Jacomy14]_ [Chippada18]_.

    An alternative to tSNE that often preserves the topology of the data
    better. This requires to run :func:`~scanpy.pp.neighbors`, first.

    The default layout ('fa', `ForceAtlas2`) [Jacomy14]_ uses the package |fa2|_
    [Chippada18]_, which can be installed via `pip install fa2`.

    `Force-directed graph drawing`_ describes a class of long-established
    algorithms for visualizing graphs.
    It has been suggested for visualizing single-cell data by [Islam11]_.
    Many other layouts as implemented in igraph [Csardi06]_ are available.
    Similar approaches have been used by [Zunder15]_ or [Weinreb17]_.

    .. |fa2| replace:: `fa2`
    .. _fa2: [https://github.com/bhargavchippada/forceatlas2](https://github.com/bhargavchippada/forceatlas2)
    .. _Force-directed graph drawing: [https://en.wikipedia.org/wiki/Force-directed_graph_drawing](https://en.wikipedia.org/wiki/Force-directed_graph_drawing)

    Parameters
    ----------
    adata
        Annotated data matrix.
    layout
        'fa' (`ForceAtlas2`) or any valid `igraph layout
        <[http://igraph.org/c/doc/igraph-Layout.html>`__](http://igraph.org/c/doc/igraph-Layout.html%3E%60__). Of particular interest
        are 'fr' (Fruchterman Reingold), 'grid_fr' (Grid Fruchterman Reingold,
        faster than 'fr'), 'kk' (Kamadi Kawai', slower than 'fr'), 'lgl' (Large
        Graph, very fast), 'drl' (Distributed Recursive Layout, pretty fast) and
        'rt' (Reingold Tilford tree layout).
    root
        Root for tree layouts.
    random_state
        For layouts with random initialization like 'fr', change this to use
        different intial states for the optimization. If `None`, no seed is set.
    adjacency
        Sparse adjacency matrix of the graph, defaults to neighbors connectivities.
    key_added_ext
        By default, append `layout`.
    proceed
        Continue computation, starting off with 'X_draw_graph_`layout`'.
    init_pos
        `'paga'`/`True`, `None`/`False`, or any valid 2d-`.obsm` key.
        Use precomputed coordinates for initialization.
        If `False`/`None` (the default), initialize randomly.
    neighbors_key
        If not specified, draw_graph looks .obsp['connectivities'] for connectivities
        (default storage place for pp.neighbors).
        If specified, draw_graph looks
        .obsp[.uns[neighbors_key]['connectivities_key']] for connectivities.
    obsp
        Use .obsp[obsp] as adjacency. You can't specify both
        `obsp` and `neighbors_key` at the same time.
    copy
        Return a copy instead of writing to adata.
    **kwds
        Parameters of chosen igraph layout. See e.g. `fruchterman-reingold`_
        [Fruchterman91]_. One of the most important ones is `maxiter`.

        .. _fruchterman-reingold: [http://igraph.org/python/doc/igraph.Graph-class.html#layout_fruchterman_reingold](http://igraph.org/python/doc/igraph.Graph-class.html#layout_fruchterman_reingold)

    Returns
    -------
    Depending on `copy`, returns or updates `adata` with the following field.

    **X_draw_graph_layout** : `adata.obsm`
        Coordinates of graph layout. E.g. for layout='fa' (the default),
        the field is called 'X_draw_graph_fa'</pre>

这是tsne的一种代替方案，由fa2库完成，绘图用的是Force-directed_graph_drawing，说明文档中给出了相关的链接。

看一下umap的结果

sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'])

3. Denoising the graph（可选步骤）

##去噪,扩散图空间表示
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap')
sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='leiden', legend_loc='on data') #legend_loc='right margin'
#pl.savefig("draw_graph_diffmap.pdf")

额还是有点乱

4. 聚类和PAGA

## sc.tl.louvain(adata, resolution=1.0) 前面已经做过这一步了，做了3再做4，重新聚类分的群的数目会和一开始不一样
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color=['leiden', 'CD4', 'CD8A', 'CD14'])

这是一个无向的网络图，点的颜色代表不同分群，点的大小代表该群细胞数的大小，连线只有一种颜色，粗细代表两个群之间连接更密切。这里并没有一个拟时的概念，但是下面我们会在这个基础上推断出一个拟时结构。

adata.obs['leiden'].cat.categories
# Index(['0', '1', '2', '3', '4', '5', '6', '7'], dtype='object')
#       dtype='object')

我们根据已知marker基因识别细胞类别，可以将细胞类型的信息注释上去。当然我们前面已经注释好了，保存在adata.obs['leiden_anno']。

# adata.obs['leiden_anno'] = adata.obs['leiden']
# adata.obs['leiden_anno'].cat.categories = ['CD4 T', 'CD14 Monocytes','B', 'CD8 T','NK', 'FCGR3A Monocytes','Dendritic', 'Megakaryocytes']
sc.tl.paga(adata, groups='leiden_anno')
sc.pl.paga(adata, threshold=0.03, show=False)

#利用PAGA重新计算细胞之间的距离
sc.tl.draw_graph(adata, init_pos='paga')
sc.pl.draw_graph(adata, color=['leiden_anno','MS4A1', 'NKG7', 'PPBP'], legend_loc='on data')

查看颜色，可以自行定义颜色

pl.figure(figsize=(8, 2))
for i in range(28):
    pl.scatter(i, 1, c=sc.pl.palettes.zeileis_28[i], s=200)
pl.show()

zeileis_colors = np.array(sc.pl.palettes.zeileis_28)
new_colors = np.array(adata.uns['leiden_anno_colors'])
new_colors[[0]] = zeileis_colors[[12]]  # CD4 T colors / green
new_colors[[1]] = zeileis_colors[[5]]  # CD14 Monocytes colors / red
new_colors[[2]] = zeileis_colors[[17]]  # B colors / yellow
new_colors[[3]] = zeileis_colors[[2]]  # CD8 T / grey
new_colors[[4]] = zeileis_colors[[18]]  # NK / turquoise
new_colors[[5]] = zeileis_colors[[6]]  # FCGR3A Monocytes / light blue
new_colors[[6]] = zeileis_colors[[0]]  # Dendritic / dark blue
new_colors[[7]] = zeileis_colors[[25]]  # Megakaryocytes / grey
#new_colors[[10, 17, 5, 3, 15, 6, 18, 13, 7, 12]] = zeileis_colors[[5, 5, 5, 5, 11, 11, 10, 9, 21, 21]]  # CD14 Monocytes colors / red

adata.uns['leiden_anno_colors'] = new_colors
sc.pl.paga_compare(
    adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
    legend_fontsize=12, fontsize=12, frameon=False, edges=True)

5. 针对给定的一组基因，沿PAGA路径重建基因变化

定义分化起点，计算每个细胞的拟时间，画拟时间分布（这儿我是随便取的B细胞作为root，请根据自身数据细胞类别选取）

adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden_anno']  == 'B')[0] ##假设分化起点为B cells,当然自己分析的时候需要根据数据实际情况选择分化起点
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color=['leiden_anno', 'dpt_pseudotime'],  legend_loc='on data',title = ['','pseudotime'], frameon=True)
sc.pl.draw_graph(adata, color=['leiden', 'dpt_pseudotime'],  legend_loc='on data',title = ['','pseudotime'], frameon=True)

针对给定基因，沿PAGA路径重建基因变化

gene_names = ['IL7R', 'CD14', 'LYZ',  'MS4A1', 'CD8A', 'CD8B', 'FCGR3A', 'MS4A7','GNLY', 'NKG7', 'FCER1A', 'CST3']                     # 选取了一列marker 基因，要根据实际情况选取
data_raw = sc.read_10x_mtx('./data/filtered_gene_bc_matrices/hg19/', var_names='gene_symbols', cache=True)  
data_raw.var_names_make_unique()
sc.pp.filter_cells(data_raw, min_genes=200)   # 去除表达基因200以下的细胞
sc.pp.filter_genes(data_raw, min_cells=3)     # 去除在3个细胞以下表达的基因
mito_genes=data_raw.var_names.str.startswith('MT-')
data_raw.obs['percent_mito']=np.sum(data_raw[:,mito_genes].X,axis=1).A1/np.sum(data_raw.X,axis=1).A1
data_raw.obs['n_counts']=data_raw.X.sum(axis=1).A1
data_raw = data_raw[data_raw.obs.n_genes < 2500, :]
data_raw = data_raw[data_raw.obs.percent_mito < 0.05, :]
sc.pp.normalize_total(data_raw, target_sum=1e4)
sc.pp.log1p(data_raw)
sc.pp.scale(data_raw)
adata.raw = data_raw #使用完整的原始数据进行可视化，处理方法和预处理和聚类里的一样

paths = [('NK', [2,0,3]), ('FCGR3A Monocytes', [2,6,1])] #自定义轨迹，以B细胞为起点，目标细胞为分化轨迹终点
adata.obs['distance'] = adata.obs['dpt_pseudotime']
adata.obs['clusters'] = adata.obs['leiden']  # just a cosmetic change
adata.uns['clusters_colors'] = adata.uns['leiden_anno_colors']
_, axs = pl.subplots(ncols=2, figsize=(6, 2.5), gridspec_kw={'wspace': 0.05, 'left': 0.12})
pl.subplots_adjust(left=0.05, right=0.98, top=0.82, bottom=0.2)
for ipath, (descr, path) in enumerate(paths):
    _, data = sc.pl.paga_path(
        adata, path, gene_names,
        show_node_names=False,
        ax=axs[ipath],
        ytick_fontsize=12,
        left_margin=0.15,
        n_avg=50,
        annotations=['distance'],
        show_yticks=True if ipath==0 else False,
        show_colorbar=False,
        color_map='Greys',
        groups_key='clusters',
        color_maps_annotations={'distance': 'viridis'},
        title='{} path'.format(descr),
        return_data=True,
        show=False)
    data.to_csv('./paga_path_{}.csv'.format(descr))
pl.savefig('paga_path_pbmc.pdf')
pl.show()

参考：https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html#Recomputing-the-embedding-using-PAGA-initialization

单细胞转录组数据分析|| scanpy教程：使用ingest和B
单细胞转录组数据分析|| scanpy教程：预处理与聚类单细胞转录组数据分析|| scanpy教程：PAGA轨迹...
单细胞转录组数据分析|| scanpy教程：可视化套件
单细胞转录组数据分析|| scanpy教程：预处理与聚类单细胞转录组数据分析|| scanpy教程：PAGA轨迹...
Scanpy分析单细胞数据：轨迹分析PAGA
Scanpy数据结构：AnnData[https://www.jianshu.com/p/9b057e105c42...
单细胞转录组数据分析|| scanpy教程：PAGA轨迹推断
单细胞转录组数据分析|| scanpy教程：预处理与聚类说起轨迹推断，很多人的第一印象就是monocle的轨迹图...
GPU版scanpy (rapids)实践 | 大型单细胞数据分
作者：ahworld链接：GPU版scanpy (rapids)实践 | 大型单细胞数据分析利器[https://...
单细胞转录组数据分析|| scanpy教程：预处理与聚类
scanpy 是一个用于分析单细胞转录组（single cell rna sequencing)数据的python...
scanpy分析单细胞数据
R在读取和处理数据的过程中会将所有的变量和占用都储存在RAM当中，这样一来，对于海量的单细胞RNA-seq数据（尤...
scanpy分析单细胞数据
scanpy和seurat是最常用的分析的单细胞的工具，seurat基于R，而scanpy基于python。lin...
单细胞转录数据分析之Scanpy
Scanpy 是一个基于 Python 分析单细胞数据的软件包，内容包括预处理，可视化，聚类，拟时序分析和差异表达...
单细胞不同样本数据整合-解决AnnData合并时ValueErr
项目场景：使用scanpy包进行单细胞数据分析时，往往需要整合多个样本的数据，也就是将多个AnnData对象合并...