Scanpy数据结构:AnnData
Scanpy分析单细胞数据:预处理和聚类
PAGA全称Partition-based graph abstraction(基于分区的图抽象)。在一种类似于图聚类的方法中,PAGA生成数据的最近邻图,然后生成细胞的分组,连接细胞之间的连接比随机期望的更多的组,从而构建数据的摘要图。由于“短路”在细胞群之间的连接比在单个细胞之间的连接更容易识别,所以PAGA排除了细胞群之间的虚假连接。

1. 导入数据
使用预处理和聚类中保存的注释好的pbmc3k.h5ad文件来做拟时序分析。(原则上PBMC的数据不推荐用来做拟时序,这里仅做演示)
安装scanpy的时候需要使用
pip install 'scanpy[louvain]'
,直接使用pip install scanpy后面跑sc.tl.paga(adata, groups='louvain')
会报错。
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
from matplotlib import rcParams
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_versions()
results_file = './pbmc3k_pseudo.h5ad'
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(3, 3), facecolor='white') # low dpi (dots per inch) yields small inline figures
adata = sc.read_h5ad("pbmc3k.h5ad")
adata
adata.X = adata.X.astype('float64')
2. 预处理和可视化
sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='leiden', legend_loc='on data',title = "")
#pl.savefig("draw_graph.pdf")

让我们来看看sc.tl.draw_graph()这个函数的帮助文档
help(sc.tl.draw_graph)
<pre style="caret-color: rgb(0, 0, 0); color: var(--jp-content-font-color1); font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-indent: 0px; text-transform: none; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; box-sizing: unset; font-family: var(--jp-code-font-family); font-size: var(--jp-code-font-size); line-height: normal; border: none; margin: 0px; padding: 0px; overflow: auto; word-break: break-all; overflow-wrap: break-word; white-space: pre-wrap; font-variant-ligatures: normal; orphans: 2; text-align: left; widows: 2; background-color: rgb(255, 255, 255); text-decoration-thickness: initial;">Help on function draw_graph in module scanpy.tools._draw_graph:
draw_graph(adata: anndata._core.anndata.AnnData, layout: Literal['fr', 'drl', 'kk', 'grid_fr', 'lgl', 'rt', 'rt_circular', 'fa'] = 'fa', init_pos: Union[str, bool, NoneType] = None, root: Union[int, NoneType] = None, random_state: Union[NoneType, int, numpy.random.mtrand.RandomState] = 0, n_jobs: Union[int, NoneType] = None, adjacency: Union[scipy.sparse.base.spmatrix, NoneType] = None, key_added_ext: Union[str, NoneType] = None, neighbors_key: Union[str, NoneType] = None, obsp: Union[str, NoneType] = None, copy: bool = False, **kwds)
Force-directed graph drawing [Islam11]_ [Jacomy14]_ [Chippada18]_.
An alternative to tSNE that often preserves the topology of the data
better. This requires to run :func:`~scanpy.pp.neighbors`, first.
The default layout ('fa', `ForceAtlas2`) [Jacomy14]_ uses the package |fa2|_
[Chippada18]_, which can be installed via `pip install fa2`.
`Force-directed graph drawing`_ describes a class of long-established
algorithms for visualizing graphs.
It has been suggested for visualizing single-cell data by [Islam11]_.
Many other layouts as implemented in igraph [Csardi06]_ are available.
Similar approaches have been used by [Zunder15]_ or [Weinreb17]_.
.. |fa2| replace:: `fa2`
.. _fa2: [https://github.com/bhargavchippada/forceatlas2](https://github.com/bhargavchippada/forceatlas2)
.. _Force-directed graph drawing: [https://en.wikipedia.org/wiki/Force-directed_graph_drawing](https://en.wikipedia.org/wiki/Force-directed_graph_drawing)
Parameters
----------
adata
Annotated data matrix.
layout
'fa' (`ForceAtlas2`) or any valid `igraph layout
<[http://igraph.org/c/doc/igraph-Layout.html>`__](http://igraph.org/c/doc/igraph-Layout.html%3E%60__). Of particular interest
are 'fr' (Fruchterman Reingold), 'grid_fr' (Grid Fruchterman Reingold,
faster than 'fr'), 'kk' (Kamadi Kawai', slower than 'fr'), 'lgl' (Large
Graph, very fast), 'drl' (Distributed Recursive Layout, pretty fast) and
'rt' (Reingold Tilford tree layout).
root
Root for tree layouts.
random_state
For layouts with random initialization like 'fr', change this to use
different intial states for the optimization. If `None`, no seed is set.
adjacency
Sparse adjacency matrix of the graph, defaults to neighbors connectivities.
key_added_ext
By default, append `layout`.
proceed
Continue computation, starting off with 'X_draw_graph_`layout`'.
init_pos
`'paga'`/`True`, `None`/`False`, or any valid 2d-`.obsm` key.
Use precomputed coordinates for initialization.
If `False`/`None` (the default), initialize randomly.
neighbors_key
If not specified, draw_graph looks .obsp['connectivities'] for connectivities
(default storage place for pp.neighbors).
If specified, draw_graph looks
.obsp[.uns[neighbors_key]['connectivities_key']] for connectivities.
obsp
Use .obsp[obsp] as adjacency. You can't specify both
`obsp` and `neighbors_key` at the same time.
copy
Return a copy instead of writing to adata.
**kwds
Parameters of chosen igraph layout. See e.g. `fruchterman-reingold`_
[Fruchterman91]_. One of the most important ones is `maxiter`.
.. _fruchterman-reingold: [http://igraph.org/python/doc/igraph.Graph-class.html#layout_fruchterman_reingold](http://igraph.org/python/doc/igraph.Graph-class.html#layout_fruchterman_reingold)
Returns
-------
Depending on `copy`, returns or updates `adata` with the following field.
**X_draw_graph_layout** : `adata.obsm`
Coordinates of graph layout. E.g. for layout='fa' (the default),
the field is called 'X_draw_graph_fa'</pre>
这是tsne的一种代替方案,由fa2库完成,绘图用的是Force-directed_graph_drawing,说明文档中给出了相关的链接。
看一下umap的结果
sc.tl.umap(adata)
sc.pl.umap(adata, color=['leiden'])

3. Denoising the graph(可选步骤)
##去噪,扩散图空间表示
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X_diffmap')
sc.tl.draw_graph(adata)
sc.pl.draw_graph(adata, color='leiden', legend_loc='on data') #legend_loc='right margin'
#pl.savefig("draw_graph_diffmap.pdf")

4. 聚类和PAGA
## sc.tl.louvain(adata, resolution=1.0) 前面已经做过这一步了,做了3再做4,重新聚类分的群的数目会和一开始不一样
sc.tl.paga(adata, groups='leiden')
sc.pl.paga(adata, color=['leiden', 'CD4', 'CD8A', 'CD14'])

这是一个无向的网络图,点的颜色代表不同分群,点的大小代表该群细胞数的大小,连线只有一种颜色,粗细代表两个群之间连接更密切。这里并没有一个拟时的概念,但是下面我们会在这个基础上推断出一个拟时结构。
adata.obs['leiden'].cat.categories
# Index(['0', '1', '2', '3', '4', '5', '6', '7'], dtype='object')
# dtype='object')
我们根据已知marker基因识别细胞类别,可以将细胞类型的信息注释上去。当然我们前面已经注释好了,保存在adata.obs['leiden_anno']。
# adata.obs['leiden_anno'] = adata.obs['leiden']
# adata.obs['leiden_anno'].cat.categories = ['CD4 T', 'CD14 Monocytes','B', 'CD8 T','NK', 'FCGR3A Monocytes','Dendritic', 'Megakaryocytes']
sc.tl.paga(adata, groups='leiden_anno')
sc.pl.paga(adata, threshold=0.03, show=False)

#利用PAGA重新计算细胞之间的距离
sc.tl.draw_graph(adata, init_pos='paga')
sc.pl.draw_graph(adata, color=['leiden_anno','MS4A1', 'NKG7', 'PPBP'], legend_loc='on data')

查看颜色,可以自行定义颜色
pl.figure(figsize=(8, 2))
for i in range(28):
pl.scatter(i, 1, c=sc.pl.palettes.zeileis_28[i], s=200)
pl.show()

zeileis_colors = np.array(sc.pl.palettes.zeileis_28)
new_colors = np.array(adata.uns['leiden_anno_colors'])
new_colors[[0]] = zeileis_colors[[12]] # CD4 T colors / green
new_colors[[1]] = zeileis_colors[[5]] # CD14 Monocytes colors / red
new_colors[[2]] = zeileis_colors[[17]] # B colors / yellow
new_colors[[3]] = zeileis_colors[[2]] # CD8 T / grey
new_colors[[4]] = zeileis_colors[[18]] # NK / turquoise
new_colors[[5]] = zeileis_colors[[6]] # FCGR3A Monocytes / light blue
new_colors[[6]] = zeileis_colors[[0]] # Dendritic / dark blue
new_colors[[7]] = zeileis_colors[[25]] # Megakaryocytes / grey
#new_colors[[10, 17, 5, 3, 15, 6, 18, 13, 7, 12]] = zeileis_colors[[5, 5, 5, 5, 11, 11, 10, 9, 21, 21]] # CD14 Monocytes colors / red
adata.uns['leiden_anno_colors'] = new_colors
sc.pl.paga_compare(
adata, threshold=0.03, title='', right_margin=0.2, size=10, edge_width_scale=0.5,
legend_fontsize=12, fontsize=12, frameon=False, edges=True)

5. 针对给定的一组基因,沿PAGA路径重建基因变化
定义分化起点,计算每个细胞的拟时间,画拟时间分布(这儿我是随便取的B细胞作为root,请根据自身数据细胞类别选取)
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden_anno'] == 'B')[0] ##假设分化起点为B cells,当然自己分析的时候需要根据数据实际情况选择分化起点
sc.tl.dpt(adata)
sc.pl.draw_graph(adata, color=['leiden_anno', 'dpt_pseudotime'], legend_loc='on data',title = ['','pseudotime'], frameon=True)
sc.pl.draw_graph(adata, color=['leiden', 'dpt_pseudotime'], legend_loc='on data',title = ['','pseudotime'], frameon=True)

针对给定基因,沿PAGA路径重建基因变化
gene_names = ['IL7R', 'CD14', 'LYZ', 'MS4A1', 'CD8A', 'CD8B', 'FCGR3A', 'MS4A7','GNLY', 'NKG7', 'FCER1A', 'CST3'] # 选取了一列marker 基因,要根据实际情况选取
data_raw = sc.read_10x_mtx('./data/filtered_gene_bc_matrices/hg19/', var_names='gene_symbols', cache=True)
data_raw.var_names_make_unique()
sc.pp.filter_cells(data_raw, min_genes=200) # 去除表达基因200以下的细胞
sc.pp.filter_genes(data_raw, min_cells=3) # 去除在3个细胞以下表达的基因
mito_genes=data_raw.var_names.str.startswith('MT-')
data_raw.obs['percent_mito']=np.sum(data_raw[:,mito_genes].X,axis=1).A1/np.sum(data_raw.X,axis=1).A1
data_raw.obs['n_counts']=data_raw.X.sum(axis=1).A1
data_raw = data_raw[data_raw.obs.n_genes < 2500, :]
data_raw = data_raw[data_raw.obs.percent_mito < 0.05, :]
sc.pp.normalize_total(data_raw, target_sum=1e4)
sc.pp.log1p(data_raw)
sc.pp.scale(data_raw)
adata.raw = data_raw #使用完整的原始数据进行可视化,处理方法和预处理和聚类里的一样
paths = [('NK', [2,0,3]), ('FCGR3A Monocytes', [2,6,1])] #自定义轨迹,以B细胞为起点,目标细胞为分化轨迹终点
adata.obs['distance'] = adata.obs['dpt_pseudotime']
adata.obs['clusters'] = adata.obs['leiden'] # just a cosmetic change
adata.uns['clusters_colors'] = adata.uns['leiden_anno_colors']
_, axs = pl.subplots(ncols=2, figsize=(6, 2.5), gridspec_kw={'wspace': 0.05, 'left': 0.12})
pl.subplots_adjust(left=0.05, right=0.98, top=0.82, bottom=0.2)
for ipath, (descr, path) in enumerate(paths):
_, data = sc.pl.paga_path(
adata, path, gene_names,
show_node_names=False,
ax=axs[ipath],
ytick_fontsize=12,
left_margin=0.15,
n_avg=50,
annotations=['distance'],
show_yticks=True if ipath==0 else False,
show_colorbar=False,
color_map='Greys',
groups_key='clusters',
color_maps_annotations={'distance': 'viridis'},
title='{} path'.format(descr),
return_data=True,
show=False)
data.to_csv('./paga_path_{}.csv'.format(descr))
pl.savefig('paga_path_pbmc.pdf')
pl.show()

网友评论