作者,Evil Genius
今日参考文章The covariance environment defines cellular niches for spatial inference | Nature Biotechnology。希望国内也能多发表几篇高分的方法论。
前几天有人开玩笑,说国内的数学落后西方80年左右,当然我没有多少概念,结果一个学数学出身的同学说了一句,那是因为学数学没有对应的公务员岗位,不然早就赶上来了。
但是数学真的很重要,每个行业进入的越深,发现全是数学。
- 高分辨率空间技术分析数据的一个关键挑战是适当地表示细胞邻域或生态位的特征。
- 利用生态位中细胞间基因-基因协变量结构来捕捉其中细胞相互作用的多变量性质
关于协变量,这是大学线性代数的内容,我本科学习的是生物医学专业,99%同学都去考那种不考数学的专业,但我不太喜欢生物医学,天天背书想想都头疼,于是去考了工科,需要考数二,就是高数和线代,以前我是从来不相信天赋一说的,认为努力都可以,后来学数学发现,这东西人与人的差异真的很大,用张雪峰的话说,只要考数学,就有牲口能考满分。
当时考研数学考了137分,全靠自学的,后来去上研究生,一问,宿舍4个人,我数学考的最低,尼玛,宿舍的牲口们。
知识背景
- 细胞的局部邻域或生态位是定义细胞相互作用的有用分辨率
- niche可以代表功能性解剖亚单位(如干细胞niche),是识别更大空间模式的基础。
- 现在Xenium扩展到5000+ 基因,提供了全新的视角。
- 需要同时表征空间niche和全转录组的分析
- COVET,假设细胞与其环境之间的相互作用在生态位细胞之间的基因表达中产生了生物学上有意义的协变量结构。
- Imaging-based spatial transcriptomics + scRNA
- 实现转录组范围的空间推断,同时将scRNA-seq和空间数据整合到一起。
- ENVI还可用于将有价值的空间信息投射到分离的scRNA-seq数据上,并可捕获大型复杂组织区域沿空间轴的连续变化。
结果1、COVET defines spatial neighborhoods
- 一个细胞会影响其附近的细胞,也会被其影响,从而在相互作用的细胞之间产生共变的表达模式。
- COVET,一种基于niche细胞间基因-基因协方差修正公式的邻域信息进行细胞表示。
- niche的距离度量
- 一种有效计算该距离度量的算法。与平均表达不同,基因-基因协方差捕获基因和细胞状态之间的关系,这些关系是由生态位内的细胞相互作用形成的。
首先通过数据集中每个细胞的k个空间近邻来定义该细胞的生态位,然后计算每个生态位的基因-基因shifted covariance matrix。
结果2、The ENVI algorithm(算法的东西很多我也不是很懂,喜欢数学奈何大学四年浪费在生物医学上了,全给荒废了)
ENVI采用条件变分自编码器来推断scRNA-seq数据中的空间背景,并通过将这两种模式映射到一个共同的嵌入来推算空间数据中缺失的基因。
结果分析示例,小鼠胚胎,核心就是通过考虑细胞-细胞接近度来捕获空间模式之间相似性的度量
结果3、ENVI将空间模式归因于单细胞基因组学数据
将空间信息独特地投射到scRNA-seq分析的游离细胞上。这种方法可以使用有限的空间分析数据来为单细胞图谱中的数百万个细胞赋予空间背景。COVET表示邻近细胞之间的基因covariation;因此,除了推断邻居的细胞类型外,还可以推断它们的基因表达。
单细胞的空间背景结果4、ENVI从单细胞数据中学习空间梯度
ENVI maps continuous spatial gradients in spine development from single-cell and spatial data.结果5、ENVI描绘了运动皮层的组织尺度模式
- ENVI对稀有细胞类型的恢复
-
ENVI还可以从仅包含少数成像基因的数据集中捕获皮层内的空间模式。
ENVI predicts the cortical localization of Sst interneuron subtypes
结果6、ENVI整合了Xenium的脑转移数据
ENVI的性能主要由三个因素驱动:(1)深度贝叶斯推理,在学习基因与生态位之间的非线性关系的同时,回归出与模态相关的混杂因素;(2)利用scRNA-seq数据对整个转录组进行显式建模;(3)通过COVET直接整合空间背景。
ENVI integrates Xenium and snRNA-seq data
最后来看看示例代码
#!pip install scenvi
import scenvi
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scanpy as sc
import colorcet
import sklearn.neighbors
import scipy.sparse
import umap.umap_ as umap
from fa2 import ForceAtlas2
###函数定义
def flatten(arr):
return(np.reshape(arr, [arr.shape[0], -1]))
def force_directed_layout(affinity_matrix, cell_names=None, verbose=True, iterations=500, device='cpu'):
"""" Function to compute force directed layout from the affinity_matrix
:param affinity_matrix: Sparse matrix representing affinities between cells
:param cell_names: pandas Series object with cell names
:param verbose: Verbosity for force directed layout computation
:param iterations: Number of iterations used by ForceAtlas
:return: Pandas data frame representing the force directed layout
"""
init_coords = np.random.random((affinity_matrix.shape[0], 2))
if device == 'cpu':
forceatlas2 = ForceAtlas2(
# Behavior alternatives
outboundAttractionDistribution=False,
linLogMode=False,
adjustSizes=False,
edgeWeightInfluence=1.0,
# Performance
jitterTolerance=1.0,
barnesHutOptimize=True,
barnesHutTheta=1.2,
multiThreaded=False,
# Tuning
scalingRatio=2.0,
strongGravityMode=False,
gravity=1.0,
# Log
verbose=verbose)
positions = forceatlas2.forceatlas2(
affinity_matrix, pos=init_coords, iterations=iterations)
positions = np.array(positions)
positions = pd.DataFrame(positions,
index=np.arange(affinity_matrix.shape[0]), columns=['x', 'y'])
return positions
def run_diffusion_maps(data_df, n_components=10, knn=30, alpha=0):
"""Run Diffusion maps using the adaptive anisotropic kernel
:param data_df: PCA projections of the data or adjacency matrix
:param n_components: Number of diffusion components
:param knn: Number of nearest neighbors for graph construction
:param alpha: Normalization parameter for the diffusion operator
:return: Diffusion components, corresponding eigen values and the diffusion operator
"""
# Determine the kernel
N = data_df.shape[0]
if(type(data_df).__module__ == np.__name__):
data_df = pd.DataFrame(data_df)
if not scipy.sparse.issparse(data_df):
print("Determing nearest neighbor graph...")
temp = sc.AnnData(data_df.values)
sc.pp.neighbors(temp, n_pcs=0, n_neighbors=knn)
kNN = temp.obsp['distances']
# Adaptive k
adaptive_k = int(np.floor(knn / 3))
adaptive_std = np.zeros(N)
for i in np.arange(len(adaptive_std)):
adaptive_std[i] = np.sort(kNN.data[kNN.indptr[i] : kNN.indptr[i + 1]])[
adaptive_k - 1
]
# Kernel
x, y, dists = scipy.sparse.find(kNN)
# X, y specific stds
dists = dists / adaptive_std[x]
W = scipy.sparse.csr_matrix((np.exp(-dists), (x, y)), shape=[N, N])
# Diffusion components
kernel = W + W.T
else:
kernel = data_df
# Markov
D = np.ravel(kernel.sum(axis=1))
if alpha > 0:
# L_alpha
D[D != 0] = D[D != 0] ** (-alpha)
mat = scipy.sparse.csr_matrix((D, (range(N), range(N))), shape=[N, N])
kernel = mat.dot(kernel).dot(mat)
D = np.ravel(kernel.sum(axis=1))
D[D != 0] = 1 / D[D != 0]
T = scipy.sparse.csr_matrix((D, (range(N), range(N))), shape=[N, N]).dot(kernel)
# Eigen value dcomposition
D, V = scipy.sparse.linalg.eigs(T, n_components, tol=1e-4, maxiter=1000)
D = np.real(D)
V = np.real(V)
inds = np.argsort(D)[::-1]
D = D[inds]
V = V[:, inds]
# Normalize
for i in range(V.shape[1]):
V[:, i] = V[:, i] / np.linalg.norm(V[:, i])
# Create are results dictionary
res = {"T": T, "EigenVectors": V, "EigenValues": D}
res["EigenVectors"] = pd.DataFrame(res["EigenVectors"])
if not scipy.sparse.issparse(data_df):
res["EigenVectors"].index = data_df.index
res["EigenValues"] = pd.Series(res["EigenValues"])
res["kernel"] = kernel
return res
def FDL(data, k = 30):
nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors=int(k), metric='euclidean',
n_jobs=5).fit(data)
kNN = nbrs.kneighbors_graph(data, mode='distance')
# Adaptive k
adaptive_k = int(np.floor(k / 3))
nbrs = sklearn.neighbors.NearestNeighbors(n_neighbors=int(adaptive_k),
metric='euclidean', n_jobs=5).fit(data)
adaptive_std = nbrs.kneighbors_graph(data, mode='distance').max(axis=1)
adaptive_std = np.ravel(adaptive_std.todense())
# Kernel
x, y, dists = scipy.sparse.find(kNN)
# X, y specific stds
dists = dists / adaptive_std[x]
N = data.shape[0]
W = scipy.sparse.csr_matrix((np.exp(-dists), (x, y)), shape=[N, N])
# Diffusion components
kernel = W + W.T
layout = force_directed_layout(kernel)
return(layout)
读取数据,单细胞数据和原位数据
#!wget https://dp-lab-data-public.s3.amazonaws.com/ENVI/sc_data.h5ad
#!wget https://dp-lab-data-public.s3.amazonaws.com/ENVI/st_data.h5ad
st_data = sc.read_h5ad('st_data.h5ad')
sc_data = sc.read_h5ad('sc_data.h5ad')
cell_type_palette = {'Astro': (0.843137, 0.0, 0.0, 1.0),
'Endo': (0.54902, 0.235294, 1.0, 1.0),
'L23_IT': (0.007843, 0.533333, 0.0, 1.0),
'L45_IT': (0.0, 0.67451, 0.780392, 1.0),
'L56_NP': (0.596078, 1.0, 0.0, 1.0),
'L5_ET': (1.0, 0.498039, 0.819608, 1.0),
'L5_IT': (0.423529, 0.0, 0.309804, 1.0),
'L5_PT': (1.0, 0.647059, 0.188235, 1.0),
'L6_CT': (0.345098, 0.231373, 0.0, 1.0),
'L6_IT': (0.0, 0.341176, 0.34902, 1.0),
'L6_IT_Car3': (0.0, 0.0, 0.866667, 1.0),
'L6b': (0.0, 0.992157, 0.811765, 1.0),
'Lamp5': (0.631373, 0.458824, 0.415686, 1.0),
'Microglia': (0.737255, 0.717647, 1.0, 1.0),
'OPC': (0.584314, 0.709804, 0.470588, 1.0),
'Oligo': (0.752941, 0.015686, 0.72549, 1.0),
'Pericytes': (0.392157, 0.329412, 0.454902, 1.0),
'Pvalb': (0.47451, 0.0, 0.0, 1.0),
'SMC': (0.027451, 0.454902, 0.847059, 1.0),
'Sncg': (0.996078, 0.960784, 0.564706, 1.0),
'Sst': (0.0, 0.294118, 0.0, 1.0),
'VLMC': (0.560784, 0.478431, 0.0, 1.0),
'Vip': (1.0, 0.447059, 0.4, 1.0)}
cell_label_palette = {'GABAergic': (0.843137, 0.0, 0.0, 1.0),
'Glutamatergic': (0.54902, 0.235294, 1.0, 1.0),
'Non-Neuronal': (0.007843, 0.533333, 0.0, 1.0)}
####plot
plt.figure(figsize=(10,10))
sns.scatterplot(x = st_data.obsm['spatial'][st_data.obs['batch'] == 'mouse1_slice10'][:, 1],
y = -st_data.obsm['spatial'][st_data.obs['batch'] == 'mouse1_slice10'][:, 0], legend = True,
hue = st_data.obs['cell_type'][st_data.obs['batch'] == 'mouse1_slice10'],
s = 12, palette = cell_type_palette)
plt.axis('equal')
plt.axis('off')
plt.title("MERFISH Data")
plt.show()
plot scRNA
fit = umap.UMAP(
n_neighbors = 100,
min_dist = 0.8,
n_components = 2,
)
sc_data.layers['log'] = np.log(sc_data.X + 1)
sc.pp.highly_variable_genes(sc_data, layer = 'log', n_top_genes = 2048)
sc_data.obsm['UMAP_exp'] = fit.fit_transform(np.log(sc_data[:, sc_data.var['highly_variable']].X + 1))
fig = plt.figure(figsize = (10,10))
sns.scatterplot(x = sc_data.obsm['UMAP_exp'][:, 0], y = sc_data.obsm['UMAP_exp'][:, 1], hue = sc_data.obs['cell_type'], s = 16,
palette = cell_type_palette, legend = True)
plt.tight_layout()
plt.axis('off')
plt.title('scRNA-seq Data')
plt.show()
Running ENVI
envi_model = scenvi.ENVI(spatial_data = st_data, sc_data = sc_data)
###Training ENVI and run auxiliary function
envi_model.train()
envi_model.impute_genes()
envi_model.infer_niche_covet()
envi_model.infer_niche_celltype()
Read ENVI predictions
st_data.obsm['envi_latent'] = envi_model.spatial_data.obsm['envi_latent']
st_data.obsm['COVET'] = envi_model.spatial_data.obsm['COVET']
st_data.obsm['COVET_SQRT'] = envi_model.spatial_data.obsm['COVET_SQRT']
st_data.uns['COVET_genes'] = envi_model.CovGenes
st_data.obsm['imputation'] = envi_model.spatial_data.obsm['imputation']
st_data.obsm['cell_type_niche'] = envi_model.spatial_data.obsm['cell_type_niche']
sc_data.obsm['envi_latent'] = envi_model.sc_data.obsm['envi_latent']
sc_data.obsm['COVET'] = envi_model.sc_data.obsm['COVET']
sc_data.obsm['COVET_SQRT'] = envi_model.sc_data.obsm['COVET_SQRT']
sc_data.obsm['cell_type_niche'] = envi_model.sc_data.obsm['cell_type_niche']
sc_data.uns['COVET_genes'] = envi_model.CovGenes
Plot UMAPs of ENVI latent
fit = umap.UMAP(
n_neighbors = 100,
min_dist = 0.3,
n_components = 2,
)
latent_umap = fit.fit_transform(np.concatenate([st_data.obsm['envi_latent'], sc_data.obsm['envi_latent']], axis = 0))
st_data.obsm['latent_umap'] = latent_umap[:st_data.shape[0]]
sc_data.obsm['latent_umap'] = latent_umap[st_data.shape[0]:]
lim_arr = np.concatenate([st_data.obsm['latent_umap'], sc_data.obsm['latent_umap']], axis = 0)
delta = 1
pre = 0.1
xmin = np.percentile(lim_arr[:, 0], pre) - delta
xmax = np.percentile(lim_arr[:, 0], 100 - pre) + delta
ymin = np.percentile(lim_arr[:, 1], pre) - delta
ymax = np.percentile(lim_arr[:, 1], 100 - pre) + delta
fig = plt.figure(figsize = (13,5))
plt.subplot(121)
sns.scatterplot(x = sc_data.obsm['latent_umap'][:, 0],
y = sc_data.obsm['latent_umap'][:, 1], hue = sc_data.obs['cell_type'], s = 8, palette = cell_type_palette,
legend = False)
plt.title("scRNA-seq Latent")
plt.xlim([xmin, xmax])
plt.ylim([ymin, ymax])
plt.axis('off')
plt.subplot(122)
sns.scatterplot(x = st_data.obsm['latent_umap'][:, 0],
y = st_data.obsm['latent_umap'][:, 1], hue = st_data.obs['cell_type'], s = 8, palette = cell_type_palette, legend = True)
legend = plt.legend(title = 'Cell Type', prop={'size': 12}, fontsize = '12', markerscale = 3, ncol = 2, bbox_to_anchor = (1, 1))#, loc = 'lower left')
plt.setp(legend.get_title(),fontsize='12')
plt.title("MERFISH Latent")
plt.axis('off')
plt.tight_layout()
plt.xlim([xmin, xmax])
plt.ylim([ymin, ymax])
plt.show()
ENVI COVET analysis
st_data_sst = st_data[st_data.obs['cell_type'] == 'Sst']
sc_data_sst = sc_data[sc_data.obs['cell_type'] == 'Sst']
gran_sst_palette = {'Th': (0.0, 0.294118, 0.0, 1.0),
'Calb2': (0.560784, 0.478431, 0.0, 1.0),
'Chodl': (1.0, 0.447059, 0.4, 1.0),
'Myh8': (0.933333, 0.72549, 0.72549, 1.0),
'Crhr2': (0.368627, 0.494118, 0.4, 1.0),
'Hpse': (0.65098, 0.482353, 0.72549, 1.0),
'Hspe': (0.352941, 0.0, 0.643137, 1.0),
'Crh': (0.607843, 0.894118, 1.0, 1.0),
'Pvalb Etv1': (0.92549, 0.0, 0.466667, 1.0)}
Calculating FDL and DC on COVET
FDL_COVET = np.asarray(FDL(np.concatenate([flatten(st_data_sst.obsm['COVET_SQRT']),
flatten(sc_data_sst.obsm['COVET_SQRT'])], axis = 0), k = 30))
st_data_sst.obsm['FDL_COVET'] = FDL_COVET[:st_data_sst.shape[0]]
sc_data_sst.obsm['FDL_COVET'] = FDL_COVET[st_data_sst.shape[0]:]
DC_COVET = np.asarray(run_diffusion_maps(np.concatenate([flatten(st_data_sst.obsm['COVET_SQRT']),
flatten(sc_data_sst.obsm['COVET_SQRT'])], axis = 0), knn = 30)['EigenVectors'])[:, 1:]
st_data_sst.obsm['DC_COVET'] = -DC_COVET[:st_data_sst.shape[0]]
sc_data_sst.obsm['DC_COVET'] = -DC_COVET[st_data_sst.shape[0]:]
st_data_sst.obsm['DC_COVET'] = -DC_COVET[:st_data_sst.shape[0]]
sc_data_sst.obsm['DC_COVET'] = -DC_COVET[st_data_sst.shape[0]:]
lim_arr = np.concatenate([st_data_sst.obsm['FDL_COVET'], sc_data_sst.obsm['FDL_COVET']], axis = 0)
delta = 1000
pre = 0.01
xmin = np.percentile(lim_arr[:, 0], pre) - delta
xmax = np.percentile(lim_arr[:, 0], 100 - pre) + delta
ymin = np.percentile(lim_arr[:, 1], pre) - delta
ymax = np.percentile(lim_arr[:, 1], 100 - pre) + delta
plt.figure(figsize=(10,5))
plt.subplot(121)
sns.scatterplot(x = sc_data_sst.obsm['FDL_COVET'][:, 0],
y = sc_data_sst.obsm['FDL_COVET'][:, 1],
hue = sc_data_sst.obs['cluster_label'], s = 16, palette= gran_sst_palette, legend = True)
plt.tight_layout()
plt.xlim([xmin, xmax])
plt.ylim([ymin, ymax])
plt.title('scRNA-seq Sst, COVET FDL')
legend = plt.legend(title = 'Sst subtype', prop={'size': 8}, fontsize = '8', markerscale = 1, ncol = 2)
plt.axis('off')
plt.subplot(122)
ax = sns.scatterplot(x = st_data_sst.obsm['FDL_COVET'][:, 0],
y = st_data_sst.obsm['FDL_COVET'][:, 1],
c = st_data_sst.obsm['DC_COVET'][:,0], s = 16, cmap= 'cet_CET_D13', legend = False)
plt.tight_layout()
plt.xlim([xmin, xmax])
plt.ylim([ymin, ymax])
plt.axis('off')
plt.title('MERFISH Sst, COVET FDL')
plt.show()
fig = plt.figure(figsize=(25,5))
for ind, batch in enumerate(['mouse1_slice212', 'mouse1_slice162', 'mouse1_slice71', 'mouse2_slice270', 'mouse1_slice40']):
st_dataBatch = st_data[st_data.obs['batch'] == batch]
st_dataPlotBatch = st_data_sst[st_data_sst.obs['batch'] == batch]
plt.subplot(1,5, 1+ ind)
sns.scatterplot(x = st_dataBatch.obsm['spatial'][:, 0], y = st_dataBatch.obsm['spatial'][:, 1], color = (207/255,185/255,151/255, 1))
sns.scatterplot(x = st_dataPlotBatch.obsm['spatial'][:, 0], y = st_dataPlotBatch.obsm['spatial'][:, 1], marker = '^',
c = st_dataPlotBatch.obsm['DC_COVET'][:, 0], s = 256, cmap= 'cet_CET_D13', legend = False)
plt.title(batch)
plt.axis('off')
plt.tight_layout()
plt.show()
Subtype Depth
depth_df = pd.DataFrame()
depth_df['Subtype'] = sc_data_sst.obs['cluster_label']
depth_df['Depth'] = -sc_data_sst.obsm['DC_COVET'][:,0]
subtype_depth_order = depth_df.groupby(['Subtype']).mean().sort_values(by = 'Depth', ascending=False).index
plt.figure(figsize=(12,5))
sns.set(font_scale=1.7)
sns.set_style("whitegrid")
sns.boxenplot(depth_df, x = 'Subtype', y = 'Depth',# bw = 1, width = 0.9,
order = subtype_depth_order,
palette = gran_sst_palette)
plt.tight_layout()
plt.show()
Niche Cell Type Composition
subtype_canonical = pd.DataFrame([sc_data_sst[sc_data_sst.obs['cluster_label']==subtype].obsm['cell_type_niche'].mean(axis = 0) for subtype in subtype_depth_order],
index = subtype_depth_order, columns = sc_data.obsm['cell_type_niche'].columns)
subtype_canonical[subtype_canonical<0.2] = 0
subtype_canonical.drop(labels=subtype_canonical.columns[(subtype_canonical == 0).all()], axis=1, inplace=True)
subtype_canonical = subtype_canonical.div(subtype_canonical.sum(axis=1), axis=0)
Plot as Stacked Bar Plots
subtype_canonical.plot(kind = 'bar', stacked = 'True',
color = {col:cell_type_palette[col] for col in subtype_canonical.columns})
plt.legend(bbox_to_anchor = (1,1), ncols = 1, fontsize = 'x-small')
plt.title("Predicted Niche Composition")
plt.ylabel("Proportion")
plt.xlabel("Sst Subtype")
plt.show()
ENVI imputation on Spatial Data
tick_genes = np.asarray(['Adamts18','Pamr1', 'Dkkl1', 'Hs6st2', 'Slit1', 'Ighm'])
plt.figure(figsize=(15,10))
for ind, gene in enumerate(tick_genes):
plt.subplot(2,3,1+ind)
cvec = np.log(st_data[st_data.obs['batch'] == 'mouse1_slice10'].obsm['imputation'][gene] + 0.1)
sns.scatterplot(x = st_data.obsm['spatial'][st_data.obs['batch'] == 'mouse1_slice10'][:, 1],
y = -st_data.obsm['spatial'][st_data.obs['batch'] == 'mouse1_slice10'][:, 0], legend = False,
c = cvec, cmap = 'Reds',
vmax = np.percentile(cvec, 95), vmin = np.percentile(cvec, 30),
s = 24, edgecolor = 'k')#, palette = cell_type_palette)
plt.title(gene)
plt.axis('equal')
plt.axis('off')
plt.tight_layout()
plt.show()
网友评论