<生信交流与合作请关注公众号@生信探索>
之前版本的CellPhoneDB依赖的anndata不兼容导致使用h5ad的文件作为count matrix输入报错,没想到CellPhoneDB更新到了4.0解决了这个问题,而且运行速度超级快十几分钟就跑完了10几万细胞的主要流程
cd ~
mamba create -n cpdb python=3.8
mamba activate cpdb
mamba install -y ipykernel numpy pandas scikit-learn
pip install cellphonedb gseapy -i https://pypi.tuna.tsinghua.edu.cn/simple
下载数据库文件
![](https://img.haomeiwen.com/i4461150/7db983ca698b9c80.png)
Count data
- 如果是人的基因就直接使用
adata_noramlised_annotated.h5ad
文件就可以不需要下边的步骤 - 如果是小鼠基因,则需要把基因名转换为人的基因名
- 输入的count文件也可以是文本文件,但是h5ad文件速度更快
from gseapy import Biomart
import pandas as pd
bm = Biomart()
m2h_df = bm.query(dataset='mmusculus_gene_ensembl',
attributes=['ensembl_gene_id','external_gene_name',
'hsapiens_homolog_ensembl_gene',
'hsapiens_homolog_associated_gene_name'])
m2h=m2h_df.dropna(subset=['hsapiens_homolog_associated_gene_name'])
m2h=m2h.loc[:,['external_gene_name','hsapiens_homolog_associated_gene_name']]
import anndata
adata = anndata.read_h5ad('adata_noramlised_annotated.h5ad')
adata_raw=adata.raw.to_adata()
bdata=anndata.AnnData(X=adata_raw.X,
obs=pd.DataFrame({'cell_type':adata_raw.obs.CellType2},index=adata_raw.obs_names),
var=pd.DataFrame(index=adata_raw.var_names)
)
merged =pd.merge(bdata.var,m2h,left_index=True,right_on='external_gene_name')
bdata=bdata[:,merged.external_gene_name]
bdata.var_names=merged.hsapiens_homolog_associated_gene_name.values
bdata.write_h5ad('adata_for_cellphonedb.h5ad',compression='lzf')
Meta data
meta_file = pd.DataFrame({'Cell':bdata.obs.index,'cell_type':bdata.obs.cell_type})
meta_file.to_csv("meta_file.csv",index=False)
- 删除不需要的变量
del adata,meta_file,bdata,merged,m2h
cpdb_statistical_analysis_method
from cellphonedb.src.core.methods import cpdb_statistical_analysis_method
deconvoluted, means, pvalues, significant_means = cpdb_statistical_analysis_method.call(
cpdb_file_path = './cellphonedb.zip', # mandatory: CellPhoneDB database zip file.
meta_file_path = "./meta_file.csv", # mandatory: tsv file defining barcodes to cell label.
counts_file_path = './adata_for_cellphonedb.h5ad',# mandatory: normalized count matrix.
counts_data = 'hgnc_symbol', # defines the gene annotation in counts matrix.
microenvs_file_path = None, # optional (default: None): defines cells per microenvironment.
iterations = 1000, # denotes the number of shufflings performed in the analysis.
threshold = 0.1, # defines the min % of cells expressing a gene for this to be employed in the analysis.
threads = 8, # number of threads to use in the analysis.
debug_seed = 42, # debug randome seed. To disable >=0.
result_precision = 3, # Sets the rounding for the mean values in significan_means.
pvalue = 0.05, # P-value threshold to employ for significance.
subsampling = False, # To enable subsampling the data (geometri sketching).
subsampling_log = False, # (mandatory) enable subsampling log1p for non log-transformed data inputs.
subsampling_num_pc = 100, # Number of componets to subsample via geometric skectching (dafault: 100).
subsampling_num_cells = 1000, # Number of cells to subsample (integer) (default: 1/3 of the dataset).
separator = '|', # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
debug = False, # Saves all intermediate tables employed during the analysis in pkl format.
output_path = './', # Path to save results.
output_suffix = None # Replaces the timestamp in the output files by a user defined string in the (default: None).
)
文件目录
├── adata_for_cellphonedb.h5ad
├── meta_file.csv
├── statistical_analysis_deconvoluted_03_30_2023_11:11:04.txt
├── statistical_analysis_means_03_30_2023_11:11:04.txt
├── statistical_analysis_pvalues_03_30_2023_11:11:04.txt
└── statistical_analysis_significant_means_03_30_2023_11:11:04.txt
输出文件和可视化可以参考
https://mp.weixin.qq.com/s/FG8oQJEoM1BRclcaSFC3mw
Reference
https://cellphonedb.readthedocs.io/en/latest/RESULTS-DOCUMENTATION.html
网友评论