美文网首页
Python单细胞复现2022||02-数据介绍与下载

Python单细胞复现2022||02-数据介绍与下载

作者: 信你个鬼 | 来源:发表于2022-10-01 22:35 被阅读0次

    数据背景接上一篇:Python图文复现2022||01-文献阅读:致命COVID-19分子单细胞肺图谱

    数据获取

    有三种途径:

    这次就从GEO下载吧,下载完后:3个文件,一个处理后的csv表达数据,一个metadata,一个原始count数据压缩包tar

    image-20220930111820551.png

    原文代码:https://github.com/IzarLab/CUIMC-NYP_COVID_autopsy_lung

    但是本次我们使用一个利用这个数据讲python学习的资源,

    视频相关代码如下:

    数据下载后,开工!

    环境准备:

    # 使用conda安装好相关软件
    conda activate scanpy
    
    # 导入包,注意dir路径改成自己的
    import scanpy as sc
    dir = '/path/data/GSE171524/GSE171524_RAW/'
    

    数据读取

    GSM5226574_C51样本是个肺对照样本,总共包含6099个细胞,34546个基因。

    # 读取数据
    adata = sc.read_csv(dir + 'GSM5226574_C51ctr_raw_counts.csv').T
    adata
    
    #AnnData object with n_obs × n_vars = 6099 × 34546
    
    # 查看数据维度
    adata.X.shape
    #(6099, 34546)
    

    Doublet过滤

    # pip install scvi-tools
    import scvi
    
    # 过滤低表达基因以及高变基因选择
    sc.pp.filter_genes(adata, min_cells = 10)
    sc.pp.highly_variable_genes(adata, n_top_genes = 2000, subset = True, flavor = 'seurat_v3')
    
    # 训练模型
    scvi.model.SCVI.setup_anndata(adata)
    vae = scvi.model.SCVI(adata)
    vae.train()
    #GPU available: False, used: False
    #TPU available: False, using: 0 TPU cores
    #IPU available: False, using: 0 IPUs
    #HPU available: False, using: 0 HPUs
    #Epoch 400/400: 100%|████████████████████████████████████████████████████████| 400/400 [23:10<00:00,  3.21s/it, loss=320, v_num=1]`Trainer.fit` stopped: `max_epochs=400` reached.
    #Epoch 400/400: 100%|████████████████████████████████████████████████████████| 400/400 [23:10<00:00,  3.48s/it, loss=320, v_num=1]
    
    solo = scvi.external.SOLO.from_scvi_model(vae)
    solo.train()
    
    df = solo.predict()
    df['prediction'] = solo.predict(soft = False)
    df.index = df.index.map(lambda x: x[:-2])
    df
    

    结果:doublet这一列的值越高,表明这个细胞约可能是双包体;

    image-20220930154547326.png

    预测结果统计:

    有1245个细胞被预测为双包体,占总细胞的20%左右,这对于10X来说,双包率有点太高了

    df.groupby('prediction').count()
    
                doublet  singlet
    prediction                  
    doublet        1245     1245
    singlet        4854     4854
    

    因此,这里计算一个df值:

    df['dif'] = df.doublet - df.singlet
    df
    

    新增一列df值:

    image-20220930154853250.png

    给这个df绘制一个分布图:

    import seaborn as sns
    import matplotlib.pyplot as plt
    
    sns.displot(df[df.prediction == 'doublet'], x = 'dif')
    plt.savefig(dir+"01-df-displot.png")
    

    结果图:

    image-20220930155150864.png

    因此,将df大于1的预测为双包体:

    doublets = df[(df.prediction == 'doublet') & (df.dif > 1)]
    doublets
    
    adata = sc.read_csv(dir+'GSM5226574_C51ctr_raw_counts.csv').T
    adata.obs['doublet'] = adata.obs.index.isin(doublets.index)
    adata.obs
    

    预测结果:

    image-20220930155535720.png

    去除:还剩余5618个细胞

    adata = adata[~adata.obs.doublet]
    adata
    
    View of AnnData object with n_obs × n_vars = 5618 × 34546
        obs: 'doublet'
    

    预处理

    # 计算线粒体基因比例
    adata.var['mt'] = adata.var.index.str.startswith('MT-')
    adata.var
    
    # 核糖体RNA基因
    import pandas as pd
    ribo_url = "http://software.broadinstitute.org/gsea/msigdb/download_geneset.jsp?geneSetName=KEGG_RIBOSOME&fileType=txt"
    ribo_genes = pd.read_table(ribo_url, skiprows=2, header = None)
    ribo_genes
    
    #              0
    # 0          FAU
    # 1       MRPL13
    # 2        RPL10
    # 3       RPL10A
    # 4       RPL10L
    # ..         ...
    # 83        RPS9
    # 84        RPSA
    # 85     RSL24D1
    # 86  RSL24D1P11
    # 87       UBA52
    
    # [88 rows x 1 columns]
    
    adata.var['ribo'] = adata.var_names.isin(ribo_genes[0].values)
    

    接着计算qc指标

    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt', 'ribo'], percent_top=None, log1p=False, inplace=True)
    adata.var.sort_values('n_cells_by_counts')
    

    结果如下:

    image-20220930163534167.png

    低表达过滤:

    sc.pp.filter_genes(adata, min_cells=3)
    adata.var.sort_values('n_cells_by_counts')
    adata.obs.sort_values('n_genes_by_counts')
    
    # 绘制小提琴图
    sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt', 'pct_counts_ribo'], jitter=0.4, multi_panel=True)
    plt.savefig(dir+"02-qc_violin.png")
    

    结果图:

    image-20220930163814498.png

    按照分位数来过滤细胞:

    import numpy as np
    upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .98)
    #upper_lim = 3000
    upper_lim
    adata = adata[adata.obs.n_genes_by_counts < upper_lim]
    adata.obs
    

    过滤之后:

    image-20220930164011106.png

    线粒体比例与核糖体比例:

    adata = adata[adata.obs.pct_counts_mt < 20]
    adata = adata[adata.obs.pct_counts_ribo < 2]
    adata
    

    过滤后:

    View of AnnData object with n_obs × n_vars = 5489 × 24080
        obs: 'doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo'
        var: 'mt', 'ribo', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'n_cells'
    

    标准化Normalization

    标准化前后的区别可看:adata.X.sum(axis = 1)值的变化

    adata.X.sum(axis = 1)
    
    #normalize every cell to 10,000 UMI
    sc.pp.normalize_total(adata, target_sum=1e4) 
    adata.X.sum(axis = 1)
    
    #change to log counts
    sc.pp.log1p(adata) 
    adata.X.sum(axis = 1)
    
    adata.raw = adata
    

    聚类Clustering

    # 高变基因选择以及可视化
    sc.pp.highly_variable_genes(adata, n_top_genes = 2000)
    sc.pl.highly_variable_genes(adata)
    plt.savefig(dir+"03-highly_variable_genes.png")
    

    结果图:

    image-20220930165736971.png

    选择主成分:

    adata = adata[:, adata.var.highly_variable]
    sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt', 'pct_counts_ribo'])
    sc.pp.scale(adata, max_value=10)
    sc.tl.pca(adata, svd_solver='arpack')
    sc.pl.pca_variance_ratio(adata, log=True, n_pcs = 50)
    plt.savefig(dir+"04-pca_variance.png")
    

    结果图:

    image-20220930165952947.png

    这里选择30个PCs,然后聚类:

    sc.pp.neighbors(adata, n_pcs = 30)
    sc.tl.umap(adata)
    sc.pl.umap(adata)
    plt.savefig(dir+"05-umap.png")
    

    结果图:

    image-20220930170145867.png

    使用leiden在低维空间可视化:

    sc.tl.leiden(adata, resolution = 0.5)
    sc.pl.umap(adata, color=['leiden'])
    plt.savefig(dir+"05-umap_leiden.png")
    

    结果图:聚成了11类

    image-20220930172228261.png

    单个样本的分析演示到这里,下期进行所有样本整合分析~

    相关文章

      网友评论

          本文标题:Python单细胞复现2022||02-数据介绍与下载

          本文链接:https://www.haomeiwen.com/subject/icxjartx.html