美文网首页
2022-01-17 多个样本单细胞分析流程

2022-01-17 多个样本单细胞分析流程

作者: 徐添添 | 来源:发表于2022-01-16 17:21 被阅读0次

    读数据

    合并数据,创建seurat对象

    合并样本的数据可以使用cbind() 基因数相同时
    或者用merge() 基因数不同时
    创建时,用min.cells,min.features进行简单的过滤

    QC

    人类中MT开头,小鼠中mt开头

    实际情况中遇到的问题

    比如第二种情况如果选5%进行过滤,会过滤掉很多细胞,可以考虑用25%过滤
    第五种情况大概率样本不能用

    不同的样本可以用不同的cutoff,可以参考下面这篇文章中各种组织的线粒体cutoff
    Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control - PubMed (nih.gov)

    饱和曲线

    过滤+标准化

    dim(),显示有几行几列

    #计算表达量变化显著的基因FindVariableFeatures
    experiment.aggregate <- FindVariableFeatures(experiment.aggregate, 
                                                 selection.method = "vst",
                                                 nfeatures = 1000) 
    

    正常情况表达量越高的基因,差异越大,如果不是,代表数据量太小了,或者QC有问题。

    均一化及PCA降维


    TSNE聚类(当细胞量几千时,用UMAP)

    找marker

    先初筛

    p_value:该基因在自己的亚群其他细胞亚群中的差异
    logFC:平均表达倍数
    pct1:该基因在自己的亚群中百分之多少的细胞中表达
    pct2:在其他亚群中百分之多少的细胞中表达
    p_val_adj:一般看这个,就是FDR

    细筛

    3.在自己的亚群中至少50%的细胞表达,也可以PCT2<0.5
    4.5.按情况选择要不要做

    注释(代码在cell_marker_annotation.R)

    0亚群很多是T细胞的marker,可以初步注释为T细胞
    1亚群也像T或者NK细胞
    0和1很相似,可以计算这个亚群的差异基因,代码在sub_analysis_scRNA

    计算特定两组细胞之间的差异基因,决定要不要把两个亚群合并

    这里看的是0和3的差异

    发现0和3差异大的都是线粒体基因,说明本身差异不大,那就可以合并
    再看看0,3和1的差异


    如果这些差异是有意义的,那么1可以不合并

    免疫细胞注释可以用singleR

    注释小结

    可以把每个亚群找到的marker用在线工具做功能富集,初步看一下功能




    代码

    ####  Code Description              ####
    #---  1. Written by WoLin @ 2019.03.15,last update 19.07.14 ---#
    #---  2. Analysis for single sample    ---#
    #---  3. Support 10X data & expression matrix ---#
    #---  4. Need to change:sample data, sam.name, dims ---#
    
    #### 1. 加载分析使用的工具包 ####
    library(Seurat)
    library(ggplot2)
    library(cowplot)
    library(Matrix)
    library(dplyr)
    library(ggsci)
    
    #### 2. 读入原始表达数据 ####
    # 以下两种方式二选一
    # 10X 数据
    bm1 <- Read10X("./BoneMarrow/BM1/")#read10×函数只需要写到文件夹,不需要写文件名
    bm2 <- Read10X("./BoneMarrow/BM2/")
    # panglaoBD下载的Rdata可以直接用load读出一个相同的矩阵
    
    # 这里的列名就是barcode,代表一个细胞,行名是基因
    colnames(bm1)[1:10]  # 看bm1中前10个细胞的名字
    colnames(bm2)[1:10]
    # 合并前在细胞上(列名)打上样本标签
    colnames(bm1) <- paste(colnames(bm1),"BM1",sep = "_") # 把列名改成细胞名_样本来源
    colnames(bm2) <- paste(colnames(bm2),"BM2",sep = "_")
    
    #将所有读入的数据合并成一个大的矩阵,确保行数(基因数)相等,增加列
    #合并时需注意行名一致
    #既有10X的数据又有表达矩阵的数据,全部转换为表达矩阵再进行合并
    #关于矩阵合并请见单独的矩阵合并脚本“merge_matrix.R”(行数不一样样时用这个代码)
    experiment.data <- cbind(bm1,bm2) # 行数相同时可以用cbind(样本1,样本2)
    
    #创建一个叫multi的文件夹用于存放分析结果
    sam.name <- "multi"
    if(!dir.exists(sam.name)){
      dir.create(sam.name)
    }
    
    #### 3. 创建Seurat分析对象 ####
    experiment.aggregate <- CreateSeuratObject(
      experiment.data,
      project = "multi", 
      min.cells = 10,#基因至少在10个细胞里有表达,过滤基因
      min.features = 200,#细胞最少表达200个基因,过滤细胞
      names.field = 2,
      names.delim = "_")
    #将数据写到文件中一边后续分析使用
    save(experiment.aggregate,file=paste0("./",sam.name,"/",sam.name,"_raw_SeuratObject.RData"))
    
    #### 4. 数据概览 & QC ####
    #查看SeuratObject中的对象
    slotNames(experiment.aggregate)
    #assay
    experiment.aggregate@assays
    #细胞及细胞中基因与RNA数量
    dim(experiment.aggregate@meta.data)#显示有几行几列,行是细胞
    View(experiment.aggregate@meta.data)
    #第一列是样本名,在创建seurat对象时定义下划线后面的部分是样本名
    #第二列ncountRNA是UMI数,第三列nfeature是基因数
    table(experiment.aggregate@meta.data$orig.ident)#统计matadata的第一列元素的个数
    
    #转换成表达矩阵
    experiment.aggregate.matrix <- as.matrix(experiment.aggregate@assays$RNA@counts)
    
    ##QC:统计线粒体基因在每个细胞中的占比
    experiment.aggregate[["percent.mt"]] <- PercentageFeatureSet(experiment.aggregate, 
    pattern = "^MT-")#MT开头的是线粒体基因,在小鼠中是mt开头
    #小提琴图可视化
    pdf(paste0("./",sam.name,"/QC-VlnPlot.pdf"),width = 8,height = 4.5)
    VlnPlot(experiment.aggregate, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
            ncol = 3)
    dev.off()
    
    ##QC:统计基因数,RNA,线粒体基因分布
    gene.freq <- do.call("cbind", tapply(experiment.aggregate@meta.data$nFeature_RNA,
    experiment.aggregate@meta.data$orig.ident,quantile,probs=seq(0,1,0.05)))
    rna.freq <- do.call("cbind", tapply(experiment.aggregate@meta.data$nCount_RNA,experiment.aggregate@meta.data$orig.ident,
    quantile,probs=seq(0,1,0.05)))
    mt.freq <- do.call("cbind", tapply(experiment.aggregate@meta.data$percent.mt,experiment.aggregate@meta.data$orig.ident,quantile,probs=seq(0,1,0.05)))
    freq.combine <- as.data.frame(cbind(gene.freq,rna.freq,mt.freq))
    colnames(freq.combine) <- c(paste(colnames(gene.freq),"Gene",sep = "_"),
                                paste(colnames(rna.freq),"RNA",sep = "_"),
                                paste(colnames(mt.freq),"MT",sep = "_"))
    write.table(freq.combine,file = paste0(sam.name,"/QC-gene_frequency.txt"),quote = F,sep = "\t")
    rm(gene.freq,rna.freq,mt.freq)
    View(freq.combine)
    #先看小提琴图,如果细胞质量还可以,保留80%的细胞,如果比较差,保留60-70%细胞
    
    ##QC:基因数与线粒体基因以及RNA数量的分布相关性
    plot1 <- FeatureScatter(experiment.aggregate, feature1 = "nCount_RNA", feature2 = "percent.mt")
    plot2 <- FeatureScatter(experiment.aggregate, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
    pdf(paste0("./",sam.name,"/QC-FeatureScatter.pdf"),width = 8,height = 4.5)
    CombinePlots(plots = list(plot1, plot2),legend = "none")
    dev.off()
    rm(plot1,plot2)
    #红色点(BM1)已经接近饱和曲线的拐点(测到umi越多,测到的基因越多,到一定程度,即使增加umi,也不能增加很多测到的基因)
    #蓝色点(BM2)还没到饱和曲线的拐点,这个样本的细胞测到的基因量少
    
    #### 5. 筛选细胞 ####
    cat("Before filter :",nrow(experiment.aggregate@meta.data),"cells\n")
    experiment.aggregate <- subset(experiment.aggregate, 
                                   subset = 
                                     nFeature_RNA > 500 &  # 基因数>500
                                     nCount_RNA > 1000 &   # UMI>1000
                                     nCount_RNA < 20000 &  # UMI<20000(过滤双细胞)
                                     percent.mt < 5)       # 线粒体基因百分比<5
    cat("After filter :",nrow(experiment.aggregate@meta.data),"cells\n")
    table(experiment.aggregate@meta.data$orig.ident)#看过滤完两个样本还有多少细胞
    
    #### 6. 表达量标准化 ####
    experiment.aggregate <- NormalizeData(experiment.aggregate, 
                                          normalization.method = "LogNormalize",
                                          scale.factor = 10000)
    
    #计算表达量变化显著的基因FindVariableFeatures
    experiment.aggregate <- FindVariableFeatures(experiment.aggregate, 
                                                 selection.method = "vst",
                                                 nfeatures = 1000) 
    #一般500-2500个feature(基因),细胞类型越复杂,需要的feature(基因)越多
    
    #展示标准化之后的整体表达水平
    top10 <- head(x = VariableFeatures(experiment.aggregate), 10)
    plot1 <- VariableFeaturePlot(experiment.aggregate)
    plot2 <- LabelPoints(plot = plot1, points = top10)
    pdf(file = paste0(sam.name,"/Norm-feature_variable_plot.pdf"),width = 8,height = 5)
    CombinePlots(plots = list(plot1, plot2),legend = "none")
    dev.off()
    
    #### 7. 均一化与PCA ####
    #均一化(需要一点时间)
    experiment.aggregate <- ScaleData(
      object = experiment.aggregate,
      do.scale = FALSE,
      do.center = FALSE,
      vars.to.regress = c("orig.ident","percent.mt"))#去批次的因素,这里选择不同样本来源和线粒体基因百分比
    #任何批次效应校正都会损失一些信息,所以一开始不要进行太强的批次效应校正
    
    #PCA降维计算(两个作用:1看批次效应校正得怎么样 2聚类)
    experiment.aggregate <- RunPCA(object = experiment.aggregate, 
                                   features = VariableFeatures(experiment.aggregate),
                                   verbose = F,npcs = 50)
    
    #PCA结果展示-1
    pdf(paste0("./",sam.name,"/PCA-VizDimLoadings.pdf"),width = 7,height = 5)
    VizDimLoadings(experiment.aggregate, dims = 1:2, reduction = "pca")
    dev.off()
    
    #PCA结果展示-2
    pdf(paste0("./",sam.name,"/PCA-DimPlot.pdf"),width = 5,height = 4)
    DimPlot(experiment.aggregate, reduction = "pca")
    dev.off()
    
    #PCA结果展示-3
    pdf(paste0("./",sam.name,"/PCA-DimHeatmap.pdf"),width = 5,height = 4)
    DimHeatmap(experiment.aggregate, dims = 1:6, cells = 500, balanced = TRUE)
    dev.off()
    
    #### 8. 确定细胞类群分析PC ####
    #耗时较久,一般不用
    experiment.aggregate <- JackStraw(experiment.aggregate, num.replicate = 100,dims = 40)
    experiment.aggregate <- ScoreJackStraw(experiment.aggregate, dims = 1:40)
    pdf(paste0("./",sam.name,"/PCA-JackStrawPlot_40.pdf"),width = 6,height = 5)
    JackStrawPlot(object = experiment.aggregate, dims = 1:40)
    dev.off()
    
    #碎石图
    pdf(paste0("./",sam.name,"/PCA-ElbowPlot.pdf"),width = 6,height = 5)
    ElbowPlot(experiment.aggregate,ndims = 40)
    dev.off()
    #一般拐点不超过20
    
    #确定用于细胞分群的PC
    dim.use <- 1:20
    
    #### 9. 细胞分群TSNE算法 ####
    #TSNE算法(细胞量比较少的时候(几千),用UMAP)
    experiment.aggregate <- FindNeighbors(experiment.aggregate, dims = dim.use)#计算细胞相似性
    experiment.aggregate <- FindClusters(experiment.aggregate, resolution = 0.5)#resolution越高,细胞分出来的类越多
    experiment.aggregate <- RunTSNE(experiment.aggregate, dims = dim.use, 
                                    do.fast = TRUE)
    pdf(paste0("./",sam.name,"/CellCluster-TSNEPlot_res0.5_",max(dim.use),"PC.pdf"),width = 5,height = 4)
    DimPlot(object = experiment.aggregate, pt.size=0.5,label = T)
    dev.off()
    
    #按照数据来源分组展示细胞异同--画在一张图中
    pdf(paste0("./",sam.name,"/CellCluster-TSNEPlot_SamGroup_",max(dim.use),"PC.pdf"),width = 5,height = 4)
    DimPlot(object = experiment.aggregate, 
            group.by="orig.ident", 
            pt.size=0.5,reduction = "tsne")
    dev.off()
    
    #按照数据来源分组展示细胞异同--画在多张图中
    pdf(paste0("./",sam.name,"/CellCluster-TSNEPlot_SamGroup_slipt_",max(dim.use),"PC.pdf"),width = 8,height = 4)
    DimPlot(object = experiment.aggregate, 
            split.by ="orig.ident", 
            pt.size=0.5,reduction = "tsne")
    dev.off()
    
    table(experiment.aggregate@meta.data$orig.ident)
    View(experiment.aggregate@meta.data)#刚才的分群结果已经加到每个细胞的最后一列
    table(experiment.aggregate@meta.data$orig.ident,experiment.aggregate@meta.data$seurat_clusters)#每个样本中每群细胞数量
    #个性化画图代码见Sub_analysis_scRNA
    
    #### 10. 计算marker基因 ####
    #这一步计算的时候可以把min.pct以及logfc.threshold调的比较低,然后再基于结果手动筛选
    all.markers <- FindAllMarkers(experiment.aggregate, only.pos = TRUE, 
                                  min.pct = 0.3, logfc.threshold = 0.25)
    write.table(all.markers,
                file=paste0("./",sam.name,"/",sam.name,"_total_marker_genes_tsne_",max(dim.use),"PC.txt"),
                sep="\t",quote = F,row.names = F)
    
    # 遍历每一个cluster然后展示其中前4个基因
    marker.sig <- all.markers %>% 
      mutate(Ratio = round(pct.1/pct.2,3)) %>%
      filter(p_val_adj <= 0.05)  # 本条件为过滤统计学不显著的基因
    
    for(cluster_id in unique(marker.sig$cluster)){
      # cluster.markers <- FindMarkers(experiment.aggregate, ident.1 = cluster, min.pct = 0.3)
      # cluster.markers <- as.data.frame(cluster.markers) %>% 
      #   mutate(Gene = rownames(cluster.markers))
      cl4.genes <- marker.sig %>% 
        filter(cluster == cluster_id) %>%
        arrange(desc(avg_log2FC))
      cl4.genes <- cl4.genes[1:min(nrow(cl4.genes),4),"gene"]
      
      #VlnPlot
      pvn <- VlnPlot(experiment.aggregate, features = cl4.genes,ncol = 2)
      pdf(paste0("./",sam.name,"/MarkerGene-VlnPlot_cluster",cluster_id,"_tsne_",max(dim.use),"PC.pdf"),width = 7,height = 6)
      print(pvn)
      dev.off()
      
      #Feather plot 
      pvn <- FeaturePlot(experiment.aggregate,features=cl4.genes,ncol = 2)
      pdf(paste0("./",sam.name,"/MarkerGene-FeaturePlot_cluster",cluster_id,"_tsne_",max(dim.use),"PC.pdf"),width = 7,height = 6)
      print(pvn)
      dev.off()
      
      #RidgePlot
      pvn<-RidgePlot(experiment.aggregate, features = cl4.genes, ncol = 2)
      pdf(paste0("./",sam.name,"/MarkerGene-RidgePlot_cluster",cluster_id,"_tsne_",max(dim.use),"PC.pdf"),width = 7,height = 6)
      print(pvn)
      dev.off()
    }
    rm(cl4.genes,cluster_id,pvn)
    
    #热图展示Top marker基因
    #筛选top5的marker基因,可以通过参数改为其他数值
    top5 <- marker.sig %>% group_by(cluster) %>% 
      top_n(n = 5, wt = avg_log2FC)
    
    #top-marker基因热图
    pdf(paste0("./",sam.name,"/MarkerGene-Heatmap_all_cluster_tsne_",max(dim.use),"PC.pdf"))
    DoHeatmap(experiment.aggregate, features = top5$gene,size = 2) +
      theme(legend.position = "none", 
            axis.text.y = element_text(size = 6))
    dev.off()
    
    #top-marker基因dotplot
    pdf(paste0("./",sam.name,"/MarkerGene-DotPlot_all_cluster_tsne_",max(dim.use),"PC.pdf"),width = 50,height = 5)
    DotPlot(experiment.aggregate, features = unique(top5$gene))+
      RotatedAxis()
    dev.off()
    

    相关文章

      网友评论

          本文标题:2022-01-17 多个样本单细胞分析流程

          本文链接:https://www.haomeiwen.com/subject/iqmchrtx.html