美文网首页单细胞实战单细胞高级分析36计单细胞测序
NBIS系列单细胞转录组数据分析实战(一):数据质控

NBIS系列单细胞转录组数据分析实战(一):数据质控

作者: Davey1220 | 来源:发表于2021-04-02 21:46 被阅读0次

    Single cell RNA-seq analysis workshop

    该系列课程由瑞典国家生物信息学基础设施(NBIS) 负责设置,为瑞典Phd学生,博士后,研究人员和瑞典学术界的其他工作人员开设的全国性研讨会。

    本课程将通过演讲和实战练习涵盖了单细胞转录组(scRNA-seq)数据分析的基本步骤。涵盖的主题包括:

    • An overview of the current scRNAseq technologies(当前常用scRNA-seq技术概述)
    • Basic overview of pipelines for processing raw reads into expression values(原始测序数据比对转换为基因表达矩阵的分析流程概述)
    • Quality control of scRNAseq data(scRNA-seq数据的质量控制)
    • Dimensionality reduction and clustering techniques(数据降维和聚类分析技术)
    • Data normalization(数据归一化处理)
    • Differential gene expression for scRNAseq data(scRNA-seq数据的差异基因表达分析)
    • Celltype prediction(细胞类型预测)
    • Trajectory analysis(细胞轨迹拟时分析)
    • Comparison of different analysis pipelines such as Seurat, Scran and Scanpy(不同分析工具流程的比较,如Seurat,Scran和Scanpy)

    实战练习

    在该实战练习中,我们使用了3种常用的scRNA-seq分析工具流程(SeuratScranScanpy),分别提供了不同分析工具的简短教程,我们可以任选自己熟悉的工具进行scRNA-seq数据分析。原则上,我们对这三种不同的分析工具执行相同的步骤,但是仍会存在一些细微的差异,因为并非在所有的分析工具中实现了所有不同的方法。

    image.png

    在本实战中,我选用了scRNA-seq数据分析中最常用的Seurat包进行练习演示。

    第一节:数据质控实战

    下载示例数据

    在本教程中,我们将使用来自3个covid-19患者和3个健康对照的6个PBMC的10x数据集运行所有的教程,每个样本已被二次采样为1500个细胞。

    # 新建数据存放的文件夹
    mkdir -p data/raw
    
    # first check if the files are there
    count=$(ls -l data/raw/*.h5 | grep -v ^d | wc -l )
    echo $count
    
    # if not 4 files, fetch the files from github.
    # 下载示例数据
    if (("$count" <  6)); then
      cd data/raw
      curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_13.h5
      curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_14.h5
      curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/Normal_PBMC_5.h5
      curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_15.h5
      curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_17.h5
      curl -O https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/new_dataset/labs/data/covid_data_GSE149689/sub/nCoV_PBMC_1.h5
      cd ../..
    fi
    
    # 查看文件大小
    ls -lGa data/raw
    

    安装并加载所需的R包

    # 安装所需的R包
    install.packages('Seurat')
    install.packages('Matrix')
    remotes::install_github("chris-mcginnis-ucsf/DoubletFinder", upgrade = F)
    
    # 加载所需的R包
    suppressMessages(require(Seurat))
    suppressMessages(require(Matrix))
    suppressMessages(require(DoubletFinder))
    

    读取示例数据并构建seuart对象

    # 使用Read10X_h5函数读取h5文件
    cov.15 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_15.h5", use.names = T)
    cov.1 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_1.h5", use.names = T)
    cov.17 <- Seurat::Read10X_h5(filename = "data/raw/nCoV_PBMC_17.h5", use.names = T)
    ctrl.5 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_5.h5", use.names = T)
    ctrl.13 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_13.h5", use.names = T)
    ctrl.14 <- Seurat::Read10X_h5(filename = "data/raw/Normal_PBMC_14.h5", use.names = T)
    
    # 查看数据信息
    dim(cov.1)
    #[1] 33538  1500
    dim(cov.15)
    #[1] 33538  1500
    dim(cov.17)
    #[1] 33538  1500
    dim(ctrl.5)
    #[1] 33538  1500
    dim(ctrl.13)
    #[1] 33538  1500
    dim(ctrl.14)
    #[1] 33538  1500
    
    # 使用CreateSeuratObject函数构建seurat对象
    sdata.cov15 <- CreateSeuratObject(cov.15, project = "covid_15")
    sdata.cov1 <- CreateSeuratObject(cov.1, project = "covid_1")
    sdata.cov17 <- CreateSeuratObject(cov.17, project = "covid_17")
    sdata.ctrl5 <- CreateSeuratObject(ctrl.5, project = "ctrl_5")
    sdata.ctrl13 <- CreateSeuratObject(ctrl.13, project = "ctrl_13")
    sdata.ctrl14 <- CreateSeuratObject(ctrl.14, project = "ctrl_14")
    
    # 添加样本分组信息
    sdata.cov1$type = "Covid"
    sdata.cov15$type = "Covid"
    sdata.cov17$type = "Covid"
    sdata.ctrl5$type = "Ctrl"
    sdata.ctrl13$type = "Ctrl"
    sdata.ctrl14$type = "Ctrl"
    
    # 查看seurat对象
    sdata.cov1
    # An object of class Seurat 
    # 33538 features across 1500 samples within 1 assay 
    # Active assay: RNA (33538 features, 0 variable features)
    sdata.cov15
    # An object of class Seurat 
    # 33538 features across 1500 samples within 1 assay 
    # Active assay: RNA (33538 features, 0 variable features)
    sdata.cov17
    # An object of class Seurat 
    # 33538 features across 1500 samples within 1 assay 
    # Active assay: RNA (33538 features, 0 variable features)
    sdata.ctrl5
    # An object of class Seurat 
    # 33538 features across 1500 samples within 1 assay 
    # Active assay: RNA (33538 features, 0 variable features)
    sdata.ctrl13
    # An object of class Seurat 
    # 33538 features across 1500 samples within 1 assay 
    # Active assay: RNA (33538 features, 0 variable features)
    sdata.ctrl14
    # An object of class Seurat 
    # 33538 features across 1500 samples within 1 assay 
    # Active assay: RNA (33538 features, 0 variable features)
    

    合并所有样本的数据

    # Merge datasets into one single seurat object
    # 合并所有样本
    alldata <- merge(x = sdata.cov15, 
                     y = c(sdata.cov1, sdata.cov17, sdata.ctrl5, 
                           sdata.ctrl13, sdata.ctrl14), 
                     add.cell.ids = c("covid_15", "covid_1", "covid_17",
                                      "ctrl_5", "ctrl_13", "ctrl_14"))
    
    # remove all objects that will not be used.
    # 删除不用的数据对象
    rm(cov.15, cov.1, cov.17, 
       ctrl.5, ctrl.13, ctrl.14, 
       sdata.cov15, sdata.cov1, sdata.cov17, 
       sdata.ctrl5, sdata.ctrl13, sdata.ctrl14)
    
    # run garbage collect to free up memory
    gc()
    #          used  (Mb) gc trigger   (Mb)  max used   (Mb)
    # Ncells 11991984 640.5   22255031 1188.6  22255031 1188.6
    # Vcells 96687060 737.7  169043189 1289.7 165738512 1264.5
    
    # 查看数据
    alldata
    # An object of class Seurat 
    # 33538 features across 9000 samples within 1 assay 
    # Active assay: RNA (33538 features, 0 variable features)
    
    as.data.frame(alldata@assays$RNA@counts[1:10, 1:2])
    #            covid_15_CTCCATGTCAACGTGT-15 covid_15_CATAAGCAGGAACGAA-15
    #MIR1302-2HG                            0                            0
    #FAM138A                                0                            0
    #OR4F5                                  0                            0
    #AL627309.1                             0                            0
    #AL627309.3                             0                            0
    #AL627309.2                             0                            0
    #AL627309.4                             0                            0
    #AL732372.1                             0                            0
    #OR4F29                                 0                            0
    #AC114498.1                             0                            0
    
    head(alldata@meta.data, 10)
    #                            orig.ident nCount_RNA nFeature_RNA  type
    #covid_15_CTCCATGTCAACGTGT-15   covid_15      14911         3526 Covid
    #covid_15_CATAAGCAGGAACGAA-15   covid_15        338          203 Covid
    #covid_15_TTCACCGTCAGGAAGC-15   covid_15      28486         4542 Covid
    #covid_15_CGTCCATGTCCGGACT-15   covid_15       1318          539 Covid
    #covid_15_GTCCACTAGTCGCCCA-15   covid_15       4805         1493 Covid
    #covid_15_ATCCATTGTTGATGTC-15   covid_15       5386         1617 Covid
    #covid_15_AGAAGCGAGGGCCTCT-15   covid_15        686          407 Covid
    #covid_15_GAGGGTAGTAGGTTTC-15   covid_15       2155         1116 Covid
    #covid_15_CAAGACTTCTGCTTTA-15   covid_15       1216          128 Covid
    #covid_15_GCCAACGAGCTCTATG-15   covid_15        729          356 Covid
    

    计算数据质量控制指标

    • 计算线粒体相关基因含量
    # 计算线粒体相关基因百分比
    # Way1: Doing it using Seurat function
    # 使用PercentageFeatureSet函数计算线粒体基因含量
    alldata <- PercentageFeatureSet(alldata, "^MT-", col.name = "percent_mito")
    
    # Way2: Doing it manually
    # 手动计算线粒体基因含量
    total_counts_per_cell <- colSums(alldata@assays$RNA@counts)
    head(total_counts_per_cell)
    #covid_15_CTCCATGTCAACGTGT-15 covid_15_CATAAGCAGGAACGAA-15 
    #                      14911                          338 
    #covid_15_TTCACCGTCAGGAAGC-15 covid_15_CGTCCATGTCCGGACT-15 
    #                       28486                         1318 
    #covid_15_GTCCACTAGTCGCCCA-15 covid_15_ATCCATTGTTGATGTC-15 
    #                        4805                         5386 
    # 获取线粒体相关基因
    mito_genes <- rownames(alldata)[grep("^MT-", rownames(alldata))]
    head(mito_genes)
    # [1] "MT-ND1"  "MT-ND2"  "MT-CO1"  "MT-CO2"  "MT-ATP8" "MT-ATP6"
    # 计算线粒体基因含量
    alldata$percent_mito <- colSums(alldata@assays$RNA@counts[mito_genes, ])/total_counts_per_cell
    
    • 计算核糖体相关基因含量
    # Way1: Doing it using Seurat function
    # 使用PercentageFeatureSet函数计算核糖体基因含量
    alldata <- PercentageFeatureSet(alldata, "^RP[SL]", col.name = "percent_ribo")
    
    # Way2: Doing it manually
    # 手动计算核糖体相关基因含量 
    ribo_genes <- rownames(alldata)[grep("^RP[SL]", rownames(alldata))]
    head(ribo_genes, 10)
    #[1] "RPL22"   "RPL11"   "RPS6KA1" "RPS8"    "RPL5"    "RPS27"   "RPS6KC1"
    #[8] "RPS7"    "RPS27A"  "RPL31" 
    alldata$percent_ribo <- colSums(alldata@assays$RNA@counts[ribo_genes, ])/total_counts_per_cell
    
    • 计算血红蛋白相关基因含量
    # Percentage hemoglobin genes - includes all genes starting with HB except HBP.
    alldata <- PercentageFeatureSet(alldata, "^HB[^(P)]", col.name = "percent_hb")
    

    数据质量控制指标的可视化

    feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito", "percent_ribo", "percent_hb")
    # 使用VlnPlot函数绘制小提琴图可视化相关质控指标
    VlnPlot(alldata, group.by = "orig.ident", features = feats, pt.size = 0.1, ncol = 3) + 
        NoLegend()
    
    image.png
    # 使用FeatureScatter函数绘制散点图
    FeatureScatter(alldata, "nCount_RNA", "nFeature_RNA", group.by = "orig.ident", pt.size = 0.5)
    
    image.png

    数据过滤筛选

    • 基于检测的过滤
      在这里,我们将仅考虑至少检测到200个基因的细胞,并且至少需要在3个细胞中表达这些基因。请注意,这些值在很大程度上取决于所使用的单细胞文库制备方法。
    # 过滤至少检测到200个基因的细胞
    selected_c <- WhichCells(alldata, expression = nFeature_RNA > 200)
    # 过滤至少在3个细胞中能检测到的基因
    selected_f <- rownames(alldata)[Matrix::rowSums(alldata) > 3]
    
    data.filt <- subset(alldata, features = selected_f, cells = selected_c)
    dim(data.filt)
    ## [1] 18147  7973
    

    注意:检测到极高数量基因的细胞可能一些doublet。但是,根据样本中细胞类型的组成,也可能会存在一种具有更多基因(以及更高计数)的细胞。在这种情况下,我们将需要进一步进行doublet预测。

    此外,我们还可以看看哪些基因对这类reads贡献最大。例如,我们可以绘制每个基因计数的百分比。

    # Compute the relative expression of each gene per cell Use sparse matrix
    # operations, if your dataset is large, doing matrix devisions the regular way
    # will take a very long time.
    par(mar = c(4, 8, 2, 1))
    C <- data.filt@assays$RNA@counts
    C <- Matrix::t(Matrix::t(C)/Matrix::colSums(C)) * 100
    most_expressed <- order(apply(C, 1, median), decreasing = T)[20:1]
    
    boxplot(as.matrix(t(C[most_expressed, ])), cex = 0.1, las = 1, xlab = "% total count per cell", 
        col = (scales::hue_pal())(20)[20:1], horizontal = TRUE)
    
    image.png

    如图所示,我们可以看到MALAT1基因占单个细胞UMI的30%,其他最重要的基因是线粒体和核糖体相关基因。核内lincRNA与数据质量和线粒体基因reads相关联是非常普遍的,因此对MALAT1的高检测可能是一个技术问题。我们可以收集有关此类基因的一些信息,这些信息对于质量控制和下游过滤是非常重要。

    • 基与线粒体/核糖体基因含量进行过滤
      我们可以通过数据质控小提琴图,选择过滤线粒体或核糖体基因含量相关的阈值。在此例中,我们选择过滤掉线粒体读数低于20%,核糖体读数少于5%的细胞。
    selected_mito <- WhichCells(data.filt, expression = percent_mito < 0.2)
    selected_ribo <- WhichCells(data.filt, expression = percent_ribo > 0.05)
    
    # and subset the object to only keep those cells
    data.filt <- subset(data.filt, cells = selected_mito)
    data.filt <- subset(data.filt, cells = selected_ribo)
    
    dim(data.filt)
    ## [1] 18147  5762
    
    table(data.filt$orig.ident)
    ##  covid_1 covid_15 covid_17  ctrl_13  ctrl_14   ctrl_5 
    ##      878      585     1042     1154     1063     1040
    

    绘制过滤后的质量控制图

    让我们再次绘制相同的QC统计信息。

    feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito", "percent_ribo", "percent_hb")
    
    VlnPlot(data.filt, group.by = "orig.ident", features = feats, pt.size = 0.1, ncol = 3) + 
        NoLegend()
    
    image.png

    过滤筛选相关基因

    由于线粒体和MALAT1等基因的高表达被认为主要是技术性的,因此在进行任何下一步分析之前,将其从数据集中删除是明智的。

    dim(data.filt)
    ## [1] 18147  5762
    
    # Filter MALAT1
    # 过滤MALAT1基因
    data.filt <- data.filt[!grepl("MALAT1", rownames(data.filt)), ]
    
    # Filter Mitocondrial
    # 过滤线粒体基因
    data.filt <- data.filt[!grepl("^MT-", rownames(data.filt)), ]
    
    # Filter Ribossomal gene (optional if that is a problem on your data) data.filt
    # 过滤核糖体基因
    # <- data.filt[ ! grepl('^RP[SL]', rownames(data.filt)), ]
    
    # Filter Hemoglobin gene (optional if that is a problem on your data)
    # 过滤血红蛋白基因
    data.filt <- data.filt[!grepl("^HB[^(P)]", rownames(data.filt)), ]
    
    dim(data.filt)
    ## [1] 18121  5762
    

    查看样本的性别信息

    在处理人类或动物样本数据时,理想情况下,应将实验设置为单性别,以避免在结论中出现性别偏见。但是,这并不总是可能的。

    通过查看来自Y染色体(雄性)和XIST基因表达(主要是雌性)的读数,我们可以很容易地确定每个样本的性别。如果样本元数据性别与计算预测不一致,它也是一种检测样本是否存在混合的好方法。

    在这里,我们下载了公共数据的基因注释文件信息,请确保将其放置在正确的位置以使路径genes.file起作用。

    genes.file = "data/results/genes.table.csv"
    
    if (!file.exists(genes.file)) {
        suppressMessages(require(biomaRt))
    
        # initialize connection to mart, may take some time if the sites are
        # unresponsive.
        mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
    
        # fetch chromosome info plus some other annotations
        genes.table <- try(biomaRt::getBM(attributes = c("ensembl_gene_id", "external_gene_name", 
            "description", "gene_biotype", "chromosome_name", "start_position"), mart = mart, 
            useCache = F))
    
        if (!dir.exists("data/results")) {
            dir.create("data/results")
        }
        if (is.data.frame(genes.table)) {
            write.csv(genes.table, file = genes.file)
        }
    
        if (!file.exists(genes.file)) {
            download.file("https://raw.githubusercontent.com/NBISweden/workshop-scRNAseq/master/labs/misc/genes.table.csv", 
                destfile = "data/results/genes.table.csv")
            genes.table = read.csv(genes.file)
        }
    
    } else {
        genes.table = read.csv(genes.file)
    }
    
    genes.table <- genes.table[genes.table$external_gene_name %in% rownames(data.filt), ]
    head(genes.table)
    #  X ensembl_gene_id external_gene_name
    #1 1 ENSG00000281486              SNTG2
    #2 2 ENSG00000262826              INTS3
    #3 3 ENSG00000274891           TRAPPC12
    #4 4 ENSG00000280830               ADI1
    #5 5 ENSG00000263163            SLC27A3
    #6 6 ENSG00000261992            GATAD2B
    #                                                                  description
    #1                      syntrophin gamma 2 [Source:HGNC Symbol;Acc:HGNC:13741]
    #2            integrator complex subunit 3 [Source:HGNC Symbol;Acc:HGNC:26153]
    #3 trafficking protein particle complex 12 [Source:HGNC Symbol;Acc:HGNC:24284]
    #4              acireductone dioxygenase 1 [Source:HGNC Symbol;Acc:HGNC:30576]
    #5       solute carrier family 27 member 3 [Source:HGNC Symbol;Acc:HGNC:10997]
    #6   GATA zinc finger domain containing 2B [Source:HGNC Symbol;Acc:HGNC:30778]
    #    gene_biotype    chromosome_name start_position
    #1 protein_coding  CHR_HSCHR2_3_CTG1        1209102
    #2 protein_coding CHR_HSCHR1_1_CTG31      153745298
    #3 protein_coding  CHR_HSCHR2_1_CTG1        3401777
    #4 protein_coding  CHR_HSCHR2_1_CTG1        3498728
    #5 protein_coding CHR_HSCHR1_1_CTG31      153791585
    #6 protein_coding CHR_HSCHR1_1_CTG31      153821956
    

    现在我们有了染色体信息,可以计算每个细胞中来自Y染色体读数的比例。

    chrY.gene = genes.table$external_gene_name[genes.table$chromosome_name == "Y"]
    chrY.gene
    #[1] "RPS4Y1"     "AC006157.1" "ZFY"        "ZFY-AS1"    "LINC00278" 
    #[6] "PCDH11Y"    "USP9Y"      "DDX3Y"      "UTY"        "TMSB4Y"    
    #[11] "TTTY14"     "AC010889.1" "KDM5D"      "EIF1AY"     "RPS4Y2"
    
    data.filt$pct_chrY = colSums(data.filt@assays$RNA@counts[chrY.gene, ])/colSums(data.filt@assays$RNA@counts)
    

    绘制XIST表达与chrY比例的关系散点图。

    FeatureScatter(data.filt, feature1 = "XIST", feature2 = "pct_chrY")
    
    image.png

    绘制小提琴图查看样本性别。

    VlnPlot(data.filt, features = c("XIST", "pct_chrY"))
    
    image.png

    如图所示,我们可以清楚地看到这里有两个男性和四个女性,您可以看到它们是哪些样本吗?您认为这会对下游分析造成任何问题吗?与您的小组讨论:应对这种性别bias的最佳方法是什么?

    计算细胞周期评分

    在这里,我们计算细胞周期评分。为了对特定基因列表进行评分,该算法计算给定基因列表的平均表达与参考基因集的平均表达之差。为了建立参考基因集,该函数随机选择一堆与给定基因列表的表达分布相匹配的基因。计算完细胞周期评分后,将会在数据中添加三个metadata信息,即S期评分,G2M期评分和预测的细胞周期。

    # Before running CellCycleScoring the data need to be normalized and logtransformed.
    # 进行数据标准化处理
    data.filt = NormalizeData(data.filt)
    
    # 使用CellCycleScoring函数计算细胞周期评分
    data.filt <- CellCycleScoring(object = data.filt, g2m.features = cc.genes$g2m.genes, 
        s.features = cc.genes$s.genes)
    ## Warning: The following features are not present in the object: MLF1IP, not
    ## searching for symbol synonyms
    
    ## Warning: The following features are not present in the object: FAM64A, HN1, not
    ## searching for symbol synonyms
    

    现在,我们还可以为细胞周期得分绘制小提琴图。

    VlnPlot(data.filt, features = c("S.Score", "G2M.Score"), group.by = "orig.ident", 
        ncol = 4, pt.size = 0.1)
    
    image.png

    预测双细胞doublet

    在同一小孔/液滴中,细胞的Doublets/Mulitples是scRNA-seq文库构建中的常见问题。尤其是在基于液滴的方法中,会导致细胞超负荷。在10x实验中,双细胞的比例与加载细胞的数量呈线性相关。如Chromium用户指南所述,双峰速率大约如下:

    image.png

    大多数双细胞检测工具通过合并细胞的计数来模拟双细胞,并将预测到的与模拟双细胞相似的细胞归为doublets。大多数此类软件包都需要对数据集中预期双细胞数的数量/比例进行假设。我们这里使用的数据是经过二次抽样的,但是原始数据集中每个样本包含了大约5000个细胞,因此我们可以假设它们加载了大约9000个细胞,并且其双细胞率约为4%。

    OBS!理想情况下,我们应该对每个样本分别进行doublet预测,尤其是在您不同的样本具有不同比例的细胞类型的情况下。在这里,我们对数据进行了二次采样,因此每个样本中只有很少的细胞,并且所有样本都是经过sorted的PBMC,因此可以将它们一起运行。

    在这里,我们将使用DoubletFinder包来预测双细胞。

    # remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
    suppressMessages(require(DoubletFinder))
    
    # 筛选高变基因
    data.filt = FindVariableFeatures(data.filt, verbose = F)
    # 进行数据归一化
    data.filt = ScaleData(data.filt, vars.to.regress = c("nFeature_RNA", "percent_mito"), 
        verbose = F)
    
    # PCA降维
    data.filt = RunPCA(data.filt, verbose = F, npcs = 20)
    # UMAP可视化
    data.filt = RunUMAP(data.filt, dims = 1:10, verbose = F)
    

    然后,我们运行doubletFinder,选择前10个PC,pK值设为0.9。

    # Can run parameter optimization with paramSweep, but skip for now.
    # sweep.res <- paramSweep_v3(data.filt) 
    # sweep.stats <- summarizeSweep(sweep.res, GT = FALSE) 
    # bcmvn <- find.pK(sweep.stats) 
    # barplot(bcmvn$BCmetric, names.arg = bcmvn$pK, las=2)
    
    # define the expected number of doublet cellscells.
    nExp <- round(ncol(data.filt) * 0.04)  # expect 4% doublets
    # 使用doubletFinder_v3函数预测双细胞
    data.filt <- doubletFinder_v3(data.filt, pN = 0.25, pK = 0.09, nExp = nExp, PCs = 1:10)
    ## [1] "Creating 1921 artificial doublets..."
    ## [1] "Creating Seurat object..."
    ## [1] "Normalizing Seurat object..."
    ## [1] "Finding variable genes..."
    ## [1] "Scaling data..."
    ## [1] "Running PCA..."
    ## [1] "Calculating PC distance matrix..."
    ## [1] "Computing pANN..."
    ## [1] "Classifying doublets.."
    
    # name of the DF prediction can change, so extract the correct column name.
    DF.name = colnames(data.filt@meta.data)[grepl("DF.classification", colnames(data.filt@meta.data))]
    
    cowplot::plot_grid(ncol = 2, DimPlot(data.filt, group.by = "orig.ident") + NoAxes(), 
        DimPlot(data.filt, group.by = DF.name) + NoAxes())
    
    image.png

    理想情况下,双细胞中比单细胞检查到更多的基因,让我们检查一下我们预测的双细胞中是否也具有更多的检测到的基因。

    VlnPlot(data.filt, features = "nFeature_RNA", group.by = DF.name, pt.size = 0.1)
    
    image.png

    现在,我们可以将预测的双细胞从数据中删除。

    data.filt = data.filt[, data.filt@meta.data[, DF.name] == "Singlet"]
    dim(data.filt)
    ## [1] 18121  5532
    

    保存质控过滤后的数据

    最后,让我们保存经过QC过滤的数据以进行后续的分析。创建输出目录results并将数据保存到该文件夹中。

    dir.create("data/results", showWarnings = F)
    
    saveRDS(data.filt, "data/results/seurat_covid_qc.rds")
    
    sessionInfo()
    ## R version 4.0.3 (2020-10-10)
    ## Platform: x86_64-apple-darwin13.4.0 (64-bit)
    ## Running under: macOS Catalina 10.15.5
    ## 
    ## Matrix products: default
    ## BLAS/LAPACK: /Users/paulo.czarnewski/.conda/envs/scRNAseq2021/lib/libopenblasp-r0.3.12.dylib
    ## 
    ## locale:
    ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
    ## 
    ## attached base packages:
    ## [1] grid      stats     graphics  grDevices utils     datasets  methods  
    ## [8] base     
    ## 
    ## other attached packages:
    ## [1] KernSmooth_2.23-18  fields_11.6         spam_2.6-0         
    ## [4] dotCall64_1.0-0     DoubletFinder_2.0.3 Matrix_1.3-2       
    ## [7] Seurat_3.2.3        RJSONIO_1.3-1.4     optparse_1.6.6     
    ## 
    ## loaded via a namespace (and not attached):
    ##   [1] Rtsne_0.15            colorspace_2.0-0      deldir_0.2-9         
    ##   [4] ellipsis_0.3.1        ggridges_0.5.3        spatstat.data_1.7-0  
    ##   [7] farver_2.0.3          leiden_0.3.6          listenv_0.8.0        
    ##  [10] remotes_2.2.0         bit64_4.0.5           getopt_1.20.3        
    ##  [13] ggrepel_0.9.1         RSpectra_0.16-0       codetools_0.2-18     
    ##  [16] splines_4.0.3         knitr_1.30            polyclip_1.10-0      
    ##  [19] jsonlite_1.7.2        ica_1.0-2             cluster_2.1.0        
    ##  [22] png_0.1-7             uwot_0.1.10           shiny_1.5.0          
    ##  [25] sctransform_0.3.2     compiler_4.0.3        httr_1.4.2           
    ##  [28] assertthat_0.2.1      fastmap_1.0.1         lazyeval_0.2.2       
    ##  [31] later_1.1.0.1         formatR_1.7           htmltools_0.5.1      
    ##  [34] tools_4.0.3           rsvd_1.0.3            igraph_1.2.6         
    ##  [37] gtable_0.3.0          glue_1.4.2            RANN_2.6.1           
    ##  [40] reshape2_1.4.4        dplyr_1.0.3           maps_3.3.0           
    ##  [43] Rcpp_1.0.6            spatstat_1.64-1       scattermore_0.7      
    ##  [46] vctrs_0.3.6           nlme_3.1-151          lmtest_0.9-38        
    ##  [49] xfun_0.20             stringr_1.4.0         globals_0.14.0       
    ##  [52] mime_0.9              miniUI_0.1.1.1        lifecycle_0.2.0      
    ##  [55] irlba_2.3.3           goftest_1.2-2         future_1.21.0        
    ##  [58] MASS_7.3-53           zoo_1.8-8             scales_1.1.1         
    ##  [61] promises_1.1.1        spatstat.utils_1.20-2 parallel_4.0.3       
    ##  [64] RColorBrewer_1.1-2    yaml_2.2.1            curl_4.3             
    ##  [67] reticulate_1.18       pbapply_1.4-3         gridExtra_2.3        
    ##  [70] ggplot2_3.3.3         rpart_4.1-15          stringi_1.5.3        
    ##  [73] rlang_0.4.10          pkgconfig_2.0.3       matrixStats_0.57.0   
    ##  [76] evaluate_0.14         lattice_0.20-41       ROCR_1.0-11          
    ##  [79] purrr_0.3.4           tensor_1.5            labeling_0.4.2       
    ##  [82] patchwork_1.1.1       htmlwidgets_1.5.3     bit_4.0.4            
    ##  [85] cowplot_1.1.1         tidyselect_1.1.0      parallelly_1.23.0    
    ##  [88] RcppAnnoy_0.0.18      plyr_1.8.6            magrittr_2.0.1       
    ##  [91] R6_2.5.0              generics_0.1.0        DBI_1.1.1            
    ##  [94] withr_2.4.0           pillar_1.4.7          mgcv_1.8-33          
    ##  [97] fitdistrplus_1.1-3    survival_3.2-7        abind_1.4-5          
    ## [100] tibble_3.0.5          future.apply_1.7.0    crayon_1.3.4         
    ## [103] hdf5r_1.3.3           plotly_4.9.3          rmarkdown_2.6        
    ## [106] data.table_1.13.6     digest_0.6.27         xtable_1.8-4         
    ## [109] tidyr_1.1.2           httpuv_1.5.5          munsell_0.5.0        
    ## [112] viridisLite_0.3.0
    

    参考来源:https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/seurat/seurat_01_qc.html

    相关文章

      网友评论

        本文标题:NBIS系列单细胞转录组数据分析实战(一):数据质控

        本文链接:https://www.haomeiwen.com/subject/vauqkltx.html