美文网首页
scATAC-seq数据分析练习(成年小鼠大脑组织)

scATAC-seq数据分析练习(成年小鼠大脑组织)

作者: 生信start_site | 来源:发表于2020-09-19 08:24 被阅读0次

    教程官网以及练习数据下载:https://satijalab.org/signac/articles/

    在这个练习里使用了成年小鼠大脑细胞的scATAC-seq数据来作为例子,这个scATAC-seq是由10x Genomics平台测得的。练习数据是count矩阵。这篇笔记就是按照官网的教程走一遍代码,熟悉单细胞ATAC-seq的分析流程。

    > library(Signac)
    > library(Seurat)
    > library(GenomeInfoDb)
    > library(EnsDb.Mmusculus.v79)
    > library(ggplot2)
    > library(patchwork)
    > set.seed(1234)
    

    Pre-processing workflow数据预处理

    > counts <- Read10X_h5("./mouse_brain/atac_v1_adult_brain_fresh_5k_filtered_peak_bc_matrix.h5")
    > metadata <- read.csv(
      file = "./mouse_brain/atac_v1_adult_brain_fresh_5k_singlecell.csv",
      header = TRUE,
      row.names = 1
    )
    #从count矩阵构建一个ChromatinAssay对象,
    #输入的矩阵应该是以features x 细胞的格式
    > brain_assay <- CreateChromatinAssay(
      counts = counts, 
      sep = c(":", "-"),
      genome = "mm10",
      fragments = './mouse_brain/atac_v1_adult_brain_fresh_5k_fragments.tsv.gz',
      min.cells = 1
    )
    
    Computing hash
    Checking for 5337 cell barcodes
    

    上面的CreateChromatinAssay()函数是创建ChromatinAssay对象。参数分别是:

    counts:count是非标准化的,原始的count值
    sep:用于分隔基因组坐标的字符,这里是:和-。
    genome:说明你用的是哪一个物种的基因组。
    fragments:输入矩阵中包含的数据的tabix索引片段文件的路径。
    min.cells:至少在一个细胞里检测到features,这一步实际上过滤细胞。把什么reads都没有的细胞除去。
    这个函数还有一些其他的参数,具体请见:Create ChromatinAssay object

    构建Seurat对象:

    > brain <- CreateSeuratObject(
      counts = brain_assay, #可以是一个矩阵,也可以是assay-derived的对象
      assay = 'peaks', #Name of the initial assay
      project = 'ATAC', #Project name for the Seurat object
      meta.data = metadata
    )
    #会弹出:(不要紧,官网上的教程里也出现了这一行warning信息)
    Warning message:
    In CreateSeuratObject.Assay(counts = brain_assay, assay = "peaks",  :
      Some cells in meta.data not present in provided counts matrix.
    

    这里我们可以给这个brain对象添加基因注释信息,这会方便之后从这个seurat对象里直接拉取基因注释信息:

    #从EnsDb对象中提取所有染色体的转录信息
    > annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
    #会弹出:
    Fetching data...OK
    Parsing exons...OK
    Defining introns...OK
    Defining UTRs...OK
    Defining CDS...OK
    aggregating...
    Done
    ......(此处省略N行)
    There were 21 warnings (use warnings() to see them)
    

    修改UCSC style:

    > seqlevelsStyle(annotations) <- 'UCSC'
    > genome(annotations) <- "mm10"
    

    给seurat对象加基因信息:

    > Annotation(brain) <- annotations
    

    Computing QC Metrics计算质量控制指标

    首先计算每一个细胞里的核小体信号强度。使用的是NucleosomeSignal()函数,计算147bp-294bp(单个核小体)之间的片段与小于147bp片段(非核小体)的比例。
    计算后会返回一个Seurat对象,其中添加了关于每个细胞单核小体片段与无核小体片段比例的metadata,以及每个比例的百分数rank。

    > brain <- NucleosomeSignal(object = brain)
    Found 5337 cell barcodes
    Done Processing 53 million lines
    

    我们可以查看所有细胞的片段长度周期性,并且把核小体高信号强度和低信号强度的细胞分群。这样可以看单核小体/非核小体的比例,从而检查离群细胞。剩下的细胞就是可以用来进行后续分析的细胞。

    #在brain对象里加一个metadata,名为nucleosome_group,如果nucleosome_signal里的数值大于4,则在nucleosome_group里标记NS>4,反之则标记NS<4.
    > brain$nucleosome_group <- ifelse(brain$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
    #绘制不同长度的片段在不同细胞群中出现的频率
    #这里我们绘制上面区分的NS>4和NS<4的细胞群里不同片段在chr1的内出现的频率。
    > FragmentHistogram(object = brain, group.by = 'nucleosome_group', region = 'chr1-1-10000000')
    

    Tn5整合事件在转录起始位点是富集的,所以也可以用来作为一个重要的质量控制参数来评价Tn5在ATAC-seq实验中的靶向效率。ENCODE定义了TSS富集分数(TSS周围的Tn5整合位点数量标准化为侧翼区域的Tn5整合位点数量)。

    转录起始点(TSS)富集分数: TSS富集计算是一个信号到噪声的计算。落在一系列TSSs周围的reads被收集起来,形成以TSSs为中心的reads的聚合分布,并在两个方向上扩展到1000 bp(总共为2000bp)。然后,通过取在每个末端100 bps中的平均reads深度(总共200bp的平均数据),计算在每个位置的read深度的倍数变化,对该分布进行标准化。这意味着侧翼(flanks)应该从1开始,如果在转录起始位点(基因组的高度开放区域)有高的reads信号,信号应该增加形成中间的peak。我们将分布的中心的信号值进行标准化,作为TSS的富集参数。用于评估ATAC-seq。

    具体内容可见: (https://www.encodeproject.org/data-standards/terms/). 我们可以使用Signac包里的TSSEnrichment()功能来计算每一个细胞TSS富集分数:

    > brain <- TSSEnrichment(brain, fast = FALSE)
    
    Extracting TSS positions
    Finding + strand cut sites
    Finding - strand cut sites
    |--------------------------------------------------|
    |==================================================|
    Computing mean insertion frequency in flanking regions
    Normalizing TSS score
    
    #将上面计算好的数值加到brain对象中,并区分>2的是高,反之则是低
    > brain$high.tss <- ifelse(brain$TSS.enrichment > 2, 'High', 'Low')
    #绘制每个位置相对于TSS的标准化TSS富集评分,这里分别绘制high和low两个组
    > TSSPlot(brain, group.by = 'high.tss') + NoLegend()
    

    然后我们要把所有细胞里落在peaks上的reads的比例算出来,并把落在“黑名单”区域上的Reads比例算出来:

    # add blacklist ratio and fraction of reads in peaks
    > brain$pct_reads_in_peaks <- brain$peak_region_fragments / brain$passed_filters * 100
    > brain$blacklist_ratio <- brain$blacklist_region_fragments / brain$peak_region_fragments
    > VlnPlot(
      object = brain,
      features = c('pct_reads_in_peaks', 'peak_region_fragments',
                   'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),
      pt.size = 0.1,
      ncol = 5
    )
    

    fraction of reads in peaks:
    reads在峰值的部分(FRiP)——所有比对到peak区域的reads的部分,即在显著富集的peak上的可用的reads除以所有可用的reads。一般来说,FRiP分数与区域的数量呈正相关。(Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)

    然后把离群值去掉:

    > brain <- subset(
      x = brain,
      subset = peak_region_fragments > 3000 &
        peak_region_fragments < 100000 &
        pct_reads_in_peaks > 40 &
        blacklist_ratio < 0.025 &
        nucleosome_signal < 4 &
        TSS.enrichment > 2
    )
    > brain
    An object of class Seurat 
    156815 features across 3512 samples within 1 assay 
    Active assay: peaks (156815 features, 0 variable features)
    

    Normalization and linear dimensional reduction标准化和线性降维

    > brain <- RunTFIDF(brain)
    Performing TF-IDF normalization
    > brain <- FindTopFeatures(brain, min.cutoff = 'q0')
    # A dimensional reduction object with key LSI_
    > brain <- RunSVD(object = brain)
    Running SVD
    Scaling cell embeddings
    

    第一个LSI成分通常捕获的是测序深度(技术variation)而不是生物学variation。如果是这种情况,则应该从下游分析中删除该成分。我们可以使用DepthCor()函数来评估每个LSI成分与序列深度的相关性:

    #计算总counts数与每一个降维成分之间的相关性
    > DepthCor(brain)
    

    在上图里,我们可以看到第一个LSI成分与细胞总count数有强烈的相关性,所以我们在后续的分析中要去掉这个成分。

    Non-linear dimension reduction and clustering非线性降维和聚类

    由于细胞被嵌入到低维空间中,我们就可以使用通常用于分析scRNA-seq数据的方法来执行graph-based聚类,并使用非线性降维来进行可视化。函数RunUMAP()、FindNeighbors()和FindClusters()都来自Seurat包:

    > brain <- RunUMAP(
       object = brain,
       reduction = 'lsi',
       dims = 2:30
     )
    Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
    To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
    This message will be shown once per session
    21:47:12 UMAP embedding parameters a = 0.9922 b = 1.112
    21:47:13 Read 3512 rows and found 29 numeric columns
    21:47:13 Using Annoy for neighbor search, n_neighbors = 30
    21:47:13 Building Annoy index with metric = cosine, n_trees = 50
    0%   10   20   30   40   50   60   70   80   90   100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    21:47:13 Writing NN index file to temp file C:\Users\YANFAN~1\AppData\Local\Temp\RtmpW4OpBN\file97830d16bc1
    21:47:13 Searching Annoy index using 1 thread, search_k = 3000
    21:47:15 Annoy recall = 100%
    21:47:16 Commencing smooth kNN distance calibration using 1 thread
    21:47:19 Initializing from normalized Laplacian + noise
    21:47:19 Commencing optimization for 500 epochs, with 130124 positive edges
    0%   10   20   30   40   50   60   70   80   90   100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    21:47:32 Optimization finished
    
    > brain <- FindNeighbors(
       object = brain,
       reduction = 'lsi',
       dims = 2:30
     )
    Computing nearest neighbor graph
    Computing SNN
    > brain <- FindClusters(   #根据shared nearest neighbor (SNN) 来鉴定clusters
      object = brain,
      algorithm = 3,
      resolution = 1.2,
      verbose = FALSE
    )
    > DimPlot(object = brain, label = TRUE) + NoLegend()
    

    Create a gene activity matrix创建基因活性矩阵

    #计算基因活性(计算每一个细胞里,落在gene body上和启动子上的counts数)
    > gene.activities <- GeneActivity(brain)
    Extracting gene coordinates
    Extracting reads overlapping genomic regions
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=09m 20s
    # 把基因活性矩阵作为一个新的assay添加到Seurat对象里
    > brain[['RNA']] <- CreateAssayObject(counts = gene.activities)
    > brain <- NormalizeData(
       object = brain,
       assay = 'RNA',
       normalization.method = 'LogNormalize',
       scale.factor = median(brain$nCount_RNA)
     )
    Performing log-normalization
    0%   10   20   30   40   50   60   70   80   90   100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    
    > DefaultAssay(brain) <- 'RNA'
    > FeaturePlot(
       object = brain,
       features = c('Sst','Pvalb',"Gad2","Neurod6","Rorb","Syt6"),
       pt.size = 0.1,
       max.cutoff = 'q95',
       ncol = 3
     )
    

    Integrating with scRNA-seq data与scRNA-seq数据整合

    为了帮助解释scATAC-seq数据,我们可以根据同一个生物系统(成年小鼠大脑)的scRNA-seq实验对细胞进行分类。这里我们使用描述的跨模态集成和标签转移的方法(here),这里有一个更详细的教程(here)。
    你可以从Allen Institute网站下载此实验的原始数据(website),并在GitHub上查看用于构造此对象的代码(GitHub)。或者,你可以在这里下载预处理的Seurat对象(here)。

    # 这里我们使用预处理过的scRNA-seq Seurat对象
    > allen_rna <- readRDS("./mouse_brain/allen_brain.rds")
    
    > allen_rna <- FindVariableFeatures(
       object = allen_rna,
       nfeatures = 5000
     )
    Calculating gene variances
    0%   10   20   30   40   50   60   70   80   90   100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    Calculating feature variances of standardized and clipped values
    0%   10   20   30   40   50   60   70   80   90   100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    
    #把ATAC-seq和RNA-seq的数据对应起来,这里需要输入你的参考对象、和要进行query(查询)的对象,这里reference是RNA-seq的对象,我们需要用ATAC-seq结果和它对应起来
    > transfer.anchors <- FindTransferAnchors( 
       reference = allen_rna,
       query = brain,
       reduction = 'cca', #这里降维的方法是CCA。如果你两个对象都是scRNA-seq的话,则用PCA,代码是reduction='pcaproject'
       dims = 1:40
     )
    Running CCA
    Merging objects
    Finding neighborhoods
    Finding anchors
        Found 7875 anchors
    Filtering anchors
        Retained 4854 anchors
    
    > predicted.labels <- TransferData(
       anchorset = transfer.anchors, #这里anchorset应该是你上一步生成的对象
       refdata = allen_rna$subclass, #data to transfer
       weight.reduction = brain[['lsi']],
       dims = 2:30
     )
    Finding integration vectors
    Finding integration vector weights
    0%   10   20   30   40   50   60   70   80   90   100%
    [----|----|----|----|----|----|----|----|----|----|
    **************************************************|
    Predicting cell labels
    
    > brain <- AddMetaData(object = brain, metadata = predicted.labels)
    
    > plot1 <- DimPlot(allen_rna, group.by = 'subclass', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
    > plot2 <- DimPlot(brain, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
    > plot1 + plot2
    

    Why did we change default parameters?为什么我们要修改默认参数?
    我们更改了FindIntegrationAnchors()和FindVariableFeatures()的默认参数(包括更多的features和维度)。你可以以两种方式运行分析,这两种方法的结果是非常相似的。然而,当使用默认参数时,我们误将cluster 11细胞标记为Vip间神经元,而实际上它们是最近我们和其他人描述的表达Meis2的CGE衍生的间神经元群体。原因是这个细胞群在scRNA-seq数据中非常罕见(0.3%),所以定义这个细胞群的基因(例如Meis2)表达过低,无法在最初的一组变量特征中被选择。因此,我们需要更多的基因和维度来进行跨模比对。有趣的是,与scRNA-seq数据相比,scATAC-seq数据中的这个细胞群有10倍的富集(请看这篇文章以理解这种现象的发生的可能原因 this paper )。

    你可以看到,基于RNA的分类与仅在ATAC-seq数据上计算的UMAP可视化完全一致。我们现在可以注释我们的scATAC-seq的细胞群(或者,我们可以使用RNA分类)。我们注意到三个小的细胞群(13,20,21),它们代表了scRNA-seq标签的细分。尝试从allen scRNA-seq数据集转移聚类标签(它显示了更细微的区别),对它们进行注释!

    # replace each label with its most likely prediction
    > for(i in levels(brain)) {
      cells_to_reid <- WhichCells(brain, idents = i)
      newid <- names(sort(table(brain$predicted.id[cells_to_reid]),decreasing=TRUE))[1]
      Idents(brain, cells = cells_to_reid) <- newid
    }
    

    Find differentially accessible peaks between clusters不同细胞群之间差异可接近性peak

    #switch back to working with peaks instead of gene activities
    > DefaultAssay(brain) <- 'peaks'
    
    > da_peaks <- FindMarkers( #Finds markers (differentially expressed genes) for identity classes
       object = brain,
       ident.1 = c("L2/3 IT"),
       ident.2 = c("L4", "L5 IT", "L6 IT"),
       min.pct = 0.4,
       test.use = 'LR',
       latent.vars = 'peak_region_fragments'
     )
      |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s  
    
    > head(da_peaks) #这里avg_logFC的值如果是正的,说明peaks在上一步里ident.1设置的细胞群里比较多;如果是负的,说明在ident.2细胞群里比较多
                                     p_val  avg_logFC pct.1 pct.2    p_val_adj
    chr4-86523678-86525285    6.291218e-67  0.5613693 0.401 0.036 9.865574e-62
    chr15-87605281-87607659   4.653646e-57  0.4891020 0.484 0.091 7.297614e-52
    chr2-118700082-118704897  1.747014e-56  0.3044948 0.624 0.181 2.739581e-51
    chr3-137056475-137058371  1.246008e-52  0.3551127 0.524 0.103 1.953928e-47
    chr13-69329933-69331707   3.007140e-52 -0.4234536 0.144 0.444 4.715647e-47
    chr10-107751762-107753240 1.533317e-51  0.3950263 0.612 0.184 2.404472e-46
    
    #对差异peak列表里的第一个peak(chr4的一个区域)进行不同细胞群之间的比较,并作图
    > plot1 <- VlnPlot(
       object = brain,
       features = rownames(da_peaks)[1],
       pt.size = 0.1,
       idents = c("L4","L5 IT","L2/3 IT","L6 IT")
     )
    > plot2 <- FeaturePlot(
       object = brain,
       features = rownames(da_peaks)[1],
       pt.size = 0.1,
       max.cutoff = 'q95'
     )
    > plot1 | plot2
    
    #标记显著差异的peaks。logFC>0.25说明在L2/3里peak更加容易接近,如果<-0.25说明在L4/5/6细胞群里peak更容易接近。
    > open_l23 <- rownames(da_peaks[da_peaks$avg_logFC > 0.25, ])
    > open_l456 <- rownames(da_peaks[da_peaks$avg_logFC < -0.25, ])
    > closest_l23 <- ClosestFeature(brain, open_l23) #closestFeature:寻找距离一组给定的基因组区域最近的features
    > closest_l456 <- ClosestFeature(brain, open_l456)
    > head(closest_l23)
    
    > head(closest_l23)
                                    tx_id gene_name            gene_id   gene_biotype type            closest_region
    ENSMUST00000151481 ENSMUST00000151481   Fam154a ENSMUSG00000028492 protein_coding  gap    chr4-86487920-86538964
    ENSMUSE00000647021 ENSMUST00000068088   Fam19a5 ENSMUSG00000054863 protein_coding exon   chr15-87625230-87625486
    ENSMUST00000104937 ENSMUST00000104937   Ankrd63 ENSMUSG00000078137 protein_coding  cds  chr2-118702266-118703438
    ENSMUST00000070198 ENSMUST00000070198    Ppp3ca ENSMUSG00000028161 protein_coding  utr  chr3-136935226-136937727
    ENSMUST00000165341 ENSMUST00000165341     Otogl ENSMUSG00000091455 protein_coding  utr chr10-107762223-107762309
    ENSMUST00000179893 ENSMUST00000179893      Ryr1 ENSMUSG00000030592 protein_coding  cds    chr7-29073019-29073139
                                    query_region distance
    ENSMUST00000151481    chr4-86523678-86525285        0
    ENSMUSE00000647021   chr15-87605281-87607659    17570
    ENSMUST00000104937  chr2-118700082-118704897        0
    ENSMUST00000070198  chr3-137056475-137058371   118747
    ENSMUST00000165341 chr10-107751762-107753240     8982
    ENSMUST00000179893    chr7-29070299-29073172        0
    
    > head(closest_l456)
                                    tx_id gene_name            gene_id   gene_biotype type           closest_region
    ENSMUST00000044081 ENSMUST00000044081     Papd7 ENSMUSG00000034575 protein_coding  utr  chr13-69497959-69499915
    ENSMUST00000084628 ENSMUST00000084628    Hs3st2 ENSMUSG00000046321 protein_coding  cds chr7-121392730-121393214
    ENSMUST00000161979 ENSMUST00000161979    Kcnab1 ENSMUSG00000027827 protein_coding  gap   chr3-65261847-65266490
    ENSMUST00000111627 ENSMUST00000111627      Mobp ENSMUSG00000032517 protein_coding  utr chr9-120173191-120176091
    ENSMUST00000034339 ENSMUST00000034339      Cdh5 ENSMUSG00000031871 protein_coding  gap chr8-104126660-104128051
    ENSMUST00000134253 ENSMUST00000134253    Nmnat3 ENSMUSG00000032456 protein_coding  gap   chr9-98354165-98399455
                                   query_region distance
    ENSMUST00000044081  chr13-69329933-69331707   166251
    ENSMUST00000084628 chr7-121391215-121395519        0
    ENSMUST00000161979   chr3-65262154-65263636        0
    ENSMUST00000111627 chr9-120174034-120175458        0
    ENSMUST00000034339 chr8-104126858-104127953        0
    ENSMUST00000134253   chr9-98387463-98389145        0
    

    Plotting genomic regions画基因组区域

    我们还可以使用CoveragePlot()函数为任何基因组区域创建按聚类、细胞类型或对象中存储的任何其他元数据分组的覆盖率图。这些代表了pseudo-bulk可接近性track,其中信号是一个组内所有细胞的平均,以可视化DNA可接近性。(实际上和IGV里的可视化是一样的)

    > levels(brain) <- c("L2/3 IT","L4","L5 IT","L5 PT","L6 CT", "L6 IT","NP","Sst","Pvalb","Vip","Lamp5","Meis2","Oligo","Astro","Endo","VLMC","Macrophage")
    #查看Neurod6基因附近的peak
    > CoveragePlot(
      object = brain,
      region = c("Neurod6"),
      extend.upstream = 1000,
      extend.downstream = 1000
    )
    
    #查看Gad2基因附近的peak
    > CoveragePlot(
      object = brain,
      region = c("Gad2"),
      extend.upstream = 1000,
      extend.downstream = 1000
    )
    

    你也可以同时画两个基因的peak情况:

    > CoveragePlot(
      object = brain,
      region = c("Neurod6", "Gad2"),
      extend.upstream = 1000,
      extend.downstream = 1000,
      ncol = 1
    )
    

    相关文章

      网友评论

          本文标题:scATAC-seq数据分析练习(成年小鼠大脑组织)

          本文链接:https://www.haomeiwen.com/subject/locdyktx.html