美文网首页scATAC-seqscATAC-seq学习资料scRNA
使用SnapATAC分析单细胞ATAC-seq数据(二):10X

使用SnapATAC分析单细胞ATAC-seq数据(二):10X

作者: Davey1220 | 来源:发表于2020-06-22 15:49 被阅读0次
    image.png

    10X Adult Mouse Brain

    在本示例中,我们将分析来自10X genomics平台测序产生的5000个成年小鼠大脑细胞的数据集。该示例数据可以从以下网址进行下载:http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X/

    分析内容

    • Step 0. Data download
    • Step 1. Barcode selection
    • Step 2. Add cell-by-bin matrix
    • Step 3. Matrix binarization
    • Step 4. Bin filtering
    • Step 5. Dimensionality reduction
    • Step 6. Determine significant components
    • Step 7. Graph-based clustering
    • Step 8. Visualization
    • Step 9. Gene based annotation
    • Step 10. Heretical clustering
    • Step 11. Identify peak
    • Step 12. Create a cell-by-peak matrix
    • Step 13. Add cell-by-peak matrix
    • Step 14. Identify differentially accessible regions
    • Step 15. Motif analysis
    • Step 16. GREAT analysis

    Step 0. Data download

    在本示例中,我们将直接下载所需的snap文件,该文件中已经包含了cell-by-bin/cell-by-peak matrix的计数矩阵。

    wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X/atac_v1_adult_brain_fresh_5k.snap
    wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X/atac_v1_adult_brain_fresh_5k_singlecell.csv
    

    Step 1. Barcode selection

    首先,我们对数据进行初步的过滤,基于以下标准选择出高质量barcodes的细胞: 1) number of unique fragments; 2) fragments in promoter ratio;

    # 加载SnapATAC包
    library(SnapATAC);
    # 使用createSnap函数构建snap对象
    x.sp = createSnap(
        file="atac_v1_adult_brain_fresh_5k.snap",
        sample="atac_v1_adult_brain_fresh_5k",
        num.cores=1
      );
    # 读取barcode信息
    barcodes = read.csv(
        "atac_v1_adult_brain_fresh_5k_singlecell.csv",
        head=TRUE
      );
    barcodes = barcodes[2:nrow(barcodes),];
    
    # 计算比对到promoter区域的比率
    promoter_ratio = (barcodes$promoter_region_fragments+1) / (barcodes$passed_filters + 1);
    UMI = log(barcodes$passed_filters+1, 10);
    data = data.frame(UMI=UMI, promoter_ratio=promoter_ratio);
    barcodes$promoter_ratio = promoter_ratio;
    
    library(viridisLite);
    library(ggplot2);
    p1 = ggplot(
        data, 
        aes(x= UMI, y= promoter_ratio)) + 
        geom_point(size=0.1, col="grey") +
        theme_classic() +
        ggtitle("10X Fresh Adult Brain") +
        ylim(0, 1) + xlim(0, 6) +
        labs(x = "log10(UMI)", y="promoter ratio") 
    p1 
    
    # 根据条件筛选barcode
    barcodes.sel = barcodes[which(UMI >= 3 & UMI <= 5 & promoter_ratio >= 0.15 & promoter_ratio <= 0.6),];
    rownames(barcodes.sel) = barcodes.sel$barcode;
    x.sp = x.sp[which(x.sp@barcode %in% barcodes.sel$barcode),];
    x.sp@metaData = barcodes.sel[x.sp@barcode,];
    x.sp
    ## number of barcodes: 4100
    ## number of bins: 0
    ## number of genes: 0
    ## number of peaks: 0
    ## number of motifs: 0
    
    image.png

    Step 2. Add cell-by-bin matrix

    接下来,我们将5kb分辨率的cell-by-bin计数矩阵添加到snap对象中。addBmatToSnap函数将自动读取cell-by-bin计数矩阵,并将其添加到snap对象的bmat slot 中。

    # show what bin sizes exist in atac_v1_adult_brain_fresh_5k.snap file
    # 使用showBinSizes函数查看snap文件中的bin size信息
    showBinSizes("atac_v1_adult_brain_fresh_5k.snap");
    [1] 1000 5000 10000
    
    # 使用addBmatToSnap函数添加cell-by-bin计数矩阵
    x.sp = addBmatToSnap(x.sp, bin.size=5000, num.cores=1);
    

    Step 3. Matrix binarization

    接下来,我们将cell-by-bin的计数矩阵转换为二进制矩阵。计数矩阵中的某些items具有异常高的覆盖率,这可能是由比对错误造成的。因此,我们会将计数矩阵中覆盖率最高的0.1%的items进行删除,然后将其余的非零的items转换为1。

    # 使用makeBinary函数将计数矩阵转换为二进制矩阵
    x.sp = makeBinary(x.sp, mat="bmat");
    

    Step 4. Bin filtering

    首先,我们将过滤掉与ENCODE blacklist区域重叠的bins,避免潜在的人为因素产生的误差。

    system("wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz");
    
    library(GenomicRanges);
    black_list = read.table("mm10.blacklist.bed.gz");
    black_list.gr = GRanges(
        black_list[,1], 
        IRanges(black_list[,2], black_list[,3])
      );
    
    idy = queryHits(findOverlaps(x.sp@feature, black_list.gr));
    if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]};
    x.sp
    ## number of barcodes: 4100
    ## number of bins: 546103
    ## number of genes: 0
    ## number of peaks: 0
    ## number of motifs: 0
    

    接下来,我们将移除那些不需要的染色体上的信息。

    chr.exclude = seqlevels(x.sp@feature)[grep("random|chrM", seqlevels(x.sp@feature))];
    
    idy = grep(paste(chr.exclude, collapse="|"), x.sp@feature);
    if(length(idy) > 0){x.sp = x.sp[,-idy, mat="bmat"]};
    x.sp
    ## number of barcodes: 4100
    ## number of bins: 545183
    ## number of genes: 0
    ## number of peaks: 0
    ## number of motifs: 0
    

    最后,bin的覆盖率大致服从对数正态分布(log normal distribution),我们会将与invariant features(如管家基因的启动子)区域重叠的前5%的bins进行删除。

    bin.cov = log10(Matrix::colSums(x.sp@bmat)+1);
    hist(
        bin.cov[bin.cov > 0], 
        xlab="log10(bin cov)", 
        main="log10(Bin Cov)", 
        col="lightblue", 
        xlim=c(0, 5)
      );
    bin.cutoff = quantile(bin.cov[bin.cov > 0], 0.95);
    idy = which(bin.cov <= bin.cutoff & bin.cov > 0);
    x.sp = x.sp[, idy, mat="bmat"];
    x.sp
    ## number of barcodes: 4100
    ## number of bins: 474624
    ## number of genes: 0
    ## number of peaks: 0
    ## number of motifs: 0
    
    image.png

    Step 5. Dimensionality reduction

    使用diffusion maps的方法进行数据降维

    x.sp = runDiffusionMaps(
        obj=x.sp,
        input.mat="bmat", 
        num.eigs=50
      );
    

    Step 6. Determine significant components

    接下来,我们基于数据降维的结果确定用于下游分析的维数。我们绘制不同维数之间的配对散点图,并选择散点图开始看起来零散的维数。在下面的示例中,我们选择了前20个维度。

    plotDimReductPW(
        obj=x.sp, 
        eigs.dims=1:50,
        point.size=0.3,
        point.color="grey",
        point.shape=19,
        point.alpha=0.6,
        down.sample=5000,
        pdf.file.name=NULL, 
        pdf.height=7, 
        pdf.width=7
      );
    
    image.png

    Step 7. Graph-based clustering

    选择好有效的降维维度后,我们基于它们来构造一个K近邻(KNN)的聚类图(K =15)。在该图中,每个点代表一个细胞,并根据欧氏距离确定每个细胞的k近邻个点。

    x.sp = runKNN(
        obj=x.sp,
        eigs.dims=1:20,
        k=15
      );
    
    x.sp=runCluster(
        obj=x.sp,
        tmp.folder=tempdir(),
        louvain.lib="R-igraph",
        seed.use=10
      );
    
    x.sp@metaData$cluster = x.sp@cluster;
    

    Step 8. Visualization

    SnapATAC可以使用tSNE(FI-tsne)或UMAP的方法对降维聚类后的结果进行可视化展示。在此例中,我们计算t-SNE embedding,使用tSNE方法进行可视化展示。同时,我们还将测序深度或其他偏差投射到t-SNE embedding上。

    x.sp = runViz(
        obj=x.sp, 
        tmp.folder=tempdir(),
        dims=2,
        eigs.dims=1:20, 
        method="Rtsne",
        seed.use=10
      );
    
    par(mfrow = c(2, 2));
    plotViz(
        obj=x.sp,
        method="tsne", 
        main="10X Brain Cluster",
        point.color=x.sp@cluster, 
        point.size=1, 
        point.shape=19, 
        point.alpha=0.8, 
        text.add=TRUE,
        text.size=1.5,
        text.color="black",
        text.halo.add=TRUE,
        text.halo.color="white",
        text.halo.width=0.2,
        down.sample=10000,
        legend.add=FALSE
      );
    
    plotFeatureSingle(
        obj=x.sp,
        feature.value=log(x.sp@metaData[,"passed_filters"]+1,10),
        method="tsne", 
        main="10X Brain Read Depth",
        point.size=0.2, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99)
      ); 
    
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@metaData$peak_region_fragments / x.sp@metaData$passed_filters,
        method="tsne", 
        main="10X Brain FRiP",
        point.size=0.2, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99) # remove outliers
      );
    
    plotFeatureSingle(
        obj=x.sp,
        feature.value=x.sp@metaData$duplicate / x.sp@metaData$total,
        method="tsne", 
        main="10X Brain Duplicate",
        point.size=0.2, 
        point.shape=19, 
        down.sample=10000,
        quantiles=c(0.01, 0.99) # remove outliers
      );
    
    image.png

    Step 9. Gene based annotation

    为了帮助注释聚类后分群的细胞簇,SnapATAC接下来将创建cell-by-gene计数矩阵,并可视化marker基因的富集情况,根据marker基因的表达情况进行cluster的注释。

    system("wget http://renlab.sdsc.edu/r3fang/share/github/Mouse_Brain_10X/gencode.vM16.gene.bed");
    genes = read.table("gencode.vM16.gene.bed");
    genes.gr = GRanges(genes[,1], 
        IRanges(genes[,2], genes[,3]), name=genes[,4]
      );
    
    marker.genes = c(
        "Snap25", "Gad2", "Apoe",
        "C1qb", "Pvalb", "Vip", 
        "Sst", "Lamp5", "Slc17a7"
      );
    
    genes.sel.gr <- genes.gr[which(genes.gr$name %in% marker.genes)];
    
    # re-add the cell-by-bin matrix to the snap object;
    x.sp = addBmatToSnap(x.sp);
    x.sp = createGmatFromMat(
        obj=x.sp, 
        input.mat="bmat",
        genes=genes.sel.gr,
        do.par=TRUE,
        num.cores=10
      );
    
    # normalize the cell-by-gene matrix
    x.sp = scaleCountMatrix(
        obj=x.sp, 
        cov=x.sp@metaData$passed_filters + 1,
        mat="gmat",
        method = "RPM"
      );
    
    # smooth the cell-by-gene matrix
    x.sp = runMagic(
        obj=x.sp,
        input.mat="gmat",
        step.size=3
      );
    
    par(mfrow = c(3, 3));
    for(i in 1:9){
        plotFeatureSingle(
            obj=x.sp,
            feature.value=x.sp@gmat[, marker.genes[i]],
            method="tsne", 
            main=marker.genes[i],
            point.size=0.1, 
            point.shape=19, 
            down.sample=10000,
            quantiles=c(0, 1)
      )};
    
    images

    Step 10. Heretical clustering

    接下来,我们将属于同一cluster的细胞汇集到一起,用以创建每个cluster的聚合信号(aggregate signal)。

    # calculate the ensemble signals for each cluster
    ensemble.ls = lapply(split(seq(length(x.sp@cluster)), x.sp@cluster), function(x){
        SnapATAC::colMeans(x.sp[x,], mat="bmat");
        })
    
    # cluster using 1-cor as distance  
    hc = hclust(as.dist(1 - cor(t(do.call(rbind, ensemble.ls)))), method="ward.D2");
    plotViz(
        obj=x.sp,
        method="tsne", 
        main="10X Brain Cluster",
        point.color=x.sp@cluster, 
        point.size=1, 
        point.shape=19, 
        point.alpha=0.8, 
        text.add=TRUE,
        text.size=1.5,
        text.color="black",
        text.halo.add=TRUE,
        text.halo.color="white",
        text.halo.width=0.2,
        down.sample=10000,
        legend.add=FALSE
        );
    plot(hc, hang=-1, xlab="");
    

    在本示例中,cluster 20到25是兴奋性神经元细胞,cluster 19到5为抑制性神经元细胞,而其余的为非神经元细胞。


    image.png

    Step 11. Identify peaks

    接下来,我们将每个cluster群的细胞信息聚合起来,创建一个用于peak calling和可视化的集成track。在该步骤中,将生成一个.narrowPeak的文件,其中包含识别出的所有peak的信息,和一个.bedGraph文件,可以用于可视化展示。为了获得最robust的结果,我们不建议对细胞数目小于100的cluster执行此步骤。

    # 查看snaptools的安装路径
    system("which snaptools")
    /home/r3fang/anaconda2/bin/snaptools
    # 查看macs2的安装路径
    system("which macs2")
    /home/r3fang/anaconda2/bin/macs2
    
    # 调用macs2进行peak callling
    runMACS(
        obj=x.sp[which(x.sp@cluster==1),], 
        output.prefix="atac_v1_adult_brain_fresh_5k.1",
        path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
        path.to.macs="/home/r3fang/anaconda2/bin/macs2",
        gsize="mm", 
        buffer.size=500, 
        num.cores=5,
        macs.options="--nomodel --shift 37 --ext 73 --qval 1e-2 -B --SPMR --call-summits",
        tmp.folder=tempdir()
        );
    

    接下来,我们将提供一个简短的脚本,用于为所有的cluster进行批量操作此步骤。

    # call peaks for all cluster with more than 100 cells
    clusters.sel = names(table(x.sp@cluster))[which(table(x.sp@cluster) > 200)];
    
    peaks.ls = mclapply(seq(clusters.sel), function(i){
        print(clusters.sel[i]);
        runMACS(
            obj=x.sp[which(x.sp@cluster==clusters.sel[i]),], 
            output.prefix=paste0("atac_v1_adult_brain_fresh_5k.", gsub(" ", "_", clusters.sel)[i]),
            path.to.snaptools="/home/r3fang/anaconda2/bin/snaptools",
            path.to.macs="/home/r3fang/anaconda2/bin/macs2",
            gsize="hs", # mm, hs, etc
            buffer.size=500, 
            num.cores=1,
            macs.options="--nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR",
            tmp.folder=tempdir()
       );
     }, mc.cores=5);
    
    # assuming all .narrowPeak files in the current folder are generated from the clusters
    peaks.names = system("ls | grep narrowPeak", intern=TRUE);
    
    peak.gr.ls = lapply(peaks.names, function(x){
        peak.df = read.table(x)
        GRanges(peak.df[,1], IRanges(peak.df[,2], peak.df[,3]))
      })
    
    # 合并所有的peak信息
    peak.gr = reduce(Reduce(c, peak.gr.ls));
    peak.gr
    ## GRanges object with 242847 ranges and 0 metadata columns:
    ##           seqnames               ranges strand
    ##              <Rle>            <IRanges>  <Rle>
    ##       [1]     chr1   [3094889, 3095629]      *
    ##       [2]     chr1   [3113499, 3114060]      *
    ##       [3]     chr1   [3118103, 3118401]      *
    ##       [4]     chr1   [3119689, 3120845]      *
    ##       [5]     chr1   [3121534, 3121786]      *
    ##       ...      ...                  ...    ...
    ##  [242843]     chrY [90797373, 90798136]      *
    ##  [242844]     chrY [90804709, 90805456]      *
    ##  [242845]     chrY [90808580, 90808819]      *
    ##  [242846]     chrY [90808850, 90809131]      *
    ##  [242847]     chrY [90810817, 90811057]      *
    ##  -------
    

    Step 12. Create a cell-by-peak matrix

    接下来,我们基于合并后的peak信息作为参考,使用原始的snap文件创建一个 cell-by-peak的计数矩阵。

    peaks.df = as.data.frame(peak.gr)[,1:3];
    write.table(peaks.df,file = "peaks.combined.bed",append=FALSE,
            quote= FALSE,sep="\t", eol = "\n", na = "NA", dec = ".", 
            row.names = FALSE, col.names = FALSE, qmethod = c("escape", "double"),
            fileEncoding = "")
    saveRDS(x.sp, file="atac_v1_adult_brain_fresh_5k.snap.rds");
    

    我们使用snaptools创建cell-by-peak的计数矩阵,并将其添加到snap文件中,这一步可能需要一段时间。

    snaptools snap-add-pmat \
        --snap-file atac_v1_adult_brain_fresh_5k.snap \
        --peak-file peaks.combined.bed
    

    Step 13. Add cell-by-peak matrix

    接下来,我们将计算好的cell-by-peak计数矩阵添加到现有的snap对象中。

    x.sp = readRDS("atac_v1_adult_brain_fresh_5k.snap.rds");
    x.sp = addPmatToSnap(x.sp);
    x.sp = makeBinary(x.sp, mat="pmat");
    x.sp
    ## number of barcodes: 4100
    ## number of bins: 546206
    ## number of genes: 16
    ## number of peaks: 242847
    ## number of motifs: 0
    

    Step 14. Identify differentially accessible regions

    SnapATAC通过差异分析来识别出不同cluster群中的差异可及性区域( differentially accessible regions,DARs)。默认情况下,它只寻找每个cluster群中的positive peaks(可以通过cluster.pos参数指定), 与一组阴性对照细胞相比。如果默认的cluster.neg=NULL, findDAR函数将寻找最接近positive细胞的一组作为背景细胞。

    DARs = findDAR(
        obj=x.sp,
        input.mat="pmat",
        cluster.pos=26,
        cluster.neg.method="knn",
        test.method="exactTest",
        bcv=0.1, #0.4 for human, 0.1 for mouse
        seed.use=10
      );
    DARs$FDR = p.adjust(DARs$PValue, method="BH");
    idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);
    par(mfrow = c(1, 2));
    plot(DARs$logCPM, DARs$logFC, 
        pch=19, cex=0.1, col="grey", 
        ylab="logFC", xlab="logCPM",
        main="Cluster 26"
      );
    
    points(DARs$logCPM[idy], 
        DARs$logFC[idy], 
        pch=19, 
        cex=0.5, 
        col="red"
      );
    
    abline(h = 0, lwd=1, lty=2);
    covs = Matrix::rowSums(x.sp@pmat);
    vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
    vals.zscore = (vals - mean(vals)) / sd(vals);
    plotFeatureSingle(
        obj=x.sp,
        feature.value=vals.zscore,
        method="tsne", 
        main="Cluster 26",
        point.size=0.1, 
        point.shape=19, 
        down.sample=5000,
        quantiles=c(0.01, 0.99)
      );
    
    image.png

    接下来,我们识别出每个cluster群中的DARs。对于缺乏揭示DARs的静态能力(static power)的簇,特别是比较小的簇,我们根据peak的富集程度对其进行排序,并使用top 2000个peak用于motif discovery的代表性峰。

    idy.ls = lapply(levels(x.sp@cluster), function(cluster_i){
        DARs = findDAR(
            obj=x.sp,
            input.mat="pmat",
            cluster.pos=cluster_i,
            cluster.neg=NULL,
            cluster.neg.method="knn",
            bcv=0.1,
            test.method="exactTest",
            seed.use=10
            );
        DARs$FDR = p.adjust(DARs$PValue, method="BH");
        idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);
        if((x=length(idy)) < 2000L){
                PValues = DARs$PValue;
                PValues[DARs$logFC < 0] = 1;
                idy = order(PValues, decreasing=FALSE)[1:2000];
                rm(PValues); # free memory
        }
        idy
      })
    names(idy.ls) = levels(x.sp@cluster);
    par(mfrow = c(3, 3));
    for(cluster_i in levels(x.sp@cluster)){
        print(cluster_i)
        idy = idy.ls[[cluster_i]];
        vals = Matrix::rowSums(x.sp@pmat[,idy]) / covs;
        vals.zscore = (vals - mean(vals)) / sd(vals);
        plotFeatureSingle(
            obj=x.sp,
            feature.value=vals.zscore,
            method="tsne", 
            main=cluster_i,
            point.size=0.1, 
            point.shape=19, 
            down.sample=5000,
            quantiles=c(0.01, 0.99)
            );
      }
    
    image.png

    Step 15. Motif analysis identifies master regulators

    SnapATAC可以调用Homer来鉴定识别出的差异可及性区域(DARs)中富集的master regulators。运行完runHomer函数后,会在./homer/C5文件夹中生成一个homer motif的报告knownResults.html。我们需要预先安装好Homer程序。

    system("which findMotifsGenome.pl");
    /projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl
    
    motifs = runHomer(
        x.sp[,idy.ls[["5"]],"pmat"], 
        mat = "pmat",
        path.to.homer = "/projects/ps-renlab/r3fang/public_html/softwares/homer/bin/findMotifsGenome.pl",
        result.dir = "./homer/C5",
        num.cores=5,
        genome = 'mm10',
        motif.length = 10,
        scan.size = 300,
        optimize.count = 2,
        background = 'automatic',
        local.background = FALSE,
        only.known = TRUE,
        only.denovo = FALSE,
        fdr.num = 5,
        cache = 100,
        overwrite = TRUE,
        keep.minimal = FALSE
      );
    
    image.png

    SnapATAC还可以调用chromVAR(Schep等)来进行motif的可变性分析。

    library(chromVAR);
    library(motifmatchr);
    library(SummarizedExperiment);
    library(BSgenome.Mmusculus.UCSC.mm10);
    
    x.sp = makeBinary(x.sp, "pmat");
    x.sp@mmat = runChromVAR(
        obj=x.sp,
        input.mat="pmat",
        genome=BSgenome.Mmusculus.UCSC.mm10,
        min.count=10,
        species="Homo sapiens"
      );
    
    motif_i = "MA0497.1_MEF2C";
    dat = data.frame(x=x.sp@metaData[,"cluster"], y=x.sp@mmat[,motif_i]);
    
    p1 <- ggplot(dat, aes(x=x, y=y, fill=x)) + 
        theme_classic() +
        geom_violin() + 
        xlab("cluster") +
        ylab("motif enrichment") + 
        ggtitle(motif_i) +
        theme(
              plot.margin = margin(5,1,5,1, "cm"),
              axis.text.x = element_text(angle = 90, hjust = 1),
              axis.ticks.x=element_blank(),
              legend.position = "none"
       );
    
    motif_i = "MA0660.1_MEF2B";
    dat = data.frame(x=x.sp@metaData[,"cluster"], y=x.sp@mmat[,motif_i]);
    
    p2 <- ggplot(dat, aes(x=x, y=y, fill=x)) + 
        theme_classic() +
        geom_violin() + 
        xlab("cluster") +
        ylab("motif enrichment") + 
        ggtitle(motif_i) +
        theme(
              plot.margin = margin(5,1,5,1, "cm"),
              axis.text.x = element_text(angle = 90, hjust = 1),
              axis.ticks.x=element_blank(),
              legend.position = "none"
       );
    
    p1
    p2
    
    image.png
    image.png

    Step 16. GREAT analysis

    SnapATAC还可以使用GREAT来识别每个细胞cluster中活跃的生物学通路。在本示例中,我们将首先识别小胶质细胞(cluster 13)中的差异可及性区域DARs,并展示使用GREAT analysis识别出的top 6个GO term的富集情况。

    ## install R package rGREAT
    if (!requireNamespace("BiocManager", quietly=TRUE))
        install.packages("BiocManager")
    BiocManager::install("rGREAT")
    ## or install the latest version
    library(devtools)
    install_github("jokergoo/rGREAT")
    
    library(rGREAT);
    DARs = findDAR(
        obj=x.sp,
        input.mat="pmat",
        cluster.pos=13,
        cluster.neg.method="knn",
        test.method="exactTest",
        bcv=0.1, #0.4 for human, 0.1 for mouse
        seed.use=10
      );
    
    DARs$FDR = p.adjust(DARs$PValue, method="BH");
    idy = which(DARs$FDR < 5e-2 & DARs$logFC > 0);
    
    job = submitGreatJob(
        gr                    = x.sp@peak[idy],
        bg                    = NULL,
        species               = "mm10",
        includeCuratedRegDoms = TRUE,
        rule                  = "basalPlusExt",
        adv_upstream          = 5.0,
        adv_downstream        = 1.0,
        adv_span              = 1000.0,
        adv_twoDistance       = 1000.0,
        adv_oneDistance       = 1000.0,
        request_interval = 300,
        max_tries = 10,
        version = "default",
        base_url = "http://great.stanford.edu/public/cgi-bin"
      );
    
    job
    ## Submit time: 2019-09-04 14:14:02
    ## Version: default
    ## Species: mm10
    ## Inputs: 25120 regions
    ## Background: wholeGenome
    ## Model: Basal plus extension
    ##   Proximal: 5 kb upstream, 1 kb downstream,
    ##   plus Distal: up to 1000 kb
    ## Include curated regulatory domains
    ## 
    ## Enrichment tables for following ontologies have been downloaded:
    ##   None
    

    运行完所提交的job后,我们可以提取GREAT分析的结果。第一个主要的分析结果是一个富集统计的信息表。默认情况下,它将检索来自三种GO(包括MF,BP,CC)terms的结果和pathway的注释结果。所有的表中都包含所有terms 的统计信息,无论它们是否具有显著性。并且,用户可以通过自定义的cutoff 进行筛选。

    tb = getEnrichmentTables(job);
    names(tb);
    ## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"
    
    GBP = tb[["GO Biological Process"]];
    head(GBP[order(GBP[,"Binom_Adjp_BH"]),1:5]);
    ##           ID                                      name Binom_Genome_Fraction
    ## 1 GO:0002376                     immune system process            0.12515840
    ## 2 GO:0002682       regulation of immune system process            0.09012561
    ## 3 GO:0009987                          cellular process            0.80870120
    ## 4 GO:0048518 positive regulation of biological process            0.43002240
    ## 5 GO:0050789          regulation of biological process            0.68873070
    ## 6 GO:0050794            regulation of cellular process            0.66837300
    ##   Binom_Expected Binom_Observed_Region_Hits
    ## 1       3095.918                       5592
    ## 2       2229.347                       4148
    ## 3      20004.030                      22241
    ## 4      10637.030                      13697
    ## 5      17036.440                      19871
    ## 6      16532.870                      19356
    

    参考来源:https://gitee.com/booew/SnapATAC/blob/master/examples/10X_brain_5k/README.md

    相关文章

      网友评论

        本文标题:使用SnapATAC分析单细胞ATAC-seq数据(二):10X

        本文链接:https://www.haomeiwen.com/subject/bzrxfktx.html