美文网首页chip seq
ChIPseeker | 详解annotatePeak的三种注释

ChIPseeker | 详解annotatePeak的三种注释

作者: 生信云笔记 | 来源:发表于2023-12-16 13:51 被阅读0次

序 言

  ChIPseeker用来注释peak确实很方便,也内置了不少的可视化函数可以用来展示结果。ChIPseeker基于annotatePeak函数做注释虽然使用起来很简单,但在其内部却存在着三种独立的注释方式,分别为nearest gene annotationgenomic annotationflankDistance。下面咱们就来具体看看这三种注释方式分别是什么,以及哪些参数可以控制其行为。

nearest gene annotation

  该注释与其他软件理念一致,就是根据距离寻找与peak的最近基因,距离的计算依据是TSS,即注释结果中的distanceToTSS列。该注释方式其实是由内部的getNearestFeatureIndicesAndDistances函数完成,涉及到的参数如下:

Description:

     get index of features that closest to peak and calculate distance

Usage:

     getNearestFeatureIndicesAndDistances(
       peaks,
       features,
       sameStrand = FALSE,
       ignoreOverlap = FALSE,
       ignoreUpstream = FALSE,
       ignoreDownstream = FALSE,
       overlap = "TSS"
     )

  annotatePeak函数注释时会调用getNearestFeatureIndicesAndDistances函数,所以annotatePeak里面那些同名的参数会被getNearestFeatureIndicesAndDistances使用。也就是说,这些参数会影响nearest gene annotation的注释结果。这些参数在软件里面都有相应的解释,这里就不再赘述了。不过,ignoreOverlapoverlap这两个参数有点不好理解,为了搞清楚其对结果会有什么影响,就去扒拉了一下源码,其中涉及到这两参数的代码如下:

if (!ignoreOverlap && overlap == "all") {
    overlap_hit <- findOverlaps(peaks, unstrand(features))
}
features <- resize(features, width = 1)
... ...
... ...
if (!ignoreOverlap) {
    if (overlap == "all") {
        hit <- overlap_hit
        if (length(hit) != 0) {
            qh <- queryHits(hit)
            hit.idx <- getFirstHitIndex(qh)
            hit <- hit[hit.idx]
            peakIdx <- queryHits(hit)
            featureIdx <- subjectHits(hit)
            index[peakIdx] <- featureIdx
            distance_both_end <- data.frame(start = start(peaks[peakIdx]) -
              start(features[featureIdx]), end = end(peaks[peakIdx]) -
              start(features[featureIdx]))
            distance_idx <- apply(distance_both_end, 1, function(i) which.min(abs(i)))
            distance_minimal <- distance_both_end[, 1]
            distance_minimal[distance_idx == 2] <- distance_both_end[distance_idx ==
              2, 2]
            distanceToTSS[peakIdx] <- distance_minimal *
              ifelse(strand(features[featureIdx]) == "+",
                1, -1)
        }
    }
    hit <- findOverlaps(peaks, unstrand(features))
    if (length(hit) != 0) {
        qh <- queryHits(hit)
        hit.idx <- getFirstHitIndex(qh)
        hit <- hit[hit.idx]
        peakIdx <- queryHits(hit)
        featureIdx <- subjectHits(hit)
        index[peakIdx] <- featureIdx
        distanceToTSS[peakIdx] <- 0
    }
}

  从上面的代码可以看出,overlap参数受ignoreOverlap影响,如果ignoreOverlap设定为true,无论overlap设定为TSS或是all结果都一样。ignoreOverlap默认值为false,从上面的代码可以看出overlap设定为TSS或者all改变了peak寻找overlap区域的规则,设为TSS时是以转录起始位点为依据,而设为all是以整个基因区域为依据,故此时改变overlap的参数会影响注释的结果。注释结果中geneChrgeneStartgeneEndgeneLengthgeneStrandgeneIdtranscriptIddistanceToTSSENSEMBLSYMBOLGENENAME这些列都属于该注释方式,其中最后三列取决于有没有提供annoDb参数。

genomic annotation

  该注释方式为直接寻找与peak有overlap的基因组区域,即基因组内定义的不同功能区域,分别为Promoter5UTR3UTRExonIntronDownstreamIntergenic。这个注释方式是由内部的getGenomicAnnotation函数完成,涉及到的参数如下:

Description:

     get Genomic Annotation of peaks

Usage:

     getGenomicAnnotation(
       peaks,
       distance,
       tssRegion = c(-3000, 3000),
       TxDb,
       level,
       genomicAnnotationPriority,
       sameStrand = FALSE
     )

  distance参数是由nearest gene annotation计算出的peakTSS距离,这个参数可以忽略。这里的参数都很容易理解,就不再解释了。annotatePeak函数里面的assignGenomicAnnotation参数直接控制了是否做这个注释,默认包含这个结果,设置为false结果中就没有annotation这一列了。如下所示:

peakanno <- annotatePeak(peak, TxDb=txdb, assignGenomicAnnotation=F, verbose=F)

head(as.data.frame(peakanno))
  seqnames  start    end width strand                V4  V5 V6       V7
1     chr1 177887 178346   460      * macs2/spi1_peak_1  63  .  5.02370
2     chr1 502490 502734   245      * macs2/spi1_peak_2 114  .  8.47362
3     chr1 816135 816585   451      * macs2/spi1_peak_3 687  . 27.23509
4     chr1 817259 817655   397      * macs2/spi1_peak_4  30  .  4.25229
5     chr1 825329 825528   200      * macs2/spi1_peak_5  42  .  4.87237
6     chr1 906748 907307   560      * macs2/spi1_peak_6 113  .  6.11296
        V8       V9 V10 geneChr geneStart geneEnd geneLength geneStrand
1  8.41215  6.35401 169       1    182696  184174       1479          1
2 13.72868 11.49205 159       1    494464  502508       8045          2
3 71.78964 68.76992 193       1    817371  819837       2467          1
4  4.98931  3.09637 270       1    817371  819837       2467          1
5  6.17494  4.21122  51       1    825138  849592      24455          1
6 13.58947 11.36661 157       1    917370  918534       1165          2
     geneId      transcriptId distanceToTSS
1 102725121 ENST00000624431.2         -4350
2 100132287 ENST00000440038.7             0
3    400728 ENST00000326734.2          -786
4    400728 ENST00000326734.2             0
5    643837 ENST00000624927.3           191
6 100130417 ENST00000432961.1         11227

flankDistance

  该注释从peak的角度出发寻找上下游某个范围内的所有基因,由内部的getAllFlankingGene函数完成,下面是该函数的参数定义:

function (peak.gr, features, level = "transcript", distance = 5000)

  从上面的定义可知,用户可以用来控制结果的参数为leveldistance,参数的意义也很好理解。如果需要这个注释,将annotatePeak的参数addFlankGeneInfo设为true即可,结果就会多出flank_txIdsflank_geneIdsflank_gene_distances三列信息。

peakanno <- annotatePeak(peak, TxDb=txdb, addFlankGeneInfo=T, verbose=F)

head(as.data.frame(peakanno),3)
  seqnames  start    end width strand                V4  V5 V6       V7
1     chr1 177887 178346   460      * macs2/spi1_peak_1  63  .  5.02370
2     chr1 502490 502734   245      * macs2/spi1_peak_2 114  .  8.47362
3     chr1 816135 816585   451      * macs2/spi1_peak_3 687  . 27.23509
        V8       V9 V10        annotation geneChr geneStart geneEnd geneLength
1  8.41215  6.35401 169 Distal Intergenic       1    182696  184174       1479
2 13.72868 11.49205 159  Promoter (<=1kb)       1    494464  502508       8045
3 71.78964 68.76992 193  Promoter (<=1kb)       1    817371  819837       2467
  geneStrand    geneId      transcriptId distanceToTSS
1          1 102725121 ENST00000624431.2         -4350
2          2 100132287 ENST00000440038.7             0
3          1    400728 ENST00000326734.2          -786
                                            flank_txIds
1                                     ENST00000624431.2
2 ENST00000440038.7;ENST00000423728.6;ENST00000616311.5
3                                     ENST00000326734.2
                  flank_geneIds flank_gene_distances
1                     102725121                -4350
2 100132287;100132287;100132287        0;-3315;-3514
3                        400728                    0

结 语

  ChIPseeker的三种注释之间相互独立,所以结果就会出现信息不对称的现象,比如annotation注释到的转录本或基因可能与geneIdtranscriptId列不同。ChIPseeker做注释虽然很简单,但其中还是有不少的细节需要注意以便更准确地理解结果。关于ChIPseeker的应用这里并没有详细的介绍,网络上也有不少这方面的帖子可供参考,不过之前也写过一篇个性化注释的帖子,感兴趣的可以戳这里:[个性化ChIPseeker注释peak结果]。以上内容为个人理解,仅供参考。

往期回顾

macs2 | model vs nomodel
有些软件总能以某种方式劝退你 | SpectralTAD 之 TAD calling
生信分析不只是跑个软件 | TADCompare 差异分析要留心
HiC术语图解与分析软件汇总
R语言揭秘 | $符鲜为人知的秘密,避坑预警

相关文章

网友评论

    本文标题:ChIPseeker | 详解annotatePeak的三种注释

    本文链接:https://www.haomeiwen.com/subject/wvhmgdtx.html