美文网首页
GenomicRanges 根据重合比例合并基因区间数据

GenomicRanges 根据重合比例合并基因区间数据

作者: BeeBee生信 | 来源:发表于2022-01-17 20:15 被阅读0次

    最近在做 ATAC-Seq 分析,得到实验组和对照组 peaksets (本文称为 peaks1 和 peaks2) 后想合并得到 consensus peaksets. 用 bedtools intersect 做可以设置重合比例参数,但它无法从 peaks1 和 peaks2 计算并集为新 peaksets 然后输出。方法二是自己写 python 代码,但我觉得 GenomicRanges 可以实现,没必要重复造轮子。

    实现过程大致如下:先 findOverlaps 计算两个 peaksets 重合索引,然后 pintersect 得到 peaksets 重合区域,并计算每个重合区域长度,分别除以原 peak 区域长度得到重合比例。根据自定的阈值(案例是任意比例大于 0.5)选择保留的重合,最后用 punion 合并。

    library(GenomicRanges)
    
    overlap_index <- findOverlaps(peaks1, peaks2)
    # 重合区域及长度(bp)
    overlaps <- pintersect(peaks1[queryHits(overlap_index)], peaks2[subjectHits(overlap_index)])
    overlap_width <- width(overlaps)
    # 重合比例
    overlap_percent1 <- overlap_width / width(peaks1[queryHits(overlap_index)])
    overlap_percent2 <- overlap_width / width(peaks2[subjectHits(overlap_index)])
    
    # 按阈值筛选
    keep_overlaps <- (overlap_percent1 > 0.5) | (overlap_percent2 > 0.5)
    keep_index <- overlap_index[keep_overlaps]
    merged_peaks <- punion(peaks1[queryHits(keep_index)], peaks2[subjectHits(keep_index)])
    

    查看合并结果

    > merged_peaks[1:3]
    GRanges object with 3 ranges and 0 metadata columns:
          seqnames      ranges strand
             <Rle>   <IRanges>  <Rle>
      [1]     chr1  9959-10557      *
      [2]     chr1 11142-11518      *
      [3]     chr1 13231-13986      *
      -------
      seqinfo: 22 sequences from an unspecified genome; no seqlengths
    

    选取一条在两个 peaks 里重合比例都不为 1 的记录做检查,发现正确合并了两个 peak 区域。

    > c(peaks1[3], peaks2[3], merged_peaks[3])
    GRanges object with 3 ranges and 6 metadata columns:
          seqnames      ranges strand |        name     score signalValue    pValue    qValue      peak
             <Rle>   <IRanges>  <Rle> | <character> <numeric>   <numeric> <numeric> <numeric> <integer>
      [1]     chr1 13231-13959      * |      peak_2      1000    1910.682  10.12469   7.41840       290
      [2]     chr1 13314-13986      * |      peak_2      1000     971.596   8.91901   5.75845       211
      [3]     chr1 13231-13986      * |        <NA>        NA          NA        NA        NA      <NA>
      -------
      seqinfo: 22 sequences from an unspecified genome; no seqlengths
    

    相关文章

      网友评论

          本文标题:GenomicRanges 根据重合比例合并基因区间数据

          本文链接:https://www.haomeiwen.com/subject/xtqhhrtx.html