最近在做 ATAC-Seq 分析,得到实验组和对照组 peaksets (本文称为 peaks1 和 peaks2) 后想合并得到 consensus peaksets. 用 bedtools intersect 做可以设置重合比例参数,但它无法从 peaks1 和 peaks2 计算并集为新 peaksets 然后输出。方法二是自己写 python 代码,但我觉得 GenomicRanges 可以实现,没必要重复造轮子。
实现过程大致如下:先 findOverlaps
计算两个 peaksets 重合索引,然后 pintersect
得到 peaksets 重合区域,并计算每个重合区域长度,分别除以原 peak 区域长度得到重合比例。根据自定的阈值(案例是任意比例大于 0.5)选择保留的重合,最后用 punion
合并。
library(GenomicRanges)
overlap_index <- findOverlaps(peaks1, peaks2)
# 重合区域及长度(bp)
overlaps <- pintersect(peaks1[queryHits(overlap_index)], peaks2[subjectHits(overlap_index)])
overlap_width <- width(overlaps)
# 重合比例
overlap_percent1 <- overlap_width / width(peaks1[queryHits(overlap_index)])
overlap_percent2 <- overlap_width / width(peaks2[subjectHits(overlap_index)])
# 按阈值筛选
keep_overlaps <- (overlap_percent1 > 0.5) | (overlap_percent2 > 0.5)
keep_index <- overlap_index[keep_overlaps]
merged_peaks <- punion(peaks1[queryHits(keep_index)], peaks2[subjectHits(keep_index)])
查看合并结果
> merged_peaks[1:3]
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 9959-10557 *
[2] chr1 11142-11518 *
[3] chr1 13231-13986 *
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
选取一条在两个 peaks 里重合比例都不为 1 的记录做检查,发现正确合并了两个 peak 区域。
> c(peaks1[3], peaks2[3], merged_peaks[3])
GRanges object with 3 ranges and 6 metadata columns:
seqnames ranges strand | name score signalValue pValue qValue peak
<Rle> <IRanges> <Rle> | <character> <numeric> <numeric> <numeric> <numeric> <integer>
[1] chr1 13231-13959 * | peak_2 1000 1910.682 10.12469 7.41840 290
[2] chr1 13314-13986 * | peak_2 1000 971.596 8.91901 5.75845 211
[3] chr1 13231-13986 * | <NA> NA NA NA NA <NA>
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
网友评论