Finding overlapping peaks between replicates
The bedtools intersect
command within bedtools is the one we want to use, since it is able to report back the peaks that are overlapping with respect to a given file (the file designated as “a”).
Note: If you are trying to intersect very large files and are having trouble with excessive memory usage, please presort your data by chromosome and then by start position (e.g., sort -k1,1 -k2,2n in.bed > in.sorted.bed for BED files) and then use the -sorted option. This invokes a memory-efficient algorithm designed for large files. This algorithm has been substantially improved in recent (>=2.18.0) releases
参数
image.pngBy default, if an overlap is found, bedtools intersect reports the shared interval between the two overlapping features.
$ cat A.bed
chr1 10 20
chr1 30 40
$ cat B.bed
chr1 15 20
#shared invervals
$ bedtools intersect -a A.bed -b B.bed
chr1 15 20
The intersect tool evaluates A (file 1) and finds regions that overlap in B (file 2). We will add the -wo which indicates to write the original A (file 1) and B (file 2) entries plus the number of base pairs of overlap between the two features.
Let’s start with the Nanog replicates:
$ bedtools intersect \
-a macs2/Nanog-rep1_peaks.narrowPeak \
-b macs2/Nanog-rep2_peaks.narrowPeak \
-wo > bedtools/Nanog-overlaps.bed
We’ll do the same for the Pou5f1 replicates:
$ bedtools intersect \
-a macs2/Pou5f1-rep1_peaks.narrowPeak \
-b macs2/Pou5f1-rep2_peaks.narrowPeak \
-wo > bedtools/Pou5f1-overlaps.bed
Historical Note: “A simpler heuristic for establishing reproducibility was previously used as a standard for depositing ENCODE data and was in effect when much of the currently available data was submitted. According to this standard, either 80% of the top 40% of the peaks identified from one replicate using an acceptable scoring method should overlap the list of peaks from the other replicate, OR peak lists scored using all available reads from each replicate should share more than 75% of regions in common. As with the current standards, this was developed based on experience with accumulated ENCODE ChIP-seq data, albeit with a much smaller sample size.” [ENCODE Guidelines, Landt et al, 2012]
Ref: https://hbctraining.github.io/Intro-to-ChIPseq/lessons/handling-replicates-bedtools.html
网友评论