To annotate the peaks generated by Macs2, the following codes can be used。
Peak注释需要考虑一下几种情况:
- Gene是在+还是-链上,不同链方向不一致,启动子区间设置时候需要注意,由于基因无论是在正义链还是反义链其注释的坐标都是5‘->3’,但是基因是有方向的,所以+链的promoter与-promoter计算是有区别的(下图a)。
- 注释的适合需要peak与基因overlap的如下三种种情况(下图b);这是后续写注释程序的逻辑所在。
- Gene in peak,这种很好判断。
- Gene 与peak overlap,这种一般考虑与peak的overlap>=50%。
- 没有overlap,gene在peak的上游或者下游(3的对立面是1和2两种情况)。
open PEAK, "$ARGV[0]";
while(<PEAK>){
chomp;
if(/^Chr/){
@temp=split /\t/,$_;
$chr=$temp[0];
$chr=~s/ *//;
$hash_peak{$chr}{"$temp[1]\t$temp[2]\t$temp[3]"}=$temp[5];
}
}
#foreach $chr(keys %hash_peak){
# foreach(keys %{$hash_peak{$chr}}){
# print"$chr\t$_\t$hash_peak{$chr}{$_}\n";
# }
#}
open GFF3,"$ARGV[1]";
while(<GFF3>){
chomp;
if(/^Chr/){
@temp=split /\t/,$_;
$GENE{$temp[2]}=[$temp[0],$temp[3],$temp[4],$temp[5]];
}
}
$gene_in_peak="Gene in peak";
$gene_body="Gene body";
$promoter="Promoter";
my %peak_anno=();
foreach (keys %GENE){
$LOC=$_;
my($chr,$gene_start,$gene_end,$strand)=@{$GENE{$LOC}};
foreach(keys %{$hash_peak{$chr}}){
my($peak_start,$peak_end,$peak_length)=split /\t/,$_;
if($strand eq "+"){
my($a1,$a2,$b1,$b2)=($gene_start,$gene_end,$peak_start,$peak_end); #gene_body
if($a1>=$b1 and $a1<=$b2 and $a2>$b1 and $a2<=$b2){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$gene_body\t$gene_in_peak\t";
}
elsif(!($a1>$b2 or $a2<$b1)){
@compare=sort{$a<=>$b}($a1,$a2,$b1,$b2);
if(($compare[2]-$compare[1])/$peak_length>=0.5){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$gene_body\t";
}
}
my($a1,$a2,$b1,$b2)=($gene_start-2000,$gene_start,$peak_start,$peak_end); #gene_2k_promoter
if($a1>=$b1 and $a1<=$b2 and $a2>$b1 and $a2<=$b2){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$promoter\t";
}
elsif(!($a1>$b2 or $a2<$b1)){
@compare=sort{$a<=>$b}($a1,$a2,$b1,$b2);
if(($compare[2]-$compare[1])/$peak_length>=0.5){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$promoter\t";
}
}
}
if($strand eq "-"){
my($a1,$a2,$b1,$b2)=($gene_start,$gene_end,$peak_start,$peak_end); #gene_body
if($a1>=$b1 and $a1<=$b2 and $a2>$b1 and $a2<=$b2){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$gene_body\t$gene_in_peak\t";
}
elsif(!($a1>$b2 or $a2<$b1)){
@compare=sort{$a<=>$b}($a1,$a2,$b1,$b2);
if(($compare[2]-$compare[1])/$peak_length>=0.5){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$gene_body\t";
}
}
my($a1,$a2,$b1,$b2)=($gene_end+2000,$gene_end,$peak_start,$peak_end); #gene_2k_promoter
if($a1>=$b1 and $a1<=$b2 and $a2>$b1 and $a2<=$b2){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$promoter\t";
}
elsif(!($a1>$b2 or $a2<$b1)){
@compare=sort{$a<=>$b}($a1,$a2,$b1,$b2);
if(($compare[2]-$compare[1])/$peak_length>=0.5){
$peak_anno{$chr}{"$peak_start\t$peak_end"}.="$LOC\t$promoter\t";
}
}
}
}
}
open OU,">$ARGV[2]";
foreach $chr (sort keys %peak_anno ){
foreach(keys %{$peak_anno{$chr}}){
print OU "$chr\t$_\t$peak_anno{$chr}{$_}\n";
}
}
运行:perl new_5.pl WT_K5_rep1_peaks.xls GENE_FF3 Peak_anno_v3.txt
Prepare the GENE coordinates filels for all.gff3 files.
产生GENE_FF3
awk -F "\t" '$3=="gene"{print $0}' all.gff3 |awk -F ";" '{print $1}' |sed s/ID=// |awk '{print $1,$3,$9,$4,$5,$7}' OFS="\t" > GENE_FF3
To test whether the perl script was correct or not, we used the bedtools to annotate the same peak file and did the comparasions.
Prepare the the bedfile for bedtools. As the bedfile was 0-based and half-closed-half-open interval, thus the left coordinate should be 1 bp less ($4-1).
产生GENE_FF3.bed file
awk '{print $1,$4-1,$5,$2,$3}' GENE_FF3 >GENE_FF3.bed
bedtools进行注释(注意要用macs产生的peak 文件):
bedtools intersect -f 0.5 -a WT_K5_rep1_peaks.narrowPeak -b GENE_FF3.bed -wa -wb >Bedttols_annotaion_only_for_gene
对二者进行注释的结果进行比较,由于bedtoo注释只考虑了Gene body没有考虑promoter,所以只对Gene body注释的基因进行比较,结果如下:
image.png从结果看,基本符合,由于bedtools这里只考虑了peak 与Gene 有至少>=50%的overlap部分(下图情况1,2),而对于Gene in peak(下图情况3)的这种情况没有计算(wa wb互换且,-F 1),所以 有一些出入。
image.png
网友评论