R-loop数据分析之R-ChIP(peak calling)

作者: xuzhougeng | 来源:发表于2018-09-20 13:14 被阅读43次

    Peak Calling

    关于MACS2的使用方法, 我写了如何使用MACS进行peak calling详细地介绍了它的参数,在用MACS2之前尽量去阅读下。

    尽管 文章说他按照默认参数分别找narrow peak 和 broad peak, 即"We used MACS2 with default settings to call narrow (or broad when necessary) R-loop peaks",但是 MACS2的默认参数一开始会根据reads在正反链的分布建立双峰模型,确定偏移模型(shifting model),也就是它会认为示例CHTF8和CIRH1A位于正反链的信号视作一个IP结果的信号,因此用默认参数绝对有问题,所以 必须要增加--nomodel取消建模,且增加--shift 0 --extsize 150按照实验的条带长度对片段进行延长。

    cd analysis
    # HKE293 D210N
    macs2 callpeak --nomodel --shift 0 --extsize 150 -g hs --keep-dup all -f BAM -t 2-read-align/HKE293-D210N-V5ChIP-Rep*.flt.bam -c 2-read-align/HKE293-D210N-Input-Rep*.flt.bam --outdir 5-peak-calling/narrow -n HKE293-D210 2> log/HKE293-D210.macs2.narrow.log &
    macs2 callpeak --nomodel --shift 0 --extsize 150 -g hs --keep-dup all --broad -f BAM -t 2-read-align/HKE293-D210N-V5ChIP-Rep*.flt.bam -c 2-read-align/HKE293-D210N-Input-Rep*.flt.bam --outdir 5-peak-calling/broad -n HKE293-D210 2> log/HKE293-D210.macs2.broad.log &
    # HEK293 WKKD
    macs2 callpeak --nomodel --shift 0 --extsize 150  -g hs --keep-dup all -f BAM -t 2-read-align/HKE293-WKKD-V5ChIP.flt.bam -c 2-read-align/HKE293-WKKD-Input.flt.bam --outdir 5-peak-calling/narrow/ -n HKE293-WKKD 2> log/HKE293-WKKD.macs2.narrow.log &
    macs2 callpeak --nomodel --shift 0 --extsize 150  -g hs --keep-dup all --broad -f BAM -t 2-read-align/HKE293-WKKD-V5ChIP.flt.bam -c 2-read-align/HKE293-WKKD-Input.flt.bam --outdir 5-peak-calling/broad/ -n HKE293-WKKD 2> log/HKE293-WKKD.macs2.broad.log &
    

    可以将建模和不建模的narrowPeak结果在IGV进行比较,你会发现之前的CHTF8和CIRH1A上的两个信号峰在默认参数下就被认为是成一个。

    是否建模

    文章中对找到peak进行了一次筛选,标准是大于5倍富集和q-value 小于或等于0.001(broad peak则是0.0001),最后文章写着在D210N有12,906peak,然后剔除WKKD里的peak,还有12,507个。

    我找到的HKE293-D210的原始narrowPeak数为17,558, 按照作者的标准筛选后只剩下9,552,。对于HKE2930-D210的原始broadPeak数为42,912,过滤之后只剩2,998了。隐隐觉得peak有点太少了,于是我的标准是4倍变化。

    负对照的WKKD无论是narrowPeak还是broadPeak都是0

    cd analysis/5-peak-calling/
    fc=4
    # narrow peak
    awk -v fc=$fc '$7 >= fc && $9 >=3' narrow/HKE293-D210_peaks.narrowPeak > HKE293-D210_flt.narrowPeak
    # broad peak
    awk -v fc=$fc '$7 >= fc && $9 >=4' broad/HKE293-D210_peaks.broadPeak > HKE293-D210_flt.broadPeak
    

    之后可以用过滤后的D210N的narrowPeak和broadPeak的peak长度进行描述性统计分析,然后用箱线图展示其大小分布。

    library(data.table)
    library(ggplot2)
    
    narrowPeak <- fread(file="HKE293-D210_flt.narrowPeak",
                       sep="\t", header = F)
    broadPeak <- fread(file="HKE293-D210_flt.broadPeak",
                        sep="\t", header = F)
    peak_size <- log10(c(narrowPeak$V3 - narrowPeak$V2, broadPeak$V3 - broadPeak$V2 ))
    
    peak_from <- factor(rep(c('Narrow','Broad'), times=c(nrow(narrowPeak),nrow(broadPeak)) ),
                        levels=c("Narrow","Broad"))
    
    peak_df <- data.frame(size=peak_size, from=peak_from)
    
    my_clear_theme = theme_bw() + 
      theme(panel.grid.major = element_line(colour="NA"),
            panel.grid.minor = element_line(colour="NA"),
            panel.background = element_rect(fill="NA"),
            panel.border = element_rect(colour="black", fill=NA),
            legend.background = element_blank())
    
    ggplot(peak_df,aes(x=from,y=size,col=from)) + 
      geom_boxplot(notch = T,outlier.size = .5) +
      scale_y_continuous(breaks=c(log10(100),log10(300),log10(400),log10(450),log10(500),log10(1000),log10(2000)),
                         labels = c(100,300,400,450,500,1000,2000)) +
      coord_cartesian(ylim=c(2,3.5)) + labs(col="Peak Size") +
      my_clear_theme + xlab("") + ylab("Peak Size (bp)")
    ggsave("Fig2A_boxplot.pdf", width=4,height=6,units = "in")
    
    boxplot of peak width

    原文说自己的narrow peak 长度的中位数是199bp, broad peak的长度中位数是318 bp,和电镜观察的150–500 bp一致。我过滤后的peak的中位数是208,broad peak的中位数是581。

    如果你无脑用MACS2的参数话,就会得到右边的图,peak的长度明显都偏大了。

    还可以对Broad peak 和Narrow peak进行比较,看看有多少共同peak和特异性peak。

    #!/bin/bash
    
    peak1=${1?first peak}
    peak2=${2?second peak}
    outdir=${3?output dir}
    
    mkdir -p ${outdir}
    
    bedtools intersect -a $peak1 -b $peak2 > ${outdir}/$(basename ${peak1%%.*}).common.peak
    common=$(wc -l ${outdir}/$(basename ${peak1%%.*}).common.peak)
    echo "$common"
    
    bedtools subtract -A -a $peak1 -b $peak2 > ${outdir}/$(basename ${peak1}).only.bed
    left=$(wc -l ${outdir}/$(basename ${peak1}).only.bed)
    
    echo "$left"
    
    bedtools subtract -A -b $peak1 -a $peak2 > ${outdir}/$(basename ${peak2}).only.bed
    right=$(wc -l ${outdir}/$(basename ${peak2}).only.bed)
    
    echo "$right"
    

    运行:bash scripts/09.peak_comparison.sh analysis/5-peak-calling/HKE293-D210_flt.narrowPeak analysis/5-peak-calling/HKE293-D210_flt.broadPeak analysis/5-peak-calling/peak_compare > results/narrow_vs_broad.txt

    然后得到的数值就可以丢到R语言画图了,这个结果和Figure S2A一致。

    library(VennDiagram)
    grid.newpage()
    venn.plot <- VennDiagram::draw.pairwise.venn(area1=6401 + 7165,
                                    area2=286 + 7165,
                                    cross.area = 7165,
                                    category = c("Narrow specific","Broad specific"),
                                    scaled = F,
                                    fill=c("#c7d0e7","#f4babf"),
                                    lty='blank',
                                    cat.pos = c(180,0),#360度划分
                                    rotation.degree = 90 #整体旋转90
                                    )
    grid.draw(venn.plot)
    
    venn plot

    我们可以对这三类peak对应区间内的信号进行一下比较。统计信号强度的工具是bigwigSummary,来自于ucscGenomeBrowser工具集。

    脚本为scripts/10.peak_signal_summary.sh

    #!/bin/bash
    
    bed=${1?bed file}
    fwd=${2?forwad bigwig}
    rvs=${3?reverse bigwig}
    
    exec 0< $bed
    
    while read region
    do
        chr=$( echo $region | cut -d ' ' -f 1)
        sta=$( echo $region | cut -d ' ' -f 2)
        end=$( echo $region | cut -d ' ' -f 3)
        (bigWigSummary $fwd $chr $sta $end 1 ; bigWigSummary $rvs $chr $sta $end 1 )| awk -v out=0 '{out=out+$1} END{print out/2}'
    done
    

    分别统计三类peak的信号的信号

    bash scripts/10.peak_signal_summary.sh analysis/5-peak-calling/peak_compare/HKE293-D210_flt.broadPeak.only.bed analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_fwd.bw analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_rvs.bw > results/HKE293-D210N-V5ChIP.broadSignal
    bash scripts/10.peak_signal_summary.sh analysis/5-peak-calling/peak_compare/HKE293-D210_flt.narrowPeak.only.bed analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_fwd.bw analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_rvs.bw > results/HKE293-D210N-V5ChIP.narrowSignal
    bash scripts/10.peak_signal_summary.sh analysis/5-peak-calling/peak_compare/HKE293-D210_flt.common.peak analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_fwd.bw analysis/4-normliazed-bw/HKE293-D210N-V5ChIP_rvs.bw > results/HKE293-D210N-V5ChIP.commonSignal
    

    在R语言中绘图展示

    # peak signal
    commonSignal <- fread("./HKE293-D210N-V5ChIP.commonSignal")
    narrowSignal <- fread("./HKE293-D210N-V5ChIP.narrowSignal")
    broadSignal <-  fread("./HKE293-D210N-V5ChIP.broadSignal")
    
    data <- data.frame(signal=c(commonSignal$V1, narrowSignal$V1, broadSignal$V1),
                       type=factor(rep(c("common","narrow","broad"),
                                times=c(nrow(commonSignal),nrow(narrowSignal), nrow(broadSignal)))) )
    
    wilcox.test(data[data$type=="common",1],data[data$type=="broad",1])
    wilcox.test(data[data$type=="common",1],data[data$type=="narrow",1])
    
    
    p <- ggplot(data,aes(x=type,y=signal,col=type)) + 
      geom_boxplot(outlier.colour = "NA") +
      coord_cartesian(ylim=c(0,3.5)) +
      theme(panel.grid.major = element_line(colour="NA"),
            panel.grid.minor = element_line(colour="NA"),
            panel.background = element_rect(fill="NA"),
            panel.border = element_rect(colour="black", fill=NA)) +
      theme(axis.text.x = element_blank()) +
      xlab("") + ylab("R-loop Signal") + labs(col="") 
    
    
    p1 <- p + annotate("segment", x=1,xend=1,y=0.75,yend=2,size=0.5) + 
      annotate("segment", x=2,xend=2,y=1.25,yend=2,size=0.5) +
      annotate("segment", x=1,xend=2,y=2,yend=2,size=0.5) +
      annotate("text", x= 1.5, y=2.1, label="***") 
    
    p2 <- p1 + annotate("segment", x=3,xend=3,y=0.65,yend=3,size=0.5) + 
      annotate("segment", x=2,xend=2,y=2.2,yend=3,size=0.5) +
      annotate("segment", x=2,xend=3,y=3,yend=3,size=0.5) +
      annotate("text", x= 2.5, y=3.2, label="***") 
    
    p2
    
    signal comparison

    看图说话: 无论是narrow peak 特异区间的信号,还是broad peak 特异区间的信号都显著性低于其共有区间的信号。

    相关文章

      网友评论

        本文标题:R-loop数据分析之R-ChIP(peak calling)

        本文链接:https://www.haomeiwen.com/subject/mhagnftx.html