美文网首页ChipSeq数据分析ChIP-seq生信
R-loop数据分析之R-ChIP(样本间BAM比较和可视化)

R-loop数据分析之R-ChIP(样本间BAM比较和可视化)

作者: xuzhougeng | 来源:发表于2018-09-19 10:28 被阅读90次

    样本间相关性评估

    上一步得到各个样本的BAM文件之后,就可以在全基因组范围上看看这几个样本之间是否有差异。也就是先将基因组分成N个区间,然后用统计每个区间上比对上的read数。

    脚本scripts/07.genome_bin_read_coverage.sh如下

    #!/bin/bash
    
    set -e
    set -u
    set -o pipefail
    
    samples=${1?missing sample file}
    chromsize=${2:-index/hg19.chrom.sizes}
    size=${3:-3000}
    ALIGN_DIR="analysis/2-read-align"
    COV_DIR="analysis/3-genome-coverage"
    
    mkdir -p ${COV_DIR}
    
    exec 0< $samples
    
    while read sample
    do
        bedtools makewindows -g $chromsize -w $size | \
          bedtools intersect -b ${ALIGN_DIR}/${sample}.flt.bam -a - -c -bed > ${COV_DIR}/${sample}.ReadsCoverage
    done
    
    

    准备一个输入文件存放待处理样本的前缀,然后运行脚本bash scripts/07.genome_bin_read_coverage.sh samples_rep.txt

    HKE293-D210N-Input-Rep1
    HKE293-D210N-Input-Rep2
    HKE293-D210N-Input-Rep3
    HKE293-D210N-V5ChIP-Rep1
    HKE293-D210N-V5ChIP-Rep2
    HKE293-D210N-V5ChIP-Rep3
    

    最后将得到的文件导入到R语言中进行作图,使用的是基础绘图系统的光滑散点图(smoothScatter)。

    files_list <- list.files("r-chip/analysis/3-genome-coverage","ReadsCoverage")
    files_path <- file.path("r-chip/analysis/3-genome-coverage",files_list)
    
    input_rep1 <- read.table(files_path[1], sep='\t')
    input_rep2 <- read.table(files_path[2], sep='\t')
    input_rep3 <- read.table(files_path[3], sep='\t')
    chip_rep1 <- read.table(files_path[4], sep='\t')
    chip_rep2 <- read.table(files_path[5], sep='\t')
    chip_rep3 <- read.table(files_path[6], sep='\t')
    
    pw_plot <- function(x, y, 
                        xlab="x",
                        ylab="y", ...){
      log2x <- log2(x)
      log2y <- log2(y)
      smoothScatter(log2x,log2y,
                    cex=1.2,
                    xlim=c(0,12),ylim=c(0,12),
                    xlab=xlab,
                    ylab=ylab)
      text(3,10,paste("R = ",round(cor(x,y),2),sep=""))
    }
    
    par(mfrow=c(2,3))
    pw_plot(chip_rep1[,4], chip_rep2[,4],
            xlab = "Rep 1 (Log2 Tag Counts)",
            ylab = "Rep 2 (Log2 Tag Counts)")
    
    pw_plot(chip_rep1[,4], chip_rep3[,4],
            xlab = "Rep 1 (Log2 Tag Counts)",
            ylab = "Rep 2 (Log2 Tag Counts)")
    
    pw_plot(chip_rep2[,4], chip_rep3[,4],
            xlab = "Rep 1 (Log2 Tag Counts)",
            ylab = "Rep 2 (Log2 Tag Counts)")
    
    pw_plot(input_rep1[,4], input_rep2[,4],
            xlab = "Rep 1 (Log2 Tag Counts)",
            ylab = "Rep 2 (Log2 Tag Counts)")
    
    pw_plot(input_rep1[,4], input_rep3[,4],
            xlab = "Rep 1 (Log2 Tag Counts)",
            ylab = "Rep 3 (Log2 Tag Counts)")
    
    pw_plot(input_rep2[,4], input_rep3[,4],
            xlab = "Rep 2 (Log2 Tag Counts)",
            ylab = "Rep 3 (Log2 Tag Counts)")
    
    correlationship

    这种多个BAM文件之间相关性衡量,其实也可以用deepToolsplotCorrelation画出来,但是我觉得应该没有R语言画 的好看。

    BigWig可视化

    由于同一个样本间的BAM文件具有很强的相关性,因此可以将这些样本合并起来转换成bigwig格式,这样子在基因组浏览器(例如IGV, UCSC Browser, JBrowse)上方便展示。

    如下的代码的目的就是先合并BAM,然后转换成BigWig,拆分成正链和反链进行保存

    #!/bin/bash
    
    set -e
    set -u
    set -o pipefail
    
    samples=${1?please provied sample file}
    threads=${2-8}
    bs=${3-50}
    
    ALIGN_DIR="analysis/2-read-align"
    BW_DIR="analysis/4-normliazed-bw"
    
    mkdir -p $BW_DIR
    
    exec 0< $samples
    cd $ALIGN_DIR
    
    while read sample
    do
        bams=$(ls ${sample}*flt.bam | tr '\n' ' ')
        if [ ! -f ../$(basename ${BW_DIR})/${sample}.tmp.bam ]; then
        echo "samtools merge -f -@ ${threads} ../$(basename ${BW_DIR})/${sample}.tmp.bam $bams  &&
              samtools index ../$(basename ${BW_DIR})/${sample}.tmp.bam" | bash
        fi
    done
    
    cd ../../
    exec 0< $samples
    
    cd ${BW_DIR}
    while read sample
    do
        bamCoverage -b ${sample}.tmp.bam -o ${sample}_fwd.bw -of bigwig  \
          --filterRNAstrand forward --binSize ${bs} --normalizeUsing CPM --effectiveGenomeSize 2864785220 \
          --extendReads 150 -p ${threads} 2> ../log/${sample}_fwd.log
        bamCoverage -b ${sample}.tmp.bam -o ${sample}_rvs.bw -of bigwig  \
          --filterRNAstrand reverse --binSize ${bs} --normalizeUsing CPM --effectiveGenomeSize 2864785220 \
          --extendReads 150 -p ${threads} 2> ../log/${sample}_rvs.log
        rm -f ${sample}.tmp.bam ${sample}.tmp.bam.bai
    done
    
    

    得到的BW文件你可以在IGV上初步看看,比如说检查下文章Figure 1(E)提到的CIRH1A基因

    CIRH1A

    发表文章时肯定不能用上图,我们可以用R的Gviz进行展示下

    library(Gviz)
    
    #下面将scale等track写入tracklist
    tracklist<-list()
    itrack <- IdeogramTrack(genome = "hg19", chromosome = 'chr16',outline=T)
    tracklist[["itrack"]]<-itrack
    
    # 读取BigWig
    bw_file_path <- "C:/Users/DELL/Desktop/FigureYa/R-ChIP/"
    bw_file_names <- list.files(bw_file_path, "*.bw")
    bw_files <- file.path("C:/Users/DELL/Desktop/FigureYa/R-ChIP/",
                          bw_file_names)
    
    tracklist[['D210-fwd']] <- DataTrack(range = bw_files[3],
                                  genome="hg19",
                                  type="histogram",
                                  name='D210 + ',
                                  ylim=c(0,4),
                                  col.histogram="#2167a4",
                                  fill.histogram="#2167a4")
    tracklist[['D210-rvs']] <- DataTrack(range = bw_files[4],
                                         genome="hg19",
                                         type="histogram",
                                         name='D210 - ',
                                         ylim=c(4,0),
                                         col.histogram="#eb1700",
                                         fill.histogram="#eb1700")
    
    tracklist[['WKDD-fwd']] <- DataTrack(range = bw_files[9],
                                         genome="hg19",
                                         type="histogram",
                                         name='WKDD + ',
                                         ylim=c(0,4),
                                         ylab=2,
                                         col.histogram="#2167a4",
                                         fill.histogram="#2167a4")
    tracklist[['WKDD-rvs']] <- DataTrack(range = bw_files[10],
                                         genome="hg19",
                                         type="histogram",
                                         name=' WKDD -',
                                         ylim=c(4,0),
                                         col.histogram="#eb1700",
                                         fill.histogram="#eb1700",
                                         showAxis=TRUE)
    
    #写入比例尺
    scalebar <- GenomeAxisTrack(scale=0.25,
                                col="black",
                                fontcolor="black",
                                name="Scale",
                                labelPos="above",showTitle=F)
    tracklist[["scalebar"]]<-scalebar
    
    # 画图
    plotTracks(trackList = tracklist,
               chromosome = "chr16",
               from =  69141913, to= 69205033,
               background.panel = "#f6f6f6",
               background.title = "white",
               col.title="black",col.axis="black",
               rot.title=0,cex.title=0.5,margin=10,title.width=1.75,
               cex.axis=1
               )
    
    Gviz

    后续用AI修改下坐标轴,就几乎和原图差不多了。此外可能还要调整之前脚本的bin size, 使得整体更加平滑

    看图说话: 由于WKKD在D210N的基础上继续突变了几个位点,使得原本只丧失了RNASEH1的催化活性的D210N进一步丧失了结合到RNA/DNA杂合体的能力,在实验中就可以作为 负对照。上图就是其中一个有代表性的区间。

    相关文章

      网友评论

        本文标题:R-loop数据分析之R-ChIP(样本间BAM比较和可视化)

        本文链接:https://www.haomeiwen.com/subject/aiibnftx.html