美文网首页ATAC-seq
【ATAC-Seq 实战】四、计算插入片段长度,FRiP,IDR

【ATAC-Seq 实战】四、计算插入片段长度,FRiP,IDR

作者: 佳奥 | 来源:发表于2022-12-30 22:41 被阅读0次

    这里是佳奥!

    2022年的最后一天,让我们继续ATAC-Seq的学习!

    1 计算插入片段长度

    非冗余非线粒体能够比对的fragment、比对率、NRF、PBC1、PBC2、peak数、无核小体区NFR、TSS富集、FRiP 、IDR重复的一致性

    根据bam文件第9列,在R里面统计绘图

    samtools view 2-ce11-2.last.bam | cut -f 9 >1.txt
    
    apt install r-base-core
    
    $ R
    > a=read.table('1.txt')
    > dim(a)
    [1] 7292144       1
    > png('hist.png')
    > hist(as.numeric(a[,1]))
    > dev.off
    > q()
    
    hist.png
    hist(abs(as.numeric(a[,1])), breaks=100)
    
    hist2.png

    批量脚本

    ##创建一个config.last.bam文件,里面内容包含bam文件的名称
    2-cell-1.last.bam 2-cell-1.last
    2-cell-2.last.bam 2-cell-2.last
    2-cell-4.last.bam 2-cell-4.last
    2-cell-5.last.bam 2-cell-5.last
    
    ##提取bam文件的第九列indel插入长度信息
    cat config.last.bam | while read id;
    do
    arr=($id)
    sample=${arr[0]}
    sample_name=${arr[1]}
    samtools view $sample | awk '{print $9}'  > ${sample_name}.length.txt
    done
    
    ##准备一个用于R语言批量绘制indel分布的文本输入文件config.indel.length.distribution
    2-cell-1.last.length.txt 2-cell-1.last.length
    2-cell-2.last.length.txt 2-cell-2.last.length
    2-cell-4.last.length.txt 2-cell-4.last.length
    2-cell-5.last.length.txt 2-cell-5.last.length
    
    ##有了上面的文件就可以批量检验bam文件进行出图。创建批量运行的shell脚本
    cat config.indel.length.distribution  | while read id;
    do
    arr=($id)
    input=${arr[0]}
    output=${arr[1]}
    Rscript indel.length.distribution.R $input $output
    done
    
    ##indel.length.distribution.R
    cmd=commandArgs(trailingOnly=TRUE); 
    input=cmd[1]; output=cmd[2]; 
    a=abs(as.numeric(read.table(input)[,1])); 
    png(file=output);
    hist(a,
    main="Insertion Size distribution",
    ylab="Read Count",xlab="Insert Size",
    xaxt="n",
    breaks=seq(0,max(a),by=10)
    ); 
    
    axis(side=1,
    at=seq(0,max(a),by=100),
    labels=seq(0,max(a),by=100)
    );
    
    dev.off()  
    

    2 FRiP值的计算

    fraction of reads in called peak regions

    Fraction of reads in peaks (FRiP) - Fraction of all mapped reads that fall into the called peak regions, i.e. usable reads in significantly enriched peaks divided by all usable reads. In general, FRiP scores correlate positively with the number of regions. (Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)

    bedtools intersect -a ../align/2-ceLL-1.bed -b 2-ceLL-1_peaks.narrowPeak |wc -l
    148210
    
    wc ../align/2-ceLL-1.bed
    5105844
    wc ../align/2-ceLL-1.raw.bed
    5105844
    
    ls *narrowPeak|while  read id;
    do 
    echo $id
    bed=../align/$(basename $id "_peaks.narrowPeak").raw.bed
    #ls -lh $bed 
    Reads=$(bedtools intersect -a $bed -b $id |wc -l|awk '{print $1}')
    totalReads=$(wc -l $bed|awk '{print $1}')
    echo $Reads  $totalReads 
    echo '==> FRiP value:' $(bc <<< "scale=2;100*$Reads/$totalReads")'%'
    done 
    
    2-ce11-2_peaks.narrowPeak
    3420904 95149325
    ==> FRiP value: 3.59%
    2-ce11-4_peaks.narrowPeak
    1126859 29866961
    ==> FRiP value: 3.77%
    2-ce11-5_peaks.narrowPeak
    4259835 103697403
    ==> FRiP value: 4.10%
    2-ceLL-1_peaks.narrowPeak
    2488167 62365958
    ==> FRiP value: 3.98%
    

    只显示.bam,其他不显示:

    $ ls 2-ce11-?.raw.bam
    
    2-ce11-2.raw.bam  2-ce11-4.raw.bam  2-ce11-5.raw.bam
    

    可以使用R包看不同peaks文件的overlap情况:


    QQ截图20221231175609.png
    if(F){
      options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/") 
      options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
      source("http://bioconductor.org/biocLite.R") 
    
      BiocManager::install('ChIPseeker')
      BiocManager::install('ChIPpeakAnno')
    
    }
    
    library(ChIPseeker)
    library(ChIPpeakAnno)
    list.files('D:/ATAC-Seq/数据/',"*.narrowPeak")
    tmp=lapply(list.files('D:/ATAC-Seq/数据/',"*.narrowPeak"),function(x){
      return(readPeakFile(file.path('D:/ATAC-Seq/数据/', x))) 
    })
    
    ol <- findOverlapsOfPeaks(tmp[[1]],tmp[[4]])
    png('overlapVenn.png')
    makeVennDiagram(ol)
    dev.off()
    
    QQ截图20221231180630.png

    3 IDR计算

    也可以使用专业软件,IDR 来进行计算出来,同时考虑peaks间的overlap,和富集倍数的一致性 。

    详细的教程:

    https://www.jianshu.com/p/d8a7056b4294
    
    source activate atac
    # 可以用search先进行检索
    conda search idr
    source  deactivate
    ## 保证所有的软件都是安装在 py3 这个环境下面
    conda  create -n py3 -y python=3 idr
    conda activate py3
    conda install -c bioconda idr
    
    idr -h 
    idr --samples  2-ceLL-1_peaks.narrowPeak 2-ce11-2_peaks.narrowPeak --plot
    
    idr --samples 2-ceLL-1_peaks.narrowPeak 2-ce11-2_peaks.narrowPeak \
    --input-file-type narrowPeak \
    --rank p.value \
    --output-file sample-idr \
    --plot \
    --log-output-file sample.idr.log
    

    4 deeptools可视化

    需要把.bam转化为.bw

    http://www.bio-info-trainee.com/1815.html
    
    cd  ~/project/atac/align
    source activate atac
    # ls  *.bam  |xargs -i samtools index {} 
    ls *last.bam |while read id;do
    nohup bamCoverage -p 5 --normalizeUsing CPM -b $id -o ${id%%.*}.last.bw & 
    done 
    
    cd dup 
    ls  *.bam  |xargs -i samtools index {} 
    ls *.bam |while read id;do
    nohup bamCoverage --normalizeUsing CPM -b $id -o ${id%%.*}.rm.bw & 
    done 
    

    .bw文件的IGV可视化

    QQ截图20221231210807.png
    查看TSS附件信号强度
    ## both -R and -S can accept multiple files 
    mkdir -p  ~/project/atac/tss
    cd   ~/project/atac/tss 
    source activate atac
    
    computeMatrix reference-point  --referencePoint TSS  -p 15  \
    -b 10000 -a 10000    \
    -R /home/kaoku/refer/mm10/ucsc.refseq.bed  \
    -S /home/kaoku/project/atac/align/*.bw  \
    --skipZeros  -o matrix1_test_TSS.gz  \
    --outFileSortedRegions regions1_test_genes.bed
    
    ## both plotHeatmap and plotProfile will use the output from   computeMatrix
    plotHeatmap -m matrix1_test_TSS.gz  -out test_Heatmap.png
    plotHeatmap -m matrix1_test_TSS.gz  -out test_Heatmap.pdf --plotFileFormat pdf  --dpi 720  
    plotProfile -m matrix1_test_TSS.gz  -out test_Profile.png
    plotProfile -m matrix1_test_TSS.gz  -out test_Profile.pdf --plotFileFormat pdf --perGroup --dpi 720 
    

    下载参考.bed

    http://genome.ucsc.edu/cgi-bin/hgTables
    
    ##具体转化方法
    https://www.jianshu.com/p/5d078d517770
    
    QQ截图20221231211404.png
    绘制的热图
    test_Heatmap.png
    查看基因body的信号强度
    source activate atac
    computeMatrix scale-regions  -p 15  \
    -R /home/kaoku/refer/mm10/ucsc.refseq.bed  \
    -S /home/kaoku/project/atac/align/*.bw  \
    -b 10000 -a 10000  \
    --skipZeros -o matrix1_test_body.gz
    plotHeatmap -m matrix1_test_body.gz  -out ExampleHeatmap1.png 
    
    plotHeatmap -m matrix1_test_body.gz  -out test_body_Heatmap.png
    plotProfile -m matrix1_test_body.gz  -out test_body_Profile.png
    

    绘制的热图

    test_body_Heatmap.png
    ngsplot也是可以的。

    上面的批量代码其实就是为了统计全基因组范围的peak在基因特征的分布情况,也就是需要用到computeMatrix计算,用plotHeatmap以热图的方式对覆盖进行可视化,用plotProfile以折线图的方式展示覆盖情况。

    computeMatrix具有两个模式: scale-regionreference-point。前者用来信号在一个区域内分布,后者查看信号相对于某一个点的分布情况。无论是那个模式,都有有两个参数是必须的,-S是 提供bigwig文件,-R是提供基因的注释信息。

    ##deeptools官方文档
    https://deeptools.readthedocs.io/en/develop/content/tools/computeMatrix.html#id10
    

    补充:

    查看进程:
    top
    
    彩色界面:
    htop
    

    下一步便是peaks的注释。

    我们下一篇再见!

    相关文章

      网友评论

        本文标题:【ATAC-Seq 实战】四、计算插入片段长度,FRiP,IDR

        本文链接:https://www.haomeiwen.com/subject/hgsfcdtx.html