检测bam文件的完整度

作者: 因地制宜的生信达人 | 来源:发表于2019-04-03 07:42 被阅读131次

    检测bam文件的完整度

    本来以为有bai文件就说明是完整的,事实上我还是太年轻,最近处理一个 600个病人的肿瘤WES数据,走流程过程发现卡在CNVKIT,部分样本出现了:

      File "pysam/libcalignmentfile.pyx", line 729, in pysam.libcalignmentfile.AlignmentFile.__cinit__
      File "pysam/libcalignmentfile.pyx", line 930, in pysam.libcalignmentfile.AlignmentFile._open
      File "pysam/libchtslib.pyx", line 364, in pysam.libchtslib.HTSFile.check_truncation
    IOError: no BGZF EOF marker; file may be truncated
    

    很明显,是bam文件有问题!

    为什么bam文件报错

    因为这 600个病人的肿瘤WES数据我是批量运行的,日志也是一大堆,没有心情去检查到底是哪个步骤出错。

    就先调出错误的样本,重新跑一次流程。简单的脚本检查,发现我输出的bam文件都是有bai文件的,如下:

    ls *bam|while read id;do (ls -lh ${id/bam/bai});done
    

    也就是说,一个不完整的bam文件,软件仍然是可以为它构建bai索引文件的!

    而且诡异的是

    samtools view T_recal.bam|head # 没有问题
    samtools view T_recal.bam|tail # 报错
    # [W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
    

    报错如下:

    [W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
    [E::bgzf_read] bgzf_read_block error -1 after 31 of 32 bytes
    [main_samview] truncated file.
    

    而且两个统计命令,虽然都报错,其中有的有结果,有的没有结果。

    samtools flagstat $sample.bam > ${sample}.alignment.flagstat & ## 能统计
    samtools stats  $sample.bam > ${sample}.alignment.stat &  ## 空着
    

    结果如下:

    -rw-rw-r-- 1 jianmingzeng jianmingzeng    0 Dec 28 17:44 K1_recal.alignment.stat
    -rw-rw-r-- 1 jianmingzeng jianmingzeng    0 Dec 28 17:58 K2_recal.alignment.stat
    -rw-rw-r-- 1 jianmingzeng jianmingzeng    0 Dec 28 19:55 S1_recal.alignment.stat
    -rw-rw-r-- 1 jianmingzeng jianmingzeng    0 Dec 28 19:53 S2_recal.alignment.stat
    

    挑选出错误的bam文件

    既然上面是tail报错,就批量写脚本

    ls *bam|while read id;do (echo $id; samtools view $id|tail -1);done
    

    但是肿瘤WGS样本通常是测序量不小,这个运行速度令人发指,最后发现了那些报错样本

    K1_recal.bam
    [W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
    --
    K2.bam
    [W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
    --
    S1.bam
    [W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
    --
    S2.bam
    [W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
    

    经过群友点播,发现了更好的方法:

    for i in *.bam ;do (samtools quickcheck $i && echo "ok" || echo $i error);done
    

    还有一个很诡异的方法,可以检查一下bam文件尾部的28个字符是不是1f8b08040000000000ff0600424302001b0003000000000000000000,使用命令如下:

    tail -c 28 test.bam| xxd -p
    

    仔细查看发现问题所在

    是因为服务器限制我的任务只能运行24小时,在最后一点点时间被kill了。

    15:04:14.333 INFO  ProgressMeter - chr1_KI270711v1_random:18313            188.7              91048000         482377.5
    15:04:24.334 INFO  ProgressMeter - chr3_KI270777v1_alt:28930            188.9              91499000         484339.2
    15:04:34.337 INFO  ProgressMeter - chr11_KI270832v1_alt:127459            189.1              91955000         486323.8
    slurmstepd: *** JOB 213314 ON compute-1-0 CANCELLED AT 2019-02-23T15:05:06 DUE TO TIME LIMIT ***
    15:05:28.874 INFO  ProgressMeter - chr11_KI270832v1_alt:136360            190.0              91956000         484002.4
    

    相关文章

      网友评论

        本文标题:检测bam文件的完整度

        本文链接:https://www.haomeiwen.com/subject/lgqwbqtx.html