美文网首页
10-GATK的最佳实践-bqsr检查

10-GATK的最佳实践-bqsr检查

作者: jiarf | 来源:发表于2022-05-08 14:31 被阅读0次

    前面sed跑完之后就可以跑bqsr了,sed还是很快的,
    在跑一下

    GATK=/data1/jiarongf/learning/cancer-WES/run/gatk-4.2.6.1/gatk
    snp=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz
    indel=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    ref=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta
    cd /data1/jiarongf/learning/cancer-WES/5.GATK
    cat ../data/case  | while read id
    do
        if [ ! -f ${id}_bqsr.bam ]
        then
        echo "start BQSR for ${id}" `date`
        $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp/"  BaseRecalibrator \
        -R $ref  \
        -I ${id}_marked.chr.bam  \
        --known-sites $snp \
        --known-sites $indel \
        -O ${id}_recal.table \
        1>../log/${id}_log.recal 2>&1 
    
        $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp"  ApplyBQSR \
        -R $ref  \
        -I ${id}_marked.chr.bam  \
        -bqsr ${id}_recal.table \
        -O ${id}_bqsr.bam \
        1>../log/${id}_log.ApplyBQSR  2>&1 
    
        echo "end BQSR for ${id}" `date`
        fi
    done
    
    # gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
        # -R /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta  \
        # -I case1_biorep_A_techrep_WES_marked.chr.sed.bam  \
        # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz \
        # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
        # -O case1_biorep_A_techrep_WES_recal.table \
    

    看着一点第一个出来的log。recal,那个chrMT还有没有问题了

    (KG) 11:37:50 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/log
    $
    tail case5_germline_WES_log.recal
    11:39:04.846 INFO  ProgressMeter -        chr11:8872371              6.8              17145000        2505382.4
    11:39:14.853 INFO  ProgressMeter -       chr11:18291832              7.0              17583000        2508262.0
    11:39:24.871 INFO  ProgressMeter -       chr11:31370706              7.2              17928000        2497973.9
    11:39:34.896 INFO  ProgressMeter -       chr11:44600085              7.3              18257000        2485941.1
    11:39:44.903 INFO  ProgressMeter -       chr11:55913813              7.5              18663000        2484794.3
    11:39:54.904 INFO  ProgressMeter -       chr11:62690975              7.7              19155000        2494931.1
    11:40:04.908 INFO  ProgressMeter -       chr11:67441227              7.8              19681000        2508960.9
    11:40:14.943 INFO  ProgressMeter -       chr11:75071241              8.0              20112000        2510375.6
    11:40:25.006 INFO  ProgressMeter -       chr11:86807547              8.2              20467000        2502307.7
    11:40:35.009 INFO  ProgressMeter -      chr11:100089766              8.3              20804000        2492696.1
    
    

    才跑到11号染色体,在等的
    感觉不对,发现sed那个还是没有把chrMT改成chrM

    samtools view -h case2_techrep_2_WES_marked.chr.bam | sed 's/chrMT/chrM/g' | samtools view -b > case2_techrep_2_WES_marked.chr.sed.bam
    

    先试一个,跑的挺慢的

    (KG) 14:09:06 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
    $
    samtools view -h case2_techrep_2_WES_marked.chr.bam | sed 's/chrMT/chrM/g' | samtools view -b > case2_techrep_2_WES_marked.chr.sed.bam
    (KG) 14:32:22 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
    $
    samtools view -h case2_techrep_2_WES_marked.chr.sed.bam | grep chrMT
    (KG) 14:41:19 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
    

    可以看到已经没有这chrMT了,在循环跑一下chrMT的代替

    (KG) 14:43:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
    $
    bash sed_chrMT.sh > ../log/sed_chrMT.sh.log 2>&1
    
    

    这个5.GATK下面已经六百多G了,删掉一些吧,目前这个文件下面



    删掉marked.bam就可以了

    (KG) 14:43:26 jiarongf@172.16.10.223:/data1/jiarongf/learning/cancer-WES/5.GATK
    $
    bash sed_chrMT.sh > ../log/sed_chrMT.sh.log 2>&1
    ^C
    

    因为跑的时间太长了,把每个都分开跑吧,多开几个screen
    只跑了case1的
    做bqsr

    GATK=/data1/jiarongf/learning/cancer-WES/run/gatk-4.2.6.1/gatk
    snp=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz
    indel=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    ref=/data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta
    cd /data1/jiarongf/learning/cancer-WES/5.GATK
    #cat ../data/case  | while read id
    ls *chr.sed.bam | while read id
    do
        if [ ! -f ${id}_bqsr.bam ]
        then
        echo "start BQSR for ${id}" `date`
        $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp/"  BaseRecalibrator \
        -R $ref  \
        -I ${id}_marked.chr.sed.bam  \
        --known-sites $snp \
        --known-sites $indel \
        -O ${id}_recal.table \
        1>../log/${id}_log.recal 2>&1 
    
        $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp"  ApplyBQSR \
        -R $ref  \
        -I ${id}_marked.chr.sed.bam  \
        -bqsr ${id}_recal.table \
        -O ${id}_bqsr.bam \
        1>../log/${id}_log.ApplyBQSR  2>&1 
    
        echo "end BQSR for ${id}" `date`
        fi
    done
    
    # gatk --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
        # -R /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Homo_sapiens_assembly38.fasta  \
        # -I case1_biorep_A_techrep_WES_marked.chr.sed.bam  \
        # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/dbsnp_146.hg38.vcf.gz \
        # --known-sites /data1/jiarongf/learning/cancer-WES/run/gatk_resource/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
        # -O case1_biorep_A_techrep_WES_recal.table \
    
    又有问题了,我猜后面的染色体应该都有问题,所以这里告诉我们不应该乱加自己的chr,还有就是之前bwa比对的时候,那个基因组应该是加chr的染色体,不然比对的bam文件到后面的GATK的fasta基因组的时候会有很大的问题,所以比对的那个也要找找怎么回事

    好了这个我猜后面还有错误的,就不继续往下了,就是要记住自己比对的bwa基因组要和GTAK的基因组的染色体要注意一致。

    相关文章

      网友评论

          本文标题:10-GATK的最佳实践-bqsr检查

          本文链接:https://www.haomeiwen.com/subject/bcgayrtx.html