美文网首页
2020-06-24 学习WES数据分析流程3

2020-06-24 学习WES数据分析流程3

作者: 程凉皮儿 | 来源:发表于2020-06-24 21:46 被阅读0次

    今天的学习之前需要复习一下基础如上所示:
    使用HaplotypeCaller对上述的bam文件进行call variant,就是寻找变异的过程,生成一个VCF文件用于后续位点的质控和注释,从一个bam到一个vcf文件

    time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk HaplotypeCaller -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -I 9Y1640_bqsr.bam -O 9Y1640WES.HC.vcf
    

    这个过程比昨天的还要久运行结果如下 :

    6761758 total reads filtered
    09:19:47.687 INFO  ProgressMeter - HLA-DRB1*16:02:01:10700            464.1              11090871          23898.8
    09:19:47.687 INFO  ProgressMeter - Traversal complete. Processed 11090871 total regions in 464.1 minutes.
    09:19:48.095 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 17613.102690375003
    09:19:48.095 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 1732.41 sec
    09:19:48.095 INFO  HaplotypeCaller - Shutting down engine
    [June 24, 2020 at 9:19:48 AM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 464.12 minutes.
    Runtime.totalMemory()=2202009600
    
    real    464m13.962s
    user    474m2.054s
    sys 3m4.715s
    

    共耗时464.1分钟,之后就可以用不同的方法过滤掉置信度不高的位点,然后进行注释,注释过程见下文。
    先来比较一下和昨天获得的raw.vcf(也就是gvcf) .

    (base) root@1100150:~/wes_cancer/project/5.gatk# less -S 9Y1640_raw.vcf |grep -v "#"|less -S |head
    chr1    69091   .   A   <NON_REF>   .   .   END=69161   GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,11
    chr1    69162   .   T   <NON_REF>   .   .   END=69175   GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
    chr1    69176   .   G   <NON_REF>   .   .   END=69191   GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,93
    chr1    69192   .   A   <NON_REF>   .   .   END=69232   GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,54
    chr1    69233   .   T   <NON_REF>   .   .   END=69254   GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,95
    chr1    69255   .   G   <NON_REF>   .   .   END=69263   GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,151
    chr1    69264   .   C   <NON_REF>   .   .   END=69269   GT:DP:GQ:MIN_DP:PL  0/0:5:15:5:0,15,195
    chr1    69270   rs201219564 A   G,<NON_REF> 128.96  .   DB;DP=5;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQandDP=3125,5   GT:AD:DP:GQ:PL:SB   1/1:0,5,0:5:15:143,15,0,143,15,143:0,0,5,0
    chr1    69271   .   C   <NON_REF>   .   .   END=69276   GT:DP:GQ:MIN_DP:PL  0/0:5:15:5:0,15,189
    chr1    69277   .   G   <NON_REF>   .   .   END=69285   GT:DP:GQ:MIN_DP:PL  0/0:6:18:6:0,18,222
    (base) root@1100150:~/wes_cancer/project/5.gatk# less -S 9Y1640WES.HC.vcf |grep -v "#"|less -S |head
    chr1    14574   .   A   G   159.64  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.653;DP=37;ExcessHet=3.0103;FS=2.663;MLEAC=1;MLEAF=0.500;MQ=30.00;MQRankSum=-4.861;QD=4.56;ReadPosRankSum=-0.128;SOR=1.539    GT:AD:DP:GQ:PL  0/1:25,10:35:99:167,0,647
    chr1    14590   .   G   A   298.64  .   AC=1;AF=0.500;AN=2;BaseQRankSum=4.998;DP=48;ExcessHet=3.0103;FS=7.638;MLEAC=1;MLEAF=0.500;MQ=32.48;MQRankSum=-5.175;QD=6.22;ReadPosRankSum=1.525;SOR=2.670  GT:AD:DP:GQ:PL  0/1:38,10:48:99:306,0,1566
    chr1    14599   .   T   A   291.64  .   AC=1;AF=0.500;AN=2;BaseQRankSum=1.291;DP=50;ExcessHet=3.0103;FS=10.372;MLEAC=1;MLEAF=0.500;MQ=32.67;MQRankSum=-5.213;QD=5.83;ReadPosRankSum=1.614;SOR=3.014 GT:AD:DP:GQ:PL  0/1:40,10:50:99:299,0,1649
    chr1    14604   .   A   G   399.64  .   AC=1;AF=0.500;AN=2;BaseQRankSum=3.537;DP=59;ExcessHet=3.0103;FS=12.817;MLEAC=1;MLEAF=0.500;MQ=33.70;MQRankSum=-4.818;QD=6.77;ReadPosRankSum=0.247;SOR=3.533 GT:AD:DP:GQ:PL  0/1:46,13:59:99:407,0,1872
    chr1    14610   .   T   C   411.64  .   AC=1;AF=0.500;AN=2;BaseQRankSum=5.871;DP=66;ExcessHet=3.0103;FS=21.573;MLEAC=1;MLEAF=0.500;MQ=33.76;MQRankSum=-4.854;QD=6.33;ReadPosRankSum=0.367;SOR=4.266 GT:AD:DP:GQ:PL  0/1:51,14:65:99:419,0,1958
    chr1    14653   .   C   T   1074.64 .   AC=1;AF=0.500;AN=2;BaseQRankSum=1.334;DP=149;ExcessHet=3.0103;FS=6.839;MLEAC=1;MLEAF=0.500;MQ=37.03;MQRankSum=-2.288;QD=7.57;ReadPosRankSum=-3.040;SOR=1.266    GT:AD:DP:GQ:PL  0/1:89,53:142:99:1082,0,1965
    chr1    14932   .   G   T   67.64   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.524;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=27.83;MQRankSum=-1.645;QD=13.53;ReadPosRankSum=1.036;SOR=1.609 GT:AD:DP:GQ:PL  0/1:3,2:5:75:75,0,120
    chr1    14937   .   T   C   67.64   .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.645;DP=5;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.53;MQRankSum=-1.645;QD=13.53;ReadPosRankSum=1.036;SOR=1.609 GT:AD:DP:GQ:PL  0/1:3,2:5:75:75,0,120
    chr1    15903   .   G   GC  585 .   AC=2;AF=1.00;AN=2;DP=19;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=37.15;QD=32.50;SOR=0.693    GT:AD:DP:GQ:PL  1/1:0,18:18:54:599,54,0
    chr1    16378   .   T   C   51.64   .   AC=1;AF=0.500;AN=2;BaseQRankSum=1.204;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=23.26;MQRankSum=-2.189;QD=7.38;ReadPosRankSum=-1.718;SOR=0.495 GT:AD:DP:GQ:PL  0/1:4,3:7:59:59,0,98
    

    GATK找变异

    参考:https://zhuanlan.zhihu.com/p/69726572用GATK找变异

    #SNP calling
    # rescource 文件需要先index,参考文件尽量全,最后三行的文件都是输出
    # 步骤1-4
    time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk VariantRecalibrator -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -V 9Y1640WES.HC.vcf \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ~/wes_cancer/data/hapmap_3.3.hg38.vcf.gz \
    -resource:omini,known=false,training=true,truth=false,prior=12.0 ~/wes_cancer/data/1000G_omni2.5.hg38.vcf.gz \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 ~/wes_cancer/data/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ~/wes_cancer/data/dbsnp_146.hg38.vcf.gz \
    -an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode SNP -tranche 100.0 \
    -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
    -O 9Y1640WES.snp.recal \
    --tranches-file 9Y1640WES.snp.tranches \
    --rscript-file 9Y1640WES.snp.plots.R
    
    # 步骤5
    time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk ApplyVQSR -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -V 9Y1640WES.HC.vcf \
    --ts-filter-level 99.0 --tranches-file 9Y1640WES.snp.tranches \
    --recal-file 9Y1640WES.snp.recal \
    -mode SNP \
    --output data/9Y1640WES.snps.VQSR.vcf.gz
    
    ## 查看数据文件
    $ cat 9Y1640WES.snp.tranches
    

    -tranche默认是输出[100,99.9,99.0,90.0]4个tranche阈值的统计结果,如果想看其他阈值的结果,需要自行加上;结果就是看9Y1640WES.snp.tranches,而9Y1640WES.snp.recal文件则是用于变异质控ApplyVQSR的.

    #INDEL calling
    #加了--max-gaussians 6用于设定Gaussians(clusters of variants that have similar properties)的数目,即减少聚类的组数,从而使得每个组的变异位点数目达到要求
    time ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk VariantRecalibrator -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta -V data/9Y1640WES.snps.VQSR.vcf.gz \
    -resource:mills,known=true,training=true,truth=true,prior=12.0 ~/wes_cancer/data/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    -an DP -an QD -an FS -an SOR -an ReadPosRankSum -an MQRankSum \
    -mode INDEL --max-gaussians 6 \
    --rscript-file 9Y1640WES_L1.snp.indel.plots.R \
    --tranches-file 9Y1640WES.snp.indel.tranches \
    --output data/9Y1640WES.snp.indel.recal
    
    
    ~/wes_cancer/biosoft/gatk-4.1.7.0/gatk ApplyVQSR \
    -R ~/wes_cancer/data/Homo_sapiens_assembly38.fasta \
    -V data/9Y1640WES.snps.VQSR.vcf.gz \
    -O 9Y1640WES.indel.VQSR.vcf \
    --truth-sensitivity-filter-level 99.0 \
    --tranches-file 9Y1640WES.snp.indel.tranches \
    --recal-file data/9Y1640WES.snp.indel.recal \
    -mode INDEL
    #(上面这个9Y1640WES.indel.VQSR.vcf就是我们千辛万苦想得到的文件!)
    

    VQSR大概步骤:

    1. GATK认为VQSR比根据各种annotations进行hard-filtering过滤要好,减少了人为阈值的局限性,避免了一刀切的弊端,从而在sensitivity和specificity之间达到一定的平衡

    2. VQSR根据机器学习算法从highly validated变异位点数据集(每个位点的annotation profile,一般使用5-8个annotation)中获取到good variants/bad variants

    3. 根据上述的位点从我们自己数据集中挑选出一个变异子集(probably true positives)来建模训练,获得一个可识别good variants的模型;bad variants的模型也是如此获得

    4. 然后根据上述获得的模型,对自己数- 据集的每个变异位点进行一个总的打分

    5. 最后根据设定的sensitivity阈值对变异位点进行过滤

    按照官方教程:

    SNP的VQSR过滤,选用的resource datasets为:

    • HapMap,hapmap_3.3.hg38.vcf.gz,truth=true表示VQSR将这个数据集中的变异位点作为真位点true sites,training=true表示VQSR将true sites用于训练recalibration model,并赋予这些变异位点prior likelihood值为Q15 (96.84%)
    • Omni,1000G_omni2.5.hg38.vcf.gz,truth=truetraining=false(文档中写着是true,参数建议中写着的是false。。。我就按照参数上的来了),Q12 (93.69%)
    • 1000G,1000G_phase1.snps.high_confidence.hg38.vcf.gz,truth=false表示VQSR考虑到在1000G数据集中的不仅包含了true variants还有false positives,training=trueQ10 (90%)
    • dbSNP,dbsnp_146.hg38.vcf.gz/dbsnp_138.hg38.vcf.gz,truth=false表示VQSR未将dbSNP数据集中的位点作为可信数据集,training=false表示不用于训练数据集,known=true表示stratify output metrics such as Ti/Tv ratio by whether variants are present in dbsnp or not,Q2 (36.90%)

    INDEL的VQSR过滤,选用的resource datasets为:

    • Mills,Mills_and_1000G_gold_standard.indels.hg38.vcf.gz,truth=truetraining=trueQ12 (93.69%)
    • dbSNP,dbsnp_146.hg38.vcf.gz/dbsnp_138.hg38.vcf.gz,truth=falsetraining=falseknown=trueQ2 (36.90%)

    查看最后的vcf文件:

    (wes) root@1100150:~/wes_cancer/project/5.gatk# tail 9Y1640WES.indel.VQSR.vcf
    chrUn_JTFH01001981v1_decoy  1912    .   T   C   841.06  PASS    AC=2;AF=1.00;AN=2;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=42.43;QD=28.54;SOR=1.931;VQSLOD=0.691;culprit=SOR   GT:AD:DP:GQ:PL  1/1:0,22:22:57:855,57,0
    chrUn_JTFH01001981v1_decoy  1916    .   A   C   841.06  PASS    AC=2;AF=1.00;AN=2;DP=24;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=41.21;QD=33.27;SOR=2.124;VQSLOD=2.39;culprit=SORGT:AD:DP:GQ:PL  1/1:0,24:24:57:855,57,0
    chrUn_JTFH01001981v1_decoy  1923    .   T   C   886.06  VQSRTrancheSNP99.00to99.90  AC=2;AF=1.00;AN=2;DP=25;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=41.46;QD=29.92;SOR=1.893;VQSLOD=0.124;culprit=SOR   GT:AD:DP:GQ:PL  1/1:0,25:25:60:900,60,0
    chrUn_JTFH01001981v1_decoy  1945    .   T   C   1246.06 PASS    AC=2;AF=1.00;AN=2;DP=33;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=46.33;QD=30.47;SOR=1.751;VQSLOD=1.78;culprit=SORGT:AD:DP:GQ:PL  1/1:0,33:33:84:1260,84,0
    chrUn_JTFH01001981v1_decoy  1958    .   A   T   1370.06 PASS    AC=2;AF=1.00;AN=2;DP=39;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=45.74;QD=29.10;SOR=1.022;VQSLOD=2.74;culprit=FSGT:AD:DP:GQ:PL   1/1:0,33:33:99:1384,99,0
    chrUn_JTFH01001981v1_decoy  1995    .   T   A   906.06  VQSRTrancheSNP99.00to99.90  AC=2;AF=1.00;AN=2;DP=37;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=44.95;QD=24.49;SOR=0.746;VQSLOD=0.283;culprit=QD    GT:AD:DP:GQ:PL  1/1:0,37:37:99:920,99,0
    chrUn_JTFH01001981v1_decoy  2025    .   A   C   651.06  PASS    AC=2;AF=1.00;AN=2;DP=24;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=45.83;QD=27.13;SOR=0.859;VQSLOD=0.875;culprit=QDGT:AD:DP:GQ:PL  1/1:0,24:24:66:665,66,0
    chrUn_JTFH01001981v1_decoy  2069    .   A   C   447.06  VQSRTrancheSNP99.00to99.90  AC=2;AF=1.00;AN=2;DP=17;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=49.36;QD=26.30;SOR=1.739;VQSLOD=0.408;culprit=QD    GT:AD:DP:GQ:PL  1/1:0,17:17:48:461,48,0
    HLA-DRB1*16:02:01   5879    .   TGA T   40.60   VQSRTrancheINDEL99.00to99.90    AC=1;AF=0.500;AN=2;BaseQRankSum=-1.012;DP=15;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=51.08;MQRankSum=-1.571;NEGATIVE_TRAIN_SITE;QD=2.90;ReadPosRankSum=-1.184;SOR=0.223;VQSLOD=-2.913e+00;culprit=MQRankSum    GT:AD:DP:GQ:PL  0/1:12,2:14:48:48,0,428
    HLA-DRB1*16:02:01   5883    .   A   T   40.63   VQSRTrancheSNP99.90to100.00 AC=1;AF=0.500;AN=2;BaseQRankSum=-1.097;DP=15;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=51.08;MQRankSum=-1.571;NEGATIVE_TRAIN_SITE;QD=2.90;ReadPosRankSum=-1.602;SOR=0.223;VQSLOD=-4.746e+00;culprit=QD   GT:AD:DP:GQ:PL  0/1:12,2:14:48:48,0,428
    

    相关文章

      网友评论

          本文标题:2020-06-24 学习WES数据分析流程3

          本文链接:https://www.haomeiwen.com/subject/gfarfktx.html