美文网首页全基因组/外显子组测序分析生信小白
全外显子组数据分析笔记(四):变异检测

全外显子组数据分析笔记(四):变异检测

作者: TOP生物信息 | 来源:发表于2018-11-11 19:52 被阅读51次

    变异检测

    得到原始vcf的命令很简单

    #snp=/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/dbsnp_146.hg38.vcf.gz
    ${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" HaplotypeCaller \
                   -R ${ref}  \
                   -I ${info2}.smrb.bam \
                   --dbsnp ${snp} \
                   -O ${info2}.raw.vcf \
                   1>${info2}.HaplotypeCaller.log 2>&1
    

    如果只想对特定的区域进行变异检测呢?可以这样:

    samtools view -h CL100072545.44.smrb.bam chr19:1205798-1228434 | samtools sort -o STK11.smrb.bam -
    samtools index STK11.smrb.bam

    这里举一个错误的例子:

    samtools view -h CL100072545.44.smrb.bam chr19:1205798-1228434 > STK11.smrb.bam

    这是不对的,重定向(>)之前文件是以sam格式打开的,即使重新定义为bam,它本身还是sam文件(可以直接用less查看),所以可以samtools sort转换一下(虽然排序的作用是多余的,但生成了真的bam文件)。
    接着可以简单看一下指定区域的比对情况,用samtools tview命令,类似于IGV,但远没有IGV清晰。

    samtools tview -p chr19:1205798-1228434 STK11.smrb.bam \
    --reference ~/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta
    #如下
    #此界面下输入问号?显示帮助信息,区域小还能用一下,区域大了简直难受
       1206801   1206811   1206821   1206831   1206841   1206851   1206861   1206871   1206881   1206891   1206901   1206911   1206921
    TTTTTTTTTTCTTTTTTCTTTGTAAAATTTTGGAGAAGGGAAGTCGGAACACAAGGAAGGACCGCTCACCCGCGGACTCAGGGCTGGCGGCGGGACTCCAGGACCCTGGGTCCAGCATGGAGGTGGTGGACCC
    .....................................................................................................................................
    ......... ..................................................    .................................................. ,,,,,,,,,,,,,,,,,,
    ................ ..................................................   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,  ,,,,,,,,,,,
    ..................    ..................................................   ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,
    ..................     ..................................................   ........................T.........................      .
    ..................     ..................................................   ..................................................
    ........................... ..................................................              ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
    ...................................  ..................................................        ......................................
    ......................................   ..................................................    ......................................
    ........................................... .................................................. ......................................
    ........................................... .................................................. ......................................
    ................................................   ..................................................  t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
         ..................................................   ............................................G.....   ......................
          ..................................................        .................................................. ,,,,,,,,,,,,,,,,,,
          ..................................................        .................................................. ,,,,,,,,,,,,,,,,,,
           ...........................A.........G............   .................................................. ,,,,,,,,,,,,,,,,,,
           ..................................................   ..................................................  .................
           ..................................................   ..................................................   ................
           ..................................................   ........................N.........................   ................
    

    针对特定基因找变异

    这里找变异的流程主要参考的是samtools官网提供的流程,链接http://www.htslib.org/workflow/#mapping_to_variant
    在官网主页有这样一段话:

    Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:
    Samtools
    Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
    BCFtools
    Reading/writing BCF2/VCF/gVCF files and calling/filtering/summarising SNP and short indel sequence variants
    HTSlib
    A C library for reading/writing high-throughput sequencing data

    第一步
    bcftools mpileup --output-type u -o STK11.bcf \
    -f ~/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
    STK11.smrb.bam
    这里--output-type u表示输出不压缩的bcf文件
    

    这一步之后会生成bcf文件,我们知道vcf(variant call format)是一种用于存储基因序列突变信息的文本格式,而bcf格式文件是vcf格式的二进制文件,所以bcf不能直接查看。注意这里强调的是格式而不是内容,上面的bcf文件直接由bam得到,没有经过任何筛选,所以不可能每一个位置都是变异点。

    可以这样查看
    bcftools view ./STK11.bcf | lsx
    如下,ALT是<*>当然就是没有变异咯
    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CL100072545
    chr19   1206383 .       C       <*>     0       .       DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,0,0,0,0;QS=1,0;MQ0F=0   PL      0,3,30
    chr19   1206384 .       G       <*>     0       .       DP=1;I16=1,0,0,0,27,729,0,0,60,3600,0,0,1,1,0,0;QS=1,0;MQ0F=0   PL      0,3,27
    chr19   1206385 .       G       <*>     0       .       DP=1;I16=1,0,0,0,31,961,0,0,60,3600,0,0,2,4,0,0;QS=1,0;MQ0F=0   PL      0,3,31
    再来看看vcf文件(事先准备好的)
    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CL100072545
    chr19   1207142 .       GT      G       222     .       INDEL;IDV=29;IMF=0.432836;DP=67;VDB=0.74912;SGB=-0.693079;MQSB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=15,23,15,14;MQ=60   GT:PL   0/1:255,0,255
    chr19   1207239 .       G       T       225     .       DP=68;VDB=0.150142;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,11,51;MQ=60  GT:PL   1/1:255,184,0
    chr19   1207281 .       C       T       131     .       DP=34;VDB=0.232203;SGB=-0.69168;RPB=0.572388;MQB=1;BQB=0.010939;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,14,0,19;MQ=60      GT:PL   0/1:164,0,183
    
    第二步
    bcftools call -v -m -O v -o STK11.vcf STK11.bcf
    -v  output variant sites only
    -O v  output type: 'v' uncompressed VCF
    

    变异校正(VQSR, Variant Quality Score Recalibration)

    第一步
    ${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" VariantRecalibrator \
            -R ${ref}  \
            -V ${info2}.raw.vcf \
            --resource 1000G,known=false,training=true,truth=false,prior=10.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
            --resource hapmap,known=false,training=true,truth=true,prior=15.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/hapmap_3.3.hg38.vcf.gz \
            --resource dbsnp,known=true,training=false,truth=false,prior=2.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/dbsnp_146.hg38.vcf.gz \
            -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
            -mode SNP \
            -O ${info2}.snps.recal \
            --tranches-file ${info2}.snps.tranches \
            --rscript-file ${info2}.snps.plots.R
    
    第二步
    ${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" ApplyVQSR \
            -R ${ref}  \
            -V ${info2}.raw.vcf \
            -mode SNP \
            --recal-file ${info2}.snps.recal \
            --tranches-file ${info2}.snps.tranches \
            -O ${info2}.snps.VQSR.vcf
            --truth-sensitivity-filter-level 99.0
    

    再对indel来一遍

    ${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" VariantRecalibrator \
            -R ${ref}  \
            -V ${info2}.snps.VQSR.vcf \
            -resource mills,known=true,training=true,truth=true,prior=12.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
            -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
            -mode INDEL \
            -O  ${info2}.snps.indels.recal \
            --tranches-file ${info2}.snps.indels.tranches \
            --rscript-file ${info2}.snps.indels.plots.R
    ${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" ApplyVQSR \
            -R ${ref}  \
            -V ${info2}.snps.VQSR.vcf \
            -mode INDEL \
            --recal-file ${info2}.snps.indels.recal \
            --tranches-file ${info2}.snps.indels.tranches \
            -O ${info2}.snps.indels.VQSR.vcf
            --truth-sensitivity-filter-level 99.0
    

    这一步其实就是变异过滤,通常来讲有两种策略:

    硬指标过滤

    给一些指标设定临界值,各项指标大于临界值的位点才会被保留。

    • 比如vcf文件中INFO这一列里面的:RMSMappingQuality (MQ) 40.0: 表示覆盖序列质量的均方值RMS,小于40的过滤掉;
    • QualByDepth (QD) 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples等等。
    • 还有QUAL这一列的值,表示在该位点存在突变的可能性;该值越高,则突变的可能性越大;计算方法:-10 * log (1-p) p为突变存在的概率。
      一些数据库不完备的物种可以这样做,但过滤阈值的选取不太好定,官网提供的推荐值(本文下方有链接)显然不能适用于所有情况。所以最好还是请教一下有经验的人。
    变异校正(机器学习的方法)

    就是上面的VQSR,更强大,但需要非常完备的数据集。


    Hard-filtering is a very blunt instrument
    Variant recalibration is far more discriminating

    接下来就可以按照snp和indel把vcf分开,到这儿才算结束了这一步。

    java -jar ~/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
    -R /ifs1/Grp3/huangsiyuan/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
    -V CL100072545.44.snps.indels.VQSR.vcf \
    -selectType SNP -o snp.vcf 
    java -jar ~/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
    -R /ifs1/Grp3/huangsiyuan/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
    -V CL100072545.44.snps.indels.VQSR.vcf \
    -selectType INDEL -o indel.vcf
    

    reference

    (howto) Apply hard filters to a call set: https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set

    相关文章

      网友评论

        本文标题:全外显子组数据分析笔记(四):变异检测

        本文链接:https://www.haomeiwen.com/subject/xduoxqtx.html