变异检测
得到原始vcf的命令很简单
#snp=/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/dbsnp_146.hg38.vcf.gz
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" HaplotypeCaller \
-R ${ref} \
-I ${info2}.smrb.bam \
--dbsnp ${snp} \
-O ${info2}.raw.vcf \
1>${info2}.HaplotypeCaller.log 2>&1
如果只想对特定的区域进行变异检测呢?可以这样:
samtools view -h CL100072545.44.smrb.bam chr19:1205798-1228434 | samtools sort -o STK11.smrb.bam -
samtools index STK11.smrb.bam
这里举一个错误的例子:
samtools view -h CL100072545.44.smrb.bam chr19:1205798-1228434 > STK11.smrb.bam
这是不对的,重定向(>)之前文件是以sam格式打开的,即使重新定义为bam,它本身还是sam文件(可以直接用less查看),所以可以samtools sort转换一下(虽然排序的作用是多余的,但生成了真的bam文件)。
接着可以简单看一下指定区域的比对情况,用samtools tview命令,类似于IGV,但远没有IGV清晰。
samtools tview -p chr19:1205798-1228434 STK11.smrb.bam \
--reference ~/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta
#如下
#此界面下输入问号?显示帮助信息,区域小还能用一下,区域大了简直难受
1206801 1206811 1206821 1206831 1206841 1206851 1206861 1206871 1206881 1206891 1206901 1206911 1206921
TTTTTTTTTTCTTTTTTCTTTGTAAAATTTTGGAGAAGGGAAGTCGGAACACAAGGAAGGACCGCTCACCCGCGGACTCAGGGCTGGCGGCGGGACTCCAGGACCCTGGGTCCAGCATGGAGGTGGTGGACCC
.....................................................................................................................................
......... .................................................. .................................................. ,,,,,,,,,,,,,,,,,,
................ .................................................. ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,
.................. .................................................. ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,
.................. .................................................. ........................T......................... .
.................. .................................................. ..................................................
........................... .................................................. ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
................................... .................................................. ......................................
...................................... .................................................. ......................................
........................................... .................................................. ......................................
........................................... .................................................. ......................................
................................................ .................................................. t,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.................................................. ............................................G..... ......................
.................................................. .................................................. ,,,,,,,,,,,,,,,,,,
.................................................. .................................................. ,,,,,,,,,,,,,,,,,,
...........................A.........G............ .................................................. ,,,,,,,,,,,,,,,,,,
.................................................. .................................................. .................
.................................................. .................................................. ................
.................................................. ........................N......................... ................
针对特定基因找变异
这里找变异的流程主要参考的是samtools官网提供的流程,链接http://www.htslib.org/workflow/#mapping_to_variant
在官网主页有这样一段话:
Samtools is a suite of programs for interacting with high-throughput sequencing data. It consists of three separate repositories:
Samtools
Reading/writing/editing/indexing/viewing SAM/BAM/CRAM format
BCFtools
Reading/writing BCF2/VCF/gVCF files andcalling/filtering/summarising SNP and short indel sequence variants
HTSlib
A C library for reading/writing high-throughput sequencing data
第一步
bcftools mpileup --output-type u -o STK11.bcf \
-f ~/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
STK11.smrb.bam
这里--output-type u表示输出不压缩的bcf文件
这一步之后会生成bcf文件,我们知道vcf(variant call format)是一种用于存储基因序列突变信息的文本格式,而bcf格式文件是vcf格式的二进制文件,所以bcf不能直接查看。注意这里强调的是格式而不是内容,上面的bcf文件直接由bam得到,没有经过任何筛选,所以不可能每一个位置都是变异点。
可以这样查看
bcftools view ./STK11.bcf | lsx
如下,ALT是<*>当然就是没有变异咯
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CL100072545
chr19 1206383 . C <*> 0 . DP=1;I16=1,0,0,0,30,900,0,0,60,3600,0,0,0,0,0,0;QS=1,0;MQ0F=0 PL 0,3,30
chr19 1206384 . G <*> 0 . DP=1;I16=1,0,0,0,27,729,0,0,60,3600,0,0,1,1,0,0;QS=1,0;MQ0F=0 PL 0,3,27
chr19 1206385 . G <*> 0 . DP=1;I16=1,0,0,0,31,961,0,0,60,3600,0,0,2,4,0,0;QS=1,0;MQ0F=0 PL 0,3,31
再来看看vcf文件(事先准备好的)
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CL100072545
chr19 1207142 . GT G 222 . INDEL;IDV=29;IMF=0.432836;DP=67;VDB=0.74912;SGB=-0.693079;MQSB=1;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=15,23,15,14;MQ=60 GT:PL 0/1:255,0,255
chr19 1207239 . G T 225 . DP=68;VDB=0.150142;SGB=-0.693147;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,11,51;MQ=60 GT:PL 1/1:255,184,0
chr19 1207281 . C T 131 . DP=34;VDB=0.232203;SGB=-0.69168;RPB=0.572388;MQB=1;BQB=0.010939;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=0,14,0,19;MQ=60 GT:PL 0/1:164,0,183
第二步
bcftools call -v -m -O v -o STK11.vcf STK11.bcf
-v output variant sites only
-O v output type: 'v' uncompressed VCF
变异校正(VQSR, Variant Quality Score Recalibration)
第一步
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" VariantRecalibrator \
-R ${ref} \
-V ${info2}.raw.vcf \
--resource 1000G,known=false,training=true,truth=false,prior=10.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource hapmap,known=false,training=true,truth=true,prior=15.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/hapmap_3.3.hg38.vcf.gz \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/dbsnp_146.hg38.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
-mode SNP \
-O ${info2}.snps.recal \
--tranches-file ${info2}.snps.tranches \
--rscript-file ${info2}.snps.plots.R
第二步
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" ApplyVQSR \
-R ${ref} \
-V ${info2}.raw.vcf \
-mode SNP \
--recal-file ${info2}.snps.recal \
--tranches-file ${info2}.snps.tranches \
-O ${info2}.snps.VQSR.vcf
--truth-sensitivity-filter-level 99.0
再对indel来一遍
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" VariantRecalibrator \
-R ${ref} \
-V ${info2}.snps.VQSR.vcf \
-resource mills,known=true,training=true,truth=true,prior=12.0:/ifs1/Grp3/huangsiyuan/learn_wes/ref/database/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \
-mode INDEL \
-O ${info2}.snps.indels.recal \
--tranches-file ${info2}.snps.indels.tranches \
--rscript-file ${info2}.snps.indels.plots.R
${gatk4_path} --java-options "-Xmx10G -Djava.io.tmpdir=./" ApplyVQSR \
-R ${ref} \
-V ${info2}.snps.VQSR.vcf \
-mode INDEL \
--recal-file ${info2}.snps.indels.recal \
--tranches-file ${info2}.snps.indels.tranches \
-O ${info2}.snps.indels.VQSR.vcf
--truth-sensitivity-filter-level 99.0
这一步其实就是变异过滤,通常来讲有两种策略:
硬指标过滤
给一些指标设定临界值,各项指标大于临界值的位点才会被保留。
- 比如vcf文件中INFO这一列里面的:RMSMappingQuality (MQ) 40.0: 表示覆盖序列质量的均方值RMS,小于40的过滤掉;
- QualByDepth (QD) 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples等等。
- 还有QUAL这一列的值,表示在该位点存在突变的可能性;该值越高,则突变的可能性越大;计算方法:-10 * log (1-p) p为突变存在的概率。
一些数据库不完备的物种可以这样做,但过滤阈值的选取不太好定,官网提供的推荐值(本文下方有链接)显然不能适用于所有情况。所以最好还是请教一下有经验的人。
变异校正(机器学习的方法)
就是上面的VQSR,更强大,但需要非常完备的数据集。
Hard-filtering is a very blunt instrument
Variant recalibration is far more discriminating
接下来就可以按照snp和indel把vcf分开,到这儿才算结束了这一步。
java -jar ~/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R /ifs1/Grp3/huangsiyuan/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
-V CL100072545.44.snps.indels.VQSR.vcf \
-selectType SNP -o snp.vcf
java -jar ~/GenomeAnalysisTK-3.4-0/GenomeAnalysisTK.jar -T SelectVariants \
-R /ifs1/Grp3/huangsiyuan/learn_wes/ref/Homo_sapiens_assembly38/Homo_sapiens_assembly38.fasta \
-V CL100072545.44.snps.indels.VQSR.vcf \
-selectType INDEL -o indel.vcf
reference
(howto) Apply hard filters to a call set: https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set
网友评论