作者按
最近在准备着换一个职业赛道,所以在做之前所有项目的回溯,遇到了最最基础的SNV+Indel的流程,给别人重新讲了一遍Mutect2的过滤规则和参数选择,发现这个,含金量比我之前写的SV和CNV高多了。贴出来给我考试攒人品啦。
Mutect2是基于somatic极大似然模型,通过寻找active region,迭代寻找可能突变的一款认可度很高的软件。
值得注意的是目前的Mutect2单独call变异属于BETA测试版,只有tumor-normal成对call是经过学术认可的。
但是目前各大医院、临检单位,为了压缩成本、减轻患者压力,普遍采取大panel像全外显子这种的采取成对call,小panel单独only tumor。通过SNP数据集,经验值或者其他一代二代验证方法等等过滤产生的假阳性。
那么过滤就需要细之又细,以下是Mutect2的所有参数。参考来源于Mutect官方文档之Mathematical notes on Mutect的chapter8。
概述
Mutect2一共有14个过滤标签(vcf的filter列可能出现的tag),每个标签对应一个或者好几个值。vcf里每个点都有这14个值,值的意义在vcf的info列列出,见下表的key。这每个key或者filter对应一个在运行Mutect时候设置的参数,见下表的Argument。
Mutect参数对应的过滤标签和INFO列的对应键值对举个例子:
见下图的某位点,base_quality
标签出现在了filter列,它在call的时候估算的参数为MBQ=0,而我们设置的min-median-base-quality
参数为20,因为0<20,所以base_quality
的标签出现在了tag里边。其余参数以此类推。
像这种与likelihood model相关性弱一点的参数很好理解。万一是t-lod
标签呢?它代表什么意思呢?又是如何算出来的呢?这就涉及到它的数学模型了,我已经努力地在写一篇比较浅显易懂的介绍了,但是因为OneNote写公式实在太累,所以我决定手推,等下次更新的时候就可以看到一篇全篇公式的手写图片式科普23333
我一定是准备GMAT的数学太简单了才如此找虐的。。。
好了,言归正传,接下来依次介绍各个标签的含义:
参数含义
1. t_lod
- tumor-lod is the minimum likelihood of an allele as determined by the somatic likelihoods model required to pass.
- LOD threshold for calling tumor variant.
- 似然模型中认为该点是体细胞变异的最小似然比,默认是5.3,若小于5.3则添加tlod标签。
- Default value: 5.3
-
Example
t-lod example
2.clustered_events
-
max-events-in-region is the maximum allowable number of called variants co-occurring in a single assembly region. If the number of called variants exceeds this they will all be filtered.
-
Variants coming from an assembly region with more than this many events are filtered.
-
活动区域发生多次突变,且突变位点距离在3bp及以上。【为什么大于3bp呢?因为2个的话很有可能是可以合并的同一个。】
-
Default value: 2.
-
因为这个曾经困扰了我很久,所以直接在GitHub扒了gatk源码。查看的时候关于这个参数的部分长这样
-
然后又询问官方,得到的回复是建议认为是假性。
-
example
-
note
如果是大panel,大部分位点都会带这个标签,根据经验,如果过滤,会过滤掉真阳性位点,我个人建议保留,或者至少验证一下。【真的不能粗暴过滤,我验证过的】以下是我当时调整参数
--max-enents-in-region
长度,得到clustered_events
标签的个数。发现十几bp的时候,针对我的panel,带标签的点并不会减少多少。所以用官方的
3.duplicated_evidence
- unique-alt-read-count is the minimum number of unique (start position, fragment length) pairs required to make a call. This count is a proxy for the number of unique molecules (as opposed to PCR duplicates) supporting an allele.
- 设置单独支持该等位基因的最小分子数,默认值是0
- Filter a variant if a site contains fewer than this many unique (i.e. deduplicated) reads supporting the alternate allele Default value: 0.
-
Example
4.multiallelic
- max-alt-allele-count is the maximum allowable number of alt alleles at a site. By default only biallelic variants pass the filter.
- 某一个点等位基因的最大数,默认情况下,只有双等位基因变异通过过滤器。
- Default value: 1.
- filter variants with too many alt alleles
-
Example
5.germline_risk
- max-germline-posterior is the maximum posterior probability, as determined by the above germline probability model, that a variant is a germline event.
- 该位点是germline event的最大后验概率, 根据模型计算P_GERMLINE值,大于设定值就会添加germline_risk的标签。
- Maximum posterior probability that an allele is a germline variant.
- Default value: 0.1.
-
Example
6.artifact_in_normal
- normal-artifact-lod is the maximum acceptable likelihood of an allele in the normal by the somatic likelihoods model. This is different from the normal likelihood that goes into the germline model, which makes a diploid assumption. Here we compute the normal likelihood as if it were a tumor in order to detect artifacts.
- 当tumor和control成对call的时候,会对control组的normal样本单独设置对数比阈值,该阈值越高,过滤标准越严格,因为认为normal全部是假阳性,所以会设置较低的LOD值。
- LOD threshold for calling normal artifacts
- Default value: 0.0.
- example
我没有tumo-normal对,所以无图可用,可怜兮兮~~~
7.strand_artifact
- max-strand-artifact-probability is the posterior probability of a strand artifact, as determined by the model described above, required to apply the strand artifact filter.
- 链偏好性的后验概率,根据计算的SA_POST_PROB,大于设定值则过滤;还有第二层补充条件;
- This is necessary but not sufficient – we also require the estimated max a posteriori allele fraction to be less than min-strand-artifact-allele-fraction.The second condition prevents filtering real variants that also have significant strand bias, i.e. a true variant that also has some artifactual reads.
- 如果链偏好性的最大后验概率比SA_MAP_AF(MAP estimates of allele fraction given
- 变异频率的最大后验概率)值小,会保留,以防将真阳性位点加上strand_artifact标签。
- Filter a variant if the probability of strand artifact exceeds this number
- Default value: 0.99.
-
Example
8.base_quality
- min-median-base-quality is the minimum median base quality of bases supporting a SNV.
- 支持SNV的最小碱基质量。( median base quality of bases 碱基质量衡量标准,在fastqc软件中该值为25)
- filter variants for which alt reads' median base quality is too low.
- Default value: 20.
-
Example
9.mapping_quality
- min-median-mapping-quality is the minimum median mapping quality of reads supporting an allele.
- 支持SNV的最小mapping质量。
- Filter variants for which alt reads' median mapping quality is too low.
- Default value:30.
-
Example
10.fragment_length
- max-median-fragment-length-difference is the maximum difference between the median fragment lengths reads supporting alt and reference alleles. Note that fragment length is based on where paired reads are mapped,not the actual physical fragment length.
- 是支持alt和ref片段长度之间的最大差异。片段长度是基于mapping成对读取时的长度,而不是实际物理片段长度。过滤掉ref和alt比对片段差异巨大的点。
- Filter variants for which alt reads' median fragment length is very different from the median for ref reads.
- Default value: 10000.
- example
这个也是tumor-normal成对call才会出现的,我没有例子可展示,可怜兮兮X2
11.read_position
- min-median-read-position is the minimum median length of bases supporting an allele from the closest end of the read. Indels positions are measured by the end farthest from the end of the read.
- 位点到read末尾的最近读取端的最小中值长度。DENELS的位置是由读数末尾最远的一端测量的。
- filter variants for which the median position of alt alleles within reads is too near the end of reads.
- Default value: 5.
-
Example
12.panel_of_normals
- One of the two unadjustable filters. the panel of normals filter removes all alleles at a site belonging to the panel of normals, which is a vcf of blacklisted artifact sites. It can be disabled by not passing a panel of normals to Mutect2.
- 该位点也在PON中存在
-
Example
13.contamination
- If FilterMutectCalls is passed a contamination-table from CalculateContamination it will filter alleles with allele fraction less than the whole-bam contamination in the table.
- 过滤样品污染
- --contamination-table:File
- example:
单个样品就不需要做啦
14.str_contraction
-
One of the two unadjustable filters, an STR contraction filter which removes variants that are the deletion of a single repeat unit of an STR when this repeat unit contains more than one base.
-
当该重复大于一个base,过滤短串联重复区,STR( Variant is a short tandem repeat )。
-
RPA=Number of times tandem repeat unit is repeated, for each allele (including reference)
-
RU=Tandem repeat unit (bases)
-
Example
写在最后
很多参数,比如t-lod
是模型最终的检验T值,artifact_in_normal
是normal的后验概率相关,如果不了解模型,可能无法理解其假阳性的中间推导过程。建议各位同学仔细读一读Mathematical Notes on Mutect(David Benjamin� and Takuto Sato†Broad Institute, 75 Ames Street, Cambridge, MA 02142
(Dated: September 26, 2018)。
网友评论