美文网首页
nsSNPs致病性分析(二)现有工具与原理

nsSNPs致病性分析(二)现有工具与原理

作者: UnderStorm | 来源:发表于2019-01-23 14:28 被阅读27次

往期系列文章:nsSNPs致病性分析(一): 自己写脚本分析蛋白质保守性

目录

    1. SIFT
    1. Polyphen2
    1. CADD
    1. DANN
    1. MetaSVM
    1. dbNSFP数据库:整合多种nsSNP预测工具的结果

1. SIFT

算法说明:

For a given protein sequence, SIFT compiles a dataset of functionally related protein sequences by searching a protein database
 using the PSI-BLAST algorithm6. It then builds an alignment from the homologous sequences with the query sequence. 

In the second step of the algorithm, SIFTscans each position in the alignment and calculates the probabilities for all possible
 20 amino acids at that position. These probabilities are normalized by the probability of the most frequent amino acid and are
 recorded in a scaled probability matrix. 

SIFT predicts a substitution to affect protein function if the scaled probability, also termed the SIFTscore, lies below a certain
 threshold value. Generally, a highly conserved position is intolerant to most substitutions, whereas a poorly conserved position
 can tolerate most substitutions

计算某个位点保守性的公式:

Pca的计算方法——基于伪计数得到校正的PSSM:

Nc:实际得到的同源序列数
gca:目标序列c位点出现a氨基酸的实际频率
Bc:伪计数,未观察到的同源序列数
fca:目标序列c位点出现a氨基酸的伪计数频率

由伪计数得到的 Bc 和 fca 是怎么确定的?

对于氨基酸组成多样的位点,SIFT倾向于使用更大的伪计数,因为越多样则被漏掉的同源序列可能就越多

得到的PSSM的形式:

2. Polyphen-2

PolyPhen同时结合序列和结果上的信息,主要的假设就是说有一些氨基酸的改变可能会影响蛋白的折叠,影响蛋白的的相互作用区间,影响它的稳定性 ,而蛋白结构如果有改变,那蛋白的功能就更可能会发生改变,所以它整合了序列和三维结构的一些特征

Sequence-based features

(1)通过已有的蛋白质注释数据库(如UniProtKB/Swiss-Prot),鉴定某个替换 (substitution) 是否落在某个特殊的区域/位置

特殊位点包括:

  • DISULFID, CROSSLNK bond or
  • BINDING, ACT_SITE, LIPID, METAL, SITE, MOD_RES, CARBOHYD, NON_STD site

特殊区域包括:

TRANSMEM, INTRAMEM, COMPBIAS, REPEAT, COILED, SIGNAL, PROPEP

(2)另外还会计算替换前后PSIC值的差值

PSIC值的计算方法类似于PSSM的计算方法,即在UniRef100数据库中利用BLAST搜索与qury序列高度同源的序列,然后将这些序列进行多序列比对,基于多序列比对结果得到profile matrix,其中行表示一个特定的氨基酸位点,列表示一种氨基酸,像这样:

这个矩阵的每个元素(profile score)的计算公式为:

其中i表示矩阵的行号,j表示矩阵的列号,PSICi,j表示矩阵第i行,第j列的PSIC值,P(aa=Aj| posi=i) 表示在query序列第i个氨基酸位点出现Aj氨基酸的概率,P(aa=Aj) 表示任意位点出现Aj氨基酸的概率

若在qury序列第i个氨基酸位点,发生了Am到An的非同义突变,则

若ΔPSIC是一个比较大的正数,说明这种突变发生的概率很低,这种突变很可能是一个有害突变

Structural features

找到这个蛋白的三维结构,或者这个三维结构没有,但是有一个和你这个蛋白序列比较相类似的另外一个蛋白结构有,那你可以做一个同源建模,来预测它的三维结构

然后基于这个三维结构计算该位点相关的结构参数 (structural parameters),PolyPhen
2利用DSSP数据库来获得下面的结构参数:

  • Secondary structure (according to the DSSP nomenclature)
  • Solvent accessible surface area (absolute value in Ų)
  • Phi-psi dihedral angles

使用的预测算法为Naive Bayes

训练集有两种:

  • HumDiv

compiled from all damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProtKB database, together with differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging

  • HumVar

consisted of all human disease-causing mutations from UniProtKB, together with common human nsSNPs (MAF>1%) without annotated involvement in disease, which were treated as non-damaging.

基于两种不同类型的训练集训练得到两种不同的预测模型,适用于不同类型nsSNP的预测

  • HVAR:should be used for diagnostics of Mendelian diseases, which requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles.The authors recommend calling "probably damaging" if the score is between 0.909 and 1, and "possibly damaging" if the score is between 0.447 and 0.908, and "benign" is the score is between 0 and 0.446.
  • HDIV: be used when evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data. The authors recommend calling "probably damaging" if the score is between 0.957 and 1, and "possibly damaging" if the score is between 0.453 and 0.956, and "benign" is the score is between 0 and 0.452.

一般突变看HVAR

3. CADD

CADD —— Combined Annotation Dependent Depletion

这个工具出行的历史任务是,在此之前,大多数SNV有害性或可容忍性 (deleteriousness)的评估都是基于单个因素,而CADD对多种特征都进行了整合

While many variant annotation and scoring tools are around, most annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Thus, a broadly applicable metric that objectively weights and integrates diverse information is needed. Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations.

CADD独创了一种打分算法,来衡量变异位点的有害程度。对于一组变异位点,CADD 结合等位基因的多态性,变异的致病性等多个因素,构建了一套模型,对每个变异位点进行评估,并给出一个具体的得分,简称C-Scores。 统计模型直接给出的打分叫做RawScore, 这个值越高,代表该变异位点是一个有害突变的概率越高。

对于不同组的变异位点,比如对于1000G和ESP两批变异位点而言,由于各因素的差异,其模型是不同的,RawScore在不同模型间是无法直接比较的。所以提出了scaled C-scores的概念。对RawScores进行从大到小排序,采用-10*log10(rank/total)的公式计算出scaled C-scores。由于这个公式和phread的定义方式类似,所以scaled C-scores也叫做PHREAD

在分析潜在的致病变异位点时,通常会对PHREAD进行过滤。官方推荐阈值为10,15,20都可以,但是更加推荐结合C-Scores和其他实验证据来对变异位点的致病性进行评估,而不是单纯的进行一个数值过滤。

4. DANN

DANN利用神经网络算法评估变异位点的有害程度

DANN软件可以看作是CADD的改进版本,改进了预测的算法,效果比CADD有所提高。

CADD软件的核心是支持向量机SVM算法,这个算法在机器学习领域是一个常用的算法之一,对于具有线性关系的特征具有具有较好的性能,但是对于非线性关系的特征,其性能就相对差点。DANN采用了神经网络算法,更容易捕获非线性关系的特征,所以效果上比CADD要好一点。

Bioinformatics. 2015 Mar 1; 31(5): 761–763.

可以看到,两幅图中,DANN的AUC都比SVM的要大,说明DANN相比CADD确实是性能更好。

5. MetaSVM

分为三步:

(1) perform imputation for whole-exome variants and fill out missing scores for SIFT, PolyPhen, MutationAssessor and so on.

(2) Normalize all scores to 0-1 range

(3) use a radial SVM model to train prediction model using all available scores and some population genetics parameters, and then apply the model on whole-exome variants.

简单来说,就是结合SIFT, PolyPhen 和 MutationAssessor 的预测分值,训练SVM模型来预测

6. dbNSFP数据库:整合多种nsSNP预测工具的结果

网址:https://sites.google.com/site/jpopgen/dbNSFP

整合了20种nsSNP的功能预测算法与6种保守性评估方法得到的分值

  • 功能预测:

SIFT, Polyphen2-HDIV, Polyphen2-HVAR, LRT, MutationTaster2, MutationAssessor, FATHMM, MetaSVM, MetaLR, CADD, VEST3, PROVEAN, FATHMM-MKL coding, fitCons, DANN, GenoCanyon, Eigen coding, Eigen-PC, M-CAP, REVEL, MutPred

  • 保守性评估:

PhyloP x 2, phastCons x 2, GERP++ and SiPhy

Score (dbtype) # variants in LJB23 build hg19 Categorical Prediction
SIFT (sift) 77593284 D: Deleterious (sift<=0.05); T: tolerated (sift>0.05)
PolyPhen 2 HDIV (pp2_hdiv) 72533732 D: Probably damaging (>=0.957), P: possibly damaging (0.453<=pp2_hdiv<=0.956); B: benign (pp2_hdiv<=0.452)
PolyPhen 2 HVar (pp2_hvar) 72533732 D: Probably damaging (>=0.909), P: possibly damaging (0.447<=pp2_hdiv<=0.909); B: benign (pp2_hdiv<=0.446)
LRT (lrt) 68069321 D: Deleterious; N: Neutral; U: Unknown
MutationTaster (mt) 88473874 A" ("disease_causing_automatic"); "D" ("disease_causing"); "N" ("polymorphism"); "P" ("polymorphism_automatic"
MutationAssessor (ma) 74631375 H: high; M: medium; L: low; N: neutral. H/M means functional and L/N means non-functional
FATHMM (fathmm) 70274896 D: Deleterious; T: Tolerated
MetaSVM (metasvm) 82098217 D: Deleterious; T: Tolerated
MetaLR (metalr) 82098217 D: Deleterious; T: Tolerated
GERP++ (gerp++) 89076718 higher scores are more deleterious
PhyloP (phylop) 89553090 higher scores are more deleterious
SiPhy (siphy) 88269630 higher scores are more deleterious

dbNSFP的数据已经被整合进ANNOVAR中了,目前的最新版本为dbnsfp33a

# 注释数据下载
$ annotate_variation.pl -downdb -webfrom annovar -buildver hg19 dbnsfp33a humandb/

# 同时获得所有dnNSFP的注释
$ table_annovar.pl ex1.avinput humandb/ -protocol dbnsfp33a -operation f -build hg19 -nastring .

# 获得单一dnNSFP的注释,需要先从ANNOVAR的官方服务器上下载对应某个dnNSFP的注释的文件,以SIFT为例
$ annotate_variation.pl -filter -dbtype ljb23_sift -buildver hg19 -out ex1 example/ex1.avinput humandb/

参考资料:

(1) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm, Nature Protocols 4, - 1073 - 1081 (2009)

(2) Predicting Deleterious Amino Acid Substitutions, Genome Res. 2001 May; 11(5): 863.874.

(3) PolyPhen-2官网

(4) CADD官网

(5) 【简书】CADD数据库简介

(6) 【简书】DANN:利用神经网络算法评估变异位点的有害程度

(7) Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2014;31(5):761-3.

(8) dbNSFP 官网

(9) ANNOVAR document

相关文章

网友评论

      本文标题:nsSNPs致病性分析(二)现有工具与原理

      本文链接:https://www.haomeiwen.com/subject/elijjqtx.html