期刊
Genomics Proteomics & Bioinformatics (6.409/Q1)
KaKs_Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging
KaKs_Calculator:通过模型选择和模型平均计算 Ka 和 Ks
KaKs Calculator is a software package that calculates nonsynonymous (Ka) and synonymous (Ks) substitution rates through model selection and model averaging. Since existing methods for this estimation adopt their specif ic mutation (substitution) models that consider dif ferent evolutionary features, leading to diverse estimates, KaKs Calculator implements a set of candidate models in a maximum likelihood framework and adopts the Akaike information criterion to measure f itness between models and data, aiming to include as many features as needed for accurately capturing evolutionary information in protein-coding sequences. In addition, several existing methods for calculating Ka and Ks are also incorporated into this software. KaKs Calculator, including source codes, compiled executables, and documentation, is freely available for academic use at http://evolution.genomics.org.cn/software.htm.
KaKs_Calculator 是一个软件包,通过模型选择和模型平均计算非同义(Ka)和同义(Ks)替代率。 由于该估计的现有方法采用了考虑不同进化特征的特定突变(替代)模型,导致估计多样化,因此 KaKs_Calculator 在最大似然框架中实现了一组候选模型,并采用 Akaike 信息标准来衡量适应度 在模型和数据之间,旨在包括准确捕获蛋白质编码序列中的进化信息所需的尽可能多的特征。 此外,该软件还整合了几种现有的计算 Ka 和 Ks 的方法。 KaKs_Calculator,包括源代码、编译的可执行文件和文档,可在 http://evolution.genomics.org.cn/software.htm 上免费供学术使用。
Key words: model selection, model averaging, AIC, approximate method, maximum likelihood method
关键词: 模型选择, 模型平均, AIC, 近似法, 最大似然法
Introduction
Calculating nonsynonymous (Ka) and synonymous (Ks) substitution rates is of great significance in reconstructing phylogeny and understanding evolutionary dynamics of protein-coding sequences across closely related and yet diverged species. It is known that Ka and Ks, or often their ratio (Ka/Ks), indicate neutral mutation when Ka equals to Ks, negative (purifying) selection when Ka is less than Ks, and positive (diversifying) selection when Ka exceeds Ks. Therefore, statistics of the two variables in genes from dif ferent evolutionary lineages provides a powerful tool for quantifying molecular evolution.
计算非同义 (Ka) 和同义 (Ks) 替代率对于重建系统发育和了解密切相关但不同物种的蛋白质编码序列的进化动力学具有重要意义。 众所周知,Ka 和 Ks,或者通常是它们的比率 (Ka/Ks),当 Ka 等于 Ks 时表示中性突变,当 Ka 小于 Ks 时表示阴性(纯化)选择,当 Ka 超过 Ks 时表示阳性(多样化)选择。 因此,来自不同进化谱系的基因中两个变量的统计数据为量化分子进化提供了有力的工具。
Over the past two decades, several methods have been developed for this purpose, which can generally be classified into two classes: approximate method and maximum likelihood method. The approximate method involves three basic steps: (1) counting the numbers of synonymous and nonsynonymous sites, (2) calculating the numbers of synonymous and nonsynonymous substitutions, and (3) correcting for multiple substitutions.
在过去的二十年中,已经为此目的开发了几种方法,通常可以分为两类:近似方法和最大似然法。 近似方法包括三个基本步骤:(1)计算同义和非同义位点的数量,(2)计算同义和非同义替换的数量,以及(3)校正多个替换。
On the other hand, the maximum likelihood method integrates evolutionary features (reflected in nucleotide models) into codon-based models and uses the probability theory to finish all the three steps in one go. However, these methods adopt dif ferent substitution or mutation models based on dif ferent assumptions that take account of various sequence features, giving rise to varied estimates of evolutionary distance. In other words, Ka and Ks estimation is sensitive to underlying assumptions or mutation models. In addition, since the amount and the degree of sequence substitutions vary among datasets from diverse taxa, a single model or method may not be adequate for accurate Ka and Ks calculations. Therefore, a model selection step, that is, to choose a best-fit model when estimating Ka and Ks, becomes critical for capturing appropriate evolutionary information.
另一方面,最大似然法将进化特征(反映在核苷酸模型中)整合到基于密码子的模型中,并使用概率论一次性完成所有三个步骤。 然而,这些方法采用不同的替代或突变模型,这些模型基于考虑到各种序列特征的不同假设,从而产生了对进化距离的不同估计。 换句话说,Ka 和 Ks 估计对潜在的假设或突变模型很敏感。 此外,由于来自不同分类群的数据集的序列替换的数量和程度不同,单个模型或方法可能不足以准确计算 Ka 和 Ks。 因此,模型选择步骤,即在估计 Ka 和 Ks 时选择最佳拟合模型,对于捕获适当的进化信息变得至关重要。
Toward this end, we have applied model selection and model averaging techniques for Ka and Ks estimations. We use a maximum likelihood method based on a set of candidate substitution models and adopt the Akaike information criterion (AIC) to measure fitness between models and data. After choosing the best-fit model for calculating Ka and Ks, we average the parameters across the candidate models to include as many features as needed since the true model is seldom one of the candidate models in practice. Finally, these considerations are incorporated into a software package, namely KaKs_Calculator.
为此,我们将模型选择和模型平均技术应用于 Ka 和 Ks 估计。 我们使用基于一组候选替换模型的最大似然法,并采用 Akaike 信息准则 (AIC) 来衡量模型和数据之间的适应度。 在选择了计算 Ka 和 Ks 的最佳拟合模型后,我们对候选模型中的参数进行平均,以根据需要包含尽可能多的特征,因为真正的模型在实践中很少是候选模型之一。 最后,将这些考虑因素合并到一个软件包中,即 KaKs_Calculator。
Algorithm
Candidate models
Substitution models play a significant role in phylogenetic and evolutionary analyses of protein-coding sequences by integrating diverse processes of sequence evolution through various assumptions and providing approximations to datasets. We focused on a set of time-reversible substitution models as shown in Table 1, ranging from the Jukes-Cantor (JC) model, which assumes that all substitutions have equal rates and equal nucleotide frequencies, to the general time-reversible (GTR) model that considers six different substitution rates and unequal nucleotide frequencies. Subsequently, we incorporated the parameters in each nucleotide model into a codon-based model. As a result, a general formula of the substitution rate qij from any sense codon i to j (i 6 = j) is given for all candidate models:
替代模型通过各种假设整合序列进化的不同过程并提供数据集的近似值,在蛋白质编码序列的系统发育和进化分析中发挥重要作用。 我们专注于一组时间可逆替换模型,如表 1 所示,范围从假设所有替换具有相同速率和相同核苷酸频率的 Jukes-Cantor (JC) 模型到一般时间可逆 (GTR) 该模型考虑了六种不同的替代率和不相等的核苷酸频率。 随后,我们将每个核苷酸模型中的参数合并到基于密码子的模型中。 因此,对于所有候选模型,给出了从任何有义密码子 i 到 j (i /= j) 的替换率 qij 的通用公式:
Table 1Model selection
AIC has been widely used in model selection aside from other methods such as the likelihood ratio test (LRT) and the Bayesian information criterion (BIC). AIC characterizes the Kullback-Leibler distance between a true model and an examined model, and this distance can be regarded as quantifying the information lost by approximating the true model. KaKs Calculator uses a modification of AIC (AICC), which takes account of sampling size (n), maximum likelihood score (lnLi), and the number of parameters (ki) in model i as follows:
除了似然比检验(LRT)和贝叶斯信息准则(BIC)等其他方法外,AIC已广泛用于模型选择。 AIC 表征了真实模型和被检查模型之间的 Kullback-Leibler 距离,这个距离可以看作是通过逼近真实模型来量化丢失的信息。 KaKs Calculator 使用 AIC (AICC) 的修改,它考虑了模型 i 中的采样大小 (n)、最大似然分数 (lnLi) 和参数数量 (ki),如下所示:
AICC is proposed to correct for small sampling size, and it approaches to AIC when sampling size comes to infinity. Consequently, we could use this equation to compute AICC for each candidate model and then identify a model that possesses the smallest AICC, which is a sign for appropriateness between models and data.
AICC 是为了修正小样本量而提出的,当样本量达到无穷大时,它会逼近 AIC。 因此,我们可以使用这个方程来计算每个候选模型的 AICC,然后确定一个拥有最小 AICC 的模型,这是模型和数据之间适当性的标志。
Model averaging
Model selection is merely an approximate fit to a dataset, whereas a true evolutionary model is seldom one of the candidate models (8 ). Therefore, an alternative way is model averaging, which assigns each candidate model a weight value and engages more than one model to estimate average parameters across models. Accordingly, we first need to compute the Akaike weight (wi, where i = 1, 2, . . . , m) for each model in a set of candidate models:
模型选择只是对数据集的近似拟合,而真正的进化模型很少是候选模型之一。 因此,另一种方法是模型平均,它为每个候选模型分配一个权重值,并使用多个模型来估计模型之间的平均参数。 因此,我们首先需要为一组候选模型中的每个模型计算 Akaike 权重(wi,其中 i = 1、2、...、m):
where min AICC is the smallest AICC value among candidate models. We can then estimate model-averaged parameters. Taking κTC as an example, a model-averaged estimate can be calculated by:
其中 min AICC 是候选模型中最小的 AICC 值。 然后我们可以估计模型平均参数。 以 κTC 为例,模型平均估计值可以通过下式计算:
Application
KaKs Calculator is written in standard C++ language. It is readily compiled and run on Unix/Linux or workstation (tested on AIX/IRIX/Solaris). In addition, we use Visual C++ 6.0 for graphic user interface and provide its Windows version that can run on any IBM compatible computer under Windows operating system (tested on Windows 2000/XP). Compiled executables on AIX/IRIX/Solaris and setup application on Windows, as well as source codes, example data, instructions for installation and documentation for KaKs Calculator is available at http://evolution.genomics.org.cn/software.htm.
KaKs_Calculator是用标准 C++ 语言编写的。 它很容易在 Unix/Linux 或工作站上编译和运行(在 AIX/IRIX/Solaris 上测试)。 此外,我们使用Visual C++ 6.0作为图形用户界面,并提供其Windows版本,可以在Windows操作系统下的任何IBM兼容计算机上运行(在Windows 2000/XP上测试)。 AIX/IRIX/Solaris 上编译的可执行文件和 Windows 上的设置应用程序,以及 KaKs 计算器的源代码、示例数据、安装说明和文档可在 http://evolution.genomics.org.cn/software.htm 获得。
Different from other existing tools, KaKs_Calculator employs model-selected and model-averaged methods based on a set of candidate models to estimate Ka and Ks. It integrates as many features as needed from sequence data and in most cases gives rise to more reliable evolutionary information (see the comparative results on simulated sequences at http://evolution.genomics.org.cn/doc/SimulatedResults.xls). KaKs_Calculator also provides comprehensive information estimated from compared sequences, including the numbers of synonymous and nonsynonymous sites and substitutions, GC contents, maximum likelihood scores, and AICC. Moreover, KaKs_Calculator incorporates several other methods and allows users to choose one or more methods at one running time (Table 2).
与其他现有工具不同,KaKs_Calculator 采用基于一组候选模型的模型选择和模型平均方法来估计 Ka 和 Ks。 它根据需要从序列数据中集成尽可能多的特征,并且在大多数情况下会产生更可靠的进化信息(参见模拟序列的比较结果,网址为 http://evolution.genomics.org.cn/doc/SimulatedResults.xls)。 KaKs_Calculator 还提供从比较序列估计的综合信息,包括同义和非同义位点和替换的数量、GC 含量、最大似然分数和 AICC。 此外,KaKs_Calculator 结合了其他几种方法,并允许用户在一次运行时选择一种或多种方法(表 2)。
Table 2#1The approximate method involves three basic steps: Step 1: counting the numbers of synonymous and nonsynonymous sites; Step 2: calculating the numbers of synonymous and nonsynonymous substitutions; Step 3: correcting for multiple substitutions. #2The maximum likelihood method uses the probability theory to finish the three steps in one go (4 ). *No specific definition of synonymous and nonsynonymous sites or substitutions.
#1近似方法包括三个基本步骤: 步骤1:计算同义和非同义站点的数量; 步骤2:计算同义和非同义替换的数量; 第 3 步:纠正多个替换。 #2最大似然法使用概率论一口气完成三个步骤(4)。 *没有同义和非同义位点或替代的具体定义。
Although there exist 203 time-reversible models of nucleotide substitution, model selection in practice is often limited to a subset of them, and thus model averaging can reduce biases arising from model selection. Therefore, model-averaged methods should be preferred for general calculations of Ka and Ks. Some planned improvements include application of model selection and model averaging to detect positive selection at single amino acid sites, which requires high-speed computing for maximum likelihood estimation, especially when an adopted model becomes complex.
尽管存在 203 个时间可逆的核苷酸替换模型,但在实践中的模型选择通常仅限于其中的一个子集,因此模型平均可以减少模型选择产生的偏差。 因此,对于 Ka 和 Ks 的一般计算,应首选模型平均方法。 一些计划中的改进包括应用模型选择和模型平均来检测单个氨基酸位点的阳性选择,这需要高速计算来进行最大似然估计,尤其是当采用的模型变得复杂时。
In conclusion, KaKs Calculator incorporates as many features as needed for accurately extracting evolutionary information through model selection and model averaging, therefore it may be useful for in-depth studies on phylogeny and molecular evolution.
总之,KaKs_Calculator 包含了通过模型选择和模型平均来准确提取进化信息所需的尽可能多的特征,因此它可能有助于深入研究系统发育和分子进化。
Acknowledgements
We thank Professor Ziheng Yang for the permission to use his invaluable source codes in PAML and two anonymous reviewers for their constructive comments on an earlier version of this manuscript. We are grateful to Ya-Feng Hu, Lin Fang, Jia Ye, Hai-Feng Yuan, and Heng Li for their help in software development. We also thank a number of users and members of our institutes for reporting bugs and giving suggestions. This work was supported by grants from the Ministry of Science and Technology of China (No. 2001AA231061) and the National Natural Science Foundation of China (No. 30270748) awarded to JY.
我们感谢杨子恒教授允许在 PAML 中使用他宝贵的源代码,感谢两位匿名审稿人对本手稿早期版本的建设性意见。 感谢 Ya-Feng Hu、Lin Fang、Jia Ye、Hai-Feng Yuan 和 Heng Li 对软件开发的帮助。 我们也感谢许多用户和我们研究所的成员报告错误并提供建议。 这项工作得到了中国科学技术部 (No. 2001AA231061) 和中国国家自然科学基金 (No. 30270748) 授予 JY 的资助。
Authors’ contributions
ZZ designed and programmed this software, and drafted the manuscript. JL carried out computer simulations to generate sequences. XQZ performed test for earlier versions of the software. JW and GKSW contributed in conceiving this software and participated in software design. JY supervised the study and revised the manuscript. All authors read and approved the final manuscript.
ZZ 设计和编程了这个软件,并起草了手稿。 JL 进行计算机模拟以生成序列。 XQZ 对早期版本的软件进行了测试。 JW 和 GKSW 参与了该软件的构思并参与了软件设计。 JY 监督了这项研究并修改了手稿。 所有作者阅读并认可的终稿。
网友评论