期刊
Genomics Proteomics & Bioinformatics (6.409/Q1)
KaKs_Calculator 2.0: A Toolkit Incorporating Gamma-Series Methods and Sliding Window Strategies
KaKs_Calculator 2.0:包含 Gamma 系列的工具包方法和滑动窗口策略
Abstract
We present an integrated stand-alone software package named KaKs_Calculator 2.0 as an updated version. It incorporates 17 methods for the calculation of nonsynonymous and synonymous substitution rates; among them, we added our modified versions of several widely used methods as the gamma series including γ-NG, γ-LWL, γ-MLWL, γ-LPB, γ-MLPB, γ-YN and γ-MYN, which have been demonstrated to perform better under certain conditions than their original forms and are not implemented in the previous version. The package is readily used for the identification of positively selected sites based on a sliding window across the sequences of interests in 5′ to 3′ direction of protein-coding sequences, and have improved the overall performance on sequence analysis for evolution studies. A toolbox, including C++ and Java source code and executable files on both Windows and Linux platforms together with a user instruction, is downloadable from the website for academic purpose at https://sourceforge.net/projects/kakscalculator2/.
我们提供了一个名为 KaKs_Calculator 2.0 的集成独立软件包作为更新版本。它包含 17 种计算非同义和同义替代率的方法;其中,我们添加了几种广泛使用的方法的修改版本作为伽马系列,包括 γ-NG、γ-LWL、γ-MLWL、γ-LPB、γ-MLPB、γ-YN 和 γ-MYN,这些方法已被证明在某些条件下比它们的原始形式表现更好,并且在以前的版本中没有实现。该软件包很容易用于基于跨感兴趣序列的滑动窗口在蛋白质编码序列的 5' 到 3' 方向上识别正向选择的位点,并提高了进化研究序列分析的整体性能。一个工具箱,包括 Windows 和 Linux 平台上的 C++ 和 Java 源代码和可执行文件以及用户说明,可从网站 https://sourceforge.net/projects/kakscalculator2/ 下载用于学术目的(跟KaKs_Calculator 1.0一样没有Mac OS也可以发表哈)。
Key words: Ka/Ks, gamma-series methods, sliding window, positively selected sites
关键词: Ka/Ks, 伽马系列方法, 滑动窗口, 正选择位点
Introduction
Calculating nonsynonymous (Ka) and synonymous (Ks) substitution rates is a useful way for evaluating sequence variations for protein orthologs across different species or taxonomical lineages with unknown evolutionary status. Furthermore, it is often important to recognize positively selected sites and to identify genes with selective hotspots. There have been numerous methods and software tools developed for such purposes in the public domain, including PAML, MEGA, DnaSP, HyPhy and certain modules from Bioperl. However, after careful simulations and real data analysis, we believe that a single method will not be readily identified to be used under all circumstances, therefore we created the version of the KaKs_Calculator 1.0, which adopted model-selected and model-averaged techniques to compute Ka/Ks values by means of a group of existing nucleotide substitution models.
计算非同义 (Ka) 和同义 (Ks) 替代率是评估不同物种或具有未知进化状态的分类谱系的蛋白质直系同源物的序列变异的有用方法。此外,识别正选择位点并识别具有选择性热点的基因通常很重要。在公共领域已经为此目的开发了许多方法和软件工具,包括 PAML、MEGA、DnaSP、HyPhy 和 Bioperl 的某些模块。然而,经过仔细的模拟和真实数据分析,我们认为单一的方法不会很容易确定在所有情况下都可以使用,因此我们创建了 KaKs_Calculator 1.0 版本,它采用模型选择和模型平均技术来计算Ka/Ks 值是通过一组现有的核苷酸替换模型来计算的。
Since the majority of DNA sequence sites are considered to be invariable due to functional restraints and evolutionary distances, the selective pressure varies among different sites in a sequence, thus Ka/Ks calculations only based on the entire gene are not enough to detect the individual sites subjected to adaptive selection. To conquer this problem, a “sliding window” strategy has been introduced to several web servers such as SWAPSC and WSPMaker, while these tools adopted fewer (mostly one) models for Ka and Ks calculations. Here we provide an updated version of KaKs_Calculator, which solves these two questions in a simple way. In particular, we have embedded gamma-series methods into this new version.
由于大多数 DNA 序列位点由于功能限制和进化距离而被认为是不变的,因此序列中不同位点之间的选择压力不同,因此仅基于整个基因的 Ka/Ks 计算不足以检测单个位点进行适应性选择。 为了解决这个问题,SWAPSC 和 WSPMaker 等多个 Web 服务器引入了“滑动窗口”策略,而这些工具采用较少(主要是一个)模型进行 Ka 和 Ks 计算。 在这里,我们提供了更新版本的 KaKs_Calculator,它以简单的方式解决了这两个问题。 特别是,我们在这个新版本中嵌入了伽马系列方法。
New Features
We have brought up three novel features into KaKs_Calculator 2.0. First, unlike the existent Ka/Ks algorithms, the new software can take the variable mutation rates across sequence sites into account, which contain vital information for molecular evolutionary studies. We created seven related methods namely γ-NG, γ-LWL, γ-MLWL, γ-LPB, γ-MLPB, γ-YN and γ-MYN by introducing gamma distribution to model the mutation rates; the importance of the new methods has been demonstrated as the ignorance gives rise to biased computational results. We therefore implemented these new methods into the updated core tool of version 2.0, whose core toolset has seventeen algorithms including seven original approximate methods, seven gamma-series methods, one maximum likelihood method (GY), and two expanding methods (model selected and model averaged). The methods provide not only the values of Ka, Ks and Ka/Ks, but also other key information from paired orthologous sequences, including the number of synonymous/nonsynonymous sites, substitutions, divergence time, substitution-rate-ratio, GC content, and AICc. Second, we added three new modules—split, plot, dpss—to evaluate adaptive selection at the gene sequence level. As an expanding toolset, they adopt a sliding window with user’s definition on window length and step length. Split is responsible for the division of the raw paired orthologous sequences into portions on the basis of dynamic windows in the positive direction. Plot deals with the outcome of the core toolset after the nucleotide sequences from Split have been computed, resulting in a massive collection of figures illustrating Ka, Ks and Ka/Ks (omega) in intervals. Dpss identifies the positions of positively selected sites based on the initial analyses. Third, it should be emphasized that all above-mentioned processes are capable of handling massive data in a timely fashion. In particular, all transferrable data including sequences and resulting information are contained in a single file. We provide executable files as well as source codes for the package and tested all programs on both Windows 2000/XP/Vista and Linux (Red Hat 3.4.6-8) platforms. The toolkit is freely available (licensed under GPLv3) online at https://sourceforge.net/projects/kakscalculator2/.
我们在 KaKs_Calculator 2.0 中引入了三个新功能。首先,与现有的 Ka/Ks 算法不同,新软件可以考虑跨序列位点的可变突变率,其中包含分子进化研究的重要信息。我们通过引入伽马分布来模拟突变率,创建了七种相关方法,即γ-NG、γ-LWL、γ-MLWL、γ-LPB、γ-MLPB、γ-YN和γ-MYN;新方法的重要性已得到证明,因为无知会导致计算结果有偏差。因此,我们在 2.0 版更新的核心工具中实现了这些新方法,其核心工具集有 17 种算法,包括 7 种原始近似方法、7 种伽马系列方法、1 种最大似然法(GY)和 2 种扩展方法(模型选择和模型平均)。这些方法不仅提供了 Ka、Ks 和 Ka/Ks 的值,还提供了来自配对直系同源序列的其他关键信息,包括同义/非同义位点的数量、替换、分歧时间、替换率比、GC 含量, 和 AICc。其次,我们添加了三个新模块——split、plot、dpss——来评估基因序列水平的适应性选择。作为扩展工具集,他们采用滑动窗口,用户定义窗口长度和步长。 Split 负责根据正方向上的动态窗口将原始配对的直系同源序列划分为多个部分。 Plot 在计算出来自 Split 的核苷酸序列后处理核心工具集的结果,从而产生大量图表,以间隔显示 Ka、Ks 和 Ka/Ks (omega)。 Dpss 根据初始分析确定正向选择位点的位置。第三,需要强调的是,上述所有流程都能够及时处理海量数据。特别是,包括序列和结果信息在内的所有可传输数据都包含在单个文件中。我们提供软件包的可执行文件和源代码,并在 Windows 2000/XP/Vista 和 Linux (Red Hat 3.4.6-8) 平台上测试了所有程序。该工具包可在 https://sourceforge.net/projects/kakscalculator2/ 在线免费获得(根据 GPLv3 许可)。
Implementation
In order to conveniently update the algorithm and to friendly communicate with users, we implement the new version with a “toolkit” idea in mind. Therefore, the integrated software is divided into two essential parts to better serve for different functionality: the core toolset that calculates Ka and Ks, and the expanding toolset that is responsible for additional computation activities based on the Ka and Ks calculation (e.g., with a sliding window strategy) (Figure 1). In the core toolset, we design the GUI with visual C++’s MFC (Microsoft Foundation Classes) that manages documents and allows users to view the objects, and the entire program is object-oriented. Each main method has its own class in the code and the multi-thread operations among them use the CPU time allocations very efficiently. We adopt Java-6 to program the expanding toolset because of its advantages across different platforms. We choose R language (http://www.r-project.org/) to draw high-level graphics from inputting data. To call for the R function from Java, we employ a package named “Rserve” (http://www.rforge.net/Rserve/index.html), which is a program responding to requests from clients based on the TCP/IP protocol. In details, we use java to invoke the JRclient suite and connect it after Rserve starts on R environment; under this circumstance each connection has its workspace and directory. Moreover, the server allows many clients to plot their data simultaneously. In consideration of the running speed, it is so fast that a graph covering thousands of data points can be plotted in a few seconds.
为了方便更新算法和与用户友好交流,我们在新版本的实现中始终牢记“工具包”的思想,因此,集成软件分为两个基本部分,以更好地服务于不同的功能:核心工具集计算 Ka 和 Ks,以及负责基于 Ka 和 Ks 计算的额外计算活动的扩展工具集(例如,使用滑动窗口策略)(图 1)。在核心工具集中,我们用visual C++的MFC(Microsoft Foundation Classes)设计了GUI,它管理文档并允许用户查看对象,整个程序是面向对象的。每个 main 方法在代码中都有自己的类,其中的多线程操作非常有效地使用 CPU 时间分配。我们采用 Java-6 来编写扩展工具集,因为它具有跨不同平台的优势。我们选择 R 语言(http://www.r-project.org/)从输入数据中绘制高级图形。为了从 Java 中调用 R 函数,我们使用了一个名为“Rserve”的包 (http://www.rforge.net/Rserve/index.html),它是一个基于 TCP/IP 响应来自客户端的请求的程序协议。具体来说,我们使用java调用JRclient套件,并在Rserve在R环境启动后连接;在这种情况下,每个连接都有其工作区和目录。此外,服务器允许许多客户端同时绘制他们的数据——(这个不错哦)。考虑到运行速度,它是如此之快,以至于可以在几秒钟内绘制一张覆盖数千个数据点的图表。
Figure 1图 1 KaKs_Calculator 2.0 软件设计流程图。
Evaluation
We have evaluated the performance of the gamma-series methods in Ka/Ks calculations in previous studies. In the process of identifying positively selected sites, we have also successfully applied the toolbox to two real cases, including the animal alpha-defensin genes investigated in Lynn et al and the TAS1R3 (taste receptor type 1 member 3) genes reported to be responsible for the ability to recognize the sweetness (Figure 2). It is important to combine the gamma-series methods with a sliding window strategy; the former represents the variation of raw mutation across sites and the latter reveals if each site is driven by different selective pressure based on the assumption that the omega (Ka/Ks) values are not equal across orthologous gene sequences. In particular, when window slices become dense enough, it approaches the “site models”, similar to the thought of “integral” definition in mathematics. We believe that the software provides an excellent choice when one calculates for positively selected sites. A final note is that we will construct ancestral sequences for the measurement of lineage-specific selective strength in our next update.
我们在之前的研究中评估了伽马系列方法在 Ka/Ks 计算中的性能。在确定正向选择位点的过程中,我们还成功地将工具箱应用于两个真实案例,包括 Lynn 等人研究的动物 α-防御素基因和据报道负责识别甜味的能力的 TAS1R3(味觉受体 1 型成员 3)基因(图 2)。将 gamma 系列方法与滑动窗口策略结合起来很重要;前者代表跨位点的原始突变的变化,后者揭示了每个位点是否由不同的选择压力驱动,基于 omega (Ka/Ks) 值在直系同源基因序列中不相等的假设。特别是当窗口切片变得足够密集时,它接近于“位点模型”,类似于数学中的“积分”定义思想。我们相信,当您为正向选择的位点进行计算时,该软件提供了一个绝佳的选择。最后一点是,我们将在下一次更新中构建祖先序列来测量谱系特异性选择强度。
Figure 2Figure 2 An example for displaying Ka, Ks and Ka/Ks to identify positively selected sites. This analysis was performed based on the TAS1R3 gene pairs from Homo sapiens (NM_152228) and Canis familiaris (XM_843615).
图 2 显示 Ka、Ks 和 Ka/Ks 以识别正选择位点的示例。 该分析是基于来自智人 (NM_152228) 和犬类 (XM_843615) 的 TAS1R3 基因对进行的。
Acknowledgements
This work was funded by the National Basic Research Program of China (973 Program) to JY (Grant No. 2006CB910404).
Authors’ contributions
DW and JY conceived and designed this study. DW and YZ programmed the software and drafted the manuscript. ZZ supplied several bug reports and modified schemes in the previous version of the software. DW and JZ contributed to data analyses and software testing. JY managed this project and revised the manuscript. All authors read and approved the final manuscript.
DW 和 JY 构思并设计了这项研究。 DW 和 YZ 对软件进行了编程并起草了手稿。 ZZ 在之前版本的软件中提供了几个错误报告和修改方案。 DW 和 JZ 为数据分析和软件测试做出了贡献。 JY 管理了这个项目并修改了手稿。 所有作者阅读并认可的终稿。
网友评论