Efficient Sequence Alignment Algorithms
高效的序列比对算法
Journal of theoretical biology (图1)
2021年影响因子/JCR分区:2.691/Q2
图1Sequence alignments are becoming more important with the increase of nucleic acid data. Fitch and Smith have recently given an example where multiple insertion/deletions (rather than a series of adjacent single insertion/deletions) are necessary to achieve the correct alignment. Multiple insertion/deletions are known to increase computation time from O(n 2) to O(n 3) although Gotoh has presented an O(n 2) algorithm in the case the multiple insertion/deletion weighting function is linear. It is argued in this paper that it could be desirable to use concave weighting functions. For that case, an algorithm is derived that is conjectured to be 0(n2).
随着核酸数据的增加,序列比对变得越来越重要。 Fitch 和 Smith 最近给出了一个示例,其中需要多次插入/删除(而不是一系列相邻的单个插入/删除)才能实现正确对齐。 众所周知,多次插入/删除会将计算时间从 O(n 2) 增加到 O(n 3),尽管 Gotoh 在多次插入/删除权重函数是线性的情况下提出了 O(n 2) 算法。 本文认为使用凹加权函数可能是可取的。对于这种情况,推导出一个算法,推测为 0(n2)。
Introduction
With the advent of rapid methods for determining nucleic acid sequences, there is increased interest in computer methods for comparing these sequences. Sequencing is estimated to be proceeding at the rate of 10 6 bases per year and various data bases are being structured to organize the sequences and attendant information. Relevant portions of the data base are searched for sequences similar to ones recently determined; and rapid, efficient, and meaningful algorithms are necessary.
随着用于确定核酸序列的快速方法的出现,人们对用于比较这些序列的计算机方法越来越感兴趣。 测序估计以每年10 6 个碱基的速度进行,并且正在构建各种数据库以组织序列和伴随的信息。 在数据库的相关部分中搜索与最近确定的序列相似的序列; 快速、高效和有意义的算法是必要的——(CyDotian都满足,就是内存缺陷)。
The Needleman & Wunsch (1970) algorithm was the first rapid method in the biological literature for determining sequence homology and was followed by the metrics of Sankoff (1972) and Sellers (1974a, b) for finding the distance between two sequences. Kruskal (1983) recently presented an extensive review of these and related methods and applications.
Needleman & Wunsch (1970) 算法是生物学文献中第一个快速确定序列同源性的方法,随后是 Sankoff (1972) 和 Sellers (1974a, b) 的度量标准,用于查找两个序列之间的距离。 Kruskal (1983) 最近对这些和相关的方法和应用进行了广泛的回顾。
Sellers' work was generalized to include multiple insertion/deletions by Waterman, Smith & Beyer (1976). A review of the use of these techniques in sequence analysis was given by Smith, Fitch & Waterman (1981). Fitch & Smith (1983) recently studied these algorithms for a region of DNA coding for alpha and beta hemoglobin in chicken, where the alignment is assumed known by comparison of the many corresponding protein sequences, and they determined that a specific range of weights for multiple insertion/deletions were necessary to obtain correct alignments. Therefore, it is important to include these in application algorithms.
Sellers的工作被 Waterman, Smith & Beyer (1976) 概括为包括多次插入/删除。 Smith, Fitch & Waterman (1981) 回顾了这些技术在序列分析中的应用。 Fitch & Smith (1983) 最近研究了鸡中编码 α 和 β 血红蛋白的 DNA 区域的这些算法,其中假设通过比较许多相应的蛋白质序列已知比对,并且他们确定了多个特定的权重范围 插入/删除是获得正确对齐所必需的。 因此,将这些包含在应用算法中很重要。
In this paper we review recent work of Gotoh (1982) who presented an O(n 2) algorithm for the case of linear insertion/deletion functions. An extension to concave insertion/deletion functions is obtained here which has computation complexity close to O(n2).
在本文中,我们回顾了 Gotoh (1982) 最近的工作,他提出了一个 O(n 2) 算法用于线性插入/删除函数的情况。 这里获得了对凹插入/删除函数的扩展,其计算复杂度接近 O(n2)。
Linear Insertion/Deletions
To make the discussion specific, the algorithm of Waterman et al. (1976) will now be presented, a = a1a2 • • • an and b= b1b2 • •. bm, are the sequences of interest and D(a, b) will denote the (minimum) distance between a and b where d(x, y) is a dissimilarity measure between the sequence elements, and deletions of length k are assigned weight w(k). We take Di0 = w(i), D0j = w(j) and Dij = D(a1a2... ai, bib2.., bj) and proceed by
Of course, the final result is Dn,m = D(a, b). The computational efficiency of the algorithm is of order
当然,最终的结果是 Dn,m = D(a, b)。 该算法的计算效率是有序的
which, when n = m, is O(n3).
其中,当 n = m 时,为 O(n3)。
Paul Haeberli of the University of Wisconsin brought it to our attention that this algorithm can be improved to O(n2) when w(k) is a linear function of k, although he did not carefully state the condition on w(k) or give a complete proof. Much the same observation was made by Goad & Kanehisa (1982) regarding the secondary structure algorithm of Waterman & Smith (1978). More recently Gotoh (1983) gave a complete and clear proof for the new algorithm.
威斯康星大学的 Paul Haeberli 提请我们注意,当 w(k) 是 k 的线性函数时,该算法可以改进到 O(n2),尽管他没有仔细说明 w(k) 的条件或给出 一个完整的证明。 Goad & Kanehisa (1982) 对 Waterman & Smith (1978) 的二级结构算法做出了大致相同的观察。 最近 Gotoh (1983) 对新算法给出了完整而清晰的证明。
Concave Insertion/Deletions
The intuitive argument for multiple insertion/deletion functions is that a deletion of fourteen bases (say) should not be thought of as fourteen independent single deletions but as one deletion event which has weight less than the sum the weights of fourteen single deletions. This reasoning implies a general inequality
多个插入/删除函数的直观论点是,十四个碱基的删除(比如说)不应该被认为是十四个独立的单个删除,而是一个删除事件,其权重小于十四个单个删除的权重之和。 这种推理隐含着一个普遍的不等式
It is possible to give a slightly improved set of inequalities which proves useful in this problem. If we agree that bases are increasingly easier to insert (delete) as the insertion (deletion) length grows, then the resulting inequalities are
可以给出一组稍微改进的不等式,这在这个问题中证明是有用的。 如果我们同意随着插入(删除)长度的增加,碱基越来越容易插入(删除),那么由此产生的不等式是
If equality is required, then linearity follows:
如果要求相等,则线性如下:
The assumption of linearity is now transparent; w(k) linear means that after the first deletion (insertion) each successive addition of a base has equal cost. We argue that strict inequality could be preferable. Since the function w is increasing and has decreasing differences, it is concave downward, here simply referred to as concave. Such functions as
线性假设现在是透明的; w(k) 线性意味着在第一次删除(插入)之后,每个连续添加的碱基具有相同的成本。 我们认为严格的不平等可能更可取。 由于函数w是递增的,并且有递减的差值,所以它是向下凹的,这里简称为凹。 诸如此类的功能
are concave and have intuitive appeal.
都是凹的,具有直观的吸引力。
Next we make the assumption of linearity, w (k) = a + b (k - 1) and review Gotoh's proof. He presents the above recursion for Dij in the form
接下来我们做出线性假设,w (k) = a + b (k - 1) 并查看 Gotoh 的证明。 他以以下形式呈现了 Dij 的上述递归
The computational complexity of this algorithm depends on the rate of growth of S(i) as a function of sequence length. On a row of length m, how many times will length one deletions be optimal?We conjecture that this function does not grow faster than log (m).
该算法的计算复杂度取决于作为序列长度函数的 S(i) 的增长率。 在长度为 m 的一行上,一个删除的长度是多少倍是最优的?我们推测这个函数的增长速度不会比 log (m) 快。
Conclusion
More development of sequence comparison algorithms will occur in the near future. The rapidly increasing data base will force those working on algorithms for molecular biology to construct algorithms that are more efficient and that answer increasingly more special and involved questions. We are, for example, applying our results on concave deletion functions to algorithms for RNA secondary structure. One of the interesting aspects will be the effects that molecular biology and computer science will have as they continue to communicate and cooperate.
序列比较算法的更多发展将在不久的将来发生。 快速增长的数据库将迫使那些致力于分子生物学算法的人构建更有效的算法,并回答越来越特殊和复杂的问题。 例如,我们正在将我们关于凹缺失函数的结果应用于 RNA 二级结构的算法。 有趣的方面之一是分子生物学和计算机科学在继续交流和合作时将产生的影响。
The Department of Biochemistry and Biophysics of the University of California at San Francisco and System Development Foundation are each gratefully acknowledged for providing partial support of this work.
感谢加州大学旧金山分校生物化学和生物物理系和系统开发基金会对这项工作的部分支持。
总结:不愧为原创算法的奠基人,牛。。。
网友评论