美文网首页
算法文献阅读4:Smith-Waterman algorithm

算法文献阅读4:Smith-Waterman algorithm

作者: 龙star180 | 来源:发表于2022-06-08 11:21 被阅读0次

    Identification of Common Molecular Subsequences (1981年)

    公共分子子序列的鉴定

    authors: 坦普尔·史密斯(Temple F. Smith)和迈克尔·沃特曼(Michael S. Waterman)

    这个经典的不能再经典的经典算法所发表的期刊为Journal of molecular biology(是Elsevier(爱思唯尔)出版公司旗下的期刊(图1))2021年影响因子/JCR分区:5.469/Q1

    图1

    The identification of maximally homologous subsequences among sets of long sequences is an important problem in molecular sequence analysis. The problem is straightforward only if one restricts consideration to contiguous subsequences (segments) containing no internal deletions or insertions. The more general problem has its solution in an extension of sequence metrics developed to measure the minimum number of “events” required to convert one sequence into another.

    识别长序列组中的最大同源子序列是分子序列分析中的一个重要问题。只有当我们把问题限制在不包含内部删除或插入的连续子序列(片段)时,这个问题才是简单的。这个更普遍的问题在序列度量的扩展中得到了解决,该度量是为了测量将一个序列转化为另一个序列所需的最小 "事件 "数量。

    These developments in the modern sequence analysis began with the heuristic homology algorithm of Needleman & Wunsch (1970) which first introduced an iterative matrix method of calculation. Numerous other heuristic algorithms have been suggested including those of Fitch (1966) and Dayhoff (1969). More mathematically rigorous algorithms were suggested by Sankoff (1972), Reichert et al. (1973) and Beyer et al. (1979), but these were generally not biologically satisfying or interpretable. Success came with Sellers (1974) development of a true metric mewure of the distance between sequences. This metric was later generalized by Waterman et al. (1976) to include deletions/insertions of arbitrary length. This metric represents the minimum number of “mutational events” required to convert one sequence into another. It is of interest to note that Smith et al. (1980) have recently shown that under some conditions the generalized Sellers metric is equivalent to the original homology algorithm of Needleman & Wunsch (1970).

    现代序列分析的这些发展始于Needleman & Wunsch(1970)的启发式同源算法,该算法首次引入了迭代矩阵的计算方法。许多其他启发式算法也被提出,包括Fitch(1966)和Dayhoff(1969)的算法。Sankoff(1972)、Reichert等人(1973)和Beyer等人(1979)提出了更严格的数学算法,但这些算法通常在生物学上不能令人满意或解释。Sellers(1974)开发了一个真正的序列间距离的度量方法,取得了成功。这个指标后来被Waterman 等人(1976)将这一指标推广到包括任意长度的缺失/插入。这个度量代表了将一个序列转化为另一个序列所需的最小 "突变事件 "的数量。值得注意的是,Smith等人(1980)最近表明,在某些条件下,广义的Sellers度量等同于Needleman和Wunsch(1970)的原始同源算法。

    In this letter we extend the above ideas to find a pair of segments, one from each of two long sequences, such that there is no other pair of segments with greater similarity (homology). The similarity measure used here allows for arbitrary length deletions and insertions.

    在这个手稿中,我们扩展了上述想法,从两个长序列中的每一个中找到一对片段,这样就没有其他具有更大相似性(同源性)的片段对。 这里使用的相似性度量允许任意长度的删除和插入。

                                                                    Algorithm

    The two molecular sequences will be h=a1a2 . . . an, and B=b1b2, . . . bm. A similarity s(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wk. To find pairs of segments with high degrees of similarity, we set up a matrix H. First set

    两个分子序列将是h=a1a2 ... an,和B=b1b2, ... bm。序列元素a和b之间有一个相似度s(a,b),长度为k的删减被赋予权重Wk。为了找到具有高相似度的片段对,我们设置了一个矩阵H。首先设置:

    Preliminary values of H have the interpretation that H, is the maximum similarity of two segments ending in ai and bj, respectively. These values are obtained from the relationship

    H的初值可以解释为Hij是分别以ai和bj结尾的两段最大相似度。这些值从关系中获得

    The formula for Hij follows by considering the possibilities for ending. the segments at any ai and bj.

    Hij 的公式通过考虑结束的可能性来遵循。 任意 ai 和 bj 处的段。

    (1) If ai and bj are associated, the similarity is

    (1) 如果 ai 和 bj 相关联,相似度为

    (2) If ai is at the end of a deletion oi length k, the similarity is
    (2)如果ai在一个长度为k的删除的末尾,则相似度为

    (3) If bj is at the end of a deletion of length l, the similarity is

    (3)如果 bj 在长度为 l 的删除末尾,则相似度为

    (4) Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to ai and bj.

    (4)最后加入一个0来防止计算出的负相似度,表示与ai和bj没有相似度。

    The pair of segments with maximum similarity is found by first locating the maximum element of H. The other matrix elements leading to this maximum value are than sequentially determined with a traceback procedure ending with an element of H equal to zero. This procedure identifies the segments as well as produces the corresponding alignment. The pair of segments with the next best similarity is found by applying the traceback procedure to the second largest element of H not associated with the first traceback.

    首先找到H的最大元素,找到具有最大相似性的一对片段,然后通过回溯程序依次确定导致该最大值的其他矩阵元素,直到H的元素等于零。这个程序确定了这些片段,并产生了相应的排列。通过对与第一次回溯无关的H的第二大元素应用回溯程序,找到具有下一个最佳相似性的一对片段。

    A simple example is given in Figure 1. In this example the parameters s(aibj) and Wk required were chosen on an a priori statistical basis. A match, ai = bj, produced an s(aibj) value of unity while a mismatch produced a minus one-third. These values have an average for long, random sequences over an equally probable four letter set of zero. The deletion weight must be chosen to be at least equal to the difference between a match and a mismatch. The value used here was Wk = 1.0 + 1/3*k.

    图 2 给出了一个简单的例子。在这个例子中,所需的参数 s(aibj) 和 Wk 是在先验统计基础上选择的。匹配,ai=bj,产生的s(aibj)值为1,而不匹配产生的值为负三分之一。 这些值在等可能的四个字母的零集合上具有长的随机序列的平均值。 删除权重必须选择为至少等于匹配和不匹配之间的差异。 这里使用的值是 Wk = 1.0 + 1/3*k。

    图2

    对序列A-A-U-G-C-A-U-G-C-G和C-A-G-C-U-C-U-A-G应用公式(1)产生的Hij矩阵。下划线的元素表示来自最大元素3.30的回溯路径。

    除非s(a,b)为负值,否则不需要包括0。

    Note, in this simple example, that the alignment obtained:

    注意,在这个简单的例子中,得到的对齐为:

    包含不匹配和内部删除对后者的鉴定以前是不可能以任何严格的方式进行的

    This algorithm not only puts the search for pairs of maximally similar segments on a mathematically rigorous basis but it can be efficiently and simply programmed on a computer.

    这种算法不仅将寻找最大相似段的工作建立在严格的数学基础上,而且可以在计算机上进行有效和简单的编程。

    vital reference

    Needleman S B, Wunsch C D. A general method applicable to the search for similarities in the amino acid sequence of two proteins[J]. Journal of molecular biology, 1970, 48(3): 443-453.

    总结:这么牛掰的划世纪算法(包括Needleman-Wunsch算法)(多少现存的算法是基于它的基础上改良并得到启发的——好多发了顶级一区)只发了中科院二区。这。。。上哪说理去呀。。。可能是年代的原因吧。

    相关文章

      网友评论

          本文标题:算法文献阅读4:Smith-Waterman algorithm

          本文链接:https://www.haomeiwen.com/subject/uxqumrtx.html