美文网首页生信笔记
根据编辑距离(edit distance)分析数据 2018-1

根据编辑距离(edit distance)分析数据 2018-1

作者: 11的雾 | 来源:发表于2018-11-06 15:42 被阅读3次

    编辑距离(edit distance)
    edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.

    Different definitions of an edit distance use different sets of string operations. The Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the Levenshtein distance is usually what is meant by "edit distance".[1]

    def minEditDist(sm,sn):
        m,n = len(sm)+1,len(sn)+1
        # create a matrix (m*n)
        matrix = [[0]*n for i in range(m)]
        matrix[0][0]=0
        for i in range(1,m):
            matrix[i][0] = matrix[i-1][0] + 1
    
        for j in range(1,n):
            matrix[0][j] = matrix[0][j-1]+1
        cost = 0
        for i in range(1,m):
            for j in range(1,n):
                if sm[i-1]==sn[j-1]:
                    cost = 0
                else:
                    cost = 1
                matrix[i][j]=min(matrix[i-1][j]+1,matrix[i][j-1]+1,matrix[i-1][j-1]+cost)
        return matrix[m-1][n-1]
    

    使用:

    vi test.py

    import sys
    sys.path.append("/cygene/script/")
    import edit_distance
    
    print edit_distance.minEditDist("TGTGCTGTGGAGGATCAGTCGGGAACCT","CAATATCCAGAACCCTGACCCTGCCGTGTACCAGCTCTGCTTCTGAT")
    

    打印出结果:28

    相关文章

      网友评论

        本文标题:根据编辑距离(edit distance)分析数据 2018-1

        本文链接:https://www.haomeiwen.com/subject/bfyuxqtx.html