美文网首页
字符串相似性度量方法

字符串相似性度量方法

作者: thirsd | 来源:发表于2019-03-14 00:00 被阅读0次

    推荐文章:

    1. FIVE MOST POPULAR SIMILARITY MEASURES IMPLEMENTATION IN PYTHON
      2.字符串相似算法-Jaro-Winkler Distance

    推荐代码实现:

    1. python-string-similarity
      A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented.

    The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length m and n respectively.

    Normalized? Metric? Type Cost Typical usage
    Levenshtein distance No Yes O(m*n) 1
    Normalized Levenshtein distance similarity Yes No O(m*n) 1
    Weighted Levenshtein distance No No O(m*n) 1 OCR
    Damerau-Levenshtein 3 distance No Yes O(m*n) 1
    Optimal String Alignment 3 distance No No O(m*n) 1
    Jaro-Winkler similarity distance Yes No O(m*n) typo correction
    Longest Common Subsequence distance No No O(m*n) 1,2 diff utility, GIT reconciliation
    Metric Longest Common Subsequence distance Yes Yes O(m*n) 1,2
    N-Gram distance Yes No O(m*n)
    Q-Gram distance No No Profile O(m+n)
    Cosine similarity similarity distance Yes No Profile O(m+n)
    Jaccard index similarity distance Yes Yes Set O(m+n)
    Sorensen-Dice coefficient similarity distance Yes No Set O(m+n)

    [1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the dynamic programming

    相关文章

      网友评论

          本文标题:字符串相似性度量方法

          本文链接:https://www.haomeiwen.com/subject/lpqfmqtx.html