推荐文章:
推荐代码实现:
-
python-string-similarity
A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented.
The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length m and n respectively.
Normalized? | Metric? | Type | Cost | Typical usage | ||
---|---|---|---|---|---|---|
Levenshtein | distance | No | Yes | O(m*n) 1 | ||
Normalized Levenshtein | distance similarity | Yes | No | O(m*n) 1 | ||
Weighted Levenshtein | distance | No | No | O(m*n) 1 | OCR | |
Damerau-Levenshtein 3 | distance | No | Yes | O(m*n) 1 | ||
Optimal String Alignment 3 | distance | No | No | O(m*n) 1 | ||
Jaro-Winkler | similarity distance | Yes | No | O(m*n) | typo correction | |
Longest Common Subsequence | distance | No | No | O(m*n) 1,2 | diff utility, GIT reconciliation | |
Metric Longest Common Subsequence | distance | Yes | Yes | O(m*n) 1,2 | ||
N-Gram | distance | Yes | No | O(m*n) | ||
Q-Gram | distance | No | No | Profile | O(m+n) | |
Cosine similarity | similarity distance | Yes | No | Profile | O(m+n) | |
Jaccard index | similarity distance | Yes | Yes | Set | O(m+n) | |
Sorensen-Dice coefficient | similarity distance | Yes | No | Set | O(m+n) |
[1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the dynamic programming
网友评论