标签相似度算法:levenshtein distance编辑距离算法
步骤:
1..两组标签组组合成二维数组。行:0-第一组标签长度,列:0-第二组标签的长度
2.双重循环在该数字的左边,左上,和上面取出对小值,如果该位置上的两组标签相同,则该位置值取最小值,若不相同则取最小值加一。重复此步骤
3.similarity = 1 -dic[len1][len2] / max 相似度等于1-矩阵最后一个元素/两组标签长度的最大值
###编辑距离算法
def distance(labels1, labels2):
len1 = len(labels1) len2 = len(labels2)
dic = np.zeros((len1 + 1, len2 + 1))
for i in range(len1):
dic[i][0] = i for j in range(len2):
dic[0][j] = j for i in range(len1 + 1):
if (i != 0): for j in range(len2 + 1):
if (j != 0):
if (labels1[i - 1] == labels2[j - 1]):
temp = 0
else:
temp = 1
dic[i][j] = min(dic[i - 1][j - 1] + temp, dic[i][j - 1] + 1, dic[i - 1][j] + 1)
max = len2
if (len1 > len2):
max = len1
similarity = 1 - dic[len1][len2] / max
return similarity
if __name__ == '__main__':
labels1 = ["起亚", "实拍", "汽车", "新闻", "广州车展", "东风", "资讯", "飞机"]
labels2 = ["广州", "现场", "汽车", "国际车展", "新闻", "首发", "资讯", "现代", "概念", "北京", "飞机"]
sim = distance(labels1, labels2) print(sim)
print(sim)
输出sim=0.2727272727272727
github源码:
网友评论