看了下simhash, minhash算法原理。
查到的大多是直接用它们做计算,但想了解下hash后的值长什么样子。
https://leons.im/posts/a-python-implementation-of-simhash-algorithm/
simhash 查其值,用.value
from simhash import Simhash
def get_features(s):
width = 3
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
print('%x' % Simhash(get_features('How are you? I am fine. Thanks.')).value)
print('%x' % Simhash(get_features('How are u? I am fine. Thanks.')).value)
print('%x' % Simhash(get_features('How r you?I am fine. Thanks.')).value)
结果如下:

minhash 查看值用,digest()
from datasketch import MinHashLSHEnsemble, MinHash
m1 = MinHash()
m2 = MinHash()
m1.update('How are you? I am fine. Thanks.'.encode('utf8'))
m2.update('How r you?I am fine. Thanks.'.encode('utf8'))
print(m1.digest())
print(m2.digest())
是个128维的向量

查看hashlib中的相关算法
https://docs.python.org/3.5/library/hashlib.html
import hashlib
hashlib.algorithms_guaranteed

网友评论