布隆过滤器 用于大量数据快速去重
先说结果
一共有4348524
条数据, 不重复的有55888
条
数据库使用 select count(DISTINCT(address))
花了 13.644s
,
从数据库读取数据,再用bloom_filter处理,花了 1m26.710s
, 统计出不重复的有55888条,
准确率一致
4348524
条数据从文本读取,去重时间达到 0m12.150s
目前400w的数量级 bloom filter 跟数据库的DISTINCT
处理时间基本一样
更新
总数据 13766256
, 去重后的数据 363202
,
数据库 select count(DISTINCT(address))
花了 42.433s
,
bloom_filter 处理花了 0m50.983s
准确率一致
python script
import mmh3
from bitarray import bitarray
BIT_SIZE = 1 << 30
class BloomFilter:
def __init__(self):
# Initialize bloom filter, set size and all bits to 0
bit_array = bitarray(BIT_SIZE)
bit_array.setall(0)
self.bit_array = bit_array
def add(self, val):
point_list = self.get_postions(val)
for b in point_list:
self.bit_array[b] = 1
def get_postions(self, val):
# Get points positions in bit vector.
# 提供不同的hash种子得到多个hash函数, seed最好为质数
point1 = mmh3.hash(val, 5) % BIT_SIZE
point2 = mmh3.hash(val, 7) % BIT_SIZE
point3 = mmh3.hash(val, 11) % BIT_SIZE
point4 = mmh3.hash(val, 13) % BIT_SIZE
point7 = mmh3.hash(val, 19) % BIT_SIZE
point5 = mmh3.hash(val, 23) % BIT_SIZE
point6 = mmh3.hash(val, 31) % BIT_SIZE
return [point1, point2, point3, point4, point5, point6]
def is_contains(self, val):
point_list = self.get_postions(val)
result = True
for b in point_list:
result = result and self.bit_array[b]
return result
if __name__ == '__main__':
bf = BloomFilter()
# 第一次会显示 not exists
if bf.is_contains('zqw'):
print('exists')
else:
print('not exists')
bf.add('zqw')
if bf.is_contains('shooter'):
print('exists!')
else:
bf.add('shooter')
if bf.is_contains('zqw'):
print('exists!')
else:
bf.add('zqw')
Hash算法
计算散列值使用了mmh3算法, 因为 mmh3 是非加密哈希算法,计算速度非常快. 据说xxHash更快
参考:
http://www.zhiwenli.com/wordpress/?p=752
https://llimllib.github.io/bloomfilter-tutorial/
https://segmentfault.com/a/1190000010990136
http://www.alloyteam.com/2017/05/hash-functions-introduction/
https://coolshell.cn/articles/17225.html
https://yq.aliyun.com/articles/136154
https://blog.csdn.net/jiaomeng/article/details/1495500
https://github.com/LiuXingMing/Scrapy_Redis_Bloomfilter
https://china.googleblog.com/2007/07/bloom-filter_7469.html
https://www.cnblogs.com/cpselvis/p/6265825.html
http://beginman.cn/redis/2015/11/03/redis-bit/
https://en.wikipedia.org/wiki/MurmurHash
https://github.com/jaybaird/python-bloomfilter/
https://github.com/andymccurdy/redis-py
网友评论