BloomFilter大量数据快速去重

作者: 已不再更新_转移到qiita | 来源:发表于2018-04-17 17:27 被阅读129次

布隆过滤器用于大量数据快速去重

先说结果

一共有4348524条数据，不重复的有55888条
数据库使用 select count(DISTINCT(address)) 花了 13.644s，
从数据库读取数据，再用bloom_filter处理，花了 1m26.710s，统计出不重复的有55888条,
准确率一致
4348524条数据从文本读取，去重时间达到 0m12.150s

目前400w的数量级 bloom filter 跟数据库的DISTINCT 处理时间基本一样

更新

总数据 13766256, 去重后的数据 363202,
数据库 select count(DISTINCT(address)) 花了 42.433s,
bloom_filter 处理花了 0m50.983s
准确率一致

python script

import mmh3
from bitarray import bitarray

BIT_SIZE = 1 << 30

class BloomFilter:

    def __init__(self):
        # Initialize bloom filter, set size and all bits to 0
        bit_array = bitarray(BIT_SIZE)
        bit_array.setall(0)

        self.bit_array = bit_array

    def add(self, val):
        point_list = self.get_postions(val)

        for b in point_list:
            self.bit_array[b] = 1

    def get_postions(self, val):
        # Get points positions in bit vector.
        # 提供不同的hash种子得到多个hash函数, seed最好为质数

        point1 = mmh3.hash(val, 5)  % BIT_SIZE
        point2 = mmh3.hash(val, 7)  % BIT_SIZE
        point3 = mmh3.hash(val, 11) % BIT_SIZE
        point4 = mmh3.hash(val, 13) % BIT_SIZE
        point7 = mmh3.hash(val, 19) % BIT_SIZE
        point5 = mmh3.hash(val, 23) % BIT_SIZE
        point6 = mmh3.hash(val, 31) % BIT_SIZE

        return [point1, point2, point3, point4, point5, point6]

    def is_contains(self, val):
        point_list = self.get_postions(val)

        result = True
        for b in point_list:
            result = result and self.bit_array[b]

        return result


if __name__ == '__main__':

    bf = BloomFilter()

    # 第一次会显示 not exists

    if bf.is_contains('zqw'):
        print('exists')
    else:
        print('not exists')
        bf.add('zqw')

    if bf.is_contains('shooter'):
        print('exists!')
    else:
        bf.add('shooter')

    if bf.is_contains('zqw'):
        print('exists!')
    else:
        bf.add('zqw')

Hash算法

计算散列值使用了mmh3算法, 因为 mmh3 是非加密哈希算法,计算速度非常快. 据说xxHash更快

参考:

http://www.zhiwenli.com/wordpress/?p=752
https://llimllib.github.io/bloomfilter-tutorial/
https://segmentfault.com/a/1190000010990136
http://www.alloyteam.com/2017/05/hash-functions-introduction/

https://coolshell.cn/articles/17225.html
https://yq.aliyun.com/articles/136154
https://blog.csdn.net/jiaomeng/article/details/1495500
https://github.com/LiuXingMing/Scrapy_Redis_Bloomfilter
https://china.googleblog.com/2007/07/bloom-filter_7469.html
https://www.cnblogs.com/cpselvis/p/6265825.html
http://beginman.cn/redis/2015/11/03/redis-bit/
https://en.wikipedia.org/wiki/MurmurHash
https://github.com/jaybaird/python-bloomfilter/
https://github.com/andymccurdy/redis-py

网友评论

本文标题：BloomFilter大量数据快速去重

本文链接：https://www.haomeiwen.com/subject/tdspkftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！