美文网首页
BloomFilter大量数据快速去重

BloomFilter大量数据快速去重

作者: 已不再更新_转移到qiita | 来源:发表于2018-04-17 17:27 被阅读129次

    布隆过滤器 用于大量数据快速去重

    先说结果

    一共有4348524条数据, 不重复的有55888
    数据库使用 select count(DISTINCT(address)) 花了 13.644s
    从数据库读取数据,再用bloom_filter处理,花了 1m26.710s, 统计出不重复的有55888条,
    准确率一致
    4348524条数据从文本读取,去重时间达到 0m12.150s

    目前400w的数量级 bloom filter 跟数据库的DISTINCT 处理时间基本一样

    更新

    总数据 13766256, 去重后的数据 363202,
    数据库 select count(DISTINCT(address)) 花了 42.433s,
    bloom_filter 处理花了 0m50.983s
    准确率一致

    python script

    import mmh3
    from bitarray import bitarray
    
    BIT_SIZE = 1 << 30
    
    class BloomFilter:
    
        def __init__(self):
            # Initialize bloom filter, set size and all bits to 0
            bit_array = bitarray(BIT_SIZE)
            bit_array.setall(0)
    
            self.bit_array = bit_array
    
        def add(self, val):
            point_list = self.get_postions(val)
    
            for b in point_list:
                self.bit_array[b] = 1
    
        def get_postions(self, val):
            # Get points positions in bit vector.
            # 提供不同的hash种子得到多个hash函数, seed最好为质数
    
            point1 = mmh3.hash(val, 5)  % BIT_SIZE
            point2 = mmh3.hash(val, 7)  % BIT_SIZE
            point3 = mmh3.hash(val, 11) % BIT_SIZE
            point4 = mmh3.hash(val, 13) % BIT_SIZE
            point7 = mmh3.hash(val, 19) % BIT_SIZE
            point5 = mmh3.hash(val, 23) % BIT_SIZE
            point6 = mmh3.hash(val, 31) % BIT_SIZE
    
            return [point1, point2, point3, point4, point5, point6]
    
        def is_contains(self, val):
            point_list = self.get_postions(val)
    
            result = True
            for b in point_list:
                result = result and self.bit_array[b]
    
            return result
    
    
    if __name__ == '__main__':
    
        bf = BloomFilter()
    
        # 第一次会显示 not exists
    
        if bf.is_contains('zqw'):
            print('exists')
        else:
            print('not exists')
            bf.add('zqw')
    
        if bf.is_contains('shooter'):
            print('exists!')
        else:
            bf.add('shooter')
    
        if bf.is_contains('zqw'):
            print('exists!')
        else:
            bf.add('zqw')
    

    Hash算法

    计算散列值使用了mmh3算法, 因为 mmh3 是非加密哈希算法,计算速度非常快. 据说xxHash更快


    参考:

    http://www.zhiwenli.com/wordpress/?p=752
    https://llimllib.github.io/bloomfilter-tutorial/
    https://segmentfault.com/a/1190000010990136
    http://www.alloyteam.com/2017/05/hash-functions-introduction/

    https://coolshell.cn/articles/17225.html
    https://yq.aliyun.com/articles/136154
    https://blog.csdn.net/jiaomeng/article/details/1495500
    https://github.com/LiuXingMing/Scrapy_Redis_Bloomfilter
    https://china.googleblog.com/2007/07/bloom-filter_7469.html
    https://www.cnblogs.com/cpselvis/p/6265825.html
    http://beginman.cn/redis/2015/11/03/redis-bit/
    https://en.wikipedia.org/wiki/MurmurHash
    https://github.com/jaybaird/python-bloomfilter/
    https://github.com/andymccurdy/redis-py

    相关文章

      网友评论

          本文标题:BloomFilter大量数据快速去重

          本文链接:https://www.haomeiwen.com/subject/tdspkftx.html