美文网首页
精确去重和Roaring BitMap (咆哮位图)

精确去重和Roaring BitMap (咆哮位图)

作者: zh_harry | 来源:发表于2020-01-26 14:22 被阅读0次

    基本概念

    Roaring BitMap 以下简称 RBM,中文翻译为咆哮位图,它本质上是定义了一个很大的 bit 数组,每个元素对应到 bit 数组的其中一位,一个Integer是32-bit, 一共有Integer.MAX_VALUE = 2 ^ 32个值,32-bit的unsigned integer的集合(共2 ^ 32 = 42 9496 7296个)
    这个数足够覆盖一款产品的user数或item数(item 泛指是新闻,商品等)
    由定义可知,其去重是针对int 型数据进行操作,对于非Integer类型的数据(比如String类型)可以通过数据字典映射成Integer,比如数据库的ID

    基本位图实现存在问题

    • 即使只存一个数(最大的),存储空间也是512MB
      对于原始的Bitmap来说,这就需要2 ^ 32长度的bit数组
      通过计算可以发现(2 ^ 32 / 8 bytes = 512MB), 一个普通的Bitmap需要耗费512MB的存储空间
      不管业务值的基数有多大,这个存储空间的消耗都是恒定不变的,这显然是不能接受的。
      redis 的基本的位图实现就存在这个问题
      REDIS BITMAP 测试

    localhost:0>flushdb
    localhost:0>info
    # Memory
    used_memory:690288
    used_memory_human:674.11K
    used_memory_rss:652376
    used_memory_rss_human:637.09K #表示该进程所占物理内存的大小
    used_memory_peak:667584400
    used_memory_peak_human:637.09K #是过去Redis内存使用的峰值
    localhost:0> setbit a 4294967295 1 #将最大位置1,此时基数为1
    # Memory
    used_memory:541755832
    used_memory_human:516.66M #set 一个值,直接占512m内存
    used_memory_rss:541717944
    used_memory_rss_human:516.62M
    used_memory_peak:667584400
    used_memory_peak_human:636.66M
    

    附redis raoring-bitmap 实现 https://github.com/aviggiano/redis-roaring

    Roaring Bitmap实现

    • RBM实现源理
      将32位无符号整数按照高16位分桶,即最多可能有216=65536个桶,论文内称为container。存储数据时,按照数据的高16位找到container(找不到就会新建一个),再将低16位放入container中。也就是说,一个RBM就是很多container的集合。
      RoaringBitmap 构造源码
     RoaringArray highLowContainer = null;
    
      /**
       * Create an empty bitmap
       */
      public RoaringBitmap() {
        highLowContainer = new RoaringArray();
      }
    

    RoaringArray

    static final int INITIAL_CAPACITY = 4;
      short[] keys = null;//高16位数组 有序数组,方便二分查找
    
      Container[] values = null;//低16位容器数组
    
      int size = 0;
    
      protected RoaringArray() {
        this(INITIAL_CAPACITY);
      }
    
    • 低16位容器数组存储示意图


      低16位容器数组存储示意图
    • 前1000个62的倍数
    • 高16位为1,低16位为0到99 0x00010000-0x00010063
    • 高16位为2, 低16位的所有偶数

    Container 实现

    Container只需要处理低16位的数据。

    • ArrayContainer
    public final class ArrayContainer extends Container implements Cloneable {
      private static final int DEFAULT_INIT_SIZE = 4;
      private static final int ARRAY_LAZY_LOWERBOUND = 1024;
      static final int DEFAULT_MAX_SIZE = 4096;// containers with DEFAULT_MAX_SZE or less integers
                                               // should be ArrayContainers
      private static final long serialVersionUID = 1L;
      protected int cardinality = 0;
      short[] content;
      /**
       * Create an array container with default capacity
       */
      public ArrayContainer() {
        this(DEFAULT_INIT_SIZE);
      }
    

    short[] content,将16位value直接存储。
    short[] content始终保持有序且不重,方便使用二分查找。
    根据源码可以看出,常量DEFAULT_MAX_SIZE值为4096,当容量超过这个值的时候会将当前Container替换为BitmapContainer。

    /**
       * running time is in O(n) time if insert is not in order.
       */
      @Override
      public Container add(final short x) {
        //基数为0或当前值大于容器中的最大值(因为有序,最后一个即最大值)
        if (cardinality == 0 || (cardinality > 0
                && toIntUnsigned(x) > toIntUnsigned(content[cardinality - 1]))) {
    //基数>=4096则转为BitmapContainer
          if (cardinality >= DEFAULT_MAX_SIZE) {
            return toBitmapContainer().add(x);
          }
    //如果容器空间不路,则扩容[见下方扩容源代码]
          if (cardinality >= this.content.length) {
            increaseCapacity();
          }
          content[cardinality++] = x;
        } else {
    //如果当前值在容器范围内,则找到该值对应的位置[见下二分查找源码]
          int loc = Util.unsignedBinarySearch(content, 0, cardinality, x);
    //未找到
          if (loc < 0) {
            // Transform the ArrayContainer to a BitmapContainer
            // when cardinality = DEFAULT_MAX_SIZE
    //虽然在容器范围内也可能会涉及到容器升级和扩容
            if (cardinality >= DEFAULT_MAX_SIZE) {
              return toBitmapContainer().add(x);
            }
            if (cardinality >= this.content.length) {
              increaseCapacity();
            }
            // insertion : shift the elements > x by one position to
            // the right
            // and put x in it's appropriate place
            System.arraycopy(content, -loc - 1, content, -loc, cardinality + loc + 1);
            content[-loc - 1] = x;
            ++cardinality;
          }
        }
        return this;
      }
    

    数组扩容逻辑

    private void increaseCapacity(boolean allowIllegalSize) {
        int newCapacity = (this.content.length == 0) ? DEFAULT_INIT_SIZE
            : this.content.length < 64 ? this.content.length * 2
                : this.content.length < 1067 ? this.content.length * 3 / 2
                    : this.content.length * 5 / 4;
        // never allocate more than we will ever need
        if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE && !allowIllegalSize) {
          newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
        }
        // if we are within 1/16th of the max, go to max
        if (newCapacity > ArrayContainer.DEFAULT_MAX_SIZE - ArrayContainer.DEFAULT_MAX_SIZE / 16
            && !allowIllegalSize) {
          newCapacity = ArrayContainer.DEFAULT_MAX_SIZE;
        }
        this.content = Arrays.copyOf(this.content, newCapacity);
      }
    

    数组的二分查找(找到则返回value的位置,未找到则返回当前值将要insert 的位置,为与找到值的位置区分,这里返回负值)

    /**
       * Look for value k in array in the range [begin,end). If the value is found, return its index. If
       * not, return -(i+1) where i is the index where the value would be inserted. The array is assumed
       * to contain sorted values where shorts are interpreted as unsigned integers.
       *
       * @param array array where we search
       * @param begin first index (inclusive)
       * @param end last index (exclusive)
       * @param k value we search for
       * @return count
       */
      public static int unsignedBinarySearch(final short[] array, final int begin, final int end,
          final short k) {
        if (USE_HYBRID_BINSEARCH) {
          return hybridUnsignedBinarySearch(array, begin, end, k);
        } else {
          return branchyUnsignedBinarySearch(array, begin, end, k);
        }
      }
    

    为什么是4096的时侯升级容器?

    bitmap vs array
    1. bitmap存储空间恒定为8K,最大的基数可达到8*1024*8=65536个
    2. array的基数与存储空间成正比,即基数越大,占用空占越多
      通过以上两点我们取两者交相交的点为平衡点,即小于该点array更省空间,大于该点bitmap更省空间。

    上图中的前两个container基数都没超过4096,所以均为ArrayContainer。

    • BitmapContainer
    /**
     * Simple bitset-like container.
     */
    public final class BitmapContainer extends Container implements Cloneable {
      protected static final int MAX_CAPACITY = 1 << 16;
    

    //保存低16位,所以最大值为216

      // the parameter is for overloading and symmetry with ArrayContainer
      protected static int serializedSizeInBytes(int unusedCardinality) {
        return MAX_CAPACITY / 8;
      }
      final long[] bitmap;
      int cardinality;
      // nruns value for which RunContainer.serializedSizeInBytes ==
      // BitmapContainer.getArraySizeInBytes()
      private final int MAXRUNS = (getArraySizeInBytes() - 2) / 4;
    
      /**
       * Create a bitmap container with all bits set to false
    *构造最大的long 值数组,每个long 64位
       */
      public BitmapContainer() {
        this.cardinality = 0;
        this.bitmap = new long[MAX_CAPACITY / 64];//1024个long
      }
    

    这种Container使用long[]存储位图数据。我们知道,每个Container处理16位整形的数据,也就是0~65535,因此根据位图的原理,需要65536个比特来存储数据,每个比特位用1来表示有,0来表示无。每个long有64位,因此需要1024个long来提供65536个bit。
    因此,每个BitmapContainer在构建时就会初始化长度为1024的long[]。这就意味着,不管一个BitmapContainer中只存储了1个数据还是存储了65536个数据,占用的空间都是同样的8kb。
    上图中的第三个container基数远远大于4096,所以要用BitmapContainer存储。
    bit map container add方法源码分析

    @Override
      public Container add(final short i) {
        final int x = Util.toIntUnsigned(i);
    //当前值/64取整找到long数组的索引
        final long previous = bitmap[x / 64];
    

    //找到当前值所在long 的第几位,并将该位置1,1L<<x 等价于1x<<(x%64)
    关于位移操作详情官方解释
    https://docs.oracle.com/javase/specs/jls/se10/html/jls-15.html#jls-15.19

    The operators << (left shift), >> (signed right shift), and >>> (unsigned right shift) are called the shift operators. The left-hand operand of a shift operator is the value to be shifted; the right-hand operand specifies the shift distance.
    The type of the shift expression is the promoted type of the left-hand operand.

    If the promoted type of the left-hand operand isint, then only the five lowest-order bits of the right-hand operand are used as the shift distance.(int 只有低5位有效)It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x1f (0b11111). The shift distance actually used is therefore always in the range 0 to 31, inclusive.

    If the promoted type of the left-hand operand islong, then only the six lowest-order bits of the right-hand operand are used as the shift distance.(long 只有低6位有效)It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x3f (0b111111). The shift distance actually used is therefore always in the range 0 to 63, inclusive.

        long newval = previous | (1L << x);
    //赋新值
        bitmap[x / 64] = newval;
        if (USE_BRANCHLESS) {
          cardinality += (previous ^ newval) >>> x;
        } else if (previous != newval) {
          ++cardinality;
        }
        return this;
      }
    
    • RunContainer
      该容器由RLE实现
    /**
     * This container takes the form of runs of consecutive values (effectively, run-length encoding).
     *
     * Adding and removing content from this container might make it wasteful so regular calls to
     * "runOptimize" might be warranted.
     */
    public final class RunContainer extends Container implements Cloneable {
      private static final int DEFAULT_INIT_SIZE = 4;
      private static final boolean ENABLE_GALLOPING_AND = false;
      private short[] valueslength;//主要存储结构 we interleave values and lengths, so
      // that if you have the values 11,12,13,14,15, you store that as 11,4 where 4 means that beyond 11
      // itself, there are
      // 4 contiguous values that follows.
      // Other example: e.g., 1, 10, 20,0, 31,2 would be a concise representation of 1, 2, ..., 11, 20,
      // 31, 32, 33
      int nbrruns = 0;// how many runs, this number should fit in 16 bits.
    
    • Run-Length Encoding(RLE)
      Run-length encoding是被许多bitmap文件格式支持的数据压缩算法
      RLE工作方式是减少重复字符的物理尺寸,被编码成两个字节,第一个字节表示字符数量,第二个字节表示本身字符值。
      字符串包含4个不同字符
      例如:
    AAAAAAbbbXXXXXt
    

    使用RLE,可以形成4个2字节包

    6A3b5X1t
    

    当手动执行runOptimize 方法时会触发优化

    /**
       * Use a run-length encoding where it is more space efficient
       *
       * @return whether a change was applied
       */
      public boolean runOptimize() {
        boolean answer = false;
        for (int i = 0; i < this.highLowContainer.size(); i++) {
          Container c = this.highLowContainer.getContainerAtIndex(i).runOptimize();
          if (c instanceof RunContainer) {
            answer = true;
          }
          this.highLowContainer.setContainerAtIndex(i, c);
        }
        return answer;
      }
    

    ArrayContainer 优化逻辑

    @Override
      public Container runOptimize() {
        // TODO: consider borrowing the BitmapContainer idea of early
        // abandonment
        // with ArrayContainers, when the number of runs in the arrayContainer
        // passes some threshold based on the cardinality.
        int numRuns = numberOfRuns();
        int sizeAsRunContainer = RunContainer.serializedSizeInBytes(numRuns);
    //如果RunContainer 比当前的容器省空间,则升级为 RunContainer。BitMap 同理
        if (getArraySizeInBytes() > sizeAsRunContainer) {
          return new RunContainer(this, numRuns); // this could be maybe
                                                  // faster if initial
                                                  // container is a bitmap
        } else {
          return this;
        }
      }
    

    numberOfRuns 计算run 的个数

    @Override
      int numberOfRuns() {
        if (cardinality == 0) {
          return 0; // should never happen
        }
        int numRuns = 1;
        int oldv = toIntUnsigned(content[0]);
        for (int i = 1; i < cardinality; i++) {
          int newv = toIntUnsigned(content[i]);
          if (oldv + 1 != newv) {
            ++numRuns;
          }
          oldv = newv;
        }
        return numRuns;
      }
    

    计算RunContainer 可能占用的容量

    protected static int serializedSizeInBytes(int numberOfRuns) {
        return 2 + 2 * 2 * numberOfRuns; // each run requires 2 2-byte entries.
    //值和长度成对出现,分别两个字节
      }
    

    容器转换总结

    • ArrayContainer:
      如果插入值后容量超过4096,则自动转换为BitmapContainer。因此正常使用的情况下不会出现容量超过4096的ArrayContainer。
      调用runOptimize()方法时,会比较和RunContainer的空间占用大小,选择是否转换为RunContainer。
    • BitmapContainer:
      如果删除某值后容量低至4096,则会自动转换为ArrayContainer。因此正常使用的情况下不会出现容量小于4096的BitmapContainer。
      调用runOptimize()方法时,会比较和RunContainer的空间占用大小,选择是否转换为RunContainer。
    • RunContainer:
      只有在调用runOptimize()方法才会发生转换,会分别和ArrayContainer、BitmapContainer比较空间占用大小,然后选择是否转换。

    参考

    《Better bitmap performance with Roaring bitmaps》
    《Consistently faster and smaller compressed bitmaps with Roaring》
    The Java® Language Specification
    https://www.jianshu.com/p/818ac4e90daf
    https://blog.csdn.net/luanpeng825485697/article/details/101110798

    相关文章

      网友评论

          本文标题:精确去重和Roaring BitMap (咆哮位图)

          本文链接:https://www.haomeiwen.com/subject/zcwrthtx.html