美文网首页
Roaring Bitmap 原理

Roaring Bitmap 原理

作者: lj72808up | 来源:发表于2023-07-05 16:57 被阅读0次

一. bitmaps 是干什么的?

  1. bitmap 是一个比特数组:Array[Byte], 用来存储整数集合:Set[Integer].它通过"如果集合中有一个整数n,就设置arr[n]=1 bit"来存放整数.
  2. 由于 bitmap 的这种表达整数的方式, 它可以利用 cpu 的 bitwise-and (按位与) 和 bitwise-or (按位或) 很快的进行"2个整数集合求交集,并集"操作, 时间复杂度O(1)
    假设有10亿个文档, 编号从 1 到 10亿.现在要算出同时存在单词 carrier 和单词 pigeon 的文档该怎么做?
    可以分别将存在单词 carrier 的文档编号集合用 arr1:Array[Byte] 表示, 存在单词 pigeon 的文档编号集合用 arr2:Array[Byte] 表示; 同时存在两个单词的文档集合就是将这两个比特数组按位与
  3. 普通的 bitmaps 有一个缺陷: 当整数数组最大值很大, 但是元素个数却很少时, 会造成巨量的空间浪费.
    比如: [1,1000000000] 这个数组, 只有2个整数, 却要用 10亿 个bit的空间表示这个整数数组

二. Roaring bitmaps 是干什么的?

Roaring bitmaps 在传统 bitmaps 上, 使用压缩解决数组稀疏问题.具体上讲, Roaring bitmaps 将1个 32 位整数集合, 按照高 16 位分桶(container),最多可分 2^{16}=65536 个桶. 存储整数时,按照整数的高16位找到container(找不到就会新建一个),再将整数的低16位放入 container 中. 常见的 container 有一下2类:

  1. ArrayContainer
    当桶内数据的个数不大于4096时,会采用它来存储,其本质上是一个unsigned short类型(正好 16 位)的有序数组:Array[Short]。数组初始长度为4,随着数据的增多会自动扩容(但数组的最大长度就是4096, 即 ArrayContainer 最大占用从初始的 4 * 2B=8B, 到最大 4096 * 2B = 8KB)。另外还维护有一个计数器,用来实时记录基数。

  2. BitmapContainer
    当桶内数据的个数大于4096时,会采用它来存储,其本质上是长度固定为 2^{16} 位(8KB)的传统 bitmap (存储 2^{16} 个整数) 1物理表现为 长度固定为 1024 的 unsigned long型(64位,8B)数组:Array[Long] (size=1024),亦即这些位图的大小固定 8KB。它同样有一个计数器。

三. Roaring bitmaps 的 exist, union, intersect 如何计算?

  1. 判断整数 N 是否存在集合中
    To check if an integer N exists, get N's 16 most significant bits (N / 2^16) and use it to find N's corresponding container in the Roaring bitmap.

If the container doesn't exist, then N is not in the Roaring bitmap.

Checking for existence in array and bitmap containers works differently:

Bitmap: check if the bit at N % 2^16 is set.
Array: use binary search to find N % 2^16 in the sorted array.
Intersect matching containers to intersect two Roaring bitmaps. Algorithms vary by container type(s), and container types may change.

  1. 计算 intersect
    To intersect Roaring bitmaps A and B, it is sufficient to intersect matching containers in A and B.

This is possible because of how integers are partitioned in Roaring bitmaps: matching containers in A and B store integers with the same 16 most significant bits (the same chunks).

Intersection algorithms vary by the types of the containers involved, as do the resulting container types:

Bitmap / Bitmap: Compute the bitwise AND of the two bitmaps. If the cardinality is <= 4,096, store the result in an array container, otherwise store it in a bitmap container.
Bitmap / Array: Iterate over the array, checking for the existence of each 16-bit integer in the bitmap. If the integer exists, add it to the resulting array container – note that intersections of bitmap and array container types will always create an array container.
Array / Array: Intersections of two array containers always create a new array container. The algorithm used to compute the intersection varies by a cardinality heuristic described at the bottom of page 5 here. It will either be a simple merge (as used in merge sort) or a galloping intersection, described in this paper.
If there is a container in either Roaring bitmap without a corresponding container in the other, it will not exist in the result: the intersection of an empty set and any set is an empty set.

  1. 计算 union
    Union matching containers to produce a Roaring bitmap union. Algorithms vary by container type(s), and container types may change.
    To union Roaring bitmaps A and B, union all matching containers in A and B.

Union algorithms vary by the container types involved, as do the resulting container types:

Bitmap / Bitmap: Compute the bitwise OR of the two bitmaps. Unions of two bitmap containers will always create another bitmap container.
Bitmap / Array: Copy the bitmap and set corresponding bits for all the integers in the array container. Unions of a bitmap and array container will always create another bitmap container.
Array / Array: If the sum of the cardinalities of the two array containers is <= 4,096, the resulting container will be an array container. In this case, add all integers from both arrays to a new array container. Otherwise, optimistically assume the resulting container will be a bitmap: create a new bitmap container and set all corresponding bits for all integers in both arrays. If the cardinality of the resulting container is <= 4,096, convert the bitmap container back into an array container.
Finally, add all containers in A and B that do not have a matching container to the result. Remember: this is a union, so all integers in Roaring bitmaps A and B must be in the resulting set.

相关文章

  • bitmaps

    Roaring bitmaps 说到Roaring bitmaps,就必须先从bitmap说起。Bitmap是一种...

  • RoaringFormatSpec

    roaring bitmap存储格式规范 通用格式 说明: 有一个初始化“ cookie头”,它使我们能够识别出位...

  • 大数据分析常用去重算法分析『HyperLogLog 篇』

    在上篇推送中,Kyligence 大数据工程师陶加涛为大家介绍了利用 Roaring Bitmap 来进行精确去重...

  • 精确去重和Roaring BitMap (咆哮位图)

    基本概念 Roaring BitMap 以下简称 RBM,中文翻译为咆哮位图,它本质上是定义了一个很大的 bit ...

  • What is ?

    What is roaring, is the soul roaring or the tired body? W...

  • Bitmap高效加载及Android缓存策略

    大图加载原理也涉及到了Bitmap的使用。 一、Bitmap(位图)基本概念 1、Bitmap是Android系统...

  • BitMap原理

    经常能够看到有些大厂的面试题里有一些这样的题目:一个10G的文件,里面全部是自然数,一行一个,乱序排列,对其排序。...

  • No.14 【大数据算法】BitMap的原理和实现

    0x00 前言 本篇是大数据算法系列 第一篇《BitMap的原理和实现》,BitMap 的思想的和原理是很多算法的...

  • Bitmap

    基本概念(是什么,应用场景)以及BitMap的编码原理(做引导) BitMap类在Android类中的基本实现(基...

  • (2)BitMap原理

    经常能够看到有些大厂的面试题里有一些这样的题目:一个10G的文件,里面全部是自然数,一行一个,乱序排列,对其排序。...

网友评论

      本文标题:Roaring Bitmap 原理

      本文链接:https://www.haomeiwen.com/subject/pspkudtx.html