（Caffeine）FrequencySketch源码解读

作者: CayChan | 来源:发表于2021-01-08 16:50 被阅读0次

（Caffeine）FrequencySketch源码解读
（Caffeine）TimerWheel源码解读
Caffeine Cache实战
AFN 3.0学习总结（最后的总结转载）
spring源码解读网上知识摘要
文章目录
Element分析（组件篇）——TableHeader
JDK源码-Object
【Java源码计划】LongAdder<rt.jar_ja
AFNetworking 3.0 源码解读（五）之 AFURLS

Caffeine应用w-TinyLFU算法实现数据频次的记录，主要解决两个问题：

尽量降低数据的占用空间：将每个key的最大命中次数上限设为15，用4个bit表示。每个long为64位，可分为16个4bit，这样每个long型整数可以保存16个。
尽量保持数据相对"新鲜"：达到一定条件时，就将所有key的命中频次减半（降频）。

做法：

用一个Long型一维数组保存所有数据，每个数组元素包括16个分段（下面将每个数组元素称为一个slot，每个4bit分段称为一个counter）
使用hash算法将每个key的频次映射到某个Counter中。并借鉴BloomFilter思想，用多种hash算法降低hash重复带来的误差。
每个key计算一个hash值，再用hash值计算出4种index，表示在哪些slot中；用hash值计算出在16个counter中的位置，每个slot中的counter下标相差1位，counter下标从右到左为0-16。
每次命中后将4个counter值加1，最多15。
由于每个元素的hash值都可能跟其他key相同，所以每次频次加1时，可能将其他key的频次加1。所以取这4个key的最小值最为该key的频次

image.png

/**
 * A probabilistic multiset for estimating the popularity of an element within a time window. The
 * maximum frequency of an element is limited to 15 (4-bits) and an aging process periodically
 * halves the popularity of all elements.
 *
 * @author ben.manes@gmail.com (Ben Manes)
 */
final class FrequencySketch<E> {

  /*
   * This class maintains a 4-bit CountMinSketch [1] with periodic aging to provide the popularity
   * history for the TinyLfu admission policy [2]. The time and space efficiency of the sketch
   * allows it to cheaply estimate the frequency of an entry in a stream of cache access events.
   *
   * The counter matrix is represented as a single dimensional array holding 16 counters per slot. A
   * fixed depth of four balances the accuracy and cost, resulting in a width of four times the
   * length of the array. To retain an accurate estimation the array's length equals the maximum
   * number of entries in the cache, increased to the closest power-of-two to exploit more efficient
   * bit masking. This configuration results in a confidence of 93.75% and error bound of e / width.
   *
   * The frequency of all entries is aged periodically using a sampling window based on the maximum
   * number of entries in the cache. This is referred to as the reset operation by TinyLfu and keeps
   * the sketch fresh by dividing all counters by two and subtracting based on the number of odd
   * counters found. The O(n) cost of aging is amortized, ideal for hardware prefetching, and uses
   * inexpensive bit manipulations per array location.
   *
   * [1] An Improved Data Stream Summary: The Count-Min Sketch and its Applications
   * http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
   * [2] TinyLFU: A Highly Efficient Cache Admission Policy
   * https://dl.acm.org/citation.cfm?id=3149371
   */

  //hash种子，用来计算4个不同的hash值
  static final long[] SEED = { // A mixture of seeds from FNV-1a, CityHash, and Murmur3
      0xc3a5c85c97cb3127L, 0xb492b66fbe98f273L, 0x9ae16a3b2f90404fL, 0xcbf29ce484222325L};
  static final long RESET_MASK = 0x7777777777777777L; //二进制 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111 0111
    static final long ONE_MASK = 0x1111111111111111L; //二进制 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001

  //size累加到sampleSize时，执行减半操作
  int sampleSize;
  //hash&tableMask相当于取模，得到slot的index
  int tableMask;
  //记录frequency的一维数组
  long[] table;
  //所有counter之和
  int size;

  /**
   * Creates a lazily initialized frequency sketch, requiring {@link #ensureCapacity} be called
   * when the maximum size of the cache has been determined.
   */
  @SuppressWarnings("NullAway.Init")
  public FrequencySketch() {}

  /**
   * Initializes and increases the capacity of this <tt>FrequencySketch</tt> instance, if necessary,
   * to ensure that it can accurately estimate the popularity of elements given the maximum size of
   * the cache. This operation forgets all previous counts when resizing.
   * 初始化table，若maximumSize=0，table长度为1（无界缓存，所有数据都被记录下来，也就无需记录每个数据的频次了）
   * 否则table长度为大于等于maximumSize的最小的2的整数倍。maximumSize大于0时，sampleSize=10倍数组长度。
   * 数组长度最大为Integer.MAX_VALUE的二分之一，也就是2147483647/2=1073741823，大概10.7亿
   * tableMask=数组长度减1
   *
   * 比如：
   * maximumSize=0，table长度为1，sampleSize=Integer.MAX_VALUE
   * maximumSize=10，table长度为16，sampleSize=160
   * maximumSize=100，table长度为128，sampleSize=1280
   *
   * @param maximumSize the maximum size of the cache
   */
  public void ensureCapacity(@NonNegative long maximumSize) {
    requireArgument(maximumSize >= 0);
    int maximum = (int) Math.min(maximumSize, Integer.MAX_VALUE >>> 1);
    if ((table != null) && (table.length >= maximum)) {
      return;
    }

    table = new long[(maximum == 0) ? 1 : Caffeine.ceilingPowerOfTwo(maximum)];
    tableMask = Math.max(0, table.length - 1);
    sampleSize = (maximumSize == 0) ? 10 : (10 * maximum);
    if (sampleSize <= 0) {
      sampleSize = Integer.MAX_VALUE;
    }
    size = 0;
  }

  /**
   * Returns if the sketch has not yet been initialized, requiring that {@link #ensureCapacity} is
   * called before it begins to track frequencies.
   */
  public boolean isNotInitialized() {
    return (table == null);
  }

  /**
   * Returns the estimated number of occurrences of an element, up to the maximum (15).
   *
   * 获取元素的频次，由于每个元素都有4个hash算法，在4个位置记录了4个频次，取其中最小的频次作为该元素的频次
   * 获取元素的counter下标和slot下标，再取对应的4bit的数据
   *
   * @param e the element to count occurrences of
   * @return the estimated number of occurrences of the element; possibly zero but never negative
   */
  @NonNegative
  public int frequency(@NonNull E e) {
    if (isNotInitialized()) {
      return 0;
    }

    int hash = spread(e.hashCode());
    int start = (hash & 3) << 2; //start为counter下标，这个算法start只可能为0 4 8 12的其中一种
    int frequency = Integer.MAX_VALUE;
    for (int i = 0; i < 4; i++) {
      int index = indexOf(hash, i); //index为每个slot下标
      //table[index]为当前slot的long整数。
      //假设start为4，二进制为0100，table[index]的二进制为0100 1010... 1101 0011 1000 0011 1001
      //(start + i) << 2 左移两位0001 0000，也就是十进制16，其实就是从右数，下标为4的counter的二进制的右边第一位，这个counter就是1101
      //(table[index] >>> 16也就是0000 0000 0000 0000 0100 1010... 1101
      //oxfL也就是15，二进制为0000 0000...0000 1111，跟"0000 0000 0000 0000 0100 1010... 1101"相与，就是只保留后4位，1101，也就是这个counter的值=该元素在该位置的频次
      int count = (int) ((table[index] >>> ((start + i) << 2)) & 0xfL);
      frequency = Math.min(frequency, count);
    }
    return frequency;
  }

  /**
   * Increments the popularity of the element if it does not exceed the maximum (15). The popularity
   * of all elements will be periodically down sampled when the observed events exceeds a threshold.
   * This process provides a frequency aging to allow expired long term entries to fade away.
   *
   * 增加元素的频次
   * 获取4个slot位置，获取counter位置，并为每个counter加1
   *
   * 假设slot1的counter下标=c。slot1的counter下标等于c+1，slot2的counter下标等于c+2，slot3的counter下标等于c+3
   * 这么做是因为计算counter的算法得到的只能为0,4,8,12这4种情况，为了使用所有16个counter，所以这么处理（也可能是理论原因所以这么计算counter）
   * @param e the element to add
   */
  public void increment(@NonNull E e) {
    if (isNotInitialized()) {
      return;
    }

    //hash后再打乱一次，使hashcode更加均匀
    int hash = spread(e.hashCode());
    //使用hash值二进制后两位计算counter下标
    int start = (hash & 3) << 2;

    //用不同seed获取4个slot下标
    // Loop unrolling improves throughput by 5m ops/s
    int index0 = indexOf(hash, 0);
    int index1 = indexOf(hash, 1);
    int index2 = indexOf(hash, 2);
    int index3 = indexOf(hash, 3);

    //尝试在每个counter上加1，最多15
    boolean added = incrementAt(index0, start);
    added |= incrementAt(index1, start + 1);
    added |= incrementAt(index2, start + 2);
    added |= incrementAt(index3, start + 3);

    if (added && (++size == sampleSize)) {
      reset();
    }
  }

  /**
   * Increments the specified counter by 1 if it is not already at the maximum value (15).
   * 这里跟frequency方法类似，目的是找到counter所在的4位bit，然后加1，最多=15
   * offset为counter的4位bit的最低位在64位bit中的下标。
   * 64bit分为16个counter。从右往左数，j为counter下标，offset为64bit的下标
   *
   * @param i the table index (16 counters)
   * @param j the counter to increment
   * @return if incremented
   */
  boolean incrementAt(int i, int j) {
    int offset = j << 2;
    //mask为掩码，counter的那4个bit，在mask中相同位置的4bit为1111，其他位置为0。
    long mask = (0xfL << offset);
    if ((table[i] & mask) != mask) {//(table[i] & mask) != mask，表示那4bit不全为1，也就是不等于15
      //1L也就是0001，左移offset位，也就是将这4位移到跟counter相同的位置，然后相加。
      //比如table[index]的二进制为0100 1010...1101 0011 1000 0011 1001，offset=12，counter的4bit为1101，十进制=13
      //1L << offset的结果为     0000 0000...0001 0000 0000 0000 0000。相加后1101变成1110，十进制=14
      //虽然counter只加了1，但table[i]加了"1 0000 0000 0000 0000"，十进制507376
      table[i] += (1L << offset);
      return true;
    }
    return false;
  }

  /** Reduces every counter by half of its original value. */
  void reset() {
    int count = 0;
    for (int i = 0; i < table.length; i++) {
      count += Long.bitCount(table[i] & ONE_MASK);//16个counter中频次为奇数的个数
      //table[i] >>> 1，整体右移1位，其中每4个bit也右移1位，相当于除2。但每个counter的高位是上一个bit的低位，可能为1
      //& RESET_MASK，抹去新counter的最高位，保留低三位。最终实现每个counter除2
      //        1100 1001 0001 0010
      // 右移1位 0110 0100 1000 1001
      //  相与后 0110 0100 0000 0001
      table[i] = (table[i] >>> 1) & RESET_MASK;
    }
    //新size=老size/2-奇数数据/4。没明白为什么要除以4
    size = (size >>> 1) - (count >>> 2);
  }

  /**
   * Returns the table index for the counter at the specified depth.
   *
   * @param item the element's hash
   * @param i the counter depth
   * @return the table index
   */
  int indexOf(int item, int i) {
    long hash = (item + SEED[i]) * SEED[i];
    hash += (hash >>> 32);
    return ((int) hash) & tableMask;
  }

  /**
   * Applies a supplemental hash function to a given hashCode, which defends against poor quality
   * hash functions.
   */
  int spread(int x) {
    x = ((x >>> 16) ^ x) * 0x45d9f3b;
    x = ((x >>> 16) ^ x) * 0x45d9f3b;
    return (x >>> 16) ^ x;
  }
}

（Caffeine）FrequencySketch源码解读
Caffeine应用w-TinyLFU算法实现数据频次的记录，主要解决两个问题：尽量降低数据的占用空间：将每...
（Caffeine）TimerWheel源码解读
Caffeine Cache实战
Caffeine简介 Caffeine是基于JAVA 1.8 Version的高性能缓存库。Caffeine提供的...
AFN 3.0学习总结（最后的总结转载）
AFNetworking 3.0 源码解读总结（干货）（上） AFNetworking 3.0 源码解读总结（...
spring源码解读网上知识摘要
spring 启动源码 spring 启动流程源码解读spring 启动注解加载类源码解读springxml 文件...
文章目录
Go 源码解读篇《Go源码解读篇》之常见数据结构(list) 《Go源码解读篇》之 Error 工作中知识总结 ...
Element分析（组件篇）——TableHeader
说明 table-header是表头组件，较为复杂，直接看源码解读。源码解读
JDK源码-Object
简介源码解读
【Java源码计划】LongAdder<rt.jar_ja
LongAdder 源码解读源码解读部分按照我得理解翻译和解读注解并添加相关的部分代码解读保持一个或者多个变量...
AFNetworking 3.0 源码解读（五）之 AFURLS
本篇是AFNetworking 3.0 源码解读的第五篇了。 AFNetworking 3.0 源码解读（一）之 ...