Storage Format

作者: skyler_tao | 来源:发表于2017-01-21 17:27 被阅读0次

    文档简介(0.9.0)

    Data in Druid is stored in a custom column format known as a segment. Segments are composed of different types of columns. Column.java and the classes that extend it is a great place to looking into the storage format.

    基本类

    ValueType

    枚举类,包含四个可选项:

    1. Float
    2. Long
    3. String
    4. Complex

    IndexedInts

    主要有三个方法:

    int size();
    int get(int index);
    void fill(int index, int[] toFill);
    

    实现类主要有:

    1. EmptyIndexedInts
    2. IntBufferIndexedInts
    3. ListBasedIndexedInts
    4. VSizeIndexedInts

    size() 指的是该 Buffer 下还有多少个元素可读或可写;
    get(index) 读取该 Buffer 下的 index 个元素;
    fill()将对应的 Channel 数据填充到该 Buffer,目前都不支持该方法.
    其中,ListBasedIndexedInts采用的存储是 List<Integer>.
    可以看出,部分是采用 Java NIO 操作 native memory.

    ColumnCapabilities

    属性:

    private ValueType type = null;
    private boolean dictionaryEncoded = false;  // 是否字典编码
    private boolean runLengthEncoded = false;  // 是否 runLength 编码,runLength 是虚构的,可忽略
    private boolean hasInvertedIndexes = false;  // 是否倒排索引
    private boolean hasSpatialIndexes = false;  // 是否稀疏索引
    private boolean hasMultipleValues = false;  // 是否有多值
    

    DictionaryEncodedColumn

    基本方法:

    public int length();  // 一个字典编码列的总长度
    public boolean hasMultipleValues();  // 是否有多值的情况
    public int getSingleValueRow(int rowNum);  // 获取某行的单值
    public IndexedInts getMultiValueRow(int rowNum);  // 获取某行的多值
    public String lookupName(int id);  // 通过 id 索引获取对应行的值,注意,null and empty 都会转化成 null
    public int lookupId(String name);  // 
    public int getCardinality();  // 获取基数,字典长度
    

    唯一实现类SimpleDictionaryEncodedColumn,有三个属性:

    private final IndexedInts column;
    private final IndexedMultivalue<IndexedInts> multiValueColumn;
    private final CachingIndexed<String> cachedLookups;
    

    其中有意思的是 cachedLookups,存储的是字典。

    CachingIndexed

    字典的具体实现类,实现了 Indexed接口,其它的实现类主要有:

    1. GenericIndexed
    2. ArrayIndexed
    3. BufferIndexed
    4. ListIndexed
    5. VSizeIndexed

    CachingIndexed 是 wrapping a given GenericIndexed,同时使用一个 LRUMap SizedLRUMap<Integer, T>来存储 cachedValues.

    GenericIndexed

    A generic, flat storage mechanism. Use static methods fromArray() or fromIterable() to construct. If input is sorted, supports binary search index lookups. If input is not sorted, only supports array-like index lookups.
    V1 Storage Format:

    • byte 1: version (0x1)
    • byte 2 == 0x1 => allowReverseLookup
    • bytes 3-6 => numBytesUsed
    • bytes 7-10 => numElements
    • bytes 10-((numElements * 4) + 10): integers representing 'end' offsets of byte serialized values
    • bytes ((numElements * 4) + 10)-(numBytesUsed + 2): 4-byte integer representing length of value, followed by bytes for value

    属性有:

    private final ByteBuffer theBuffer;  // 内置的 ByteBuffer 存储
    private final ObjectStrategy<T> strategy;
    private final boolean allowReverseLookup;
    private final int size;  // theBuffer 的当前 int 值
    private final int valuesOffset;
    private final BufferIndexed bufferIndexed;  // 内部类, BufferIndexed
    

    Column 类

    接口,详见实现类

    SimpleColumn 类

    属性:

    
    private final ColumnCapabilitiescapabilities;
    
    private final SupplierdictionaryEncodedColumn;
    
    private final SupplierrunLengthColumn;
    
    private final SuppliergenericColumn;
    
    private final SuppliercomplexColumn;
    
    private final SupplierbitmapIndex;
    
    private final SupplierspatialIndex;
    
    

    相关文章

      网友评论

        本文标题:Storage Format

        本文链接:https://www.haomeiwen.com/subject/anegbttx.html