美文网首页
大数据文件格式与压缩算法小结

大数据文件格式与压缩算法小结

作者: chenfh5 | 来源:发表于2018-03-05 00:04 被阅读647次

    小结一下Hadoop/Hive的文件格式压缩算法,目录如下,

    0. Overview
    1. 文件格式
    2. 压缩算法
    3. Others
    4. Reference
    

    Overview

    文件格式和压缩算法在大数据系统里面是一个高关注的优化点,双方常常是配合着一起调优使用。


    1. 文件格式

    A file format is the way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format.

    file format characteristics hive storage option
    TextFile plain text, default format STORED AS TEXTFILE
    SequenceFile row-based, binary key-value, splittable STORED AS SEQUENCEFILE
    Avro row-based, binary or JSON, splittable STORED AS AVRO
    RCFile columnar, RLE STORED AS RCFILE
    ORCFile Optimized RC, Flatten STORED AS ORC
    Parquet column-oriented binary file, Nested STORED AS PARQUET

    2. 压缩算法

    To balance the processing capacity required to compress and uncompress the data, the CPU required to processing compress or uncompress data, the disk IO required to read and write the data, and the network bandwidth required to send the data across the network.

    Compression is not recommended if your data is already compressed (such as images in JPEG format). In fact, the resulting file can actually be larger than the original.

    compression format characteristics splittable
    DEFLATE DefaultCodec no
    GZip uses more CPU resources than Snappy or LZO; provides a higher compression ratio; A good choice for cold data no
    BZip2 more compression than GZip yes
    LZO better choice for hot data yes if indexed
    LZ4 significantly faster than LZO no
    Snappy performs better than LZO, better choice for hot data yes?

    Others

    • 游程编码,Run Length Encoding,RLE,常用于列式存储,4A3B2C1D4E
    • 纠删码,Erasure Coding,EC,hadoop 3.0.0的replica,但由于其带宽和cpu高消耗,常用于冷数据,k块原始+m块校验
    • Doc Values,最大公约数压缩,偏移量进行编码,按照docid排序的,利用内存映射文件mmap,预读取机制
    • skipList
    • bitSet [1,3,4,7,10]->[1,0,1,1,0,0,1,0,0,1]
    • Roaring Bitmap (bitset improvement),类似RLE,4A3B
    • Frame Of Reference编码
    • 数值差分[73,300,302,332,343,372]->[73,227,2,30,11,29]
    • term index,tire树
    • term dictionary
    • finite state transducers
      FST
    • 维度字段上移到父文档里,而不用在每个子文档里重复存储,从而减少索引的尺寸
    • segment一个int就可以存储
    • Hyperloglog
    • 聚合之后再做聚合,Pipeline Aggregation

    Reference

    相关文章

      网友评论

          本文标题:大数据文件格式与压缩算法小结

          本文链接:https://www.haomeiwen.com/subject/fbdqfftx.html