美文网首页
ClickHouse LowCardinality数据类型

ClickHouse LowCardinality数据类型

作者: 白奕新 | 来源:发表于2020-04-29 00:03 被阅读0次

    一、结论及原理

    1、what's this?

    (1)LowCardinality(Type),是一种数据类型上的字典编码封装。LowCardinality(String),又成StringWithDictionary,是最常见的LowCardinality类型。
    (2)使用LowCardinality编码的字段,在底层数据存储上做了修改。

    image.png
    (3)适用场景:适用于原始string字段冗长且去重后的计数值<1000w(It works the best when source strings are lengthy and the number of distinct values is not that big. There is no hard limit in ClickHouse, the results are usually good if number of distinct values is below 10M)。
    (4)优势:降低磁盘存储。对于单个字段的group by有单独特殊的设计,所以会比较快,多个字段以上就不一定了。(https://github.com/ClickHouse/ClickHouse/issues/4796

    2、how to use?

    create table datasets.base( 
    FlightDate Date,
    FlightNum LowCardinality(String),
    OriginAirportID LowCardinality(String)) ENGINE = MergeTree() PARTITION by FlightDate
    order by (FlightNum,OriginAirportID);
    

    二、Performance Test

    0、结论

    • write:写性能衰退30%
    • query:查询性能提升40%
    • disk usage:根据字段的稀疏程序,有不同程度的存储优化,从30%到100%

    1、source data

    使用官方数据给的美国航班表数据作为数据源。

    2、test for writing

    通过INSERT INTO SELECT的方式写入新表,评判写入性能。

    insert into TABLE select FlightDate,FlightNum,OriginAirportID,DestAirportID,Distance,ArrDelay from datasets.ontime where Year > 2008;
    
    (1)基础表
    create table datasets.base( FlightDate Date,
    FlightNum String,
    OriginAirportID String,
    DestAirportID String,
    Distance Int32,
    ArrDelay Int32) ENGINE = MergeTree() PARTITION by FlightDate
    order by (FlightNum,OriginAirportID);
    

    cost time for writing

    序号 耗时 速度
    1 48.069 sec 1.28 million rows/s., 41.75 MB/s
    2 48.398 sec 1.27 million rows/s., 41.46 MB/s.
    3 52.804 sec 1.16 million rows/s., 38.00 MB/s.

    avg cost time : 49.757 sec

    (2)LowCardinality表
    create table datasets.lowCardinality( FlightDate Date,
    FlightNum LowCardinality(String),
    OriginAirportID LowCardinality(String),
    DestAirportID LowCardinality(String),
    Distance Int32,
    ArrDelay Int32) ENGINE = MergeTree() PARTITION by FlightDate
    order by (FlightNum,OriginAirportID);
    

    cost time for writing

    序号 耗时 速度
    1 74.402 sec 826.37 thousand rows/s., 26.97 MB/s.
    2 67.489 sec 911.01 thousand rows/s., 29.73 MB/s.
    3 72.909 sec 843.29 thousand rows/s., 27.52 MB/s.

    avg cost time : 71.6 sec

    3、test for querying

    SELECT SUM(Distance) as all_distince, SUM(ArrDelay), FlightNum
    FROM TABLE
    WHERE (OriginAirportID LIKE '15%'
            OR OriginAirportID LIKE '13%')
        AND DestAirportID LIKE '13%'
    GROUP BY FlightNum, OriginAirportID
    ORDER BY all_distince DESC
    LIMIT 10;
    
    (1)基础表
    序号 耗时 速度
    1 1.368 sec 44.17 million rows/s., 469.02 MB/s.
    2 1.131 sec 53.43 million rows/s., 567.38 MB/s.
    3 1.208 sec. 50.00 million rows/s., 530.94 MB/s.

    avg cost time:1.236 sec

    (2)LowCardinality表
    序号 耗时 速度
    1 0.822 sec 73.53 million rows/s., 600.51 MB/s.
    2 0.754 sec 911.01 thousand rows/s., 29.73 MB/s.
    3 0.767 sec. 78.72 million rows/s., 642.94 MB/s.

    avg cost time:0.781 sec

    3、test for disk usage

    SELECT column, ANY(type), SUM(column_data_compressed_bytes) AS compressed
        , SUM(column_data_uncompressed_bytes) AS uncompressed
    FROM system.parts_columns
    WHERE table IN ('lowCardinality', 'base')
        AND active
        AND (column LIKE '%AirportID'
            OR column = 'FlightNum')
    GROUP BY column, table
    ORDER BY column ASC
    

    结果:


    image.png

    三、reference

    (1)https://raw.githubusercontent.com/ClickHouse/clickhouse-presentations/master/meetup19/string_optimization.pdf
    (2)https://www.altinity.com/blog/2019/3/27/low-cardinality

    后记:
    最近准备在线上慢慢使用上,有什么效果到时再来同步更新

    相关文章

      网友评论

          本文标题:ClickHouse LowCardinality数据类型

          本文链接:https://www.haomeiwen.com/subject/ofhwwhtx.html