一、结论及原理
1、what's this?
(1)LowCardinality(Type),是一种数据类型上的字典编码封装。LowCardinality(String),又成StringWithDictionary,是最常见的LowCardinality类型。
(2)使用LowCardinality编码的字段,在底层数据存储上做了修改。
(3)适用场景:适用于原始string字段冗长且去重后的计数值<1000w(It works the best when source strings are lengthy and the number of distinct values is not that big. There is no hard limit in ClickHouse, the results are usually good if number of distinct values is below 10M)。
(4)优势:降低磁盘存储。对于单个字段的group by有单独特殊的设计,所以会比较快,多个字段以上就不一定了。(https://github.com/ClickHouse/ClickHouse/issues/4796)
2、how to use?
create table datasets.base(
FlightDate Date,
FlightNum LowCardinality(String),
OriginAirportID LowCardinality(String)) ENGINE = MergeTree() PARTITION by FlightDate
order by (FlightNum,OriginAirportID);
二、Performance Test
0、结论
- write:写性能衰退30%
- query:查询性能提升40%
- disk usage:根据字段的稀疏程序,有不同程度的存储优化,从30%到100%
1、source data
使用官方数据给的美国航班表数据作为数据源。
2、test for writing
通过INSERT INTO SELECT
的方式写入新表,评判写入性能。
insert into TABLE select FlightDate,FlightNum,OriginAirportID,DestAirportID,Distance,ArrDelay from datasets.ontime where Year > 2008;
(1)基础表
create table datasets.base( FlightDate Date,
FlightNum String,
OriginAirportID String,
DestAirportID String,
Distance Int32,
ArrDelay Int32) ENGINE = MergeTree() PARTITION by FlightDate
order by (FlightNum,OriginAirportID);
cost time for writing
序号 | 耗时 | 速度 |
---|---|---|
1 | 48.069 sec | 1.28 million rows/s., 41.75 MB/s |
2 | 48.398 sec | 1.27 million rows/s., 41.46 MB/s. |
3 | 52.804 sec | 1.16 million rows/s., 38.00 MB/s. |
avg cost time : 49.757 sec
(2)LowCardinality表
create table datasets.lowCardinality( FlightDate Date,
FlightNum LowCardinality(String),
OriginAirportID LowCardinality(String),
DestAirportID LowCardinality(String),
Distance Int32,
ArrDelay Int32) ENGINE = MergeTree() PARTITION by FlightDate
order by (FlightNum,OriginAirportID);
cost time for writing
序号 | 耗时 | 速度 |
---|---|---|
1 | 74.402 sec | 826.37 thousand rows/s., 26.97 MB/s. |
2 | 67.489 sec | 911.01 thousand rows/s., 29.73 MB/s. |
3 | 72.909 sec | 843.29 thousand rows/s., 27.52 MB/s. |
avg cost time : 71.6 sec
3、test for querying
SELECT SUM(Distance) as all_distince, SUM(ArrDelay), FlightNum
FROM TABLE
WHERE (OriginAirportID LIKE '15%'
OR OriginAirportID LIKE '13%')
AND DestAirportID LIKE '13%'
GROUP BY FlightNum, OriginAirportID
ORDER BY all_distince DESC
LIMIT 10;
(1)基础表
序号 | 耗时 | 速度 |
---|---|---|
1 | 1.368 sec | 44.17 million rows/s., 469.02 MB/s. |
2 | 1.131 sec | 53.43 million rows/s., 567.38 MB/s. |
3 | 1.208 sec. | 50.00 million rows/s., 530.94 MB/s. |
avg cost time:1.236 sec
(2)LowCardinality表
序号 | 耗时 | 速度 |
---|---|---|
1 | 0.822 sec | 73.53 million rows/s., 600.51 MB/s. |
2 | 0.754 sec | 911.01 thousand rows/s., 29.73 MB/s. |
3 | 0.767 sec. | 78.72 million rows/s., 642.94 MB/s. |
avg cost time:0.781 sec
3、test for disk usage
SELECT column, ANY(type), SUM(column_data_compressed_bytes) AS compressed
, SUM(column_data_uncompressed_bytes) AS uncompressed
FROM system.parts_columns
WHERE table IN ('lowCardinality', 'base')
AND active
AND (column LIKE '%AirportID'
OR column = 'FlightNum')
GROUP BY column, table
ORDER BY column ASC
结果:
image.png
三、reference
(1)https://raw.githubusercontent.com/ClickHouse/clickhouse-presentations/master/meetup19/string_optimization.pdf
(2)https://www.altinity.com/blog/2019/3/27/low-cardinality
后记:
最近准备在线上慢慢使用上,有什么效果到时再来同步更新
网友评论