ClickHouse技术分享第二弹（英文讲义）

作者: LittleMagic | 来源:发表于2020-06-10 21:55 被阅读0次

ClickHouse技术分享第二弹（英文讲义）
Flink入门技术分享之三（英文讲义）
六千字呕心沥血深度总结，为您揭秘ClickHouse为什么查询这
Clickhouse Projection 特性探索
外企常用英文第二弹
ClickHouse入门技术分享PPT之一
技术分享 | ClickHouse GDB 调试笔记
六天入门软件测试⑤——测试编程讲义
一级消防工程师技术实务复习重点分享
《好书分享》第二弹

前言

以下是今天为公司小伙伴们做的ClickHouse技术分享的讲义。由于PPT太难做了，索性直接用Markdown来写，搭配Chrome上的Markdown Preview Plus插件来渲染，效果非常好。

以下全文奉上，浓缩的都是精华，包含之前写过的两篇文章《物化视图简介与ClickHouse中的应用示例》和《ClickHouse Better Practices》中的全部内容，另外也包含一些新内容，如：

ClickHouse聚合函数的combinator后缀

分布式join/in的读放大和GLOBAL关键字

ClickHouse SQL缺乏开窗分析函数的解决方案

示例：排名榜和同比、环比计算

以及array join的用法

LowCardinality数据类型

MergeTree索引结构

数组和高阶函数没时间说了，之后再提。

开始~

Advanced Usage & Better Practice of ClickHouse

Part I - Materialized View

Intro

Materialized view (MV): A copy (persistent storage) of query result set
MVs ≠ normal views, but ≈ tables
Space trade-off for time
Exists in various DBMSs (Oracle/SQL Server/PostgreSQL/...)
MV in ClickHouse = Precomputation + Incremental refreshing + Explicit data cache
Usage: Relieve from frequent & patterned aggregating queries

Engines

MaterializedView: Implicit
(Replicated)AggregatingMergeTree: Do auto aggregation upon insertion according to user-defined logic
Distributed: Just like distributed tables before

Creation

Best-selling merchandise points: PV/UV/first visiting time/last visiting time

【此处图片涉及业务数据，故删掉】

CREATE MATERIALIZED VIEW IF NOT EXISTS dw.merchandise_point_pvuv_agg
ON CLUSTER sht_ck_cluster_1
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/tables/{shard}/dw/merchandise_point_pvuv_agg','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,site_id,point_index,merchandise_id)
SETTINGS index_granularity = 8192
[POPULATE] AS SELECT
  ts_date,
  site_id,
  site_name,
  point_index,
  merchandise_id,
  merchandise_abbr,
  sumState(1) AS pv,
  uniqState(user_id) AS uv,
  maxState(ts_date_time) AS last_time,
  minState(ts_date_time) AS first_time
FROM ods.analytics_access_log
WHERE event_type = 'shtOpenGoodsDetail'
AND active_id = 0
AND site_id >= 0 AND merchandise_id >= 0 AND point_index >= 0
GROUP BY ts_date,site_id,site_name,point_index,merchandise_id,merchandise_abbr;

MVs can have partition keys, order (primary) keys and setting parameters (again, like tables)
The POPULATE keyword:
- Without POPULATE = Only compute the data inserted to the table after MV creation
- With POPULATE = Compute all history data while creating the MV, but ignore new data ingested during this period
sum/uniq/max/minState() ???

Under the Hood

Distributed MV

CREATE TABLE IF NOT EXISTS dw.merchandise_point_pvuv_agg_all
ON CLUSTER sht_ck_cluster_1
AS dw.merchandise_point_pvuv_agg
ENGINE = Distributed(sht_ck_cluster_1,dw,merchandise_point_pvuv_agg,rand());

Query

SELECT
  merchandise_id,
  merchandise_abbr,
  sumMerge(pv) AS pv,
  uniqMerge(uv) AS uv,
  maxMerge(last_time) AS last_time,
  minMerge(first_time) AS first_time,
  arrayStringConcat(groupUniqArray(site_name),'|') AS site_names
FROM dw.merchandise_point_pvuv_agg_all
WHERE ts_date = today()
AND site_id IN (10030,10031,10036,10037,10038)
AND point_index = 2
GROUP BY merchandise_id,merchandise_abbr
ORDER BY pv DESC LIMIT 10;

【此处图片涉及业务数据，故删掉】

sum/uniq/max/minMerge() ???

Part II - Aggregate Function Combinators

-State

Do not return the aggregation result directly, but keeps an intermediate result (a "state") of the aggregating process
e.g. uniqState() keeps the hash table for cardinality approximation
Aggregate functions combined with -State will produce a column of type AggregateFunction(func,type)

AggregateFunction columns cannot be queried directly

【此处图片涉及业务数据，故删掉】

-Merge

Aggregate the intermediate results and gives out the final value
A variant '-MergeState', aggregates intermediate results to a new intermediate result (But what's the point?)

-If

Conditional aggregation
Perform multi-condition processing within one statement

SELECT
  sumIf(quantity, merchandise_abbr LIKE '%苹果%') AS apple_quantity,
  countIf(toStartOfHour(ts_date_time) = '2020-06-09 20:00:00') AS eight_oclock_sub_order_num,
  maxIf(quantity * price, coupon_money > 0) AS couponed_max_gmv
FROM ods.ms_order_done
WHERE ts_date = toDate('2020-06-09');
┌─apple_quantity─┬─eight_oclock_sub_order_num─┬─couponed_max_gmv─┐
│           1365 │                      19979 │           318000 │
└────────────────┴────────────────────────────┴──────────────────┘

-Array

Array aggregation

SELECT avgArray([33, 44, 99, 110, 220]);
┌─avgArray([33, 44, 99, 110, 220])─┐
│                            101.2 │
└──────────────────────────────────┘

-ForEach

Array aggregation by indexes (position)

SELECT sumForEach(arr)
FROM (
  SELECT 1 AS id, [3, 6, 12] AS arr
  UNION ALL
  SELECT 2 AS id, [7, 14, 7, 5] AS arr
);
┌─sumForEach(arr)─┐
│ [10,20,19,5]    │
└─────────────────┘

Part III - Using JOIN Correctly

Only consider 2-table equi-joins

Use IN When Possible

Prefer IN over JOIN when we only want to fetch data from the left table

SELECT sec_category_name,count()
FROM ods.analytics_access_log
WHERE ts_date = today() - 1
AND site_name like '长沙%'
AND merchandise_id IN (
  SELECT merchandise_id
  FROM ods.ms_order_done
  WHERE price > 10000
)
GROUP BY sec_category_name;

Put Small Table at Right

ClickHouse will utilize hash-join algorithm whenever memory is enough

Right table is always treated as build table (resides in memory), while left table is always treated as probe table
Convert to merge-join on disk when running out of memory (not as efficient as hash-join)

No Predicate Pushdown

Predicate pushdown is a common query optimization approach. e.g. in MySQL:

SELECT l.col1,r.col2 FROM left_table l
INNER JOIN right_table r ON l.key = r.key
WHERE l.col3 > 123 AND r.col4 = '...';

The WHERE predicates will be executed early during scan phase, thus reducing data size in join phase
But ClickHouse optimizer is fairly weak and has no support for this. We should manually put the predicates "inside"

SELECT l.col1,r.col2 FROM (
  SELECT col1,key FROM left_table
  WHERE col3 > 123
) l INNER JOIN (
  SELECT col2,key FROM right_table
  WHERE col4 = '...'
) r ON l.key = r.key;

Distributed JOIN/IN with GLOBAL

When joining or doing IN on two distributed tables/MVs, the GLOBAL keyword is crucial

SELECT
  t1.merchandise_id,t1.merchandise_abbr,t1.pv,t1.uv,
  t2.total_quantity,t2.total_gmv
FROM (
  SELECT
    merchandise_id,merchandise_abbr,
    sumMerge(pv) AS pv,
    uniqMerge(uv) AS uv
  FROM dw.merchandise_point_pvuv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  AND point_index = 1
  GROUP BY merchandise_id,merchandise_abbr
) t1
GLOBAL LEFT JOIN (  -- GLOBAL
  SELECT
    merchandise_id,
    sumMerge(total_quantity) AS total_quantity,
    sumMerge(total_gmv) AS total_gmv
  FROM dw.merchandise_gmv_agg_all  -- Distributed
  WHERE ts_date = today()
  AND site_id IN (10030,10031,10036,10037,10038)
  GROUP BY merchandise_id
) t2
ON t1.merchandise_id = t2.merchandise_id;

Distributed joining without GLOBAL

Causes read amplification: Right table will be read M*N times (or N² when shards are equal), very wasteful
Distributed joining with GLOBAL is all right with an intermediate cache of right table

ARRAY JOIN

Special. Not related to table joining, but arrays
Used to convert a row of an array to multiple rows with extra column(s)
Seems like LATERAL VIEW EXPLODE in Hive?
An example in the next section

Part IV - Alternative to Windowed Analytical Functions

Drawback

ClickHouse lacks basic windowed analytical functions, such as (in Hive):

row_number() OVER (PARTITION BY col1 ORDER BY col2)
rank() OVER (PARTITION BY col1 ORDER BY col2)
dense_rank() OVER (PARTITION BY col1 ORDER BY col2)
lag(col,num) OVER (PARTITION BY col1 ORDER BY col2)
lead(col,num) OVER (PARTITION BY col1 ORDER BY col2)

Any other way around?

arrayEnumerate*()

arrayEnumerate(): Returns index array [1, 2, 3, …, length(array)]

SELECT arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerate([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,5,6,7,8]                                │
└──────────────────────────────────────────────────┘

arrayEnumerateDense(): Returns an array of the same size as the source array, indicating where each element first appears in the source array

SELECT arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateDense([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,2,3,4,2,1,2,5]                                     │
└───────────────────────────────────────────────────────┘

arrayEnumerateUniq(): Returns an array the same size as the source array, indicating for each element what its position is among elements with the same value

SELECT arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55]);
┌─arrayEnumerateUniq([99, 88, 77, 66, 88, 99, 88, 55])─┐
│ [1,1,1,1,2,2,3,1]                                    │
└──────────────────────────────────────────────────────┘

Ranking List

When the array is ordered, arrayEnumerate() = row_number(), arrayEnumerateDense() = dense_rank()
Pay attention to the usage of ARRAY JOIN --- it 'flattens' the result of arrays into human-readable columns

SELECT main_site_id,merchandise_id,gmv,row_number,dense_rank
FROM (
  SELECT main_site_id,
    groupArray(merchandise_id) AS merchandise_arr,
    groupArray(gmv) AS gmv_arr,
    arrayEnumerate(gmv_arr) AS gmv_row_number_arr,
    arrayEnumerateDense(gmv_arr) AS gmv_dense_rank_arr
  FROM (
    SELECT main_site_id,
      merchandise_id,
      sum(price * quantity) AS gmv
    FROM ods.ms_order_done
    WHERE ts_date = toDate('2020-06-01')
    GROUP BY main_site_id,merchandise_id
    ORDER BY gmv DESC
  )
  GROUP BY main_site_id
) ARRAY JOIN
  merchandise_arr AS merchandise_id,
  gmv_arr AS gmv,
  gmv_row_number_arr AS row_number,
  gmv_dense_rank_arr AS dense_rank
ORDER BY main_site_id ASC,row_number ASC;
┌─main_site_id─┬─merchandise_id─┬────gmv─┬─row_number─┬─dense_rank─┐
│          162 │         379263 │ 136740 │          1 │          1 │
│          162 │         360845 │  63600 │          2 │          2 │
│          162 │         400103 │  54110 │          3 │          3 │
│          162 │         404763 │  52440 │          4 │          4 │
│          162 │          93214 │  46230 │          5 │          5 │
│          162 │         304336 │  45770 │          6 │          6 │
│          162 │         392607 │  45540 │          7 │          7 │
│          162 │         182121 │  45088 │          8 │          8 │
│          162 │         383729 │  44550 │          9 │          9 │
│          162 │         404698 │  43750 │         10 │         10 │
│          162 │         102725 │  33284 │         11 │         11 │
│          162 │         404161 │  29700 │         12 │         12 │
│          162 │         391821 │  28160 │         13 │         13 │
│          162 │         339499 │  26069 │         14 │         14 │
│          162 │         404548 │  25600 │         15 │         15 │
│          162 │         167303 │  25520 │         16 │         16 │
│          162 │         209754 │  23940 │         17 │         17 │
│          162 │         317795 │  22950 │         18 │         18 │
│          162 │         404158 │  21780 │         19 │         19 │
│          162 │         326096 │  21540 │         20 │         20 │
│          162 │         404493 │  20950 │         21 │         21 │
│          162 │         389508 │  20790 │         22 │         22 │
│          162 │         301524 │  19900 │         23 │         23 │
│          162 │         404506 │  19900 │         24 │         23 │
│          162 │         404160 │  18130 │         25 │         24 │
........................

Use WHERE row_number <= N/dense_rank <= N to extract grouped top-N

neighbor()

neighbor() is actually the combination of lag & lead

neighbor(column,offset[,default_value])
-- offset > 0 = lead
-- offset < 0 = lag
-- default_value is used when the offset is out of bound

Baseline (YoY/MoM)

“同比”—— YoY (year-over-year) rate = {value[month,year] - value[month,year - 1]} / value[month,year - 1]
“环比”—— MoM (month-over-month) rate = {value[month] - value[month - 1]} / value[month - 1]
Let's make up some fake data and test it over

WITH toDate('2019-01-01') AS start_date
SELECT
  toStartOfMonth(start_date + number * 32) AS dt,
  rand(number) AS val,
  neighbor(val,-12) AS prev_year_val,
  neighbor(val,-1) AS prev_month_val,
  if (prev_year_val = 0,-32768,round((val - prev_year_val) / prev_year_val, 4) * 100) AS yoy_percent,
  if (prev_month_val = 0,-32768,round((val - prev_month_val) / prev_month_val, 4) * 100) AS mom_percent
FROM numbers(18);
┌─────────dt─┬────────val─┬─prev_year_val─┬─prev_month_val─┬─yoy_percent─┬─────────mom_percent─┐
│ 2019-01-01 │  344308231 │             0 │              0 │      -32768 │              -32768 │
│ 2019-02-01 │ 2125630486 │             0 │      344308231 │      -32768 │              517.36 │
│ 2019-03-01 │  799858939 │             0 │     2125630486 │      -32768 │ -62.370000000000005 │
│ 2019-04-01 │ 1899653667 │             0 │      799858939 │      -32768 │               137.5 │
│ 2019-05-01 │ 3073278541 │             0 │     1899653667 │      -32768 │               61.78 │
│ 2019-06-01 │  882031881 │             0 │     3073278541 │      -32768 │               -71.3 │
│ 2019-07-01 │ 3888311917 │             0 │      882031881 │      -32768 │              340.84 │
│ 2019-08-01 │ 3791703268 │             0 │     3888311917 │      -32768 │               -2.48 │
│ 2019-09-01 │ 3472517572 │             0 │     3791703268 │      -32768 │               -8.42 │
│ 2019-10-01 │ 1010491656 │             0 │     3472517572 │      -32768 │  -70.89999999999999 │
│ 2019-11-01 │ 2841992923 │             0 │     1010491656 │      -32768 │              181.25 │
│ 2019-12-01 │ 1783039500 │             0 │     2841992923 │      -32768 │              -37.26 │
│ 2020-01-01 │ 2724427263 │     344308231 │     1783039500 │      691.28 │  52.800000000000004 │
│ 2020-02-01 │ 2472851287 │    2125630486 │     2724427263 │       16.33 │  -9.229999999999999 │
│ 2020-03-01 │ 1699617807 │     799858939 │     2472851287 │      112.49 │ -31.269999999999996 │
│ 2020-04-01 │  873033696 │    1899653667 │     1699617807 │      -54.04 │              -48.63 │
│ 2020-05-01 │ 3524933462 │    3073278541 │      873033696 │        14.7 │              303.76 │
│ 2020-06-01 │   85437434 │     882031881 │     3524933462 │      -90.31 │              -97.58 │
└────────────┴────────────┴───────────────┴────────────────┴─────────────┴─────────────────────┘

Part V - More on Data Types

Date/DateTime

Do not use String for Date/DateTime (other types also fit for this rule)
- ClickHouse is strongly typed, no implicit conversions
- All-String tables (as in Hive) do not agree with ClickHouse
Do not use Int-type timestamp for Date/DateTime
- Date: Stored as the date difference to 1970-01-01
- DateTime: Stored directly as timestamp (fast)
Very flexible date/time functions

Nullable

ClickHouse doesn't provide NULL by default, but if you want to...

merchandise_id Nullable(Int64)

But try to stay away from Nullable
- Need a separate mark file for NULLs
- Nullable columns cannot be indexed
Default value itself can indicate NULL (0 for Int, '' for String, etc.), or explicitly define it when creating tables

merchandise_id Int64 DEFAULT -1

LowCardinality

ClickHouse applies dictionary coding to LowCardinality columns. Operating with such kind of data significantly increases performance of SELECT queries for many applications
LowCardinality is almost always used together with less diversified String columns (cardinality < 10000)

-- event_type in access logs is quite suitable
event_type LowCardinality(String)

Arrays & Higher-order Functions [TBD]

TBD...

Part VI - MergeTree Indices & Table Settings

Index Structure

Not B-Tree style, but rather like Kafka log indices (sparse)
.bin (data), .mrk (index marker) files for each column on disks
primary.idx stores the indexed data according to index granularity

Index Settings

Must include those columns which occur frequently as predicates (in WHERE clause)
Date/DateTime columns come first (when partitioning with date/time)
Very distinctive columns are not suitable for indexing
Do not use too many columns, also do not change index_granularity = 8192 setting when everything's fine

Table TTL

Determines the lifetime of rows, thus enabling auto expiration of history data
When creating a table

PARTITION BY ...
ORDER BY (...)
TTL ts_date + INTERVAL 6 MONTH

Or modify an existing table (only affects the data inserted after modification)

ALTER TABLE ods.analytics_access_log ON CLUSTER sht_ck_cluster_1
MODIFY TTL ts_date + INTERVAL 6 MONTH;

The settings parameter for TTL-ed part merging frequency

SETTINGS merge_with_ttl_timeout = 86400  -- 1 day

ZooKeeper

ClickHouse utilizes ZooKeeper as: Coordination service + Mini log service + Metadata storage
Quite heavy, so try to keep ZooKeeper cluster happy

autopurge.purgeInterval = 1
autopurge.snapRetainCount = 5

Also, replicated tables can store the headers of the data parts compactly using a single znode by defining:

SETTINGS use_minimalistic_part_header_in_zookeeper = 1

Review CREATE TABLE statement

CREATE TABLE IF NOT EXISTS ods.analytics_access_log
ON CLUSTER sht_ck_cluster_1 (
  ts_date Date,
  ts_date_time DateTime,
  user_id Int64,
  event_type String,
  column_type String,
  groupon_id Int64,
  site_id Int64,
  site_name String,
  -- ...
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/ods/analytics_access_log','{replica}')
PARTITION BY ts_date
ORDER BY (ts_date,toStartOfHour(ts_date_time),main_site_id,site_id,event_type,column_type)
TTL ts_date + INTERVAL 6 MONTH
SETTINGS index_granularity = 8192,
use_minimalistic_part_header_in_zookeeper = 1,
merge_with_ttl_timeout = 86400;

The End

民那晚安晚安。

ClickHouse技术分享第二弹（英文讲义）
前言以下是今天为公司小伙伴们做的ClickHouse技术分享的讲义。由于PPT太难做了，索性直接用Markdow...
Flink入门技术分享之三（英文讲义）
今天用半小时为小伙伴们简单分享了Flink Streaming中窗口的一些基础扩展用法（增量聚合、触发器和双流jo...
六千字呕心沥血深度总结，为您揭秘ClickHouse为什么查询这
1. ClickHouse设计思想和核心技术特征 1.1 ClickHouse 全知全解 ClickHouse 是...
Clickhouse Projection 特性探索
年初的clickhouse meetup上快手团队分享了clickhouse projection在其公司内部的实...
外企常用英文第二弹
1. conferencecall 这可是外企工作的主要工作方式，有道是大call三六九，小call天天有，有事咱...
ClickHouse入门技术分享PPT之一
今天为小伙伴们做了ClickHouse入门的技术分享（确实好用），把做的PPT贴在下面当做今日份吧。多图预警~ ...
技术分享 | ClickHouse GDB 调试笔记
作者：xuty本文来源：原创投稿*爱可生开源社区出品，原创内容未经授权不得随意使用，转载请联系小编并注明来源。一...
六天入门软件测试⑤——测试编程讲义
Day 5 测试编程讲义本篇讲义非常偏技术属性，需要用到比较多编程知识。阅读此讲义，需要对以下技术有一定了解: ...
一级消防工程师技术实务复习重点分享
消防技术实务第二弹重点分享话不多说消防知识点来也！！！照明器具防火一、电气照明灯具的选型 1、火灾危险场所应...
《好书分享》第二弹
《小岛经济学》这是年初在公众号看到的好书推荐。书里用一个小岛的故事来讲述国家经济的发展。作为一个非经济学和金...

ClickHouse技术分享第二弹（英文讲义）

前言

Advanced Usage & Better Practice of ClickHouse

Part I - Materialized View

Intro

Engines

Creation

Under the Hood

Distributed MV

Query

Part II - Aggregate Function Combinators

-State

-Merge

-If

-Array

-ForEach

Part III - Using JOIN Correctly

Use IN When Possible

Put Small Table at Right

No Predicate Pushdown

Distributed JOIN/IN with GLOBAL

ARRAY JOIN

Part IV - Alternative to Windowed Analytical Functions

Drawback

arrayEnumerate*()

Ranking List

neighbor()

Baseline (YoY/MoM)

Part V - More on Data Types

Date/DateTime

Nullable

LowCardinality

Arrays & Higher-order Functions [TBD]

Part VI - MergeTree Indices & Table Settings

Index Structure

Index Settings

Table TTL

ZooKeeper

Review CREATE TABLE statement

The End

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读