Impala配置

作者: nightwish夜愿 | 来源:发表于2018-04-25 11:28 被阅读0次

003 Impala | 安装前性能配置
Impala配置
CDH5.8 安装配置安全组件Sentry
HDFS 故障记载
浅谈交互式查询⼯工具Impala(一)
Apache Impala 简介
Impala快速上手：Impala简介，Impala shell
Apache Impala概念和架构
kylin-kylin平台PushDown引擎调研配置
Apache Impala(二) 基本配置

1、关于Impala Daemon个数设置

A More Precise Approach

A more precise sizing estimate would require not only queries per minute (QPM), but also an average data size

scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total data size. The

following equation can be used as a rough guide to estimate the number of nodes (N) needed:

Eq 1: N > QPM * D / 100 GB

Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is required to be

15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can estimate the number

of node using our equation above.

N > QPM * D / 100GB

N > 400 * 50GB / 100GB

N > 200

Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.

Depending on the complexity of the query, the processing rate of query might change. If the query has more joins,

aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the process rate will

be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and filtering on numbers, the

processing rate can be higher.

2、关于每个Impala Daemon的内存配置

Estimating Memory Requirements

Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the joined tables,

using the COMPUTE STATS statement. However, joining big tables does consume more memory. Follow the steps

below to calculate the minimum memory requirement.

Suppose you are running the following join:

select a.*, b.col_1, b.col_2, ... b.col_n

from a, b

where a.key = b.key

and b.col_1 in (1,2,4...)

and b.col_4 in (....);

And suppose table B is smaller than table A (but still a large table).

The memory requirement for the query is the right-hand table (B), after decompression, filtering (b.col_n in ...) and after projection (only using certain columns) must be less than the total memory of the entire cluster.

Cluster Total Memory Requirement = Size of the smaller table *

selectivity factor from the predicate *

projection factor * compression ratio

In this case, assume that table

B is 100 TB in Parquet format with 200 columns. The predicate on B(b.col_1 in ...and b.col_4 in ...) will select only 10% of the rows from

B and for projection, we are only projecting 5 columns out of 200 columns. Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.

Cluster Total Memory Requirement = Size of the smaller table *

selectivity factor from the predicate *

projection factor * compression ratio

= 100TB * 10% * 5/200 * 3

= 0.75TB

= 750GB

So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 TB

of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries of this

magnitude

网友评论

本文标题：Impala配置

本文链接：https://www.haomeiwen.com/subject/fqbelftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Impala配置

相关文章

003 Impala | 安装前性能配置

Impala配置

CDH5.8 安装配置安全组件Sentry

HDFS 故障记载

浅谈交互式查询⼯工具Impala(一)

Apache Impala 简介

Impala快速上手：Impala简介，Impala shell

Apache Impala概念和架构

kylin-kylin平台PushDown引擎调研配置

Apache Impala(二) 基本配置

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读