美文网首页
Impala配置

Impala配置

作者: nightwish夜愿 | 来源:发表于2018-04-25 11:28 被阅读0次

1、关于Impala Daemon个数设置

A More Precise Approach

A more precise sizing estimate would require not only queries per minute (QPM), but also an average data size

scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total data size. The

following equation can be used as a rough guide to estimate the number of nodes (N) needed:

Eq 1: N > QPM * D / 100 GB

Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is required to be

15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can estimate the number

of node using our equation above.

N > QPM * D / 100GB

N > 400 * 50GB / 100GB

N > 200

Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.

Depending on the complexity of the query, the processing rate of query might change. If the query has more joins,

aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the process rate will

be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and filtering on numbers, the

processing rate can be higher.

2、关于每个Impala Daemon的内存配置

Estimating Memory Requirements

Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the joined tables,

using the COMPUTE STATS statement. However, joining big tables does consume more memory. Follow the steps

below to calculate the minimum memory requirement.

Suppose you are running the following join:

select a.*, b.col_1, b.col_2, ... b.col_n

from a, b

where a.key = b.key

and b.col_1 in (1,2,4...)

and b.col_4 in (....);

And suppose table B is smaller than table A  (but still a large table).

The memory requirement for the query is the right-hand table (B), after decompression, filtering (b.col_n in ...) and after projection (only using certain columns) must be less than the total memory of the entire cluster.

Cluster Total Memory Requirement  = Size of the smaller table *

  selectivity factor from the predicate *

  projection factor * compression ratio

In this case, assume that table

B is 100 TB in Parquet format with 200 columns. The predicate on B(b.col_1 in ...and b.col_4 in ...) will select only 10% of the rows from

B  and for projection, we are only projecting 5 columns out of 200 columns. Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.

Cluster Total Memory Requirement  = Size of the smaller table *

  selectivity factor from the predicate *

  projection factor * compression ratio

  = 100TB * 10% * 5/200 * 3

  = 0.75TB

  = 750GB

So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 TB

of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries of this

magnitude

相关文章

网友评论

      本文标题:Impala配置

      本文链接:https://www.haomeiwen.com/subject/fqbelftx.html