1、关于Impala Daemon个数设置
A More Precise Approach
A more precise sizing estimate would require not only queries per minute (QPM), but also an average data size
scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total data size. The
following equation can be used as a rough guide to estimate the number of nodes (N) needed:
Eq 1: N > QPM * D / 100 GB
Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is required to be
15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can estimate the number
of node using our equation above.
N > QPM * D / 100GB
N > 400 * 50GB / 100GB
N > 200
Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
Depending on the complexity of the query, the processing rate of query might change. If the query has more joins,
aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the process rate will
be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and filtering on numbers, the
processing rate can be higher.
2、关于每个Impala Daemon的内存配置
Estimating Memory Requirements
Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the joined tables,
using the COMPUTE STATS statement. However, joining big tables does consume more memory. Follow the steps
below to calculate the minimum memory requirement.
Suppose you are running the following join:
select a.*, b.col_1, b.col_2, ... b.col_n
from a, b
where a.key = b.key
and b.col_1 in (1,2,4...)
and b.col_4 in (....);
And suppose table B is smaller than table A (but still a large table).
The memory requirement for the query is the right-hand table (B), after decompression, filtering (b.col_n in ...) and after projection (only using certain columns) must be less than the total memory of the entire cluster.
Cluster Total Memory Requirement = Size of the smaller table *
selectivity factor from the predicate *
projection factor * compression ratio
In this case, assume that table
B is 100 TB in Parquet format with 200 columns. The predicate on B(b.col_1 in ...and b.col_4 in ...) will select only 10% of the rows from
B and for projection, we are only projecting 5 columns out of 200 columns. Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
Cluster Total Memory Requirement = Size of the smaller table *
selectivity factor from the predicate *
projection factor * compression ratio
= 100TB * 10% * 5/200 * 3
= 0.75TB
= 750GB
So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 TB
of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries of this
magnitude
网友评论