Apache Hive Essentials笔记

作者: 蓝Renly | 来源:发表于2019-04-24 20:31 被阅读0次

Apache Hive Essentials笔记

1.CASCADE

Note:

Note that Hive keeps the database and the table in directory mode. In order to remove the parent directory, we need to remove the subdirectories first. By default, the database cannot be dropped if it is not empty, unless CASCADE is specified. CASCADE drops the tables in the database automatically before dropping the database.

Hive是数据库和表格以目录形式保存的.为了删除父目录,必须删除子目录.默认的,数据库如果不是空的不会删除,除非指定了CASECADE.

jdbc2:hive2://> DROP DATABASE IF EXISTS myhivebook CASECADE;

2.表

Create the external table and load the data:
jdbc:hive2://> CREATE EXTERNAL TABLE employee_external
. . . . . . .> (
. . . . . . .> name string,
. . . . . . .> work_place ARRAY<string>,
www.it-ebooks.info
. . . . . . .> sex_age STRUCT<sex:string,age:int>,
. . . . . . .> skills_score MAP<string,int>,
. . . . . . .> depart_title MAP<STRING,ARRAY<STRING>>
. . . . . . .> )
. . . . . . .> COMMENT 'This is an external table'
. . . . . . .> ROW FORMAT DELIMITED
. . . . . . .> FIELDS TERMINATED BY '|'
. . . . . . .> COLLECTION ITEMS TERMINATED BY ','
. . . . . . .> MAP KEYS TERMINATED BY ':'
. . . . . . .> STORED AS TEXTFILE
. . . . . . .> LOCATION '/user/dayongd/employee';
No rows affected (1.332 seconds)
jdbc:hive2://> LOAD DATA LOCAL INPATH '/home/hadoop/employee.txt'. . .
. . . .> OVERWRITE
INTO TABLE employee_external;

Note
CREATE TABLE
The Hive table does not have constraints such as a database yet.
If the folder in the path does not exist in the LOCATION property, Hive will create that folder. If there is another folder inside the folder specified in the LOCATION property,Hive will NOT report errors when creating the table, but will report an error when querying the table.
A temporary table, which is automatically deleted at the end of the Hive session, is supported in Hive 0.14.0 by HIVE-7090 (https://issues.apache.org/jira/browse/HIVE-7090) through the CREATE TEMPORARY TABLE statement.
For the STORE AS property, it is set to AS TEXTFILE by default. Other file format values, such as SEQUENCEFILE , RCFILE , ORC , AVRO (since Hive 0.14.0), and PARQUET(since Hive 0.13.0) can also be specified.

Create the table as select(CTAS):

#Create the table as select (CTAS):
jdbc:hive2://> CREATE TABLE ctas_employee
. . . . . . .> AS SELECT * FROM employee_external;
No rows affected (1.562 seconds)

CTAS copies the data as well as table definitions. The table created by CTAS is atomic;this means that other users do not see the table until all the query results are populated. CTAS has the following restrictions:
The table created cannot be a partitioned table
The table created cannot be an external table
The table created cannot be a list bucketing table
A CTAS statement will trigger a map job for populating the data; even SELECT * itself does not trigger any MapReduce job.

CTAS with Common Table Expression (CTE) can be created as follows:

jdbc:hive2://> CREATE TABLE cte_employee AS
. . . . . . .> WITH r1 AS
. . . . . . .> (SELECT name FROM r2
. . . . . . .> WHERE name = 'Michael'),
. . . . . . .> r2 AS
. . . . . . .> (SELECT name FROM employee
. . . . . . .> WHERE sex_age.sex= 'Male'),
. . . . . . .> r3 AS
. . . . . . .> (SELECT name FROM employee
. . . . . . .> WHERE sex_age.sex= 'Female')
. . . . . . .> SELECT * FROM r1 UNION ALL select * FROM r3;
No rows affected (61.852 seconds)
jdbc:hive2://> SELECT * FROM cte_employee;
+----------------------------+
| cte_employee.name |
+----------------------------+
| Michael |
| Shelley |
| Lucy |
+----------------------------+
3 rows selected (0.091 seconds)

CTE is available since Hive 0.13.0. It is a temporary result set derived from a simple SELECT query specified in a WITH clause, followed by SELECT or INSERT keyword to operate this result set. The CTE is defined only within the execution scope of a single statement. One or more CTEs can be used in a nested or chained way with Hive keywords, such as the SELECT , INSERT , CREATE TABLE AS SELECT , or CREATE VIEW AS SELECT statements

CTE尽在HIve0.13.0之后支持,它是一个可获得的临时表,从通指定一个with从句中的简单查询中获取,跟从select或者insert关键字去操作结果集.

Empty tables can be created in two ways as follows:

1. Use CTAS as shown here:
jdbc:hive2://> CREATE TABLE empty_ctas_employee AS
. . . . . . .> SELECT * FROM employee_internal WHERE 1=2;
No rows affected (213.356 seconds)
2. Use  LIKE as shown here:
jdbc:hive2://> CREATE TABLE empty_like_employee
. . . . . . .> LIKE employee_internal;
No rows affected (0.115 seconds)

The LIKE way, which is faster, does not trigger a MapReduce job since it is metadata duplication only.

like方式更快,不触发MR,因为它只是复制元数据.

The drop table’s command removes the metadata completely and moves data to Trash(垃圾) or to the current directory if Trash is configured:

jdbc:hive2://> DROP TABLE IF EXISTS empty_ctas_employee;

The truncate table’s command removes all the rows from a table that should be aninternal table:

dbc:hive2://> TRUNCATE TABLE cte_employee;
No rows affected (0.093 seconds)
--Table is empty after truncate
jdbc:hive2://> SELECT * FROM cte_employee;
+--------------------+
| cte_employee.name |
+--------------------+
+--------------------+
No rows selected (0.059 seconds)

Alter the table’s statements to rename the table:

jdbc:hive2://> !table
+-----------+------------------+-----------+---------------------------
+
|TABLE_SCHEM| TABLE_NAME | TABLE_TYPE| REMARKS
|
+-----------+------------------+-----------+---------------------------
+
|default | employee | TABLE | NULL
|
|default | employee_internal| TABLE | This is an internal table
|
|default | employee_external| TABLE | This is an external table
|
|default | ctas_employee | TABLE | NULL
|
|default | cte_employee | TABLE | NULL
|
+-----------+------------------+-----------+---------------------------
+
jdbc:hive2://> ALTER TABLE cte_employee RENAME TO c_employee;
No rows affected (0.237 seconds)

Alter the table’s file format:
www.it-ebooks.info
jdbc:hive2://> ALTER TABLE c_employee SET FILEFORMAT RCFILE;
No rows affected (0.235 seconds)
Alter the table’s location, which must be a full URI of HDFS:
jdbc:hive2://> ALTER TABLE c_employee
. . . . . . .> SET LOCATION
. . . . . . .> 'hdfs://localhost:8020/user/dayongd/employee';
No rows affected (0.169 seconds)
Alter the table’s enable/disable protection to  NO_DROP , which prevents a table from
being dropped, or  OFFLINE , which prevents data (not metadata) in a table from being
queried:
jdbc:hive2://> ALTER TABLE c_employee ENABLE NO_DROP;
jdbc:hive2://> ALTER TABLE c_employee DISABLE NO_DROP;
jdbc:hive2://> ALTER TABLE c_employee ENABLE OFFLINE;
jdbc:hive2://> ALTER TABLE c_employee DISABLE OFFLINE;
Alter the table’s concatenation to merge small files into larger files:
--Convert to the file format supported
jdbc:hive2://> ALTER TABLE c_employee SET FILEFORMAT ORC;
No rows affected (0.160 seconds)
--Concatenate files
jdbc:hive2://> ALTER TABLE c_employee CONCATENATE;
No rows affected (0.165 seconds)
--Convert to the regular file format
jdbc:hive2://> ALTER TABLE c_employee SET FILEFORMAT TEXTFILE;
No rows affected (0.143 seconds)
Note
CONCATENATE
In Hive release 0.8.0, RCFile is added to support fast block-level merging of small
RCFiles using the  CONCATENATE command. In Hive release 0.14.0 ORC, the files that
are added support fast stripe-level merging of small ORC files using the  CONCATENATE
command. Other file formats are not supported yet. In case of RCFiles, the merge
happens at block level and ORC files merge at stripe level thereby avoiding the
overhead of decompressing and decoding the data. MapReduce is triggered when
performing concatenation.

The alter table/partition statement for file format, location, protections, and concatenation has the same syntax as the alter table statements and is shown here:

ALTER TABLE table_name PARTITION partition_spec SET FILEFORMAT file_format;
ALTER TABLE table_name PARTITION partition_spec SET LOCATION 'fullURI';
ALTER TABLE table_name PARTITION partition_spec ENABLE NO_DROP;
ALTER TABLE table_name PARTITION partition_spec ENABLE OFFLINE;
ALTER TABLE table_name PARTITION partition_spec DISABLE NO_DROP;
ALTER TABLE table_name PARTITION partition_spec DISABLE OFFLINE;
ALTER TABLE table_name PARTITION partition_spec CONCATENATE;

3.Hibe Buckets

Besides partition, bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from partition, the bucket corresponds to segments of files in HDFS. For example, the employee_partitioned table from the previous section uses the year and month as the top-level partition. If there is a further request to use the employee_id as the third level of partition, it leads to many deep and small partitions and directories. For instance, we can bucket the employee_partitioned
table using employee_id as the bucket column. The value of this column will be hashed by a user-defined number into buckets. The records with the same employee_id will always be stored in the same bucket (segment of files). By using buckets, Hive can easily and efficiently do sampling (see Chapter 6, Data Aggregation and Sampling) and map side joins (see Chapter 4, Data Selection and Scope). An example to create a bucket table is as follows:

除了分区之外,分桶也是另一个聚集数据集到更多可管理的部件去优化查询性能的技术.不同于分区,分桶相当于HDFS中的文件分片.例如,从之前章节中,我们使用年和月对employee_partitioned表作为顶级分区.如果有更远的需求,使用employee_id作为第三级的分区,它会导致更深的和更小的分区和目录.替代的,我们能对employee_partitioned表使用employee_id作为分桶列.这一列的值将要通过用户定义的数量存储到分桶区中.employee_id的相同记录将被存储到相同的分桶中(文件切片).通过使用分桶, Hive可以很容易高效的做采样和map side joins.

--Prepare another dataset and table for bucket table
jdbc:hive2://> CREATE TABLE employee_id
. . . . . . .> (
. . . . . . .> name string,
. . . . . . .> employee_id int,
. . . . . . .> work_place ARRAY<string>,
. . . . . . .> sex_age STRUCT<sex:string,age:int>,
. . . . . . .> skills_score MAP<string,int>,
. . . . . . .> depart_title MAP<string,ARRAY<string>>
. . . . . . .> )
. . . . . . .> ROW FORMAT DELIMITED
. . . . . . .> FIELDS TERMINATED BY '|'
. . . . . . .> COLLECTION ITEMS TERMINATED BY ','
. . . . . . .> MAP KEYS TERMINATED BY ':';
No rows affected (0.101 seconds)
jdbc:hive2://> LOAD DATA LOCAL INPATH
. . . . . . .> '/home/dayongd/Downloads/employee_id.txt'
. . . . . . .> OVERWRITE INTO TABLE employee_id
No rows affected (0.112 seconds)
--Create the buckets table
jdbc:hive2://> CREATE TABLE employee_id_buckets
. . . . . . .> (
. . . . . . .> name string,
. . . . . . .> employee_id int,
. . . . . . .> work_place ARRAY<string>,
. . . . . . .> sex_age STRUCT<sex:string,age:int>,
. . . . . . .> skills_score MAP<string,int>,
. . . . . . .> depart_title MAP<string,ARRAY<string >>
. . . . . . .> )
. . . . . . .> CLUSTERED BY (employee_id) INTO 2 BUCKETS
. . . . . . .> ROW FORMAT DELIMITED
. . . . . . .> FIELDS TERMINATED BY '|'
. . . . . . .> COLLECTION ITEMS TERMINATED BY ','
. . . . . . .> MAP KEYS TERMINATED BY ':';
No rows affected (0.104 seconds)

Bucket numbers:关于使用分桶数的优化

To define the proper number of buckets, we should avoid having too much or too little of data in each bucket. A better choice is somewhere near two blocks of data. For example,we can plan 512 MB of data in each bucket, if the Hadoop block size is 256 MB. If possible, use 2N as the number of buckets.Bucketing has close dependency on the underlying data loaded. To properly load data to a bucket table, we need to either set the maximum number of reducers to the same number of buckets specified in the table creation (for example, 2 ) or enable enforce bucketing asfollows:

set map.reduce.tasks = 2;

set hive.enforce.bucketing = true

jdbc:hive2://> set map.reduce.tasks = 2;
No rows affected (0.026 seconds)
jdbc:hive2://> set hive.enforce.bucketing = true;
No rows affected (0.002 seconds)
To populate the data to the bucket table, we cannot use  LOAD keywords such as what was
done in the regular tables since  LOAD does not verify the data against the metadata. Instead,
INSERT should be used to populate the bucket table as follows:
jdbc:hive2://> INSERT OVERWRITE TABLE employee_id_buckets
. . . . . . .> SELECT * FROM employee_id;
No rows affected (75.468 seconds)
--Verify the buckets in the HDFS
-bash-4.1$ hdfs dfs -ls /user/hive/warehouse/employee_id_buckets
Found 2 items
-rwxrwxrwx 1 hive hive 900 2014-11-02 10:54
/user/hive/warehouse/employee_id_buckets/000000_0
-rwxrwxrwx 1 hive hive 582 2014-11-02 10:54
/user/hive/warehouse/employee_id_buckets/000001_0

4.查询及优化

1.单查询

1.Enable fetch and verify the performance improvement:

未设置参数查询时长:162.452 seconds

jdbc:hive2://> SELECT name FROM employee;
+----------+
| name |
+----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+----------+
4 rows selected (162.452 seconds)

设置参数后:0.242 seconds

jdbc:hive2://> SET hive.fetch.task.conversion=more;
No rows affected (0.002 seconds)
jdbc:hive2://> SELECT name FROM employee;
+----------+
| name |
+----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+----------+
4 rows selected (0.242 seconds)

2.Nested SELECT using CTE can be implemented as follows:

jdbc:hive2://> WITH t1 AS (
. . . . . . .> SELECT * FROM employee
. . . . . . .> WHERE sex_age.sex = 'Male')
. . . . . . .> SELECT name, sex_age.sex AS sex FROM t1;
+----------+-------+
| name | sex |
www.it-ebooks.info
+----------+-------+
| Michael | Male |
| Will | Male |
+----------+-------+
2 rows selected (38.706 seconds)

3.There are additional restrictions for subqueries used in WHERE clauses:

Subqueries can only appear on the right-hand side of the WHERE clauses
Nested subqueries are not allowed
The IN and NOT IN statement supports only one column

2.INNER JOIN

Self-join is a special  JOIN where one table joins itself. When doing such joins, a different alias should be given to distinguish the same table:
jdbc:hive2://> SELECT emp.name
. . . . . . .> FROM employee emp
. . . . . . .> JOIN employee emp_b
. . . . . . .> ON emp.name = emp_b.name;
+-----------+
| emp.name |
+-----------+
| Michael |
| Will |
| Shelley |
| Lucy |
+-----------+
4 rows selected (59.891 seconds)
Implicit join is a JOIN operation without using the JOIN keyword.It is supported Hive 0.13.0:
jdbc:hive2://> SELECT emp.name, emph.sin_number
. . . . . . .> FROM employee emp, employee_hr emph
. . . . . . .> WHERE emp.name = emph.name;
+-----------+------------------+
| emp.name | emph.sin_number |
+-----------+------------------+
| Michael | 547-968-091 |
| Will | 527-948-090 |
| Lucy | 577-928-094 |
+-----------+------------------+
3 rows selected (47.241 seconds)
The  JOIN operation uses different columns in join conditions and will create an additional MapReduce:
jdbc:hive2://> SELECT emp.name, empi.employee_id, emph.sin_number
. . . . . . .> FROM employee emp
. . . . . . .> JOIN employee_hr emph ON emp.name = emph.name
. . . . . . .> JOIN employee_id empi ON emph.employee_id =
empi.employee_id;
+-----------+-------------------+------------------+
| emp.name | empi.employee_id | emph.sin_number |
+-----------+-------------------+------------------+
| Michael | 100 | 547-968-091 |
| Will | 101 | 527-948-090 |
| Lucy | 103 | 577-928-094 |
+-----------+-------------------+------------------+
3 rows selected (49.785 seconds)

Note
If JOIN uses different columns in the join conditions, it will request additional job stages to complete the join. If the JOIN operation uses the same column in the join conditions, Hive will join on this condition using one stage.

hint: /+STREAMTABLE (table_name)/

hint:类似于oracle,强制告诉优化器使用索引.在Hive中的意思是标注最大的那张表,告诉MR程序这张表是最大的,这样就可以不用写在右边的最末尾位置了.

When JOIN is performed between multiple tables, the MapReduce jobs are created to process the data in the HDFS. Each of the jobs is called a stage. Usually,it is suggested for JOIN statements to put the big table right at the end for better performance as well as avoiding Out Of Memory (OOM) exceptions, because the last table in the sequence is streamed through the reducers where the others are buffered in the reducer by default.Also, a hint, such as /+STREAMTABLE (table_name)/, can be specified to tell which table is streamed as follows:

建议把大表放在右侧的最末尾,提高性能并且避免内存溢出异常.

jdbc:hive2://> SELECT /*+ STREAMTABLE(employee_hr) */
. . . . . . .> emp.name, empi.employee_id, emph.sin_number
. . . . . . .> FROM employee emp
. . . . . . .> JOIN employee_hr emph ON emp.name = emph.name
. . . . . . .> JOIN employee_id empi ON emph.employee_id =
empi.employee_id;

3.OUTER JOIN/CROSS JOIN

OUTER JOIN

This returns all rows in both the tables and matched rows in both the tables. If there is no match in the left or right table, return null instead.
#m + n - m ∩ n

CROSS JOIN

This returns all row combinations(混合) in both the tables to produce a Cartesian(笛卡尔积) product.
#m * n

4.Special JOIN – MAPJOIN

jdbc:hive2://> SELECT /*+ MAPJOIN(employee) */ emp.name, emph.sin_number
. . . . . . .> FROM employee emp
. . . . . . .> CROSS JOIN employee_hr emph WHERE emp.name <> emph.name;

The MAPJOIN operation does not support the following:
The use of MAPJOIN after UNION ALL , LATERAL VIEW , GROUP BY / JOIN / SORT BY / CLUSTER BY / DISTRIBUTE BY;
The use of MAPJOIN before UNION , JOIN , and another MAPJOIN;

sort-merge JOIN

The bucket map join is a special type of MAPJOIN that uses bucket columns (the column specified by CLUSTERED BY in the CREATE table statement) as the join condition. Instead of fetching the whole table as done by the regular map join, bucket map join only fetches the required bucket data. To enable bucket map join, we need to set hive.optimize.bucketmapjoin = true and make sure the buckets number is a multiple of each other. If both tables joined are sorted and bucketed with the same number of buckets, a sort-merge join can be performed instead of caching all small tables in the memory. The following additional settings are needed to enable this behavior:

分桶map join是一种特殊的MAPJOIN,它使用分桶列(在创建表时CLUSTERED BY指定的列)作为join条件.代替普通的map join获取整个表数据,分桶map join只是获取需要的分桶数据.

SET hive.optimize.bucketmapjoin = true;
SET hive.optimize.bucketmapjoin.sortedmerge = true;
SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

5.数据操作

Insert data from the CTE statement:

jdbc:hive2://> WITH a AS (SELECT * FROM ctas_employee )
. . . . . . .> FROM a
. . . . . . .> INSERT OVERWRITE TABLE employee
. . . . . . .> SELECT *;
No rows affected (30.1 seconds)

Run multiple INSERT by only scanning the source table once:

jdbc:hive2://> FROM ctas_employee
. . . . . . .> INSERT OVERWRITE TABLE employee
. . . . . . .> SELECT *
. . . . . . .> INSERT OVERWRITE TABLE employee_internal
. . . . . . .> SELECT * ;
No rows affected (27.919 seconds)

1.设置动态分区

Dynamic partition is not enabled by default. We need to set the following properties to
make it work:

jdbc:hive2://> SET hive.exec.dynamic.partition=true;
No rows affected (0.002 seconds)
jdbc:hive2://> SET hive.exec.dynamic.partition.mode=nonstrict;
No rows affected (0.002 seconds)
jdbc:hive2://> INSERT INTO TABLE employee_partitioned
. . . . . . .> PARTITION(year, month)
. . . . . . .> SELECT name, array('Toronto') as work_place,
. . . . . . .> named_struct("sex","Male","age",30) as sex_age,
. . . . . . .> map("Python",90) as skills_score,
. . . . . . .> map("R&D",array('Developer')) as depart_title,
. . . . . . .> year(start_date) as year, month(start_date) as month
. . . . . . .> FROM employee_hr eh
. . . . . . .> WHERE eh.employee_id = 102;
No rows affected (29.024 seconds)

加载到本地:

We can insert to local files with default row separators. In some recent version of Hadoop, the local directory path only works for a directory level less than two. We may need to set hive.insert.into.multilevel.dirs=true to get this fixed:

jdbc:hive2://> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/output1'
. . . . . . .> SELECT * FROM employee;
No rows affected (30.859 seconds)

Note
By default, many partial files could be created by the reducer when doing INSERT . To merge them into one, we can use HDFS commands, as shown in the following example:

hdfs dfs –getmerge hdfs://<host_name>:8020/user/dayongd/output/tmp/test

Insert to local files with specified row separators:

jdbc:hive2://> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/output2'
. . . . . . .> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
. . . . . . .> SELECT * FROM employee;
No rows affected (31.937 seconds)

2.追加文件 hive -e使用

Besides the Hive INSERT statement, Hive and HDFS shell commands can also be used toextract data to local or remote files with both append and overwrite support. The hive -e'quoted_hql_string' or hive -f <hql_filename> commands can execute a Hive query statement or query file. Linux redirect operators and piping can be used with these commands to redirect result sets. The following are a few examples:

Append to local files:
$ hive -e 'select * from employee' >> test

Overwrite to local files:
$ hive -e 'select * from employee' > test

Append to HDFS files:
$ hive -e 'select * from employee'|hdfs dfs -appendToFile - /user/dayongd/output2/test

Overwrite to HDFS files:
$ hive -e 'select * from employee'|hdfs dfs -put -f - /user/dayongd/output2/test

6.EXPORT IMPORT

Data exchange – EXPORT and IMPORT

When working with Hive, sometimes we need to migrate data among different environments. Or we may need to back up some data. Since Hive 0.8.0, EXPORT and IMPORT statements are available to support the import and export of data in HDFS for data migration or backup/restore purposes.The EXPORT statement will export both data and metadata from a table or partition.Metadata is exported in a file called _metadata . Data is exported in a subdirectory called data:

jdbc:hive2://> EXPORT TABLE employee TO '/user/dayongd/output3';
No rows affected (0.19 seconds)

After EXPORT , we can manually copy the exported files to other Hive instances or use Hadoop distcp commands to copy to other HDFS clusters. Then, we can import the data in the following manner:

Import data to a table with the same name. It throws an error if the table exists:

jdbc:hive2://> IMPORT FROM '/user/dayongd/output3';
Error: Error while compiling statement: FAILED: SemanticException
[Error 10119]: Table exists and contains data files
(state=42000,code=10119)

Import data to a new table:

jdbc:hive2://> IMPORT TABLE empolyee_imported FROM
. . . . . . .> '/user/dayongd/output3';
No rows affected (0.788 seconds)

Import data to an external table, where the LOCATION property is optional:

jdbc:hive2://> IMPORT EXTERNAL TABLE empolyee_imported_external
. . . . . . .> FROM '/user/dayongd/output3'
. . . . . . .> LOCATION '/user/dayongd/output4' ;
No rows affected (0.256 seconds)

Export and import partitions:

jdbc:hive2://> EXPORT TABLE employee_partitioned partition
. . . . . . .> (year=2014, month=11) TO '/user/dayongd/output5';
No rows affected (0.247 seconds)
jdbc:hive2://> IMPORT TABLE employee_partitioned_imported
. . . . . . .> FROM '/user/dayongd/output5';
No rows affected (0.14 seconds)

7.ORDER and SORT

1.ORDER BY (ASC|DESC)

2.SORT BY(ASC|DESC)

The SORT BY statement does not perform a global sort and only makes sure data is locally sorted in each reducer unless we set mapred.reduce.tasks=1. In this case, it is equal to the result of ORDER BY . It can be used as follows:

--Use only 1 reducer

jdbc:hive2://> SET mapred.reduce.tasks = 1;
No rows affected (0.002 seconds)

jdbc:hive2://> SELECT name FROM employee SORT BY NAME DESC;

+----------+
| name |
+----------+
| Will |
| Shelley |
| Michael |
| Lucy |
+----------+
4 rows selected (46.03 seconds)

3.DISTRIBUTE BY :

Rows with matching column values will be partitioned to the same reducer. When used alone, it does not guarantee sorted input to the reducer. The DISTRIBUTE BY statement is similar to GROUP BY in RDBMS in terms of deciding which reducer to distribute the mapper output to. When using with SORT BY ,
DISTRIBUTE BY must be specified before the SORT BY statement. And, the column used to distribute must appear in the select column list. It can be used as follows:

与列值相匹配的行将被分配到相同的reducer.当使用alone模式,不能保证排序的输入进入到reducer. DISTRIBUTE BY是相似与RDBMS中的GROUP BY,决定reducer去分散mapper的输出位置.当使用SORT BY,DISTRIBUTE BY 必须被指定在SORT BY之前.并且,列通常分配必须出现在选择的列系列中.

--Used with SORT BY
jdbc:hive2://> SELECT name, employee_id
. . . . . . .> FROM employee_hr
. . . . . . .> DISTRIBUTE BY employee_id SORT BY name;
+----------+--------------+
| name | employee_id |
+----------+--------------+
| Lucy | 103 |
| Michael | 100 |
| Steven | 102 |
| Will | 101 |
+----------+--------------+
4 rows selected (38.01 seconds)

4.CLUSTER BY

This is a shorthand operator to perform DISTRIBUTE BY and SORT BY operations on the same group of columns. And, it is sorted locally in each reducer.The CLUSTER BY statement does not support ASC or DESC yet. Compared to ORDER BY ,which is globally sorted, the CLUSTER BY operation is sorted in each distributed group. To fully utilize all the available reducers when doing a global sort, we can do CLUSTER BY first and then ORDER BY . This can be used as follows:

jdbc:hive2://> SELECT name, employee_id
. . . . . . .> FROM employee_hr CLUSTER BY name;
+----------+--------------+
| name | employee_id |
+----------+--------------+
| Lucy | 103 |
| Michael | 100 |
| Steven | 102 |
| Will | 101 |
+----------+--------------+
4 rows selected (39.791 seconds)

8.Operators and functions

To further manipulate data, we can also use expressions, operators, and functions in Hive to transform data. The Hive wiki(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) has offered
specifications for each expression and function, so we do not want to repeat all of them here except a few important usages or tips in this chapter.Hive has defined relational operators, arithmetic operators, logical operators, complex type constructors, and complex type operators. For relational, arithmetic, and logical
operators, they are similar to standard operators in SQL/Java. We do not repeat them again in this chapter. For operators on a complex data type, we have already introduced them in the Understanding Hive data types section of Chapter 3, Data Definition and Description,as well as the example for a dynamic partition insert in this chapter.The functions in Hive are categorized as follows:Mathematical functions: These functions are mainly used to perform mathematical calculations, such as RAND() and E() .
Collection functions: These functions are used to find the size, keys, and values for complex types, such as SIZE(Array<T>) .
Type conversion functions: These are mainly CAST and BINARY functions to convert one type to the other.
Date functions: These functions are used to perform date-related calculations, such as YEAR(string date) and MONTH(string date) .
Conditional functions: These functions are used to check specific conditions with a defined value returned, such as COALESCE , IF , and CASE WHEN .
String functions: These functions are used to perform string-related operations, such as UPPER(string A) and TRIM(string A) .
Aggregate functions: These functions are used to perform aggregation (which is introduced in the next chapter for more details), such as SUM() , COUNT(*) .
Table-generating functions: These functions transform a single input row into multiple output rows, such as EXPLODE(MAP) and JSON_TUPLE(jsonString, k1, k2,…) .
Customized functions: These functions are created by Java code as extensions for Hive. They are introduced in Chapter 8, Extensibility Considerations.
To list Hive built-in functions/UDF, we can use the following commands in Hive CLI:

SHOW FUNCTIONS; --List all functions
DESCRIBE FUNCTION <function_name>; --Detail for specified function
DESCRIBE FUNCTION EXTENDED <function_name>; --Even more details

9.Transactions

Before Hive version 0.13.0, Hive does not support row-level transactions. As a result,there is no way to update, insert, or delete rows of data. Hence, data overwrite can only happen on tables or partitions. This makes Hive very difficult when dealing with concurrent read/write and data-cleaning use cases.Since Hive version 0.13.0, Hive fully supports row-level transactions by offering full Atomicity, Consistency, Isolation, and Durability (ACID) to Hive. For now, all the transactions are **autocommuted **and only support data in the Optimized Row Columnar (ORC) file (available since Hive 0.11. 0) format and in bucketed tables.
The following configuration parameters must be set appropriately to turn on transaction support in Hive:

SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;

The  SHOW TRANSACTIONS command is added since Hive 0.13.0 to show currently open and
aborted transactions in the system:

jdbc:hive2://> SHOW TRANSACTIONS;
+-----------------+--------------------+-------+-----------+
| txnid | state | user | host |
+-----------------+--------------------+-------+-----------+
| Transaction ID | Transaction State | User | Hostname |
+-----------------+--------------------+-------+-----------+
1 row selected (15.209 seconds

10.Aggregation

GROUP BY

GROUPING SETS

ROLLUP and CUBE

HAVING

11.Analytic functions

Standard aggregations: This can be either  COUNT() ,  SUM() ,  MIN() ,  MAX() , or  AVG() .

RANK : It ranks items in a group, such as finding the top N rows for specific conditions.

DENSE_RANK : It is similar to  RANK , but leaves no gaps in the ranking sequence when there are ties. For example, if we rank a match using  DENSE_RANK and had two players tie for second place, we would see that the two players were in second place and that the next person is ranked as third. However, the  RANK function would also rank two people in second place, but the next person would be in fourth place.

ROW_NUMBER : It assigns a unique sequence number starting from 1 to each row according to the partition and order specification.

CUME_DIST : It computes the number of rows whose value is smaller or equal to the value of the total number of rows divided by the current row.

PERCENT_RANK : It is similar to  CUME_DIST , but it uses rank values rather than row counts in its numerator as total number of rows - 1 divided by current rank - 1.Therefore, it returns the percent rank of a value relative to a group of values.

NTILE : It divides an ordered dataset into number of buckets and assigns an appropriate bucket number to each row. It can be used to divide rows into equal sets and assign a number to each row.

LEAD : The  LEAD function,  lead(value_expr[,offset[,default]]) , is used to return data from the next row. The number ( value_expr ) of rows to lead can optionally be specified. If the number of rows ( offset ) to lead is not specified, the lead is one row by default. It returns  [,default] or null when the default is not specified and the lead for the current row extends beyond the end of the window.

LAG : The  LAG function,  lag(value_expr[,offset[,default]]) , is used to access data from a previous row. The number ( value_expr ) of rows to lag can optionally be specified. If the number of rows ( offset ) to lag is not specified, the lag is one row by default. It returns  [,default] or null when the default is not specified and the lag for the current row extends beyond the end of the window.

FIRST_VALUE : It returns the first result from an ordered set.

LAST_VALUE : It returns the last result from an ordered set. For  LAST_VALUE , using the default windowing clause, the result can be a little unexpected. This is because the default windowing clause is  RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW , which in this example means the current row will always be the last value.Changing the windowing clause to  RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING gives us the result we probably expected (see the  last_value column in the following examples).

jdbc:hive2://> SELECT name, dept_num, salary,
. . . . . . .> COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,
. . . . . . .> SUM(salary) OVER(PARTITION BY dept_num
. . . . . . .> ORDER BY dept_num) AS deptTotal,
. . . . . . .> SUM(salary) OVER(ORDER BY dept_num)
. . . . . . .> AS runningTotal1, SUM(salary)
. . . . . . .> OVER(ORDER BY dept_num, name rows unbounded
. . . . . . .> preceding) AS runningTotal2
. . . . . . .> FROM employee_contract
. . . . . . .> ORDER BY dept_num, name;
+-------+--------+------+-------+---------+-------------+-------------+
| name |dept_num|salary|row_cnt|deptTotal|runningTotal1|runningTotal2|
+-------+--------+------+-------+---------+-------------+-------------+
|Lucy |1000 |5500 |5 |24900 |24900 |5500 |
|Michael|1000 |5000 |5 |24900 |24900 |10500 |
|Steven |1000 |6400 |5 |24900 |24900 |16900 |
|Will |1000 |4000 |5 |24900 |24900 |24900 |
|Will |1000 |4000 |5 |24900 |24900 |20900 |
|Jess |1001 |6000 |3 |17400 |42300 |30900 |
|Lily |1001 |5000 |3 |17400 |42300 |35900 |
|Mike |1001 |6400 |3 |17400 |42300 |42300 |
|Richard|1002 |8000 |3 |20500 |62800 |50300 |
|Wei |1002 |7000 |3 |20500 |62800 |57300 |
|Yun |1002 |5500 |3 |20500 |62800 |62800 |
+-------+--------+------+-------+---------+-------------+-------------+
11 rows selected (359.918 seconds)

jdbc:hive2://> SELECT name, dept_num, salary,
. . . . . . .> RANK() OVER (PARTITION BY dept_num ORDER BY salary)
. . . . . . .> AS rank,
. . . . . . .> DENSE_RANK()
. . . . . . .> OVER (PARTITION BY dept_num ORDER BY salary)
. . . . . . .> AS dense_rank, ROW_NUMBER() OVER () AS row_num,
. . . . . . .> ROUND((CUME_DIST() OVER (PARTITION BY dept_num
. . . . . . .> ORDER BY salary)), 1) AS cume_dist,
. . . . . . .> PERCENT_RANK() OVER(PARTITION BY dept_num
. . . . . . .> ORDER BY salary) AS percent_rank, NTILE(4)
. . . . . . .> OVER(PARTITION BY dept_num ORDER BY salary)
. . . . . . .> AS ntile
. . . . . . .> FROM employee_contract ORDER BY dept_num;
+-------+--------+------+----+----------+-------+---------+------------
+-----+
| name
|dept_num|salary|rank|dense_rank|row_num|cume_dist|percent_rank|ntile|
+-------+--------+------+----+----------+-------+---------+------------
+-----+
|Will | 1000 | 4000 | 1 | 1 | 11 | 0.4 | 0.0
| 1 |
|Will | 1000 | 4000 | 1 | 1 | 10 | 0.4 | 0.0
| 1 |
|Michael| 1000 | 5000 | 3 | 2 | 9 | 0.6 | 0.5
| 2 |
|Lucy | 1000 | 5500 | 4 | 3 | 8 | 0.8 | 0.75
| 3 |
|Steven | 1000 | 6400 | 5 | 4 | 7 | 1.0 | 1.0
| 4 |
|Lily | 1001 | 5000 | 1 | 1 | 6 | 0.3 | 0.0
| 1 |
|Jess | 1001 | 6000 | 2 | 2 | 5 | 0.7 | 0.5
| 2 |
|Mike | 1001 | 6400 | 3 | 3 | 4 | 1.0 | 1.0
| 3 |
|Yun | 1002 | 5500 | 1 | 1 | 3 | 0.3 | 0.0
| 1 |
|Wei | 1002 | 7000 | 2 | 2 | 2 | 0.7 | 0.5
| 2 |
|Richard| 1002 | 8000 | 3 | 3 | 1 | 1.0 | 1.0
| 3 |
+-------+--------+------+----+----------+-------+---------+------------
+-----+
11 rows selected (367.112 seconds)

jdbc:hive2://> SELECT name, dept_num, salary,
. . . . . . .> LEAD(salary, 2) OVER(PARTITION BY dept_num
. . . . . . .> ORDER BY salary) AS lead,
. . . . . . .> LAG(salary, 2, 0) OVER(PARTITION BY dept_num
. . . . . . .> ORDER BY salary) AS lag,
. . . . . . .> FIRST_VALUE(salary) OVER (PARTITION BY dept_num
. . . . . . .> ORDER BY salary) AS first_value,
. . . . . . .> LAST_VALUE(salary) OVER (PARTITION BY dept_num
. . . . . . .> ORDER BY salary) AS last_value_default,
. . . . . . .> LAST_VALUE(salary) OVER (PARTITION BY dept_num
. . . . . . .> ORDER BY salary
. . . . . . .> RANGE BETWEEN UNBOUNDED PRECEDING
. . . . . . .> AND UNBOUNDED FOLLOWING) AS last_value
. . . . . . .> FROM employee_contract ORDER BY dept_num;
+-------+--------+------+----+----+-----------+------------------+-----
----+
| name |dept_num|salary|lead|lag |first_value|last_value_default|
last_value
|
+-------+--------+------+----+----+-----------+------------------+-----
----+
|Will |1000 |4000 |5000|0 |4000 |4000 |6400
|
|Will |1000 |4000 |5500|0 |4000 |4000 |6400
|
|Michael|1000 |5000 |6400|4000|4000 |5000 |6400
|
|Lucy |1000 |5500 |NULL|4000|4000 |5500 |6400
|
|Steven |1000 |6400 |NULL|5000|4000 |6400 |6400
|
|Lily |1001 |5000 |6400|0 |5000 |5000 |6400
|
|Jess |1001 |6000 |NULL|0 |5000 |6000 |6400
|
|Mike |1001 |6400 |NULL|5000|5000 |6400 |6400
|
|Yun |1002 |5500 |8000|0 |5500 |5500 |8000
|
|Wei |1002 |7000 |NULL|0 |5500 |7000 |8000
|
|Richard|1002 |8000 |NULL|5500|5500 |8000 |8000
|
+-------+--------+------+----+----+-----------+------------------+-----
----+
11 rows selected (92.572 seconds)

12.Sampling

1.Random sampling

uses the RAND() function and LIMIT keyword to get the sampling of data as shown in the following example. The DISTRIBUTE and SORT keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY RAND() statement can also achieve the same purpose,but the performance is not good:

SELECT * FROM <Table_Name> DISTRIBUTE BY RAND() SORT BY RAND() LIMIT <N rows to sample>;

2.Bucket table sampling

is a special sampling optimized for bucket tables as shown in the following syntax and example. The colname value specifies the column where to sample the data. The RAND() function can also be used when sampling is on the entire rows. If the sample column is also the CLUSTERED BY column, the TABLESAMPLE statement will be more efficient.

--Syntax(语法)
SELECT * FROM <Table_Name> TABLESAMPLE(BUCKET <specified bucket number to sample> OUT OF <total number of buckets> ON [colname|RAND()]) table_alias;

--An example
jdbc:hive2://> SELECT name FROM employee_id_buckets
. . . . . . .> TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand()) a;
+----------+
| name |
+----------+
| Lucy |
| Shelley |
| Lucy |
| Lucy |
| Shelley |
| Lucy |
| Will |
| Shelley |
| Michael |
| Will |
| Will |
| Will |
| Will |
| Will |
| Lucy |
+----------+
15 rows selected (0.07 seconds)

3.Block sampling

allows Hive to randomly pick up N rows of data, percentage (n percentage) of data size, or N byte size of data. The sampling granularity is the HDFS block size. Its syntax and examples are as follows:

--Syntax
SELECT *
FROM <Table_Name> TABLESAMPLE(N PERCENT|ByteLengthLiteral|N ROWS) s;
-- ByteLengthLiteral
-- (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')

--Sample by rows 通过行数取样
jdbc:hive2://> SELECT name
. . . . . . .> FROM employee_id_buckets TABLESAMPLE(4 ROWS) a;
+----------+
| name |
+----------+
| Lucy |
| Shelley |
| Lucy |
| Shelley |
+----------+
4 rows selected (0.055 seconds)

--Sample by percentage of data size 通过百分比取样
jdbc:hive2://> SELECT name
. . . . . . .> FROM employee_id_buckets TABLESAMPLE(10 PERCENT) a;
+----------+
| name |
+----------+
| Lucy |
| Shelley |
| Lucy |
+----------+
3 rows selected (0.061 seconds)

--Sample by data size 通过大小取样
jdbc:hive2://> SELECT name
. . . . . . .> FROM employee_id_buckets TABLESAMPLE(3M) a;
+----------+
| name |
+----------+
| Lucy |
| Shelley |
| Lucy |
| Shelley |
| Lucy |
| Shelley |
| Lucy |
| Shelley |
| Lucy |
| Will |
| Shelley |
| Lucy |
| Will |
| Shelley |
| Michael |
| Will |
| Shelley |
| Lucy |
| Will |
| Will |
| Will |
| Will |
| Will |
| Lucy |
| Shelley |
+----------+
25 rows selected (0.07 seconds)

13.Performance Considerations

1.Performance utilities

Hive provides the EXPLAIN and ANALYZE statements that can be used as utilities to check and identify the performance of queries.

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query

The following keywords can be used:
EXTENDED : This provides additional information for the operators in the plan, such as file pathname and abstract syntax tree.
DEPENDENCY : This provides a JSON format output that contains a list of tables and partitions that the query depends on. It is available since HIVE 0.10.0.
AUTHORIZATION : This lists all entities needed to be authorized including input and output to run the Hive query and authorization failures, if any. It is available since HIVE 0.14.0.

A typical query plan contains the following three sections. We will also have a look at an example later:
Abstract syntax tree (AST): Hive uses a pacer generator called ANTLR (see http://www.antlr.org/) to automatically generate a tree of syntax for HQL. We can usually ignore this most of the time.
Stage dependencies: This lists all dependencies and number of stages used to run the query.
Stage plans: It contains important information, such as operators and sort orders, for running the job.

The ANALYZE statement

Hive statistics are a collection of data that describe more details, such as the number of rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a metadata of Hive data. Hive supports statistics at the table, partition, and column level.These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an optimizer to pick the query plan with the lowest cost in terms of system resources required to complete the query.

The statistics are gathered through the ANALYZE statement since Hive 0.10.0 on tables, partitions, and columns as given in the following examples:

jdbc:hive2://> ANALYZE TABLE employee COMPUTE STATISTICS;
No rows affected (27.979 seconds)

jdbc:hive2://> ANALYZE TABLE employee_partitioned
. . . . . . .> PARTITION(year=2014, month=12) COMPUTE STATISTICS;
No rows affected (45.054 seconds)

jdbc:hive2://> ANALYZE TABLE employee_id COMPUTE STATISTICS
. . . . . . .> FOR COLUMNS employee_id;
No rows affected (41.074 seconds)

Once the statistics are built, we can check the statistics by the DESCRIBE EXTENDED / FORMATTED statement. From the table/partition output, we can find the statistics information inside the parameters, such as parameters:{numFiles=1,COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1417726247, numRows=4,totalSize=227, rawDataSize=223}) . The following is an example:

jdbc:hive2://> DESCRIBE EXTENDED employee_partitioned
. . . . . . .> PARTITION(year=2014, month=12);
jdbc:hive2://> DESCRIBE EXTENDED employee;
…
parameters:{numFiles=1, COLUMN_STATS_ACCURATE=true,
transient_lastDdlTime=1417726247, numRows=4, totalSize=227,
rawDataSize=223}).

jdbc:hive2://> DESCRIBE FORMATTED employee.name;
+--------+---------+---+---+---------+--------------+-----------+-----------+
|col_name|data_type|min|max|num_nulls|distinct_count|avg_col_len|max_col_len|
+--------+---------+---+---+---------+--------------+-----------+-----------+
| name | string | | | 0 | 5 | 5.6 | 7|
+--------+---------+---+---+---------+--------------+-----------+-----------+
+---------+----------+-----------------+
|num_trues|num_falses| comment |
+---------+----------+-----------------+
| | |from deserializer|
+---------+----------+-----------------+
3 rows selected (0.116 seconds)

Hive statistics are persisted in the metastore to avoid computing them every time. For newly created tables and/or partitions, statistics are automatically computed by default if we enable the following setting:

jdbc:hive2://> SET hive.stats.autogather=ture;

Note
Hive logs
Logs provide useful information to find out how a Hive query/job runs. By checking the Hive logs, we can identify runtime problems and issues that may cause bad performance. There are two types of logs available in Hive: system log and job log.The system log contains the Hive running status and issues. It is configured in **{HIVE_HOME}/conf/hive-log4j.properties **. The following three lines for Hive log can
be found:

hive.root.logger=WARN,DRFA
hive.log.dir=/tmp/${user.name}
hive.log.file=hive.log

To modify the status, we can either modify the preceding lines in hive-log4j.properties(applies to all users) or set from the Hive CLI (only applies to the current user and current session) as follows:

hive --hiveconf hive.root.logger=DEBUG,console

The job log contains Hive query information and is saved at the same place,/tmp/${user.name} , by default as one file for each Hive user session. We can override it in hive-site.xml with the hive.querylog.location property. If a Hive query generates MapReduce jobs, those logs can also be viewed through the Hadoop JobTracker Web UI.

2.Design optimization

Design optimization covers several data layout and design strategies to improve performance.

3.Partition tables

Hive partitioning is one of the most effective methods to improve the query performance on larger tables. The query with partition filtering will only load the data in the specified partitions (subdirectories), so it can execute much faster than a normal query that filters by a non-partitioning field. The selection of partition key is always an important factor for performance. It should always be a low cardinal attribute to avoid many subdirectories overhead.
The following are some commonly used dimensions as partition keys:
Partitions by date and time: Use date and time, such as year, month, and day (even hours), as partition keys when data is associated with the time dimension

Partitions by locations:Use country, territory, state, and city as partition keys when data is location related

Partitions by business logics:Use department, sales region, applications, customers,and so on as partitioned keys when data can be separated evenly by some business logic

4.Bucket tables

Similar to partitioning, a bucket table organizes data into separate files in the HDFS. Bucketing can speed up the data sampling in Hive with sampling on buckets. Bucketing can also improve the join performance if the join keys are also bucket keys because bucketing ensures that the key is present in a certain bucket. More details are given in the Job and Query optimization section in this chapter.

14.Local mode

jdbc:hive2://> SET hive.exec.mode.local.auto=true; --default false
jdbc:hive2://> SET hive.exec.mode.local.auto.inputbytes.max=50000000;
jdbc:hive2://> SET hive.exec.mode.local.auto.input.files.max=5;
--default 4

15.JVM reuse

By default, Hadoop launches a new JVM for each map or reduce job and runs the map or reduce task in parallel. When the map or reduce job is a lightweight job running only for a few seconds, the JVM startup process could be a significant overhead. The MapReduce framework (version 1 only, not Yarn) has an option to reuse JVM by sharing the JVM to run mapper/reducer serially instead of parallel. JVM reuse applies to map or reduce tasks in the same job. Tasks from different jobs will always run in a separate JVM. To enable the reuse, we can set the maximum number of tasks for a single job for JVM reuse using the mapred.job.reuse.jvm.num.tasks property. Its default value is 1:

jdbc:hive2://> SET mapred.job.reuse.jvm.num.tasks=5;

We can also set the value to –1 to indicate that all the tasks for a job will run in the same JVM.

16.Parallel execution

Hive queries commonly are translated into a number of stages that are executed by the default sequence. These stages are not always dependent on each other. Instead, they can run in parallel to save the overall job running time. We can enable this feature with the following settings:

jdbc:hive2://> SET hive.exec.parallel=true;—default false
jdbc:hive2://> SET hive.exec.parallel.thread.number=16;
-- default 8, it defines the max number for running in parallel

Parallel execution will increase the cluster utilization. If the utilization of a cluster is already very high, parallel execution will not help much in terms of overall performance.

17.Join optimization

1.Common join

The common join is also called reduce side join. It is a basic join in Hive and works for most of the time. For common joins, we need to make sure the big table is on the right- most side or specified by hit, as follows:
/*+ STREAMTABLE(stream_table_name) */.

2.Map join

Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited. Since Hive 0.7.0, Hive can convert map join automatically with the following settings:

jdbc:hive2://> SET hive.auto.convert.join=true; --default false
jdbc:hive2://> SET hive.mapjoin.smalltable.filesize=600000000;
--default 25M
jdbc:hive2://> SET hive.auto.convert.join.noconditionaltask=true;
--default false. Set to true so that map join hint is not needed
jdbc:hive2://> SET hive.auto.convert.join.noconditionaltask.size=10000000;
--The default value controls the size of table to fit in memory

Once autoconvert is enabled, Hive will automatically check if the smaller table file size is bigger than the value specified by hive.mapjoin.smalltable.filesize , and then Hive will convert the join to a common join. If the file size is smaller than this threshold, it will try to convert the common join into a map join. Once autoconvert join is enabled, there is no need to provide the map join hints in the query.

3.Bucket map join

Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings:

jdbc:hive2://> SET hive.auto.convert.join=true; --default false
jdbc:hive2://> SET hive.optimize.bucketmapjoin=true; --default false

In bucket map join, all the join tables must be bucket tables and join on buckets columns.
In addition, the buckets number in bigger tables must be a multiple of the bucket number in the small tables.

4.Sort merge bucket (SMB) join

SMB is the join performed on the bucket tables that have the same sorted, bucket, and join
condition columns. It reads data from both bucket tables and performs common joins (map
and reduce triggered) on the bucket tables. We need to enable the following properties to
use SMB:

jdbc:hive2://> SET hive.input.format=
. . . . . . .> org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
jdbc:hive2://> SET hive.auto.convert.sortmerge.join=true;
jdbc:hive2://> SET hive.optimize.bucketmapjoin=true;
jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;
jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true;

5.Sort merge bucket map (SMBM) join

SMBM join is a special bucket join but triggers map-side join only. It can avoid caching all rows in the memory like map join does. To perform SMBM joins, the join tables must have the same bucket, sort, and join condition columns. To enable such joins, we need to enable the following settings:

jdbc:hive2://> SET hive.auto.convert.join=true;
jdbc:hive2://> SET hive.auto.convert.sortmerge.join=true
jdbc:hive2://> SET hive.optimize.bucketmapjoin=true;
jdbc:hive2://> SET hive.optimize.bucketmapjoin.sortedmerge=true;
jdbc:hive2://> SET hive.auto.convert.sortmerge.join.noconditionaltask=true;
jdbc:hive2://> SET hive.auto.convert.sortmerge.join.bigtable.selection.policy=
org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;

18.Skew join(数据倾斜)

When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the computation. The following setting informs Hive to optimize properly if data skew happens:

jdbc:hive2://> SET hive.optimize.skewjoin=true;
--If there is data skew in join, set it to true. Default is false.
jdbc:hive2://> SET hive.skewjoin.key=100000;
--This is the default value. If the number of key is bigger than
--this, the new keys will send to the other unused reducers.

Note
Skew data could happen on the GROUP BY data too. To optimize it, we need to do the following settings to enable skew data optimization in the GROUP BY result:

SET hive.groupby.skewindata=true;

Once configured, Hive will first trigger an additional MapReduce job whose map output will randomly distribute to the reducer to avoid data skew.For more information about Hive join optimization, please refer to the Apache Hive wiki available at
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization and
https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization.

Apache Hive Essentials笔记

Apache Hive Essentials笔记

1.CASCADE

2.表

3.Hibe Buckets

Bucket numbers:关于使用分桶数的优化

4.查询及优化

1.单查询

1.Enable fetch and verify the performance improvement:

2.Nested SELECT using CTE can be implemented as follows:

3.There are additional restrictions for subqueries used in WHERE clauses:

2.INNER JOIN

hint: /+STREAMTABLE (table_name)/

3.OUTER JOIN/CROSS JOIN

4.Special JOIN – MAPJOIN

sort-merge JOIN

5.数据操作

1.设置动态分区

2.追加文件 hive -e使用

6.EXPORT IMPORT

7.ORDER and SORT

1.ORDER BY (ASC|DESC)

2.SORT BY(ASC|DESC)

3.DISTRIBUTE BY :

4.CLUSTER BY

8.Operators and functions

9.Transactions

10.Aggregation

11.Analytic functions

12.Sampling

1.Random sampling

2.Bucket table sampling

3.Block sampling

13.Performance Considerations

1.Performance utilities

2.Design optimization

3.Partition tables

4.Bucket tables

14.Local mode

15.JVM reuse

16.Parallel execution

17.Join optimization

1.Common join

2.Map join

3.Bucket map join

4.Sort merge bucket (SMB) join

5.Sort merge bucket map (SMBM) join

18.Skew join(数据倾斜)

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读