HQL支持以下四种关键字来进行数据排序,
- ORDER BY [ASC|DESC]
- SORT BY [ASC|DESC]
- DISTRIBUTE BY
- CLUSTER BY
以下依次为这些关键字的说明和解释,
- ORDER BY [ASC|DESC]:
和SQL中SELECT语句的ORDER BY子句相似。需要注意的是,该排序是针对每一个reducer的输出汇总后整体的排序,只使用一个reducer来做全局整体排序,通常会消耗较多时间来完成这个排序任务。
ASC这个开关项是缺省开关项,如果没有指定ASC|DESC,ASC就会被缺省使用。自Hive 2.1.0版本起,HQL支持可以指定NULL值位于排序中的起始/结束。NULL FIRST, NULL值位于起始;NULL LAST, NULL值位于结束位置。缺省情况下,NULL值被视为最小,在做ASC排序时,位于起始位置,在做DESC排序时,位于结束位置。
> SELECT name FROM employee ORDER BY name DESC; -- By columns
+----------+
| name |
+----------+
| Will |
| Shelley |
| Michael |
| Lucy |
+----------+
4 rows selected (24.057 seconds)
> SELECT name
> FROM employee -- Order by expression
> ORDER BY CASE WHEN name = 'Will' THEN 0 ELSE 1 END DESC;
+----------+
| name |
+----------+
| Lucy |
| Shelley |
| Michael |
| Will |
+----------+
4 rows selected (25.057 seconds)
> SELECT * FROM emp_simple ORDER BY work_place NULL LAST;
+---------+------------+
| name | work_place |
+---------+------------+
| Lucy | Montreal |
| Michael | Toronto |
| Will | NULL | -- NULL stays at the last
+---------+------------+
3 rows selected (0.263 seconds)
推荐LIMIT子句和ORDER BY一起使用。当设置hive.mapred.mode为strict时,ORDER BY子句一定要和LIMIT子句一起使用,否则执行相关HQL语句时会返回异常。另外,Hive 1.x版本,该设置缺省为 nonstrict,但自从Hive 2.x后,该设置缺省为 strict。
- SORT BY [ASC|DESC]: 用来指定提供给Reducer输入记录的排序方式。由于是给Reducer的输入数据排序,所以这个排序操作在Reducer开始处理数据前就已经完成了。该排序不是全局排序,只是在每个Reducer内局部排。当时有且仅有一个Reducer时,和ORDER BY的效果一致。另外SORT BY通常需要和DISTRIBUTE BY一起使用,否则无实际效果。
> SET mapred.reduce.tasks = 2; -- Sort by with more than 1 reducer
No rows affected (0.001 seconds)
> SELECT name FROM employee SORT BY name DESC;
+---------+
| name |
+---------+
| Shelley | -- Once result is collected to client, it is
| Michael | order-less
| Lucy |
| Will |
+---------+
4 rows selected (54.386 seconds)
> SET mapred.reduce.tasks = 1; -- Sort by one reducer
No rows affected (0.002 seconds)
> SELECT name FROM employee SORT BY name DESC;
+----------+
| name |
+----------+
| Will | -- Same result to ORDER BY
| Shelley |
| Michael |
| Lucy |
+----------+
4 rows selected (46.03 seconds)
- DISTRIBUTE BY:
和GROUP BY子句相似,Mapper用来确定数据输出至哪个Reducer。
和GROUP BY子句不一样的地方在于,DISTRIBUTE BY子句无法用于数据聚合函数,如COUNT(*),而只是指定数据输出至哪个Reducer。
DISTRIBUTE BY子句通常用于按照某些列来重新存放数据。另外该子句后的列,必须包含在SELECT子句的列列表中,否在会报错。
-- Error when not specify distributed column employee_id in select
> SELECT name FROM employee_hr DISTRIBUTE BY employee_id;
Error: Error while compiling statement: FAILED: SemanticException
[Error 10004]: Line 1:44 Invalid table alias or column reference 'employee_id': (possible column names are: name)
> SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id;
+----------+--------------+
| name | employee_id |
+----------+--------------+
| Lucy | 103 |
| Steven | 102 |
| Will | 101 |
| Michael | 100 |
+----------+--------------+
4 rows selected (38.92 seconds)
-- Used with SORT BY to order name started on the same day
> SELECT name, start_date
> FROM employee_hr
> DISTRIBUTE BY start_date SORT BY name;
+----------+--------------+
| name | start_date |
+----------+--------------+
| Lucy | 2010-01-03 |
| Michael | 2014-01-29 |
| Steven | 2012-11-03 |
| Will | 2013-10-02 |
+----------+--------------+
4 rows selected (38.01 seconds)
- CLUSTER BY: 等用于DISTRIBUTE BY子句和SORT BY子句同时作用于同一组数据列。另外,该子句不可以指定ASC或DESC选项。
> SELECT name, employee_id FROM employee_hr CLUSTER BY name;
+----------+--------------+
| name | employee_id |
+----------+--------------+
| Lucy | 103 |
| Michael | 100 |
| Steven | 102 |
| Will | 101 |
+----------+--------------+
4 rows selected (39.791 seconds)
最后,以上子句的区别,大家可以通过以下图表来体会,
HQL 排序子句的区别
网友评论