美文网首页
HQL之数据排序

HQL之数据排序

作者: 长较瘦 | 来源:发表于2019-10-28 14:03 被阅读0次

    HQL支持以下四种关键字来进行数据排序,

    • ORDER BY [ASC|DESC]
    • SORT BY [ASC|DESC]
    • DISTRIBUTE BY
    • CLUSTER BY
      以下依次为这些关键字的说明和解释,
    1. ORDER BY [ASC|DESC]:
      和SQL中SELECT语句的ORDER BY子句相似。需要注意的是,该排序是针对每一个reducer的输出汇总后整体的排序,只使用一个reducer来做全局整体排序,通常会消耗较多时间来完成这个排序任务。
      ASC这个开关项是缺省开关项,如果没有指定ASC|DESC,ASC就会被缺省使用。自Hive 2.1.0版本起,HQL支持可以指定NULL值位于排序中的起始/结束。NULL FIRST, NULL值位于起始;NULL LAST, NULL值位于结束位置。缺省情况下,NULL值被视为最小,在做ASC排序时,位于起始位置,在做DESC排序时,位于结束位置。
    > SELECT name FROM employee ORDER BY name DESC; -- By columns      
    +----------+      
    |   name   |      
    +----------+      
    | Will     |      
    | Shelley  |      
    | Michael  |      
    | Lucy     |      
    +----------+      
    4 rows selected (24.057 seconds)            
    > SELECT name       
    > FROM employee  -- Order by expression      
    > ORDER BY CASE WHEN name = 'Will' THEN 0 ELSE 1 END DESC;      
    +----------+      
    |   name   |      
    +----------+      
    | Lucy     |      
    | Shelley  |      
    | Michael  |      
    | Will     |      
    +----------+      
    4 rows selected (25.057 seconds)            
    > SELECT * FROM emp_simple ORDER BY work_place NULL LAST;      
    +---------+------------+      
    | name    | work_place |      
    +---------+------------+      
    | Lucy    | Montreal   |      
    | Michael | Toronto    |      
    | Will    | NULL       | -- NULL stays at the last      
    +---------+------------+      
    3 rows selected (0.263 seconds)
    

    推荐LIMIT子句和ORDER BY一起使用。当设置hive.mapred.mode为strict时,ORDER BY子句一定要和LIMIT子句一起使用,否则执行相关HQL语句时会返回异常。另外,Hive 1.x版本,该设置缺省为 nonstrict,但自从Hive 2.x后,该设置缺省为 strict。

    1. SORT BY [ASC|DESC]: 用来指定提供给Reducer输入记录的排序方式。由于是给Reducer的输入数据排序,所以这个排序操作在Reducer开始处理数据前就已经完成了。该排序不是全局排序,只是在每个Reducer内局部排。当时有且仅有一个Reducer时,和ORDER BY的效果一致。另外SORT BY通常需要和DISTRIBUTE BY一起使用,否则无实际效果。
    > SET mapred.reduce.tasks = 2; -- Sort by with more than 1 reducer      
    No rows affected (0.001 seconds)            
    > SELECT name FROM employee SORT BY name DESC;      
    +---------+      
    |   name  |      
    +---------+      
    | Shelley | -- Once result is collected to client, it is       
    | Michael |   order-less      
    | Lucy    |      
    | Will    |      
    +---------+      
    4 rows selected (54.386 seconds)            
    > SET mapred.reduce.tasks = 1; -- Sort by one reducer      
    No rows affected (0.002 seconds)            
    > SELECT name FROM employee SORT BY name DESC;      
    +----------+      
    |   name   |      
    +----------+      
    | Will     | -- Same result to ORDER BY      
    | Shelley  |      
    | Michael  |      
    | Lucy     |      
    +----------+      
    4 rows selected (46.03 seconds)
    
    1. DISTRIBUTE BY:
      和GROUP BY子句相似,Mapper用来确定数据输出至哪个Reducer。
      和GROUP BY子句不一样的地方在于,DISTRIBUTE BY子句无法用于数据聚合函数,如COUNT(*),而只是指定数据输出至哪个Reducer。
      DISTRIBUTE BY子句通常用于按照某些列来重新存放数据。另外该子句后的列,必须包含在SELECT子句的列列表中,否在会报错。
    -- Error when not specify distributed column employee_id in select      
    > SELECT name FROM employee_hr DISTRIBUTE BY employee_id;       
    Error: Error while compiling statement: FAILED: SemanticException
    [Error 10004]: Line 1:44 Invalid table alias or column reference 'employee_id': (possible column names are: name)             
    > SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id;       
    +----------+--------------+      
    |   name   | employee_id  |      
    +----------+--------------+      
    | Lucy     | 103          |      
    | Steven   | 102          |      
    | Will     | 101          |      
    | Michael  | 100          |      
    +----------+--------------+      
    4 rows selected (38.92 seconds)            
    -- Used with SORT BY to order name started on the same day      
    > SELECT name, start_date      
    > FROM employee_hr      
    > DISTRIBUTE BY start_date SORT BY name;      
    +----------+--------------+      
    |   name   |  start_date  |      
    +----------+--------------+      
    | Lucy     | 2010-01-03   |      
    | Michael  | 2014-01-29   |      
    | Steven   | 2012-11-03   |     
    | Will     | 2013-10-02   |      
    +----------+--------------+      
    4 rows selected (38.01 seconds)
    
    1. CLUSTER BY: 等用于DISTRIBUTE BY子句和SORT BY子句同时作用于同一组数据列。另外,该子句不可以指定ASC或DESC选项。
    > SELECT name, employee_id FROM employee_hr CLUSTER BY name;      
    +----------+--------------+      
    |   name   | employee_id  |      
    +----------+--------------+      
    | Lucy     | 103          |      
    | Michael  | 100          |      
    | Steven   | 102          |      
    | Will     | 101          |      
    +----------+--------------+      
    4 rows selected (39.791 seconds)
    

    最后,以上子句的区别,大家可以通过以下图表来体会,


    HQL 排序子句的区别

    相关文章

      网友评论

          本文标题:HQL之数据排序

          本文链接:https://www.haomeiwen.com/subject/nrovvctx.html