美文网首页
HQL聚集计算之窗口篇

HQL聚集计算之窗口篇

作者: 长较瘦 | 来源:发表于2019-10-29 15:26 被阅读0次

自Hive 0.11.0之后,窗口函数,作为一组可以以扫描多个输入行作为输入来计算和生成每一个返回值的特殊函数,正式进入HQL大家族。该组函数的语法位,
Function (arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_expression>]),其中Function (arg1,..., argn)分为以下四类,

  • 聚集函数 如 SUM, MAX, MIN
  • 排序函数 如 RANK, ROW_NUMBER
  • 分析函数 用于统计或比较,如 LEAD, LAG, FIRST_VALUE
  1. 聚集函数
    自Hive 2.2.0之后,DISTINCT 可以和聚集函数一起使用于窗口函数中。以下示例1,用于创建表和加载数据以方便演示功能
> CREATE TABLE IF NOT EXISTS employee_contract (      
> name string,      
> dept_num int,      
> employee_id int,      
> salary int,      
> type string,      
> start_date date      
> )      
> ROW FORMAT DELIMITED      
> FIELDS TERMINATED BY '|'      
> STORED as TEXTFILE;      
No rows affected (0.282 seconds)      
> LOAD DATD INPATH '/tmp/hivedemo/data/employee_contract.txt'      
> OVERWRITE INTO TABLE employee_contract;      
No rows affected (0.48 seconds)

以下示例2,演示了一般聚集函数的窗口函数用法

> SELECT       
> name,       
> dept_num as deptno,       
> salary,      
> count(*) OVER (PARTITION BY dept_num) as cnt,      
> count(distinct dept_num) OVER (PARTITION BY dept_num) as dcnt,      
> sum(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) as       sum1,      
> sum(salary) OVER(ORDER BY dept_num) as sum2,      
> sum(salary) OVER(ORDER BY dept_num, name) as sum3      
> FROM employee_contract      
> ORDER BY deptno, name;      
+---------+--------+--------+-----+-----+-------+-------+-------+      
| name    | deptno | salary | cnt | dcnt| sum1  | sum2  | sum3  |      
+---------+--------+--------+-----+-----+-------+-------+-------+      
| Lucy    | 1000   | 5500   | 5   | 1   | 24900 | 24900 | 5500  |      
| Michael | 1000   | 5000   | 5   | 1   | 24900 | 24900 | 10500 |      
| Steven  | 1000   | 6400   | 5   | 1   | 24900 | 24900 | 16900 |      
| Wendy   | 1000   | 4000   | 5   | 1   | 24900 | 24900 | 20900 |      
| Will    | 1000   | 4000   | 5   | 1   | 24900 | 24900 | 24900 |     
| Jess    | 1001   | 6000   | 3   | 1   | 17400 | 42300 | 30900 |      
| Lily    | 1001   | 5000   | 3   | 1   | 17400 | 42300 | 35900 |      
| Mike    | 1001   | 6400   | 3   | 1   | 17400 | 42300 | 42300 |      
| Richard | 1002   | 8000   | 3   | 1   | 20500 | 62800 | 50300 |      
| Wei     | 1002   | 7000   | 3   | 1   | 20500 | 62800 | 57300 |      
| Yun     | 1002   | 5500   | 3   | 1   | 20500 | 62800 | 62800 |      
+---------+--------+--------+-----+-----+-------+-------+-------+      
11 rows selected (111.856 seconds)
  1. 排序函数,
  • ROW_NUMBER 根据分区和排序,从1开始给每一行指定唯一的序号。
  • RANK 在一个组中,对组内的记录条目排序
  • DENSE_RANK 和RANK相似,但是排名是连续无断开和跳跃。
  • PERCENT_RANK 公式为(current rank - 1)/(total number of rows - 1)
  • NTILE 把一个排序过的数据集分成若干部分,给每一条记录分配至这些部分的序号。
    以下示例演示了以上函数的使用,
> SELECT 
> name, 
> dept_num as deptno, 
> salary,
> row_number() OVER () as rnum, -- sequence in orginal table
> rank() OVER (PARTITION BY dept_num ORDER BY salary) as rk, 
> dense_rank() OVER (PARTITION BY dept_num ORDER BY salary) as drk,
> percent_rank() OVER(PARTITION BY dept_num ORDER BY salary) as prk,
> ntile(4) OVER(PARTITION BY dept_num ORDER BY salary) as ntile
> FROM employee_contract
> ORDER BY deptno, name;
+---------+--------+--------+------+----+-----+------+-------+
| name    | deptno | salary | rnum | rk | drk | prk  | ntile |
+---------+--------+--------+------+----+-----+------+-------+
| Lucy    | 1000   | 5500   | 7    | 4  | 3   | 0.75 | 3     |
| Michael | 1000   | 5000   | 11   | 3  | 2   | 0.5  | 2     |
| Steven  | 1000   | 6400   | 8    | 5  | 4   | 1.0  | 4     |
| Wendy   | 1000   | 4000   | 9    | 1  | 1   | 0.0  | 1     |
| Will    | 1000   | 4000   | 10   | 1  | 1   | 0.0  | 1     |
| Jess    | 1001   | 6000   | 5    | 2  | 2   | 0.5  | 2     |
| Lily    | 1001   | 5000   | 6    | 1  | 1   | 0.0  | 1     |
| Mike    | 1001   | 6400   | 4    | 3  | 3   | 1.0  | 3     |
| Richard | 1002   | 8000   | 1    | 3  | 3   | 1.0  | 3     |
| Wei     | 1002   | 7000   | 3    | 2  | 2   | 0.5  | 2     |
| Yun     | 1002   | 5500   | 2    | 1  | 1   | 0.0  | 1     |
+---------+--------+--------+------+----+-----+------+-------+
11 rows selected (80.052 seconds)

自Hive 2.1.0之后,聚集函数可以用于OVER子句中,请看以下示例,

> SELECT
> dept_num,
> rank() OVER (PARTITION BY dept_num ORDER BY sum(salary)) as rk
> FROM employee_contract
> GROUP BY dept_num;
+----------+----+
| dept_num | rk |
+----------+----+
| 1000     | 1  |
| 1001     | 1  |
| 1002     | 1  | 
+----------+----+
3 rows selected (54.43 seconds)
  1. 分析函数
  • cume_dist (number of rows ≤ current row)/(total number of rows)
  • lead 返回后续记录
  • lag 返回之前记录
  • first_value 返回排序数据集的第一个值
  • last_value 返回排序数据集的最后一个值
    以下示例,
> SELECT 
> name,
> dept_num as deptno,
> salary,
> cume_dist() OVER (PARTITION BY dept_num ORDER BY salary) as cume,
> lead(salary, 2) OVER (PARTITION BY dept_num ORDER BY salary) as lead,
> lag(salary, 2, 0) OVER (PARTITION BY dept_num ORDER BY salary) as lag,
> first_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as fval,
> last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary) as lval,
> last_value(salary) OVER (PARTITION BY dept_num ORDER BY salary RANGE BETWEEN UNBOUNDED 
> PRECEDING AND UNBOUNDED FOLLOWING) as lval2
> FROM employee_contract 
> ORDER BY deptno, salary;
+--------+------+--------+------+------+-----+------+------+-------+
| name   |deptno| salary | cume | lead | lag | fval |lvalue|lvalue2|
+--------+------+--------+------+------+-----+------+------+-------+
| Will   | 1000 | 4000   | 0.4  | 5500 | 0   | 4000 | 4000 | 6400  |
| Wendy  | 1000 | 4000   | 0.4  | 5000 | 0   | 4000 | 4000 | 6400  |
| Michael| 1000 | 5000   | 0.6  | 6400 | 4000| 4000 | 5000 | 6400  |
| Lucy   | 1000 | 5500   | 0.8  | NULL | 4000| 4000 | 5500 | 6400  |
| Steven | 1000 | 6400   | 1.0  | NULL | 5000| 4000 | 6400 | 6400  |
| Lily   | 1001 | 5000   | 0.33 | 6400 | 0   | 5000 | 5000 | 6400  |
| Jess   | 1001 | 6000   | 0.67 | NULL | 0   | 5000 | 6000 | 6400  |
| Mike   | 1001 | 6400   | 1.0  | NULL | 5000| 5000 | 6400 | 6400  |
| Yun    | 1002 | 5500   | 0.33 | 8000 | 0   | 5500 | 5500 | 8000  |
| Wei    | 1002 | 7000   | 0.67 | NULL | 0   | 5500 | 7000 | 8000  |
| Richard| 1002 | 8000   | 1.0  | NULL | 5500| 5500 | 8000 | 8000  |
+--------+------+--------+------+------+-----+------+------+-------+
11 rows selected (55.203 seconds)

其中last_value缺省的窗口为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
[<window_expression>] 用来进一步子分区结果并应用窗口函数,有两类窗口类型,行类型和范围类型。另外rank(...), ntile(...), dense_rank(...), cume_dist(...), percent_rank(...), lead(...), lag(...), and row_number(...) 暂不支持这一个功能。

window_expression definition
行类型示例1
-- Preceding and Following
> SELECT 
> name, dept_num as dno, salary as sal,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING) win4,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS 2 PRECEDING) win6, -- FOLLOWING does not work in this way
> max(salary) OVER (PARTITION BY dept_num ORDER BY name
> ROWS UNBOUNDED PRECEDING) win7
> FROM employee_contract
> ORDER BY dno, name;
+---------+------+------+------+------+------+------+------+------+------+
| name    | dno  | sal  | win1 | win2 | win3 | win4 | win5 | win6 | win7 |
+---------+------+------+------+------+------+------+------+------+------+
| Lucy    | 1000 | 5500 | 5500 | 6400 | 6400 | NULL | 6400 | 5500 | 5500 |
| Michael | 1000 | 5000 | 5500 | 6400 | 6400 | 5500 | 6400 | 5500 | 5500 |
| Steven  | 1000 | 6400 | 6400 | 6400 | 6400 | 5500 | 4000 | 6400 | 6400 |
| Wendy   | 1000 | 4000 | 6400 | 6400 | 6400 | 6400 | 4000 | 6400 | 6400 |
| Will    | 1000 | 4000 | 6400 | 6400 | 4000 | 6400 | NULL | 6400 | 6400 |
| Jess    | 1001 | 6000 | 6000 | 6400 | 6400 | NULL | 6400 | 6000 | 6000 |
| Lily    | 1001 | 5000 | 6000 | 6400 | 6400 | 6000 | 6400 | 6000 | 6000 |
| Mike    | 1001 | 6400 | 6400 | 6400 | 6400 | 6000 | NULL | 6400 | 6400 |
| Richard | 1002 | 8000 | 8000 | 8000 | 8000 | NULL | 7000 | 8000 | 8000 |
| Wei     | 1002 | 7000 | 8000 | 8000 | 8000 | 8000 | 5500 | 8000 | 8000 |
| Yun     | 1002 | 5500 | 8000 | 8000 | 7000 | 8000 | NULL | 8000 | 8000 |
+---------+------+------+------+------+------+------+------+------+------+
11 rows selected (55.885 seconds)-- Current and Unbounded
> SELECT 
> name, dept_num as dno, salary as sal,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win8,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win9,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win10,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) win11,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win12,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win13,
> max(salary) OVER (PARTITION BY dept_num ORDER BY name 
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) win14
> FROM employee_contract> ORDER BY dno, name;
+-------+----+------+------+------+-------+-------+-------+-------+-------+
|name   |dno | sal  | win8 | win9 | win10 | win11 | win12 | win13 | win14 |
+-------+----+------+------+------+-------+-------+-------+-------+-------+
|Lucy   |1000| 5500 | 5500 | 5500 | 6400  | NULL  | 5500  | 5500  | 6400  |
|Michael|1000| 5000 | 5000 | 6400 | 6400  | 5500  | 5500  | 6400  | 6400  |
|Steven |1000| 6400 | 6400 | 6400 | 6400  | 5500  | 6400  | 6400  | 6400  |
|Wendy  |1000| 4000 | 4000 | 4000 | 4000  | 6400  | 6400  | 6400  | 6400  |
|Will   |1000| 4000 | 4000 | 4000 | 4000  | 6400  | 6400  | 6400  | 6400  |
|Jess   |1001| 6000 | 6000 | 6000 | 6400  | NULL  | 6000  | 6000  | 6400  |
|Lily   |1001| 5000 | 5000 | 6400 | 6400  | 6000  | 6000  | 6400  | 6400  |
|Mike   |1001| 6400 | 6400 | 6400 | 6400  | 6000  | 6400  | 6400  | 6400  |
|Richard|1002| 8000 | 8000 | 8000 | 8000  | NULL  | 8000  | 8000  | 8000  |
|Wei    |1002| 7000 | 7000 | 7000 | 7000  | 8000  | 8000  | 8000  | 8000  |
|Yun    |1002| 5500 | 5500 | 5500 | 5500  | 8000  | 8000  | 8000  | 8000  |
+-------+----+------+------+------+-------+-------+-------+-------+-------+
11 rows selected (53.754 seconds)

行类型示例2

> SELECT 
> name, dept_num, salary,
> max(salary) OVER w1 as win1,
> max(salary) OVER w2 as win2,
> max(salary) OVER w3 as win3
> FROM employee_contract
> WINDOW w1 as (
> PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
> ),
> w2 as w3,
> w3 as (
> PARTITION BY dept_num ORDER BY name
> ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING
> );
+---------+----------+--------+------+------+------+
| name    | dept_num | salary | win1 | win2 | win3 |
+---------+----------+--------+------+------+------+
| Lucy    | 1000     | 5500   | 5500 | 6400 | 6400 |
| Michael | 1000     | 5000   | 5500 | 6400 | 6400 |
| Steven  | 1000     | 6400   | 6400 | 6400 | 6400 |
| Wendy   | 1000     | 4000   | 6400 | 6400 | 6400 |
| Will    | 1000     | 4000   | 6400 | 4000 | 4000 |
| Jess    | 1001     | 6000   | 6000 | 6400 | 6400 |
| Lily    | 1001     | 5000   | 6000 | 6400 | 6400 |
| Mike    | 1001     | 6400   | 6400 | 6400 | 6400 |
| Richard | 1002     | 8000   | 8000 | 8000 | 8000 |
| Wei     | 1002     | 7000   | 8000 | 8000 | 8000 |
| Yun     | 1002     | 5500   | 8000 | 7000 | 7000 |
+---------+----------+--------+------+------+------+
11 rows selected (57.204 seconds)

范围类型示例1

> SELECT
> dept_num, start_date, name, salary,
> max(salary) OVER (PARTITION BY dept_num ORDER BY salary
> RANGE BETWEEN 500 PRECEDING AND 1000 FOLLOWING) win1,
> max(salary) OVER (PARTITION BY dept_num ORDER BY salary
> RANGE BETWEEN 500 PRECEDING AND CURRENT ROW) win2
> FROM employee_contract
> order by dept_num, start_date;
+----------+------------+---------+--------+------+------+
| dept_num | start_date | name    | salary | win1 | win2 |
+----------+------------+---------+--------+------+------+
| 1000     | 2010-01-03 | Lucy    | 5500   | 6400 | 5500 |
| 1000     | 2012-11-03 | Steven  | 6400   | 6400 | 6400 |
| 1000     | 2013-10-02 | Will    | 4000   | 5000 | 4000 |
| 1000     | 2014-01-29 | Michael | 5000   | 5500 | 5000 |
| 1000     | 2014-10-02 | Wendy   | 4000   | 5000 | 4000 |
| 1001     | 2013-11-03 | Mike    | 6400   | 6400 | 6400 |
| 1001     | 2014-11-29 | Lily    | 5000   | 6000 | 5000 |
| 1001     | 2014-12-02 | Jess    | 6000   | 6400 | 6000 |
| 1002     | 2010-04-03 | Wei     | 7000   | 8000 | 7000 |
| 1002     | 2013-09-01 | Richard | 8000   | 8000 | 8000 |
| 1002     | 2014-01-29 | Yun     | 5500   | 5500 | 5500 |
+----------+------------+---------+--------+------+------+
11 rows selected (60.784 seconds)

相关文章

  • HQL聚集计算之窗口篇

    自Hive 0.11.0之后,窗口函数,作为一组可以以扫描多个输入行作为输入来计算和生成每一个返回值的特殊函数,正...

  • HQL聚集计算之条件篇

    自Hive 0.70之后,HAVING 子句可以用于对聚集结果进行条件过滤。该用法和SQL中类似,这样我们就不需要...

  • HQL聚集计算之进阶篇

    HQL聚集函数可以使用GROUPING SETS, CUBE, 和ROLLUP等关键词。 GROUPING SET...

  • HQL聚集计算之基本篇

    基本内置聚集函数通常需要和GROUP BY子句一起使用。如果没有使用GROUP BY子句,聚集函数会缺省按照整行所...

  • HQL之抽样篇

    由于数据量本身通常过于浩大,我们需要一部分数据来加速数据分析过程。我们就需要运用抽样技术来发现整个数据集的模式和趋...

  • Spring Data之@Query中的org.hibernat

    Spring Data之@Query中的org.hibernate.hql.internal.QueryExecu...

  • HQL数据查询基础

    了解HQL HQL定义 HQL:Hibernate Query Language, Hibernate查询语句 H...

  • Hibernate之HQL

    1. 对于HQL而言,都是基于对象进行查询的 Query query = session.createQuery(...

  • HQL操作之-DDL命令

    HQL操作之-DDL命令 参考:https://cwiki.apache.org/confluence/displ...

  • Hibernate框架学习---HQL数据查询基础

    HQL定义 HQL全称为(Hibernate Query Language) HQL是面向对象的查询语言(即以面向...

网友评论

      本文标题:HQL聚集计算之窗口篇

      本文链接:https://www.haomeiwen.com/subject/lxzwvctx.html