美文网首页
Hive开窗函数

Hive开窗函数

作者: Yobhel | 来源:发表于2020-06-07 10:46 被阅读0次

    1. 介绍

    普通聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。
    因此,普通聚合函数每组(Group by)只有一个返回值,而开窗函数则可以为窗口中的每行都返回一个值。

    1.1 基础结构

    分析函数(如:sum(), max(), row_number()...) + 窗口子句(over函数)
    

    1.2 over 函数

    over(partition by [column_n] order by [column_m])先按照column_n分区,相同的column_n分为一区,每个分区根据column_m排序(默认升序)。

    1.3 测试数据

    建表并插入数据

    -- 建表
    create table student_scores(
      id int,
      studentId int,
      language int,
      math int,
      english int,
      classId string,
      departmentId string
    );
    -- 写入数据
    insert into table student_scores values
      (1,111,68,69,90,'class1','department1'),
      (2,112,73,80,96,'class1','department1'),
      (3,113,90,74,75,'class1','department1'),
      (4,114,89,94,93,'class1','department1'),
      (5,115,99,93,89,'class1','department1'),
      (6,121,96,74,79,'class2','department1'),
      (7,122,89,86,85,'class2','department1'),
      (8,123,70,78,61,'class2','department1'),
      (9,124,76,70,76,'class2','department1'),
      (10,211,89,93,60,'class1','department2'),
      (11,212,76,83,75,'class1','department2'),
      (12,213,71,94,90,'class1','department2'),
      (13,214,94,94,66,'class1','department2'),
      (14,215,84,82,73,'class1','department2'),
      (15,216,85,74,93,'class1','department2'),
      (16,221,77,99,61,'class2','department2'),
      (17,222,80,78,96,'class2','department2'),
      (18,223,79,74,96,'class2','department2'),
      (19,224,75,80,78,'class2','department2'),
      (20,225,82,85,63,'class2','department2');
    

    1.4 窗口含义

    select studentId,math,departmentId,classId,
    -- 符合所有条件的行作为窗口,这里符合department1的有9个
    count(math) over() as count1,
    -- 按照classId分组的所有行作为窗口
    count(math) over(partition by classId) as count2,
    -- 按照classId分组、按照math排序的所有行作为窗口
    count(math) over(partition by classId order by math) as count3,
    -- 按照classId分组、按照math排序,当前行向前1行向后2行的行作为窗口
    count(math) over(partition by classId order by math rows between 1 preceding and 2 following) as count4,
    -- 按照classId分组、按照math排序,当前行向后所有行作为窗口
    count(math) over(partition by classId order by math rows between current row and unbounded following) as count5
    from student_scores where departmentId='department1';
    

    结果:

    studentId    math    departmentId    classId    count1    count2    count3    count4    count5
    111          69      department1     class1     9         5         1         3         5
    113          74      department1     class1     9         5         2         4         4
    112          80      department1     class1     9         5         3         4         3
    115          93      department1     class1     9         5         4         3         2
    114          94      department1     class1     9         5         5         2         1
    124          70      department1     class2     9         4         1         3         4
    121          74      department1     class2     9         4         2         4         3
    123          78      department1     class2     9         4         3         3         2
    122          86      department1     class2     9         4         4         2         1
    

    结果解析:

    studentId=115,
    count1为departmentId=department1的行数为9,
    count2为分区class1中的行数5,
    count3为分区class1中math<=93的行数4,
    count4为分区class1中math值向前1行和向后2行的行数3,
    count5为分区class1中当前math值到class1分区结束的行数2。
    

    上面可以看到:如果不指定ROWS BETWEEN,默认统计窗口是从起点到当前行
    关键是ROWS BETWEEN,也叫做window子句
    PRECEDING:向前
    FOLLOWING:向后
    CURRENT ROW:当前行
    UNBOUNDED:无边界,UNBOUNDED PRECEDING表示从最前面的起点开始,UNBOUNDED FOLLOWING表示到最后面的终点

    开窗函数可以粗略地分为两类:聚合开窗函数和排序开窗函数。

    2. 聚合开窗函数

    2.1 sum 函数

    select studentId,math,departmentId,classId,
    sum(math) over() as sum1,
    sum(math) over(partition by classId) as sum2,
    sum(math) over(partition by classId order by math) as sum3,
    sum(math) over(partition by classId order by math rows between 1 preceding and 2 following) as sum4,
    sum(math) over(partition by classId order by math rows between current row and unbounded following) as sum5
    from student_scores where departmentId='department1';
    
    -- 结果解析:类似count()函数
    

    min(),max(),avg()都与 count()类似

    2.2 first_value 开窗函数

    作用:返回分区中的第一个值

    select studentId,math,departmentId,classId,
    first_value(math) over(partition by classId order by math) as first_value1,
    first_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as first_value2
    from student_scores where departmentId='department1';
    

    结果:

    studentid    math    departmentid    classid    first_value1    first_value2
    111          69      department1     class1     69              69
    113          74      department1     class1     69              69
    112          80      department1     class1     69              74
    115          93      department1     class1     69              80
    114          94      department1     class1     69              93
    124          70      department1     class2     70              70
    121          74      department1     class2     70              70
    123          78      department1     class2     70              74
    122          86      department1     class2     70              78
    

    结果解析:

    studentId=122,
    first_value1为分区class2中按照math排序的第一个值70,
    first_value2为分区class2中按照math排序后当前行向前1行向后2行区间的第一个值78。
    

    2.3 last_vlaue 开窗函数

    作用:返回分区最后一个值

    hive> select * from my_table;
    OK
    1       3       10      A
    2       NULL    10      NULL
    3       5       10      A
    4       NULL    10      NULL
    5       NULL    10      NULL
    6       2       10      A
    Time taken: 0.06 seconds, Fetched: 6 row(s)
    
    • 您需要做的就是在上一行和当前行之间的窗口上滑动,并找到最近的非空值。 LAST_VALUE可窗口函数有一个参数,可以将空值忽略为boolean。 LAST_VALUE(<field>,<ignore_nulls> as boolean);
    SELECT
        COALESCE(`date`, LAST_VALUE(`date`, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
        COALESCE(number, LAST_VALUE(number, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
        COALESCE(estimate, LAST_VALUE(estimate, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
        COALESCE(client, LAST_VALUE(client, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
    FROM my_table;
    

    结果将是

    OK
    1       3       10      A
    2       3       10      A
    3       5       10      A
    4       5       10      A
    5       5       10      A
    6       2       10      A
    Time taken: 19.177 seconds, Fetched: 6 row(s)
    

    2.4 lag 开窗函数

    作用:LAG(col, n, DEFAULT)用于统计窗口内向上第 n 行的值
    col:列名
    n:向上 n 行,[可选,默认为 1]
    DEFAULT:当向上 n 行为 NULL 时,取默认值;如果不指定,则为 NULL

    select studentId,math,departmentId,classId,
    lag(math,2,60) over(partition by classId order by math) as lag1,
    lag(math,2) over(partition by classId order by math) as lag2
    from student_scores where departmentId='department1';
    

    结果:

    studentid    math    departmentid    classid    lag1    lag2
    111          69      department1     class1     60      NULL
    113          74      department1     class1     60      NULL
    112          80      department1     class1     69      69
    115          93      department1     class1     74      74
    114          94      department1     class1     80      80
    124          70      department1     class2     60      NULL
    121          74      department1     class2     60      NULL
    123          78      department1     class2     70      70
    122          86      department1     class2     74      74
    

    结果解析:

    studentId=113,
    lag1为分区class1按照math排序后当前行向上2行的值NULL,但是设置了DEFUALT,所以为60,
    lag2因为没有设置DEFAULT,所以为NULL。
    

    2.5 lead 开窗函数

    作用:LEAD(col, n, DEFAULT)LAG相反,用于统计窗口内向下 n 行的值
    col:列名
    n:向下 n 行,[可选,默认为 1]
    DEFAULT:当向下 n 行为 NULL 时,取默认值;如果不指定,则为 NULL

    2.6 cume_dist 开窗函数

    作用:计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:(小于等于当前值的行数) / (分区内总行数)

    select studentId,math,departmentId,classId,
    cume_dist() over(order by math) as cume_dist1,
    cume_dist() over(order by math desc) as cume_dist2,
    cume_dist() over(partition by classId order by math) as cume_dist3
    from student_scores where departmentId='department1';
    

    结果:

    studentid    math    departmentid    classid    cume_dist1            cume_dist2            cume_dist3
    111          69      department1     class1     0.1111111111111111    1.0                   0.2
    113          74      department1     class1     0.4444444444444444    0.7777777777777778    0.4
    112          80      department1     class1     0.6666666666666666    0.4444444444444444    0.6
    115          93      department1     class1     0.8888888888888888    0.2222222222222222    0.8
    114          94      department1     class1     1.0                   0.1111111111111111    1.0
    124          70      department1     class2     0.2222222222222222    0.8888888888888888    0.25
    121          74      department1     class2     0.4444444444444444    0.7777777777777778    0.5
    123          78      department1     class2     0.5555555555555556    0.5555555555555556    0.75
    122          86      department1     class2     0.7777777777777778    0.3333333333333333    1.0
    

    结果解析:

    studentId=115,
    cume_dist1=小于等于93的行数8/总行数9=0.8888888888888888
    cume_dist2=大于等于93的行数2/总行数9=0.2222222222222222
    cume_dist3=class1分区内小于等于93的行数4/总行数5=0.8
    

    3. 排序开窗函数

    3.1 row_number 开窗函数

    作用:row_number() over([partition by col1] [order by col2])开窗函数是基于over子句中order by列的一个排名。在窗口或分区内从 1 开始排序,即使遇到col2相等时,名次依旧增加。例如:有两条记录相等,但一个是第一,一个是第二。

    select id,studentId,language,math,english,classId,departmentId,
    row_number() over (order by math) as row_number1
    from student_scores;
    

    结果:

    id    studentid    language    math    english    classid    departmentid    row_number1
    1     111          68          69      90         class1     department1     1
    9     124          76          70      76         class2     department1     2
    18    223          79          74      96         class2     department2     3
    15    216          85          74      93         class1     department2     4
    3     113          90          74      75         class1     department1     5
    6     121          96          74      79         class2     department1     6
    8     123          70          78      61         class2     department1     7
    17    222          80          78      96         class2     department2     8
    19    224          75          80      78         class2     department2     9
    2     112          73          80      96         class1     department1     10
    14    215          84          82      73         class1     department2     11
    11    212          76          83      75         class1     department2     12
    20    225          82          85      63         class2     department2     13
    7     122          89          86      85         class2     department1     14
    5     115          99          93      89         class1     department1     15
    10    211          89          93      60         class1     department2     16
    12    213          71          94      90         class1     department2     17
    4     114          89          94      93         class1     department1     18
    13    214          94          94      66         class1     department2     19
    16    221          77          99      61         class2     department2     20
    
    studentId=223, 216, 113, 121的math都为74,但是row_number1的值是递增的
    

    3.2 rank 开窗函数

    作用:rank() over([partition by col1] [order by col2]),当遇到col2相等时,名次相同,但是下一个 col2 值的名次递增 N(N 是重复的次数)。例如:有两条记录是并列第一,下一个是第三,没有第二。

    select id,studentId,language,math,english,classId,departmentId,
    dense_rank() over (order by math) as rank1
    from student_scores;
    

    结果:

    id    studentid    language    math    english    classid    departmentid    rank1
    1     111          68          69      90         class1     department1     1
    9     124          76          70      76         class2     department1     2
    18    223          79          74      96         class2     department2     3
    15    216          85          74      93         class1     department2     3
    3     113          90          74      75         class1     department1     3
    6     121          96          74      79         class2     department1     3
    8     123          70          78      61         class2     department1     7
    17    222          80          78      96         class2     department2     7
    19    224          75          80      78         class2     department2     9
    2     112          73          80      96         class1     department1     9
    14    215          84          82      73         class1     department2     11
    11    212          76          83      75         class1     department2     12
    20    225          82          85      63         class2     department2     13
    7     122          89          86      85         class2     department1     14
    5     115          99          93      89         class1     department1     15
    10    211          89          93      60         class1     department2     15
    12    213          71          94      90         class1     department2     17
    4     114          89          94      93         class1     department1     17
    13    214          94          94      66         class1     department2     17
    16    221          77          99      61         class2     department2     20
    
    从第3行到第6行,重复出现了4次,所以第7行rank1为3+4=7
    

    3.3 dense_rank 开窗函数

    作用:dense_rank() over([partition by col1] [order by col2])rank类似,当遇到col2相等时,名次同样相等,不同的是,下一个 col2 值的名次+1,而不是+N。

    select id,studentId,language,math,english,classId,departmentId,
    dense_rank() over (order by math) as dense_rank1
    from student_scores;
    

    结果:

    id    studentid    language    math    english    classid    departmentid    dense_rank1
    1     111          68          69      90         class1     department1     1
    9     124          76          70      76         class2     department1     2
    18    223          79          74      96         class2     department2     3
    15    216          85          74      93         class1     department2     3
    3     113          90          74      75         class1     department1     3
    6     121          96          74      79         class2     department1     3
    8     123          70          78      61         class2     department1     4
    17    222          80          78      96         class2     department2     4
    19    224          75          80      78         class2     department2     5
    2     112          73          80      96         class1     department1     5
    14    215          84          82      73         class1     department2     6
    11    212          76          83      75         class1     department2     7
    20    225          82          85      63         class2     department2     8
    7     122          89          86      85         class2     department1     9
    5     115          99          93      89         class1     department1     10
    10    211          89          93      60         class1     department2     10
    12    213          71          94      90         class1     department2     11
    4     114          89          94      93         class1     department1     11
    13    214          94          94      66         class1     department2     11
    16    221          77          99      61         class2     department2     12
    
    从第3行到第6行math值相等,dense_rank1为3,所以第7行dense_rank1为4
    

    3.4 ntile 开窗函数

    作用:ntile(N) over([partition by col1] [order by col2]),将分区中的数据按照顺序划分为 N 片,返回当前片的值。
    注 1:如果切片分布不均匀,默认增加第一个切片的分布
    注 2:ntile不支持ROWS BETWEEN

    select id,studentId,language,math,english,classId,departmentId,
    ntile(2) over(partition by departmentid order by math) as ntile1
    from student_scores;
    

    结果:

    id    studentid    language    math    english    classid    departmentid    ntile1
    1     111          68          69      90         class1     department1     1
    9     124          76          70      76         class2     department1     1
    6     121          96          74      79         class2     department1     1
    3     113          90          74      75         class1     department1     1
    8     123          70          78      61         class2     department1     1
    2     112          73          80      96         class1     department1     2
    7     122          89          86      85         class2     department1     2
    5     115          99          93      89         class1     department1     2
    4     114          89          94      93         class1     department1     2
    15    216          85          74      93         class1     department2     1
    18    223          79          74      96         class2     department2     1
    17    222          80          78      96         class2     department2     1
    19    224          75          80      78         class2     department2     1
    14    215          84          82      73         class1     department2     1
    11    212          76          83      75         class1     department2     1
    20    225          82          85      63         class2     department2     2
    10    211          89          93      60         class1     department2     2
    12    213          71          94      90         class1     department2     2
    13    214          94          94      66         class1     department2     2
    16    221          77          99      61         class2     department2     2
    

    3.5 percent_rank 开窗函数

    作用:percent_rank() over([partition by col1] [order by col2]),计算给定行的百分比排名。分组内当前行的 RANK 值-1/分组内总行数-1,可以用来计算超过了百分之多少的人。

    select id,studentid,language,math,english,classid,departmentid,
    percent_rank() over(order by math) as rn1
    from student_scores;
    

    结果:

    id    studentid    language    math    english    classid    departmentid    rn1
    1     111          68          69      90         class1     department1     0.0
    9     124          76          70      76         class2     department1     0.05263157894736842
    18    223          79          74      96         class2     department2     0.10526315789473684
    15    216          85          74      93         class1     department2     0.10526315789473684
    3     113          90          74      75         class1     department1     0.10526315789473684
    6     121          96          74      79         class2     department1     0.10526315789473684
    8     123          70          78      61         class2     department1     0.3157894736842105
    17    222          80          78      96         class2     department2     0.3157894736842105
    19    224          75          80      78         class2     department2     0.42105263157894735
    2     112          73          80      96         class1     department1     0.42105263157894735
    14    215          84          82      73         class1     department2     0.5263157894736842
    11    212          76          83      75         class1     department2     0.5789473684210527
    20    225          82          85      63         class2     department2     0.631578947368421
    7     122          89          86      85         class2     department1     0.6842105263157895
    5     115          99          93      89         class1     department1     0.7368421052631579
    10    211          89          93      60         class1     department2     0.7368421052631579
    12    213          71          94      90         class1     department2     0.8421052631578947
    4     114          89          94      93         class1     department1     0.8421052631578947
    13    214          94          94      66         class1     department2     0.8421052631578947
    16    221          77          99      61         class2     department2     1.0
    
    第二行,rn1的值为(2-1)/(20-1)=0.05263157894736842
    

    相关文章

      网友评论

          本文标题:Hive开窗函数

          本文链接:https://www.haomeiwen.com/subject/lupazhtx.html