hive窗口函数总结

作者: 愤怒的谜团 | 来源:发表于2019-10-10 16:48 被阅读0次

    一:前言

    根据官网的介绍,hive推出的窗口函数功能是对hive sql的功能增强,确实目前用于离线数据分析逻辑日趋复杂,很多场景都需要用到。以下就是对hive窗口函数的一个总结附上案例。

    二:理解下什么是WINDOW子句(灵活控制窗口的子集)

    PRECEDING:往前
    FOLLOWING:往后
    CURRENT ROW:当前行
    UNBOUNDED:起点(一般结合PRECEDING,FOLLOWING使用)
    UNBOUNDED PRECEDING 表示该窗口最前面的行(起点)
    UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)
    比如说:
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行)
    ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
    ROWS BETWEEN 2 PRECEDING AND 1 CURRENT ROW(表示往前2行到当前行)
    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示当前行到终点)
    官网有一段话列出了哪些窗口函数是不支持window子句的,如下图所示:


    image.png

    三:准备需要演示的数据

    insert overwrite table dw_tmp.window_function_temp
    select 
    split(detail,',')[0] as uname
    ,split(detail,',')[1] as create_time
    ,split(detail,',')[2] as pv
    from
    (
        select
        concat('测试用户,2019-10-02,7
        #测试用户,2019-10-05,4
        #测试用户,2019-10-07,5
        #测试用户,2019-10-03,6
        #测试用户,2019-10-04,3
        #测试用户,2019-10-01,3
        #测试用户,2019-10-06,4') as ct_str
    ) t
    lateral view explode(split(ct_str,'#')) t2 as detail;
    
    测试数据.png

    四:Windowing functions

    1.LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值第一个参数为列名,第二个参数为往下第n行(可选,默认为1,不可为负数),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

    2.LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值第一个参数为列名,第二个参数为往上第n行(可选,默认为1,不可为负数),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    select 
    uname
    ,create_time
    ,pv
    ,lead(pv,1,-9999) over (partition by uname order by create_time) as lead_1_pv
    ,lag(pv,1,-9999) over (partition by uname order by create_time) as lag_1_pv
    from dw_tmp.window_function_temp;
    
    image.png

    3.FIRST_VALUE取分组内排序后,截止到当前行,第一个值,这最多需要两个参数。第一个参数是您想要第一个值的列,第二个(可选)参数必须是false默认为布尔值的布尔值。如果设置为true,则跳过空值。

    4.LAST_VALUE取分组内排序后,截止到当前行,最后一个值,这最多需要两个参数。第一个参数是您想要第一个值的列,第二个(可选)参数必须是false默认为布尔值的布尔值。如果设置为true,则跳过空值。

    select 
    uname
    ,create_time
    ,pv
    ,first_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as first_value_pv
    ,last_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as last_value_pv
    from dw_tmp.window_function_temp;
    
    image.png

    让我们加上window子句来观察一下变化,虽然FIRST_VALUE和LAST_VALUE不常于与window子句结合使用。

    select 
    uname
    ,create_time
    ,pv
    ,first_value(pv) over (partition by uname order by create_time) as first_value_pv
    ,first_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as window_first_value_pv
    ,last_value(pv) over (partition by uname order by create_time) as last_value_pv
    ,last_value(pv) over (partition by uname order by create_time rows between unbounded preceding and current row) as window_last_value_pv
    from dw_tmp.window_function_temp;
    
    image.png

    五:aggregates functions

    1.COUNT
    2.SUM
    3.MIN
    4.MAX
    5.AVG
    目前支持这五种带有聚合意义的窗口函数,以常用SUM举例。

    select 
    uname
    ,create_time
    ,pv
    ,SUM(pv) over (partition by uname order by create_time) as sum_pv_1 --默认情况
    ,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as sum_pv_2 --表示从起点到当前行
    ,SUM(pv) over (partition by uname) as sum_pv_3 --表示窗口内所有行
    ,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as sum_pv_4 --表示起点到终点
    ,SUM(pv) over (partition by uname order by create_time ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING) as sum_pv_5 --表示前2行到后面1行
    from dw_tmp.window_function_temp;
    
    image.png

    从结果当中其实可以得到结论,默认情况就是从起点到当前行,不带order by语句其实就是表示窗口内全部行都参与聚合处理,这里其实还有其他用法,读者可以自行尝试一下。

    六:Analytics functions

    1.ROW_NUMBER
    从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列;通常用于获取分组内排序第一的记录;获取一个session中的第一条refer等。
    2.RANK
    生成数据项在分组中的排名,排名相等会在名次中留下空位。
    3.DENSE_RANK
    生成数据项在分组中的排名,排名相等会在名次中不会留下空位。
    4.CUME_DIST
    CUME_DIST 小于等于当前值的行数/分组内总行数
    5.PERCENT_RANK
    PERCENT_RANK 分组内当前行的RANK值-1/分组内总行数-1
    6.NTILE
    NTILE(n) 用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN
    以上是带有分析功能的窗口函数,使用的频率没有上面两类高,但是也是需要掌握的。

    我们先对1-3三种分析窗口函数进行演示

    select 
    uname
    ,create_time
    ,pv
    ,ROW_NUMBER() over (partition by uname order by pv) as row_number_pv_1
    ,RANK() over (partition by uname order by pv) as row_number_pv_2
    ,DENSE_RANK() over (partition by uname order by pv) as row_number_pv_3
    from dw_tmp.window_function_temp;
    
    image.png

    第4-5种:

    select 
    uname
    ,create_time
    ,pv
    ,CUME_DIST() over (partition by uname order by pv) as CUME_DIST_pv_
    ,PERCENT_RANK() over (partition by uname order by pv) as PERCENT_RANK_pv_
    from dw_tmp.window_function_temp;
    
    image.png

    第六种:NTILE

    select 
    uname
    ,create_time
    ,pv
    ,NTILE(2) over (partition by uname order by pv) as NTILE_pv_1
    ,NTILE(3) over (partition by uname order by pv) as NTILE_pv_2
    ,NTILE(4) over (partition by uname order by pv) as NTILE_pv_3
    from dw_tmp.window_function_temp;
    
    image.png

    相关文章

      网友评论

        本文标题:hive窗口函数总结

        本文链接:https://www.haomeiwen.com/subject/wvqapctx.html