美文网首页
Hive开窗函数

Hive开窗函数

作者: 米小河123 | 来源:发表于2020-07-02 17:09 被阅读0次

    一、应用场景:

    • 用于分区排序
    • 动态Group By
    • top N
    • 累计计算

    二、函数介绍

    1、窗口函数:

    first_value:取分组内排序后,截止到当前行,第一个值;
    last_value:取分组内排序后,截止到当前行,最后一个值;
    lead(col, n, default):用于统计窗口内往下第n行值。第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为null时,取默认值,如不指定则为null);
    lag(col, n, default):与lead相反,用于统计窗口内往上第n行值。第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为null时,取默认值,如不指定,则为null)。

    2、over从句

    1)使用标准的聚合函数count、sum、min、max、avg
    2)使用partition by语句,使用一个或多个原始列
    3)使用partition byorder by语句,使用一个或多个分区或者排序列
    4)使用窗口规范,窗口规范支持以下格式:

    (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
    (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
    

    ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

    ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

    OVER从句支持以下函数, 但是并不支持和窗口一起使用它们。
    Ranking函数: Rank, NTile, DenseRank, CumeDist, PercentRank.
    LeadLag 函数.

    3、分析函数

    row_number():从1开始,按照顺序生成组内记录的序列,比如按照pv降序排列生成分组内的pv排名;获取分组内的top1记录;获取一个session内的第一条记录等等。
    rank():生成数据项在分组内的排名,排名相等会在名次中留下空位。
    dense_rank():生成数据项在分组内的排名,排名相对不会在名次中留下空位。
    cume_dist:小于等于当前值的行数/分组内总行数。比如,统计小于等于当前薪资的人数占总人数的比例。
    percent_rank: (分组内当前行的rank值-1)/(分组内总行数-1)。
    ntile(n):用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。ntile不支持rows between,比如ntile(2) over(partition by cookieied order by createtime rows between 3 preceding and current row)

    --- Hive2.1.0及以后支持Distinct
    COUNT(DISTINCT a) OVER (PARTITION BY c)
    
    --- Hive 2.2.0中在使用ORDER BY和窗口限制时支持distinct
    COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
    
    --- Hive2.1.0及以后支持在OVER从句中支持聚合函数
    SELECT rank() OVER (ORDER BY sum(b))
    FROM t
    GROUP BY a
    ;
    

    4、测试数据集


    -- COUNT、SUM、MIN、MAX、AVG
    select 
        user_id,
        user_type,
        sales,
        --默认为从起点到当前行
        sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc) AS sales_1,
        --从起点到当前行,结果与sales_1不同。
        sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS sales_2,
        --当前行+往前3行
        sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS sales_3,
        --当前行+往前3行+往后1行
        sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS sales_4,
        --当前行+往后所有行  
        sum(sales) OVER(PARTITION BY user_type ORDER BY sales asc ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS sales_5,
        --分组内所有行
        SUM(sales) OVER(PARTITION BY user_type) AS sales_6                          
    from 
        order_detail
    order by 
        user_type,
        sales,
        user_id
    ;
    -- 注意:
    -- 输出结果和order by相关,默认为升序;
    -- 如果不指定rows between,默认为起点到当前行;
    -- 如果不指定order by,则将分组内所有值累加;
    
    

    关键是理解ROWS BETWEEN含义,也叫做WINDOW子句
    PRECEDING:往前
    FOLLOWING:往后
    CURRENT ROW:当前行
    UNBOUNDED:无界限(起点或终点)
    UNBOUNDED PRECEDING:表示从前面的起点
    UNBOUNDED FOLLOWING:表示到后面的终点
    其他COUNT、AVG,MIN,MAX,和SUM用法一样。

    --  first_value与last_value
    select 
        user_id,
        user_type,
        ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,  
        first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user,
        first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user,
        last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user,
        last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user
    from 
        order_detail;
    
    -- lead与lag
    select 
        user_id,device_id,
        lead(device_id) over (order by sales) as default_after_one_line,
        lag(device_id) over (order by sales) as default_before_one_line,
        lead(device_id,2) over (order by sales) as after_two_line,
        lag(device_id,2,'abc') over (order by sales) as before_two_line
    from 
        order_detail;
    
    -- RANK、ROW_NUMBER、DENSE_RANK
    select 
        user_id,user_type,sales,
        RANK() over (partition by user_type order by sales desc) as r,
        ROW_NUMBER() over (partition by user_type order by sales desc) as rn,
        DENSE_RANK() over (partition by user_type order by sales desc) as dr
    from
        order_detail;  
    
    -- NTILE
    
    select 
        user_type,sales,
        --分组内将数据分成2片
        NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2,
        --分组内将数据分成3片    
        NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3,
        --分组内将数据分成4片    
        NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4,
        --将所有数据分成4片
        NTILE(4) OVER(ORDER BY sales) AS all_nt4
    from 
        order_detail
    order by 
        user_type,
        sales
    
    --取sale前20%的用户ID
    select
        user_id
    from
    (
        select 
            user_id,
            NTILE(5) OVER(ORDER BY sales desc) AS nt
        from 
            order_detail
    )A
    where nt=1;
    
    -- CUME_DIST、PERCENT_RANK 
    
    select 
    user_id,user_type,sales,
    --没有partition,所有数据均为1组
    CUME_DIST() OVER(ORDER BY sales) AS cd1,
    --按照user_type进行分组
    CUME_DIST() OVER(PARTITION BY user_type ORDER BY sales) AS cd2 
    from 
    order_detail;   
    
    
    select 
    user_type,sales
    --分组内总行数      
    SUM(1) OVER(PARTITION BY user_type) AS s, 
    --RANK值  
    RANK() OVER(ORDER BY sales) AS r,    
    PERCENT_RANK() OVER(ORDER BY sales) AS pr,
    --分组内     
    PERCENT_RANK() OVER(PARTITION BY user_type ORDER BY sales) AS prg 
    from 
    order_detail; 
    

    相关文章

      网友评论

          本文标题:Hive开窗函数

          本文链接:https://www.haomeiwen.com/subject/gwprqktx.html