美文网首页
Hive 窗口函数

Hive 窗口函数

作者: 歌哥居士 | 来源:发表于2019-03-30 12:33 被阅读0次

    Select 窗口函数

    over():指定分析函数的数据窗口大小,这个数据窗口大小随行的变化而变化。
    CURRENT ROW:当前行
    n PRECEDING:往前n行(包含其间所有)
    n FOLLOWING:往后n行(包含其间所有)
    UNBOUNDED PRECEDING:表示从前面的起点开始
    UNBOUNDED FOLLOWING:表示到后面的终点开始
    LAG(col,n):往前第n行数据
    LEAD(col,n):往后第n行数据
    NTILE(n):分n组,并给分组编号

    示例准备 --------------------------------------

    business.txt

    jack,2017-01-01,10
    tony,2017-01-02,15
    jack,2017-02-03,23
    tony,2017-01-04,29
    jack,2017-01-05,46
    jack,2017-04-06,42
    tony,2017-01-07,50
    jack,2017-01-08,55
    mart,2017-04-08,62
    mart,2017-04-09,68
    neil,2017-05-10,12
    mart,2017-04-11,75
    neil,2017-06-12,80
    mart,2017-04-13,94

    hive> create table business(name string, orderdate string, cost int)
    row format delimited
    fields terminated by ',';
    hive> load data local inpath '/home/user000/data/business.txt' into table business;

    查询在2017年4月购买过的顾客及总人数

    错误:hive> select name,count(1) from business where substring(orderdate,1,7) = '2017-04' group by name;
    +-------+------+--+
    | name | _c1 |
    +-------+------+--+
    | jack | 1 |
    | mart | 4 |
    +-------+------+--+
    正确:hive> select name,count(1) over() from business where substring(orderdate,1,7) = '2017-04' group by name;
    +-------+---------+--+
    | name | _wcol0 |
    +-------+---------+--+
    | mart | 2 |
    | jack | 2 |
    +-------+---------+--+
    解释:group by name后窗口是2个,所以加一个over指定窗口,然后count就能得到正确答案。
    如果不加,每个name下有各自的记录,count是分别计算每个name下的记录。

    查询顾客的购买明细及月购买总额

    hive(default)> select *,sum(cost) over(partition by month(orderdate)) from business;
    hive(default)> select *,sum(cost) over(distribute by month(orderdate)) from business;
    +----------------+---------------------+----------------+---------+--+
    | business.name | business.orderdate | business.cost | _wcol0 |
    +----------------+---------------------+----------------+---------+--+
    | jack | 2017-01-01 | 10 | 205 |
    | jack | 2017-01-08 | 55 | 205 |
    | tony | 2017-01-07 | 50 | 205 |
    | jack | 2017-01-05 | 46 | 205 |
    | tony | 2017-01-04 | 29 | 205 |
    | tony | 2017-01-02 | 15 | 205 |
    | jack | 2017-02-03 | 23 | 23 |
    | mart | 2017-04-13 | 94 | 341 |
    | jack | 2017-04-06 | 42 | 341 |
    | mart | 2017-04-11 | 75 | 341 |
    | mart | 2017-04-09 | 68 | 341 |
    | mart | 2017-04-08 | 62 | 341 |
    | neil | 2017-05-10 | 12 | 12 |
    | neil | 2017-06-12 | 80 | 80 |
    +----------------+---------------------+----------------+---------+--+
    解释:partition by orderdate会根据日期分区,使用partition by month(orderdate)按月分区。

    上述的场景,要将cost按照日期进行累加

    hive(default)> select *,sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW) from business;
    +----------------+---------------------+----------------+---------+--+
    | business.name | business.orderdate | business.cost | _wcol0 |
    +----------------+---------------------+----------------+---------+--+
    | jack | 2017-01-01 | 10 | 10 |
    | tony | 2017-01-02 | 15 | 25 |
    | tony | 2017-01-04 | 29 | 54 |
    | jack | 2017-01-05 | 46 | 100 |
    | tony | 2017-01-07 | 50 | 150 |
    | jack | 2017-01-08 | 55 | 205 |
    | jack | 2017-02-03 | 23 | 228 |
    | jack | 2017-04-06 | 42 | 270 |
    | mart | 2017-04-08 | 62 | 332 |
    | mart | 2017-04-09 | 68 | 400 |
    | mart | 2017-04-11 | 75 | 475 |
    | mart | 2017-04-13 | 94 | 569 |
    | neil | 2017-05-10 | 12 | 581 |
    | neil | 2017-06-12 | 80 | 661 |
    +----------------+---------------------+----------------+---------+--+

    按名字统计,每个人的累加:
    +----------------+---------------------+----------------+---------+--+
    | business.name | business.orderdate | business.cost | _wcol0 |
    +----------------+---------------------+----------------+---------+--+
    | jack | 2017-01-01 | 10 | 10 |
    | jack | 2017-01-05 | 46 | 56 |
    | jack | 2017-01-08 | 55 | 111 |
    | jack | 2017-02-03 | 23 | 134 |
    | jack | 2017-04-06 | 42 | 176 |
    | mart | 2017-04-08 | 62 | 62 |
    | mart | 2017-04-09 | 68 | 130 |
    | mart | 2017-04-11 | 75 | 205 |
    | mart | 2017-04-13 | 94 | 299 |
    | neil | 2017-05-10 | 12 | 12 |
    | neil | 2017-06-12 | 80 | 92 |
    | tony | 2017-01-02 | 15 | 15 |
    | tony | 2017-01-04 | 29 | 44 |
    | tony | 2017-01-07 | 50 | 94 |
    +----------------+---------------------+----------------+---------+--+

    查询顾客上次的购买时间

    hive(default)> select name,orderdate,lag(orderdate,1) over(distribute by name sort by orderdate) from business;
    +-------+-------------+-------------+--+
    | name | orderdate | _wcol0 |
    +-------+-------------+-------------+--+
    | jack | 2017-01-01 | NULL |
    | jack | 2017-01-05 | 2017-01-01 |
    | jack | 2017-01-08 | 2017-01-05 |
    | jack | 2017-02-03 | 2017-01-08 |
    | jack | 2017-04-06 | 2017-02-03 |
    | mart | 2017-04-08 | NULL |
    | mart | 2017-04-09 | 2017-04-08 |
    | mart | 2017-04-11 | 2017-04-09 |
    | mart | 2017-04-13 | 2017-04-11 |
    | neil | 2017-05-10 | NULL |
    | neil | 2017-06-12 | 2017-05-10 |
    | tony | 2017-01-02 | NULL |
    | tony | 2017-01-04 | 2017-01-02 |
    | tony | 2017-01-07 | 2017-01-04 |
    +-------+-------------+-------------+--+

    lead示例:
    hive(default)> select name,orderdate,lead(orderdate,1) over(distribute by name sort by orderdate) from business;
    +-------+-------------+-------------+--+
    | name | orderdate | _wcol0 |
    +-------+-------------+-------------+--+
    | jack | 2017-01-01 | 2017-01-05 |
    | jack | 2017-01-05 | 2017-01-08 |
    | jack | 2017-01-08 | 2017-02-03 |
    | jack | 2017-02-03 | 2017-04-06 |
    | jack | 2017-04-06 | NULL |
    | mart | 2017-04-08 | 2017-04-09 |
    | mart | 2017-04-09 | 2017-04-11 |
    | mart | 2017-04-11 | 2017-04-13 |
    | mart | 2017-04-13 | NULL |
    | neil | 2017-05-10 | 2017-06-12 |
    | neil | 2017-06-12 | NULL |
    | tony | 2017-01-02 | 2017-01-04 |
    | tony | 2017-01-04 | 2017-01-07 |
    | tony | 2017-01-07 | NULL |
    +-------+-------------+-------------+--+

    查询前20%的订单信息

    ntile(5),然后取第一组,就是20%了。
    hive(default)> select name,orderdate,cost,ntile(5) over(sort by orderdate) from business;
    +-------+-------------+-------+---------+--+
    | name | orderdate | cost | _wcol0 |
    +-------+-------------+-------+---------+--+
    | jack | 2017-01-01 | 10 | 1 |
    | tony | 2017-01-02 | 15 | 1 |
    | tony | 2017-01-04 | 29 | 1 |
    | jack | 2017-01-05 | 46 | 2 |
    | tony | 2017-01-07 | 50 | 2 |
    | jack | 2017-01-08 | 55 | 2 |
    | jack | 2017-02-03 | 23 | 3 |
    | jack | 2017-04-06 | 42 | 3 |
    | mart | 2017-04-08 | 62 | 3 |
    | mart | 2017-04-09 | 68 | 4 |
    | mart | 2017-04-11 | 75 | 4 |
    | mart | 2017-04-13 | 94 | 4 |
    | neil | 2017-05-10 | 12 | 5 |
    | neil | 2017-06-12 | 80 | 5 |
    +-------+-------------+-------+---------+--+

    hive(default)> select * from
    (select name,orderdate,cost,ntile(5) over(sort by orderdate) gid from business ) t
    where gid = 1;
    +---------+--------------+---------+--------+--+
    | t.name | t.orderdate | t.cost | t.gid |
    +---------+--------------+---------+--------+--+
    | jack | 2017-01-01 | 10 | 1 |
    | tony | 2017-01-02 | 15 | 1 |
    | tony | 2017-01-04 | 29 | 1 |
    +---------+--------------+---------+--------+--+

    RANK() DENSE_RANK() ROW_NUMBER()

    RANK():排序相同时会重复,总数不会变,例如排序结果是 1 1 3 4
    DENSE_RANK():排序相同时,总数会减少,例如排序结果是 1 1 2 3
    ROW_NUMBER():行号

    相关文章

      网友评论

          本文标题:Hive 窗口函数

          本文链接:https://www.haomeiwen.com/subject/kmmxvqtx.html