美文网首页
Hive分析窗口函数(一) SUM,AVG,MIN,MAX

Hive分析窗口函数(一) SUM,AVG,MIN,MAX

作者: 尼小摩 | 来源:发表于2018-06-19 14:19 被阅读20次

    Hive中提供了越来越多的分析函数,用于完成负责的统计分析。
    今天先看几个基础的,SUM、AVG、MIN、MAX。

    用于实现分组内所有和连续累积的统计。

    数据准备

    cookie1,2015-04-10,1
    cookie1,2015-04-11,5
    cookie1,2015-04-12,7
    cookie1,2015-04-13,3
    cookie1,2015-04-14,2
    cookie1,2015-04-15,4
    cookie1,2015-04-16,4
    
    
    CREATE EXTERNAL TABLE lxw1234 (
    cookieid string,
    createtime string,
    pv INT
    ) ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    stored as textfile location '/tmp/lxw11/';
    
    
    hive (default)> SELECT cookieid, createtime, pv  FROM lxw1234;
    OK
    cookieid    createtime  pv
    cookie1 2015-04-10  1
    cookie1 2015-04-11  5
    cookie1 2015-04-12  7
    cookie1 2015-04-13  3
    cookie1 2015-04-14  2
    cookie1 2015-04-15  4
    cookie1 2015-04-16  4
    

    SUM — 注意,结果和ORDER BY相关,默认为升序

    1. 默认为从起点到当前行
    hive (default)>  SELECT cookieid, createtime, pv, SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1 FROM lxw1234;
     
    Total MapReduce CPU Time Spent: 4 seconds 110 msec
    OK
    cookieid    createtime  pv  pv1
    cookie1 2015-04-10  1   1
    cookie1 2015-04-11  5   6
    cookie1 2015-04-12  7   13
    cookie1 2015-04-13  3   16
    cookie1 2015-04-14  2   18
    cookie1 2015-04-15  4   22
    cookie1 2015-04-16  4   26
    
    2. 从起点到当前行,结果同pv1
    hive (default)>  SELECT cookieid, createtime, pv, SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING  AND CURRENT ROW)AS pv2  FROM lxw1234;
    
    
    Total MapReduce CPU Time Spent: 4 seconds 70 msec
    OK
    cookieid    createtime  pv  pv2
    cookie1 2015-04-10  1   1
    cookie1 2015-04-11  5   6
    cookie1 2015-04-12  7   13
    cookie1 2015-04-13  3   16
    cookie1 2015-04-14  2   18
    cookie1 2015-04-15  4   22
    cookie1 2015-04-16  4   26
    
    3. 分组内所有行
    hive (default)>  SELECT cookieid, createtime, pv, SUM(pv) OVER(PARTITION BY cookieid) as pv3  FROM lxw1234;
    
    Total MapReduce CPU Time Spent: 4 seconds 160 msec
    OK
    cookieid    createtime  pv  pv3
    cookie1 2015-04-16  4   26
    cookie1 2015-04-15  4   26
    cookie1 2015-04-14  2   26
    cookie1 2015-04-13  3   26
    cookie1 2015-04-12  7   26
    cookie1 2015-04-11  5   26
    cookie1 2015-04-10  1   26
    
    4. 当前行+往前3行
    hive (default)>  SELECT cookieid, createtime, pv, SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as pv4  FROM lxw1234;
    
    Total MapReduce CPU Time Spent: 4 seconds 50 msec
    OK
    cookieid    createtime  pv  pv4
    cookie1 2015-04-10  1   1
    cookie1 2015-04-11  5   6
    cookie1 2015-04-12  7   13
    cookie1 2015-04-13  3   16
    cookie1 2015-04-14  2   17
    cookie1 2015-04-15  4   16
    cookie1 2015-04-16  4   13
    
    5. 当前行+往前3行+往后1行
    hive (default)>  SELECT cookieid, createtime, pv, SUM(PV) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5 from lxw1234;
    
    Total MapReduce CPU Time Spent: 4 seconds 90 msec
    OK
    cookieid    createtime  pv  pv5
    cookie1 2015-04-10  1   6
    cookie1 2015-04-11  5   13
    cookie1 2015-04-12  7   16
    cookie1 2015-04-13  3   18
    cookie1 2015-04-14  2   21
    cookie1 2015-04-15  4   20
    cookie1 2015-04-16  4   13
    
    6. 当前行+往后所有行
    hive (default)>  SELECT cookieid, createtime, pv,  SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6  from lxw1234;
    
    Total MapReduce CPU Time Spent: 4 seconds 0 msec
    OK
    cookieid    createtime  pv  pv6
    cookie1 2015-04-10  1   26
    cookie1 2015-04-11  5   25
    cookie1 2015-04-12  7   20
    cookie1 2015-04-13  3   13
    cookie1 2015-04-14  2   10
    cookie1 2015-04-15  4   8
    cookie1 2015-04-16  4   4
    
    

    pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
    pv2: 同pv1
    pv3: 分组内(cookie1)所有的pv累加
    pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
    pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
    pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10

    如果不指定ROWS BETWEEN,默认为从起点到当前行;
    如果不指定ORDER BY,则将分组内所有值累加;

    关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
    PRECEDING:往前
    FOLLOWING:往后
    CURRENT ROW:当前行
    UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

    其他AVG,MIN,MAX,和SUM用法一样。

    AVG

    hive (default)> SELECT 
    cookieid,
     createtime, 
    pv, 
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS avg1,  -- 默认为从起点到当前行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg2,  --从起点到当前行,结果同pv1 
    AVG(pv) OVER(PARTITION BY cookieid) AS pv3, --分组内所有行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
    AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW  AND UNBOUNDED FOLLOWING) AS pv6  ---当前行+往后所有行  
    from lxw1234;
    
    cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 
    -----------------------------------------------------------------------------
    cookie1 2015-04-10      1       1.0     1.0     3.7142857142857144      1.0     3.0     3.7142857142857144
    cookie1 2015-04-11      5       3.0     3.0     3.7142857142857144      3.0     4.333333333333333       4.166666666666667
    cookie1 2015-04-12      7       4.333333333333333       4.333333333333333       3.7142857142857144      4.333333333333333       4.0     4.0
    cookie1 2015-04-13      3       4.0     4.0     3.7142857142857144      4.0     3.6     3.25
    cookie1 2015-04-14      2       3.6     3.6     3.7142857142857144      4.25    4.2     3.3333333333333335
    cookie1 2015-04-15      4       3.6666666666666665      3.6666666666666665      3.7142857142857144      4.0     4.0     4.0
    cookie1 2015-04-16      4       3.7142857142857144      3.7142857142857144      3.7142857142857144      3.25    3.25    4.0
    

    MIN

    SELECT cookieid,
    createtime,
    pv,
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
    MIN(pv) OVER(PARTITION BY cookieid) AS pv3,                                                                                                                             --分组内所有行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1行
    MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  
    FROM lxw1234;
     
    cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 
    -----------------------------------------------------------------------------
    cookie1 2015-04-10      1       1       1       1       1       1       1
    cookie1 2015-04-11      5       1       1       1       1       1       2
    cookie1 2015-04-12      7       1       1       1       1       1       2
    cookie1 2015-04-13      3       1       1       1       1       1       2
    cookie1 2015-04-14      2       1       1       1       2       2       2
    cookie1 2015-04-15      4       1       1       1       2       2       4
    cookie1 2015-04-16      4       1       1       1       2       2       4
    

    MAX

    SELECT cookieid,
    createtime,
    pv,
    MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
    MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1 
    MAX(pv) OVER(PARTITION BY cookieid) AS pv3,                             --分组内所有行
    MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,   --当前行+往前3行
    MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,    --当前行+往前3行+往后1行
    MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6   ---当前行+往后所有行  
    FROM lxw1234;
     
    cookieid createtime     pv      pv1     pv2     pv3     pv4     pv5      pv6 
    -----------------------------------------------------------------------------
    cookie1 2015-04-10      1       1       1       7       1       5       7
    cookie1 2015-04-11      5       5       5       7       5       7       7
    cookie1 2015-04-12      7       7       7       7       7       7       7
    cookie1 2015-04-13      3       7       7       7       7       7       4
    cookie1 2015-04-14      2       7       7       7       7       7       4
    cookie1 2015-04-15      4       7       7       7       7       7       4
    cookie1 2015-04-16      4       7       7       7       4       4       4
    

    转载:http://lxw1234.com/archives/2015/04/176.htm

    相关文章

      网友评论

          本文标题:Hive分析窗口函数(一) SUM,AVG,MIN,MAX

          本文链接:https://www.haomeiwen.com/subject/cklaeftx.html