美文网首页
级联求和

级联求和

作者: pengtoxen | 来源:发表于2019-08-20 19:28 被阅读0次

    前提

    最近在学习hive,碰到了级联求和的问题.经过一番思考学习,现在做些学习笔记.

    需求

    原始数据表

    访客 月份 访问次数
    A 2015-01 5
    A 2015-01 15
    B 2015-01 5
    A 2015-01 8
    B 2015-01 25
    A 2015-01 5
    A 2015-02 4
    A 2015-02 6
    B 2015-02 10
    B 2015-02 5

    根据上面的数据表输出每个用户每个月份的访问次数,并且每个月统计总的访问次数.最后的输出格式如下

    需要输出报表

    访客 月份 月访问总计 累计访问总计
    A 2015-01 33 33
    A 2015-02 10 43
    B 2015-01 30 30
    B 2015-02 15 45

    实现步骤

    1.创建hive表

    create table t_access_times(username string,month string,cnt int)
    row format delimited fields terminated by ',';
    

    2.准备数据 access.log

    A,2015-01,5
    A,2015-01,15
    B,2015-01,5
    A,2015-01,8
    B,2015-01,25
    A,2015-01,5
    A,2015-02,4
    A,2015-02,6
    B,2015-02,10
    B,2015-02,5
    

    3.加载数据到表中

    load data local inpath '/home/hadoop/access.log' into table t_access_times;
    

    4.自join方式

    • 先求每个用户每个月的访问总次数
    +-----------+----------+---------+--+ 
    | username  |  month   | cnt   |
    +-----------+----------+---------+--+
    | A         | 2015-01  | 33      |
    | A         | 2015-02  | 10      |
    | B         | 2015-01  | 30      |
    | B         | 2015-02  | 15      |
    +-----------+----------+---------+--+
    
    • 将月总次数表 自己连接自己(自join)
    select A.*,B.* FROM
    (select username,month,sum(cnt) as cnt from t_access_times group by username,month) A 
    inner join 
    (select username,month,sum(cnt) as cntfrom t_access_times group by username,month) B
    on
    A.username=B.username
    where B.month <= A.month;
    
    +-------------+----------+-----------+-------------+----------+--------
    | A.username  | A.month  | A.cnt| B.username  | B.month  | B.cnt  |
    +-------------+----------+-----------+-------------+----------+--------
    | A           | 2015-01  | 33        | A           | 2015-01  | 33        |
    | A           | 2015-02  | 10        | A           | 2015-01  | 33        |
    | A           | 2015-02  | 10        | A           | 2015-02  | 10        |
    | B           | 2015-01  | 30        | B           | 2015-01  | 30        |
    | B           | 2015-02  | 15        | B           | 2015-01  | 30        |
    | B           | 2015-02  | 15        | B           | 2015-02  | 15        |
    +-------------+----------+-----------+-------------+----------+--------
    

    刚开始这里不是很明白为什么加上where B.month >= A.month的条件,这样有什么意义?其实这是为后面的统计做准备.

    现在来讲讲这个自join是怎么产生这样的数据的.
    hive的表连接我没有研究过,这里暂时我用mysql的连接来举例说明.我姑且认为它们的实现原理的是一样的.

    1. 从表A中读入一行数据R;
    2. 从数据行R中,取出username字段和where条件到B表中去查找;
    3. 在B表中找到满足条件的行,跟R组成一行,作为结果集的一部分;
    4. 重复执行步骤1到3,直到表A的末尾循环结束;

    在这里,两个表都做了一次全表扫描,所以总的扫描行数是 4 + 4 = 8;
    内存中的判断次数是 4 * 4 = 16;

    • 最终的sql语句
    select A.username,A.month,max(A.cnt) as cnt,sum(B.cnt) as accumulate 
    from 
    (select username,month,sum(cnt) as cntfrom t_access_times group  by username,month) A  
    inner join 
    (select username,month,sum(cnt) as cntfrom t_access_times group by username,month) B 
    on 
    A.username=B.username 
    where B.month <= A.month 
    group by A.username,A.month 
    order by A.username,A.month;
    
    --最终结果为:
    +-------------+----------+---------+-------------+--+
    | A.username  | A.month  | cnt| accumulate  |
    +-------------+----------+---------+-------------+--+
    | A           | 2015-01  | 33      | 33          |
    | A           | 2015-02  | 10      | 43          |
    | B           | 2015-01  | 30      | 30          |
    | B           | 2015-02  | 15      | 45          |
    +-------------+----------+---------+-------------+--+
    
    

    5.窗口函数

    还有一种方式也可以实现需求,那就是窗口函数

    select
    t.username,
    t.month,
    t.cnt,
    sum(t.cnt) over(partition by t.username order by t.username,
    t.month rows between unbounded preceding and current row) as accumlate
    from(
    select 
    username,month,
    sum(cnt) as cnt
    from t_access_times group by username,month) t
    ;
    
    --最终结果为:
    +-------------+----------+---------+-------------+--+
    | A.username  | A.month  | cnt| accumulate  |
    +-------------+----------+---------+-------------+--+
    | A           | 2015-01  | 33      | 33          |
    | A           | 2015-02  | 10      | 43          |
    | B           | 2015-01  | 30      | 30          |
    | B           | 2015-02  | 15      | 45          |
    +-------------+----------+---------+-------------+--+
    

    相关文章

      网友评论

          本文标题:级联求和

          本文链接:https://www.haomeiwen.com/subject/tgkmsctx.html