美文网首页SQL
HiveSQL核心技能之表连接

HiveSQL核心技能之表连接

作者: 让数据告诉你 | 来源:发表于2021-01-28 17:27 被阅读0次

    目标:
    1、掌握HQL中的各种连接及其组合使用;
    2、掌握数据分析中业务指标思路转换的技巧;
    3、区分好full join 和 union all 的使用场景;
    4、在多表连接时,注意各种细节和业务逻辑;
    5、复杂表连接要学会分步骤处理

    一、基础知识:

    1、内连接(inner join,返回两个表的交集)

    需注意:
    1、表连接时,必须进行重命名;
    2、on后面使用的连接条件必须起到唯一键值的作用(有时会有多个字段组合);
    3、inner可省略不写,效果是一样的
    4、表连接时不能使用 a join b join c这种方式,不然会极度浪费电脑的资源和延长查询时间,要在子查询的表里先做好筛选之后在连接;

    1)找出在2019年购买后又退款的用户(记得要去重)

    select a.user_name
    from 
        (select distinct user_name
        from user_trade where year(dt)='2019') as a
    join 
        (select distinct user_name
        from user_refund where year(dt)='2019') as b
    on a.user_name=b.user_name
    
    

    注意:一定要先去重,再做表连接,养成良好的习惯(虽然可以先连接再去重,但是那么做会使执行效率很低)

    2)在2017年和2018年都购买的用户

    select a.user_name
    from
        (select distinct user_name from user_trade where year(dt)=2017) as a
    join 
        (select distinct user_name from user_trade where year(dt)=2018) as b
    on a.user_name=b.user_name;
    
    

    3)在2017年、2018年、2019年都有交易的用户

    select a.user_name
        from (select distinct user_name from trade_2017)a
        join (select distinct user_name from trade_2018)b
        on a.user_name=b.user_name
     join (select distinct user_name from trade_2019)c
     on a.user_name=c.user_name
    注意:要先去重再做表连接
    
    

    2、左连接(left join,查询在表1但不在表2中的数据经常使用左连接的 is NULL)

    进行左连接之后,以左表为全集,返回能够匹配上的右边表的匹配结果,没有匹配上的则显示NULL。

    拓展:
    right join:以右表为全集,返回能够匹配上的左边表的匹配结果,没有匹配上的则显示NULL,可以由left join改写出同样的结果。

    4)在2019年购买,但是没有退款的用户

    select a.user_name
    from (select distinct user_name 
            from user_trade 
            where year(dt)=2019)a
    left join 
        (select distinct user_name 
            from user_refund 
            where year(dt)=2019)b
    on a.user_name=b.user_name
    where b.user_name is null
    
    

    5)在2019年由购买的用户的学历分布

    extra2["education"]
    
    select b.education,count(a.user_name)
    from 
    (select distinct user_name
    from 
    user_trade where year(dt)=2019)a
    left join 
    (select distinct user_name
        ,extra2["education"] as education
    from 
    user_info)b
    on a.user_name=b.user_name
    group by b.education
    
    注意:extra2["education"] 可以换成 get_json_object(extra1,"$.education")
    
    

    6)在2017年和2018年都购买,但是没有在2019年购买的用户

    方法一:
    
    select a.user_name
    from 
    (select distinct user_name
    from user_trade where year(dt)=2017)a
    join
    (select distinct user_name
    from user_trade where year(dt)=2018)b
    on a.user_name=b.user_name
    left join
    (select distinct user_name
    from user_trade where year(dt)=2019)c
    on a.user_name=c.user_name
    where c.user_name is null
    
    
    方法二:
    
    select a.user_name
    from 
        (select distinct user_name
        from trade_2017)a
    join
        (select distinct user_name
        from trade_2018)b
    on a.user_name=b.user_name
    left join
        (select distinct user_name
        from trade_2019)c
    on a.user_name=c.user_name
    where c.user_name is null
    
    

    3、全连接(full join,应该在提取全部行名称的场景比较多)

    select *
    from user_list_1 a
    full join user_list_2 b
    on a.user_id=b.user_id
    
    

    查询两个表的所有用户时使用full join是一个比较好的方法(需要用到coalesce函数:

    select coalesce(a.user_name,b.user_name)
    from user_list_1 a
    full join user_list_2 b
    on a.user_id=b.user_id
    
    

    注:coalesce函数,coalesce(expression1,expression2,...,expression n),依次参考各参数表达式,遇到非null值即停止并返回该值,如果所有的表达式都是空值,最终将返回一个空值。

    4、表合并(union all,利用占位符来做查询筛选可以很好的提高效率)

    注:表合并时字段名称必须一致,字段顺序必须一致,而且不用填写连接条件

    7)2017-2019年由交易的所有用户数

    写法一:
    
    select count(distinct a.user_name),
            count(a.user_name)
    from 
       ( select distinct user_name from trade_2017
        union all
        select distinct user_name from trade_2018
        union all
        select distinct user_name from trade_2019) a
    
    写法二:
    
    select count(distinct a.user_name),
            count(a.user_name)
    from 
       ( select user_name from trade_2017
        union 
        select user_name from trade_2018
        union 
        select  user_name from trade_2019) a
    
    

    union all 和 union 的区别:
    union all 不会去重,不会排序,效率较快;union 会去重且排序,效率较慢。
    如果表很大时,推荐先去重,再进行 union all ,不能直接对表进行 union all,不然效率很慢。

    8)2019年每个用户的支付和退款金额汇总

    写法一(少了子查询里的分组计算步骤,执行效率更高一些,拼接的时候可以使用这种方式,表连接的时候不能用这种方式):
    
    select a.user_name,sum(a.pay_amount),sum(a.refund_amount)
    from(
        (select user_name,pay_amount,0 as refund_amount
            from user_trade 
            where year(dt)=2019)
    union all
        (select user_name,0 as pay_amount,refund_amount
            from user_refund 
            where year(dt)=2019))a
    group by a.user_name
    
    
    写法二:
    
    select a.user_name,sum(a.pay_amount),sum(a.refund_amount)
    from(
        select  user_name
                ,sum(pay_amount)as pay_amount
                ,0 as refund_amount
            from user_trade
            where year(dt)=2019
            group by user_name
    union all
        select  user_name,
                0 as pay_amount,
                sum(refund_amount) as refund_amount
            from user_refund 
            where year(dt)=2019
            group by user_name)a
    group by a.user_name
    
    

    也可以使用 full join 的方式:

    select coalesce(a.user_name,b.user_name)
            ,if(a.payamount is null,0,payamount)
            ,if(b.refundamount is null,0,refundamount)
    from 
         (  select user_name,sum(pay_amount) as payamount
            from user_trade 
            where year(dt)=2019
            group by user_name
            ) a
     full join 
          (  select user_name,sum(refund_amount) as refundamount
            from user_refund
            where year(dt)=2019
            group by user_name
            ) b
      on a.user_name=b.user_name
    
    

    9)2019年每个支付用户的支付金额和退款金额

    写法一(错误写法,由于左连接会出现多次连接的情况,因此要先去重(或分组)之后在进行连接,不然会出现重复连接的情况:
    
    select a.user_name,sum(a.pay_amount),sum(b.refund_amount)
    from
        (select user_name,pay_amount
            from user_trade 
            where year(dt)=2019)a
    left join 
        (select user_name,refund_amount
            from user_refund 
            where year(dt)=2019)b
    on  a.user_name=b.user_name
    group by a.user_name
    
    
    写法二:
    
    select a.user_name,a.pay_amount,b.refund_amount
    from
        (select user_name,sum(pay_amount)as pay_amount
            from user_trade 
            where year(dt)=2019
            group by user_name)a
    left join 
         (select user_name,sum(refund_amount)as refund_amount
            from user_refund
            where year(dt)=2019
            group by user_name)b
    on  a.user_name=b.user_name
    
    

    10)首次激活时间在2017年,但是一直没有支付的用户年龄段分布

        1)
    select a.age_type,count(a.user_name)
    from
        (select user_name,
                case
                when age<20 then '20岁以下'
                when age<30 then '20-30岁'
                when age<40 then '30-40岁'
                else '40岁以上'
                end as age_type
        from user_info
        where year(firstactivetime)=2017)a
    left join
        (select distinct user_name
        from user_trade where dt>0)b
    on a.user_name=b.user_name
    where b.user_name is null
    group by a.age_type
    
    

    步骤总结:
    1、先筛选出年份为2017注册的用户;
    2、没有支付的人;
    3、年龄段分布
    注意:由于age也是在user_info的表格里,第三步用的字段需要在第一步进行预处理,所以在限制时间的时候需要同时对年龄段进行预处理,这样在第三步的时候才会由年龄段这个字段;需要注意对 case when 的字段进行重命名才能进行后续的操作

        2)
    select a.age_type,count(distinct a.user_name)
    from
        (select distinct user_name,
                case
                when age<20 then '20岁以下'
                when age<30 then '20-30岁'
                when age<40 then '30-40岁'
                else '40岁以上'
                end as age_type
        from user_info
        where year(firstactivetime)=2017
        )a
    left join
        (select  distinct user_name
        from user_trade 
        where year(dt) in ('2017','2018','2019')
        )b
    on a.user_name=b.user_name
    where b.user_name is null
    group by a.age_type
    
    

    11)2018、2019年交易的用户,其激活时间段分布

    写法一:
    
    select b.hour,count(a.user_name)
    from
       ( select distinct user_name
        from user_trade
        where  year(dt) in(2018,2019)
        )a
    left join
       ( select distinct user_name,hour(firstactivetime) as hour
        from user_info
        )b
    on a.user_name=b.user_name
    group by b.hour
    
    
    写法二:
    
    select hour(firstactivetime)
            ,count(distinct a.user_name)
    from
      ( select distinct user_name
        from trade_2018
    union all
        select distinct user_name
        from trade_2019
        ) as a
    left join 
        user_info as b
    on a.user_name=b.user_name
    group by hour(firstactivetime)
    
    

    步骤总结:
    1. 取出2018和2019年所有的交易用户的交集
    2. 取出所有用户的激活时间
    3. 统计时间分布

    相关文章

      网友评论

        本文标题:HiveSQL核心技能之表连接

        本文链接:https://www.haomeiwen.com/subject/folxtltx.html