目标:
1、掌握HQL中的各种连接及其组合使用;
2、掌握数据分析中业务指标思路转换的技巧;
3、区分好full join 和 union all 的使用场景;
4、在多表连接时,注意各种细节和业务逻辑;
5、复杂表连接要学会分步骤处理
一、基础知识:
1、内连接(inner join,返回两个表的交集)
需注意:
1、表连接时,必须进行重命名;
2、on后面使用的连接条件必须起到唯一键值的作用(有时会有多个字段组合);
3、inner可省略不写,效果是一样的
4、表连接时不能使用 a join b join c这种方式,不然会极度浪费电脑的资源和延长查询时间,要在子查询的表里先做好筛选之后在连接;
1)找出在2019年购买后又退款的用户(记得要去重)
select a.user_name
from
(select distinct user_name
from user_trade where year(dt)='2019') as a
join
(select distinct user_name
from user_refund where year(dt)='2019') as b
on a.user_name=b.user_name
注意:一定要先去重,再做表连接,养成良好的习惯(虽然可以先连接再去重,但是那么做会使执行效率很低)
2)在2017年和2018年都购买的用户
select a.user_name
from
(select distinct user_name from user_trade where year(dt)=2017) as a
join
(select distinct user_name from user_trade where year(dt)=2018) as b
on a.user_name=b.user_name;
3)在2017年、2018年、2019年都有交易的用户
select a.user_name
from (select distinct user_name from trade_2017)a
join (select distinct user_name from trade_2018)b
on a.user_name=b.user_name
join (select distinct user_name from trade_2019)c
on a.user_name=c.user_name
注意:要先去重再做表连接
2、左连接(left join,查询在表1但不在表2中的数据经常使用左连接的 is NULL)
进行左连接之后,以左表为全集,返回能够匹配上的右边表的匹配结果,没有匹配上的则显示NULL。
拓展:
right join:以右表为全集,返回能够匹配上的左边表的匹配结果,没有匹配上的则显示NULL,可以由left join改写出同样的结果。
4)在2019年购买,但是没有退款的用户
select a.user_name
from (select distinct user_name
from user_trade
where year(dt)=2019)a
left join
(select distinct user_name
from user_refund
where year(dt)=2019)b
on a.user_name=b.user_name
where b.user_name is null
5)在2019年由购买的用户的学历分布
extra2["education"]
select b.education,count(a.user_name)
from
(select distinct user_name
from
user_trade where year(dt)=2019)a
left join
(select distinct user_name
,extra2["education"] as education
from
user_info)b
on a.user_name=b.user_name
group by b.education
注意:extra2["education"] 可以换成 get_json_object(extra1,"$.education")
6)在2017年和2018年都购买,但是没有在2019年购买的用户
方法一:
select a.user_name
from
(select distinct user_name
from user_trade where year(dt)=2017)a
join
(select distinct user_name
from user_trade where year(dt)=2018)b
on a.user_name=b.user_name
left join
(select distinct user_name
from user_trade where year(dt)=2019)c
on a.user_name=c.user_name
where c.user_name is null
方法二:
select a.user_name
from
(select distinct user_name
from trade_2017)a
join
(select distinct user_name
from trade_2018)b
on a.user_name=b.user_name
left join
(select distinct user_name
from trade_2019)c
on a.user_name=c.user_name
where c.user_name is null
3、全连接(full join,应该在提取全部行名称的场景比较多)
select *
from user_list_1 a
full join user_list_2 b
on a.user_id=b.user_id
查询两个表的所有用户时使用full join是一个比较好的方法(需要用到coalesce函数:
select coalesce(a.user_name,b.user_name)
from user_list_1 a
full join user_list_2 b
on a.user_id=b.user_id
注:coalesce函数,coalesce(expression1,expression2,...,expression n),依次参考各参数表达式,遇到非null值即停止并返回该值,如果所有的表达式都是空值,最终将返回一个空值。
4、表合并(union all,利用占位符来做查询筛选可以很好的提高效率)
注:表合并时字段名称必须一致,字段顺序必须一致,而且不用填写连接条件
7)2017-2019年由交易的所有用户数
写法一:
select count(distinct a.user_name),
count(a.user_name)
from
( select distinct user_name from trade_2017
union all
select distinct user_name from trade_2018
union all
select distinct user_name from trade_2019) a
写法二:
select count(distinct a.user_name),
count(a.user_name)
from
( select user_name from trade_2017
union
select user_name from trade_2018
union
select user_name from trade_2019) a
union all 和 union 的区别:
union all 不会去重,不会排序,效率较快;union 会去重且排序,效率较慢。
如果表很大时,推荐先去重,再进行 union all ,不能直接对表进行 union all,不然效率很慢。
8)2019年每个用户的支付和退款金额汇总
写法一(少了子查询里的分组计算步骤,执行效率更高一些,拼接的时候可以使用这种方式,表连接的时候不能用这种方式):
select a.user_name,sum(a.pay_amount),sum(a.refund_amount)
from(
(select user_name,pay_amount,0 as refund_amount
from user_trade
where year(dt)=2019)
union all
(select user_name,0 as pay_amount,refund_amount
from user_refund
where year(dt)=2019))a
group by a.user_name
写法二:
select a.user_name,sum(a.pay_amount),sum(a.refund_amount)
from(
select user_name
,sum(pay_amount)as pay_amount
,0 as refund_amount
from user_trade
where year(dt)=2019
group by user_name
union all
select user_name,
0 as pay_amount,
sum(refund_amount) as refund_amount
from user_refund
where year(dt)=2019
group by user_name)a
group by a.user_name
也可以使用 full join 的方式:
select coalesce(a.user_name,b.user_name)
,if(a.payamount is null,0,payamount)
,if(b.refundamount is null,0,refundamount)
from
( select user_name,sum(pay_amount) as payamount
from user_trade
where year(dt)=2019
group by user_name
) a
full join
( select user_name,sum(refund_amount) as refundamount
from user_refund
where year(dt)=2019
group by user_name
) b
on a.user_name=b.user_name
9)2019年每个支付用户的支付金额和退款金额
写法一(错误写法,由于左连接会出现多次连接的情况,因此要先去重(或分组)之后在进行连接,不然会出现重复连接的情况:
select a.user_name,sum(a.pay_amount),sum(b.refund_amount)
from
(select user_name,pay_amount
from user_trade
where year(dt)=2019)a
left join
(select user_name,refund_amount
from user_refund
where year(dt)=2019)b
on a.user_name=b.user_name
group by a.user_name
写法二:
select a.user_name,a.pay_amount,b.refund_amount
from
(select user_name,sum(pay_amount)as pay_amount
from user_trade
where year(dt)=2019
group by user_name)a
left join
(select user_name,sum(refund_amount)as refund_amount
from user_refund
where year(dt)=2019
group by user_name)b
on a.user_name=b.user_name
10)首次激活时间在2017年,但是一直没有支付的用户年龄段分布
1)
select a.age_type,count(a.user_name)
from
(select user_name,
case
when age<20 then '20岁以下'
when age<30 then '20-30岁'
when age<40 then '30-40岁'
else '40岁以上'
end as age_type
from user_info
where year(firstactivetime)=2017)a
left join
(select distinct user_name
from user_trade where dt>0)b
on a.user_name=b.user_name
where b.user_name is null
group by a.age_type
步骤总结:
1、先筛选出年份为2017注册的用户;
2、没有支付的人;
3、年龄段分布
注意:由于age也是在user_info的表格里,第三步用的字段需要在第一步进行预处理,所以在限制时间的时候需要同时对年龄段进行预处理,这样在第三步的时候才会由年龄段这个字段;需要注意对 case when 的字段进行重命名才能进行后续的操作
2)
select a.age_type,count(distinct a.user_name)
from
(select distinct user_name,
case
when age<20 then '20岁以下'
when age<30 then '20-30岁'
when age<40 then '30-40岁'
else '40岁以上'
end as age_type
from user_info
where year(firstactivetime)=2017
)a
left join
(select distinct user_name
from user_trade
where year(dt) in ('2017','2018','2019')
)b
on a.user_name=b.user_name
where b.user_name is null
group by a.age_type
11)2018、2019年交易的用户,其激活时间段分布
写法一:
select b.hour,count(a.user_name)
from
( select distinct user_name
from user_trade
where year(dt) in(2018,2019)
)a
left join
( select distinct user_name,hour(firstactivetime) as hour
from user_info
)b
on a.user_name=b.user_name
group by b.hour
写法二:
select hour(firstactivetime)
,count(distinct a.user_name)
from
( select distinct user_name
from trade_2018
union all
select distinct user_name
from trade_2019
) as a
left join
user_info as b
on a.user_name=b.user_name
group by hour(firstactivetime)
步骤总结:
1. 取出2018和2019年所有的交易用户的交集
2. 取出所有用户的激活时间
3. 统计时间分布
网友评论