一、准备工作及背景
1、准备数据库:data
···
create database data charset=utf8;
···
2、准备数据表:orderinfo (539,414数据量)和userinfo(101,535数据量)
source+表1路径
source+表2路径
3、查看表结构:
表orderinfo相关信息 表userinfo相关信息
二、题目要求..
1、统计不同月份的下单人数
思考过程:
select month(paidTime) from orderinfo limit 10,10;
select count(1) ,month(paidTime) as mon from orderinfo group by month(paidTime) ;
select
year(paidTime),
month(paidTime),
count(1)
from orderinfo
group by year(paidTime),month(paidTime);
注意:
不同月份,可能是同一年的不同月份,也有可能是不同年的不同月份,所以在分组的时候需要考虑年份。
结果:
可以利用where去除脏数据(0000-00-00 00:00:00)以及筛选出‘已支付’的用户
select
year(paidTime),
month(paidTime),
count(1)
from orderinfo
where isPaid='已支付'
and
paidTime!='0000-00-00 00:00:00'
group by year(paidTime),month(paidTime);
注意:group by 放在语句的最后
可能存在同一个用户支付多个订单的情况,所以需要用distinct做区分
select
year(paidTime),
month(paidTime),
count(distinct userid)
from orderinfo
where isPaid='已支付'
and
paidTime!='0000-00-00 00:00:00'
group by year(paidTime),month(paidTime);
补充:ctrl+c可以退出运行进程。
2、统计用户三月份的回购率和复购率
复购率:当月购买了多次的用户占当月用户的比例。
回购率:上月购买用户中有多少用户又再次购买。
思考过程:
-
首先计算复购率
(1)先筛选出3月份的消费情况
应该也要考虑年份的,但是我们发现除去脏数据后年份只剩2016年
在用where筛选之前查看一下paidTime的类型(varchar,'3')
select
*
from orderinfo
where isPaid='已支付'
and month(paidTime)='3'
limit 10;
(2)统计每个用户在3月份消费了多少次
思路:group by用户
select
userid,count(1)
from orderinfo
where isPaid='已支付'
and month(paidTime)='3'
group by userid
limit 10;
(3)统计出消费次数大于1的用户数、总用户数以及复购率
select
sum(if(con>1,1,0)) as fugou_cons,
count(1) as userid_cons,
sum(if(con>1,1,0))/count(1) as fugou_rate
from (select
userid,count(1) as con
from orderinfo
where isPaid='已支付'
and month(paidTime)='3'
group by userid
) t;
注意:嵌套的子表需要取别名(比如,t)且不需要加as;否则会报错如下:
小tip:在notepad++中'alt'+鼠标下拉,可以统一选中多行,进而可以按空格产生缩进,便于代码美观易观看。
或者 选择代码,再按‘tab’也可达到同样缩进效果。
-
接着计算回购率
回购率:上月购买用户中有多少用户又再次购买。
思考过程:
先查看下数据中的月份类别
select distinct(month(paidTime)) from orderinfo;
现在是六月,那么这里的上月就是五月,确认了存在五月的数据,可以继续计算。
五月份的回购率=五月用户中6月又再次购买的人数/5月的用户总数
我们关系的不是用户购买的次数,而是用户有没有购买过。
(1)筛选出五月购买的用户
select * from orderinfo
where month(paidTime)='5';
(2)按照userid进行分组并显示购买次数
select userid,count(1) as con from orderinfo
where month(paidTime)='5' group by userid;
(3)统计购买次数大于1 的用户数、总用户数以及回购率
select
count(1) as userid_con,
sum(if(con >1,1,0)) as huigou_con,
sum(if(con >1,1,0))/count(1) as huigou_rate
from
(select userid,count(1) as con from orderinfo
where month(paidTime)='5' group by userid) t;
===
(1)、统计每年每月的用户消费情况
select
userid,
date_format(paidTime,'%Y-%m') as month_dt,
count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m')
limit 10;
注意:我们这里选用date_format()而没有用month(),目的是便于后续判断日期是否相邻。
(2)、相邻月份进行关联,能关联上的用户说明就是回购。
首先写个框架
select
from () a
left join
() b;
其次,自己和自己进行关联,注意用left join,避免遗漏信息
select
*
from (select
userid,
date_format(paidTime,'%Y-%m-%d') as month_dt,
count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) a
left join
(select
userid,
date_format(paidTime,'%Y-%m-%d') as month_dt,
count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid
and date_sub(b.month_dt ,interval 1 month)=a.month_dt limit 10;
?但为什么右列都是null值?
(3)统计每个月份的消费人数情况以及得到回购率
框架:
select
a.month_dt,
count(a.userid),
count(b.userid),
count(b.userid)/count(a.userid)
from
()
group by a.month_dt;
完整代码:
select
a.month_dt,
count(a.userid),
count(b.userid),
count(b.userid)/count(a.userid)
from
(select
userid,
date_format(paidTime,'%Y-%m-%d') as month_dt,
count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) a
left join
(select
userid,
date_format(paidTime,'%Y-%m-%d') as month_dt,
count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid
and date_sub(b.month_dt,interval 1 month)=a.month_dt
group by a.month_dt;
存在疑问:
1、date_sub()函数是对日期格式有特殊要求吗?
具体情形:
如果我将paidTime格式转换为'%Y-%m'再进行date_sub(b.month_dt,interval 1 month)与a表日期进行是否相等判断,则匹配的全为null值。
但是参考答案中将paidTime格式转换为'%Y-%m-01'再进行匹配则不全为null值,但是这个‘01’的设置在视频讲解中也提到说不是影响因素,只是强制设置所有的记录的日为01,所以就不清楚这个错误是为什么出现?
解决问题:
date_sub(b.month_dt,interval 1 month)中的日期最少要精确到日,到月的输出Null值
2、表Orderinfo中paidTime的约束是字符串类型(varchar),而date_format (日期,字符串格式)函数的作用是将日期转换为字符串格式,那用date_format(paidTime,'%Y-%m-01'),是把paidTime当成日期类型了吗?还是说我理解错了?
解决问题:
date_format (日期,字符串格式)函数并没有规定这个字段的类型,只要是合法的日期就行
总结:还是需要再练练
注意:count()函数在统计时并不计算null值
3、统计男女用户消费频次是否有差异
思考过程:
(1)统计不同性别下的用户的消费次数
思考过程:
(1)首先通过left join合并一下表userinfo和表orderinfo
select * from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
查看一下sex的类别
select distinct sex from userinfo;
类别为:'男','女',字符串空''
(1)我们需要统计不同性别下的消费情况(次数)
select * from
(select *,o.userid as o_id from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;
出现ERROR 1060 (42S21): Duplicate column name 'userid',已经做好记录并解决了(见:https://www.jianshu.com/p/0c0df1939ee4)。
select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id
from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
select * from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;
select * from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' limit 10 ;
#去除sex为空的字段以及根据sex和id进行分组
select o_id as all_id ,sex,count(1) from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by sex,o_id limit 20;
#修改子查询的写法,并改变分组的顺序为id,sex
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by orderid,sex limit 20;
#注意:group by 后面变量的前后顺序不同,得到的结果也会发生改变。
#去掉'已支付'的限制条件
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.sex!='' group by orderid,sex limit 20;
#把left join 换成inner join
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o inner join userinfo as u on u.userid=o.userid) t
where t.sex!='' group by orderid,sex limit 20;
/*发现分组的依据id以及查询的id使用错误,应该使用userid 而不是orderid,所以还是用回之前的left join,看下结果:*/
select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.sex!='' group by userid,sex limit 20;
#接着再次加上'已支付'的限制条件
select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by userid,sex limit 20;
这样我们就统计了不同性别下的用户消费次数
接着,对性别做一个消费次数平均计算
同样构建框架:
select
from ()
group by sex;
详细代码:
select
sex,
avg(count(1)) as avg_con
from (select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;
但出现错误:
ERROR 1111 (HY000): Invalid use of group function
这个问题还有待进一步了解。
但是修改avg()后运行成功,
完整代码如下:
select
sex,
avg(cons) as avg_con
from (select userid ,sex,count(1) as cons from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;
错误记录:
1、
select *,o.userid as o_id from
(select * from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;
ERROR 1060 (42S21): Duplicate column name 'userid'
解决办法:
其实这个问题的本质是子查询:
select * from orderinfo as o left join userinfo as u on u.userid=o.userid;
中出现重复字段名,如下图:
所以,我们在查询的时候需要对字段进行重命名,如下:
select * ,o.userid as o_id, u.userid as u_id from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
但是这样是在查询原有全部信息的同时又增加了重新命名的两列,如何在查询全部信息的同时更改其中某个字段的名称呢?
目前的解决办法是:
自己手动写出需要查询的字段的名称,如下:
select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id
from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
这样似乎有些蠢笨,等发现有简单方法后再来更新。
方法更新:
如下:
select a.*,b.sex,b.birth
from orderidfo as a left join userinfo as b on a.userid=b.userid limit 10;
这个的思路是:
这样写就不用修改名称了,a.*查询字段多的表,字段少的表直接手写
错误2:
select
sex,
avg(count(1)) as avg_con
from (select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;
ERROR 1111 (HY000): Invalid use of group function
4、统计多次消费的用户,第一次和最后一次消费间隔是多少天
即产品的生命周期。
(1)统计出多次消费的用户
select userid from orderinfo
where isPaid='已支付'
group by userid
having count(1)>1;
(2)取出第一次和最后一次的时间
select
userid ,
min(paidTime),
max(paidTime),
max(paidTime)-min(paidTime)
from orderinfo
where isPaid='已支付'
group by userid
having count(1)>1 limit 10;
注意:日期类型不能用简单的算术运算,应该用日期特有的运算(datediff(日期1,日期2))
select
userid ,
min(paidTime),
max(paidTime),
datediff(max(paidTime),min(paidTime) )
from orderinfo
where isPaid='已支付'
group by userid
having count(1)>1 limit 10;
5、统计不同年龄段,用户的消费金额是否有差异
首先计算年龄
select userid,2020-year(birth) as age from userinfo limit 10;
注意:计算年龄可以考虑timestampdiff(year,birth,now())函数,而且该函数未过生日会减一
其中,第一个参数year是返回的类型,也可以是month,hour
select userid,birth,now(),timestampdiff(year,birth,now()) as age
from userinfo limit 10;
注意:系统设置的默认时间为'1900-01-01'(因为发现年龄居然存在一百多岁的),以及存在脏数据‘ 0000-00-00’
所以需要进行筛选:
select userid,birth,now(),timestampdiff(year,birth,now()) as age
from userinfo where year(birth)>1900 limit 30;
年龄如何分段?0-10:1,11-20:2,21-30:3 利用case when
补充:向上取整函数ceil()
select
userid,
birth,
now(),
ceil(timestampdiff(year,birth,now())/10) as age
from userinfo
where year(birth)>1900
limit 10;
疑问:
year(birth)>1900 和birth>'1901-00-00'的区别??
统计消费金额
select price from orderinfo limit 10;
消费金额如何分段?
分段之后通过两个表的userid 连接起来,关联订单信息,获取不同年龄段的一个消费频次和消费金额
框架:
select
from orderinfo a
inner join() b on a.userid=b.userid;
详细代码:
select
userid,
age,
count(1) as con,
sum(price) as prices
from orderinfo a
inner join(select
userid,
birth,
now(),
ceil(timestampdiff(year,birth,now())/10) as age
from userinfo
where year(birth)>1900) b on a.userid=b.userid
group by userid,age;
修改如下:
select
a.userid,
age,
count(1) as con,
sum(price) as prices
from orderinfo a
inner join(select
userid,
birth,
now(),
ceil(timestampdiff(year,birth,now())/10) as age
from userinfo
where year(birth)>1900) b on a.userid=b.userid
group by a.userid,age limit 30;
(3)再对年龄分层进行聚合,得到不同年龄层的消费情况
框架:
select
age,
avg(con),
avg(prices)
from
() a
group by age ;
完整代码:
select
age,
avg(con),
avg(prices)
from
(select
a.userid,
age,
count(1) as con,
sum(price) as prices
from orderinfo a
inner join(select
userid,
birth,
now(),
ceil(timestampdiff(year,birth,now())/10) as age
from userinfo
where year(birth)>1900) b on a.userid=b.userid
group by a.userid,age) a
group by age ;
错误记录:
错误3、ERROR 1052 (23000): Column 'userid' in field list is ambiguous
6、统计消费的二八法则,消费的top20%用户,贡献了多少消费额
(1)统计每个用户的消费金额,并进行降序排序
select
userid,
price,
sum(price) as total_price
from orderinfo
where isPaid='已支付'
group by userid limit 20;
(2)统计用户总数以及总消费金额
框架:
select
count(1) as con,
sum(total_price) as all_price
from () a;
完整代码:
select
count(1) as con,
sum(total_price) as all_price
from (select
userid,
price,
sum(price) as total_price
from orderinfo
where isPaid='已支付'
group by userid) a;
(3)取出前20%的用户进行金额统计。
select
count(1) as con,
sum(total_price) as all_price
from(
select
userid,
price,
sum(price) as total_price
from orderinfo
where isPaid='已支付'
group by userid
order by total_price desc
limit 17000) b;
我们发现20%的用户贡献了80%的业务量。
网友评论