美文网首页zhaoyqiu的数据分析进阶之路2.0
D7-利用Mysql进行数据分析实例

D7-利用Mysql进行数据分析实例

作者: 喝奶茶不加奶茶 | 来源:发表于2020-06-30 20:47 被阅读0次

    一、准备工作及背景
    1、准备数据库:data
    ···
    create database data charset=utf8;
    ···
    2、准备数据表:orderinfo (539,414数据量)和userinfo(101,535数据量)

    source+表1路径
    source+表2路径
    

    3、查看表结构:


    表orderinfo相关信息 表userinfo相关信息

    二、题目要求..
    1、统计不同月份的下单人数
    思考过程:

    select month(paidTime) from orderinfo limit 10,10;
    
    select count(1) ,month(paidTime)  as mon from orderinfo  group by month(paidTime) ;
    
    
    select 
    year(paidTime),
    month(paidTime),
    count(1)
    from  orderinfo
    group by  year(paidTime),month(paidTime);
    

    注意:
    不同月份,可能是同一年的不同月份,也有可能是不同年的不同月份,所以在分组的时候需要考虑年份。
    结果:


    可以利用where去除脏数据(0000-00-00 00:00:00)以及筛选出‘已支付’的用户

    select 
    year(paidTime),
    month(paidTime),
    count(1)
    from  orderinfo
    where isPaid='已支付' 
    and
    paidTime!='0000-00-00 00:00:00'
    group by  year(paidTime),month(paidTime);
    

    注意:group by 放在语句的最后


    可能存在同一个用户支付多个订单的情况,所以需要用distinct做区分

    select 
    year(paidTime),
    month(paidTime),
    count(distinct userid)
    from  orderinfo
    where isPaid='已支付' 
    and
    paidTime!='0000-00-00 00:00:00'
    group by  year(paidTime),month(paidTime);
    

    补充:ctrl+c可以退出运行进程。
    2、统计用户三月份的回购率和复购率
    复购率:当月购买了多次的用户占当月用户的比例。
    回购率:上月购买用户中有多少用户又再次购买。
    思考过程:

    • 首先计算复购率
      (1)先筛选出3月份的消费情况
      应该也要考虑年份的,但是我们发现除去脏数据后年份只剩2016年

      在用where筛选之前查看一下paidTime的类型(varchar,'3')
    select 
    *
    from  orderinfo
    where isPaid='已支付'
    and month(paidTime)='3'
    limit 10;
    

    (2)统计每个用户在3月份消费了多少次
    思路:group by用户

    select 
    userid,count(1)
    from  orderinfo
    where isPaid='已支付'
    and month(paidTime)='3'
    group by userid
    limit 10;
    

    (3)统计出消费次数大于1的用户数、总用户数以及复购率

    select 
         sum(if(con>1,1,0)) as fugou_cons,
         count(1) as userid_cons,
         sum(if(con>1,1,0))/count(1) as fugou_rate
    from  (select 
           userid,count(1) as con
           from  orderinfo
           where isPaid='已支付'
           and month(paidTime)='3'
           group by userid
         ) t;
    

    注意:嵌套的子表需要取别名(比如,t)且不需要加as;否则会报错如下:


    小tip:在notepad++中'alt'+鼠标下拉,可以统一选中多行,进而可以按空格产生缩进,便于代码美观易观看。
    或者 选择代码,再按‘tab’也可达到同样缩进效果。

    • 接着计算回购率
      回购率:上月购买用户中有多少用户又再次购买。
      思考过程:

    先查看下数据中的月份类别

    select distinct(month(paidTime)) from orderinfo;
    

    现在是六月,那么这里的上月就是五月,确认了存在五月的数据,可以继续计算。
    五月份的回购率=五月用户中6月又再次购买的人数/5月的用户总数

    我们关系的不是用户购买的次数,而是用户有没有购买过。
    (1)筛选出五月购买的用户

    select * from orderinfo 
    where month(paidTime)='5';
    

    (2)按照userid进行分组并显示购买次数

    select userid,count(1) as con from orderinfo 
    where month(paidTime)='5' group by userid;
    

    (3)统计购买次数大于1 的用户数、总用户数以及回购率

    select 
        count(1) as userid_con,
        sum(if(con >1,1,0)) as huigou_con,
        sum(if(con >1,1,0))/count(1) as huigou_rate
    from 
        (select userid,count(1) as con from orderinfo 
    where month(paidTime)='5' group by userid) t;
    

    ===
    (1)、统计每年每月的用户消费情况

    select 
        userid,
        date_format(paidTime,'%Y-%m') as month_dt,
        count(1) as con
    from orderinfo
    where isPaid='已支付'
    group by userid,date_format(paidTime,'%Y-%m') 
    limit 10;
    

    注意:我们这里选用date_format()而没有用month(),目的是便于后续判断日期是否相邻。
    (2)、相邻月份进行关联,能关联上的用户说明就是回购。
    首先写个框架

    select 
    from () a
    left join
    () b;
    

    其次,自己和自己进行关联,注意用left join,避免遗漏信息

    select 
    *
    from (select 
        userid,
        date_format(paidTime,'%Y-%m-%d') as month_dt,
        count(1) as con
    from orderinfo
    where isPaid='已支付'
    group by userid,date_format(paidTime,'%Y-%m-%d')) a
    left join
    (select 
        userid,
        date_format(paidTime,'%Y-%m-%d') as month_dt,
        count(1) as con
    from orderinfo
    where isPaid='已支付'
    group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid 
    and date_sub(b.month_dt  ,interval 1 month)=a.month_dt  limit 10; 
    

    ?但为什么右列都是null值?

    (3)统计每个月份的消费人数情况以及得到回购率
    框架:

    select
    a.month_dt,
        count(a.userid),
        count(b.userid),
        count(b.userid)/count(a.userid)
    from
    ()
    group by a.month_dt;
    

    完整代码:

    select
        a.month_dt,
        count(a.userid),
        count(b.userid),
        count(b.userid)/count(a.userid)
    from
    (select 
        userid,
        date_format(paidTime,'%Y-%m-%d') as month_dt,
        count(1) as con
    from orderinfo
    where isPaid='已支付'
    group by userid,date_format(paidTime,'%Y-%m-%d')) a
    left join
    (select 
        userid,
        date_format(paidTime,'%Y-%m-%d') as month_dt,
        count(1) as con
    from orderinfo
    where isPaid='已支付'
    group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid 
    and date_sub(b.month_dt,interval 1 month)=a.month_dt
    group by a.month_dt;
    

    存在疑问:
    1、date_sub()函数是对日期格式有特殊要求吗?
    具体情形:
    如果我将paidTime格式转换为'%Y-%m'再进行date_sub(b.month_dt,interval 1 month)与a表日期进行是否相等判断,则匹配的全为null值。
    但是参考答案中将paidTime格式转换为'%Y-%m-01'再进行匹配则不全为null值,但是这个‘01’的设置在视频讲解中也提到说不是影响因素,只是强制设置所有的记录的日为01,所以就不清楚这个错误是为什么出现?
    解决问题:
    date_sub(b.month_dt,interval 1 month)中的日期最少要精确到日,到月的输出Null值



    2、表Orderinfo中paidTime的约束是字符串类型(varchar),而date_format (日期,字符串格式)函数的作用是将日期转换为字符串格式,那用date_format(paidTime,'%Y-%m-01'),是把paidTime当成日期类型了吗?还是说我理解错了?
    解决问题:
    date_format (日期,字符串格式)函数并没有规定这个字段的类型,只要是合法的日期就行
    总结:还是需要再练练

    注意:count()函数在统计时并不计算null值

    3、统计男女用户消费频次是否有差异
    思考过程:
    (1)统计不同性别下的用户的消费次数

    思考过程:
    (1)首先通过left join合并一下表userinfo和表orderinfo

    select * from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
    

    查看一下sex的类别

    select distinct sex from userinfo;
    

    类别为:'男','女',字符串空''

    (1)我们需要统计不同性别下的消费情况(次数)

    select * from
    (select *,o.userid as o_id from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where t.isPaid='已支付' ;
    

    出现ERROR 1060 (42S21): Duplicate column name 'userid',已经做好记录并解决了(见:https://www.jianshu.com/p/0c0df1939ee4)。

    select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
    from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
    
    select * from
    (select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where t.isPaid='已支付' ;
    
    select * from
    (select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where t.isPaid='已支付' limit 10 ;
    
    #去除sex为空的字段以及根据sex和id进行分组
    select o_id as all_id ,sex,count(1) from
    (select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where t.isPaid='已支付' and t.sex!='' group by sex,o_id limit 20;
    
    #修改子查询的写法,并改变分组的顺序为id,sex
    select orderid as all_id ,sex,count(1) from
    (select o.*,u.sex,u.birth 
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where t.isPaid='已支付' and t.sex!='' group by orderid,sex limit 20;
    #注意:group by 后面变量的前后顺序不同,得到的结果也会发生改变。
    
    #去掉'已支付'的限制条件
    select orderid as all_id ,sex,count(1) from
    (select o.*,u.sex,u.birth 
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where  t.sex!='' group by orderid,sex limit 20;
    
    #把left join 换成inner join
    select orderid as all_id ,sex,count(1) from
    (select o.*,u.sex,u.birth 
    from orderinfo as o inner join userinfo as u on u.userid=o.userid) t
    where  t.sex!='' group by orderid,sex limit 20;
    
    /*发现分组的依据id以及查询的id使用错误,应该使用userid 而不是orderid,所以还是用回之前的left join,看下结果:*/
    select userid ,sex,count(1) from
    (select o.*,u.sex,u.birth 
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where  t.sex!='' group by userid,sex limit 20;
    
    #接着再次加上'已支付'的限制条件
    select userid ,sex,count(1) from
    (select o.*,u.sex,u.birth 
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where  t.isPaid='已支付' and t.sex!='' group by userid,sex limit 20;
    

    这样我们就统计了不同性别下的用户消费次数
    接着,对性别做一个消费次数平均计算
    同样构建框架:

    select 
    
    from ()
    group by sex;
    

    详细代码:

    select
        sex,
        avg(count(1)) as avg_con
    from (select userid ,sex,count(1) from
    (select o.*,u.sex,u.birth
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
    group by sex;
    

    但出现错误:



    ERROR 1111 (HY000): Invalid use of group function
    这个问题还有待进一步了解。
    但是修改avg()后运行成功,
    完整代码如下:

    select 
    sex,
    avg(cons) as avg_con
    from (select userid ,sex,count(1) as cons from
    (select o.*,u.sex,u.birth 
     from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
    group by sex;
    

    错误记录:
    1、

    select *,o.userid as o_id from
     (select * from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where t.isPaid='已支付' ;
    

    ERROR 1060 (42S21): Duplicate column name 'userid'
    解决办法:
    其实这个问题的本质是子查询:

    select * from orderinfo as o left join userinfo as u on u.userid=o.userid;
    

    中出现重复字段名,如下图:


    所以,我们在查询的时候需要对字段进行重命名,如下:

    select * ,o.userid as o_id, u.userid as u_id from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
    

    但是这样是在查询原有全部信息的同时又增加了重新命名的两列,如何在查询全部信息的同时更改其中某个字段的名称呢?

    目前的解决办法是:
    自己手动写出需要查询的字段的名称,如下:

     select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
     from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;
    

    这样似乎有些蠢笨,等发现有简单方法后再来更新。

    方法更新:
    如下:

    select a.*,b.sex,b.birth 
    from orderidfo as a left join userinfo as b on a.userid=b.userid limit 10;
    

    这个的思路是:
    这样写就不用修改名称了,a.*查询字段多的表,字段少的表直接手写

    错误2:

    select
        sex,
        avg(count(1)) as avg_con
    from (select userid ,sex,count(1) from
    (select o.*,u.sex,u.birth
    from orderinfo as o left join userinfo as u on u.userid=o.userid) t
    where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
    group by sex;
    

    ERROR 1111 (HY000): Invalid use of group function


    4、统计多次消费的用户,第一次和最后一次消费间隔是多少天
    即产品的生命周期。
    (1)统计出多次消费的用户

    select userid from orderinfo
    where  isPaid='已支付'
    group by userid
    having count(1)>1;
    

    (2)取出第一次和最后一次的时间

    select 
        userid ,
        min(paidTime),
        max(paidTime),
        max(paidTime)-min(paidTime) 
    from orderinfo
    where  isPaid='已支付'
    group by userid
    having count(1)>1 limit 10;
    

    注意:日期类型不能用简单的算术运算,应该用日期特有的运算(datediff(日期1,日期2))

    select 
        userid ,
        min(paidTime),
        max(paidTime),
        datediff(max(paidTime),min(paidTime) )
    from orderinfo
    where  isPaid='已支付'
    group by userid
    having count(1)>1 limit 10;
    

    5、统计不同年龄段,用户的消费金额是否有差异
    首先计算年龄

    select userid,2020-year(birth) as age from userinfo limit 10;
    

    注意:计算年龄可以考虑timestampdiff(year,birth,now())函数,而且该函数未过生日会减一
    其中,第一个参数year是返回的类型,也可以是month,hour

    select userid,birth,now(),timestampdiff(year,birth,now()) as age
    from userinfo limit 10;
    

    注意:系统设置的默认时间为'1900-01-01'(因为发现年龄居然存在一百多岁的),以及存在脏数据‘ 0000-00-00’


    所以需要进行筛选:

    select userid,birth,now(),timestampdiff(year,birth,now()) as age
    from userinfo where year(birth)>1900 limit 30;
    

    年龄如何分段?0-10:1,11-20:2,21-30:3 利用case when
    补充:向上取整函数ceil()

    select 
        userid,
        birth,
        now(),
        ceil(timestampdiff(year,birth,now())/10) as age
    from userinfo 
    where year(birth)>1900 
    limit 10;
    
    

    疑问:
    year(birth)>1900 和birth>'1901-00-00'的区别??

    统计消费金额

    select  price from orderinfo limit 10;
    

    消费金额如何分段?
    分段之后通过两个表的userid 连接起来,关联订单信息,获取不同年龄段的一个消费频次和消费金额
    框架:

    select
     
    from orderinfo a 
    inner join() b on a.userid=b.userid;
    
    

    详细代码:

    select
        userid,
        age,
        count(1) as con,
        sum(price) as prices
    from orderinfo a 
    inner join(select 
        userid,
        birth,
        now(),
        ceil(timestampdiff(year,birth,now())/10) as age
    from userinfo 
    where year(birth)>1900) b on a.userid=b.userid
    group by userid,age;
    

    修改如下:

    select
        a.userid,
        age,
        count(1) as con,
        sum(price) as prices
    from orderinfo a 
    inner join(select 
        userid,
        birth,
        now(),
        ceil(timestampdiff(year,birth,now())/10) as age
    from userinfo 
    where year(birth)>1900) b on a.userid=b.userid
    group by a.userid,age  limit 30;
    

    (3)再对年龄分层进行聚合,得到不同年龄层的消费情况
    框架:

    select 
        age,
        avg(con),
        avg(prices)
    from 
    () a
    group by age ;
    

    完整代码:

    select 
        age,
        avg(con),
        avg(prices)
    from 
    (select
        a.userid,
        age,
        count(1) as con,
        sum(price) as prices
    from orderinfo a 
    inner join(select 
        userid,
        birth,
        now(),
        ceil(timestampdiff(year,birth,now())/10) as age
    from userinfo 
    where year(birth)>1900) b on a.userid=b.userid
    group by a.userid,age) a
    group by age ;
    

    错误记录:
    错误3、ERROR 1052 (23000): Column 'userid' in field list is ambiguous


    6、统计消费的二八法则,消费的top20%用户,贡献了多少消费额
    (1)统计每个用户的消费金额,并进行降序排序

    select 
        userid,
        price,
        sum(price)  as total_price
    from orderinfo
    where isPaid='已支付'
    group by userid limit 20;
    

    (2)统计用户总数以及总消费金额
    框架:

    select 
    count(1) as  con,
    sum(total_price)  as all_price
    from () a;
    

    完整代码:

    select 
    count(1) as  con,
    sum(total_price)  as all_price
    from (select 
        userid,
        price,
        sum(price)  as total_price
    from orderinfo
    where isPaid='已支付'
    group by userid) a;
    

    (3)取出前20%的用户进行金额统计。

    select
        count(1) as  con,
        sum(total_price)  as all_price
    from(
    select 
        userid,
        price,
        sum(price)  as total_price
    from orderinfo
    where isPaid='已支付'
    group by userid 
    order by total_price desc 
    limit 17000) b;
    

    我们发现20%的用户贡献了80%的业务量。


    相关文章

      网友评论

        本文标题:D7-利用Mysql进行数据分析实例

        本文链接:https://www.haomeiwen.com/subject/txcofktx.html