D7-利用Mysql进行数据分析实例

作者: 喝奶茶不加奶茶 | 来源:发表于2020-06-30 20:47 被阅读0次

D7-利用Mysql进行数据分析实例
2019-10-12
《利用Python进行数据分析》PDF高清完整版-免费下载
关联分析（一）
学习python数据分析必看，《利用Python进行数据分析》
GEO2R工具和使用方法介绍
Kaggle: video_games_sales数据集探索性分
利用Python进行数据分析 - 准备工作
5/05day47_Mysql性能优化
基于Linux的MySQL操作实例（软件安装，mysql基本操作

一、准备工作及背景
1、准备数据库：data
···
create database data charset=utf8;
···
2、准备数据表：orderinfo （539,414数据量）和userinfo（101,535数据量）

source+表1路径
source+表2路径

3、查看表结构：

表orderinfo相关信息

表userinfo相关信息

二、题目要求..
1、统计不同月份的下单人数
思考过程：

select month(paidTime) from orderinfo limit 10,10;

select count(1) ,month(paidTime)  as mon from orderinfo  group by month(paidTime) ;


select 
year(paidTime),
month(paidTime),
count(1)
from  orderinfo
group by  year(paidTime),month(paidTime);

注意：
不同月份，可能是同一年的不同月份，也有可能是不同年的不同月份,所以在分组的时候需要考虑年份。
结果：

可以利用where去除脏数据（0000-00-00 00:00:00）以及筛选出‘已支付’的用户

select 
year(paidTime),
month(paidTime),
count(1)
from  orderinfo
where isPaid='已支付' 
and
paidTime!='0000-00-00 00:00:00'
group by  year(paidTime),month(paidTime);

注意：group by 放在语句的最后

可能存在同一个用户支付多个订单的情况，所以需要用distinct做区分

select 
year(paidTime),
month(paidTime),
count(distinct userid)
from  orderinfo
where isPaid='已支付' 
and
paidTime!='0000-00-00 00:00:00'
group by  year(paidTime),month(paidTime);

补充：ctrl+c可以退出运行进程。
2、统计用户三月份的回购率和复购率
复购率：当月购买了多次的用户占当月用户的比例。
回购率：上月购买用户中有多少用户又再次购买。
思考过程：

首先计算复购率
（1）先筛选出3月份的消费情况
应该也要考虑年份的，但是我们发现除去脏数据后年份只剩2016年

在用where筛选之前查看一下paidTime的类型（varchar,'3'）

select 
*
from  orderinfo
where isPaid='已支付'
and month(paidTime)='3'
limit 10;

(2)统计每个用户在3月份消费了多少次
思路：group by用户

select 
userid,count(1)
from  orderinfo
where isPaid='已支付'
and month(paidTime)='3'
group by userid
limit 10;

(3)统计出消费次数大于1的用户数、总用户数以及复购率

select 
     sum(if(con>1,1,0)) as fugou_cons,
     count(1) as userid_cons,
     sum(if(con>1,1,0))/count(1) as fugou_rate
from  (select 
       userid,count(1) as con
       from  orderinfo
       where isPaid='已支付'
       and month(paidTime)='3'
       group by userid
     ) t;

注意：嵌套的子表需要取别名（比如，t）且不需要加as；否则会报错如下:

小tip：在notepad++中'alt'+鼠标下拉，可以统一选中多行，进而可以按空格产生缩进，便于代码美观易观看。
或者选择代码，再按‘tab’也可达到同样缩进效果。

接着计算回购率
回购率：上月购买用户中有多少用户又再次购买。
思考过程：

先查看下数据中的月份类别

select distinct(month(paidTime)) from orderinfo;

现在是六月，那么这里的上月就是五月，确认了存在五月的数据，可以继续计算。
五月份的回购率=五月用户中6月又再次购买的人数/5月的用户总数

我们关系的不是用户购买的次数，而是用户有没有购买过。
（1）筛选出五月购买的用户

select * from orderinfo 
where month(paidTime)='5';

（2）按照userid进行分组并显示购买次数

select userid,count(1) as con from orderinfo 
where month(paidTime)='5' group by userid;

（3）统计购买次数大于1 的用户数、总用户数以及回购率

select 
    count(1) as userid_con,
    sum(if(con >1,1,0)) as huigou_con,
    sum(if(con >1,1,0))/count(1) as huigou_rate
from 
    (select userid,count(1) as con from orderinfo 
where month(paidTime)='5' group by userid) t;

===
（1）、统计每年每月的用户消费情况

select 
    userid,
    date_format(paidTime,'%Y-%m') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m') 
limit 10;

注意：我们这里选用date_format()而没有用month(),目的是便于后续判断日期是否相邻。
（2）、相邻月份进行关联，能关联上的用户说明就是回购。
首先写个框架

select 
from () a
left join
() b;

其次，自己和自己进行关联，注意用left join，避免遗漏信息

select 
*
from (select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) a
left join
(select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid 
and date_sub(b.month_dt  ,interval 1 month)=a.month_dt  limit 10;

？但为什么右列都是null值?

(3)统计每个月份的消费人数情况以及得到回购率
框架：

select
a.month_dt,
    count(a.userid),
    count(b.userid),
    count(b.userid)/count(a.userid)
from
()
group by a.month_dt;

完整代码：

select
    a.month_dt,
    count(a.userid),
    count(b.userid),
    count(b.userid)/count(a.userid)
from
(select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) a
left join
(select 
    userid,
    date_format(paidTime,'%Y-%m-%d') as month_dt,
    count(1) as con
from orderinfo
where isPaid='已支付'
group by userid,date_format(paidTime,'%Y-%m-%d')) b on a.userid=b.userid 
and date_sub(b.month_dt,interval 1 month)=a.month_dt
group by a.month_dt;

存在疑问：
1、date_sub()函数是对日期格式有特殊要求吗？
具体情形：
如果我将paidTime格式转换为'%Y-%m'再进行date_sub(b.month_dt,interval 1 month)与a表日期进行是否相等判断，则匹配的全为null值。
但是参考答案中将paidTime格式转换为'%Y-%m-01'再进行匹配则不全为null值，但是这个‘01’的设置在视频讲解中也提到说不是影响因素，只是强制设置所有的记录的日为01，所以就不清楚这个错误是为什么出现？
解决问题：
date_sub(b.month_dt,interval 1 month)中的日期最少要精确到日，到月的输出Null值

2、表Orderinfo中paidTime的约束是字符串类型（varchar），而date_format (日期，字符串格式)函数的作用是将日期转换为字符串格式，那用date_format(paidTime,'%Y-%m-01')，是把paidTime当成日期类型了吗？还是说我理解错了？
解决问题：
date_format (日期，字符串格式)函数并没有规定这个字段的类型，只要是合法的日期就行
总结：还是需要再练练

注意：count()函数在统计时并不计算null值

3、统计男女用户消费频次是否有差异
思考过程:
(1)统计不同性别下的用户的消费次数

思考过程：
（1）首先通过left join合并一下表userinfo和表orderinfo

select * from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

查看一下sex的类别

select distinct sex from userinfo;

类别为：'男','女',字符串空''

（1）我们需要统计不同性别下的消费情况（次数）

select * from
(select *,o.userid as o_id from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;

出现ERROR 1060 (42S21): Duplicate column name 'userid',已经做好记录并解决了（见：https://www.jianshu.com/p/0c0df1939ee4）。

select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

select * from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;

select * from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' limit 10 ;

#去除sex为空的字段以及根据sex和id进行分组
select o_id as all_id ,sex,count(1) from
(select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by sex,o_id limit 20;

#修改子查询的写法，并改变分组的顺序为id，sex
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' and t.sex!='' group by orderid,sex limit 20;
#注意：group by 后面变量的前后顺序不同，得到的结果也会发生改变。

#去掉'已支付'的限制条件
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.sex!='' group by orderid,sex limit 20;

#把left join 换成inner join
select orderid as all_id ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o inner join userinfo as u on u.userid=o.userid) t
where  t.sex!='' group by orderid,sex limit 20;

/*发现分组的依据id以及查询的id使用错误，应该使用userid 而不是orderid，所以还是用回之前的left join，看下结果：*/
select userid ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.sex!='' group by userid,sex limit 20;

#接着再次加上'已支付'的限制条件
select userid ,sex,count(1) from
(select o.*,u.sex,u.birth 
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex limit 20;

这样我们就统计了不同性别下的用户消费次数
接着，对性别做一个消费次数平均计算
同样构建框架：

select 

from ()
group by sex;

详细代码：

select
    sex,
    avg(count(1)) as avg_con
from (select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;

但出现错误：

ERROR 1111 (HY000): Invalid use of group function
这个问题还有待进一步了解。
但是修改avg()后运行成功，
完整代码如下：

select 
sex,
avg(cons) as avg_con
from (select userid ,sex,count(1) as cons from
(select o.*,u.sex,u.birth 
 from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;

错误记录：
1、

select *,o.userid as o_id from
 (select * from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where t.isPaid='已支付' ;

ERROR 1060 (42S21): Duplicate column name 'userid'
解决办法：
其实这个问题的本质是子查询：

select * from orderinfo as o left join userinfo as u on u.userid=o.userid;

中出现重复字段名，如下图：

所以，我们在查询的时候需要对字段进行重命名，如下：

select * ,o.userid as o_id, u.userid as u_id from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

但是这样是在查询原有全部信息的同时又增加了重新命名的两列，如何在查询全部信息的同时更改其中某个字段的名称呢？

目前的解决办法是：
自己手动写出需要查询的字段的名称，如下：

 select o.orderid,o.userid as o_id, o.isPaid ,o.price,o.paidTime,u.sex,u.birth,u.userid as u_id 
 from orderinfo as o left join userinfo as u on u.userid=o.userid limit 10;

这样似乎有些蠢笨，等发现有简单方法后再来更新。

方法更新：
如下：

select a.*,b.sex,b.birth 
from orderidfo as a left join userinfo as b on a.userid=b.userid limit 10;

这个的思路是：
这样写就不用修改名称了，a.*查询字段多的表，字段少的表直接手写

错误2：

select
    sex,
    avg(count(1)) as avg_con
from (select userid ,sex,count(1) from
(select o.*,u.sex,u.birth
from orderinfo as o left join userinfo as u on u.userid=o.userid) t
where  t.isPaid='已支付' and t.sex!='' group by userid,sex) a
group by sex;

ERROR 1111 (HY000): Invalid use of group function

4、统计多次消费的用户，第一次和最后一次消费间隔是多少天
即产品的生命周期。
（1）统计出多次消费的用户

select userid from orderinfo
where  isPaid='已支付'
group by userid
having count(1)>1;

(2)取出第一次和最后一次的时间

select 
    userid ,
    min(paidTime),
    max(paidTime),
    max(paidTime)-min(paidTime) 
from orderinfo
where  isPaid='已支付'
group by userid
having count(1)>1 limit 10;

注意：日期类型不能用简单的算术运算，应该用日期特有的运算(datediff(日期1，日期2))

select 
    userid ,
    min(paidTime),
    max(paidTime),
    datediff(max(paidTime),min(paidTime) )
from orderinfo
where  isPaid='已支付'
group by userid
having count(1)>1 limit 10;

5、统计不同年龄段，用户的消费金额是否有差异
首先计算年龄

select userid,2020-year(birth) as age from userinfo limit 10;

注意：计算年龄可以考虑timestampdiff(year,birth,now())函数，而且该函数未过生日会减一
其中，第一个参数year是返回的类型，也可以是month，hour

select userid,birth,now(),timestampdiff(year,birth,now()) as age
from userinfo limit 10;

注意：系统设置的默认时间为'1900-01-01'（因为发现年龄居然存在一百多岁的）,以及存在脏数据‘ 0000-00-00’

所以需要进行筛选：

select userid,birth,now(),timestampdiff(year,birth,now()) as age
from userinfo where year(birth)>1900 limit 30;

年龄如何分段？0-10:1,11-20:2,21-30:3 利用case when
补充：向上取整函数ceil()

select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900 
limit 10;

疑问：
year(birth)>1900 和birth>'1901-00-00'的区别？？

统计消费金额

select  price from orderinfo limit 10;

消费金额如何分段？
分段之后通过两个表的userid 连接起来，关联订单信息，获取不同年龄段的一个消费频次和消费金额
框架：

select
 
from orderinfo a 
inner join() b on a.userid=b.userid;

详细代码：

select
    userid,
    age,
    count(1) as con,
    sum(price) as prices
from orderinfo a 
inner join(select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900) b on a.userid=b.userid
group by userid,age;

修改如下：

select
    a.userid,
    age,
    count(1) as con,
    sum(price) as prices
from orderinfo a 
inner join(select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900) b on a.userid=b.userid
group by a.userid,age  limit 30;

(3)再对年龄分层进行聚合，得到不同年龄层的消费情况
框架：

select 
    age,
    avg(con),
    avg(prices)
from 
() a
group by age ;

完整代码：

select 
    age,
    avg(con),
    avg(prices)
from 
(select
    a.userid,
    age,
    count(1) as con,
    sum(price) as prices
from orderinfo a 
inner join(select 
    userid,
    birth,
    now(),
    ceil(timestampdiff(year,birth,now())/10) as age
from userinfo 
where year(birth)>1900) b on a.userid=b.userid
group by a.userid,age) a
group by age ;

错误记录：
错误3、ERROR 1052 (23000): Column 'userid' in field list is ambiguous

6、统计消费的二八法则，消费的top20%用户，贡献了多少消费额
(1)统计每个用户的消费金额，并进行降序排序

select 
    userid,
    price,
    sum(price)  as total_price
from orderinfo
where isPaid='已支付'
group by userid limit 20;

(2)统计用户总数以及总消费金额
框架：

select 
count(1) as  con,
sum(total_price)  as all_price
from () a;

完整代码：

select 
count(1) as  con,
sum(total_price)  as all_price
from (select 
    userid,
    price,
    sum(price)  as total_price
from orderinfo
where isPaid='已支付'
group by userid) a;

(3)取出前20%的用户进行金额统计。

select
    count(1) as  con,
    sum(total_price)  as all_price
from(
select 
    userid,
    price,
    sum(price)  as total_price
from orderinfo
where isPaid='已支付'
group by userid 
order by total_price desc 
limit 17000) b;

我们发现20%的用户贡献了80%的业务量。

D7-利用Mysql进行数据分析实例
一、准备工作及背景1、准备数据库：data···create database data charset=utf8...
2019-10-12
原文请看利用Python进行数据分析(10) pandas基础: 处理缺失数据利用Python进行数据分析(10...
《利用Python进行数据分析》PDF高清完整版-免费下载
《利用Python进行数据分析》PDF高清完整版-免费下载《利用Python进行数据分析》PDF高清完整版-免费...
关联分析（一）
参考：利用mlxtend进行数据关联分析
学习python数据分析必看，《利用Python进行数据分析》
利用Python进行数据分析pdf: 讲述了从pandas库的数据分析工具开始利用高性能工具、matpIotlib...
GEO2R工具和使用方法介绍
实例数据定义分组可通过下载Download full table 进行后续分析
Kaggle: video_games_sales数据集探索性分
注：该分析报告的数据源自Kaggle中video_games_sales数据集，该分析利用Python进行数据...
利用Python进行数据分析 - 准备工作
概述该系列是关于《利用Python进行数据分析》的学习笔记。涉及内容主要介绍如何利用Python来进行数据控...
5/05day47_Mysql性能优化
回顾 MySQL索引今日目标一 MySQL性能 1.1 分析-数据库查询效率低下我们进入公司进行项目...
基于Linux的MySQL操作实例（软件安装，mysql基本操作
基于Linux的MySQL操作实例（软件安装，mysql基本操作，mysql数据类型，建表、插入数据操作）前言 ...