一、项目概述
1. 成果预览
分析概述:本文是对Adventure项目的总结,记录项目分析与实现过程,主要任务是对MySQL数据库中原始数据通过pandas和numpy聚合、分析,并使用聚合后数据进行dashboard的搭建,为业务、运营人员提供自主分析工具。在聚合后数据的基础上,汇报2019年11月自行车销售情况,为精细化运营提供数据支持,精准定位目标客户群体。
dashboard预览:dashboard看板连接
主页PPT预览:
2. 背景介绍
2.1 公司背景
Adventure Works Cycles是基于微软AdventureWorks示例数据库的虚拟公司,是一家大型跨国制造公司。该公司生产金属和复合材料自行车,并在全国进行销售。2019年12月业务部门需要向上级汇报11月份的自行车销售情况,为精细化运营提供数据支持,精准定位目标客户群体。
2.2 产品介绍
产品分为三个大类:自行车(公司主要产品)、服装、配件
- 自行车:分为三种类型,公路自行车、山地自行车、旅游自行车
- 服装:包括帽子、手套、短裤、背心、袜子、运动衫
- 配件:包括头盔、挡泥板、水袋、清洁工、瓶子和笼子、自行车架、自行车看台、车胎和内胎
二、数据搭建
1. 指标搭建
1.1 查看数据
adventure Works Cycles公司的mysql数据库中有三张表:
-
ods_sales_orders:订单明细表(一个订单表示销售一个产品),主要包括每个订单的详细信息:订单主键、订单日期、客户编号、产品编号、产品名、产品类别、产品子类、产品单价这些字段
订单明细表 -
ods_customer:每日新增用户表,本表与“ods_sales_orders 订单明细表”可通过客户编号字段关联,本表记录每日新增客户系信,包括日期、出生日期、性别、婚否、年收入以及详细地理位置:省份、城市、地址、区域和经纬度等字段
-
dim_date_df:日期维度表,本表可通过日期字段与上述两张表关联,本表通过1和0对是否是当日、当月、当年、等字段的记录,可在后续对表的聚合加工中,对订单表与用户表进行不同时间维度查询
日期维度表
1.2 确定需求
结合业务要求和现有数据,确定数据指标,为后续dashboard搭建和相关报告输出,提供完善数据。对MySQL数据库中三张表数据进行聚合和各维度关联等操作,完成下面三张表格:
- dw_order_by_day:每日环比表
- 本表是对每日数据的一个汇总聚合,通过对ods_sales_orders(订单明细表)中的“create_date”字段进行聚合,获得每日总销售金额、总销量、并计算出平均订单价
- 设定每日销售量与销售金额目标值,便于后续dashboard展示目标完成情况
- 通过“create_date”字段关联dim_date_df(日期维度表),方便后续对日、月、季度、年等时间维度的销售情况进行聚合展示
- 计算每日销售金额与昨日销售金额环比
- dw_customer_order:时间-地区-产品聚合表
- 提取当日维度下ods_sales_orders(订单明细表)和ods_customer(每日新增用户表),并通过“customer_key”字段关联两张表
- 提取订单主键、订单日期、客户编号、产品名、产品子类、产品类别、产品单价、所在区域、所在省份、所在城市这些字段,使用groupby方法对订单日期、产品名、产品子类、产品类别、所在区域、所在省份、所在城市的逐级聚合
- 通过“create_date”字段关联dim_date_df(日期维度表),方便后续对日、月、季度、年等时间维度产品信息和地区信息的聚合展示
- 获得相应的订单总量、客户总量和销售总金额,完成对当日时间维度下产品和地区的聚合,可用于dashboard地区维度的切片和地图可视化
- dw_amount_diff:当日维度表
- 提取生成的的dw_order_by_day(每日环比表)的数据
- 通过“is_today”、“is_yesterday”、“is_current_year”、“is_last_year”等这些字段中的值:1和0来判断时间
- 对当日、昨日、今年、去年等的销售金额,销售量和客单价聚合起来,完成当日同比、昨日同比 、本月同比、本季度同比以及本年同比
- 使用“amount”、“order“、”avg“标记销售金额,销售量和客单价的同比结果,方便横向提取数据
2. python处理数据
使用python,连接服务器中MySQL数据库,通过关联、聚合、计算以及一些格式的转换等步骤,完成上述指标的搭建,并将生成的表格上传至MySQL数据库。详细代码如下:
2.1 生成dw_order_by_day表
导入相关模块,连接数据库
import pandas as pd
import pymysql
import random
from sqlalchemy import create_engine
pymysql.install_as_MySQLdb()
import datetime
import warnings
warnings.filterwarnings("ignore")
# 连接adventure_ods库
adventure_ods = create_engine('mysql://root:password@localhost:3306/adventure_ods?charset=gbk')
# 连接adventure_dw库
adventure_dw = create_engine('mysql://root:password@localhost:3306/adventure_dw?charset=gbk')
step1:读取ods_sales_orders(订单明细表),生成sum_amount_order(销量订单聚合表),求总销售金额与客单价
# 1、读取ods_sales_orders(订单明细表)
ods_sales_orders = pd.read_sql_query("select * from ods_sales_orders ", con=adventure_ods)
# 2、根据create_date分组求总销量及客户数量
sum_amount_order = ods_sales_orders.groupby(by='create_date').agg({'unit_price': sum, 'customer_key': pd.Series.nunique}).reset_index()
# 3、修改对应列名(unit_price→sum_amount', 'customer_key'→ 'sum_order)
sum_amount_order.rename(columns = {'unit_price': 'sum_amount', 'customer_key': 'sum_order'},inplace = True)
# 4、新增amount_div_order客单价列
sum_amount_order['amount_div_order'] = sum_amount_order['sum_amount'] / sum_amount_order['sum_order']
step2:利用空列表及循环生成对应随机值,与销量订单聚合表合并形成sum_amount_order_goal(销量订单聚合目标表)
sum_amount_goal_list = []
sum_order_goal_list = []
# 转为list类型,遍历每个日期
create_date_list = list(sum_amount_order['create_date'])
for i in create_date_list:
# 1.生成一个在[0.85,1.1]随机数
a = random.uniform(0.85,1.1)
b = random.uniform(0.85,1.1)
# 2.对应日期下生成总金额(sum_amount)*a的列
amount_goal = list(sum_amount_order[sum_amount_order['create_date'] == i]['sum_amount'])[0] * a
# 3.对应日期下生成总订单数(sum_order)*a的列
order_goal = list(sum_amount_order[sum_amount_order['create_date'] == i]['sum_order'])[0] * b
# 4.将生成的目标值加入空列表
sum_amount_goal_list.append(amount_goal)
sum_order_goal_list.append(order_goal)
# 5.合并sum_amount_order表与刚生成的目标值列,形成sum_amount_order_goal
sum_amount_order_goal = pd.concat([sum_amount_order, pd.DataFrame({'sum_amount_goal': sum_amount_goal_list, 'sum_order_goal': sum_order_goal_list})], axis=1)
step3:读取dim_date_df日期维度表
dim_date_df = """
select create_date,
is_current_year,
is_last_year,
is_yesterday,
is_today,
is_current_month,
is_current_quarter
from dim_date_df"""
dim_date_df = pd.read_sql_query(dim_date_df, con=adventure_ods)
step4:进行数据的融合,生成dw_order_by_day表
# 1.转化create_date格式为标准日期格式
sum_amount_order_goal['create_date'] = sum_amount_order_goal['create_date'].apply(lambda x: x.strftime('%Y-%m-%d'))
# 2.通过主键create_date连接日期维度,输出:dw_order_by_day每日环比表
dw_order_by_day = pd.merge(sum_amount_order_goal, dim_date_df, on='create_date', how='inner')
# 3.在sum_amount列基础上,增添环比数据列amount_diff
dw_order_by_day['amount_diff'] = dw_order_by_day['sum_amount'].pct_change()
dw_order_by_day['amount_diff'].fillna(0,inplace=True)
step5:删除旧的dw_order_by_day(每日环比表),存储新的dw_order_by_day
# 因为删除插入更新操作没有返回值,程序会抛出ResourceClosedError,并终止程序。使用try捕捉此异常。
try:
# 作业代码:删除旧表数据
pd.read_sql_query("Truncate table dw_order_by_day", con=adventure_dw)
except Exception as e:
print("删除旧的dw_order_by_day表,error:{}".format(e))
# 替换原先的dw_order_by_day
dw_order_by_day.to_sql('dw_order_by_day', con=adventure_dw, if_exists='replace', index=False)
2.2 生成dw_customer_order表
导入相关模块,连接数据库
import pandas as pd
import pymysql
from sqlalchemy import create_engine
pymysql.install_as_MySQLdb()
import datetime
import warnings
warnings.filterwarnings("ignore")
# 连接adventure_ods库
adventure_ods = create_engine('mysql://root:password@localhost:3306/adventure_ods?charset=gbk')
# 连接adventure_dw库
adventure_dw = create_engine('mysql://root:password@localhost:3306/adventure_dw?charset=gbk')
step1:读取最新日期的ods_sales_orders(订单明细表)
ods_sales_orders = \
'''select sales_order_key,
create_date,
customer_key,
english_product_name,
cpzl_zw,
cplb_zw,
unit_price
from ods_sales_orders where create_date =(
select create_date
from dim_date_df
order by create_date desc
limit 1) '''
ods_sales_orders = pd.read_sql_query(ods_sales_orders, con=adventure_ods)
ods_sales_orders.info()
step2:读取每日新增用户表ods_customer
ods_customer = '''
select customer_key,
chinese_territory,
chinese_province,
chinese_city
from ods_customer'''
ods_customer = pd.read_sql_query(ods_customer, con=adventure_ods)
ods_customer.info()
step3:读取日期维度表dim_date_df
dim_date_df = """
select create_date,
is_current_year,
is_last_year,
is_yesterday,
is_today,
is_current_month,
is_current_quarter
from dim_date_df"""
dim_date_df = pd.read_sql_query(dim_date_df, con=adventure_ods)
dim_date_df.info()
step4:进行数据的聚合
# 将customer_key由int类型转为str类型
ods_customer['customer_key'] = ods_customer['customer_key'].map(lambda x:str(x))
# 通过客户id连接表
sales_customer_order = pd.merge(ods_sales_orders, ods_customer, left_on='customer_key', right_on='customer_key', how='left')
# 1.提取订单主键/订单日期/客户编号/产品名/产品子类/产品类别/产品单价/所在区域/所在省份/所在城市
sales_customer_order = sales_customer_order[['sales_order_key', 'create_date', 'customer_key', 'english_product_name',\
'cpzl_zw','cplb_zw','unit_price', 'chinese_territory', 'chinese_province', 'chinese_city']]
# 2.形成按照 订单日期/产品名/产品子类/产品类别/所在区域/所在省份/所在城市的逐级聚合表,获得订单总量/客户总量/销售总金额
sum_customer_order = sales_customer_order.groupby(['create_date','english_product_name','cpzl_zw','cplb_zw','chinese_territory',\
'chinese_province', 'chinese_city'],as_index=False).\
agg({'sales_order_key':pd.Series.nunique,'customer_key':pd.Series.nunique,'unit_price':sum})
sum_customer_order.rename(columns = {'sales_order_key':'order_num','customer_key':'customer_num',\
'unit_price':'sum_amount','english_product_name':'product_name'},inplace=True)
# 3.转化订单日期为字符型格式
sum_customer_order['create_date'] = sum_customer_order['create_date'].apply(lambda x: x.strftime('%Y-%m-%d'))
# 4.获取当日日期维度
dw_customer_order = pd.merge(sum_customer_order, dim_date_df, on='create_date', how='inner')
step5:进行数据的存储
dw_customer_order.to_sql('dw_customer_order', con=adventure_dw,if_exists='replace', index=False)
2.3 生成dw_amount_diff表
导入相关模块,连接数据库
import pandas as pd
import pymysql
from sqlalchemy import create_engine
pymysql.install_as_MySQLdb()
import datetime
import warnings
warnings.filterwarnings("ignore")
# 连接adventure_dw库
adventure_dw = create_engine('mysql://root:password@localhost:3306/adventure_dw?charset=gbk')
step1:读取dw_order_by_day表
dw_order_by_day = pd.read_sql_query("select * from dw_order_by_day", con=adventure_dw)
dw_order_by_day['create_date'] = pd.to_datetime(dw_order_by_day['create_date'])
step2:求取各阶段的总金额
"""当天"""
# 当天的总金额
today_amount = dw_order_by_day[dw_order_by_day['is_today'] == 1]['sum_amount'].sum()
# 去年同期的日期维度
before_year_today = list(dw_order_by_day[dw_order_by_day['is_today'] == 1]['create_date'] + datetime.timedelta(days=-365))
# 去年同期总金额
before_year_today_amount = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_today)]['sum_amount'].sum()
"""昨天"""
# 昨天的总金额
yesterday_amount = dw_order_by_day[dw_order_by_day['is_yesterday'] == 1]['sum_amount'].sum()
# 去年同期的日期维度
before_year_yesterday = list(dw_order_by_day[dw_order_by_day['is_yesterday'] == 1]['create_date'] + datetime.timedelta(days=-365))
# 去年同期总金额
before_year_yesterday_amount = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_yesterday)]['sum_amount'].sum()
"""当前月份"""
# 当月的总金额
month_amount = dw_order_by_day[dw_order_by_day['is_current_month'] == 1]['sum_amount'].sum()
# 去年同期的日期维度
before_year_month = list(dw_order_by_day[dw_order_by_day['is_current_month'] == 1]['create_date'] + datetime.timedelta(days=-365))
# 去年同期总金额
before_year_month_amount = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_month)]['sum_amount'].sum()
"""当前季度"""
# 当季的总金额
quarter_amount = dw_order_by_day[dw_order_by_day['is_current_quarter'] == 1]['sum_amount'].sum()
# 去年同期的日期维度
before_year_quarter = list(dw_order_by_day[dw_order_by_day['is_current_quarter'] == 1]['create_date'] + datetime.timedelta(days=-365))
# 去年同期总金额
before_year_quarter_amount = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_quarter)]['sum_amount'].sum()
"""当前年份"""
# 当年的总金额
year_amount = dw_order_by_day[dw_order_by_day['is_current_year'] == 1]['sum_amount'].sum()
# 去年同期的日期维度
before_year_year = list(dw_order_by_day[dw_order_by_day['is_current_year'] == 1]['create_date'] + datetime.timedelta(days=-365))
# 去年同期总金额
before_year_year_amount = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_year)]['sum_amount'].sum()
step3:求取各阶段的总客户数
"""当天"""
# 当天的总客户数
today_order = dw_order_by_day[dw_order_by_day['is_today'] == 1]['sum_order'].sum()
# 去年同期总客户数
before_year_today_order = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_today)]['sum_order'].sum()
"""昨天"""
# 昨天的总客户数
yesterday_order = dw_order_by_day[dw_order_by_day['is_yesterday'] == 1]['sum_order'].sum()
# 去年同期总客户数
before_year_yesterday_order = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_yesterday)]['sum_order'].sum()
"""当前月份"""
# 当月的总客户数
month_order = dw_order_by_day[dw_order_by_day['is_current_month'] == 1]['sum_order'].sum()
# 去年同期总客户数
before_year_month_order = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_month)]['sum_order'].sum()
"""当前季度"""
# 当季的总客户数
quarter_order = dw_order_by_day[dw_order_by_day['is_current_quarter'] == 1]['sum_order'].sum()
# 去年同期总客户数
before_year_quarter_order = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_quarter)]['sum_order'].sum()
"""当前年份"""
# 当年的总客户数
year_order = dw_order_by_day[dw_order_by_day['is_current_year'] == 1]['sum_order'].sum()
# 去年同期总客户数
before_year_year_order = dw_order_by_day[dw_order_by_day['create_date'].isin(before_year_year)]['sum_order'].sum()
step4:求取各阶段的总金额、订单数的同期对比数据
amount_dict = {'today_diff': [today_amount / before_year_today_amount - 1,
today_order / before_year_today_order - 1,
(today_amount / today_order) / (before_year_today_amount /before_year_today_order) - 1],
'yesterday_diff': [yesterday_amount / before_year_yesterday_amount - 1,
yesterday_order / before_year_yesterday_order - 1,
(yesterday_amount / yesterday_order) / (before_year_yesterday_amount / before_year_yesterday_order) - 1],
'month_diff': [month_amount / before_year_month_amount - 1,
month_order / before_year_month_order - 1,
(month_amount / month_order) / (before_year_month_amount / before_year_month_order) - 1],
'quarter_diff': [quarter_amount / before_year_quarter_amount - 1,
quarter_order / before_year_quarter_order - 1,
(quarter_amount / quarter_order) / (before_year_quarter_amount / before_year_quarter_order) - 1],
'year_diff': [year_amount / before_year_year_amount - 1,
year_order / before_year_year_order - 1,
(year_amount / year_order) / (before_year_year_amount / before_year_year_order) - 1],
'flag': ['amount', 'order', 'avg']}
dw_amount_diff = pd.DataFrame(amount_dict)
step5:删除旧的dw_amount_diff表,存储新的dw_amount_diff表
# 因为删除插入更新操作没有返回值,程序会抛出ResourceClosedError,并终止程序。使用try捕捉此异常。
try:
pd.read_sql_query("Truncate table dw_amount_diff", con=adventure_dw)
except Exception as e:
print("删除旧的dw_amount_diff表,error:{}".format(e))
# 存储新的dw_amount_diff表
dw_amount_diff.to_sql('dw_amount_diff', con=adventure_dw, if_exists='replace', index=False)
3. 搭建Dashboard
3.1 可视化看板布置
- 可视化工具:这里用到的可视化工具有切片器、KPI、卡片图、表、环形图、地图等。
- 筛选器:这里用于日期、区域等字段的筛选。
- 书签窗格:这里将按钮和书签结合使用,用于制作导航栏和动态图表。
3.2 报表展示
1、主页
- 展示今日销售金额和目标完成情况、客户量和客单价,以及同比情况
- 产品今日详细销售情况
- 环形图直观表示各个品类订单销量的百分比
- 通过地区和城市切片,可以动态展示不同地区和城市的上述内容
- 通过地图动态展示各个城市销售订单量
2、趋势详情页
- 通过日期筛选器筛选出最近一段时间的销售趋势
- 动态展示最近14天销售金额和目标销售金额的趋势
- 动态展示最近14天顾客量和目标顾客量的趋势
- 动态展示最近14天客单价的趋势
3、地区详情页
- 对区域和省份进行切片
- 通过切片展示不同地区订单详细情况
三、报告输出
1. 指标搭建
1.1 目的
- 根据需求,需要输出2019.11月自行车销售情况,为精细化运营提供数据支持,能精准的定位目标客户群体;
- 制定销售策略,调整产品结构,才能保持高速增长,获取更多的收益,占领更多市场份额;
- 报告通过对整个公司的自行车销量持续监测和分析,掌握公司自行车销售状况、走势的变化,为客户制订、调整和检查销售策略,完善产品结构提供依据。
1.2 数据来源
- dw_customer_order:产品销售信息事实表
- ods_customer:每天新增客户信息表
- dim_date_df:日期表
- ods_sales_orders:订单明细表
1.3 指标制定
- 从整体的角度:分析2019.1—2019.11自行车整体销售表现
- 从地域的角度:分析11月每个区域销售量表现、11月TOP10城市销售量表现
- 从产品的角度:分析11月类别产品销售量表现、11月细分产品销售量表现
- 热销产品:分析11月TOP10产品销量榜、11月TOP10销量增速榜
- 从用户的角度:分析11月用户年龄分布及每个年龄段产品购买喜好、11月男女用户比例及产品购买喜好
2. python加工数据
代码如下:
#导入模块
import pandas as pd
import numpy as np
import pymysql
pymysql.install_as_MySQLdb()
from sqlalchemy import create_engine
part1:自行车整体销售表现
'''1. 从数据库读取源数据:dw_customer_order'''
# 读取dw_customer_order
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_ods')
gather_customer_order=pd.read_sql_query("select * from dw_customer_order",con = engine)
# 查看源数据前5行,观察数据,判断数据是否正常识别
gather_customer_order.head()
# 查看源数据类型
gather_customer_order.info()
# 利用create_date字段增加create_year_month月份字段
gather_customer_order['create_year_month'] = gather_customer_order["create_date"].apply(lambda x: x.strftime('%Y-%m'))
# 筛选产品类型cplb_zw中的自行车作为新的gather_customer_order
gather_customer_order = gather_customer_order[gather_customer_order['cplb_zw']=="自行车"]
'''2. 自行车整体销售量表现'''
# 取消科学计数法
pd.set_option('float_format', lambda x: '%.6f' % x)
#每月订单数量和销售金额,用groupby创建一个新的对象,需要将order_num、sum_amount求和,按日期降序排序,方便计算环比
overall_sales_performance = gather_customer_order.groupby("create_year_month").agg({"order_num":"sum","sum_amount":"sum"}).\
sort_values(by = "create_year_month",ascending = False).reset_index()
# 新增一列order_num_diff,此为每月自行车销售订单量环比,本月与上月对比,例如本期2019-02月销售额与上一期2019-01月销售额做对比
order_num_diff = list(-(overall_sales_performance.order_num.diff()/overall_sales_performance.order_num))
# 删除列表中第一个元素
order_num_diff.pop(0)
# 将0新增到列表末尾
order_num_diff.append(0)
# 将环比转化为DataFrame
overall_sales_performance = pd.concat([overall_sales_performance,pd.DataFrame(order_num_diff)],axis=1)
# 新增一列sum_amount_diff,此为每月自行车销售金额环比,最后形成按照日期升序排列
sum_amount_diff = list(-(overall_sales_performance.sum_amount.diff()/overall_sales_performance.sum_amount))
sum_amount_diff.pop(0)
sum_amount_diff.append(0)
sum_amount_diff
# 将环比转化为DataFrame
overall_sales_performance = pd.concat([overall_sales_performance,pd.DataFrame(sum_amount_diff)],axis = 1)
# 销量环比字段名order_diff,销售金额环比字段名sum_amount_diff
overall_sales_performance.columns = ['create_year_month', 'order_num', 'sum_amount', 'order_diff','sum_amount_diff']
# 按照日期排序,升序
overall_sales_performance = overall_sales_performance.sort_values(by = "create_year_month",ascending = True)
'''3. 将最终的overall_sales_performance的DataFrame存入数据库中'''
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
overall_sales_performance.to_sql('pt_overall_sale_performance_1_cc',con = engine,if_exists='replace')
overall_sales_performance
part2:2019年11月自行车地域销售表现
'''1. 源数据dw_customer_order,数据清洗筛选10月11月数据'''
# 筛选10、11月的自行车数据,赋值变量为gather_customer_order_10_11
gather_customer_order_10_11 = gather_customer_order[gather_customer_order["create_year_month"].isin(['2019-10','2019-11'])]
# 查看10月、11月的自行车订单数量
len(gather_customer_order_10_11)
'''2. 2019年11月自行车区域销售量表现'''
#按照区域、月分组,订单量求和,销售金额求和
gather_customer_order_10_11_group = gather_customer_order_10_11.groupby(["chinese_territory","create_year_month"]).\
agg({"order_num":"sum","sum_amount":"sum"}).reset_index()
#将区域存为列表
region_list = list(gather_customer_order["chinese_territory"].unique())
# pct_change()当前元素与先前元素的相差百分比,求不同区域10月11月环比
order_x = pd.Series([])
amount_x = pd.Series([])
for i in region_list:
a = gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['order_num'].pct_change()
b = gather_customer_order_10_11_group.loc[gather_customer_order_10_11_group['chinese_territory']==i]['sum_amount'].pct_change()
order_x = order_x.append(a)
amount_x = amount_x.append(b)
gather_customer_order_10_11_group['order_diff'] = order_x
gather_customer_order_10_11_group['amount_diff'] = amount_x
# 填充NaN值
gather_customer_order_10_11_group.fillna(value = 0,inplace = True)
# 将数据存入数据库
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
gather_customer_order_10_11_group.to_sql('pt_bicy_november_territory_2_cc',con = engine,if_exists='replace')
gather_customer_order_10_11_group
'''3. 2019年11月自行车销售量TOP10城市环比'''
#筛选11月自行车交易数据
gather_customer_order_11 = gather_customer_order[(gather_customer_order["create_year_month"]=='2019-11')]
# 将gather_customer_order_11按照chinese_city城市分组,求和销售数量order_num,最终查看11月自行车销售数量前十城市,赋予变量gather_customer_order_city_head
gather_customer_order_city_11 = gather_customer_order_11.groupby("chinese_city").agg({"order_num":"sum"}).reset_index().\
sort_values(by = "order_num",ascending = False)
#11月自行车销售数量前十城市
gather_customer_order_city_head = gather_customer_order_city_11.head(10)
#筛选销售前十城市的10月11月自行车销售数据
gather_customer_order_10_11_head = gather_customer_order_10_11[gather_customer_order_10_11["chinese_city"].isin(list(gather_customer_order_city_head.chinese_city))]
#分组计算前十城市,自行车销售数量销售金额
gather_customer_order_city_10_11 = gather_customer_order_10_11_head.groupby(["chinese_city","create_year_month"]).\
agg({"order_num":"sum","sum_amount":"sum"}).reset_index()
#计算前十城市环比
city_top_list = gather_customer_order_city_head["chinese_city"]
order_top_x = pd.Series([])
amount_top_x = pd.Series([])
for i in city_top_list:
a=gather_customer_order_city_10_11.loc[gather_customer_order_city_10_11['chinese_city']==i]['order_num'].pct_change()
b=gather_customer_order_city_10_11.loc[gather_customer_order_city_10_11['chinese_city']==i]['sum_amount'].pct_change()
order_top_x = order_top_x.append(a)
amount_top_x = amount_top_x.append(b)
#order_diff销售数量环比,amount_diff销售金额环比
gather_customer_order_city_10_11['order_diff']=order_top_x
gather_customer_order_city_10_11['amount_diff']=amount_top_x
# 填充NaN值
gather_customer_order_city_10_11.fillna(value = 0,inplace = True)
# 11月销售数量前十城市的销售数量、销售金额环比。存入数据库
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
gather_customer_order_city_10_11.to_sql('pt_bicy_november_october_city_3_cc',con = engine,if_exists='replace')
gather_customer_order_city_10_11
part3:2019年11月自行车产品销售表现
# 细分市场销量表现
'''1. 数据源 dw_customer_order'''
#查看数据源
gather_customer_order.head(5)
'''2.细分市场销量表现'''
#求每个月自行车累计销售数量
gather_customer_order_group_month = gather_customer_order.groupby("create_year_month").agg({"order_num":"sum"}).reset_index()
#合并自行车销售信息表+自行车每月累计销售数量表
order_num_proportion = pd.merge(gather_customer_order,gather_customer_order_group_month,how = "left",on = "create_year_month")
#计算自行车销量/自行车每月销量占比
order_num_proportion['order_proportion'] = order_num_proportion['order_num_x']/order_num_proportion['order_num_y']
#重命名sum_month_order:自行车每月销售量
order_num_proportion = order_num_proportion.rename(columns = {"order_num_x":"order_num","order_num_y":"sum_month_order"})
#将每月自行车销售信息存入数据库
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
order_num_proportion.to_sql('pt_bicycle_product_sales_month_4_cc',con = engine,if_exists = 'replace')
order_num_proportion
'''3. 公路/山地/旅游自行车细分市场表现'''
#查看自行车有那些产品子类
gather_customer_order['cpzl_zw'].unique()
'''3.1 公路自行车细分市场销量表现'''
gather_customer_order_road = gather_customer_order[gather_customer_order['cpzl_zw'] == '公路自行车']
#求公路自行车不同型号产品销售数量
gather_customer_order_road_month = gather_customer_order_road.groupby(by = ['create_year_month','product_name']).agg({"order_num":"sum"}).reset_index()
gather_customer_order_road_month['cpzl_zw'] = '公路自行车'
#每个月公路自行车累计销售数量
gather_customer_order_road_month_sum = gather_customer_order_road_month.groupby("create_year_month").agg({"order_num":"sum"}).reset_index()
#合并公路自行车gather_customer_order_road_month与每月累计销售数量
#用于计算不同型号产品的占比
gather_customer_order_road_month = pd.merge(gather_customer_order_road_month,gather_customer_order_road_month_sum,how='left',on='create_year_month')
'''3.2 山地自行车细分市场销量表现'''
# 筛选
gather_customer_order_Mountain = gather_customer_order[gather_customer_order['cpzl_zw'] == '山地自行车']
#求山地自行车不同型号产品销售数量
gather_customer_order_Mountain_month = gather_customer_order_Mountain.groupby(by = ['create_year_month','product_name']).agg({"order_num":"sum"}).reset_index()
gather_customer_order_Mountain_month['cpzl_zw'] = '山地自行车'
#每个月山地自行车累计销售数量
gather_customer_order_Mountain_month_sum = gather_customer_order_Mountain_month.groupby("create_year_month").agg({"order_num":"sum"}).reset_index()
#合并山地自行车hz_customer_order_Mountain_month与每月累计销售数量
#用于计算不同型号产品的占比
gather_customer_order_Mountain_month = pd.merge(gather_customer_order_Mountain_month,gather_customer_order_Mountain_month_sum,how='left',on='create_year_month')
'''3.3 旅游自行车细分市场销量表现'''
gather_customer_order_tour = gather_customer_order[gather_customer_order['cpzl_zw'] == '旅游自行车']
#求旅游自行车不同型号产品销售数量
gather_customer_order_tour_month = gather_customer_order_tour.groupby(by = ['create_year_month','product_name']).agg({"order_num":"sum"}).reset_index()
gather_customer_order_tour_month['cpzl_zw'] = '旅游自行车'
gather_customer_order_tour_month_sum = gather_customer_order_tour_month.groupby("create_year_month").agg({"order_num":"sum"}).reset_index()
gather_customer_order_tour_month = pd.merge(gather_customer_order_tour_month,gather_customer_order_tour_month_sum,how='left',on='create_year_month')
'''3.4 将山地自行车、旅游自行车、公路自行车每月销量信息合并'''
gather_customer_order_month = pd.concat([gather_customer_order_road_month,gather_customer_order_Mountain_month,gather_customer_order_tour_month])
gather_customer_order_month
#各类自行车,销售量占每月自行车总销售量比率
gather_customer_order_month['order_num_proportio'] = gather_customer_order_month['order_num_x']/gather_customer_order_month['order_num_y']
# 重命名:order_month_product当月产品累计销量,sum_order_month当月自行车总销量
gather_customer_order_month.rename(columns = {'order_num_x':'order_month_product','order_num_y':'sum_order_month'})
#将数据存入数据库
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
gather_customer_order_month.to_sql('pt_bicycle_product_sales_order_month_4_cc',con = engine,if_exists = 'replace')
gather_customer_order_month
'''4. 计算2019年11月自行车环比'''
#计算11月环比,先筛选10月11月数据
gather_customer_order_month_10_11 = gather_customer_order_month[gather_customer_order_month.create_year_month.isin(['2019-10','2019-11'])]
#排序。将10月11月自行车销售信息排序
gather_customer_order_month_10_11 = gather_customer_order_month_10_11.sort_values(by = ['product_name','create_year_month'])
product_name = list(gather_customer_order_month_10_11.product_name.drop_duplicates())
'''5. 计算自行车销售数量环比'''
order_top_x = pd.Series([])
for i in product_name:
a = gather_customer_order_month_10_11.loc[gather_customer_order_month_10_11['product_name']==i]["order_num_x"].pct_change()
order_top_x = order_top_x.append(a)
gather_customer_order_month_10_11['order_num_diff'] = order_top_x
gather_customer_order_month_10_11.fillna(value = 0,inplace = True)
gather_customer_order_month_10_11.rename(columns = {"order_num_x":"order_month_product","order_num_y":"sum_order_month"},inplace = True)
#筛选出11月自行车数据
gather_customer_order_month_11 = gather_customer_order_month_10_11[gather_customer_order_month_10_11['create_year_month'] == '2019-11']
'''6. 计算2019年1月至11月产品累计销量'''
#筛选2019年1月至11月自行车数据,这里使用的是‘~’取对立面
gather_customer_order_month_1_11 = gather_customer_order_month[~gather_customer_order_month['create_year_month'].\
isin(['2019-12','2020-01','2020-02','2020-03','2020-04','2020-05','2020-06','2020-07','2020-08','2020-09','2020-10','2020-11'])]
gather_customer_order_month_1_11.rename(columns = {"order_num_x":"order_month_product","order_num_y":"sum_order_month"},inplace = True)
#计算2019年1月至11月自行车累计销量
gather_customer_order_month_1_11_sum = gather_customer_order_month_1_11.groupby(by = 'product_name').agg({'order_month_product':'sum'}).reset_index()
#重命名sum_order_1_11:1-11月产品累计销量
gather_customer_order_month_1_11_sum = gather_customer_order_month_1_11_sum.rename(columns = {'order_month_product':'sum_order_1_11'})
'''7. 2019年11月自行车产品销量、环比、累计销量'''
#按相同字段product_name产品名,合并两张表
gather_customer_order_month_11 = pd.merge(gather_customer_order_month_11,gather_customer_order_month_1_11_sum,how='left',on='product_name')
''' 8. 存入数据库'''
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
gather_customer_order_month_11.to_sql('pt_bicycle_product_sales_order_month_11_cc',con = engine, if_exists = 'replace')
gather_customer_order_month_11
part4:用户行为分析
# 读取数据库客户信息表
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_ods?charset=gbk')
df_customer = pd.read_sql_query("select customer_key,birth_date,gender,marital_status from ods_customer where create_date < '2019-12-1'",con = engine)
# 查看数据类型
df_customer.info()
# 将customer_key由int类型转为str类型
df_customer['customer_key'] = df_customer.customer_key.map(lambda x:str(x))
# 读取数据库销售订单表
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_ods?charset=gbk')
df_sales_orders_11 = pd.read_sql_query("select * from ods_sales_orders where create_date>='2019-11-1' and create_date<'2019-12-1'",con = engine)
# pd.merge,how = 'inner join'
sales_customer_order_11 = pd.merge(df_sales_orders_11,df_customer,on='customer_key',how='inner')
# 根据sales_customer_order_11['birth_date'],获取客人的年份作为新的一列,以字符串类型存储
birth_year = sales_customer_order_11.birth_date.astype(str).apply(lambda x: x.split('-')[0]).rename('birth_year')
sales_customer_order_11 = pd.concat([sales_customer_order_11,birth_year],axis = 1)
'''1. 用户年龄分析'''
#修改出生年为int数据类型
sales_customer_order_11['birth_year'] = sales_customer_order_11.birth_year.astype(int)
# 计算用户年龄
sales_customer_order_11['customer_age'] = 2019 - sales_customer_order_11['birth_year']
# 利用customer_age字段,进行年龄分层,
#年龄分层1
listBins = [30,35,40,45,50,55,60,65]
listLabels = ['31-35','36-40','41-45','46-50','51-55','56-60','61-65']
#新增'age_level'分层区间列
sales_customer_order_11['age_level'] = pd.cut(sales_customer_order_11['customer_age'],bins=listBins, labels=listLabels, include_lowest=False)
#筛选销售订单为自行车的订单信息
df_customer_order_bycle = sales_customer_order_11.loc[sales_customer_order_11['cplb_zw'] == '自行车']
# 计算年龄比率
df_customer_order_bycle['age_level_rate'] = 1/df_customer_order_bycle['age_level'].count()
# 年龄分层2
# 将年龄分为3个层次,分别为'<=29'、'30-39'、'>=40'
listBins = [0,29,39,65]
listLabels = ['<=29','30-39','>=40']
df_customer_order_bycle['age_level2'] = pd.cut(sales_customer_order_11['customer_age'],bins=listBins, labels=listLabels, include_lowest=False)
# 求每个年龄段人数
age_level2_count = df_customer_order_bycle.groupby(by = 'age_level2').sales_order_key.count().reset_index()
'''2. 用户性别'''
gender_count = df_customer_order_bycle.groupby(by = 'gender').cplb_zw.count().reset_index()
df_customer_order_bycle = pd.merge(df_customer_order_bycle,age_level2_count,on = 'age_level2').rename(columns = {'sales_order_key_y':'age_level2_count'})
df_customer_order_bycle['age_level2_rate'] = 1/df_customer_order_bycle['age_level2_count']
df_customer_order_bycle = pd.merge(df_customer_order_bycle,gender_count,on = 'gender').rename(columns = {'cplb_zw_y':'gender_count'})
df_customer_order_bycle['gender_rate'] = 1/df_customer_order_bycle['gender_count']
#df_customer_order_bycle 将11月自行车用户存入数据库
#存入数据库
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
df_customer_order_bycle.to_sql('pt_user_behavior_november_cc',con = engine,if_exists='replace', index=False)
df_customer_order_bycle
part5:2019年11月热品销售分析
'''1. 11月产品销量TOP10产品,销售数量及环比'''
# 筛选11月数据
gather_customer_order_11 = gather_customer_order.loc[gather_customer_order['create_year_month'] == '2019-11']
# 按照销量降序,取出TOP10产品
customer_order_11_top10 = gather_customer_order_11.groupby(by = 'product_name').order_num.count().reset_index().sort_values(by = 'order_num',ascending = False).head(10)
# TOP10销量产品信息
list(customer_order_11_top10['product_name'])
# 计算TOP10销量及环比
# 查看11月环比数据
gather_customer_order_month_10_11.head()
customer_order_month_10_11 = gather_customer_order_month_10_11[['create_year_month','product_name','order_month_product','cpzl_zw','order_num_diff']]
customer_order_month_10_11 = customer_order_month_10_11[customer_order_month_10_11['product_name'].isin(list(customer_order_11_top10['product_name']))]
customer_order_month_10_11['category'] = '本月TOP10销量'
'''2. 11月增速TOP10产品,销售数量及环比'''
customer_order_month_11 = gather_customer_order_month_10_11.loc[gather_customer_order_month_10_11['create_year_month'] == '2019-11'].sort_values(by = 'order_num_diff',ascending = False).head(10)
customer_order_month_11_top10_seep = gather_customer_order_month_10_11.loc[gather_customer_order_month_10_11['product_name'].isin(list(customer_order_month_11['product_name']))]
customer_order_month_11_top10_seep = customer_order_month_11_top10_seep[['create_year_month','product_name','order_month_product','cpzl_zw','order_num_diff']]
customer_order_month_11_top10_seep['category'] = '本月TOP10增速'
# 合并TOP10销量表customer_order_month_10_11,TOP10增速customer_order_month_11_top10_seep
#axis = 0按照行维度合并,axis = 1按照列维度合并
hot_products_11 = pd.concat([customer_order_month_10_11,customer_order_month_11_top10_seep],axis = 0)
#存入数据库
engine = create_engine('mysql+pymysql://root:password@localhost:3306/adventure_dw')
hot_products_11.to_sql('pt_hot_products_november_cc',con = engine,if_exists='replace', index=False)
hot_products_11
3. PPT汇报输出
通过上述数据加工,将存入mysql的数据接入powerbi与Excel,分别绘制图像与表格,通过以下5个方面,完成线上自行车业务分析报告:
- 整体销售表现
- 地域销售表现
- 产品销售表现
- 热品销售表现
- 用户行为分析
网友评论