我们可以用Python中的numpy、pandas、matplotlib等包对数据进行可视化分析。
一、数据分析的基本步骤
image欢迎加入新手技术交流基地:1004391443 群里有大牛解答,有资源,有源码,学不学的会就看你了!
二、提出问题
一切的数据分析目的是为了解决问题,问题明确了才能为后续的分析过程确定了方向,帮助我们有效的选取数据,分析研究。
本案的分析目标是从销售数据中分析出以下业务指标:
1)月均消费次数
2)月均消费金额
3)客单价
4)消费趋势
三、理解问题
1、导入数据
需提前安装numpy 、pandas包
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
import numpy as np
import pandas as pd
</pre>
加载数据文件
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
x1= pd.ExcelFile(r'D:\数据分析\朝阳医院2018年销售数据.xlsx', dtype='object')
</pre>
读取数据表
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF =x1.parse('Sheet1',dtype='object')
</pre>
2、查看数据基本状况
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF.head(10)
</pre>
image查看数据的基本大小
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF.shape
</pre>
(6578, 7)
查看数据数据表中的每项的基本类型
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF.dtypes
</pre>
image用描述函数查看各项的基本状况
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF.describe()
</pre>
image通过以上操作我们可以了解到,该项目的有七项基本项目(购药时间、社保卡号、商品编码、商品名称、销售数量、应收金额、实收金额),项目行数为6578。
四、数据清洗
获得数据后,不能马上进行数据分析。往往第一手获得数据并不符合我们的数据分析的要求,而且数据表格可能会含有缺失值、异常值等,这使得我们数据分析产生偏差。这就要求我们在数据分析前需要对数据进行清洗。而且在数据分析中,有大约60%的时间花在数据清洗过程。
数据清洗步骤:
1.选择子集
2.列名重命名
3.缺失数据处理
4.数据类型转换
5.数据排序
6.异常值处理
1、选择子集
有事遇到项目较多,分析中只需用到某几项,我们需要选择数据集的子集作为研究对象。
本案不需要选择子集,下面示范怎么选择子集(一般方法):
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF1=salesDF.loc[0:10,'社保卡号':'销售数量']
salesDF1
</pre>
image2、列名重命名
如果数据列名符合习惯名称和数据分析,这就需要对列名进行重命名。
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
#要用花括号,购药时间->销售时间
namechang={'购药时间':'销售时间'}
salesDF.rename(columns = namechang,inplace=True)
#参数inplace=True表示覆盖元数据集
</pre>
改好后,再次查看数据
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF.head()
</pre>
image3、缺失数据处理
对于缺失数据,如果缺失数据较少时我们可以直接删除数据。如果缺失数据较多,可以通过建立模型插值来填充数据。
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
#查看缺失值的数量
salesDF[salesDF[['销售时间','社保卡号']].isnull().values == True]
</pre>
image序号6574因为销售时间和社保卡号都缺失,所以出现了两次,需要去掉重复数据。
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
NaDF = salesDF[salesDF[['销售时间','社保卡号']].isnull().values ==True].drop_duplicates()
NaDF
</pre>
image查看缺失值数量
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
NaDF.shape
</pre>
(3, 7)
查看原数据规模
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF.shape
</pre>
(6578, 7)
删除缺失值
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF = salesDF.dropna(subset=['销售时间','社保卡号'],how = 'any')
#how='any' 在给定的任何一列中有缺失值就删除,
#how='all' 在给定的所有列中都有缺失值就删除。
</pre>
数据删除后并不能自动更新数据序号,如未能更新会导致后续数据合并等操作出错。
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
#重新更新序号,使得序号修改为从0到N按顺序的索引值
salesDF=salesDF.reset_index(drop=True)
</pre>
4、数据类型转换
把数量、金额项目从字符串类型转换为浮点型类型。
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF['销售数量'] = salesDF['销售数量'].astype('float')
salesDF['应收金额'] = salesDF['应收金额'].astype('float')
salesDF['实收金额'] = salesDF['实收金额'].astype('float')
print('转换后的数据类型:\n',salesDF.dtypes)
</pre>
image把日期项目从字符串类型转换为日期类型
销售日期中含有日期和星期,这里只需用到日期,此需要对数据分割。
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
#日期转换
def dateChange(dateLaw):
dateList = [] #建立空列表
for i in dateLaw:
#例如2018-01-01 星期五,分割后为:2018-01-01
str = i.split(' ')[0] #按空格分割,取第一列
dateList.append(str)
dateChangeOut = pd.Series(dateList) #数组化
return dateChangeOut
dateChangeOut = dateChange(salesDF['销售时间'])
dateChangeOut
</pre>
0 2018-01-01
1 2018-01-02
2 2018-01-06
3 2018-01-11
4 2018-01-15
5 2018-01-20
6 2018-01-31
7 2018-02-17
8 2018-02-22
9 2018-02-24
10 2018-03-05
11 2018-03-05
12 2018-03-05
13 2018-03-07
14 2018-03-09
15 2018-03-15
16 2018-03-15
17 2018-03-15
18 2018-03-20
19 2018-03-22
20 2018-03-23
21 2018-03-24
22 2018-03-24
23 2018-03-28
24 2018-03-29
25 2018-04-05
26 2018-04-07
27 2018-04-13
28 2018-04-22
29 2018-05-01
...
6545 2018-04-05
6546 2018-04-05
6547 2018-04-09
6548 2018-04-10
6549 2018-04-10
6550 2018-04-10
6551 2018-04-12
6552 2018-04-13
6553 2018-04-13
6554 2018-04-14
6555 2018-04-15
6556 2018-04-15
6557 2018-04-15
6558 2018-04-15
6559 2018-04-16
6560 2018-04-17
6561 2018-04-18
6562 2018-04-21
6563 2018-04-22
6564 2018-04-24
6565 2018-04-25
6566 2018-04-25
6567 2018-04-25
6568 2018-04-26
6569 2018-04-26
6570 2018-04-27
6571 2018-04-27
6572 2018-04-27
6573 2018-04-27
6574 2018-04-28
Length: 6575, dtype: object
替换原销售时间数据
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF['销售时间'] = dateChangeOut
</pre>
查看替换情况
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF.head()
</pre>
image在转换后查询是否产生新的缺失值
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
salesDF['销售时间'].isnull().any()
</pre>
False
没有产生新的缺失值
把销售时间的数据类型转换为日期型
<pre style="margin: 0px; padding: 0px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 18px; line-height: inherit; font-family: inherit; vertical-align: baseline; word-break: break-word; color: rgb(93, 93, 93); letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial;">
dateOut=pd.to_datetime(salesDF['销售时间'], format = '%Y-%m-%d', errors='coerce')
#format 是原始数据中日期的格式
#errors='cperce' 如果原始数据不符合日期的格式,这输出值为NaT
dateOut
</pre>
网友评论