一、如何学习数据分析
这两天跟每个同学都沟通了一下,大家都很有想法:
1)有的同学希望能提供学习方法跟思路,或者参考资料,可以按图索骥,有效自学;
2)有的同学更关注python本身,希望能学好它,并实现更具体的分析。
归结起来,诉求分为两方面:一方面是分析思路,另一方面是工具的使用。这其实是一体两面的事情,分析思路需要依靠具体的工具来实现,工具的使用也只有在分析思路的引导下才能体现价值。
分析工具不仅仅限于python(入门课程主要讲python),
常用的还有R,SPSS,SAS,Tableau等等(进阶课程主要讲R,Tableau)。
1.1 结合项目学习
最有效的学习方式,还是 结合具体项目来进行两方面的学习。
在实践中体会分析思路和方法;在实践中掌握 python(numpy,pandas,matplotlib)细节。
项目来源:
1)优达纳米学位项目
2)kaggle(英语)
3)科赛(中文)
4)各种竞赛(阿里,京东,kaggle,科赛等等)
1.2 分析思路 学习
1)自己找数据集,提出问题,解决问题(效果最好)
2)看别人的分析报告(kaggle, 科赛上都有,专业的分析报告)
3)书籍推荐
1.3 python 学习
1)python 基础的学习
对于python基础,任何靠谱的在线资源都可以:优达(正式课程中关于python有深入的讲解),其他mooc。
(我当初跟着这个教程学了一遍python基础:How to Think Like a Computer Scientist)
2)科学计算包(numpy,pandas,matplotlib)的学习
在项目实践中逐步掌握相关用法,学会看参考文档,善用搜索引擎(stackoverflow)。
二、知识点梳理
2.1 jupyter 使用
jupyter 中可用的 kernel: Python, Julia, R, Go, Matlab, SAS等等
命令模式:
- 插入cell:A 在当前cell上方插入cell;B 在当前cell下方插入cell
- 复制cell:C
- 粘贴cell:V
- 剪切cell:X(也可以当作删除的快捷键)
编辑模式:
- 运行当前cell:control + return
(还有 shift+return,command+return,用习惯一种即可) - 运行全部cell:
Kernel => Restart & Run all
Cell => Run All / Run All Above / Run All Below - 补全代码:tab 键
- 查看文档:shift + tab
2.2 python 基础
1)基本数据类型:
Number(int, float, complex) - 数字类型
String - 字符串
## 整型变量
a = 10
## 浮点类型
b = 5.1
## 复数类型
c = 3 + 5j
c = complex(3, 5)
print(c)
# output
(3+5j)
## 字符串
test = 'hello'
test[0]
# output
'h'
test.upper()
# output
'HELLO'
2)复合数据类型
List - 列表
Tuple - 元组
Set - 集合
Dictionary - 字典
my_list1 = ['hello', 'data', 10, 20]
my_list2 = list('abcd')
print(my_list2)
# output
['a', 'b', 'c', 'd']
3)循环
for循环:凡是可迭代对象都可以进行循环遍历。比如:字符串,列表,元组,集合,字典。
for i in 'abcd':
print(i)
# output
a
b
c
d
for i in ['zhao','qian','sun','li']:
print(i)
# output
zhao
qian
sun
li
for a, b in enumerate(['zhao','qian','sun','li']):
print(a, b)
# output
0 zhao
1 qian
2 sun
3 li
enumerate(iterable, start=0)
Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration.
built-in functions : enumerate()
4)while循环
while循环:满足循环条件,就执行循环体。
## 死循环
while True:
print(1)
##
flag = 0
while flag < 10:
print(flag)
flag += 1
总结:for 循环适用于循环对象或循环次数已知的情况;while循环适合循环条件已知情况。
5)列表生成式(List Comprehensions)
列表生成式 和 for循环很类似,是一种语法糖。
## for 循环
names = ['zhao','qian','sun','li']
tmp = []
for name in names:
tmp.append(name.upper())
print(tmp)
# output
['ZHAO', 'QIAN', 'SUN', 'LI']
## 简单列表生成式
[name.upper() for name in names]
# output
['ZHAO', 'QIAN', 'SUN', 'LI']
## 有 if 语句的列表生成式
[i for i in range(10) if i%2 == 0]
# output
[0, 2, 4, 6, 8]
三、项目讲解
整个项目思路十分清楚,只涉及到两种图的应用,一个是箱线图,一个是柱状图。其中箱线图仅用来对单一变量(PM)进行汇总,柱状图用于对二维变量(PM v.s. 时间)进行汇总。
3.1 数据整理
1)保留PM_US Post
等列
2)增加city
列
3)对season
做转换
4)合并五个城市的数据
files = ['BeijingPM20100101_20151231.csv',
'ChengduPM20100101_20151231.csv',
'GuangzhouPM20100101_20151231.csv',
'ShanghaiPM20100101_20151231.csv',
'ShenyangPM20100101_20151231.csv']
out_columns = ['No', 'year', 'month', 'day', 'hour', 'season', 'PM_US Post']
# create a void dataframe
df_all_cities = pd.DataFrame()
# iterate to write diffrent files
for inx, val in enumerate(files):
df = pd.read_csv(val)
df = df[out_columns]
# create a city column
df['city'] = val.split('P')[0]
# map reason
df['season'] = df['season'].map({1:'Spring', 2:'Summer', 3:'Autumn', 4: 'Winter'})
# append each file and merge all files into one
df_all_cities = df_all_cities.append(df)
# replace the space in variable names with '_'
df_all_cities.columns = [c.replace(' ', '_') for c in df_all_cities.columns]
3.2 数据筛选
1)条件过滤:filter_data()
2)条件过滤,同时展示统计数字:reading_stats()
1)条件过滤
def filter_data(data, condition):
"""
Remove elements that do not match the condition provided.
Takes a data list as input and returns a filtered list.
Conditions should be a list of strings of the following format:
'<field> <op> <value>'
where the following operations are valid: >, <, >=, <=, ==, !=
Example: ["duration < 15", "start_city == 'San Francisco'"]
"""
# Only want to split on first two spaces separating field from operator and
# operator from value: spaces within value should be retained.
field, op, value = condition.split(" ", 2)
# check if field is valid
if field not in data.columns.values :
raise Exception("'{}' is not a feature of the dataframe. Did you spell something wrong?".format(field))
# convert value into number or strip excess quotes if string
try:
value = float(value)
except:
value = value.strip("\'\"")
# get booleans for filtering
if op == ">":
matches = data[field] > value
elif op == "<":
matches = data[field] < value
elif op == ">=":
matches = data[field] >= value
elif op == "<=":
matches = data[field] <= value
elif op == "==":
matches = data[field] == value
elif op == "!=":
matches = data[field] != value
else: # catch invalid operation codes
raise Exception("Invalid comparison operator. Only >, <, >=, <=, ==, != allowed.")
# filter data and outcomes
data = data[matches].reset_index(drop = True)
return data
## 函数应用示例
filter_data(df_all_cities, "city == 'Shanghai'").head()
# output
No year month day hour season PM_US_Post city
0 1 2010 1 1 0 Winter NaN Shanghai
1 2 2010 1 1 1 Winter NaN Shanghai
2 3 2010 1 1 2 Winter NaN Shanghai
3 4 2010 1 1 3 Winter NaN Shanghai
4 5 2010 1 1 4 Winter NaN Shanghai
## 函数应用示例
filter_data(df_all_cities, "year >= 2012").head()
# output
No year month day hour season PM_US_Post city
0 17521 2012 1 1 0 Winter 303.0 Beijing
1 17522 2012 1 1 1 Winter 215.0 Beijing
2 17523 2012 1 1 2 Winter 222.0 Beijing
3 17524 2012 1 1 3 Winter 85.0 Beijing
4 17525 2012 1 1 4 Winter 38.0 Beijing
pandas 中如何进行数据筛选:
## 产生True,False的 pandas.Series
(df_all_cities['season'] == 'Spring').head()
# output
0 False
1 False
2 False
3 False
4 False
Name: season, dtype: bool
## 根据上面的结果进行筛选
df_all_cities[df_all_cities['season'] == 'Spring'].head()
# output
No year month day hour season PM_US_Post city
1416 1417 2010 3 1 0 Spring 59.0 Beijing
1417 1418 2010 3 1 1 Spring 42.0 Beijing
1418 1419 2010 3 1 2 Spring 35.0 Beijing
1419 1420 2010 3 1 3 Spring 35.0 Beijing
1420 1421 2010 3 1 4 Spring 29.0 Beijing
总结:pandas中的数据筛选一种是采用[]
操作符(如以上例子所示),另一种df.query()
(课程中比较多使用的方法)。
2)返回过滤后的数据集,同时展示统计数字
def reading_stats(data, filters = [], verbose = True):
"""
Report number of readings and average PM2.5 readings for data points that meet
specified filtering criteria.
"""
n_data_all = data.shape[0]
# Apply filters to data
for condition in filters:
## 循环调用 filter_data() 函数
data = filter_data(data, condition)
# Compute number of data points that met the filter criteria.
n_data = data.shape[0]
# Compute statistics for PM 2.5 readings.
pm_mean = data['PM_US_Post'].mean()
pm_qtiles = data['PM_US_Post'].quantile([.25, .5, .75]).as_matrix()
# Report computed statistics if verbosity is set to True (default).
if verbose:
if filters:
print('There are {:d} readings ({:.2f}%) matching the filter criteria.'.format(n_data, 100. * n_data / n_data_all))
else:
print('There are {:d} reading in the dataset.'.format(n_data))
print('The average readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_mean))
print('The median readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_qtiles[1]))
print('25% of readings of PM 2.5 are smaller than {:.2f} ug/m^3.'.format(pm_qtiles[0]))
print('25% of readings of PM 2.5 are larger than {:.2f} ug/m^3.'.format(pm_qtiles[2]))
seaborn.boxplot(data['PM_US_Post'], showfliers=False)
plt.title('Boxplot of PM 2.5 of filtered data')
plt.xlabel('PM_US Post (ug/m^3)')
# Return three-number summary
return data
## 函数应用
df_test = reading_stats(df_all_cities, ["city == 'Shanghai'", "year >= 2012"])
There are 35064 readings (13.34%) matching the filter criteria.
The average readings of PM 2.5 is 52.88 ug/m^3.
The median readings of PM 2.5 is 41.00 ug/m^3.
25% of readings of PM 2.5 are smaller than 26.00 ug/m^3.
25% of readings of PM 2.5 are larger than 67.00 ug/m^3.

如何用seaborn画箱线图,十分简单:
## 横向显示箱线图
seaborn.boxplot(x = df_all_cities['PM_US_Post'], showfliers=False);

## 默认显示离群值 outliers
seaborn.boxplot(x = df_all_cities['PM_US_Post']);

## 纵向显示箱线图
seaborn.boxplot(y = df_all_cities['PM_US_Post'], showfliers=False);

seaborn.boxplot 用法示例
seaborn 官方文档
3.3 数据探索性分析和可视化
def univariate_plot(data, key = '', color = 'grey'):
"""
Plot average PM 2.5 readings, given a feature of interest
"""
# Check if the key exists
if not key:
raise Exception("No key has been provided. Make sure you provide a variable on which to plot the data.")
if key not in data.columns.values :
raise Exception("'{}' is not a feature of the dataframe. Did you spell something wrong?".format(key))
# Create plot
plt.figure(figsize=(8,6))
data.groupby(key)['PM_US_Post'].mean().plot(kind = 'bar', color = color)
plt.ylabel('PM 2.5 (ug/m^3)')
plt.title('Average PM 2.5 Reading by {:s}'.format(key), fontsize =14)
plt.show()
return None
## 上海市2012年-2015年的数据
df_test = reading_stats(df_all_cities, ["city == 'Shanghai'", "year >= 2012"])
## 对 df_test 按月汇总并求平均值
df_test.groupby('month').mean()
# output
No year day hour PM_US_Post
month
1 31050.500000 2013.500000 16.000000 11.5 80.847251
2 31645.137168 2013.486726 14.628319 11.5 59.084941
3 32472.500000 2013.500000 16.000000 11.5 59.375595
4 33204.500000 2013.500000 15.500000 11.5 55.371645
5 33936.500000 2013.500000 16.000000 11.5 52.226216
6 34668.500000 2013.500000 15.500000 11.5 40.923131
7 35400.500000 2013.500000 16.000000 11.5 32.380491
8 36144.500000 2013.500000 16.000000 11.5 27.385921
9 36876.500000 2013.500000 15.500000 11.5 32.543085
10 37608.500000 2013.500000 16.000000 11.5 42.177273
11 38340.500000 2013.500000 15.500000 11.5 64.351839
12 39072.500000 2013.500000 16.000000 11.5 85.853311
## 只选取 PM_US_Post 这一列数据
df_test.groupby('month').mean()['PM_US_Post']
# output
month
1 80.847251
2 59.084941
3 59.375595
4 55.371645
5 52.226216
6 40.923131
7 32.380491
8 27.385921
9 32.543085
10 42.177273
11 64.351839
12 85.853311
Name: PM_US_Post, dtype: float64
## 数据汇总之后进行绘图,绘制柱状图‘bar’
df_test.groupby('month').mean()['PM_US_Post'].plot(kind = 'bar', color = 'steelblue')

## 数据汇总之后进行绘图,绘制折线图‘line’
df_test.groupby('month').mean()['PM_US_Post'].plot(kind = 'line', color = 'steelblue')

四、其他的数据分析实例
The Movie Database (TMDb) 电影数据集分析案例

问题:排名前20的最受欢迎的电影?


# 对 popularity 最高的20名电影绘制 横向柱状图
df.sort_values('popularity', ascending=True)[-20:].set_index('original_title')['popularity'].plot(
kind='barh', figsize=(12,8))
plt.title('Top 20 Most Popular Movies')
plt.xlabel('Popularity')
plt.ylabel('Movie Title');

网友评论