2018-8-11

作者: katemiao | 来源:发表于2018-08-11 19:39 被阅读0次

健步走
2018-08-11
“杀鱼弟”缘何服毒自杀背后故事令人心酸【转载】
2018-8-11
2018-8-11
2018-8-11
2018-8-11
观察日志
2018-8-11 夏天没时间伤感
《男人也有梦想》

一、如何学习数据分析

这两天跟每个同学都沟通了一下，大家都很有想法：
1）有的同学希望能提供学习方法跟思路，或者参考资料，可以按图索骥，有效自学；
2）有的同学更关注python本身，希望能学好它，并实现更具体的分析。

归结起来，诉求分为两方面：一方面是分析思路，另一方面是工具的使用。这其实是一体两面的事情，分析思路需要依靠具体的工具来实现，工具的使用也只有在分析思路的引导下才能体现价值。

分析工具不仅仅限于python（入门课程主要讲python），
常用的还有R，SPSS，SAS，Tableau等等（进阶课程主要讲R，Tableau）。

1.1 结合项目学习

最有效的学习方式，还是 结合具体项目来进行两方面的学习。

在实践中体会分析思路和方法；在实践中掌握 python（numpy，pandas，matplotlib）细节。

项目来源：
1）优达纳米学位项目
2）kaggle（英语）
3）科赛（中文）
4）各种竞赛（阿里，京东，kaggle，科赛等等）

1.2 分析思路学习

1）自己找数据集，提出问题，解决问题(效果最好)
2）看别人的分析报告(kaggle, 科赛上都有，专业的分析报告)
3）书籍推荐

《精益数据分析》

1.3 python 学习

1）python 基础的学习

对于python基础，任何靠谱的在线资源都可以：优达（正式课程中关于python有深入的讲解），其他mooc。
（我当初跟着这个教程学了一遍python基础：How to Think Like a Computer Scientist）

2）科学计算包（numpy，pandas，matplotlib）的学习

在项目实践中逐步掌握相关用法，学会看参考文档，善用搜索引擎（stackoverflow）。

二、知识点梳理

2.1 jupyter 使用

jupyter 中可用的 kernel: Python, Julia, R, Go, Matlab, SAS等等

命令模式：

插入cell：A 在当前cell上方插入cell；B 在当前cell下方插入cell
复制cell：C
粘贴cell：V
剪切cell：X（也可以当作删除的快捷键）

编辑模式：

运行当前cell：control + return
（还有 shift+return，command+return，用习惯一种即可）
运行全部cell：
Kernel => Restart & Run all
Cell => Run All / Run All Above / Run All Below
补全代码：tab 键
查看文档：shift + tab

2.2 python 基础

1）基本数据类型：
Number(int, float, complex) - 数字类型
String - 字符串

## 整型变量
a = 10
## 浮点类型
b = 5.1
## 复数类型
c = 3 + 5j
c = complex(3, 5)
print(c)
# output
(3+5j)

## 字符串
test = 'hello'
test[0]
# output
'h'

test.upper()
# output
'HELLO'

2）复合数据类型
List - 列表
Tuple - 元组
Set - 集合
Dictionary - 字典

my_list1 = ['hello', 'data', 10, 20]

my_list2 = list('abcd')
print(my_list2)
# output
['a', 'b', 'c', 'd']

3）循环

for循环：凡是可迭代对象都可以进行循环遍历。比如：字符串，列表，元组，集合，字典。

for i in 'abcd':
    print(i)
# output
a
b
c
d

for i in ['zhao','qian','sun','li']:
    print(i)
# output
zhao
qian
sun
li

for a, b in enumerate(['zhao','qian','sun','li']):
    print(a, b)
# output
0 zhao
1 qian
2 sun
3 li

enumerate(iterable, start=0)
Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration.

built-in functions : enumerate()

4）while循环
while循环：满足循环条件，就执行循环体。

## 死循环
while True:
    print(1)

## 
flag = 0
while flag < 10:
    print(flag)
    flag += 1

总结：for 循环适用于循环对象或循环次数已知的情况；while循环适合循环条件已知情况。

5）列表生成式(List Comprehensions)
列表生成式和 for循环很类似，是一种语法糖。

## for 循环
names = ['zhao','qian','sun','li']
tmp = []
for name in names:
    tmp.append(name.upper())
    
print(tmp)
# output
['ZHAO', 'QIAN', 'SUN', 'LI']


## 简单列表生成式
[name.upper() for name in names]
# output
['ZHAO', 'QIAN', 'SUN', 'LI']


## 有 if 语句的列表生成式
[i for i in range(10) if i%2 == 0]
# output
[0, 2, 4, 6, 8]

三、项目讲解

整个项目思路十分清楚，只涉及到两种图的应用，一个是箱线图，一个是柱状图。其中箱线图仅用来对单一变量（PM）进行汇总，柱状图用于对二维变量（PM v.s. 时间）进行汇总。

3.1 数据整理

1）保留PM_US Post等列
2）增加city列
3）对season做转换
4）合并五个城市的数据

files = ['BeijingPM20100101_20151231.csv',
       'ChengduPM20100101_20151231.csv',
       'GuangzhouPM20100101_20151231.csv',
       'ShanghaiPM20100101_20151231.csv',
       'ShenyangPM20100101_20151231.csv']

out_columns = ['No', 'year', 'month', 'day', 'hour', 'season', 'PM_US Post']

# create a void dataframe
df_all_cities = pd.DataFrame()

# iterate to write diffrent files
for inx, val in enumerate(files):
    df = pd.read_csv(val)
    df = df[out_columns]
    # create a city column
    df['city'] = val.split('P')[0]
    # map reason 
    df['season'] = df['season'].map({1:'Spring', 2:'Summer', 3:'Autumn', 4: 'Winter'})
    # append each file and merge all files into one
    df_all_cities = df_all_cities.append(df)

# replace the space in variable names with '_'
df_all_cities.columns = [c.replace(' ', '_') for c in df_all_cities.columns]

3.2 数据筛选

1）条件过滤：filter_data()
2）条件过滤，同时展示统计数字：reading_stats()

1）条件过滤

def filter_data(data, condition):
    """
    Remove elements that do not match the condition provided.
    Takes a data list as input and returns a filtered list.
    Conditions should be a list of strings of the following format:
      '<field> <op> <value>'
    where the following operations are valid: >, <, >=, <=, ==, !=
    
    Example: ["duration < 15", "start_city == 'San Francisco'"]
    """

    # Only want to split on first two spaces separating field from operator and
    # operator from value: spaces within value should be retained.
    field, op, value = condition.split(" ", 2)
    
    # check if field is valid
    if field not in data.columns.values :
        raise Exception("'{}' is not a feature of the dataframe. Did you spell something wrong?".format(field))

    # convert value into number or strip excess quotes if string
    try:
        value = float(value)
    except:
        value = value.strip("\'\"")

    # get booleans for filtering
    if op == ">":
        matches = data[field] > value
    elif op == "<":
        matches = data[field] < value
    elif op == ">=":
        matches = data[field] >= value
    elif op == "<=":
        matches = data[field] <= value
    elif op == "==":
        matches = data[field] == value
    elif op == "!=":
        matches = data[field] != value
    else: # catch invalid operation codes
        raise Exception("Invalid comparison operator. Only >, <, >=, <=, ==, != allowed.")
    
    # filter data and outcomes
    data = data[matches].reset_index(drop = True)
    return data

## 函数应用示例
filter_data(df_all_cities, "city == 'Shanghai'").head()
# output
    No year month day  hour season  PM_US_Post  city
0   1   2010    1   1   0   Winter  NaN Shanghai
1   2   2010    1   1   1   Winter  NaN Shanghai
2   3   2010    1   1   2   Winter  NaN Shanghai
3   4   2010    1   1   3   Winter  NaN Shanghai
4   5   2010    1   1   4   Winter  NaN Shanghai

## 函数应用示例
filter_data(df_all_cities, "year >= 2012").head()
# output
    No  year    month   day hour    season  PM_US_Post  city
0   17521   2012    1   1   0   Winter  303.0   Beijing
1   17522   2012    1   1   1   Winter  215.0   Beijing
2   17523   2012    1   1   2   Winter  222.0   Beijing
3   17524   2012    1   1   3   Winter  85.0    Beijing
4   17525   2012    1   1   4   Winter  38.0    Beijing

pandas 中如何进行数据筛选:

## 产生True，False的 pandas.Series
(df_all_cities['season'] == 'Spring').head()
# output
0    False
1    False
2    False
3    False
4    False
Name: season, dtype: bool


## 根据上面的结果进行筛选
df_all_cities[df_all_cities['season'] == 'Spring'].head()
# output
   No   year    month   day hour    season  PM_US_Post  city
1416    1417    2010    3   1   0   Spring  59.0    Beijing
1417    1418    2010    3   1   1   Spring  42.0    Beijing
1418    1419    2010    3   1   2   Spring  35.0    Beijing
1419    1420    2010    3   1   3   Spring  35.0    Beijing
1420    1421    2010    3   1   4   Spring  29.0    Beijing

总结：pandas中的数据筛选一种是采用[]操作符（如以上例子所示），另一种df.query() （课程中比较多使用的方法）。

2）返回过滤后的数据集，同时展示统计数字

def reading_stats(data, filters = [], verbose = True):
    """
    Report number of readings and average PM2.5 readings for data points that meet
    specified filtering criteria.
    """

    n_data_all = data.shape[0]

    # Apply filters to data
    for condition in filters:
        ## 循环调用 filter_data() 函数
        data = filter_data(data, condition)

    # Compute number of data points that met the filter criteria.
    n_data = data.shape[0]

    # Compute statistics for PM 2.5 readings.
    pm_mean = data['PM_US_Post'].mean()
    pm_qtiles = data['PM_US_Post'].quantile([.25, .5, .75]).as_matrix()
    
    # Report computed statistics if verbosity is set to True (default).
    if verbose:
        if filters:
            print('There are {:d} readings ({:.2f}%) matching the filter criteria.'.format(n_data, 100. * n_data / n_data_all))
        else:
            print('There are {:d} reading in the dataset.'.format(n_data))

        print('The average readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_mean))
        print('The median readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_qtiles[1]))
        print('25% of readings of PM 2.5 are smaller than {:.2f} ug/m^3.'.format(pm_qtiles[0]))
        print('25% of readings of PM 2.5 are larger than {:.2f} ug/m^3.'.format(pm_qtiles[2]))
        seaborn.boxplot(data['PM_US_Post'], showfliers=False)
        plt.title('Boxplot of PM 2.5 of filtered data')
        plt.xlabel('PM_US Post (ug/m^3)')

    # Return three-number summary
    return data

## 函数应用
df_test = reading_stats(df_all_cities, ["city == 'Shanghai'", "year >= 2012"])

There are 35064 readings (13.34%) matching the filter criteria.
The average readings of PM 2.5 is 52.88 ug/m^3.
The median readings of PM 2.5 is 41.00 ug/m^3.
25% of readings of PM 2.5 are smaller than 26.00 ug/m^3.
25% of readings of PM 2.5 are larger than 67.00 ug/m^3.

image.png

如何用seaborn画箱线图，十分简单：

## 横向显示箱线图
seaborn.boxplot(x = df_all_cities['PM_US_Post'], showfliers=False);

image.png

## 默认显示离群值 outliers
seaborn.boxplot(x = df_all_cities['PM_US_Post']);

image.png

## 纵向显示箱线图
seaborn.boxplot(y = df_all_cities['PM_US_Post'], showfliers=False);

image.png

seaborn.boxplot 用法示例
 seaborn 官方文档

3.3 数据探索性分析和可视化

def univariate_plot(data, key = '', color = 'grey'):
    """
    Plot average PM 2.5 readings, given a feature of interest
    """
    
    # Check if the key exists
    if not key:
        raise Exception("No key has been provided. Make sure you provide a variable on which to plot the data.")
    if key not in data.columns.values :
        raise Exception("'{}' is not a feature of the dataframe. Did you spell something wrong?".format(key))

    # Create plot
    plt.figure(figsize=(8,6))
    data.groupby(key)['PM_US_Post'].mean().plot(kind = 'bar', color = color)
    plt.ylabel('PM 2.5 (ug/m^3)')
    plt.title('Average PM 2.5 Reading by {:s}'.format(key), fontsize =14)
    plt.show()
    return None

## 上海市2012年-2015年的数据
df_test = reading_stats(df_all_cities, ["city == 'Shanghai'", "year >= 2012"])

## 对 df_test 按月汇总并求平均值
df_test.groupby('month').mean()
# output
    No  year    day hour    PM_US_Post
month                   
1   31050.500000    2013.500000 16.000000   11.5    80.847251
2   31645.137168    2013.486726 14.628319   11.5    59.084941
3   32472.500000    2013.500000 16.000000   11.5    59.375595
4   33204.500000    2013.500000 15.500000   11.5    55.371645
5   33936.500000    2013.500000 16.000000   11.5    52.226216
6   34668.500000    2013.500000 15.500000   11.5    40.923131
7   35400.500000    2013.500000 16.000000   11.5    32.380491
8   36144.500000    2013.500000 16.000000   11.5    27.385921
9   36876.500000    2013.500000 15.500000   11.5    32.543085
10  37608.500000    2013.500000 16.000000   11.5    42.177273
11  38340.500000    2013.500000 15.500000   11.5    64.351839
12  39072.500000    2013.500000 16.000000   11.5    85.853311

## 只选取 PM_US_Post 这一列数据
df_test.groupby('month').mean()['PM_US_Post']
# output
month
1     80.847251
2     59.084941
3     59.375595
4     55.371645
5     52.226216
6     40.923131
7     32.380491
8     27.385921
9     32.543085
10    42.177273
11    64.351839
12    85.853311
Name: PM_US_Post, dtype: float64

## 数据汇总之后进行绘图，绘制柱状图‘bar’
df_test.groupby('month').mean()['PM_US_Post'].plot(kind = 'bar', color = 'steelblue')

image.png

## 数据汇总之后进行绘图，绘制折线图‘line’
df_test.groupby('month').mean()['PM_US_Post'].plot(kind = 'line', color = 'steelblue')

image.png

pandas.DataFrame.plot 参考手册

四、其他的数据分析实例

The Movie Database (TMDb) 电影数据集分析案例

image.png

问题：排名前20的最受欢迎的电影？

image.png

# 对 popularity 最高的20名电影绘制 横向柱状图
df.sort_values('popularity', ascending=True)[-20:].set_index('original_title')['popularity'].plot(
    kind='barh', figsize=(12,8))

plt.title('Top 20 Most Popular Movies')
plt.xlabel('Popularity')
plt.ylabel('Movie Title');

image.png

健步走
2018-8-11
2018-08-11
杨倩，焦点讲师三期，坚持分享589天（2018-8-11）规则良好的规则是由家庭成员共同营造...
“杀鱼弟”缘何服毒自杀背后故事令人心酸【转载】
“杀鱼弟”缘何服毒自杀背后故事令人心酸【转载】 2018-8-11 09:45:38 来源:北方网 2010年，...
2018-8-11
一、如何学习数据分析这两天跟每个同学都沟通了一下，大家都很有想法：1）有的同学希望能提供学习方法跟思路，或者参考...
2018-8-11
（万尚学习会）打卡第158天姓名：徐娟部门：人事部组别：待定【知～学习】《京瓷哲学》第一章“度过美好的人...
2018-8-11
每天三件事132/200天 1、链接在大巴上认识了一位连云港的武警，他属猪，当了五年的兵，个人呢希望退伍工作，觉...
2018-8-11
昨天整理了一下自己的跑马训练日志～
观察日志
2018-8-11 星期六晴昨晚临时决定今天一家三口南京看婆婆，婆婆糖尿病引...
2018-8-11 夏天没时间伤感
2018-8-11（星期六）晨读美文夏天没时间伤感文 | 萧玥知乎上有人问：为什么人们总是“伤春悲秋”，而...
《男人也有梦想》
巴渝墨客 2018-8-11 凌晨时分 ※题记：在物质至上的今天，这是时代造就的产物——让我们所有人都趋...

2018-8-11

一、如何学习数据分析

1.1 结合项目学习

1.2 分析思路学习

1.3 python 学习

1）python 基础的学习

2）科学计算包（numpy，pandas，matplotlib）的学习

二、知识点梳理

2.1 jupyter 使用

2.2 python 基础

三、项目讲解

3.1 数据整理

3.2 数据筛选

3.3 数据探索性分析和可视化

四、其他的数据分析实例

相关文章

健步走

2018-08-11

“杀鱼弟”缘何服毒自杀背后故事令人心酸【转载】

2018-8-11

2018-8-11

2018-8-11

2018-8-11

观察日志

2018-8-11 夏天没时间伤感

《男人也有梦想》

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

2018-8-11

一、如何学习数据分析

1.1 结合项目学习

1.2 分析思路 学习

1.3 python 学习

1）python 基础的学习

2）科学计算包（numpy，pandas，matplotlib）的学习

二、知识点梳理

2.1 jupyter 使用

2.2 python 基础

三、项目讲解

3.1 数据整理

3.2 数据筛选

3.3 数据探索性分析和可视化

四、其他的数据分析实例

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

1.2 分析思路学习