2018-8-11

作者: katemiao | 来源:发表于2018-08-11 19:39 被阅读0次

一、如何学习数据分析

这两天跟每个同学都沟通了一下,大家都很有想法:
1)有的同学希望能提供学习方法跟思路,或者参考资料,可以按图索骥,有效自学;
2)有的同学更关注python本身,希望能学好它,并实现更具体的分析。

归结起来,诉求分为两方面:一方面是分析思路,另一方面是工具的使用。这其实是一体两面的事情,分析思路需要依靠具体的工具来实现,工具的使用也只有在分析思路的引导下才能体现价值。

分析工具不仅仅限于python(入门课程主要讲python),
常用的还有R,SPSS,SAS,Tableau等等(进阶课程主要讲R,Tableau)。

1.1 结合项目学习

最有效的学习方式,还是 结合具体项目来进行两方面的学习。

在实践中体会分析思路和方法;在实践中掌握 python(numpy,pandas,matplotlib)细节。

项目来源:
1)优达纳米学位项目
2)kaggle(英语)
3)科赛(中文)
4)各种竞赛(阿里,京东,kaggle,科赛等等)

1.2 分析思路 学习

1)自己找数据集,提出问题,解决问题(效果最好)
2)看别人的分析报告(kaggle, 科赛上都有,专业的分析报告)
3)书籍推荐

《精益数据分析》

1.3 python 学习

1)python 基础的学习

对于python基础,任何靠谱的在线资源都可以:优达(正式课程中关于python有深入的讲解),其他mooc。
(我当初跟着这个教程学了一遍python基础:How to Think Like a Computer Scientist

2)科学计算包(numpy,pandas,matplotlib)的学习

在项目实践中逐步掌握相关用法,学会看参考文档,善用搜索引擎(stackoverflow)。

二、知识点梳理

2.1 jupyter 使用

jupyter 中可用的 kernel: Python, Julia, R, Go, Matlab, SAS等等

命令模式:

  • 插入cell:A 在当前cell上方插入cell;B 在当前cell下方插入cell
  • 复制cell:C
  • 粘贴cell:V
  • 剪切cell:X(也可以当作删除的快捷键)

编辑模式:

  • 运行当前cell:control + return
    (还有 shift+return,command+return,用习惯一种即可)
  • 运行全部cell:
    Kernel => Restart & Run all
    Cell => Run All / Run All Above / Run All Below
  • 补全代码:tab 键
  • 查看文档:shift + tab

2.2 python 基础

1)基本数据类型:
Number(int, float, complex) - 数字类型
String - 字符串

## 整型变量
a = 10
## 浮点类型
b = 5.1
## 复数类型
c = 3 + 5j
c = complex(3, 5)
print(c)
# output
(3+5j)

## 字符串
test = 'hello'
test[0]
# output
'h'

test.upper()
# output
'HELLO'

2)复合数据类型
List - 列表
Tuple - 元组
Set - 集合
Dictionary - 字典

my_list1 = ['hello', 'data', 10, 20]

my_list2 = list('abcd')
print(my_list2)
# output
['a', 'b', 'c', 'd']

3)循环

for循环:凡是可迭代对象都可以进行循环遍历。比如:字符串,列表,元组,集合,字典。

for i in 'abcd':
    print(i)
# output
a
b
c
d

for i in ['zhao','qian','sun','li']:
    print(i)
# output
zhao
qian
sun
li

for a, b in enumerate(['zhao','qian','sun','li']):
    print(a, b)
# output
0 zhao
1 qian
2 sun
3 li

enumerate(iterable, start=0)
Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration.

built-in functions : enumerate()

4)while循环
while循环:满足循环条件,就执行循环体。

## 死循环
while True:
    print(1)

## 
flag = 0
while flag < 10:
    print(flag)
    flag += 1

总结:for 循环适用于循环对象或循环次数已知的情况;while循环适合循环条件已知情况。

5)列表生成式(List Comprehensions)
列表生成式 和 for循环很类似,是一种语法糖。

## for 循环
names = ['zhao','qian','sun','li']
tmp = []
for name in names:
    tmp.append(name.upper())
    
print(tmp)
# output
['ZHAO', 'QIAN', 'SUN', 'LI']


## 简单列表生成式
[name.upper() for name in names]
# output
['ZHAO', 'QIAN', 'SUN', 'LI']


## 有 if 语句的列表生成式
[i for i in range(10) if i%2 == 0]
# output
[0, 2, 4, 6, 8]

三、项目讲解

整个项目思路十分清楚,只涉及到两种图的应用,一个是箱线图,一个是柱状图。其中箱线图仅用来对单一变量(PM)进行汇总,柱状图用于对二维变量(PM v.s. 时间)进行汇总。

3.1 数据整理

1)保留PM_US Post等列
2)增加city
3)对season做转换
4)合并五个城市的数据

files = ['BeijingPM20100101_20151231.csv',
       'ChengduPM20100101_20151231.csv',
       'GuangzhouPM20100101_20151231.csv',
       'ShanghaiPM20100101_20151231.csv',
       'ShenyangPM20100101_20151231.csv']

out_columns = ['No', 'year', 'month', 'day', 'hour', 'season', 'PM_US Post']

# create a void dataframe
df_all_cities = pd.DataFrame()

# iterate to write diffrent files
for inx, val in enumerate(files):
    df = pd.read_csv(val)
    df = df[out_columns]
    # create a city column
    df['city'] = val.split('P')[0]
    # map reason 
    df['season'] = df['season'].map({1:'Spring', 2:'Summer', 3:'Autumn', 4: 'Winter'})
    # append each file and merge all files into one
    df_all_cities = df_all_cities.append(df)

# replace the space in variable names with '_'
df_all_cities.columns = [c.replace(' ', '_') for c in df_all_cities.columns]

3.2 数据筛选

1)条件过滤:filter_data()
2)条件过滤,同时展示统计数字:reading_stats()

1)条件过滤

def filter_data(data, condition):
    """
    Remove elements that do not match the condition provided.
    Takes a data list as input and returns a filtered list.
    Conditions should be a list of strings of the following format:
      '<field> <op> <value>'
    where the following operations are valid: >, <, >=, <=, ==, !=
    
    Example: ["duration < 15", "start_city == 'San Francisco'"]
    """

    # Only want to split on first two spaces separating field from operator and
    # operator from value: spaces within value should be retained.
    field, op, value = condition.split(" ", 2)
    
    # check if field is valid
    if field not in data.columns.values :
        raise Exception("'{}' is not a feature of the dataframe. Did you spell something wrong?".format(field))

    # convert value into number or strip excess quotes if string
    try:
        value = float(value)
    except:
        value = value.strip("\'\"")

    # get booleans for filtering
    if op == ">":
        matches = data[field] > value
    elif op == "<":
        matches = data[field] < value
    elif op == ">=":
        matches = data[field] >= value
    elif op == "<=":
        matches = data[field] <= value
    elif op == "==":
        matches = data[field] == value
    elif op == "!=":
        matches = data[field] != value
    else: # catch invalid operation codes
        raise Exception("Invalid comparison operator. Only >, <, >=, <=, ==, != allowed.")
    
    # filter data and outcomes
    data = data[matches].reset_index(drop = True)
    return data
## 函数应用示例
filter_data(df_all_cities, "city == 'Shanghai'").head()
# output
    No year month day  hour season  PM_US_Post  city
0   1   2010    1   1   0   Winter  NaN Shanghai
1   2   2010    1   1   1   Winter  NaN Shanghai
2   3   2010    1   1   2   Winter  NaN Shanghai
3   4   2010    1   1   3   Winter  NaN Shanghai
4   5   2010    1   1   4   Winter  NaN Shanghai

## 函数应用示例
filter_data(df_all_cities, "year >= 2012").head()
# output
    No  year    month   day hour    season  PM_US_Post  city
0   17521   2012    1   1   0   Winter  303.0   Beijing
1   17522   2012    1   1   1   Winter  215.0   Beijing
2   17523   2012    1   1   2   Winter  222.0   Beijing
3   17524   2012    1   1   3   Winter  85.0    Beijing
4   17525   2012    1   1   4   Winter  38.0    Beijing

pandas 中如何进行数据筛选:

## 产生True,False的 pandas.Series
(df_all_cities['season'] == 'Spring').head()
# output
0    False
1    False
2    False
3    False
4    False
Name: season, dtype: bool


## 根据上面的结果进行筛选
df_all_cities[df_all_cities['season'] == 'Spring'].head()
# output
   No   year    month   day hour    season  PM_US_Post  city
1416    1417    2010    3   1   0   Spring  59.0    Beijing
1417    1418    2010    3   1   1   Spring  42.0    Beijing
1418    1419    2010    3   1   2   Spring  35.0    Beijing
1419    1420    2010    3   1   3   Spring  35.0    Beijing
1420    1421    2010    3   1   4   Spring  29.0    Beijing

总结:pandas中的数据筛选一种是采用[]操作符(如以上例子所示),另一种df.query() (课程中比较多使用的方法)。

2)返回过滤后的数据集,同时展示统计数字

def reading_stats(data, filters = [], verbose = True):
    """
    Report number of readings and average PM2.5 readings for data points that meet
    specified filtering criteria.
    """

    n_data_all = data.shape[0]

    # Apply filters to data
    for condition in filters:
        ## 循环调用 filter_data() 函数
        data = filter_data(data, condition)

    # Compute number of data points that met the filter criteria.
    n_data = data.shape[0]

    # Compute statistics for PM 2.5 readings.
    pm_mean = data['PM_US_Post'].mean()
    pm_qtiles = data['PM_US_Post'].quantile([.25, .5, .75]).as_matrix()
    
    # Report computed statistics if verbosity is set to True (default).
    if verbose:
        if filters:
            print('There are {:d} readings ({:.2f}%) matching the filter criteria.'.format(n_data, 100. * n_data / n_data_all))
        else:
            print('There are {:d} reading in the dataset.'.format(n_data))

        print('The average readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_mean))
        print('The median readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_qtiles[1]))
        print('25% of readings of PM 2.5 are smaller than {:.2f} ug/m^3.'.format(pm_qtiles[0]))
        print('25% of readings of PM 2.5 are larger than {:.2f} ug/m^3.'.format(pm_qtiles[2]))
        seaborn.boxplot(data['PM_US_Post'], showfliers=False)
        plt.title('Boxplot of PM 2.5 of filtered data')
        plt.xlabel('PM_US Post (ug/m^3)')

    # Return three-number summary
    return data
## 函数应用
df_test = reading_stats(df_all_cities, ["city == 'Shanghai'", "year >= 2012"])

There are 35064 readings (13.34%) matching the filter criteria.
The average readings of PM 2.5 is 52.88 ug/m^3.
The median readings of PM 2.5 is 41.00 ug/m^3.
25% of readings of PM 2.5 are smaller than 26.00 ug/m^3.
25% of readings of PM 2.5 are larger than 67.00 ug/m^3.


image.png

如何用seaborn画箱线图,十分简单:

## 横向显示箱线图
seaborn.boxplot(x = df_all_cities['PM_US_Post'], showfliers=False);
image.png
## 默认显示离群值 outliers
seaborn.boxplot(x = df_all_cities['PM_US_Post']);
image.png
## 纵向显示箱线图
seaborn.boxplot(y = df_all_cities['PM_US_Post'], showfliers=False);
image.png

seaborn.boxplot 用法示例
seaborn 官方文档

3.3 数据探索性分析和可视化

def univariate_plot(data, key = '', color = 'grey'):
    """
    Plot average PM 2.5 readings, given a feature of interest
    """
    
    # Check if the key exists
    if not key:
        raise Exception("No key has been provided. Make sure you provide a variable on which to plot the data.")
    if key not in data.columns.values :
        raise Exception("'{}' is not a feature of the dataframe. Did you spell something wrong?".format(key))

    # Create plot
    plt.figure(figsize=(8,6))
    data.groupby(key)['PM_US_Post'].mean().plot(kind = 'bar', color = color)
    plt.ylabel('PM 2.5 (ug/m^3)')
    plt.title('Average PM 2.5 Reading by {:s}'.format(key), fontsize =14)
    plt.show()
    return None
## 上海市2012年-2015年的数据
df_test = reading_stats(df_all_cities, ["city == 'Shanghai'", "year >= 2012"])

## 对 df_test 按月汇总并求平均值
df_test.groupby('month').mean()
# output
    No  year    day hour    PM_US_Post
month                   
1   31050.500000    2013.500000 16.000000   11.5    80.847251
2   31645.137168    2013.486726 14.628319   11.5    59.084941
3   32472.500000    2013.500000 16.000000   11.5    59.375595
4   33204.500000    2013.500000 15.500000   11.5    55.371645
5   33936.500000    2013.500000 16.000000   11.5    52.226216
6   34668.500000    2013.500000 15.500000   11.5    40.923131
7   35400.500000    2013.500000 16.000000   11.5    32.380491
8   36144.500000    2013.500000 16.000000   11.5    27.385921
9   36876.500000    2013.500000 15.500000   11.5    32.543085
10  37608.500000    2013.500000 16.000000   11.5    42.177273
11  38340.500000    2013.500000 15.500000   11.5    64.351839
12  39072.500000    2013.500000 16.000000   11.5    85.853311

## 只选取 PM_US_Post 这一列数据
df_test.groupby('month').mean()['PM_US_Post']
# output
month
1     80.847251
2     59.084941
3     59.375595
4     55.371645
5     52.226216
6     40.923131
7     32.380491
8     27.385921
9     32.543085
10    42.177273
11    64.351839
12    85.853311
Name: PM_US_Post, dtype: float64

## 数据汇总之后进行绘图,绘制柱状图‘bar’
df_test.groupby('month').mean()['PM_US_Post'].plot(kind = 'bar', color = 'steelblue')
image.png
## 数据汇总之后进行绘图,绘制折线图‘line’
df_test.groupby('month').mean()['PM_US_Post'].plot(kind = 'line', color = 'steelblue')
image.png

pandas.DataFrame.plot 参考手册

四、其他的数据分析实例

The Movie Database (TMDb) 电影数据集分析案例

image.png

问题:排名前20的最受欢迎的电影?


image.png image.png
# 对 popularity 最高的20名电影绘制 横向柱状图
df.sort_values('popularity', ascending=True)[-20:].set_index('original_title')['popularity'].plot(
    kind='barh', figsize=(12,8))

plt.title('Top 20 Most Popular Movies')
plt.xlabel('Popularity')
plt.ylabel('Movie Title');
image.png

相关文章

  • 健步走

    2018-8-11

  • 2018-08-11

    杨倩,焦点讲师三期,坚持分享589天(2018-8-11) 规则 良好的规则是由家庭成员共同营造...

  • “杀鱼弟”缘何服毒自杀 背后故事令人心酸【转载】

    “杀鱼弟”缘何服毒自杀 背后故事令人心酸【转载】 2018-8-11 09:45:38 来源:北方网 2010年,...

  • 2018-8-11

    一、如何学习数据分析 这两天跟每个同学都沟通了一下,大家都很有想法:1)有的同学希望能提供学习方法跟思路,或者参考...

  • 2018-8-11

    (万尚学习会)打卡第158天 姓名:徐娟 部门:人事部 组别:待定 【知~学习】 《京瓷哲学》第一章“度过美好的人...

  • 2018-8-11

    每天三件事132/200天 1、链接 在大巴上认识了一位连云港的武警,他属猪,当了五年的兵,个人呢希望退伍工作,觉...

  • 2018-8-11

    昨天整理了一下自己的跑马训练日志~

  • 观察日志

    2018-8-11 星期六 晴 昨晚临时决定今天一家三口南京看婆婆,婆婆糖尿病引...

  • 2018-8-11 夏天没时间伤感

    2018-8-11(星期六) 晨读美文 夏天没时间伤感 文 | 萧玥 知乎上有人问:为什么人们总是“伤春悲秋”,而...

  • 《男人也有梦想》

    巴渝墨客 2018-8-11 凌晨时分 ※题记: 在物质至上的今天,这是时代造就的产物——让我们所有人都趋...

网友评论

      本文标题:2018-8-11

      本文链接:https://www.haomeiwen.com/subject/hhtdbftx.html