2. End-to-End Machine Learning P

作者: 奉先 | 来源:发表于2020-06-21 12:49 被阅读0次

2. End-to-End Machine Learning P
机器学习的10点建议
十大机器学习算法
《Machine Learning Mastery with P
Machine learning booooks
00 Machine Learning Introduction
【ML】Machine learning model
Machine Learning @ Python
The Fundamentals of Machine Lear
论文笔记-End-to-end representation l

1. 背景介绍

本章是全书第一部分，将展示一个完整的机器学习项目，让大家先有个整体性的认识。
假设自己是一家房地产公司的新聘数据科学家。项目使用“加利福尼亚住房价格”数据集，如下图所示：

California housing prices

一般机器学习项目分为以下一些重要步骤：

问题全景分析

2. 问题全景分析

项目的整体目标，是使用加州人口普查数据来建立该州的房价模型。数据包括加利福尼亚州每个街区组（类似我们的居委会）的人口，中位数收入和房价中位数等指标。根据这些数据和所有其他指标来预测任何地区的房价中位数。

2.1 业务目标

项目的定位和意义

通过上图发现，模型通过地区数据，预测得到房价数据。再将房价数据导入投资分析模型，得到投资数据。也就是房屋价格预测模型是一个数据处理的中间环节。
下面设计系统，很显然这是一个监督学习任务，并且是一个回归任务，使用简单的批处理学习就可以解决。

2.2 评价指标

第二个问题是选择评价指标。
说下欧式距离。
规范指数越高，它越关注大值而忽略小值。这就是为什么RMSE对异常值比MAE更敏感的原因。

3. 获取数据

项目的开发语言选择python3，需要导入必要的包：numpy、pandas、matplotlib等。开发工具使用jupyter。任意创建一个工作目录，后续代码将在此运行(后续代码都在jupyter中创建并运行)。

首先开发一个函数，函数的功能是从外部下载并解压实验数据，代码如下：

import os 
import urllib
import tarfile

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/" 
#本地接收地址
HOUSING_PATH = os.path.join("datasets", "housing")
#远端下载地址
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    #os.makedirs与os.mkdir的区别是它可以递归创建目录，相当于mkdir -p ，
    #如果 exist_ok 为 False (默认值)，则如果目标目录已存在将引发 FileExistsError。
    os.makedirs(housing_path, exist_ok=True)
    #datasets/housing/housing.tgz
    tgz_path=os.path.join(housing_path,"housing.tgz")   
    # 下载文件
    urllib.request.urlretrieve(housing_url,tgz_path)
    # 解压文件
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()

使用pandas读取数据集后，经常使用以下一些方法来对数据有一个概览的认识：

import pandas as pd 
import matplotlib.pyplot as plt

housing_data = pd.read_csv("datasets/housing/housing.csv")
#展示top5数据
housing_data.head()

#ocean_proximity属性是代码，value_counts()方法查看每个代码的数据量，类似group by作用
housing_data["ocean_proximity"].value_counts()

#字段级基本信息
housing_data.info()

#字段级统计信息（中位数、最小值等）
housing_data.describe()

#告诉Jupyter使用自己的后端来设置matplotlib，并画图。
%matplotlib inline
#在整个数据集上调用hist()方法，将为每个数值属性绘制一个直方图
#bins参数：如果bins为整数值，则bins为柱子个数，根据数据的取值范围和柱子个数bins计算每个柱子的范围值，柱宽=(x.max()-x.min())/bins
#         如果bins取值为序列，则该序列给出每个柱子的范围值（即边缘）
#figsize参数：figsize指每张图的尺寸大小
housing_data.hist(bins=50, figsize=(20,15))
plt.plot

Matplotlib画图

创建一个测试集，一般通常是全部数据的20%（如果数据集很大，可以少一些），创建测试集是未了模型对训练集的过拟合。下面代码使用二种方法来从全部数据集中选取测试数据集（随机、固定方式）：

import numpy as np 
# zlib模块为需要数据压缩的程序提供了一系列函数，用于压缩和解压缩。crc32用于计算data的CRC(循环冗余校验)值。计算的结果是一个32位的整数。
# 要在所有的 Python 版本和平台上获得相同的值，请使用 crc32(data) & 0xffffffff
from zlib import crc32
# test_ratio 在[0,1] 表示测试集占整体集合的比例
# 该种方法因为是随机选取，会导致每次选取的测试集不一样。
def random_split_train_test(data, test_ratio):
    #np.random.permutation 如果提供一个整数参数n，则返回 0到n-1的随机顺序的seq。
    #                      如果提供一个一维序列，则返回该序列的随机序列结果（一维序列）。
    #                      如果提供一个多维序列,则按照第一维随机排序的结果。
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data)*test_ratio)
    test_data_seq = shuffled_indices[:test_set_size]     #0~25%
    train_data_seq = shuffled_indices[test_set_size:]    #25%~1
    test_data_set = data.iloc[test_data_seq] 
    train_data_set = data.iloc[train_data_seq]
    return test_data_set,train_data_set

# 通过ID（也就是数据行号）编码的方式，来确定该行数据是否属于测试集/训练集，这种方式可以保证多次执行能得到同一份测试集。
def test_set_check(identifier, test_ratio=0.5):
    # 在测试集中返回True，否则False
    return crc32(np.int64(identifier)) & 0xffffffff  < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio)) 
    return data.loc[~in_test_set], data.loc[in_test_set]

#为pandas dataframe添加序号列index，值即为数据row的行号。
housing_with_id = housing_data.reset_index()
fix_train_set,fix_test_set = split_train_test_by_id(housing_with_id,0.2,"index")
len(housing_data),len(fix_train_set), len(fix_test_set)

#简单使用Scikit-Learn提供的测试集筛选方法  类似于random_split_train_test方法
from sklearn.model_selection import train_test_split

#参数test_size是测试集占全体数据集比例，random_state是随机种子
sk_train_set, sk_test_set = train_test_split(housing_data, test_size=0.2, random_state=42)
len(housing_data),len(sk_train_set), len(sk_test_set)

我们在抽取样本时，可能并没有那么简单。举例说明，我们做人口分析，总样本中男女比例是 53%和47%，那么我们在抽取样本时，也应该尽量保持这个比例，否则抽样的结果会大概率错误。这叫做分层抽样。

# 根据median_income取值范围打标志，生成新列median_income，取值1-5 
housing_data["income_cat"] = pd.cut(housing_data["median_income"],
                                    bins=[0., 1.5, 3.0, 4.5, 6., np.inf],labels=[1, 2, 3, 4, 5])
housing_data["income_cat"].hist()

#使用sk-learn包方法实现分层抽样
from sklearn.model_selection import StratifiedShuffleSplit
# n_splits是将训练数据分成train/test对的组数，可根据需要进行设置，默认为10，我们分一组训练/测试集出来。
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) 
for train_index, test_index in split.split(housing_data, housing_data["income_cat"]):
    strat_train_set = housing_data.loc[train_index] 
    strat_test_set = housing_data.loc[test_index]
strat_test_set["income_cat"].value_counts()/len(strat_test_set)

#删掉分层抽样数据集中的income_cat列，该列是上边为了验证分层抽样的比例与总数据比例一致而增加的，无实际意义。
for set_ in (strat_train_set, strat_test_set): 
    set_.drop("income_cat", axis=1, inplace=True)
    
strat_test_set.head()

4. 进一步探索和可视化数据：

4.1基于地理信息（经纬度）可视化数据：

前边只是简单浏览了下数据，这节针对训练集数据，进行深入研究。首先，按照地理信息来讲数据展示为散点图：

#备份一个训练数据集
housing_train_copy = strat_train_set.copy()
housing_train_copy.head()

# 通过dataframe生成散点图的方法 ，简单方法
housing_train_copy.plot(kind="scatter", y="latitude", x="longitude", alpha =0.1)

# 详细参数画图：参数详见  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html
# 
housing_train_copy.plot(
        kind="scatter", 
        x= "longitude", 
        y="latitude",
        alpha = 0.4,
        s = housing_train_copy["population"]/100,   #点半径大小按照population列区分
        label = "population",
        figsize = (10,7),   #图片的大小
        c = "median_house_value",
        cmap = plt.get_cmap("jet"),
        colorbar = True
)
plt.legend

经纬度散点图

经纬度与房屋价值的关系散点图

4.2 数值数据的相关性分析：

数据集不大，可以直接调用pands的corr 方法获取相关性值矩阵。

5. 为算法模型准备数据：

为算法准备数据的过程建议写成函数，为了方便以后的复用（或者快速转换新的dataset）。

2. End-to-End Machine Learning P
1. 背景介绍本章是全书第一部分，将展示一个完整的机器学习项目，让大家先有个整体性的认识。假设自己是一家房地产公...
机器学习的10点建议
1. Mahine Learning means learning from Data. 2. Machine =...
十大机器学习算法
https://zhuanlan.zhihu.com/p/33794257 Machine Learning: 十...
《Machine Learning Mastery with P
Machine learning booooks
Machine learning Pattern Recognition and Machine Learning...
00 Machine Learning Introduction
Machine Learning Introduction What's the Machine Learning...
【ML】Machine learning model
What are machine learning models? A machine learning mode...
Machine Learning @ Python
Machine Learning（机器学习） Machine learning typically impleme...
The Fundamentals of Machine Lear
How would you define Machine Learning? Machine Learning i...
论文笔记-End-to-end representation l
题目：End-to-end representation learning for Correlation Fil...

2. End-to-End Machine Learning P

1. 背景介绍

2. 问题全景分析

2.1 业务目标

2.2 评价指标

3. 获取数据

4. 进一步探索和可视化数据：

4.1基于地理信息（经纬度）可视化数据：

4.2 数值数据的相关性分析：

5. 为算法模型准备数据：

相关文章

2. End-to-End Machine Learning P

机器学习的10点建议

十大机器学习算法

《Machine Learning Mastery with P

Machine learning booooks

00 Machine Learning Introduction

【ML】Machine learning model

Machine Learning @ Python

The Fundamentals of Machine Lear

论文笔记-End-to-end representation l

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读