美文网首页百日机器学习指南(100-Days-Of-ML-Code)
百日机器学习编程指南-Day1 数据预处理

百日机器学习编程指南-Day1 数据预处理

作者: TJH_KYC | 来源:发表于2018-09-13 22:57 被阅读0次

    前言

    • 话说,Avik Jain小哥在Github发起了一个百日机器学习编程项目(100-Days-Of-ML-Code),由于其简单、易学、系统等特点,一时间火热无比。
    • 学渣看了无比鸡冻,心想我草这特么不是为我准备的么!然后火速去学习,发现虽然没几行,但是还是有一堆代码看不懂啊!
    • 学霸看了看可怜的学渣,无奈地说道:这都看不懂,服了,还是我给你细细道来吧!

    预习中文图

    Day 1.jpg

    细细道来

    第1步:导入库

    # 1.Importing the required libraries
    import numpy as np
    import pandas as pd
    
    • 导入两个常规的库,后面是定义缩写方便后文引用;
    • Numpy是数值计算的扩展包,Panadas是做数据处理;

    第2步:导入数据集

    # 2.Importing the Dataset
    dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv")
    dataset.head()
    # type(dataset.iloc[:,:-1])
    # type(dataset.iloc[:,:-1].values)
    X = dataset.iloc[:,:-1].values
    y = dataset.iloc[:,3].values
    
    • pd.read_csv()用于导入数据集,可以本地或者url;
    • dataset.head()用于查看数据集前5行数据;
    • 通过注释掉的两个print可以看出,dataset.iloc[:,:-1]是DataFrame,而dataset.iloc[:,:-1].values是ndarray;
    • 将数据集的(所有行和除最后一列外所有列)的数值导入X矩阵,将数据集的(所有行和最后一列)的数值导入y向量;

    第3步:处理缺失数据

    # 3.Handling the missing data
    from sklearn.preprocessing import Imputer
    imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
    Z = imputer.fit(X[:,1:3])
    Z.statistics_
    X[:,1:3] = Z.transform(X[:,1:3])
    # X[:,1:3] = imputer.fit_transform(X[:,1:3])
    # print(X[:,1:3])
    
    • sklearn.preprocessing四步法:IIFT,不懂的可以看看这里,本例具体如下:
      I for Importing,导入某类(class),这里是Imputer;
      I for Instantiate,实例化,这里是把类Imputer实例化为imputer;
      F for fitting,喂实例数据进行拟合,拟合后生成某些统计量;
      T for Transforming,将统计量转换到某处;
    • 其中,FT可以一步完成,见注释掉的fit_transform处;

    第4步:解析分类数据及创立哑变量

    # 4.Encoding categorical data
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    labelencoder_X = LabelEncoder()
    X[:,0] = labelencoder_X.fit_transform(X[:,0])
    print(X[:,0])
    labelencoder_y = LabelEncoder()
    y = labelencoder_y.fit_transform(y)
    print(y)
    
    # Creating a dummy variable
    onehotencoder = OneHotEncoder(categorical_features = [0])
    type(X)
    type(onehotencoder.fit_transform(X))
    X = onehotencoder.fit_transform(X).toarray()
    type(onehotencoder.fit_transform(X).toarray())
    
    • LabelEncoder用于将分类变量里的字符型数据转化为数值型数据; OneHotEncoder用于哑变量的独热编码
    • OneHotEncoder无法直接对字符型变量进行编码,需要先通过LabelEncoder将字符型变量转换为数值型变量;
    • toarray()的作用是将coo_matrix转化为ndarray,这里的3个type可以解释,结果分别是:ndarray,coo_matrix,ndarray;

    第5步:拆分数据集为训练集合和测试集合

    # 5.Splitting the dataset into test set and training set
    from sklearn.model_selection import train_test_split
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
    
    • 注意train_test_split()括号里的arguments,和它等号左侧各数据集的摆放顺序:先训练集后测试集;

    第6步:特征量化

    # 6.Feature scaling
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    X_train = sc_X.fit_transform(X_train)
    X_test = sc_X.tranform(X_test)
    print(X_train);print(X_test)
    
    • 还是sklearn.preprocessing的IIFT四步法;
    • 为什么先是fit_transform而后是只有tranform可以看这里的解释;

    总结

    • numpy,pandas,sklearn是库(Library)
    • sklearn.preprocessing,sklearn.model_selection是sklearn库里的模块(Module)
    • Imputer(),LabelEncoder(),OneHotEncoder(),StandardScaler()这四个是类(Class)函数,train_test_split()是方法(Method)函数

    复习

    最后,奉上完整代码和英文图供复习:

    # -*- coding: utf-8 -*-
    """
    Created on Thu Sep 13 20:33:51 2018
    
    @author: wongz
    """
    # 1.Importing the required libraries
    import numpy as np
    import pandas as pd
    
    # 2.Importing the Dataset
    dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv")
    dataset.head()
    # type(dataset.iloc[:,:-1])
    # type(dataset.iloc[:,:-1].values)
    X = dataset.iloc[:,:-1].values
    y = dataset.iloc[:,3].values
    
    # 3.Handling the missing data
    from sklearn.preprocessing import Imputer
    imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
    Z = imputer.fit(X[:,1:3])
    Z.statistics_
    X[:,1:3] = Z.transform(X[:,1:3])
    # X[:,1:3] = imputer.fit_transform(X[:,1:3])
    # print(X[:,1:3])
    
    # 4.Encoding categorical data
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    labelencoder_X = LabelEncoder()
    X[:,0] = labelencoder_X.fit_transform(X[:,0])
    print(X[:,0])
    labelencoder_y = LabelEncoder()
    y = labelencoder_y.fit_transform(y)
    print(y)
    
    # Creating a dummy variable
    onehotencoder = OneHotEncoder(categorical_features = [0])
    type(X)
    type(onehotencoder.fit_transform(X))
    X = onehotencoder.fit_transform(X).toarray()
    type(onehotencoder.fit_transform(X).toarray())
    
    # 5.Splitting the dataset into test set and training set
    from sklearn.model_selection import train_test_split
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
    
    # 6.Feature scaling
    from sklearn.preprocessing import StandardScaler
    sc_X = StandardScaler()
    X_train = sc_X.fit_transform(X_train)
    X_test = sc_X.tranform(X_test)
    print(X_train);print(X_test)
    
    Day 1 EN.jpg

    相关文章

      网友评论

        本文标题:百日机器学习编程指南-Day1 数据预处理

        本文链接:https://www.haomeiwen.com/subject/ddpggftx.html