机器学习实战(2)之预测房价

作者: 柳叶刀与小鼠标 | 来源:发表于2018-10-21 19:03 被阅读7次

    上一篇

    机器学习实战⑴之线性回归预测房价 - 简书
    https://www.jianshu.com/p/0b66f1c4cc2d

    这一篇主要是系统地对数据进行机器学习前的预处理。

    # -*- coding: utf-8 -*-
    """
    Created on Sun Oct 21 14:37:15 2018
    
    @author: Administrator
    """
    
    % reset -f
    % clear
    
    # In[*]
    ##########第一步  导入包
    # In[*]
    from sklearn.model_selection import cross_val_score
    from sklearn import linear_model
    from sklearn import metrics
    import matplotlib.pyplot as plt
    import pandas as pd
    import matplotlib
    import numpy as np
    import seaborn as sns
    import os
    from scipy.stats import skew
    from scipy.stats.stats import pearsonr
    os.chdir("C:\\Users\\Administrator\\Desktop\\all")
    
    # In[*]
    ##########第二步  导入数据
    # In[*]
    train = pd.read_csv('train.csv',header = 0,index_col=0)
    test  = pd.read_csv('test.csv',header = 0,index_col=0)
    
    all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                          test.loc[:,'MSSubClass':'SaleCondition']))
    

    前两步,导入包和数据。


    数据大概80列,3000个观测值,属性包括有数字列,同时也有字符串列。

    # In[*]
    # 第三步,将目标变量标准化
    
    matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
    prices = pd.DataFrame({"price":train["SalePrice"],
                           "log(price + 1)":np.log1p(train["SalePrice"])})
    prices.hist()
    
    #log transform the target:
    
    # In[*]
    # 第四步,将预测变量标准化
    train["SalePrice"] = np.log1p(train["SalePrice"])
    
    #log transform skewed numeric features:
    numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
    
    skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) 
    
    skewed_feats = skewed_feats[skewed_feats > 0.75]
    skewed_feats = skewed_feats.index
    all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
    

    这一步主要目的是将数字类型的属性,将这些特征其中比较偏,不属于正态分布的特征做log标准化。

    # In[*]
    # 第五步,处理字符型变量以及将填充缺失值
    # In[*]
    all_data = pd.get_dummies(all_data)
    all_data = all_data.fillna(all_data.mean())
    # In[*]
    # 第六步,划分训练集和测试集
    # In[*]
    #creating matrices for sklearn:
    X_train = all_data[:train.shape[0]]
    X_test = all_data[train.shape[0]:]
    y = train.SalePrice
    

    数据预处理要点:
    1.使用log(x+1)来转换偏斜的数字特征 -,这将使我们的数据更加正常
    2.为分类要素创建虚拟变量
    3.将数字缺失值(NaN)替换为各自列的平均值

    全部代码:

    # -*- coding: utf-8 -*-
    """
    Created on Sun Oct 21 14:37:15 2018
    
    @author: Administrator
    """
    
    % reset -f
    % clear
    
    # In[*]
    ##########第一步  导入包
    # In[*]
    from sklearn.model_selection import cross_val_score
    from sklearn import linear_model
    from sklearn import metrics
    import matplotlib.pyplot as plt
    import pandas as pd
    import matplotlib
    import numpy as np
    import seaborn as sns
    import os
    from scipy.stats import skew
    from scipy.stats.stats import pearsonr
    os.chdir("C:\\Users\\Administrator\\Desktop\\all")
    
    # In[*]
    ##########第二步  导入数据
    # In[*]
    train = pd.read_csv('train.csv',header = 0,index_col=0)
    test  = pd.read_csv('test.csv',header = 0,index_col=0)
    
    all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                          test.loc[:,'MSSubClass':'SaleCondition']))
    # In[*]
    
    #Data preprocessing:
    
    #We're not going to do anything fancy here:
    
    #First I'll transform the skewed numeric features by taking log(feature + 1) - 
    #this will make the features more normal
    #Create Dummy variables for the categorical features
    #Replace the numeric missing values (NaN's) with the mean of their respective columns
    
    # In[*]
    # 第三步,将目标变量标准化
    # In[*]
    matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
    prices = pd.DataFrame({"price":train["SalePrice"],
                           "log(price + 1)":np.log1p(train["SalePrice"])})
    prices.hist()
    
    #log transform the target:
    # In[*]
    # 第四步,将预测变量标准化
    # In[*]
    train["SalePrice"] = np.log1p(train["SalePrice"])
    
    #log transform skewed numeric features:
    numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
    
    skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) 
    
    skewed_feats = skewed_feats[skewed_feats > 0.75]
    
    
    skewed_feats = skewed_feats.index
    all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
    
    # In[*]
    # 第五步,处理字符型变量以及将填充缺失值
    # In[*]
    all_data = pd.get_dummies(all_data)
    all_data = all_data.fillna(all_data.mean())
    # In[*]
    # 第六步,划分训练集和测试集
    # In[*]
    #creating matrices for sklearn:
    X_train = all_data[:train.shape[0]]
    X_test = all_data[train.shape[0]:]
    y = train.SalePrice
    
    

    下一篇

    机器学习实战(3)之使用lasso回归预测房价 - 简书
    https://www.jianshu.com/p/ccfa1d0b792a

    相关文章

      网友评论

        本文标题:机器学习实战(2)之预测房价

        本文链接:https://www.haomeiwen.com/subject/yxebzftx.html