美文网首页Python
机器学习入门数据集--2.波士顿房价

机器学习入门数据集--2.波士顿房价

作者: ac619467fef3 | 来源:发表于2019-02-10 08:00 被阅读109次

    sklearn有一个较小的房价数据集,特征有13个维度。而这个在数据集中,特征维度是79,本文用了2种模型对数据进行处理,线性回归模型和随机森林;用了2种模型评判方法R2和MSE。通过实验数据表明,随机森林模型的效果更好,一种原因是随机森林的Bag模型有抗过拟合效果更好,另一方面房价特征较多,决策树模型可以得到更好的结果。

    数据展示

    波士顿房价数据集,sklearn中可以下载已经做好预处理的数据集。

    import sklearn
    import numpy as np
    from sklearn.datasets import load_boston
    np.set_printoptions(suppress=True)
    boston = load_boston()
    
    print("data shape:{}".format(boston.data.shape))
    print("target shape:{}".format(boston.target.shape))
    print("line head 5:\n{}".format(boston.data[:5]))
    print("target head 5:\n{}".format(boston.target[:5]))
    

    查看结果:

    data shape:(506, 13)
    target shape:(506,)
    line head 5:
    [[  0.00632  18.        2.31      0.        0.538     6.575    65.2
        4.09      1.      296.       15.3     396.9       4.98   ]
     [  0.02731   0.        7.07      0.        0.469     6.421    78.9
        4.9671    2.      242.       17.8     396.9       9.14   ]
     [  0.02729   0.        7.07      0.        0.469     7.185    61.1
        4.9671    2.      242.       17.8     392.83      4.03   ]
     [  0.03237   0.        2.18      0.        0.458     6.998    45.8
        6.0622    3.      222.       18.7     394.63      2.94   ]
     [  0.06905   0.        2.18      0.        0.458     7.147    54.2
        6.0622    3.      222.       18.7     396.9       5.33   ]]
    target head 5:
    [24.  21.6 34.7 33.4 36.2]
    

    这个数据可以用任何一个简单模型进行处理,可以参考下面的文章。
    https://www.jianshu.com/p/f828eae005a1?utm_campaign=haruki&utm_content=note&utm_medium=reader_share&utm_source=weixin_timeline&from=timeline
    还有一个数据集,格式为csv,数据特征有80列,下面我们要处理这个格式的数据。

    波士顿房价数据集

    数据预处理

    加载数据

    train_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/train.csv",index_col=0)
    test_df = pd.read_csv("/Users/wangsen/ai/03/9day_discuz/firstDiscuz/02_houseprice/data/test.csv",index_col=0)
    ## read_csv加载csv文件
    ## index_col=0,指明第一列为id列
    print(train_df.info())
    ##print(train_df.describe().T)
    print(train_df['MSSubClass'].value_counts())
    print(train_df['MSSubClass'].unique())
    ## unique 查看数据
    ## value_counts 数据统计
    #数据预处理,训练集和测试集一起做数据预处理
    
    all_df = pd.concat((train_df,test_df),axis=0)
    print(all_df.shape)
    print(all_df['MSSubClass'].value_counts())
    print(all_df['MSSubClass'].unique())
    print(pd.get_dummies(sb))
    
    print(pd.concat((all_df['MSSubClass'][:5],pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass')[:5]),axis=1).T)
    
    • df.info() 查看多少列,每一个列的属性
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1460 entries, 1 to 1460
    Data columns (total 80 columns):
    MSSubClass       1460 non-null int64
    MSZoning         1460 non-null object
    LotFrontage      1201 non-null float64
    LotArea          1460 non-null int64
    Street           1460 non-null object
    Alley            91 non-null object
    LotShape         1460 non-null object
    LandContour      1460 non-null object
    Utilities        1460 non-null object
    LotConfig        1460 non-null object
    LandSlope        1460 non-null object
    Neighborhood     1460 non-null object
    Condition1       1460 non-null object
    Condition2       1460 non-null object
    BldgType         1460 non-null object
    HouseStyle       1460 non-null object
    OverallQual      1460 non-null int64
    OverallCond      1460 non-null int64
    YearBuilt        1460 non-null int64
    YearRemodAdd     1460 non-null int64
    RoofStyle        1460 non-null object
    RoofMatl         1460 non-null object
    Exterior1st      1460 non-null object
    Exterior2nd      1460 non-null object
    MasVnrType       1452 non-null object
    MasVnrArea       1452 non-null float64
    ExterQual        1460 non-null object
    ExterCond        1460 non-null object
    Foundation       1460 non-null object
    BsmtQual         1423 non-null object
    BsmtCond         1423 non-null object
    BsmtExposure     1422 non-null object
    BsmtFinType1     1423 non-null object
    BsmtFinSF1       1460 non-null int64
    BsmtFinType2     1422 non-null object
    BsmtFinSF2       1460 non-null int64
    BsmtUnfSF        1460 non-null int64
    TotalBsmtSF      1460 non-null int64
    Heating          1460 non-null object
    HeatingQC        1460 non-null object
    CentralAir       1460 non-null object
    Electrical       1459 non-null object
    1stFlrSF         1460 non-null int64
    2ndFlrSF         1460 non-null int64
    LowQualFinSF     1460 non-null int64
    GrLivArea        1460 non-null int64
    BsmtFullBath     1460 non-null int64
    BsmtHalfBath     1460 non-null int64
    FullBath         1460 non-null int64
    HalfBath         1460 non-null int64
    BedroomAbvGr     1460 non-null int64
    KitchenAbvGr     1460 non-null int64
    KitchenQual      1460 non-null object
    TotRmsAbvGrd     1460 non-null int64
    Functional       1460 non-null object
    Fireplaces       1460 non-null int64
    FireplaceQu      770 non-null object
    GarageType       1379 non-null object
    GarageYrBlt      1379 non-null float64
    GarageFinish     1379 non-null object
    GarageCars       1460 non-null int64
    GarageArea       1460 non-null int64
    GarageQual       1379 non-null object
    GarageCond       1379 non-null object
    PavedDrive       1460 non-null object
    WoodDeckSF       1460 non-null int64
    OpenPorchSF      1460 non-null int64
    EnclosedPorch    1460 non-null int64
    3SsnPorch        1460 non-null int64
    ScreenPorch      1460 non-null int64
    PoolArea         1460 non-null int64
    PoolQC           7 non-null object
    Fence            281 non-null object
    MiscFeature      54 non-null object
    MiscVal          1460 non-null int64
    MoSold           1460 non-null int64
    YrSold           1460 non-null int64
    SaleType         1460 non-null object
    SaleCondition    1460 non-null object
    SalePrice        1460 non-null int64
    dtypes: float64(3), int64(34), object(43)
    memory usage: 923.9+ KB
    
    • pd.get_dummies 对离散型特征进行哑编码(也叫独热编码one-hot)。由于pd的编码没有fit,transform等操作,需要将训练集和测试集联结。
      以第一列MSSubClass为例,可以先用unique()或value_counts()函数查看值分布。
      pd.get_dummies(all_df['MSSubClass'], prefix='MSSubClass')
      对某一列进行编码,观察输出结果,首先将特征值排序,最小值20为[1,0,0...],第二小值30为[0,1,0,....]。数据结果如下:
    Id               1   2   3   4   5
    MSSubClass      60  20  60  70  60
    MSSubClass_20    0   1   0   0   0
    MSSubClass_30    0   0   0   0   0
    MSSubClass_40    0   0   0   0   0
    MSSubClass_45    0   0   0   0   0
    MSSubClass_50    0   0   0   0   0
    MSSubClass_60    1   0   1   0   1
    MSSubClass_70    0   0   0   1   0
    MSSubClass_75    0   0   0   0   0
    MSSubClass_80    0   0   0   0   0
    MSSubClass_85    0   0   0   0   0
    MSSubClass_90    0   0   0   0   0
    MSSubClass_120   0   0   0   0   0
    MSSubClass_150   0   0   0   0   0
    MSSubClass_160   0   0   0   0   0
    MSSubClass_180   0   0   0   0   0
    MSSubClass_190   0   0   0   0   0
    
    • 查看空值
    all_dummy_df = pd.get_dummies(all_df)
    # print(all_dummy_df.head())
    print(all_dummy_df.shape)
    print(all_dummy_df.isnull().sum().sort_values(ascending=False))
    
    all_df shape:(2919, 79)
    LotFrontage              486
    GarageYrBlt              159
    MasVnrArea                23
    BsmtFullBath               2
    BsmtHalfBath               2
    BsmtFinSF1                 1
    BsmtFinSF2                 1
    BsmtUnfSF                  1
    TotalBsmtSF                1
    GarageArea                 1
    GarageCars                 1
    Condition1_RRNe            0
    Condition1_RRNn            0
    
    • 空值填充:平均值填充
    mean_cols = all_dummy_df.mean()
    all_dummy_df = all_dummy_df.fillna(mean_cols)
    
    • 模型训练
    dummy_train_df = all_dummy_df[:train_len]
    
    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    print("训练集评分:{}".format(lr.score(dummy_train_df,train_target)))
    
    kf = KFold(n_splits=5, shuffle=True)
    score_ndarray = cross_val_score(lr, dummy_train_df, train_target, cv=kf)
    print(score_ndarray)
    print(score_ndarray.mean())
    

    输出结果:

    训练集评分:0.9332679645484127
    [ 0.88669507 -1.54529853  0.90133954  0.84720817  0.86750469  0.92840145
      0.8299786   0.91205312  0.92400129  0.91065317  0.55149449  0.87645062
      0.48737113  0.82570995  0.91949504  0.8890254   0.79646233  0.94457746
      0.65656125  0.91573777]
    0.7162711000424071
    

    score

    LinearRegression的评分为R^2,模型在训练集上可以达到0.93,但是最后的交叉验证只得到了0.71的分数,说明模型存在过拟合问题。

    R2
    查看R2源码:github

    cross_val_score 交叉验证误差

    由于R^2误差不能直接表达误差的大小,对比两个模型的MSE。线性回归和随机森林。

    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    #lr.fit(dummy_train_df,train_target)
    #print("训练集评分:{}".format(lr.score(dummy_train_df,train_target,scor)))
    
    kf = KFold(n_splits=5, shuffle=True)
    score_ndarray = np.sqrt(-cross_val_score(lr, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
    print(score_ndarray.mean())
    
    from sklearn.ensemble import RandomForestRegressor
    clf = RandomForestRegressor(n_estimators=200, max_features=3)
    score_ndarray = np.sqrt(-cross_val_score(clf, dummy_train_df, train_target, cv=kf,scoring="neg_mean_squared_error"))
    print(score_ndarray.mean())
    
    clf.fit(dummy_train_df,train_target)
    train_predict = clf.predict(dummy_train_df)
    from sklearn.metrics import mean_squared_error
    print("随机森林算法的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))
    
    lr.fit(dummy_train_df,train_target)
    train_predict = lr.predict(dummy_train_df)
    from sklearn.metrics import mean_squared_error
    print("线性回归的误差:",np.sqrt(mean_squared_error(train_target,train_predict)))
    
    随机森林算法的误差: 13134.86059929929
    线性回归的误差: 20514.990603536615
    

    总结

    随机森林模型要比线性回归模型的结果好。

    相关文章

      网友评论

        本文标题:机器学习入门数据集--2.波士顿房价

        本文链接:https://www.haomeiwen.com/subject/rgvasqtx.html