案例分析一kaggle房价预测

作者: 粉红狐狸_dhf | 来源:发表于2020-05-30 19:56 被阅读0次

    资料出处:https://www.bilibili.com/video/BV19b411z73K?p=2
    数据集:链接:https://pan.baidu.com/s/1yZ1QuLaO6lz7sic40UHGvg
    提取码:0wbt

    1读取数据

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    df_train=pd.read_csv('E:/jupyter_lab/leetcode/data/house-prices-kaggle/train.csv',index_col=0)
    df_test=pd.read_csv('E:/jupyter_lab/leetcode/data/house-prices-kaggle/test.csv',index_col=0)  
    #index_col=0设定第一列为索引列
    
    print('train.shape:',df_train.shape)
    print('test.shape:',df_test.shape)
    

    2 y的处理

    回归模型尽量处理成正态分布。y变成正态分布log1p=log(x+1),用expm1变回来。

    df_y=pd.DataFrame({'y_':df_train.SalePrice,'y_log1p':np.log1p(df_train.SalePrice)})
    df_y.hist()
    
    image.png

    3 将train和test合并做预处理

    • category数据:变成one-hot编码,pd.get_dummies()
    • numerical数据:处理缺失和偏差等等

    (1)处理category变量

    y=df_train.pop('SalePrice')
    all_df=pd.concat((df_train,df_test),axis=0)
    #axis=0 纵向拼接
    all_df.shape #(2919, 79)
    
    # MSSubClass 是一个等级标签,属于category ,将其变为str类型
    print('MSSubClass.type:',all_df.MSSubClass.dtypes)
    all_df.MSSubClass=all_df.MSSubClass.astype(str)
    all_df.MSSubClass.value_counts() #统计个数
    
    #处理所有类别变量
    dummy_all_df=pd.get_dummies(all_df)
    dummy_all_df.head()
    

    (2)处理numerical变量

    用均值填满空值

    #将空值降序排列
    dummy_all_df.isnull().sum().sort_values(ascending=False).head()
    
    #用平均值填满空值
    cols_mean=dummy_all_df.mean()
    dummy_all_df=dummy_all_df.fillna(cols_mean)
    dummy_all_df.isnull().sum().sum()
    

    对numerical类标准化,此步非必须,但回归模型最好标准化。

    #取出所有numerical的变量
    numerical_cols=dummy_all_df.columns[dummy_all_df.dtypes != 'object']
    
    numerical_cols_mean=dummy_all_df.loc[:,numerical_cols].mean()
    numerical_cols_std=dummy_all_df.loc[:,numerical_cols].std()
    dummy_all_df.loc[:,numerical_cols]=(dummy_all_df.loc[:,numerical_cols]-numerical_cols_mean)/numerical_cols_std
    

    4 建立模型

    把数据集再分开

    dummy_train=dummy_all_df.loc[df_train.index]
    dummy_test=dummy_all_df.loc[df_test.index]
    dummy_train.shape
    

    (1)Ridge Regression 岭回归

    是最小二乘法+L2范数的一种有偏估计,是为了损失一部分精度来降低模型复杂度的一种防止过拟合的方法,也就是通常所说的正则化。 正则化项与模型复杂度是单调递增关系,当损失函数较小也就是拟合精度较高时,模型的复杂度也会相应增加,加入正则化项是为了让模型复杂度降低,此时会损失一部分精度,防止过拟合。(L2范数:参数平方求平方根;L1范数:参数的绝对值的和)

    from sklearn.linear_model import Ridge
    from sklearn.model_selection import cross_val_score #划分数据,交叉验证
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    y_train=df_y.y_log1p
    x_train=dummy_train.values
    x_test=dummy_test.values
    #将DataFrame数据转成Numpy Array数据,与sklearn相匹配
    
    #尝试调参
    alphas=np.logspace(-2,3,50)# 0.01~1000
    test_scores=[]
    for alpha in alphas:
        clf=Ridge(alpha)
        test_score=np.sqrt(-cross_val_score(clf,x_train,y_train,cv=10,scoring='neg_mean_squared_error'))
        test_scores.append(np.mean(test_score))
    
    plt.plot(alphas,test_scores)
    plt.title('Alpha vs CV Erroe')
    plt.show()
    
    (2)RandomForestRegressor 随机森林回归
    from sklearn.ensemble import RandomForestRegressor
    
    
    max_features=[.1,.3,.5,.7,.9,.99]
    test_scores=[]
    for max_feature in max_features :
        rlf=RandomForestRegressor(n_estimators=200,max_features=max_feature)
        test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
        test_scores.append(np.mean(test_score))
    
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    plt.plot(max_features,test_scores)
    plt.title('Max_features vs CV Erroe')
    plt.show()
    

    这些模型不一定是损失越来越小的,文章重点在于分析思路。模型可以自己替换。

    5 Ensemble 集成模型

    利用Staking 的思想将调好参数的模型集成

    ridge=Ridge(alpha=400)
    rlf=RandomForestRegressor(max_features=0.3)
    
    ridge.fit(x_train,y_train)
    rlf.fit(x_train,y_train)
    
    y_ridge=np.expm1(ridge.predict(x_test))
    y_rlf=np.expm1(rlf.predict(x_test))
    
    #ensemble将模型预测结果作为新的输入,再预测。这里直接平均化
    y_en=(y_ridge+y_rlf)/2 #伪集成
    
    #提交形式
    submission_df=pd.DataFrame({'Id':df_test.index,'SalePrice':y_en})
    submission_df.head()
    

    6 更高级的 Ensemble 集成模型

    (1)Bagging:平行进行,投票决定 。

    小分类器:ridge(15)是之前调参好的模型。

    from sklearn.ensemble import BaggingRegressor  
    from sklearn.model_selection import cross_val_score
    
    ridge=Ridge(15)#小分类器
    
    params=[1,10,15,20,25,30,40]
    test_scores=[]
    
    for param in params :
        rlf=BaggingRegressor(n_estimators=param,base_estimator=ridge)#base_estimator默认是Decision Tree模型
        test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
        test_scores.append(np.mean(test_score))
    
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    plt.plot(params,test_scores)
    plt.title('params vs CV Erroe')
    plt.show()
    
    (2) Boosting 前后关联的,把难以预测的继续喂给下一个小分类器预测。
    from sklearn.ensemble import AdaBoostRegressor  
    from sklearn.model_selection import cross_val_score
    
    ridge=Ridge(15)#小分类器
    
    params=[1,3,5,7,9,10,11,12,15]
    test_scores=[]
    
    for param in params :
        rlf=AdaBoostRegressor(n_estimators=param,base_estimator=ridge)#base_estimator默认是Decision Tree模型
        test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
        test_scores.append(np.mean(test_score))
    
    plt.plot(params,test_scores)
    plt.title('Prams vs CV Erroe')
    plt.show()
    
    (3) XGBoost :改进版的Boosting
    from xgboost import XGBRegressor
    import warnings
    warnings.filterwarnings("ignore")
    
    params=[1,3,5,7,9,10,11,12,15]
    test_scores=[]
    
    for param in params :
        rlf=XGBRegressor(param)
        test_score=np.sqrt(-cross_val_score(rlf,x_train,y_train,cv=5,scoring='neg_mean_squared_error'))
        test_scores.append(np.mean(test_score))
    
    plt.plot(params,test_scores)
    plt.title('Prams vs CV Erroe')
    plt.show()
    
    image.png

    用xgboost进行预测的这个结果是最好的,可以看出集成方法的威力。

    相关文章

      网友评论

        本文标题:案例分析一kaggle房价预测

        本文链接:https://www.haomeiwen.com/subject/hlvczhtx.html