180605_Stacked_Regression

作者: 郑磊_4135 | 来源:发表于2018-06-05 14:52 被阅读0次

    regression to predict House Price

    The feature of data engineering includes as followings:

    • Imputing missing values by proceeding sequentially through the data
    • transforming some numerical variables that seem really categorical
    • Label Encoding some categorical variables
    • Box-Cox Transformation of skewed features rather than log-transformation. This will give you a slightly better result both on leader board and cross-validation
    • Getting dummy variables for categorical features

    Then we choose many base models (mostly sklearn based models + sklearn API of DMLC's XGBoost and Microsoft's LightGBM), cross-validate them on the data before stacking/ensembling them.

    The key here is to make the linear models robust to outliers. This improved the result both on LB and cross-validation

    # linear algebra
    import numpy as np
    import pandas as pd
    
    # plot
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    
    # set plot style
    color = sns.color_palette()
    sns.set_style('darkgrid')
    
    # ignore warnings
    import warnings
    def ignore_warn(*arg, **kwargs):
        pass
    warnings.warn = ignore_warn # ignore annoying warning from sklearn and seaborn
    
    # stats function
    from scipy import stats
    from scipy.stats import norm, skew
    
    # set option
    pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
    
    from subprocess import check_output
    print(check_output(["ls","../input"]).decode("utf8"))
    

    [1] input data using pd

    [2] get data info

    print("the train data size before dropping Id feature is : {}".format(train.shape))
    print("the test data size before dropping Id feature is : {}".format(test.shape))
    
    # save and drop Id column
    train_Id = train['Id']
    test_Id = test['Id']
    
    train.drop('Id', axis = 1, inlace = True)
    test.drop('Id', axis = 1, inlace = True)
    

    [3]Data Cleaning

    [3.1] Outliers

    Documentation for the Ames Housing data indicates that there are outliers present in the training data.

    The followings is the approach to find out outliers and deal with outliers (We usually delete them).

    Let's explore these outliers

    fig, ax = plt.subplots()
    ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
    plt.ylabel('SalePrice', fontsize  = 13)
    plt.xlabel('GrLivArea', fontsize  = 13)
    plt.show()
    

    After identified the outliers, we can safely delete them

    train = train.drop(train[(train['GrLivArea'] > 4000) & (train['SalePrice'] < 300000)].index)
    fig, ax = plt.subplots()
    ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
    plt.ylabel('SalePrice', fontsize  = 13)
    plt.xlabel('GrLivArea', fontsize  = 13)
    plt.show()
    
    [3.2] Dependent variable

    We have to guarantee that DV is normal distributed. So we need to run a description analysis and draw a norm fit graph.

    # histogram with norm fitted lines
    sns.distplot(train['SalePrice'], fit = norm)
    
    # get distribution info
    (mu, sigma) = norm.fit(train['SalePrice'])
    print('\n mu = {:.2f} and sigma ={:.2f}\n'.format(mu, sigma))
    
    # draw the distribution
    plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma =${:.2f})'.format(mu, sigma)], loc = 'best')
    plt.ylabel('Frequency')
    plt.title('SalePrice Distribution')
    
    #get also the QQ-plot
    fig = plt.figure()
    res = stats.probplot(train['SalePrice'], plot = plt)
    plt.show()
    

    After review the QQplot and distribution, we have to decide whether conduct a log transformation or not.

    train['SalePrice'] = np.log1p(train['SalePrice'])
    
    # check distritbuion corrected results
     sns.distplot(train['SalePrice'], fit = norm)
    
    # get distribution info
    (mu, sigma) = norm.fit(train['SalePrice'])
    print('\n mu = {:.2f} and sigma ={:.2f}\n'.format(mu, sigma))
    
    # draw the distribution
    plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma =${:.2f})'.format(mu, sigma)], loc = 'best')
    plt.ylabel('Frequency')
    plt.title('SalePrice Distribution')
    
    #get also the QQ-plot
    fig = plt.figure()
    res = stats.probplot(train['SalePrice'], plot = plt)
    plt.show()
    

    [4] Feature engineering

    first concatenate the train and test data in the same dataframe

    ntrain = train.shape[0]
    ntest = test.shape[0]
    
    y_train = train.SalePrice.values
    
    all_data = pd.concat((train, test)).reset_index(drop = True)
    all_data.drop(['SalePrice'], axis = 1, inplace = True)
    print("all_data size is: {}".format(all_data.shape))
    
    [4.1]Missing data
    all_data_na = all_data.isnull().sum() / len(all_data) * 100
    all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending = False)
    
    missing_data = pd.DataFrame('Missing Ration': all_Data_na)
    missing_data.head()
    

    Draw missing plot

    fig, ax = plt.subplots(figsize = (15, 20))
    plt.xticks(rotation = '90')
    sns.barplot(x = all_data_na.index, y = all_data_na)
    plt.xlabel("Features", fontsize = 15)
    plt.ylabel("Percent of missing values", fontsize = 15)
    plt.title("Percent of missing data by feature", fontsize = 15)
    plt.show()
    

    Imputing missing values

    # impute missing values by “None”
    for col in ["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu"]:
            all_data[col] = all_data[col].fillna("None")
    
    '''
    since the area of each street connected to house property most likely 
    have a similar area to other houses in its neighborhood, we can fill in 
    missing values by the median LotFrontage of the neighbourhood
    '''
    all_data["LotFrontage"] = all_data.groupby("Neignborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median()))
    
    # impute missing values by 0
    cols = ["GarageType", "GarageFinish", "GarageQual", "GarageCond"]
    for col in cols:
            all_data[col] = all_data[col].fillna(0)
    
    # impute missing values by mode 
    all_data["Electrical"] = all_data["Electrical"].fillna(all_data["Electrical"].mode()[0])
    

    check is there any remaining missing value

    all_data_na = all_data.isnull().sum() / len(all_data) * 100
    all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending = False)
    
    missing_data = pd.DataFrame('Missing Ration': all_Data_na)
    missing_data.head()
    
    [4.2] Correlation matrix (heatmap)
    corrmat = train.corr()
    plt.subplots(figsize = (12,9))
    sns.heatmap(corrmat, vmax =0.9, square = True)
    

    selected important variables based on heat map

    transforming some numerical variables that are really categorical

    #MSSubClass = the building class
    all_data["MSSubClass"] = all_data["MSSubClass"].apply(str)
    
    #changing OverallCond into a categorical variable
    all_data["OverallCond"] = all_data["OverallCond"].astype(str)
    
    #year and month sold are transformed into categorical features
    all_data["YrSold"] = all_data["YrSold"].astype(str)
    all_data["MoSold"] = all_data["MoSold"].astype(str)
    

    label encoding some categorical variables that may contain information in their ordering set

    from sklearn.preprocessing import LabelEncoder
    cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
            'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
            'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
            'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
            'YrSold', 'MoSold')
    # process columns, apply LabelEncoder to categorical features
    for c in cols:
        lbl = LabelEncoder() 
        lbl.fit(list(all_data[c].values)) 
        all_data[c] = lbl.transform(list(all_data[c].values))
    
    # shape        
    print('Shape all_data: {}'.format(all_data.shape))
    

    Adding one more important feature

    Since area related features are very important to determine house prices, we add one more feature which is the total area of basement, first and second floor areas of each house

    # Adding total sqfootage feature 
    all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
    

    Skewed features

    numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
    
    # Check the skew of all numerical features
    skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
    print("\nSkew in numerical features: \n")
    skewness = pd.DataFrame({'Skew' :skewed_feats})
    skewness.head(10)
    

    Box Cox Transformation for highly skewed features

    • we usually use the scipy function boxcox1p which computes the Box-Cox transformation of $1 + x$
    • setting $\lambda = 0$ is equivalent to log1p used above for the target variable
    skewness = skewness[abs(skewness) > 0.75]
    print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))
    
    from scipy.special import boxcox1p
    skewed_features = skewness.index
    lam = 0.15
    for feat in skewed_features:
        #all_data[feat] += 1
        all_data[feat] = boxcox1p(all_data[feat], lam)
        
    #all_data[skewed_features] = np.log1p(all_data[skewed_features])
    

    ** getting dummy categorical features**

    all_data = pd.get_dummies(all_data)
    print(all_data.shape)
    

    getting the new train and test sets

    train = all_data[:ntrain]
    test = all_data[ntrain:]
    

    Modelling

    #libraries
    from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
    from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
    from sklearn.kernel_ridge import KernelRidge
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import RobustScaler
    from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
    from sklearn.model_selection import KFold, cross_val_score, train_test_split
    from sklearn.metrics import mean_squared_error
    import xgboost as xgb
    import lightgbm as lgb
    

    Define a cross validation strategy

    We use the cross_val_score function of Sklearn. However this function has not a shuffle attribut, we add then one line of code, in order to shuffle the dataset prior to cross-validation

    #Validation function
    n_folds = 5
    
    def rmsle_cv(model):
        kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
        rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
        return(rmse)
    
    Base Model
    • LASSO Regression :

    This model may be very sensitive to outliers. So we need to made it more robust on them. For that we use the sklearn's Robustscaler() method on pipeline

    lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
    
    • Elastic Net Regression :
      again made robust to outliers
    ENet = make_pipeline(RobustScaler(), 
            ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
    
    • Kernel Ridge Regression :
    KRR = KernelRidge(alpha=0.6, 
                      kernel='polynomial', 
                      degree=2, coef0=2.5)
    
    • Gradient Boosting Regression:
      With huber loss that makes it robust to outliers
    GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                       max_depth=4, max_features='sqrt',
                                       min_samples_leaf=15, min_samples_split=10, 
                                       loss='huber', random_state =5)
    
    • XGBoost:
    model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                                 learning_rate=0.05, max_depth=3, 
                                 min_child_weight=1.7817, n_estimators=2200,
                                 reg_alpha=0.4640, reg_lambda=0.8571,
                                 subsample=0.5213, silent=1,
                                 random_state =7, nthread = -1)
    
    • LightGBM:
    model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                                  learning_rate=0.05, n_estimators=720,
                                  max_bin = 55, bagging_fraction = 0.8,
                                  bagging_freq = 5, feature_fraction = 0.2319,
                                  feature_fraction_seed=9, bagging_seed=9,
                                  min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
    
    Base models scores

    Let's see how these base models perform on the data by evaluating the cross-validation rmsle error

    score = rmsle_cv(lasso)
    print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
    

    Stacking models

    Simplest Stacking approach : Averaging base models

    We begin with this simple approach of averaging base models. We build a new class to extend scikit-learn with our model and also to laverage encapsulation and code reuse (inheritance)

    Averaged base models class

    class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
        def __init__(self, models):
            self.models = models
            
        # we define clones of the original models to fit the data in
        def fit(self, X, y):
            self.models_ = [clone(x) for x in self.models]
            
            # Train cloned base models
            for model in self.models_:
                model.fit(X, y)
    
            return self
        
        #Now we do the predictions for cloned models and average them
        def predict(self, X):
            predictions = np.column_stack([
                model.predict(X) for model in self.models_
            ])
            return np.mean(predictions, axis=1) 
    

    Averaged base models score

    We just average four models here ENet, GBoost, KRR and lasso. Of course we could easily add more models in the mix.

    averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))
    
    score = rmsle_cv(averaged_models)
    print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
    

    Wow ! It seems even the simplest stacking approach really improve the score . This encourages us to go further and explore a less simple stacking approch.

    Less simple Stacking : Adding a Meta-model

    In this approach, we add a meta-model on averaged base models and use the out-of-folds predictions of these base models to train our meta-model.

    The procedure, for the training part, may be described as follows:

    1. Split the total training set into two disjoint sets (here train and .holdout )

    2. Train several base models on the first part (train)

    3. Test these base models on the second part (holdout)

    4. Use the predictions from 3) (called out-of-folds predictions) as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.

    The first three steps are done iteratively . If we take for example a 5-fold stacking , we first split the training data into 5 folds. Then we will do 5 iterations. In each iteration, we train every base model on 4 folds and predict on the remaining fold (holdout fold).

    So, we will be sure, after 5 iterations , that the entire data is used to get out-of-folds predictions that we will then use as new feature to train our meta-model in the step 4.

    For the prediction part , We average the predictions of all base models on the test data and used them as meta-features on which, the final prediction is done with the meta-model.

    QBuDOjs.jpg image5.gif

    On this gif, the base models are algorithms 0, 1, 2 and the meta-model is algorithm 3. The entire training dataset is A+B (target variable y known) that we can split into train part (A) and holdout part (B). And the test dataset is C.

    B1 (which is the prediction from the holdout part) is the new feature used to train the meta-model 3 and C1 (which is the prediction from the test dataset) is the meta-feature on which the final prediction is done.

    Stacking averaged Models Class

    class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
        def __init__(self, base_models, meta_model, n_folds=5):
            self.base_models = base_models
            self.meta_model = meta_model
            self.n_folds = n_folds
       
        # We again fit the data on clones of the original models
        def fit(self, X, y):
            self.base_models_ = [list() for x in self.base_models]
            self.meta_model_ = clone(self.meta_model)
            kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
            
            # Train cloned base models then create out-of-fold predictions
            # that are needed to train the cloned meta-model
            out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
            for i, model in enumerate(self.base_models):
                for train_index, holdout_index in kfold.split(X, y):
                    instance = clone(model)
                    self.base_models_[i].append(instance)
                    instance.fit(X[train_index], y[train_index])
                    y_pred = instance.predict(X[holdout_index])
                    out_of_fold_predictions[holdout_index, i] = y_pred
                    
            # Now train the cloned  meta-model using the out-of-fold predictions as new feature
            self.meta_model_.fit(out_of_fold_predictions, y)
            return self
       
        #Do the predictions of all base models on the test data and use the averaged predictions as 
        #meta-features for the final prediction which is done by the meta-model
        def predict(self, X):
            meta_features = np.column_stack([
                np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
                for base_models in self.base_models_ ])
            return self.meta_model_.predict(meta_features)
    

    Stacking Averaged models Score

    To make the two approaches comparable (by using the same number of models) , we just average Enet KRR and Gboost, then we add lasso as meta-model.

    stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                     meta_model = lasso)
    
    score = rmsle_cv(stacked_averaged_models)
    print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
    

    We get again a better score by adding a meta learner

    Ensembling StackedRegressor, XGBoost and LightGBM

    We add XGBoost and LightGBM to the StackedRegressor defined previously.

    We first define a rmsle evaluation function

    def rmsle(y, y_pred):
        return np.sqrt(mean_squared_error(y, y_pred))
    

    Final Training and Prediction

    StackedRegressor:

    stacked_averaged_models.fit(train.values, y_train)
    stacked_train_pred = stacked_averaged_models.predict(train.values)
    stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
    print(rmsle(y_train, stacked_train_pred))
    

    XGBoost:

    model_xgb.fit(train, y_train)
    xgb_train_pred = model_xgb.predict(train)
    xgb_pred = np.expm1(model_xgb.predict(test))
    print(rmsle(y_train, xgb_train_pred))
    

    LightGBM:

    model_lgb.fit(train, y_train)
    lgb_train_pred = model_lgb.predict(train)
    lgb_pred = np.expm1(model_lgb.predict(test.values))
    print(rmsle(y_train, lgb_train_pred))
    

    RMSE on the entire Train data when averagin

    print('RMSLE score on train data:')
    print(rmsle(y_train,stacked_train_pred*0.70 +
                   xgb_train_pred*0.15 + lgb_train_pred*0.15 ))
    

    Ensemble prediction:

    ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15
    

    Submission

    sub = pd.DataFrame()
    sub['Id'] = test_ID
    sub['SalePrice'] = ensemble
    sub.to_csv('submission.csv',index=False)
    

    If you found this notebook helpful or you just liked it , some upvotes would be very much appreciated - That will keep me motivated to update it on a regular basis :-)

    相关文章

      网友评论

        本文标题:180605_Stacked_Regression

        本文链接:https://www.haomeiwen.com/subject/tecpsftx.html