美文网首页
Kaggle|Exercise7:Categorical Var

Kaggle|Exercise7:Categorical Var

作者: 十二支箭 | 来源:发表于2020-04-21 01:13 被阅读0次

    By encoding categorical variables, you'll obtain your best results thus far!

    Setup

    The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

    # Set up code checking
    import os
    if not os.path.exists("../input/train.csv"):
        os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
        os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
    from learntools.core import binder
    binder.bind(globals())
    from learntools.ml_intermediate.ex3 import *
    print("Setup Complete")
    

    In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.

    Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    # Read the data 读取数据
    X = pd.read_csv('../input/train.csv', index_col='Id') 
    X_test = pd.read_csv('../input/test.csv', index_col='Id')
    
    # Remove rows with missing target, separate target from predictors 删去包含缺失目标变量的样本,从数据集中分离出目标变量y
    X.dropna(axis=0, subset=['SalePrice'], inplace=True)
    y = X.SalePrice
    X.drop(['SalePrice'], axis=1, inplace=True)
    
    # To keep things simple, we'll drop columns with missing values为了使模型简单,使用删除缺失值方法
    cols_with_missing = [col for col in X.columns if X[col].isnull().any()] 
    X.drop(cols_with_missing, axis=1, inplace=True)
    X_test.drop(cols_with_missing, axis=1, inplace=True)
    
    # Break off validation set from training data 划分训练集验证集
    X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                          train_size=0.8, test_size=0.2,
                                                          random_state=0)
    

    Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.

    To compare different models, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    
    # function for comparing different approaches
    def score_dataset(X_train, X_valid, y_train, y_valid):
        model = RandomForestRegressor(n_estimators=100, random_state=0)
        model.fit(X_train, y_train)
        preds = model.predict(X_valid)
        return mean_absolute_error(y_valid, preds)
    

    Step 1: Drop columns with categorical data

    You'll get started with the most straightforward approach. Use the code cell below to preprocess the data in X_train and X_valid to remove columns with categorical data. Set the preprocessed DataFrames to drop_X_train and drop_X_valid, respectively.

    # Fill in the lines below: drop columns in training and validation data
    drop_X_train = X_train.select_dtypes(exclude = ["object"])
    drop_X_valid = X_valid.select_dtypes(exclude = ["object"])
    
    # Check your answers
    step_1.check()
    

    Hint: Use the select_dtypes() method to drop all columns with the object dtype.

    DataFrame.select_dtypes(include=None, exclude=None)
    

    获得该方法的评分

    print("MAE from Approach 1 (Drop categorical variables):")
    print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
    

    Step 2: Label encoding

    Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2' column. The code cell below prints the unique entries in both the training and validation sets.在我们进行标签编码之前,先研究一下数据集。特别低,我们将观察Condition2列,下面的代码将返回训练和验证集中指定列的唯一值。

    print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
    print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())
    
    #Output
    Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']
    Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']
    

    其中 .unique()是pandas中的唯一值函数,相当于把list变成set,输出变量中包含的不重复类别

    If you now write code to:

    • fit a label encoder to the training data, and then
    • use it to transform both the training and validation data,

    you'll get an error. Can you see why this is the case? (You'll need to use the above output to answer this question.)
    如果现在就进行标签编码的训练和转换,会报错,原因是——
    【是否有变量存在于验证集中却不存在于训练集中?】
    使用标签编码器拟合训练集上的每一列,为训练集特征中出现的每一个唯一类别创建一个对应的整数值标签。如果验证集包含的特征未出现在训练集中,编码器将报错。因为这些特征不会被分配整数。请注意,验证数据中的“ Condition2”列包含值“ RRAn”和“ RRNn”,但它们未出现在训练数据中,因此,如果我们尝试将标签编码器与scikit-learn一起使用,代码将出错。

    This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.

    Run the code cell below to save the problematic columns to a Python list bad_label_cols. Likewise, columns that can be safely label encoded are stored in good_label_cols.运行下面的代码单元,将有问题的列保存在列表bad_label_cols中,同样地,将正确的编码的列储存在good_label_cols中。

    # All categorical columns依然是经典的[for in if]表达式
    object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
    
    # Columns that can be safely label encoded 这里要比较真实元素,要想当用集合
    good_label_cols = [col for col in object_cols if 
                       set(X_train[col]) == set(X_valid[col])]
            
    # Problematic columns that will be dropped from the dataset 所有分类变量除去好的就是有问题的,补集运算
    bad_label_cols = list(set(object_cols)-set(good_label_cols))
            
    print('Categorical columns that will be label encoded:', good_label_cols)
    print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
    

    才写了一半不到,To be continued

    Use the next code cell to label encode the data in X_train and X_valid. Set the preprocessed DataFrames to label_X_train and label_X_valid, respectively.

    • We have provided code below to drop the categorical columns in bad_label_cols from the dataset.
    • You should label encode the categorical columns in good_label_cols.
      下面开始进行标签编码,这里我们直接将bad_label_cols中的类从数据集中删除
    from sklearn.preprocessing import LabelEncoder
    
    # Drop categorical columns that will not be encoded
    label_X_train = X_train.drop(bad_label_cols, axis=1)
    label_X_valid = X_valid.drop(bad_label_cols, axis=1)
    
    # Apply label encoder 这里请重点注意 是包含一个遍历循环的
    label_encoder = LabelEncoder()
    for col in set(good_label_cols): 
        label_X_train[col] = label_encoder.fit_transform(label_X_train[col])
        label_X_valid[col] = label_encoder.transform(label_X_valid[col])
     # Your code here
        
    # Check your answer
    step_2.b.check()
    

    Run the next code cell to get the MAE for this approach.

    print("MAE from Approach 2 (Label Encoding):") 
    print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
    
    MAE from Approach 2 (Label Encoding):
    17575.291883561644
    

    Step 3: Investigating cardinality

    So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.对分类数据进行编码比直接删除分类变量有更好的效果

    Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.

    # Get number of unique entries in each column with categorical data
    object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
    d = dict(zip(object_cols, object_nunique))
    
    # Print number of unique entries by column, in ascending order
    sorted(d.items(), key=lambda x: x[1])
    

    .nuique()pandas函数,用于获取唯一值的统计次数,即有几个唯一类别。

    输出为

    [('Street', 2),
     ('Utilities', 2),
     ('CentralAir', 2),
     ('LandSlope', 3),
     ('PavedDrive', 3),
     ('LotShape', 4),
     ('LandContour', 4),
     ('ExterQual', 4),
     ('KitchenQual', 4),
     ('MSZoning', 5),
     ('LotConfig', 5),
     ('BldgType', 5),
     ('ExterCond', 5),
     ('HeatingQC', 5),
     ('Condition2', 6),
     ('RoofStyle', 6),
     ('Foundation', 6),
     ('Heating', 6),
     ('Functional', 6),
     ('SaleCondition', 6),
     ('RoofMatl', 7),
     ('HouseStyle', 8),
     ('Condition1', 9),
     ('SaleType', 9),
     ('Exterior1st', 15),
     ('Exterior2nd', 16),
     ('Neighborhood', 25)]
    

    The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street' column in the training data has two unique values: 'Grvl' and 'Pave', corresponding to a gravel road and a paved road, respectively.

    We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street' variable has cardinality 2.

    Use the output above to answer the questions below.

    # Fill in the line below: How many categorical variables in the training data
    # have cardinality greater than 10?
    high_cardinality_numcols = 3
    
    # Fill in the line below: How many columns are needed to one-hot encode the 
    # 'Neighborhood' variable in the training data?
    num_cols_neighborhood = 25
    
    # Check your answers
    step_3.a.check()
    

    To one-hot encode a variable, we need one column for each unique entry.

    For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

    As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.

    • If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?
    • If we instead replace the column with the label encoding, how many entries are added?
      对于具有多行的大型数据集,OH编码会大幅度增加数据集的规模,因此,我们通常只对categorical较少的列进行一次OH编码。对于包含较多分类的列,可以删去或使用label encoding.
      Use your answers to fill in the lines below.

    Hint: To calculate how many entries are added to the dataset through the one-hot encoding, begin by calculating how many entries are needed to encode the categorical variable (by multiplying the number of rows by the number of columns in the one-hot encoding). Then, to obtain how many entries are added to the dataset, subtract the number of entries in the original column.提示:要计算通过一次编码将多少项添加到数据集中,请先计算需要多少项来对分类变量进行编码(通过在一次编码中将行数乘以列数) )。 然后,要获得添加到数据集中的条目数量,请减去原始列中的条目数量。

    # Fill in the line below: How many entries are added to the dataset by 
    # replacing the column with a one-hot encoding?
    OH_entries_added = 1e4*100-1e4
    
    # Fill in the line below: How many entries are added to the dataset by
    # replacing the column with a label encoding?
    label_entries_added = 0
    
    # Check your answers
    step_3.b.check()
    

    Step 4: One-hot encoding

    In this step, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.(仅对于类别数小于10的列进行OH编码)

    Run the code cell below without changes to set low_cardinality_cols to a Python list containing the columns that will be one-hot encoded. 将被OH编码 Likewise, high_cardinality_cols contains a list of categorical columns that will be dropped from the dataset.将被从数据集分类列中删除

    # Columns that will be one-hot encoded
    low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
    
    # Columns that will be dropped from the dataset
    high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
    
    print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
    print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
    
    Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']
    
    Categorical columns that will be dropped from the dataset: ['Neighborhood', 'Exterior2nd', 'Exterior1st']
    

    Use the next code cell to one-hot encode the data in X_train and X_valid. Set the preprocessed DataFrames to OH_X_train and OH_X_valid, respectively.

    • The full list of categorical columns in the dataset can be found in the Python list object_cols.
    • You should only one-hot encode the categorical columns in low_cardinality_cols. All other categorical columns should be dropped from the dataset.

    The next code cell is VERY IMPORTANT!

    from sklearn.preprocessing import OneHotEncoder
    
    # Use as many lines of code as you need!
    #Apply one-hot encoder to each column with categorical data
    onehotencoder = OneHotEncoder(handle_unknown = 'ignore',sparse = False)
    OH_cols_train = pd.DataFrame(onehotencoder.fit_transform(X_train[low_cardinality_cols]))
    OH_cols_valid = pd.DataFrame(onehotencoder.transform(X_valid[low_cardinality_cols]))
    
    #不要忘记OH会删除索引,一定要牢记恢复索引 One-hot encoding removes index,put it back
    OH_cols_train.index = X_train.index
    OH_cols_valid.index = X_valid.index
    
    #获得数值型变量列 删去分类列即是 Remove categorical columns (will replace with one-hot encoding)
    num_cols_train = X_train.drop(object_cols,axis=1)
    num_cols_valid = X_valid.drop(object_cols,axis=1)
    
    #拼接OH编码过的分类列和数值列 注意pd.concat函数的用法
    # Add one-hot encoded columns to numerical features
    OH_X_train = pd.concat([OH_cols_train,num_cols_train],axis=1)   # Your code here
    OH_X_valid = pd.concat([OH_cols_valid,num_cols_valid],axis=1) # Your code here
    
    # Check your answer
    step_4.check()
    

    The code cell above is VERY IMPORTANT!

    Run the next code cell to get the MAE for this approach.

    print("MAE from Approach 3 (One-Hot Encoding):") 
    print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
    

    Step 5: Generate test predictions and submit your results

    After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.

    This step is completely optional, and you do not need to submit results to the leaderboard to successfully complete the exercise.

    Check out the previous exercise if you need help with remembering how to join the competition or save your results to CSV. Once you have generated a file with your results, follow the instructions below:

    测试集预处理代码 To be continued

    1. Begin by clicking on the blue Save Version button in the top right corner of this window. This will generate a pop-up window.
    2. Ensure that the Save and Run All option is selected, and then click on the blue Save button.
    3. This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
    4. Click on the Output tab on the right of the screen. Then, click on the Submit to Competition button to submit your results to the leaderboard.

    You have now successfully submitted to the competition!

    1. If you want to keep working to improve your performance, select the blue Edit button in the top right of the screen. Then you can change your model and repeat the process. There's a lot of room to improve your model, and you will climb up the leaderboard as you work.

    Keep going

    With missing value handling and categorical encoding, your modeling process is getting complex. This complexity gets worse when you want to save your model to use in the future. The key to managing this complexity is something called pipelines.

    Learn to use pipelines to preprocess datasets with categorical variables, missing values and any other messiness your data throws at you.

    相关文章

      网友评论

          本文标题:Kaggle|Exercise7:Categorical Var

          本文链接:https://www.haomeiwen.com/subject/smvjihtx.html