美文网首页
Kaggle|Exercise6:Missing Values[

Kaggle|Exercise6:Missing Values[

作者: 十二支箭 | 来源:发表于2020-04-15 23:53 被阅读0次

    来自kaggle官网的标准化机器学习流程。
    Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.

    Setup

    The questions will give you feedback on your work. Run the following cell to set up the feedback system.

    # Set up code checking
    import os
    if not os.path.exists("../input/train.csv"):
        os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
        os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
    from learntools.core import binder
    binder.bind(globals())
    from learntools.ml_intermediate.ex2 import *
    print("Setup Complete")
    

    In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.

    Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    # Read the data
    X_full = pd.read_csv('../input/train.csv', index_col='Id')
    X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
    
    # Remove rows with missing target, separate target from predictors
    X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
    y = X_full.SalePrice
    X_full.drop(['SalePrice'], axis=1, inplace=True)
    
    # To keep things simple, we'll use only numerical predictors
    X = X_full.select_dtypes(exclude=['object'])
    X_test = X_test_full.select_dtypes(exclude=['object'])
    
    # Break off validation set from training data
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                          random_state=0)
    

    You can already see a few missing values in the first several rows. In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.

    Step 1: Preliminary investigation

    Run the code cell below without changes.

    # Shape of training data (num_rows, num_columns)
    print(X_train.shape)
    
    # Number of missing values in each column of training data
    missing_val_count_by_column = (X_train.isnull().sum())
    print(missing_val_count_by_column[missing_val_count_by_column > 0])
    
    (1168, 36)
    LotFrontage    212
    MasVnrArea       6
    GarageYrBlt     58
    dtype: int64
    

    Part A

    Use the above output to answer the questions below.

    # Fill in the line below: How many rows are in the training data?
    num_rows = 1168
    
    # Fill in the line below: How many columns in the training data
    # have missing values?
    num_cols_with_missing = 3
    
    # Fill in the line below: How many missing entries are contained in 
    # all of the training data?
    tot_missing = 276
    
    # Check your answers
    step_1.a.check()
    

    Part B

    Considering your answers above, what do you think is likely the best approach to dealing with the missing values?
    针对数据的情况,应该如何选择处理缺失值的策略?
    数据集是有很多缺失值,还是只有一少部分?如果我们忽略缺失值,是否会丢失大量的有效信息?

    针对这份数据集,共有1168行,36列,缺失特征分布于3列,总缺失数为276

    由于本数据相对缺失值较少(缺失值最高的列缺失缺失数少于其总数的20%(212<1168*20%),可以预见删除列并不会有好的效果。这是因为我们会丢掉很多有价值的数据,因此使用估值法可能会更好。

    To compare different approaches to dealing with missing values, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    
    # Function for comparing different approaches
    def score_dataset(X_train, X_valid, y_train, y_valid):
        model = RandomForestRegressor(n_estimators=100, random_state=0)
        model.fit(X_train, y_train)
        preds = model.predict(X_valid)
        return mean_absolute_error(y_valid, preds)
    

    Step 2: Drop columns with missing values

    In this step, you'll preprocess the data in X_train and X_valid to remove columns with missing values. Set the preprocessed DataFrames to reduced_X_train and reduced_X_valid, respectively.

    To be continued

    相关文章

      网友评论

          本文标题:Kaggle|Exercise6:Missing Values[

          本文链接:https://www.haomeiwen.com/subject/yrjcvhtx.html