Kaggle|Exercise6:Missing Values[

作者: 十二支箭 | 来源:发表于2020-04-15 23:53 被阅读0次

Kaggle|Exercise6:Missing Values[
数据清洗之如何处理缺失值
data analysis
Data Clean
2020-09-01-Dealing With Missing
pandas 缺失值与空值处理
缺失值的处理(基于R语言)
DS Interview Question--Missing V
R语言里填充（impute）缺失值(missing values
MATH PUZZLE 1

来自kaggle官网的标准化机器学习流程。
Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.

Setup

The questions will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex2 import *
print("Setup Complete")

In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

You can already see a few missing values in the first several rows. In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.

Step 1: Preliminary investigation

Run the code cell below without changes.

# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64

Part A

Use the above output to answer the questions below.

# Fill in the line below: How many rows are in the training data?
num_rows = 1168

# Fill in the line below: How many columns in the training data
# have missing values?
num_cols_with_missing = 3

# Fill in the line below: How many missing entries are contained in 
# all of the training data?
tot_missing = 276

# Check your answers
step_1.a.check()

Part B

Considering your answers above, what do you think is likely the best approach to dealing with the missing values?
针对数据的情况，应该如何选择处理缺失值的策略？
数据集是有很多缺失值，还是只有一少部分？如果我们忽略缺失值，是否会丢失大量的有效信息？

针对这份数据集，共有1168行，36列，缺失特征分布于3列，总缺失数为276

由于本数据相对缺失值较少（缺失值最高的列缺失缺失数少于其总数的20%(212<1168*20%),可以预见删除列并不会有好的效果。这是因为我们会丢掉很多有价值的数据，因此使用估值法可能会更好。

To compare different approaches to dealing with missing values, you'll use the same score_dataset() function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 2: Drop columns with missing values

In this step, you'll preprocess the data in X_train and X_valid to remove columns with missing values. Set the preprocessed DataFrames to reduced_X_train and reduced_X_valid, respectively.

To be continued

网友评论

本文标题：Kaggle|Exercise6:Missing Values[

本文链接：https://www.haomeiwen.com/subject/yrjcvhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Kaggle|Exercise6:Missing Values[

Setup

Step 1: Preliminary investigation

Part A

Part B

由于本数据相对缺失值较少（缺失值最高的列缺失缺失数少于其总数的20%(212<1168*20%),可以预见删除列并不会有好的效果。这是因为我们会丢掉很多有价值的数据，因此使用估值法可能会更好。

Step 2: Drop columns with missing values

相关文章

Kaggle|Exercise6:Missing Values[

数据清洗之如何处理缺失值

data analysis

Data Clean

2020-09-01-Dealing With Missing

pandas 缺失值与空值处理

缺失值的处理(基于R语言)

DS Interview Question--Missing V

R语言里填充（impute）缺失值(missing values

MATH PUZZLE 1

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读