来自kaggle官网的标准化机器学习流程。
Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.
Setup
The questions will give you feedback on your work. Run the following cell to set up the feedback system.
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex2 import *
print("Setup Complete")
In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.
Run the next code cell without changes to load the training and validation sets in X_train
, X_valid
, y_train
, and y_valid
. The test set is loaded in X_test
.
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
You can already see a few missing values in the first several rows. In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.
Step 1: Preliminary investigation
Run the code cell below without changes.
# Shape of training data (num_rows, num_columns)
print(X_train.shape)
# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
(1168, 36)
LotFrontage 212
MasVnrArea 6
GarageYrBlt 58
dtype: int64
Part A
Use the above output to answer the questions below.
# Fill in the line below: How many rows are in the training data?
num_rows = 1168
# Fill in the line below: How many columns in the training data
# have missing values?
num_cols_with_missing = 3
# Fill in the line below: How many missing entries are contained in
# all of the training data?
tot_missing = 276
# Check your answers
step_1.a.check()
Part B
Considering your answers above, what do you think is likely the best approach to dealing with the missing values?
针对数据的情况,应该如何选择处理缺失值的策略?
数据集是有很多缺失值,还是只有一少部分?如果我们忽略缺失值,是否会丢失大量的有效信息?
针对这份数据集,共有1168行,36列,缺失特征分布于3列,总缺失数为276
由于本数据相对缺失值较少(缺失值最高的列缺失缺失数少于其总数的20%(212<1168*20%),可以预见删除列并不会有好的效果。这是因为我们会丢掉很多有价值的数据,因此使用估值法可能会更好。
To compare different approaches to dealing with missing values, you'll use the same score_dataset()
function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
Step 2: Drop columns with missing values
In this step, you'll preprocess the data in X_train
and X_valid
to remove columns with missing values. Set the preprocessed DataFrames to reduced_X_train
and reduced_X_valid
, respectively.
To be continued
网友评论