美文网首页
Kaggle|Exercise8|Pipelines

Kaggle|Exercise8|Pipelines

作者: 十二支箭 | 来源:发表于2020-04-24 01:06 被阅读0次

    管道最方便的地方就是pipeline 实现了对全部步骤的流式化封装和管理(streaming workflows with pipelines),可以很方便地使参数集在新数据集(比如测试集)上被重复使用。
    参见https://zhuanlan.zhihu.com/p/42368821
    In this exercise, you will use pipelines to improve the efficiency of your machine learning code.

    Setup

    The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

    # Set up code checking
    import os
    if not os.path.exists("../input/train.csv"):
        os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
        os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
    from learntools.core import binder
    binder.bind(globals())
    from learntools.ml_intermediate.ex4 import *
    print("Setup Complete")
    

    You will work with data from the Housing Prices Competition for Kaggle Learn Users.

    Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    # Read the data
    X_full = pd.read_csv('../input/train.csv', index_col='Id')
    X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
    
    # Remove rows with missing target, separate target from predictors
    X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
    y = X_full.SalePrice
    X_full.drop(['SalePrice'], axis=1, inplace=True)
    
    # Break off validation set from training data
    X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                    train_size=0.8, test_size=0.2,
                                                                    random_state=0)
    
    # "Cardinality" means the number of unique values in a column
    # Select categorical columns with relatively low cardinality (convenient but arbitrary)
    categorical_cols = [cname for cname in X_train_full.columns if
                        X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]
    
    # Select numerical columns
    numerical_cols = [cname for cname in X_train_full.columns if 
                    X_train_full[cname].dtype in ['int64', 'float64']]
    
    # Keep selected columns only
    my_cols = categorical_cols + numerical_cols
    X_train = X_train_full[my_cols].copy()
    X_valid = X_valid_full[my_cols].copy()
    X_test = X_test_full[my_cols].copy()
    

    The next code cell uses code from the tutorial to preprocess the data and train a model. Run this code without changes.数据预处理和建模

    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    
    # Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='constant')
    
    # Preprocessing for categorical data
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])  
    
    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    # Define model 
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    
    # Bundle preprocessing and modeling code in a pipeline
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('model', model)
                         ])
    
    # Preprocessing of training data, fit model 
    clf.fit(X_train, y_train)
    
    # Preprocessing of validation data, get predictions
    preds = clf.predict(X_valid)
    
    print('MAE:', mean_absolute_error(y_valid, preds))
    
    #输出为
    MAE: 17861.780102739725
    

    The code yields a value around 17862 for the mean absolute error (MAE). In the next step, you will amend the code to do better.

    Step 1: Improve the performance

    Part A

    Now, it's your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:

    • numerical_transformer
    • categorical_transformer
    • model

    To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.

    # Preprocessing for numerical data 数值型数据的预处理 不就是填补缺失值吗?什么意思??
    numerical_transformer = SimpleImputer(strategy='median') # Your code here 采用中值填补数值型变量
    
    # Preprocessing for categorical data 分类型数据的预处理,分类数据处理有两部分:填补和编码,可以用管道捆绑
    categorical_transformer = Pipeline(steps=[
        ('imputer',SimpleImputer(strategy='most_frequent')),
        ('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))]) # Your code here 加了parse=False
    
    # Bundle preprocessing for numerical and categorical data 用ColumnTransformer捆绑数值型和分类型数据的预处理 
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    # Define model
    model = RandomForestRegressor(n_estimators=100, random_state=0) # Your code here
    
    # Check your answer
    step_1.a.check()
    

    TO BE CONTINUED====================

    Part B

    Run the code cell below without changes.

    To pass this step, you need to have defined a pipeline in Part A that achieves lower MAE than the code above. You're encouraged to take your time here and try out many different approaches, to see how low you can get the MAE! (If your code does not pass, please amend the preprocessing steps and model in Part A.)

    # Bundle preprocessing and modeling code in a pipeline
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('model', model)
                                 ])
    
    # Preprocessing of training data, fit model 
    my_pipeline.fit(X_train, y_train)
    
    # Preprocessing of validation data, get predictions
    preds = my_pipeline.predict(X_valid)
    
    # Evaluate the model
    score = mean_absolute_error(y_valid, preds)
    print('MAE:', score)
    
    # Check your answer
    step_1.b.check()
    

    Step 2: Generate test predictions

    Now, you'll use your trained model to generate predictions with the test data.

    # Preprocessing of test data, fit model
    preds_test = my_pipeline.predict(X_test) # Your code here 管道最方便的地方就是可以直接对测试集进行和训练集等一样的操作而不需要重复代码。
    
    # Check your answer
    step_2.check()
    

    Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.

    # Save test predictions to file
    output = pd.DataFrame({'Id': X_test.index,
                           'SalePrice': preds_test})
    output.to_csv('submission.csv', index=False)
    

    相关文章

      网友评论

          本文标题:Kaggle|Exercise8|Pipelines

          本文链接:https://www.haomeiwen.com/subject/tsvwihtx.html