美文网首页
Kaggle|Courses|Pipelines

Kaggle|Courses|Pipelines

作者: 十二支箭 | 来源:发表于2020-04-22 18:32 被阅读0次

    管道机制。
    管道捆绑了 预处理 和 建模 的步骤,可以使代码更简单和井井有条。虽然有一些数据科学家不使用管道,但是使用管道有一些重要的好处:
    -更整洁的代码:在预处理的每个步骤中都要考虑数据会很混乱。使用管道则无需在每个步骤中手动跟踪
    -易于产出:很难将模型从原型过渡到可大规模部署的模型。在这里我们不会涉及许多相关问题,但是管道可以提供帮助。
    -更多模型验证方法:交叉验证等
    Step 1: Define Preprocessing Steps
    Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps.
    The code below:

    -imputes missing values in numerical data, and

    -imputes missing values and applies a one-hot encoding to categorical data.

    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
    
    # Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='constant')
    
    # Preprocessing for categorical data
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    

    Step 2: Define the Model

    Next, we define a random forest model with the familiar RandomForestRegressor class.

    from sklearn.ensemble import RandomForestRegressor
    
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    

    Step 3: Create and Evaluate the Pipeline

    Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:

    • With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
    • With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)
    from sklearn.metrics import mean_absolute_error
    
    # Bundle preprocessing and modeling code in a pipeline
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('model', model)
                                 ])
    
    # Preprocessing of training data, fit model 
    my_pipeline.fit(X_train, y_train)
    
    # Preprocessing of validation data, get predictions
    preds = my_pipeline.predict(X_valid)
    
    # Evaluate the model
    score = mean_absolute_error(y_valid, preds)
    print('MAE:', score)
    
    
    MAE: 160679.18917034855
    

    Conclusion

    Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.

    Your Turn

    Use a pipeline in the next exercise to use advanced data preprocessing techniques and improve your predictions!

    相关文章

      网友评论

          本文标题:Kaggle|Courses|Pipelines

          本文链接:https://www.haomeiwen.com/subject/felmihtx.html