Kaggle|Courses|Pipelines

作者: 十二支箭 | 来源:发表于2020-04-22 18:32 被阅读0次

Kaggle|Courses|Pipelines
Kaggle|Courses|MissingValues
Kaggle|Courses|Cross Validation[
Kaggle|Courses|Categorical Varia
Kaggle|Courses|XGBoost[待补充]
Kaggle|Exercise8|Pipelines
Scrapy生成json中文乱码解决
Elasticsearch中的Ingest pipelines
scrapy存储到mongodb数据库中
Pipelines

管道机制。
管道捆绑了预处理和建模的步骤，可以使代码更简单和井井有条。虽然有一些数据科学家不使用管道，但是使用管道有一些重要的好处：
-更整洁的代码：在预处理的每个步骤中都要考虑数据会很混乱。使用管道则无需在每个步骤中手动跟踪
-易于产出：很难将模型从原型过渡到可大规模部署的模型。在这里我们不会涉及许多相关问题，但是管道可以提供帮助。
-更多模型验证方法：交叉验证等
Step 1: Define Preprocessing Steps
Similar to how a pipeline bundles together preprocessing and modeling steps, we use the ColumnTransformer class to bundle together different preprocessing steps.
The code below:

-imputes missing values in numerical data, and

-imputes missing values and applies a one-hot encoding to categorical data.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Step 2: Define the Model

Next, we define a random forest model with the familiar RandomForestRegressor class.

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

Step 3: Create and Evaluate the Pipeline

Finally, we use the Pipeline class to define a pipeline that bundles the preprocessing and modeling steps. There are a few important things to notice:

With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)


MAE: 160679.18917034855

Conclusion

Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing.

Your Turn

Use a pipeline in the next exercise to use advanced data preprocessing techniques and improve your predictions!

网友评论

本文标题：Kaggle|Courses|Pipelines

本文链接：https://www.haomeiwen.com/subject/felmihtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Kaggle|Courses|Pipelines

-imputes missing values in numerical data, and

-imputes missing values and applies a one-hot encoding to categorical data.

Step 2: Define the Model

Step 3: Create and Evaluate the Pipeline

Conclusion

Your Turn

相关文章

Kaggle|Courses|Pipelines

Kaggle|Courses|MissingValues

Kaggle|Courses|Cross Validation[

Kaggle|Courses|Categorical Varia

Kaggle|Courses|XGBoost[待补充]

Kaggle|Exercise8|Pipelines

Scrapy生成json中文乱码解决

Elasticsearch中的Ingest pipelines

scrapy存储到mongodb数据库中

Pipelines

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读