Introduction
The underlying principle of ensemble methods is that there is a strength to be found in unity. By combining multiple methods, each with its own pros and cons, more powerful models can be created.
The main reason for writing this article is not to explain how stacking works, but to demonstrate how you can use Scikit-Learn V0.22 in order to simplify stacking pipelines and create interesting models.
- Stacking
Although there are many great sources that introduce stacking (here, here, and here), let me quickly get you up to speed.
Stacking is a technique that takes several regression or classification models and uses their output as the input for the meta-classifier/regressor.
In its essence, stacking is an ensemble learning technique much like Random Forests where the quality of prediction is improved by combining, typically, weak models.
The image above gives a basic overview of the principle of stacking. It typically consists of many weak base learnings or several stronger. The meta learner then learns based on the prediction outputs of each base learner.
- Sklearn Stacking
Although there are many packages that can be used for stacking like mlxtend and vecstack, this article will go into the newly added stacking regressors and classifiers in the new release of scikit-learn.
First, we need to make sure to upgrade Scikit-learn to version 0.22:
pip install --upgrade scikit-learn
The first model that we are going to make is a classifier that can predict the specifies of flowers. The model is relatively simple, we use a Random Forest and k-Nearest Neighbors as our base learners and a Logistic Regression as our meta learner.
Coding stacking models can be quite tricky as you will have to take into account the folds that you want to generate and cross-validation at different steps. Fortunately, the new scikit-learn version makes it possible to create the model shown above in just a few lines of code:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = load_iris(return_X_y=True)
# Create Base Learners
base_learners = [
('rf_1', RandomForestClassifier(n_estimators=10, random_state=42)),
('rf_2', KNeighborsClassifier(n_neighbors=5))
]
# Initialize Stacking Classifier with the Meta Learner
clf = StackingClassifier(estimators=base_learners, final_estimator=LogisticRegression())
# Extract score
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf.fit(X_train, y_train).score(X_test, y_test)
网友评论