Bonus
Introduction
This bonus chapter demonstrates the usage of a pipeline in conjunction with a grid search to automate the modelling process. Again, we are utilizing the bank marketing data. However, this time around we streamline the following:
- Data preprocessing
- Model evaluation
- Hyperparameter tuning
- Model selection
- Re-training the model on the entire dataset
... basically every step we had taken in "Data Science in Practice" block. Moreover, with a pipeline and grid search, we can easily evaluate additional model types and apply a more sophisticated way to evaluate their performance.
Tip
This chapter serves as an additional outlook for further topics you could explore, targeting your curiosity. Some concepts and techniques used in this chapter were not covered in this course. We won't explain them in much detail here, as they are beyond the scope of this course. Nonetheless, they could prove valuable for your future machine learning journey.
If you're still around, great! Let's get started with some code.
Quickstart
If you just need a blueprint for your next project, here's the whole thing.
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
- Open the
bank_model
project (from the Data Science in Practice block). - Copy and execute the code.
- Done!
If you want to know more about the individual parts, keep reading.
Plan of attack
We start by defining a bunch of things:
- Custom transformer, for data imputation and returning a
DataFrame
ColumnTransformer
for preprocessing the data- Pipeline to combine all steps
- Grid defining all models and parameters to be evaluated
- Grid search to find the best model
Then we simply need to apply the pipeline and grid search to the data. Finally, we save the best model.
Implementation
1. Custom transformer
We start by defining a custom transformer that imputes missing values in a
DataFrame
.
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
class DataFrameImputer(BaseEstimator, TransformerMixin):
def __init__(self, strategy="most_frequent"):
self.strategy = strategy
self.imputer = SimpleImputer(
strategy=strategy, missing_values="unknown"
)
def fit(self, X, y=None):
self.imputer.fit(X)
return self
def transform(self, X):
return pd.DataFrame(
self.imputer.transform(X), columns=X.columns, index=X.index
)
The custom transformer has implement the fit()
and transform()
methods.
Since we are not passing the target variable y
to the fit
method, we
"ignore" it by defining it as y=None
.
If you want to know more about custom transformers or even custom estimators (models), check out these resources:
Tip
DataFrameImputer
returns a pandas
DataFrame
which allows us to
easily chain the imputation step together with our trusted
ColumnTransformer
within a pipeline. In this case, that's the whole
purpose of the custom transformer.
2. ColumnTransformer
Speaking of the ColumnTransformer
, it simply stays the same as before.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
KBinsDiscretizer,
OneHotEncoder,
StandardScaler,
)
preprocessor = ColumnTransformer(
transformers=[
(
"nominal",
OneHotEncoder(handle_unknown="ignore"),
[
"default",
"housing",
"loan",
"contact",
"poutcome",
"job",
"marital",
],
),
(
"ordinal",
OneHotEncoder(handle_unknown="ignore"),
["month", "day_of_week", "education"],
),
(
"binning",
KBinsDiscretizer(n_bins=5, strategy="uniform", encode="onehot"),
["age", "campaign", "pdays", "previous"],
),
(
"zscore",
StandardScaler(),
[
"emp.var.rate",
"cons.price.idx",
"cons.conf.idx",
"euribor3m",
"nr.employed",
],
),
]
)
3. Pipeline
A pipeline is a sequence of steps where each step is a tuple containing a name and a transformer/estimator.
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
pipe = Pipeline(
[
("imputer", DataFrameImputer()),
("preprocessor", preprocessor),
("variance", VarianceThreshold(threshold=0.0)),
("classifier", None),
]
)
Our pipeline consists of the following sequential steps:
"imputer"
- Impute missing values"preprocessor"
- Apply all further preprocessing steps"variance"
- Remove features with zero variance (removes all constant features)"classifier"
- Apply a classifier (to be defined later)
Tip
You can modify pipelines to your liking. For example you could add another feature selection step. Or what about applying a PCA and then a classifier? The possibilities are endless!
4. Grid
Next, we define a grid with all models and hyperparameters to be evaluated.
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
grid = [
{
"classifier": [
RandomForestClassifier(random_state=42, class_weight="balanced")
],
"classifier__n_estimators": [100, 200],
"classifier__max_depth": [5, 10],
"classifier__min_samples_leaf": [1, 2],
},
{
"classifier": [SVC(random_state=42, class_weight="balanced")],
"classifier__C": [0.1, 1, 10],
},
{"classifier": [LogisticRegression(class_weight="balanced")]},
{
"classifier": [MLPClassifier(random_state=42, max_iter=1_000)],
},
]
The grid contains four different models:
- Random Forest
- Support Vector Machine (not discussed in this course)
- Logistic Regression
- Multi-layer Perceptron (Neural Network - not discussed in this course)
We will evaluate all these models and each hyperparameter combination.
Info
Names in the grid dictionary must match the names in the pipeline
("classifier"
). The double underscore "__"
is used to
indicate that the parameter belongs to the classifier in the pipeline.
5. Grid search
Finally, we define the grid search, that's where we put everything together.
from sklearn.model_selection import GridSearchCV, StratifiedKFold
search = GridSearchCV(
pipe,
grid,
n_jobs=-1, # (1)!
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=654), # (2)!
scoring=["balanced_accuracy", "roc_auc"],
refit="balanced_accuracy",
verbose=2,
)
- Use all available CPU cores (
n_jobs=-1
). This speeds up the process significantly. - We use a stratified k-fold cross-validation with 5 splits. Each fold preserves the percentage of samples for each class.
Basically, we are evaluating all models and hyperparameters using a
(stratified) k-fold cross-validation (read more about cross-validation
here).
The StratifiedKFold
thus replaces our simple train_test_split()
.
To evaluate the models, we are calculating two performance metrics: balanced
accuracy and ROC AUC (scoring=["balanced_accuracy", "roc_auc"]
).
The best model is selected based on the balanced accuracy (refit="balanced_accuracy"
) and then retrained on the entire dataset!
Info
The grid search eliminates the need to compare models manually, it performs
hyperparameter tuning, and it selects the best model for us. Lastly, we
won't even have to re-train it on the entire dataset, as the grid search
already does that for us!
Application
With all things defined, we simply need to apply the grid search to the data.
# load data
data = pd.read_csv("data/bank-merged.csv")
X, y = data.drop(columns="y"), data["y"]
# fit the grid search
search.fit(X, y)
print(
f"Best score: {search.best_score_}\n"
f"Best estimator: {search.best_params_}"
)
Best score: 0.7407486855434444
Best estimator: {
'classifier': RandomForestClassifier(class_weight='balanced', random_state=42),
'classifier__max_depth': 5, 'classifier__min_samples_leaf': 1, 'classifier__n_estimators': 200
}
Again, a random forest is the best model.
To predict new data, use following method:
That's it! You've automated the whole modelling process.