3. End-to-End
Introduction
We distill all relevant code blocks from the previous two chapters into one cohesive notebook. This notebook will be an end-to-end example to fit a machine learning model on the bank marketing data set. Lastly, we will save the model to disk.
Tip
The notebook we will create, can serve as a reference point for your further data science projects.
So start by creating yet another notebook.
📁 bank_model/
├── 📁 .venv/
├── 📁 data/
├───── 📄 bank-merged.csv
├── 📄 preparation.ipynb
├── 📄 modelling.ipynb
├── 📄 end-to-end.ipynb
Previously...
In the previous chapters, we:
- Loaded the data
- Defined techniques to impute (
SimpleImputer
) and preprocess the data (ColumnTransformer
) - Split the data into train and test sets
- Applied imputation and preprocessing techniques to the data
- Evaluated different model types and concluded that a
RandomForestClassifier
is the best model (we found) for this task - Fit and evaluated the random forest
Here are the bullet points distilled in one code block:
Copy and execute the block
Since the code block is nothing new, simply copy and execute it. If everything went smoothly, you should see the balanced accuracy score printed.
Re-fit on whole data set
Previously, we split our data into train and test sets. Using the test set we were able to estimate the performance of our model. That's the whole purpose of the test set.
Now, our goal is to save the trained model for future use. Therefore, in practice, we want to leverage the power of the whole data set. Thus, we re-fit the model on the whole data set to make use of all available data.
# preprocess the whole data set
X = impute.transform(X)
X = pd.DataFrame(X, columns=data.columns)
X = preprocessor.transform(X)
# encode target
y = encoder.transform(y)
To preprocess the whole data set, we can reuse the impute
and preprocessor
objects. We only need to transform the data and encode the target. Lastly,
we fit the model on the whole data set. It's as simple as:
Info
Note, we can simply call fit()
again, this will "overwrite" the previous
model and uses the whole data set to fit the model once-again.
The forest
is now fitted on the whole data set. That's it! We have our final
model which we will save to disk.
Model persistence
To save the model to disk, we can use
pickle
. It's a part of
base Python. With pickle
, you can save any Python
object and load it back later.

pickle
comes from the concept of
"pickling" in food preservation. Similarly, pickle
is
used to "preserve" Python objects.
Simple example
For example, we can save any object such as a simple list
:
import pickle
simple_list = [1, 2, 3, 4, 5]
with open("list.pkl", "wb") as file:
pickle.dump(simple_list, file)
Let's break down the code block:
- We open a new file named
list.pkl
;.pkl
is just a common extension forpickle
files. - The file is opened in write-binary mode (
"wb"
) - as pickle files are binary files. - We use
pickle.dump()
to save the objectsimple_list
to the file.
Info
You can delete list.pkl
, it was just an example.
Save the model
Let's extend this knowledge to save our model. Unfortunately, it's not just a
matter of saving the forest
object. First, we look at the steps we need to
take to make a prediction for a new client:
The prediction process
graph
A[New client data] --> B[Impute potential missing values: <code>impute.transform</code>];
B --> C[Preprocess data: <code>preprocessor.transform</code>];
C --> D[Make predictions: <code>forest.predict</code>];
D --> E[Transform prediction to yes or no: <code>encoder.inverse_transform</code><br>];
To get our prediction process working, we need to save all objects involved:
impute
preprocessor
encoder
forest
We can save all these objects in one file using a simple dict
:
model = {
"imputer": impute,
"preprocessor": preprocessor,
"forest": forest,
"target-encoder": encoder,
}
with open("bank-model.pkl", "wb") as file:
pickle.dump(model, file)
Load the model
Create a new notebook which we will use to test the saved model.
Use the following code block to load the model
dict
.
"rb"
stands for read-binary mode.
Danger
Do not download and load pickle
files from the internet, unless you
trust the source. Since, pickle
can execute arbitrary code, it can be
a security risk.
Predictions
Let's run the prediction process. Assume the bank contacted another client with following attributes:
import pandas as pd
client = pd.DataFrame(
{
"id": 155611,
"age": 54,
"default": None,
"housing": "no",
"loan": "no",
"contact": "cellular",
"month": "aug",
"day_of_week": "tue",
"campaign": 3,
"pdays": 999,
"previous": 0,
"poutcome": "nonexistent",
"emp.var.rate": -2.9,
"cons.price.idx": 92.201,
"cons.conf.idx": -31.4,
"euribor3m": 0.878,
"nr.employed": 5087.2,
"job": "retired",
"marital": "divorced",
"education": "professional.course",
}, index=[0]
)
Does the client subscribe to a term deposit?
Make a new prediction
Predict if the client will subscribe to a term deposit.
- Use the above code snippet to create a new observation
client
. - Use all objects in the dictionary
model
to make a prediction.
Hint: To make a prediction, simply implement the above prediction process illustrated as a graph.
Try to solve the task on your own. For completeness, we provide one possible solution.
Info
def predict(model, client):
# preprocess the client data
X = model["imputer"].transform(client)
X = pd.DataFrame(X, columns=client.columns)
X = model["preprocessor"].transform(X)
# make a prediction
prediction = model["forest"].predict(X)
# inverse transform (0, 1) to ("no", "yes")
prediction = model["target-encoder"].inverse_transform(prediction)
return prediction
Conclusion
Across three chapters, we successfully reached our end goal: To build a machine learning model on the bank marketing data set. We ended up with a random forest model with a balanced accuracy of 74.45%.
The saved model can be deployed in a production environment. The prediction process is straightforward and can be easily applied on new clients.
Congratulations, you have reached the end of this practical guide! 🎉 You are now well-equipped to tackle your own data science projects.
Outlook
There are many more avenues to explore in the data science/machine learning landscape:
Model deployment
Learn how to deploy a model in a production environment. This can be done with a REST API, a web application or a mobile application (among others).
Start with:
which is a great way to serve your model.
Model persistence with onnx
The Open Neural Network Exchange (ONNX) format provides an interesting
alternative to pickle
. ONNX allows you to convert your trained models into a
standardized format that can be run efficiently across different platforms and
programming languages.
For example, onnx
allows you to build the model in
Python and deploy it with
JavaScript.
Start with:
Expand your model toolkit
We covered a selection of different model types, yet there are many more to
explore. scikit-learn
offers many more models for classification, regression,
clustering or dimensionality reduction.
Since you're already familiar with scikit-learn
, applying these models is
straightforward.
Start with:
Advanced pipeline techniques
scikit-learn
offers more sophisticated ways for modelling through pipelines.
In a bonus chapter we explore advanced techniques for hyperparameter tuning,
custom transformers and more.