1. Data preparation
Introduction
To achieve our end goal we have to carefully analyze and preprocess the data. We will start by exploring the data set, handling missing values, identifying attribute types, and then proceed to preprocessing techniques.
Info
To build a proper machine learning model for the bank marketing data set, we need to channel all our knowledge obtained so far!
Create a new notebook or script.
Data
We start by loading the data.
A look at the data
It's always a good idea to take a look at the data before proceeding.
- Check the shape of the data.
- Display the first few rows.
Missing values
In the data preprocessing chapter we discussed missing values. Recall that
in this specific data set, the missing values are a bit more
hidden.
They are encoded as "unknown"
. So let's replace these values with
None
.
With a cleaned data set, we can now proceed to the next step - data exploration.
Attribute types
Start by checking the attribute types.
Attribute types
Again, look at the data set. The task is to identify which attribute types are generally present in the dataset. Answer, the following quiz question.
If you need a refresher on attribute types, check out the appropriate section.
Which attribute types are present in the data set?
Feature description
With a broad overview, let's explore the different features/attributes more in-depth. Since we are dealing with a couple of features, categories were built.
-
Client Demographics
Demographic information about each client such as the education level (high school, university, etc.).
Variable Description id Client identifier (we will ignore the identifier) age Age job Type of occupation marital Marital status education Education level -
Financial Status
Does the client have a housing or personal loan, etc.
Variable Description default Credit default status housing Housing loan status loan Personal loan status -
Campaign Information
Remember, bank clients were contacted by phone. Some were contacted multiple times over the span of multiple campaigns.
Variable Description contact Contact type month Last contact month day_of_week Last contact day campaign Number of contacts in current campaign pdays Days since last contact from previous campaign previous Number of contacts before this campaign poutcome Outcome of previous campaign -
Economic Indicators
Some economic indicators at the time of the contact like the current interest rate (Euribor rate).
Variable Description emp.var.rate Employment variation rate cons.price.idx Consumer price index cons.conf.idx Consumer confidence index euribor3m Euribor 3-month rate nr.employed Number of employees
Info
Lastly, one column remains - "y"
. This column is the target,
whether a customer subscribed to a term deposit (1
) or not
(0
).
With a better understanding of the features at hand, we can proceed to the next step, assigning attribute types to the columns. Doing so, will help us later to pick the appropriate preprocessing steps.
Assigning attribute types
Assigning attributes
Assign an attribute type to each column. Look at the data and go over each column/attribute and add the column name to one of the three empty lists.
Disregard the unique identifier "id"
and the target
"y"
.
For example (part of the solution):
-
"age"
is a "measurable" quantity and expressed as a number, thus is a numerical attribute. -
The next attribute
"default"
is clearly categorical with its unique values["no", None, "yes]
. But since the attribute has no meaningful order, it is nominal.
Resulting so far in:
Now, go ahead and assign all of the remaining attributes.
Danger
Since the attribute assignment is crucial, we strongly urge you to solve the task. It will help your understanding of the data set and the next steps.
Check your solution with the answer below and correct any mistakes you've made.
Info
The solution is as follows (column names are ordered according to
data
):
Visualizing the data
To get an even better understanding of the data, we can visualize it. For
convenience, we will use pandas
built-in plotting
capabilities.
Tip
If you want to know more on visualizing different attribute types, visit the Frequency Distribution chapter of the Statistics course.
For example, we can plot numerical attributes like "campaign"
as a
box plot.
import matplotlib.pyplot as plt
data.plot(
kind="box", y="campaign", title="Number of contacts in current campaign"
)
plt.show()
Info
As you might have noticed, you need to install matplotlib
.
Or how about a pie chart for nominal attributes like "marital"
?
# first, count the occurrence of each category
marital_count = data["marital"].value_counts()
marital_count.plot(kind="pie", autopct="%1.0f%%", title="Marital status") # (1)!
plt.show()
- The
autopct
parameter is used to display the percentage on the pie chart.
Visualize
Pick two more attributes of your choice and plot them.
- Choose a numerical attribute and plot it as a histogram.
- Select a nominal or ordinal attribute and plot it as a bar chart.
Use these pandas
resources, if you're having trouble:
Info
It's crucial to visualize your data before diving into further analysis. Visualizations can help you understand the distribution, identify patterns, and detect anomalies or outliers in your data. This step ensures that you have a clear understanding of your data, which is essential for making informed decisions in your analysis process.
Preprocessing
Now that we have a better understanding of the data, we can proceed to the preprocessing steps. Depending on the attribute type, we will apply different techniques.
Since we are dealing with a mixed data set, we will keep things relatively simple and plan our approach accordingly:
- For
nominal
attributes, we apply one-hot encoding. - For
ordinal
attributes, we use one-hot encoding as well. - For
numerical
attributes, we follow two strategies:- Create bins for
age
,campaign
,pdays
, andprevious
. - Z-Score normalization for the remaining features:
emp.var.rate
,cons.price.idx
,cons.conf.idx
,euribor3m
, andnr.employed
.
- Create bins for
Info
nominal
and ordinal
attributes are categorical and require one-hot
encoding to be suitable for machine learning algorithms.
We are creating bins for age
, campaign
, pdays
, and previous
, since
these features have a large number of outliers. By binning these features,
we can try to reduce the impact of outliers and noise in the data.
Z-Score normalization is applied to the remaining numerical features to ensures that features don't have a larger impact on the model just because of their larger magnitude.
To apply these preprocessing steps, we have to look for the corresponding
scikit-learn
classes.
-
Preprocessing technique
- One hot encoding
- Binning
- Z-Score normalization (standardization)
-
Corresponding
scikit-learn
class
OneHotEncoder
KBinsDiscretizer
StandardScaler
Tip
All these techniques and classes were previously introduced in the Data preprocessing chapter.
Just like in the Data preprocessing chapter we could apply each technique one at a time, e.g.:
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer
nominal_encoder = OneHotEncoder()
nominal_encoder.fit_transform(data[nominal])
ordinal_encoder = OneHotEncoder()
ordinal_encoder.fit_transform(data[ordinal])
binning = KBinsDiscretizer(n_bins=5, strategy="uniform")
binning.fit_transform(data[["age", "campaign", "pdays", "previous"]])
# and so on...
... the above approach itself is perfectly fine, but we can do better! But first, we need to get the term information leakage out of the way, a common pitfall in machine learning/data science projects.
Information leakage
To explain the term information leakage, let's look at an example.
Information leakage
Assume, we want to predict the target "y"
based on the features
"emp.var.rate"
and "euribor3m"
. First, we apply
Z-Score normalization to these features.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features = scaler.fit_transform(
data[["emp.var.rate", "euribor3m"]]
)
As always, we are splitting the data into training and test set to later evaluate the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, data["y"], test_size=0.2, random_state=42
)
Now, we are already dealing with information leakage. Put simply - the
train set X_train
already "knows" something about the test set X_test
.
Why?
Remember the definition of Z-Score normalization - it calculates the mean
and standard deviation of the data set. If we calculate these values on the
whole data set; in our case data
- just like we did above, X_train
contains information about X_test
. Thus, the test set is no longer a
good representation of unseen data, hence any scores calculated with the
test set are no longer a good indicator of the model's performance.
This is a common pitfall in machine learning! To prevent information leakage, we have to split the data before applying any preprocessing steps.
With information leakage in mind, we introduce a more elegant way to apply multiple preprocessing steps.
ColumnTransformer
Not the kind of transformer you are expecting, but cool nonetheless! 🤖

Since we do not want to apply each preprocessing step one at a time, we simply bundle them.
The ColumnTransformer
is a class in scikit-learn
that allows us to bundle our preprocessing steps
together. This way, we can apply all transformations in one go.
First, we import all necessary classes:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder, StandardScaler
Next, we can already initiate our transformer. We define the exact same steps
as we did in written form at the beginning of this section. Note that the
ColumnTransformer
takes a list
of tuple
.
- Conveniently, we can create categories (bins) with the
KBinsDiscretizer
and directly apply one-hot encoding withencode="oneheot"
.
Let's break it down:
- Our instance
preprocessor
has 4 steps, namednominal
,ordinal
,binning
, andzscore
. - Each step is defined as a
tuple
, with the first element being the name of the step, the second element the preprocessing technique, and the third element being a list of columns to apply the technique to. - By default, all columns which are not specified in the
ColumnTransformer
will be dropped! See theremainder
parameter in the docs.
So far we only defined the necessary preprocessing steps, but didn't apply them just yet (that's part of the next chapter).
Detour: Didn't we forget something?
We completely neglected the missing values in the data set. Thus, we still need to handle them with an imputation technique.
Tip
During the development process of a data science project, you will often find yourself jumping back and forth between different steps. This is perfectly normal and part of the process. Seldom will you follow a linear path from start to finish.
If you execute the above line, you will see that we still have many missing values in a couple of columns. No worries, we can easily handle them with:
from sklearn.impute import SimpleImputer
impute = SimpleImputer(strategy="most_frequent", missing_values=None)
The SimpleImputer
lets us fill in missing values with the most frequent
value in the respective column. But why did we choose this specific strategy?
Why do we plan to fill missing values with the most frequent value (the mode) and not the mean or median?
Info
You might wonder why we didn't include the imputation step in the
ColumnTransformer
. The reason is that passing the same column to more
than one step leads to issues. As the ColumnTransformer
runs in
parallel and does not apply the steps sequentially.
Recap
In this chapter, we started our practical data science project by exploring
the bank marketing data set further. We handled missing values and identified
attribute types. We then visualized the data to get a better understanding of
the features. During our discussion of appropriate preprocessing methods,
we discovered the term information leakage and how to prevent it.
Finally, we introduced the ColumnTransformer
to bundle preprocessing
steps together.
Code recap
This time around, we also do a code recap. The essential findings in this chapter can be distilled to:
In the next chapter we will apply the preprocessing steps to a train and test split. Subsequently, we fit the first model.