Frequency Distribution
Two variables, \(X\) and \(Y\), consist of \(n\) elements. The raw data list consists of the tuples \((x_1, y_1), \dots, (x_n, y_n)\). Possible values are \(a_1, \dots, a_k\) for \(X\) and \(b_1, \dots, b_m\) for \(Y\). The absolute frequency refers to how often a combination \((a_i, b_j)\) occurs.
drinks = ['small', 'small', 'small', 'medium', 'medium', 'medium', 'large']
kcal = [ 123, 154, 123, 201, 201, 201, 434]
In this example,
-
- \( X \):
drinks
- \( k \): 3
- \( a_1, \dots, a_k \):
['small', 'medium', 'large']
- \( X \):
-
- \( Y \):
kcal
- \( m \): 4
- \( b_1, \dots, b_m \):
[123, 154, 201, 434]
- \( Y \):
- \( n \):
7
- \((x_1, y_1), \dots, (x_n, y_n)\):
('small', 123), ('small', 154) ... ('large', 434)
Representation of frequencies can be done in the form of a table or a graphic. In tabular form, the so-called crosstab (or contingency table) is commonly used. For graphical representation, a histogram (or 2D bar chart) is suitable. It is important that the data remain the focal point and are presented as accurately and objectively as possible, avoiding distortions like 3D effects or shadows. Titles, axis labels, legends, the data source, and the time of data collection should always be clearly indicated.
Definition: Bivariate Frequency
Absolute frequency of the combination \( (a_i, b_j) \):
Relative frequency of the combination \( (a_i, b_j) \):
Histogram
Histograms are also suitable for bivariate data to represent frequency. A specific type of representation is the density heatmap, which can also be generated using plotly.
import plotly.express as px
df = px.data.tips()
fig = px.density_heatmap(df, x="total_bill", y="tip")
fig.show()
Both absolute and relative frequencies can be visualized using this method (by adding the parameter histnorm='percent'
). Multidimensional histograms can be created using the Python package matplotlib.
Example: Plotly Heatmap
Code
from ucimlrepo import fetch_ucirepo
import plotly.express as px
# fetch dataset
drugs = fetch_ucirepo(id=468)
# https://archive.ics.uci.edu/dataset/462
# data (as pandas dataframes)
data = drugs.data.features
# Create a density heatmap
fig = px.density_heatmap(data, x="Region", y="VisitorType")
# Adjust the plot
fig.update_layout(
xaxis_title_text='Region',
yaxis_title_text='Visitor Type',
title=dict(
text='<b><span style="font-size: 10pt">Densitiy Heatmap</span> <br> <span style="font-size:5">Data: drug_reviews_drugs_com; variable: Region, VisitorType</span></b>',
),
)
# Show the plot
fig.show()
Crosstab (Contingency Table)
The representation of the joint distribution of discrete features with few categories (if there are many categories, they need to be grouped into categories) can be done using contingency tables.
import pandas as pd
import plotly.express as px
df = px.data.tips()
pd.crosstab( df['sex'], df['day'])
These tables can display both absolute and relative frequencies (by adding the parameter normalize=True
). It is important to note that contingency tables use only the nominal scale level, even if a variable could be measured at a higher level (ordinal or numerical).
Marginal Frequencies refer to the row and column totals added to a table. The row totals are the marginal frequencies of the variable \(X\), calculated as \( h_{i.} = h_{i1} + \dots + h_{im} \) for \( i = 1, \dots, k \). The column totals are the marginal frequencies of the variable \(Y\), given by \( h_{.j} = h_{1j} + \dots + h_{kj} \) for \( j = 1, \dots, m \). In Python
you just need to add the parameter margins=True
.
Marginal Distribution refers to the marginal frequencies of a variable, which are the simple frequencies without considering the second variable. The collection of all marginal frequencies for a variable gives the marginal distribution of \(X\) (\(h_{1.}, h_{2.}, \dots, h_{k.}\)) or \(Y\)(\(h_{.1}, h_{.2}, \dots, h_{.m}\)) in absolute frequencies:
Absolute Frequency
Definition: Absolute Crosstab
Crosstab of the absolute Frequencies
-
\[ \begin{array}{c|ccc|c} & b_1 & \dots & b_m & \sum \\ \hline a_1 & h_{11} & \dots & h_{1m} & h_{1.} \\ a_2 & h_{21} & \dots & h_{2m} & h_{2.} \\ \vdots & \vdots & & \vdots & \vdots \\ a_k & h_{k1} & \dots & h_{km} & h_{k.} \\ \hline \sum & h_{.1} & \dots & h_{.m} & n \end{array} \]
-
- \( a_i \): Values of \( X \) with \( i = 1, \dots, k \)
- \( b_j \): Values of \( Y \) with \( j = 1, \dots, m \)
- \( h_{ij} \): The absolute frequency of the combination \( (a_i, b_j) \)
- \( h_{1.}, \dots, h_{k.} \): The marginal frequencies of \( X \)
- \( h_{.1}, \dots, h_{.m} \): The marginal frequencies of \( Y \)
- \( n \): Total number of elements
Example: Absolute Crosstab Website Visitors
Crosstab of the absolute Frequencies
Relative Frequency
Definition: Relative Crosstab
Crosstab of the relative Frequencies
-
\[ \begin{array}{c|ccc|c} & b_1 & \dots & b_m & \sum \\ \hline a_1 & f_{11} & \dots & f_{1m} & f_{1.} \\ a_2 & f_{21} & \dots & f_{2m} & f_{2.} \\ \vdots & \vdots & & \vdots & \vdots \\ a_k & f_{k1} & \dots & f_{km} & f_{k.} \\ \hline \sum & f_{.1} & \dots & f_{.m} & 1 \end{array} \]
-
- \( a_i \): Values of \( X \) with \( i = 1, \dots, k \)
- \( b_j \): Values of \( Y \) with \( j = 1, \dots, m \)
- \( f_{ij} = \frac{h_{ij}}{n} \): The relative frequency of the combination \( (a_i, b_j) \)
- \( f_{i.} = \frac{h_{i.}}{n} \): The relative marginal frequencies of \( X \)
- \( f_{.j} = \frac{h_{.j}}{n} \): The relative marginal frequencies of \( Y \)
Example: Relative Crosstab Website Visitors
Crosstab of the relative Frequencies in [%]
Code
from ucimlrepo import fetch_ucirepo
# fetch dataset
drugs = fetch_ucirepo(id=468)
# https://archive.ics.uci.edu/dataset/462
# data (as pandas dataframes)
data = drugs.data.features
import pandas as pd
# Create a crosstab
pd.crosstab( data['VisitorType'],data['Region'], margins=True, normalize='all')
Conditional frequency
Absolute and relative frequencies are not suitable for determining the relationship between variables. For example, the frequency of regions for New_Visitors
and Returning_Visitors
cannot be directly compared because the sizes of both groups are different. The conditional relative frequency allows for this comparison by accounting for the differences in group sizes. Therefore, in pd.crosstab()
you need to add normalize='index'
or normalize='columns'
Definition: Conditional Frequency
Conditional Frequency Distribution of \( Y \) given \( X = a_i \):
Conditional Frequency Distribution of \( X \) given \( Y = b_j \):
Example: Condtional Frequency of Website Visitors
Crosstab of the Conditional Frequencies for given Visitor Types in [%]
Code
from ucimlrepo import fetch_ucirepo
# fetch dataset
drugs = fetch_ucirepo(id=468)
# https://archive.ics.uci.edu/dataset/462
# data (as pandas dataframes)
data = drugs.data.features
import pandas as pd
# Create a crosstab
print(pd.crosstab( data['VisitorType'],data['Region'], margins=True, normalize='index'))
Example: Condtional Frequency of Website Visitors
Crosstab of the Conditional Frequencies for given Region in [%]
Code
from ucimlrepo import fetch_ucirepo
# fetch dataset
drugs = fetch_ucirepo(id=468)
# https://archive.ics.uci.edu/dataset/462
# data (as pandas dataframes)
data = drugs.data.features
import pandas as pd
# Create a crosstab
print(pd.crosstab( data['VisitorType'],data['Region'], margins=True, normalize='columns'))
Recap
- Frequencies in the bivariate case describe how often a combination of two values occurs.
- As in the univariate case, a distinction between absolute and relative frequency is made.
- 2D histograms or contingency tables can be used for representation.
- Relationships between variables are not easily identified in either absolute or relative contingency tables.
- The conditional frequency examines the frequency distribution of one variable while fixing the second variable.
Tasks
Task: Bivariate Frequency
Use the following dataset:
from ucimlrepo import fetch_ucirepo
# fetch dataset
cars = fetch_ucirepo(id=9)
# https://archive.ics.uci.edu/dataset/9/auto+mpg
# data (as pandas dataframes)
data = cars.data.features
# Show the first 5 rows
data.head()
- Generate a 2D Histogram for the variables
origin
andhorsepower
(think about attribute types, title, labeling of the axes). Interpret the results. - Calculate the crosstab for the absolute frequencies of the variables
origin
andcylinders
- Calculate the conditional crosstab for the relative frequencies to answer the following question: Whats the cylinder distributed within each origin?