Skip to content

Frequency Distribution

A list \( X \) consists of \( n \) elements \( x_1, \dots, x_n \). Within this list, \( X \) contains \( k \) distinct values (\( a_1, \dots, a_k \)). The frequency refers to how often a specific value \( a_k \) appears in \( X \).

drinks = ['small', 'small', 'small', 'medium', 'medium', 'medium', 'large']

In this example,

  • \( X \): drinks
  • \( n \): 7
  • \( x_1, \dots, x_n \): ['small', 'small', 'small', 'medium', 'medium', 'medium', 'large']
  • \( k \): 3
  • \( a_1, \dots, a_k \): ['small', 'medium', 'large']

In the case of a nominal scale, \( k \) is equal to the number of categories, with \( k \) typically much smaller than \( n \). For a metric scale, there are often only a few identical values, meaning \( k \) is approximately equal to \( n \).

The representation of frequencies can be done in the form of a table or a graphical format. When a frequency distribution is depicted as a bar chart, it is referred to as a histogram.

import plotly.express as px

df = px.data.tips()
fig = px.histogram(df, x="total_bill")
fig.show()

It is important that the data remains the focal point and is presented as accurately and objectively as possible, avoiding distortions such as 3D effects or shadows. Titles, axis labels, legends, the data source, and the time of data collection should always be clearly indicated.

Definition: Frequency

Absolute Frequency of the value \( a_j \)

\[ h(a_j) = h_j \]

Relative Frequency of the value \( a_j \)

\[ f(a_j) = f_j = \frac{h_j}{n} \]

Absolute Frequency Distribution: \( h_1, \dots, h_k \)

Relative Frequency Distribution: \( f_1, \dots, f_k \)

Code

For the upcoming analysis, the following data will be used:

# Import Libraries
import pandas as pd

# Import Data
data = pd.read_csv('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_2023.csv')

Nominal Scale

For nominally scaled variables, the values correspond to the possible categories. The internal order of these categories is not relevant in the substantive analysis.

Example: Graphical Representation of Nominal Variables
  • Code
    import plotly.express as px
    
    # Generate Histogram
    fig = px.histogram(
        data, 
        x="surface",
    )
    
    # Adjust the plot
    fig.update_layout(
        xaxis_title_text='Surface',
        yaxis_title_text='Absolute Frequency',
        title=dict(
                text='<b><span style="font-size: 10pt">Nominal Variable: Histogram</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: surface</span></b>',
            ),
    )
    
    # Show the plot
    fig.show()
    
  • Code
    import plotly.express as px
    
    # Generate Pie Chart
    fig = px.pie(
        data,
        names="surface",
    )
    
    # Adjust the plot
    fig.update_layout(
        title=dict(
                text='<b><span style="font-size: 10pt">Nominal Variable: Pie Chart</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: surface</span></b>',
            ),
    )
    fig.update_traces(textposition='outside', textinfo='percent+label')
    
    # Show the plot
    fig.show()
    

Ordinal Scale

For ordinally scaled variables, the values also correspond to the possible categories. However, the internal order of these categories is relevant in the substantive analysis. The values should always be presented in either ascending or descending order. In order to tell Python the correct order, we need to define it first

round_order = ['F', 'SF', 'QF', 'R16', 'R32', 'R64', 'R128', 'RR']

Afterwards we can use this order in the histogram

fig = px.histogram(
                data, 
                x="round",
                category_orders={"round": round_order}
                )

and for the calculation of the corsstable

data['round'] = pd.Categorical(data['round'], categories=round_order, ordered=True)

Example: Graphical Representation of Ordinal Variables
  • Histogram WITHOUT Order


    Code
    import plotly.express as px
    
    # Generate Histogram
    fig = px.histogram(
        data, 
        x="round",
    )
    # Adjust the plot
    fig.update_layout(
        title=dict(
                text='<b><span style="font-size: 10pt">Ordinal Variable: NO Order</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: round</span></b>',
            ),
        xaxis_title_text='Round',
        yaxis_title_text='Absolute Frequency',
    )
    # Show the plot
    fig.show()
    
  • Histogram WITH Order


    Code
    import plotly.express as px
    
    # Define the order of the ordinal variable
    data_ord = data.copy()
    round_order = ['F', 'SF', 'QF', 'R16', 'R32', 'R64', 'R128', 'RR']  # Define order of the rounds
    
    # HISTOGRAM sorted
    # Generate Histogram
    fig = px.histogram(
        data_ord, 
        x="round",
        category_orders={"round": round_order[::-1]},
    )
    
    # Adjust the plot
    fig.update_layout(
        title=dict(
                text='<b><span style="font-size: 10pt">Ordinal Variable: WITH Order</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: round</span></b>',
            ),
        xaxis_title_text='Round',
        yaxis_title_text='Absolute Frequency',
    )
    
    # Show the plot
    fig.show()
    
  • Table WITHOUT Order


    round Absoulte Frequency Relative Frequency [%]
    F 68 2.278
    QF 256 8.57
    R128 416 13.93
    R16 512 17.15
    R32 880 29.47
    R64 432 14.47
    RR 286 9.58
    SF 136 4.55
    Code
    # FREQUENCY TABLE unsorted
    # Generate table with absolute and relative frequencies
    absolutefrequency_ord_unsort = pd.crosstab(index=data['round'], columns='Absoulte Frequency')
    relativefrequency_ord_unsort = pd.crosstab(index=data['round'], columns='Relative Frequency [%]',normalize=True)*100
    
    # Combine the tables
    frequencytable_ord_unsort = pd.concat([absolutefrequency_ord_unsort, relativefrequency_ord_unsort], axis=1).reset_index()
    frequencytable_ord_unsort.columns.name = None
    
    # Show table
    print(frequencytable_ord_unsort)
    
  • Table WITH Order


    Round Absoulte Frequency Relative Frequency [%]
    F 68 2.28
    SF 136 4.55
    QF 256 8.57
    R16 512 17.15
    R32 880 29.47
    R64 432 14.47
    R128 416 13.93
    RR 286 9.58
    Code
    # FREQUENCY TABLE sorte
    round_order = ['F', 'SF', 'QF', 'R16', 'R32', 'R64', 'R128', 'RR']  # Define order of the roundsd
    data['round'] = pd.Categorical(data['round'], categories=round_order, ordered=True)  # Define order of the rounds
    
    # Generate table with absolute and relative frequencies
    absolutefrequency_ord_sort = pd.crosstab(index=data['round'], columns='Absoulte Frequency')
    relativefrequency_ord_sort = pd.crosstab(index=data['round'], columns='Relative Frequency [%]',normalize=True)*100
    
    # Combine the tables
    frequencytable_ord_sort = pd.concat([absolutefrequency_ord_sort, relativefrequency_ord_sort], axis=1).reset_index()
    frequencytable_ord_sort.columns.name = None
    
    # Show table
    print(frequencytable_ord_sort)
    

In the case of ordinally scaled variables, a cumulative absolute or relative frequency can also be calculated. The cumulative absolute frequency indicates how often a reference value (or category) has not been exceeded. The cumulative relative frequency is this number divided by the total number of observations.

To calculate the cumulative relative frequency in the histogram, we need to add the lines:

fig = px.histogram(
                data, 
                x="round",
                category_orders={"round": round_order[::-1]},
                cumulative=True,
                histnorm="percent"
                )
To do the same for the crosstab we need to add:

freq_rel_cum = pd.crosstab(
                index=data['round'], 
                columns='Relative Frequency',
                normalize=True
                ).cumsum()
Example: Cumulative Frequency of Ordinal Variables
  • Histogram (Abolute, Cumulative)


    Code
    import plotly.express as px
    
    # HISTOGRAM sorted Cumulative Absolute
    round_order = ['F', 'SF', 'QF', 'R16', 'R32', 'R64', 'R128', 'RR']  # Define order of the roundsd
    data['round'] = pd.Categorical(data['round'], categories=round_order, ordered=True)  # Define order of the rounds
    
    # Generate Histogram
    fig = px.histogram(
        data, 
        x="round",
        category_orders={"round": round_order[::-1]},
        cumulative=True,
    )
    
    # Adjust the plot
    fig.update_layout(
        xaxis_title_text='Round',
        yaxis_title_text='Absolute Frequency',
        title=dict(
                text='<b><span style="font-size: 10pt">Ordinal Variable: Cumulated</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: round</span></b>',
            ),
    )
    
    # Show the plot
    fig.show()
    
  • Histogram (Relative, Cumulative)


    Code
    import plotly.express as px
    # HISTOGRAM sorted cumulated Relative
    round_order = ['F', 'SF', 'QF', 'R16', 'R32', 'R64', 'R128', 'RR']  # Define order of the roundsd
    data['round'] = pd.Categorical(data['round'], categories=round_order, ordered=True)  # Define order of the rounds
    
    # Generate Histogram
    fig = px.histogram(
        data, 
        x="round",
        category_orders={"round": round_order[::-1]},
        cumulative=True,
        histnorm="percent"
    )
    
    # Adjust the plot
    fig.update_layout(
        xaxis_title_text='Round',
        yaxis_title_text='Relative Frequency [%]',
        title=dict(
                text='<b><span style="font-size: 10pt">Ordinal Variable: Cumulated</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: round</span></b>',
            ),
    )
    
    # Show the plot
    fig.show()
    
  • Table (Abolute, Cumulative)


    Round Absoulte Frequency Absoulte Frequency Cumulated
    F 68 68
    SF 136 204
    QF 256 460
    R16 512 972
    R32 880 1852
    R64 432 2284
    R128 416 2700
    RR 286 2986
    Code
    # FREQUENCY TABLE sorted cumulated absolute
    round_order = ['F', 'SF', 'QF', 'R16', 'R32', 'R64', 'R128', 'RR']  # Define order of the roundsd
    data['round'] = pd.Categorical(data['round'], categories=round_order, ordered=True)  # Define order of the rounds
    
    # Generate table with absolute and relative frequencies
    absolutefrequency_ord_sort = pd.crosstab(index=data['round'], columns='Absoulte Frequency')
    relativefrequency_ord_sort = pd.crosstab(index=data['round'], columns='Relative Frequency [%]',normalize=True)*100
    
    # Combine the tables
    frequencytable_ord_sort = pd.concat([absolutefrequency_ord_sort, relativefrequency_ord_sort], axis=1).reset_index()
    frequencytable_ord_sort.columns.name = None
    
    frequencytable_ord_sort['Absoulte Frequency Cumulated'] = frequencytable_ord_sort['Absoulte Frequency'].cumsum()
    frequencytable_ord_sort.drop(columns='Relative Frequency [%]', inplace=True)
    print(frequencytable_ord_sort)
    
  • Table (Relative, Cumulative)


    Round Relative Frequency [%] Relative Frequency Cumulated
    F 2.28 2.28
    SF 4.55 6.83
    QF 8.57 15.41
    R16 17.15 32.55
    R32 29.47 62.02
    R64 14.47 76.49
    R128 13.93 90.42
    RR 9.58 100
    Code
    # FREQUENCY TABLE sorted cumulated relative
    round_order = ['F', 'SF', 'QF', 'R16', 'R32', 'R64', 'R128', 'RR']  # Define order of the roundsd
    data['round'] = pd.Categorical(data['round'], categories=round_order, ordered=True)  # Define order of the rounds
    
    # Generate table with absolute and relative frequencies
    absolutefrequency_ord_sort = pd.crosstab(index=data['round'], columns='Absoulte Frequency')
    relativefrequency_ord_sort = pd.crosstab(index=data['round'], columns='Relative Frequency [%]',normalize=True)*100
    
    # Combine the tables
    frequencytable_ord_sort = pd.concat([absolutefrequency_ord_sort, relativefrequency_ord_sort], axis=1).reset_index()
    frequencytable_ord_sort.columns.name = None
    
    frequencytable_ord_sort['Relative Frequency Cumulated '] = frequencytable_ord_sort['Relative Frequency [%]'].cumsum()
    frequencytable_ord_sort.drop(columns='Absoulte Frequency', inplace=True)
    print(frequencytable_ord_sort)
    

Numeric Scale

When the number of values \( k \) for a metrically scaled variable is small, it can be presented in the same way as an ordinal scale. However, when \( k \) is large, the representation can become cluttered and lose clarity.

Example: Few and Many Numeric Values
  • Numeric Variable with Few of Values


    Code
    import plotly.express as px
    # HISTOGRAM Small Number of Values
    # Generate Histogram
    fig = px.histogram(
        data, 
        x="draw_size",
    )
    
    # Adjust the plot
    fig.update_layout(
        xaxis_title_text='Draw Size',
        yaxis_title_text='Absolute Frequency',
        title=dict(
                text='<b><span style="font-size: 10pt">Small Number of Values</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: draw_size</span></b>',
            ),
    )
    
    # Show the plot
    fig.show()
    
  • Numeric Variable with Many Values


    Code
    import plotly.express as px
    # HISTOGRAM Large Number of Values
    # Generate Histogram
    fig = px.bar(
        data['winner_rank_points'].value_counts().reset_index(), 
        x='winner_rank_points', 
        y='count'
        )
    
    # Adjust the width of the bars
    fig.update_traces(width=50)
    
    # Adjust the plot
    fig.update_layout(
        xaxis_title_text='Winner Rank Points',
        yaxis_title_text='Absolute Frequency',
        title=dict(
                text='<b><span style="font-size: 10pt">Large Number of Values</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: winner_rank_points</span></b>',
            ),
    )
    
    # Show the plot
    fig.show()
    

In such cases, categories (intervals or bins) should be created to reduce the number of displayed values, making the data easier to interpret. This can be done automatically e.g. by px.histogram() or manually:

data['points_cat'] = pd.cut(
                        data['winner_rank_points'], 
                        bins=range(0,int(data['winner_rank_points'].max()),100), 
                        right=False)
Example: Numeric Attribute Binning
  • Automatic Binning


    Code
    import plotly.express as px
    
    # HISTOGRAM Large Number of Values
    # Generate Histogram
    fig = px.histogram(
        data, 
        x="winner_rank_points",
    )
    
    # Adjust the plot
    fig.update_layout(
        xaxis_title_text='Winner Rank Points',
        yaxis_title_text='Absolute Frequency',
        title=dict(
                text='<b><span style="font-size: 10pt">Automatic Binning</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: winner_rank_points</span></b>',
            ),
    )
    
    # Show the plot
    fig.show()
    
  • Manual Binning


    Code
    import plotly.express as px
    # Binning of the Data
    data['points_cat'] = pd.cut(data['winner_rank_points'], bins=range(0,int(data['winner_rank_points'].max()),100), right=False) # 100 Bins between 0 and the maximum value of winner_rank_points
    # data_num['points_cat'] = pd.cut(data_num['winner_rank_points'], bins=[0, 60, 120, 180, 240, 300, 360, 420, 480, 540, 600, 660, 720, 780, 840, 900, 960, 1020, 1080, 1140, 1200], right=False) # Custom Bins
    
    # Count the values in each bin
    points_cat_count = data['points_cat'].value_counts().sort_index()
    points_cat_count.index = points_cat_count.index.astype(str)
    
    # Generate Bar Chart
    fig = px.bar(
        points_cat_count,
        )
    
    
    # Adjust the plot
    fig.update_layout(
        xaxis_title_text='Winner Rank Points',
        yaxis_title_text='Absolute Frequency',
        showlegend=False,
        title=dict(
                text='<b><span style="font-size: 10pt">Manual Binning</span> <br> <span style="font-size:5">Data: atp_matches_2023.csv; variable: winner_rank_points</span></b>',
            ),
    )
    
    # Show the plot
    fig.show()
    

Tables and charts are well-suited for providing an overview of the data. However, in some cases, it is beneficial to further condense the information within the data to reduce complexity. Nevertheless, care must be taken not to oversimplify, as this could lead to misleading interpretations. There are several key metrics available for further reducing complexity. These are typically divided into measures of central tendency and measures of dispersion.

Recap

  • Data should always be the focus, with an unbiased representation.
  • Frequencies indicate how often a particular value occurs.
  • Relative frequency (%) is the absolute frequency divided by the total number of observations.
  • The form of representation depends on the scale level of the variable.
  • In general, both tables and charts can be used.
  • Cumulative frequencies show how often a reference value has not been exceeded.

Tasks

Task: Frequency Distribution

Use the following dataset:

from ucimlrepo import fetch_ucirepo 

# fetch dataset 
cars = fetch_ucirepo(id=9) 
# https://archive.ics.uci.edu/dataset/9/auto+mpg

# data (as pandas dataframes) 
data = cars.data.features
data = data.join(cars.data.ids)

# Show the first 5 rows
data.head()
Work on the following task:

  1. Analyse the Dataset
    • Look at the website of the dataset and get familiar
  2. Generate the following plot (think about attribute types, title, labeling of the axes)
    • Histogram | Absolute Frequency | Variable: origin
    • Bar Chart | Absoulte Frequency | no binning | Variable: weight
    • Histogram | Absoulte Frequency | automatic binning | Variable: weight
    • Histogram | Relative Frequency | cumulated | Variable: hoursepower
    • Pie Chart | Relative Frequency | Variable: cylinders