Measure of Dispersion
Measures of dispersion characterize a distribution by indicating how data is spread around an average value. These metrics describe the variability or heterogeneity of the data.
Info
Some formulas differ between samples and populations (e.g., variance), which may result in slight variations in the calculations.
Range
The range, denoted as \( R \), is the difference between the largest and smallest value in a dataset. However, in the presence of extremely large or small outliers, the range can provide a distorted view of the data's variability.
Defintion: Range
with \( x_1, x_2, \dots, x_N \) representing a set of \( N \) values of a metric variable \( X \).
Example: Range of the Temperature
Given a table with 14 temperature values in °C, the goal is to calculate the range of the distribution.
SolutionThe temperature range is \(18.7^\circ C\).
Interquartile Range
The interquartile range (IQR) is the difference between the third and first quartile.
It describes the spread of the middle 50% of the data, providing a measure of variability that is less sensitive to outliers
Definition: Interquartile Range
Q1 and Q3 are the first and third quartiles of a dataset with \( N \) values of a variable \( X \).
Example: IQR of the Temperature
Given a table with 14 temperature values in °C, the goal is to calculate the IQR of the distribution.
SolutionThe IQR of the temperature is \(11^\circ C\).
Variance
Variance \( \sigma^2 \) is the mean of the squared deviations from the average. It indicates how spread out a distribution is.
import statistics
print('Variance: ', statistics.variance([1,2,1,2,3,4,1,100,1,2,2]))
print('Population Variance: ', statistics.pvariance([1,2,1,2,3,4,1,100,1,2,2]))
Definition: Variance
with \( x_1, x_2, \dots, x_N \) representing a set of \( N \) values of a metric variable \( X \). This formula applies to the entire population. For samples, it differs slightly, as the division is by \( N - 1 \) instead of \( N \).
Example: Variance of the Temperature
Given a table with 14 temperature values in °C, the goal is to calculate the variance of the distribution.
SolutionThe variance of the temperature is \(33.12^\circ C^2\).
Standard Deviation
The standard deviation \( \sigma \) describes a "typical" deviation from the mean. It indicates how spread out a distribution is.
import statistics
print('Standard Deviation: ', statistics.stdev([1,2,1,2,3,4,1,100,1,2,2]))
print('Population Standard Deviation: ', statistics.pstdev([1,2,1,2,3,4,1,100,1,2,2]))
A small \( \sigma \) suggests that the data tends to be close to the mean, while a large \( \sigma \) indicates that the data is spread over a wide range of values.
Definition: Standard Deviation
With \( x_1, x_2, \dots, x_N \) representing a set of \( N \) values of a metric variable \( X \), and \( \sigma^2 \) being the corresponding variance.
Example: StD of the Temperature
Given a table with 14 temperature values in °C, the goal is to calculate the standard deviation of the distribution.
SolutionThe temperature values deviate, on average, by \(5.76^\circ C\) from the mean.
Coefficient of Variation
We previously encountered the issue that variance and standard deviations of different data series were difficult to compare. The coefficient of variation (\(c_v\)) can be used to solve this problem. It is often referred to as the relative standard deviation.
Definition: Coefficient of Variation
With \( x_1, x_2, \dots, x_N \) representing a set of \( N \) values of a metric variable \( X \), \( \sigma \) being the corresponding standard deviation, and \( \bar{x} \) the mean.
Example: Pizza Prices
You are given a table of pizza prices in New York listed in various currencies.
dollar = [1, 2, 3, 3, 5, 6, 7, 8, 9, 11]
Pesos = [18.81, 37.62, 56.43, 56.43, 94.05, 112.86, 131.67, 150.48, 169.29]
The goal is to calculate the coefficient of variation for both data series.
Solution
Recap
- Measures of dispersion characterize a distribution by describing how data is spread around a central value.
- The IQR represents the middle 50% of the data.
- Variance indicates how wide a distribution is.
- Interpreting variance can be challenging because its units are squared.
- For this reason, standard deviation is a more suitable measure for interpretation.
- To better compare standard deviations across datasets, the coefficient of variation is used.
- There are different formulas for variance depending on whether the entire population or a sample is being analyzed
Tasks
Task: Measures of Dispersion
Use the following dataset:
from ucimlrepo import fetch_ucirepo
# fetch dataset
cars = fetch_ucirepo(id=9)
# https://archive.ics.uci.edu/dataset/9/auto+mpg
# data (as pandas dataframes)
data = cars.data.features
data = data.join(cars.data.ids)
# Show the first 5 rows
data.head()
- For the attribute
acceleration
calculate the following measures (use the sample formula - not population):- Range
- IQR --> compare to the boxplot from the section Measures of Central Tendency
- Variance
- Standard Deviation
- CV
Task: Weight of Euro Coins
Download the following dataset from this page and load it into your notebook.
# Website: https://jse.amstat.org/v14n2/datasets.aerts.html
# Dataset: https://jse.amstat.org/datasets/euroweight.dat.txt
# Description: https://jse.amstat.org/datasets/euroweight.txt
import pandas as pd
import numpy as np
# Load the dataset
data = pd.read_csv('Daten/euroweight.dat.txt', sep='\t', header=None, index_col=0, names=['Weight', 'Batch'])
# Display the first few rows
data.head()
As the Head of Quality Control at the European Central Bank (ECB), you are responsible, among other duties, for the quality management of 1-Euro coins. Consequently, you have tasked an employee with selecting a random sample of 2,000 coins. (Dataset: 'euro.csv', Unit: grams)
- Calculate the average weight of the coins.
- Determine the corresponding standard deviation and interpret its significance.
- Create a histogram. Ensure that all axes are labeled and the chart is properly titled.