ANOVA
ANOVA, which stands for Analysis of Variance, is a statistical method used to determine whether there are statistically significant differences between the means of three or more groups. Unlike the t-test, which examines whether there is a difference between two groups, ANOVA is used to assess differences among multiple group means simultaneously by comparing the variances within each group to the variances between the groups. The primary goal of ANOVA is to determine whether any of those differences are statistically significant.
Types of ANOVA
-
One-Way ANOVA
analyzes the impact of a single independent variable (factor) with three or more levels on a continuous dependent variable.
Example One-Way ANOVA
A factory manager uses One-Way ANOVA to compare the average production times of widgets produced by Machine A, Machine B, and Machine C.
-
Two-Way ANOVA
evaluates the effects of two independent variables (factors) simultaneously and examines the interaction between them on a continuous dependent variable.
Example Two-Way ANOVA
A researcher employs Two-Way ANOVA to study the effects of different fertilizers and watering schedules on plant growth.
-
Repeated Measures ANOVA
assesses the effects of one or more factors when the same subjects are measured multiple times under different conditions.
Example Repeated Measures ANOVA
A psychologist uses Repeated Measures ANOVA to evaluate participants' stress levels before, during, and after a meditation intervention.
-
MANOVA (Multivariate ANOVA)
extends ANOVA by analyzing the effects of one or more independent variables on two or more dependent variables simultaneously.
Example MANOVA
An educator applies MANOVA to investigate how teaching methods and class sizes influence both student performance and engagement levels.
One-Way ANOVA
One-Way ANOVA is a statistical method used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. Unlike the t-test, which compares the means of two groups, One-Way ANOVA can handle multiple groups simultaneously, making it a powerful tool for analyzing variations within and between groups.
Example One-Way ANOVA in Production Scenario
In a factory, there are three different machines used to assemble electronic components: Machine A, Machine B, and Machine C. The production manager wants to know if the type of machine impacts the production time of the components.
The key concepts of the One-Way ANOVA includes:
- Factor: The independent variable that categorizes the data. In One-Way ANOVA, there is only one factor.
- Levels: The different categories or groups within the factor.
- Dependent Variable: The outcome or response variable that is measured.
- Null Hypothesis (H0): Assumes that all group means are equal.
- Alternative Hypothesis (H1): Assumes that at least one group mean is different.
Example One-Way ANOVA in Production Scenario
- Factor: Machine Types
- Levels: Machine A, Machine B, Machine C
- Dependent Variable: Production Time
-
Null Hypothesis (H0): there is no difference in the average assembly time among the three machines.
\[ H_0 : \mu_A = \mu_B = \mu_C \] -
Alternative Hypothesis (H1): at least one group mean is different.
When to Use One-Way ANOVA
One-Way ANOVA is appropriate when you want to:
- Compare the means of three or more independent groups.
- Assess the impact of a single categorical factor on a continuous dependent variable.
- Determine if at least one group mean significantly differs from the others.
Approach
At its core, ANOVA partitions the total variance in the data into components attributable to different sources. Here's a simplified breakdown:
-
Hypothesis Definition
- Null Hypothesis (\( H_0 \)): All group means are equal.
- Alternative Hypothesis (\( H_a \)): At least one group mean is different.
-
Sum of Squares (SS)
-
Total Sum of Squares (SS Total)
Definition: SS Total
Measure of the total variability in the data
\[ SS_{Total} = \sum_{i=1}^{N} (Y_i - \overline{Y})^2 \]with:
- \( Y_i \) = individual observations
- \( \overline{Y} \) = grand mean
- \( N \) = total number of observations
-
Between-Group Sum of Squares (SS Between)
Definition: SS Between
Measure of the variability due to the factor (e.g., different treatments).
\[ SS_{Between} = \sum_{j=1}^{k} n_j (\overline{Y}_j - \overline{Y})^2 \]with:
- \( \overline{Y}_j \) = mean of group \( j \)
- \( n_j \) = number of observations in group \( j \)
- \( k \) = number of groups
-
Within-Group Sum of Squares (SS Within)
Definition: SS Between
Measure of the variability within each group.
\[ SS_{Within} = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (Y_{ij} - \overline{Y}_j)^2 \]
-
-
Degrees of Freedom (df)
Definition: Degree of Freedom
\[ df_{Total} = N-1 \quad | \quad df_{Between} = k-1 \quad | \quad df_{Within} = N-k \]with:
- \( N \) = total number of observations
- \( k \) = number of groups
-
Mean Squares (MS)
Definition: Mean Squares
\[ MS_{Between} = \frac{SS_{Between}}{df_{Between}} \quad | \quad MS_{Within} = \frac{SS_{Within}}{df_{Within}} \] -
F-Statistic
Definition: ANOVA F-Statistic
\[ F = \frac{MS_{Between}}{MS_{Within}} \]Interpretation: A higher F-value indicates greater evidence against the null hypothesis. The p-value is derived from the F-distribution and determines the statistical significance of the results.
Constructing the ANOVA Table
Source of Variation Sum of Squares (SS) Degrees of Freedom (df) Mean Square (MS) F-Statistic p-Value Between Groups \( SS_{Between} \) \( k - 1 \) \( MS_{Between} \) \( F \) Within Groups \( SS_{Within} \) \( N - k \) \( MS_{Within} \) Total \( SS_{Total} \) \( N - 1 \) -
Interpretation of Results As we have have seen before in the T-Test and F-Test, if the p-value is smaller than \( \alpha\) (commonly 0.05) we reject H0. If \(p> \alpha\) we fail to reject \( H_0 \).
If the ANOVA is significant, determine which specific groups differ using post-hoc tests like the Tukey HSD test to control for multiple comparisons.
Example One-Way ANOVA in Production Scenario
Let’s consider a realistic production scenario. A factory uses three different machines (Machine A, Machine B, and Machine C) to assemble electronic components. The production manager wants to know if the machine type affects the assembly time.
The Production times (in minutes) for 5 widgets from each machine:
# Production Time in Minutes
A = [12, 14, 13, 15, 14] # Machine A
B = [16, 18, 17, 19, 18] # Machine B
C = [11, 10, 12, 11, 10] # Machine C
Manual Calculation
-
Calculate Group Means:
k = 3 Y_mean_A = np.mean(A) Y_mean_B = np.mean(B) Y_mean_C = np.mean(C) Y_mean = (Y_mean_A + Y_mean_B + Y_mean_C) / k print(f"Mean of Machine A: {Y_mean_A} minutes") print(f"Mean of Machine B: {Y_mean_B} minutes") print(f"Mean of Machine C: {Y_mean_C} minutes") print(f"Mean of Production Time: {Y_mean} minutes")
-
Sum of Squares (SS)
# Calculate Sum of Squares Betwen Groups SSB = len(A)*(Y_mean_A - Y_mean)**2 + len(B)*(Y_mean_B - Y_mean)**2 + len(C)*(Y_mean_C - Y_mean)**2 print(f"Sum of Squares Between Groups: {SSB}") # Calculate Sum of Squares Within Groups SSW = sum((np.array(A) - Y_mean_A)**2) + sum((np.array(B) - Y_mean_B)**2) + sum((np.array(C) - Y_mean_C)**2) print(f"Sum of Squares Within Groups: {SSW}") # Calculate Sum of Squares Total SST = SSB + SSW print(f"Sum of Squares Total: {SST}")
-
Degrees of Freedom
-
Mean Squares
-
F-Statistic
- Determine p-Value Using an F-distribution table or statistical software with \( df_1 = df_{Between} = 2 \) and \( df_2 = df_{Within} = 12 \), an F-value of 53.1 is highly significant (p < 0.001).
Automatic Calculation
For calculation the p-value and the f-statistics of the ANOVA we can use the 'f_oneway' method of the 'scipy.stats' library:
-
Interpretation
Since the p-value is less than 0.05, we reject the null hypothesis. There are significant differences in production times among the three machines.
-
Post-Hoc Analysis To identify which specific machines differ we can perform a Tukey HSD Test revealing that:
>>> OutputTukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval) Comparison Statistic p-value Lower CI Upper CI (0 - 1) -4.000 0.000 -5.770 -2.230 (0 - 2) 2.800 0.003 1.030 4.570 (1 - 0) 4.000 0.000 2.230 5.770 (1 - 2) 6.800 0.000 5.030 8.570 (2 - 0) -2.800 0.003 -4.570 -1.030 (2 - 1) -6.800 0.000 -8.570 -5.030
- Machine A vs. Machine B: Significant difference
- Machine A vs. Machine C: Significant difference
- Machine B vs. Machine C: Significant difference
- Conclusion: All three machines have significantly different production times, with Machine B (
Y_mean_A = 17.6
) being the slowest and Machine C (Y_mean_A = 10.8
) being the fastest.
Assumptions of One-Way ANOVA
Before performing One-Way ANOVA, ensure that your data meet the following assumptions:
- Independence of Observations: The samples are independent of each other.
- Normality: The data in each group are approximately normally distributed.
- Homogeneity of Variances: The variance among the groups should be approximately equal.
Task
Task: Student Performance
Download the following dataset and load it into your notebook. Therefore use the python package openml
.
# Dataset: https://www.openml.org/search?type=data&status=active&id=43098
import openml
dataset = openml.datasets.get_dataset(43098)
df ,_ ,_ ,_ = dataset.get_data(dataset_format="dataframe", target=None)
df.head()
It contains the academic performance of 1000 students in different subjects. Answer the following questions using Python:
- Are the
math.score
results of differentrace.ethnicity
groups significantly different (\(\alpha = 5\%\))? - Are the
reading.score
results of differentlunch
groups significantly different?
For both questions, proceed as follows:
- Perform an ANOVA to answer both questions. Formulate an H0 and perform:
- Manual calculation of the ANOVA
- Automatic calculation of the ANOVA using
scipiy.stats
.
- Are both results matching? Interpret the results.