Comparing Multiple Means with ANOVA

Author

Arvind V.

Published

March 28, 2023

Modified

May 21, 2024

Abstract
ANOVA to investigate how frogspawn hatching time varies with temperature.

Setting up R Packages

knitr::opts_chunk$set(tidy = TRUE)
library(tidyverse) # Tidy data processing 
library(ggformula) # Formula based plots
library(mosaic) # Data inspection and Statistical Inference 
library(broom) # Tidy outputs from Statistical Analyses
library(infer) # Statistical Inference
library(patchwork) # Arranging Plots

Introduction

Suppose we have three sales strategies on our website, to sell a certain product, say men’s shirts. We have observations of customer website interactions over several months. How do we know which strategy makes people buy the fastest ?

If there is a University course that is offered in parallel in three different classrooms, is there a difference between the average marks obtained by students in each of the classrooms?

In each case we have a set of observations in each category: Interaction Time vs Sales Strategy in the first example, and Student Marks vs Classroom in the second. We can take mean scores in each category and decide to compare them. How do we make the comparisons? One way would be to compare them pair-wise. But with this rapidly becomes intractable and also dangerous: with increasing number of groups, the number of mean-comparisons becomes very large \(N\choose 2\) and with each comparison the possibility of some difference showing up, just by chance, increases! And we end up making the wrong inference and perhaps the wrong decision.

The trick is of course to make comparisons all at once and ANOVA is the technique that allows us to do just that. In this tutorial, we will compare the Hatching Time of frog spawn1, at three different lab temperatures.

In this tutorial, our research question is:

Research Question

How does frogspawn hatching time vary with different temperature settings?

Workflow: Read the Data

Download the data by clicking the button below.

Data Folder

Save the CSV in a subfolder titled “data” inside your R work folder.

frogs_orig <- read_csv("data/frogs.csv")
frogs_orig

Our response variable is the hatching Time. Our explanatory variable is a factor, Temperature, with 3 levels: 13°C, 18°C and 25°C. Different samples of spawn were subject to each of these temperatures respectively.

Workflow: Clean the Data

The data is badly organized, with a separate column for each Temperature, and a common column for Sample ID. There are NA entries since not all samples of spawn can be subject to all temperatures. (E.g. Sample ID #1 was maintained at 13°C).

We will first stack up the Temperature** columns into a single column, separate that into pieces and then retain just the number part (13, 18, 25), getting rid of the word Temperature from the column titles. Then the remaining numerical column with temperatures (13, 18, 25) will be converted into a factor.

We will use pivot_longer()and separate_wider_regex() to achieve this. [See this animation for pivot_longer(): https://haswal.github.io/pivot/ ]

frogs_orig %>%
  pivot_longer(
    .,
    cols = starts_with("Temperature"),
    cols_vary = "fastest",
    # new in pivot_longer
    names_to = "Temp",
    values_to = "Time"
  ) %>%
  drop_na() %>%
  
  separate_wider_regex(
    cols = Temp,
  # knock off the unnecessary "Temperature" word everywhere
  # Just keep the digits thereafter
    patterns = c("Temperature",
                 TempFac = "\\d+"),
    cols_remove = TRUE
  ) %>%
  
  # Convert Temp into TempFac, a 3-level factor
  mutate(TempFac = factor(
    x = TempFac,
    levels = c(13, 18, 25),
    labels = c("13", "18", "25")
  )) %>%
  rename("Id" = `Frogspawn sample id`) -> frogs_long

frogs_long
frogs_long %>% count(TempFac)

So we have cleaned up our data and have 20 samples for Hatching Time per TempFac setting.

Workflow: EDA

Let us plot some histograms and boxplots of Hatching Time:

# Set graph theme
theme_set(new = theme_custom())
#

gf_histogram(
  data = frogs_long,
  ~ Time,
  fill = ~ TempFac,
  stat = "count",
  alpha = 0.5
) %>%
  gf_vline(xintercept = ~ mean(Time)) %>%
  gf_labs(x = "Hatching Time", y = "Count") %>%
  gf_text(7 ~ (mean(Time) + 2),
          label = "Overall Mean") %>%
  gf_refine(guides(fill = guide_legend(title = "Temperature level (°C)")))
###
gf_boxplot(data = frogs_long,
           Time ~ TempFac,
           fill = ~ TempFac,
           alpha = 0.5) %>%
  gf_vline(xintercept = ~ mean(Time)) %>%
  gf_labs(x = "Temperature", y = "Hatching Time",
          caption = "X-axis is reversed") %>%
  gf_refine(scale_x_discrete(limits = rev),
            guides(fill = guide_legend(title = "Temperature level (°C)"))) 

The histograms look well separated and the box plots also show very little overlap. So we can reasonably hypothesize that Temperature has a significant effect on Hatching Time.

One more slightly esoteric plot: Jitter/Scatter with a new categorical x-axis offered by the ggprism package:

# Set graph theme
theme_set(new = theme_custom())
#

library(ggprism)
gf_jitter(
  frogs_long,
  Time ~ TempFac,
  color = ~ TempFac,
  xlab = "Temperature as Factor",
  ylab = "Hatching Time",
  caption = "Using `ggprism` package"
) %>%
  gf_theme(theme_prism(base_family = theme_get()$font$family)) %>%
  gf_refine(theme(legend.position = "none"),
            scale_x_discrete(guide = "prism_bracket"))

Let’s go ahead with our ANOVA test.

Workflow: ANOVA

We will first execute the ANOVA test with code and evaluate the results. Then we will do an intuitive walk through of the process and finally, hand-calculate entire analysis for clear understanding.

Workflow: Checking ANOVA Assumptions

ANOVA makes 3 fundamental assumptions:

  1. Data (and errors) are normally distributed.
  2. Variances are equal.
  3. Observations are independent.

We can check these using checks and graphs.

Checks for Normality

The shapiro.wilk test tests if a vector of numeric data is normally distributed and rejects the hypothesis of normality when the p-value is less than or equal to 0.05. 

shapiro.test(x = frogs_long$Time)

    Shapiro-Wilk normality test

data:  frogs_long$Time
W = 0.92752, p-value = 0.001561

The p-value is very low and we cannot reject the (alternative) hypothesis that the overall data is not normal. How about normality at each level of the factor?

frogs_grouped <- frogs_long %>% 
  group_by(TempFac) %>% 
  nest(.key = "list") # naming the nested column "list"

# Checking if we can purrr
frogs_grouped %>% 
  purrr::pluck("list", 1) %>% 
  dplyr::select(Time) %>% 
  as_vector() %>% 
  shapiro.test(.)

    Shapiro-Wilk normality test

data:  .
W = 0.88954, p-value = 0.02638
# OK now we are set for group-wise Shapiro-Wilk testing with purrr:

frogs_grouped %>% 
  mutate(shaptest = 
           purrr::map(.x = list, # Column name is "list"
                      .f = \(.x) select(.data = .x, 
                                        Time) %>% 
                                 as_vector() %>% 
                                 shapiro.test(.)),
         
         params = 
           purrr::map(.x = shaptest,
                      .f = \(.x) broom::tidy(.x))) %>% 
  
  select(TempFac, params) %>% 
  unnest(cols = params)
#### Using `dplyr::group_modify()
frogs_long %>%
  group_by(TempFac) %>%
  group_modify( ~ .x %>%
                  select(Time) %>%
                  
                  as_vector() %>% shapiro.test() %>%
                  broom::tidy())

The shapiro.wilk test makes a NULL Hypothesis that the data are normally distributed and estimates the probability that this could have happened by chance. Except for TempFac = 18 the p-values are less than 0.05 and we can reject the NULL hypothesis that each of these is normally distributed. Perhaps this is a sign that we need more than 20 samples per factor level. Let there be more frogs !!!

We can also check the residuals post-model:

# Set graph theme
theme_set(new = theme_custom())
#

frogs_anova$residuals %>% 
  as_tibble() %>% 
  gf_dhistogram(~ value,data = .) %>% 
  gf_fitdistr() 
frogs_anova$residuals %>%
  as_tibble() %>% 
  gf_qq(~ value, data = .) %>% 
  gf_qqstep() %>% 
  gf_qqline() 
shapiro.test(frogs_anova$residuals)


    Shapiro-Wilk normality test

data:  frogs_anova$residuals
W = 0.94814, p-value = 0.01275

Unsurprisingly, the residuals are also not normally distributed either.

Check for Similar Variance

Response data with different variances at different levels of an explanatory variable are said to exhibit heteroscedasticity. This violates one of the assumptions of ANOVA.

To check if the Time readings are similar in variance across levels of TempFac, we can use the Levene Test, or since our per-group observations are not normally distributed, a non-parametric rank-based Fligner-Killeen Test. The NULL hypothesis is that the data are with similar variances. The tests assess how probable this is with the given data assuming this NULL hypothesis:

frogs_long %>% 
  group_by(TempFac) %>% 
  summarise(variance = var(Time))

# Not too different...OK on with the test
fligner.test(Time ~ TempFac, data = frogs_long)


DescTools::LeveneTest(Time ~ TempFac, data = frogs_long)

    Fligner-Killeen test of homogeneity of variances

data:  Time by TempFac
Fligner-Killeen:med chi-squared = 0.53898, df = 2, p-value = 0.7638

It seems that there is no cause for concern here; the data do not have significantly different variances.

Independent Observations

This is an experiment design concern; the way the data is gathered must be specified such that data for each level of the factors ( factor combinations if there are more than one) should be independent.

Workflow: Effect Size

The simplest way to find the actual effect sizes detected by an ANOVA test is to use (paradoxically) the summary.lm() command:

tidy_anova <- 
  frogs_anova %>% 
  summary.lm() %>% 
  broom::tidy()
tidy_anova

It may take a bit of effort to understand this. First the TempFac is arranged in order of levels, and the mean at the \(TempFac = 13\) is titled Intercept. That is \(26.3\). The other two means for levels \(18\) and \(25\) are stated as differences from this intercept, \(-5.3\) and \(-10.1\) respectively. The p.value for all these effect sizes is well below the desired confidence level of \(0.05\).

Standard Errors

Observe that the std.error for the intercept is \(0.257\) while that for TempFac18 and TempFac25 is \(0.257 \times \sqrt2 = 0.363\) since the latter are differences in means, while the former is a single mean. The Variance of a difference is the sum of the individual variances, which are equal here.

We can easily plot bar-chart with error bars for the effect size:

# Set graph theme
theme_set(new = theme_custom())
#
tidy_anova %>% 
  mutate(hi = estimate + std.error,
         lo = estimate - std.error) %>% 
  gf_hline(data = ., yintercept = 0, 
           colour ="grey", 
           linewidth = 2) %>% 
  gf_col(estimate ~ term, 
         fill = "grey", 
         color = "black",
         width = 0.15) %>% 
  gf_errorbar(hi + lo ~ term,
              color = "blue",
              width = 0.2) %>% 
  gf_point(estimate ~ term,
           color = "red", 
           size = 3.5) %>% 
  gf_refine(scale_x_discrete("Temp Values", 
                             labels = c("13°C", "18°C", "25°C")))

If we want an “absolute value” plot for effect size, it needs just a little bit of work:

# Merging group averages with `std.error`
# Set graph theme
theme_set(new = theme_custom())
#

frogs_long %>% 
  group_by(TempFac) %>% 
  summarise(mean = mean(Time)) %>% 
  cbind(std.error = tidy_anova$std.error) %>% 
  mutate(hi = mean + std.error,
         lo = mean - std.error) %>% 
  gf_hline(data = ., yintercept = 0, 
           colour ="grey", 
           linewidth = 2) %>% 
  gf_col(mean ~ TempFac, 
         fill = "grey", 
         color = "black", width = 0.15) %>% 
  gf_errorbar(hi + lo ~ TempFac,
                color = "blue",
                width =0.2) %>% 
  gf_point(mean ~ TempFac, 
           color = "red", 
           size = 3.5) %>% 
  gf_refine(scale_x_discrete("Temp Values", 
                             labels = c("13°C", "18°C", "25°C")))

In both graphs, note the difference in the error-bar heights.

The ANOVA test does not tell us that the “treatments” (i.e. levels of TempFac) are equally effective. We need to use a multiple comparison procedure to arrive at an answer to that question. We compute the pair-wise differences in effect-size:

frogs_anova %>% stats::TukeyHSD()
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Time ~ TempFac, data = frogs_long)

$TempFac
       diff        lwr       upr p adj
18-13  -5.3  -6.175224 -4.424776     0
25-13 -10.1 -10.975224 -9.224776     0
25-18  -4.8  -5.675224 -3.924776     0

We see that each of the pairwise differences in effect-size is significant, with p = 0 !

Using other packages

Workflow: ANOVA using Permutation Tests

We wish to establish the significance of the effect size due to each of the levels in TempFac. From the normality tests conducted earlier we see that except at one level of TempFac, the times are are not normally distributed. Hence we opt for a Permutation Test to check for significance of effect.

As remarked in Ernst8, the non-parametric permutation test can be both exact and also intuitively easier for students to grasp. Permutations are easily executed in R, using packages such as mosaic9.

We proceed with a Permutation Test for TempFac. We shuffle the levels (13, 18, 25) randomly between the Times and repeat the ANOVA test each time and calculate the F-statistic. The Null distribution is the distribution of the F-statistic over the many permutations and the p-value is given by the proportion of times the F-statistic equals or exceeds that observed.

We will use mosaic first, and also try with infer.

Conclusions

We have discussed ANOVA as a means of modelling the effects of a Categorical variable on a Continuous (Quant) variable. ANOVA can be carried out using the standard formula aov when assumptions on distributions, variances, and independence are met. Permutation ANOVA tests can be carried out when these assumptions do not quite hold.

Two-Way ANOVA

What if we have two Categorical variables as predictors? We then need to perform a Two-Way ANOVA analysis, where we look at the predictors individually (main effects) and together (interaction effects). Here too, we need to verify if the number of observations are balanced across all combinations of factors of the two Qualitative predictors. There are three different classical approaches (Type1, Type2 and Type3 ANOVA) for testing hypotheses in ANOVA for unbalanced designs, as they are called. (Langsrud 2003).

Informative Hypothesis Testing: Models which incorporate a priori Beliefs

Note that when we specified our research question, we had no specific hypothesis about the means, other than that they might be different. In many situations, we may have reason to believe in the relative “ordering” of the means for different levels of the Categorical variable. The one-sided t-test is the simplest example (e.g., \(\mu_1 >= 0\) and \(\mu_1 >= \mu_2\)); this readily extends to the multi-parameter setting, where more than one inequality constraint can be imposed on the parameters (e.g., \(\mu_1 <= \mu_2 <= \mu_3\).
It is possible to incorporate these beliefs into the ANOVA model, using what is called as informative hypothesis testing, which have certain advantages compared to unconstrained models. The R package called restriktor has the capability to develop such models with beliefs.

References

  1. The ANOVA tutorial at Our Coding Club
  2. Antoine Soetewey. How to: one-way ANOVA by hand. https://statsandr.com/blog/how-to-one-way-anova-by-hand/
  3. ANOVA in R - Stats and R https://statsandr.com/blog/anova-in-r/
  4. Michael Crawley.(2013) The R Book,second edition. Chapter 11.
  5. David C Howell, Permutation Tests for Factorial ANOVA Designs
  6. Marti Anderson, Permutation tests for univariate or multivariate analysis of variance and regression
  7. Judd, Charles M., Gary H. McClelland, and Carey S. Ryan.(2017). “Introduction to Data Analysis.” In, 1–9. Routledge. https://doi.org/10.4324/9781315744131-1.
  8. Patil, I. (2021). Visualizations with statistical details: The ‘ggstatsplot’ approach. Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
  9. Langsrud, Øyvind. (2003). ANOVA for unbalanced data: Use type II instead of type III sums of squares. Statistics and Computing. 13. 163-167. https://doi.org/10.1023/A:1023260610025. https://www.researchgate.net/publication/220286726_ANOVA_for_unbalanced_data_Use_type_II_instead_of_type_III_sums_of_squares
  10. Kim TK. (2017). Understanding one-way ANOVA using conceptual figures. Korean J Anesthesiol. 2017 Feb;70(1):22-26. doi: 10.4097/kjae.2017.70.1.22. Epub 2017 Jan 26. PMID: 28184262; PMCID: PMC5296382.
  11. Anova – Type I/II/III SS explained.https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
R Package Citations
Package Version Citation
DescTools 0.99.54 Signorell (2024)
ggprism 1.0.5 Dawson (2024)
ggstatsplot 0.12.3 Patil (2021)
ggtext 0.1.2 Wilke and Wiernik (2022)
restriktor 0.5.30 Vanbrabant and Kuiper (2023)
supernova 3.0.0 Blake et al. (2024)
Blake, Adam, Jeff Chrabaszcz, Ji Son, and Jim Stigler. 2024. supernova: Judd, McClelland, & Ryan Formatting for ANOVA Output. https://CRAN.R-project.org/package=supernova.
Dawson, Charlotte. 2024. ggprism: A ggplot2 Extension Inspired by GraphPad Prism. https://CRAN.R-project.org/package=ggprism.
Langsrud, Øyvind. 2003. Statistics and Computing 13 (2): 163–67. https://doi.org/10.1023/a:1023260610025.
Patil, Indrajeet. 2021. Visualizations with statistical details: The ggstatsplot approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.
Signorell, Andri. 2024. DescTools: Tools for Descriptive Statistics. https://CRAN.R-project.org/package=DescTools.
Vanbrabant, Leonard, and Rebecca Kuiper. 2023. restriktor: Restricted Statistical Estimation and Inference for Linear Models. https://CRAN.R-project.org/package=restriktor.
Wilke, Claus O., and Brenton M. Wiernik. 2022. ggtext: Improved Text Rendering Support for ggplot2. https://CRAN.R-project.org/package=ggtext.
Back to top

Footnotes

  1. The ANOVA tutorial at Our Coding Club.↩︎

  2. Spinoza: Ethics Geometrically Demonstrated: spinoza1665.pdf (earlymoderntexts.com)↩︎

  3. https://www.openintro.org/go/?id=anova-supplement&referrer=/book/ahss/index.php↩︎

  4. Pruim R, Kaplan DT, Horton NJ (2017). “The mosaic Package: Helping Students to ‘Think with Data’ Using R.” The R Journal, 9(1), 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.↩︎

  5. mosaic::xpf() gives both a graph and the probabilities.↩︎

  6. ggplot2 Based Plots with Statistical Details • ggstatsplot https://indrajeetpatil.github.io/ggstatsplot/↩︎

  7. https://github.com/UCLATALL/supernova↩︎

  8. Ernst, Michael D. 2004. “Permutation Methods: A Basis for Exact Inference.” Statistical Science 19 (4): 676–85. doi:10.1214/088342304000000396.↩︎

  9. Pruim R, Kaplan DT, Horton NJ (2017). “The mosaic Package: Helping Students to ‘Think with Data’ Using R.” The R Journal, 9(1), 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.↩︎

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {Comparing {Multiple} {Means} with {ANOVA}},
  date = {2023-03-28},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/130-ThreeMeansOrMore/ANOVA.html},
  langid = {en},
  abstract = {ANOVA to investigate how frogspawn hatching time varies
    with temperature.}
}
For attribution, please cite this work as:
V., Arvind. 2023. “Comparing Multiple Means with ANOVA.” March 28, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/130-ThreeMeansOrMore/ANOVA.html.