Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Viz and Analytics
  3. Statistical Inference
  4. Comparing Multiple Means with ANOVA
  • Teaching
    • Data Viz and Analytics
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Graphs
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Using AI in Analytics
        • Case Studies
          • Demo:Product Packaging and Elderly People
          • Ikea Furniture
          • Movie Profits
          • Gender at the Work Place
          • Heptathlon
          • School Scores
          • Children's Games
          • Valentine’s Day Spending
          • Women Live Longer?
          • Hearing Loss in Children
          • California Transit Payments
          • Seaweed Nutrients
          • Coffee Flavours
          • Legionnaire’s Disease in the USA
          • Antarctic Sea ice
          • William Farr's Observations on Cholera in London
      • TRIZ for Problem Solvers
        • I am Water
        • I am What I yam
        • Birds of Different Feathers
        • I Connect therefore I am
        • I Think, Therefore I am
        • The Art of Parallel Thinking
        • A Year of Metaphoric Thinking
        • TRIZ - Problems and Contradictions
        • TRIZ - The Unreasonable Effectiveness of Available Resources
        • TRIZ - The Ideal Final Result
        • TRIZ - A Contradictory Language
        • TRIZ - The Contradiction Matrix Workflow
        • TRIZ - The Laws of Evolution
        • TRIZ - Substance Field Analysis, and ARIZ
      • Math Models for Creative Coders
        • Maths Basics
          • Vectors
          • Matrix Algebra Whirlwind Tour
          • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
        • Tech
          • Tools and Installation
          • Adding Libraries to p5.js
          • Using Constructor Objects in p5.js
        • Geometry
          • Circles
          • Complex Numbers
          • Fractals
          • Affine Transformation Fractals
          • L-Systems
          • Kolams and Lusona
        • Media
          • Fourier Series
          • Additive Sound Synthesis
          • Making Noise Predictably
          • The Karplus-Strong Guitar Algorithm
        • AI
          • Working with Neural Nets
          • The Perceptron
          • The Multilayer Perceptron
          • MLPs and Backpropagation
          • Gradient Descent
        • Projects
          • Projects
      • Tech for Creative Education
        • 🧭 Using Idyll
        • 🧭 Using Apparatus
        • 🧭 Using g9.js
      • Literary Jukebox: In Short, the World
        • Italy - Dino Buzzati
        • France - Guy de Maupassant
        • Japan - Hisaye Yamamoto
        • Peru - Ventura Garcia Calderon
        • Russia - Maxim Gorky
        • Egypt - Alifa Rifaat
        • Brazil - Clarice Lispector
        • England - V S Pritchett
        • Russia - Ivan Bunin
        • Czechia - Milan Kundera
        • Sweden - Lars Gustaffsson
        • Canada - John Cheever
        • Ireland - William Trevor
        • USA - Raymond Carver
        • Italy - Primo Levi
        • India - Ruth Prawer Jhabvala
        • USA - Carson McCullers
        • Zimbabwe - Petina Gappah
        • India - Bharati Mukherjee
        • USA - Lucia Berlin
        • USA - Grace Paley
        • England - Angela Carter
        • USA - Kurt Vonnegut
        • Spain-Merce Rodoreda
        • Israel - Ruth Calderon
        • Israel - Etgar Keret
    • Posts
    • Blogs and Talks

    On this page

    • Setting up R Packages
    • Introduction
    • Workflow: Read the Data
    • Workflow: Clean the Data
    • Workflow: EDA
    • Workflow: ANOVA
    • Stating the Model
    • Workflow: Checking ANOVA Assumptions
      • Checks for Normality
      • Check for Similar Variance
      • Independent Observations
    • Workflow: Effect Size
      • Using other packages
    • Workflow: ANOVA using Permutation Tests
    • Wait, But Why?
    • Conclusions
    • Your Turn
    • References
    1. Teaching
    2. Data Viz and Analytics
    3. Statistical Inference
    4. Comparing Multiple Means with ANOVA

    Comparing Multiple Means with ANOVA

    Published

    March 28, 2023

    Modified

    June 11, 2025

    Abstract
    ANOVA to investigate how frogspawn hatching time varies with temperature.

    Setting up R Packages

    library(tidyverse) # Tidy data processing
    library(ggformula) # Formula based plots
    library(mosaic) # Data inspection and Statistical Inference
    library(broom) # Tidy outputs from Statistical Analyses
    library(infer) # Statistical Inference, Permutation/Bootstrap
    library(patchwork) # Arranging Plots
    library(ggprism) # Interesting Categorical Axes
    library(supernova) # Beginner-Friendly ANOVA Tables

    Plot Theme

    Show the Code
    # https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto
    
    # Chunk options
    knitr::opts_chunk$set(
      fig.width = 7,
      fig.asp = 0.618, # Golden Ratio
      # out.width = "80%",
      fig.align = "center"
    )
    ### Ggplot Theme
    ### https://rpubs.com/mclaire19/ggplot2-custom-themes
    
    theme_custom <- function() {
      font <- "Roboto Condensed" # assign font family up front
    
      theme_classic(base_size = 14) %+replace% # replace elements we want to change
    
        theme(
          panel.grid.minor = element_blank(), # strip minor gridlines
          text = element_text(family = font),
          # text elements
          plot.title = element_text( # title
            family = font, # set font family
            # size = 20,               #set font size
            face = "bold", # bold typeface
            hjust = 0, # left align
            # vjust = 2                #raise slightly
            margin = margin(0, 0, 10, 0)
          ),
          plot.subtitle = element_text( # subtitle
            family = font, # font family
            # size = 14,                #font size
            hjust = 0,
            margin = margin(2, 0, 5, 0)
          ),
          plot.caption = element_text( # caption
            family = font, # font family
            size = 8, # font size
            hjust = 1
          ), # right align
    
          axis.title = element_text( # axis titles
            family = font, # font family
            size = 10 # font size
          ),
          axis.text = element_text( # axis text
            family = font, # axis family
            size = 8
          ) # font size
        )
    }
    
    # Set graph theme
    theme_set(new = theme_custom())
    #

    Introduction

    Suppose we have three sales strategies on our website, to sell a certain product, say men’s shirts. We have observations of customer website interactions over several months. How do we know which strategy makes people buy the fastest ?

    If there is a University course that is offered in parallel in three different classrooms, is there a difference between the average marks obtained by students in each of the classrooms?

    In each case we have a set of Quant observations in each Qual category: Interaction Time vs Sales Strategy in the first example, and Student Marks vs Classroom in the second. We can take mean scores in each category and decide to compare them. How do we make the comparisons? One way would be to compare them pair-wise, doing as many t-tests as there are pairs. But with this rapidly becomes intractable and also dangerous: with increasing number of groups, the number of mean-comparisons becomes very large (N2) and with each comparison the possibility of some difference showing up, just by chance, increases! And we end up making the wrong inference and perhaps the wrong decision. The trick is of course to make comparisons all at once and ANOVA is the technique that allows us to do just that.

    In this tutorial, we will compare the Hatching Time of frog spawn1, at three different lab temperatures.

    In this tutorial, our research question is:

    NoteResearch Question

    Based on the sample dataset at hand, how does frogspawn hatching time vary with different temperature settings?

    Workflow: Read the Data

    Download the data by clicking the button below.

    ImportantData Folder

    Save the CSV in a subfolder titled “data” inside your R work folder.

    frogs_orig <- read_csv("data/frogs.csv")
    frogs_orig
    ABCDEFGHIJ0123456789
    Frogspawn sample id
    <dbl>
    Temperature13
    <dbl>
    Temperature18
    <dbl>
    Temperature25
    <dbl>
    124NANA
    2NA21NA
    3NANA18
    426NANA
    5NA22NA
    6NANA14
    727NANA
    8NA22NA
    9NANA15
    1027NANA
    Next
    123456
    Previous
    1-10 of 60 rows

    Our response variable is the hatching Time. Our explanatory variable is a factor, Temperature, with 3 levels: 13°C, 18°C and 25°C. Different samples of spawn were subject to each of these temperatures respectively.

    Workflow: Clean the Data

    The data is in wide-format, with a separate column for each Temperature, and a common column for Sample ID. This is good for humans, but poor for a computer: there are NA entries since not all samples of spawn can be subject to all temperatures. (E.g. Sample ID #1 was maintained at 13°C, and there are NAs in the other two columns, which we don’t need).

    We will first stack up the Temperature columns into a single column, separate that into pieces and then retain just the number part (13, 18, 25), getting rid of the word Temperature from the column titles. Then the remaining numerical column with temperatures (13, 18, 25) will be converted into a factor.

    We will use pivot_longer()and separate_wider_regex() to achieve this. [See this animation for pivot_longer(): https://haswal.github.io/pivot/ ]

    frogs_orig %>%
      pivot_longer(
        .,
        cols = starts_with("Temperature"),
        cols_vary = "fastest",
        # new in pivot_longer
        names_to = "Temp",
        values_to = "Time"
      ) %>%
      drop_na() %>%
      ##
      separate_wider_regex(
        cols = Temp,
        # knock off the unnecessary "Temperature" word
        # Just keep the digits thereafter
        patterns = c("Temperature", TempFac = "\\d+"),
        cols_remove = TRUE
      ) %>%
      # Convert Temp into TempFac, a 3-level factor
      mutate(TempFac = factor(
        x = TempFac,
        levels = c(13, 18, 25),
        labels = c("13", "18", "25")
      )) %>%
      rename("Id" = `Frogspawn sample id`) -> frogs_long
    frogs_long
    ##
    frogs_long %>% count(TempFac)
    ABCDEFGHIJ0123456789
    Id
    <dbl>
    TempFac
    <fct>
    Time
    <dbl>
    11324
    21821
    32518
    41326
    51822
    62514
    71327
    81822
    92515
    101327
    Next
    1234
    ...
    6
    Previous
    1-10 of 60 rows
    ABCDEFGHIJ0123456789
    TempFac
    <fct>
    n
    <int>
    1320
    1820
    2520
    3 rows

    So we have cleaned up our data and have 20 samples for Hatching Time per TempFac setting.

    Workflow: EDA

    Let us plot some histograms and boxplots of Hatching Time:

    # Set graph theme
    theme_set(new = theme_custom())
    ##
    gf_histogram(~Time,
      fill = ~TempFac,
      data = frogs_long, alpha = 0.5
    ) %>%
      gf_vline(xintercept = ~ mean(Time)) %>%
      gf_labs(
        title = "Histograms of Hatching Time Distributions vs Temperature",
        x = "Hatching Time", y = "Count"
      ) %>%
      gf_text(7 ~ (mean(Time) + 2),
        label = "Overall Mean"
      ) %>%
      gf_refine(guides(fill = guide_legend(title = "Temperature level (°C)")))

    # Set graph theme
    theme_set(new = theme_custom())
    ##
    gf_boxplot(
      data = frogs_long,
      Time ~ TempFac,
      fill = ~TempFac,
      alpha = 0.5
    ) %>%
      gf_vline(xintercept = ~ mean(Time)) %>%
      gf_labs(
        title = "Boxplots of Hatching Time Distributions vs Temperature",
        x = "Temperature", y = "Hatching Time",
        caption = "Using ggprism"
      ) %>%
      gf_refine(
        scale_x_discrete(guide = "prism_bracket"),
        guides(fill = guide_legend(title = "Temperature level (°C)"))
      )

    The histograms look well separated and the box plots also show very little overlap. So we can reasonably hypothesize that Temperature has a significant effect on Hatching Time.

    Let’s go ahead with our ANOVA test.

    Workflow: ANOVA

    We will first execute the ANOVA test with code and evaluate the results. Then we will do an intuitive walkthrough of the process and finally, hand-calculate entire analysis for clear understanding. For now, a little faith!

    • Code
    • Intuitive
    • Frogs Demonstrated

    R offers a very simple command aov to execute an ANOVA test: Note the familiar formula of stating the variables:

    frogs_anova <- aov(Time ~ TempFac, data = frogs_long)

    This creates an ANOVA model object, called frogs_anova. We can examine the ANOVA model object best with a package called supernova2:

    # library(supernova)
    # Set graph theme
    theme_set(new = theme_custom())
    #
    supernova::pairwise(frogs_anova,
      correction = "Bonferroni", # Try "Tukey"
      alpha = 0.05, # 95% CI calculation
      var_equal = TRUE, # We'll see
      plot = TRUE
    )

    
      group_1 group_2    diff pooled_se       t    df   lower  upper p_adj
      <chr>   <chr>     <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl> <dbl>
    1 18      13       -5.300     0.257 -20.608    57  -5.861 -4.739 .0000
    2 25      13      -10.100     0.257 -39.272    57 -10.661 -9.539 .0000
    3 25      18       -4.800     0.257 -18.664    57  -5.361 -4.239 .0000

    This table + error-bar plot gives us a clear comparison between each pair of the three groups of observations defined by TempFac. The differences in spawn hatching Time between each pair of TempFac settings are given by the diff column. Also shown are the confidence intervals for each of these differences (none of which include 0); the p-values for each of these differences is also negligible. Thus we can conclude that the effect of temperature on hatching time is significant.

    Note

    To find which specific value of TempFac has the most effect will require pairwise comparison of the group means, using a standard t-test. The confidence level for such repeated comparisons will need what is called Bonferroni correction3 to prevent us from detecting a significant (pair-wise) difference simply by chance. To do this we take α=0.05, the confidence level used and divide it by K, the number of pair-wise comparisons we intend to make. This new value is used to decide on the significance of the estimated parameter. So the pairwise comparisons in our current data will have to use α/3=0.0166 as the confidence level. The supernova::pairwise() function did this for us very neatly!

    There are also other ways, such as the “Tukey correction” for multiple tests.

    All that is very well, but what is happening under the hood of the aov() command?

    Consider a data set with a single Quant and a single Qual variable. The Qual variable has two levels, the Quant data has 20 observations per Qual level.

    ABCDEFGHIJ0123456789
    index
    <int>
    qual
    <chr>
    quant
    <dbl>
    1A2.7419169
    2A-1.1293963
    3A0.7262568
    4A1.2657252
    5A0.8085366
    6A-0.2122490
    7A3.0230440
    8A-0.1893181
    9A4.0368474
    10A-0.1254282
    Next
    1234
    Previous
    1-10 of 40 rows

    All Data: In Fig A, the horizontal black line is the overall mean of quant, denoted as μtot. The vertical black lines to the points show the departures of each point from this overall mean. The sum of squares of these vertical black lines in Fig A is called the Total Sum of Squares (SST).

    (1)SST=Σ(y−μtot)2

    Grouped Data: In Fig B, the horizontal green and red lines are the means of the individual groups, respectively μA and μB. The green and red vertical lines are the departures, or errors, of each point from its own group-mean. The sum of the squares of the green and red lines is called the Total Error Sum of Squares (SSE).

    (2)SSE=Σ[(y−μA)2]+Σ(y−μB)2]

    Improvement: We take the difference in the squared error sums:

    (3)SSA=SST−SSE

    SSA is called the Treatment Sum of Squares, the “improvement” in going from believing in one mean to believing in two.

    Improvement Ratio: SSA/SSE might now help us decide whether two means are better than one.

    Let us compute these numbers for our toy dataset:

    demo_anova <- aov(quant ~ qual, data = toydata)
    supernova::supernova(demo_anova)
     Analysis of Variance Table (Type III SS)
     Model: quant ~ qual
    
                                   SS df      MS       F   PRE     p
     ----- --------------- | -------- -- ------- ------- ----- -----
     Model (error reduced) |  823.407  1 823.407 139.356 .7857 .0000
     Error (from model)    |  224.529 38   5.909                    
     ----- --------------- | -------- -- ------- ------- ----- -----
     Total (empty model)   | 1047.935 39  26.870                    

    What do we see?

    1. All Data: SST=1047.935.
    2. Grouped Data: SSE=224.529.
    3. Improvement: SSA=SST−SSE = 823.407.
    4. Improvement Ratio: Before we set up this ratio, we must realize that each of these measures uses a different number of observations! So the comparison is done after scaling each of SSA and SSE by the number of observations influencing them. (a sort of per capita, or average, squared error, an idea we saw when we defined Standard Errors): Fstat=SSA/dfSSASSE/dfSSE, where dfSSA=1 and dfSSE=38 are respectively the degrees of freedom in SSA and SSE.
    5. Large Enough Ratio?: The value of the F-statistic from the table above is 823.4075.909=139.356. Is this ratio big enough? F-statistic is compared with a critical value of the F-critical to help us decide. (Here, it is.)
    6. Belief: So we now believe in the idea of two means.
    7. Back to Mean Differences: Finally, in order to find which of the means is significantly different from others (if there are more than two!), we need to make a pair-wise comparison of the means, applying the Bonferroni correction as stated before. This means we divide the critical p.value we expect by the number of comparisons we make between levels of the Qual variable. supernova did this for us in the error-bar plot above.
    ImportantWhy “ANOVA”?

    When divide each of SSA and SSE by their degrees of freedom, this gives us a ratio of variances, the F-statistic. And so we are in effect deciding if means are significantly different by analyzing (a ratio of) variances! Hence the name, AN-alysis O-f VA-riance, ANOVA.

    So this may seem like a great Hero’s Journey, where we start with means and differences, go into sums of squares, differences and comparisons of error ratios, and return to the means where we started, only to know them properly now.

    Now that we understand what aov() is doing, let us hand-calculate the numbers for our frogs dataset and check. Let us visualize our calculations first.

    SST

    SST

    SSE

    SSE

    Let us get the ready table from supernova first, and then systematically calculate all numbers with understanding:

    supernova::supernova(frogs_anova)
     Analysis of Variance Table (Type III SS)
     Model: Time ~ TempFac
    
                                   SS df      MS       F   PRE     p
     ----- --------------- | -------- -- ------- ------- ----- -----
     Model (error reduced) | 1020.933  2 510.467 385.897 .9312 .0000
     Error (from model)    |   75.400 57   1.323                    
     ----- --------------- | -------- -- ------- ------- ----- -----
     Total (empty model)   | 1096.333 59  18.582                    

    Here are the SST, SSE, and the SSA:

    # Calculate overall sum squares SST
    frogs_overall <- frogs_long %>%
      summarise(
        overall_mean_time = mean(Time),
        # Overall mean across all readings
        # The Black Line
    
        SST = sum((Time - overall_mean_time)^2),
        n = n()
      ) # Always do this with `summarise`
    
    frogs_overall
    ABCDEFGHIJ0123456789
    overall_mean_time
    <dbl>
    SST
    <dbl>
    n
    <int>
    21.166671096.33360
    1 row
    ##
    SST <- frogs_overall$SST
    SST
    [1] 1096.333
    # Calculate sums of square errors *within* each group
    # with respect to individual group means
    frogs_within_groups <- frogs_long %>%
      group_by(TempFac) %>%
      summarise(
        grouped_mean_time = mean(Time), # The Coloured Lines
        grouped_variance_time = var(Time),
        group_error_squares = sum((Time - grouped_mean_time)^2),
        n = n()
      )
    frogs_within_groups
    ##
    frogs_SSE <- frogs_within_groups %>%
      summarise(SSE = sum(group_error_squares))
    ##
    SSE <- frogs_SSE$SSE
    SSE
    ABCDEFGHIJ0123456789
    TempFac
    <fct>
    grouped_mean_time
    <dbl>
    grouped_variance_time
    <dbl>
    group_error_squares
    <dbl>
    n
    <int>
    1326.31.27368424.220
    1821.01.26315824.020
    2516.21.43157927.220
    3 rows
    [1] 75.4
    SST
    SSE
    SSA <- SST - SSE
    SSA
    [1] 1096.333
    [1] 75.4
    [1] 1020.933

    We have SST=1096, SSE=75.4 and therefore SSA=1020.9.

    In order to calculate the F-Statistic, we need to compute the variances, using these sum of squares. We obtain variances by dividing by their Degrees of Freedom:

    Fstat=SSA/dfSSASSE/dfSSE

    where dfSSA and dfSSE are respectively the degrees of freedom in SSA and SSE.

    Let us calculate these Degrees of Freedom.

    With k=3 levels in the factor TempFac, and n=20 points per level, SST clearly has degree of freedom kn−1=3∗20 −1=59, since it uses all observations but loses one degree to calculate the global mean. (If each level did not have the same number of points n, we simply take all observations less one as the degrees of freedom for SST).

    SSE has k∗(n−1)=3∗(20−1)=57 as degrees of freedom, since each of the k groups there are n observations and each group loses one degree to calculate its own group mean.

    And therefore SSA, being their difference, has kn−1−k∗(n−1)=k−1=2 degrees of freedom.

    These are, of course, as shown in the df column in the supernova tabel above. We can still calculate these in R, for the sake of method and clarity (and pedantry):

    # Error Sum of Squares SSE
    df_SSE <- frogs_long %>%
      # Takes into account "unbalanced" situations
      # Where groups are not equal in size
      group_by(TempFac) %>%
      summarise(per_group_df_SSE = n() - 1) %>%
      summarise(df_SSE = sum(per_group_df_SSE)) %>%
      as.numeric()
    
    
    ## Overall Sum of Squares SST
    df_SST <- frogs_long %>%
      summarise(df_SST = n() - 1) %>%
      as.integer()
    
    
    # Treatment Sum of Squares SSA
    k <- length(unique(frogs_long$TempFac))
    df_SSA <- k - 1

    The degrees of freedom for the quantities are:

    df_SST
    df_SSE
    df_SSA
    [1] 59
    [1] 57
    [1] 2

    Now we are ready to compute the F-statistic: dividing each sum-of-squares byt its degrees of freedom gives us variances which we will compare, using the F-statistic as a ratio:

    # Finally F_Stat!
    # Combine the sum-square_error for each level of the factor
    # Weighted by degrees of freedom **per level**
    # Which are of course equal here ;-D
    
    MSE <- frogs_within_groups %>%
      summarise(mean_square_error = sum(group_error_squares / df_SSE)) %>%
      as.numeric()
    MSE
    [1] 1.322807
    ##
    MSA <- SSA / df_SSA # This is OK
    MSA
    [1] 510.4667
    ##
    F_stat <- MSA / MSE
    F_stat
    [1] 385.8966

    The F-stat is compared with a critical value of the F-statistic, F_crit which is computed using the formula for the f-distribution in R. As with our hypothesis tests, we set the significance level to α=0.95, but here with the Bonferroni correction, and quote the two relevant degrees of freedom as parameters to qf() which computes the critical F value F_critical as a quartile:

    F_crit <-
      qf(
        p = (1 - 0.05 / 3), # Significance level is 5% + Bonferroni Correction
        df1 = df_SSA, # Numerator degrees of freedom
        df2 = df_SSE # Denominator degrees of freedom
      )
    F_crit
    F_stat
    [1] 4.403048
    [1] 385.8966

    The F_crit value can also be seen in a plot4,5:

    # Set graph theme
    theme_set(new = theme_custom())
    #
    mosaic::xpf(
      q = F_crit,
      df1 = df_SSA, df2 = df_SSE, method = "gg",
      log.p = FALSE, lower.tail = TRUE,
      return = "plot"
    ) %>%
      gf_vline(xintercept = F_crit) %>%
      gf_label(0.75 ~ 5,
        label = "F_critical",
        inherit = F, show.legend = F
      ) %>%
      gf_labs(
        title = "F distribution for Frogs Data",
        subtitle = "F_critical = 4.403"
      )

    Any value of F more than the F_crit occurs with smaller probability than 0.05/3=0.017. Our F_stat is much higher than F_crit, by orders of magnitude! And so we can say with confidence that Temperature has a significant effect on spawn Time.

    And that is how ANOVA computes!

    Stating the Model

    And supernova gives us a nice linear equation relating Hatching_Time to TempFac:

    supernova::equation(frogs_anova)
    Fitted equation:
    Time = 26.3 + -5.3*TempFac18 + -10.1*TempFac25 + e

    TempFac18 and TempFac25 are binary {0,1} coded variables, representing the test situation. e is the remaining error. The equation models the means at each value of TempFac.

    Workflow: Checking ANOVA Assumptions

    ANOVA makes 3 fundamental assumptions:

    1. Data (and errors) are normally distributed.
    2. Variances are equal.
    3. Observations are independent.

    We can check these using checks and graphs.

    Checks for Normality

    The shapiro.wilk test tests if a vector of numeric data is normally distributed and rejects the hypothesis of normality when the p-value is less than or equal to 0.05. 

    shapiro.test(x = frogs_long$Time)
    
        Shapiro-Wilk normality test
    
    data:  frogs_long$Time
    W = 0.92752, p-value = 0.001561

    The p-value is very low and we cannot reject the (alternative) hypothesis that the overall data is not normal. How about normality at each level of the factor?

    frogs_long %>%
      group_by(TempFac) %>%
      group_modify(~ .x %>%
        select(Time) %>%
        as_vector() %>%
        shapiro.test() %>%
        broom::tidy())
    ABCDEFGHIJ0123456789
    TempFac
    <fct>
    statistic
    <dbl>
    p.value
    <dbl>
    130.88954260.02637682
    180.92544250.12614802
    250.89789470.03766278
    3 rows | 1-3 of 4 columns

    The shapiro.wilk test makes a NULL Hypothesis that the data are normally distributed and estimates the probability that the given data could have happened by chance. Except for TempFac = 18 the p.values are less than 0.05 and we can reject the NULL hypothesis that each of these is normally distributed. Perhaps this is a sign that we need more than 20 samples per factor level. Let there be more frogs !!! இன்னும தவளைகள் வேண்டும்!! !!

    We can also check the residuals post-model:

    # Set graph theme
    theme_set(new = theme_custom())
    #
    frogs_anova$residuals %>%
      as_tibble() %>%
      gf_dhistogram(~value, data = .) %>%
      gf_fitdistr()
    ##
    frogs_anova$residuals %>%
      as_tibble() %>%
      gf_qq(~value, data = .) %>%
      gf_qqstep() %>%
      gf_qqline()
    ##
    shapiro.test(frogs_anova$residuals)

    
        Shapiro-Wilk normality test
    
    data:  frogs_anova$residuals
    W = 0.94814, p-value = 0.01275

    Unsurprisingly, the residuals are also not normally distributed either.

    Check for Similar Variance

    Response data with different variances at different levels of an explanatory variable are said to exhibit heteroscedasticity. This violates one of the assumptions of ANOVA.

    To check if the Time readings are similar in variance across levels of TempFac, we can use the Levene Test, or since our per-group observations are not normally distributed, a non-parametric rank-based Fligner-Killeen Test. The NULL hypothesis is that the data are with similar variances. The tests assess how probable this is with the given data assuming this NULL hypothesis:

    frogs_long %>%
      group_by(TempFac) %>%
      summarise(variance = var(Time))
    # Not too different...OK on with the test
    DescTools::LeveneTest(Time ~ TempFac, data = frogs_long)
    ##
    fligner.test(Time ~ TempFac, data = frogs_long)
    ABCDEFGHIJ0123456789
    TempFac
    <fct>
    variance
    <dbl>
    131.273684
    181.263158
    251.431579
    3 rows
    ABCDEFGHIJ0123456789
     
     
    Df
    <int>
    F value
    <dbl>
    Pr(>F)
    <dbl>
    group20.39310340.6767746
    57NANA
    2 rows
    
        Fligner-Killeen test of homogeneity of variances
    
    data:  Time by TempFac
    Fligner-Killeen:med chi-squared = 0.53898, df = 2, p-value = 0.7638

    It seems that there is no cause for concern here; the data do not have significantly different variances.

    Independent Observations

    This is an experiment design concern; the way the data is gathered must be specified such that data for each level of the factors ( factor combinations if there are more than one) should be independent.

    Workflow: Effect Size

    The simplest way to find the actual effect sizes detected by an ANOVA test is something we have already done, with the supernova package: Here is the table and plot again:

    # Set graph theme
    theme_set(new = theme_custom())
    #
    frogs_supernova <-
      supernova::pairwise(frogs_anova,
        plot = TRUE,
        alpha = 0.05,
        correction = "Bonferroni"
      )

    frogs_supernova
    
      group_1 group_2    diff pooled_se       t    df   lower  upper p_adj
      <chr>   <chr>     <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl> <dbl>
    1 18      13       -5.300     0.257 -20.608    57  -5.861 -4.739 .0000
    2 25      13      -10.100     0.257 -39.272    57 -10.661 -9.539 .0000
    3 25      18       -4.800     0.257 -18.664    57  -5.361 -4.239 .0000

    This table, the plot, and the equation we set up earlier all give us the sense of how the TempFac affects Time. The differences are given pair-wise between levels of the Qual factor, TempFac, and the standard error has been declared in pooled fashion (all groups together).

    We can also use (paradoxically) the summary.lm() command:

    tidy_anova <-
      frogs_anova %>%
      summary.lm() %>%
      broom::tidy()
    tidy_anova
    ABCDEFGHIJ0123456789
    term
    <chr>
    estimate
    <dbl>
    std.error
    <dbl>
    statistic
    <dbl>
    p.value
    <dbl>
    (Intercept)26.30.2571777102.263942.781059e-66
    TempFac18-5.30.3637041-14.572287.081214e-21
    TempFac25-10.10.3637041-27.769828.187867e-35
    3 rows

    It may take a bit of effort to understand this. First the TempFac is arranged in order of levels, and the mean at the TempFac=13 is titled Intercept. That is 26.3. The other two means for levels 18 and 25 are stated as differences from this intercept, −5.3 and −10.1 respectively. The p.value for all these effect sizes is well below the desired confidence level of 0.05.

    NoteStandard Errors

    Observe that the std.error for the intercept is 0.257 while that for TempFac18 and TempFac25 is 0.257×2=0.363 since the latter are differences in means, while the former is a single mean. The Variance of a difference is the sum of the individual variances, which are equal here.

    We can easily plot bar-chart with error bars for the effect size:

    # Set graph theme
    theme_set(new = theme_custom())
    #
    tidy_anova %>%
      mutate(
        hi = estimate + std.error,
        lo = estimate - std.error
      ) %>%
      gf_hline(
        data = ., yintercept = 0,
        colour = "grey",
        linewidth = 2
      ) %>%
      gf_col(estimate ~ term,
        fill = "grey",
        color = "black",
        width = 0.15
      ) %>%
      gf_errorbar(hi + lo ~ term,
        color = "blue",
        width = 0.2
      ) %>%
      gf_point(estimate ~ term,
        color = "red",
        size = 3.5
      ) %>%
      gf_refine(scale_x_discrete("Temp Values",
        labels = c("13°C", "18°C", "25°C")
      ))

    If we want an “absolute value” plot for effect size, it needs just a little bit of work:

    # Merging group averages with `std.error`
    # Set graph theme
    theme_set(new = theme_custom())
    #
    
    frogs_long %>%
      group_by(TempFac) %>%
      summarise(mean = mean(Time)) %>%
      cbind(std.error = tidy_anova$std.error) %>%
      mutate(
        hi = mean + std.error,
        lo = mean - std.error
      ) %>%
      gf_hline(
        data = ., yintercept = 0,
        colour = "grey",
        linewidth = 2
      ) %>%
      gf_col(mean ~ TempFac,
        fill = "grey",
        color = "black", width = 0.15
      ) %>%
      gf_errorbar(hi + lo ~ TempFac,
        color = "blue",
        width = 0.2
      ) %>%
      gf_point(mean ~ TempFac,
        color = "red",
        size = 3.5
      ) %>%
      gf_refine(scale_x_discrete("Temp Values",
        labels = c("13°C", "18°C", "25°C")
      ))

    In both graphs, note the difference in the error-bar heights.

    The ANOVA test does not tell us that the “treatments” (i.e. levels of TempFac) are equally effective. We need to use a multiple comparison procedure to arrive at an answer to that question. We compute the pair-wise differences in effect-size:

    frogs_anova %>% stats::TukeyHSD()
      Tukey multiple comparisons of means
        95% family-wise confidence level
    
    Fit: aov(formula = Time ~ TempFac, data = frogs_long)
    
    $TempFac
           diff        lwr       upr p adj
    18-13  -5.3  -6.175224 -4.424776     0
    25-13 -10.1 -10.975224 -9.224776     0
    25-18  -4.8  -5.675224 -3.924776     0

    We see that each of the pairwise differences in effect-size is significant, with p = 0 !

    Using other packages

    • Using ggstatsplot
    • Using supernova

    There is a very neat package called ggstatsplot6 that allows us to plot very comprehensive statistical graphs. Let us quickly do this:

    # Set graph theme
    theme_set(new = theme_custom())
    #
    library(ggstatsplot)
    frogs_long %>%
      ggstatsplot::ggbetweenstats(
        x = TempFac, y = Time,
        title = "ANOVA : Frogs Spawn Time vs Temperature Setting"
      )

    We can also obtain crisp-looking anova tables from the new supernova package 7, which is based on the methods discussed in Judd et al. Section 14

    library(supernova)
    supernova::supernova(frogs_anova)
    supernova::pairwise(frogs_anova)
     Analysis of Variance Table (Type III SS)
     Model: Time ~ TempFac
    
                                   SS df      MS       F   PRE     p
     ----- --------------- | -------- -- ------- ------- ----- -----
     Model (error reduced) | 1020.933  2 510.467 385.897 .9312 .0000
     Error (from model)    |   75.400 57   1.323                    
     ----- --------------- | -------- -- ------- ------- ----- -----
     Total (empty model)   | 1096.333 59  18.582                    
    
      group_1 group_2    diff pooled_se       q    df   lower  upper p_adj
      <chr>   <chr>     <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl> <dbl>
    1 18      13       -5.300     0.257 -20.608    57  -6.175 -4.425 .0000
    2 25      13      -10.100     0.257 -39.272    57 -10.975 -9.225 .0000
    3 25      18       -4.800     0.257 -18.664    57  -5.675 -3.925 .0000

    The supernova table clearly shows the reduction the Sum of Squares as we go from a NULL (empty) model to a full ANOVA model.

    Workflow: ANOVA using Permutation Tests

    We wish to establish the significance of the effect size due to each of the levels in TempFac. From the normality tests conducted earlier we see that except at one level of TempFac, the times are are not normally distributed. Hence we opt for a Permutation Test to check for significance of effect.

    As remarked in Ernst8, the non-parametric permutation test can be both exact and also intuitively easier for students to grasp.

    We proceed with a Permutation Test for TempFac. We shuffle the levels (13, 18, 25) randomly between the Times and repeat the ANOVA test each time and calculate the F-statistic. The Null distribution is the distribution of the F-statistic over the many permutations and the p-value is given by the proportion of times the F-statistic equals or exceeds that observed.

    We will use infer to do this: We calculate the observed F-stat with infer, which also has a very direct, if verbose, syntax for doing permutation tests:

    observed_infer <-
      frogs_long %>%
      specify(Time ~ TempFac) %>%
      hypothesise(null = "independence") %>%
      calculate(stat = "F")
    observed_infer
    ABCDEFGHIJ0123456789
    stat
    <dbl>
    385.8966
    1 row

    We see that the observed F-Statistic is of course 385.8966 as before. Now we use infer to generate a NULL distribution using permutation of the factor TempFac:

    null_dist_infer <- frogs_long %>%
      specify(Time ~ TempFac) %>%
      hypothesise(null = "independence") %>%
      generate(reps = 4999, type = "permute") %>%
      calculate(stat = "F")
    ##
    null_dist_infer
    ABCDEFGHIJ0123456789
    replicate
    <int>
    stat
    <dbl>
    11.9359049289
    22.9119835126
    30.3322413952
    40.1998254799
    51.9329404889
    61.0130820818
    71.0409851565
    80.9629891561
    92.4882971338
    100.1366969114
    Next
    123456
    ...
    500
    Previous
    1-10 of 4,999 rows
    ##
    null_dist_infer %>%
      visualise(method = "simulation") +
      shade_p_value(obs_stat = observed_infer$stat, direction = "right") +
      scale_x_continuous(trans = "log10", expand = c(0, 0)) +
      coord_cartesian(xlim = c(0.2, 500), clip = "off") +
      annotation_logticks(outside = FALSE) +
      theme_custom()

    As seen, the infer based permutation test also shows that the permutationally generated F-statistics are nowhere near that which was observed. The effect of TempFac is very strong.

    Wait, But Why?

    • In marketing, design, or business research, similar quantities may be measured across different locations, or stores, or categories of people, for instance.
    • ANOVA is the tool to decide if the Quant variable has differences across the Qual categories.
    • This approach can be extended to more than one Qual variable, and also if there is another Quant variable in the mix.

    Conclusions

    We have discussed ANOVA as a means of modelling the effects of a Categorical variable on a Continuous (Quant) variable. ANOVA can be carried out using the standard formula aov when assumptions on distributions, variances, and independence are met. Permutation ANOVA tests can be carried out when these assumptions do not quite hold.

    NoteTwo-Way ANOVA

    What if we have two Categorical variables as predictors?

    We then need to perform a Two-Way ANOVA analysis, where we look at the predictors individually (main effects) and together (interaction effects). Here too, we need to verify if the number of observations are balanced across all combinations of factors of the two Qualitative predictors. There are three different classical approaches (Type1, Type2, and Type3 ANOVA) for testing hypotheses in ANOVA for unbalanced designs, as they are called. (Langsrud 2003).

    NoteInformative Hypothesis Testing: Models which incorporate a priori Beliefs

    Note that when we specified our research question, we had no specific hypothesis about the means, other than that they might be different. In many situations, we may have reason to believe in the relative “ordering” of the means for different levels of the Categorical variable. The one-sided t-test is the simplest example (e.g., μ1>=0 and μ1>=μ2); this readily extends to the multi-parameter setting, where more than one inequality constraint can be imposed on the parameters (e.g., μ1<=μ2<=μ3.
    It is possible to incorporate these beliefs into the ANOVA model, using what is called as informative hypothesis testing, which have certain advantages compared to unconstrained models. The R package called restriktor has the capability to develop such models with beliefs.

    Your Turn

    1. Try the simple datasets at https://www.performingmusicresearch.com/datasets/

    2. Can you try to ANOVA-analyse the datasets we dealt with in plotting Groups with Boxplots?

    References

    1. The ANOVA tutorial at Our Coding Club
    2. Antoine Soetewey. How to: one-way ANOVA by hand. https://statsandr.com/blog/how-to-one-way-anova-by-hand/
    3. ANOVA in R - Stats and R https://statsandr.com/blog/anova-in-r/
    4. Michael Crawley.(2013) The R Book,second edition. Chapter 11.
    5. David C Howell, Permutation Tests for Factorial ANOVA Designs
    6. Marti Anderson, Permutation tests for univariate or multivariate analysis of variance and regression
    7. Judd, Charles M., Gary H. McClelland, and Carey S. Ryan.(2017). “Introduction to Data Analysis.” In, 1–9. Routledge. https://doi.org/10.4324/9781315744131-1.
    8. Patil, I. (2021). Visualizations with statistical details: The ‘ggstatsplot’ approach. Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
    9. Langsrud, Øyvind. (2003). ANOVA for unbalanced data: Use type II instead of type III sums of squares. Statistics and Computing. 13. 163-167. https://doi.org/10.1023/A:1023260610025. https://www.researchgate.net/publication/220286726_ANOVA_for_unbalanced_data_Use_type_II_instead_of_type_III_sums_of_squares
    10. Kim TK. (2017). Understanding one-way ANOVA using conceptual figures. Korean J Anesthesiol. 2017 Feb;70(1):22-26. https://ekja.org/upload/pdf/kjae-70-22.pdf
    11. Anova – Type I/II/III SS explained.https://mcfromnz.wordpress.com/2011/03/02/anova-type-iiiiii-ss-explained/
    12. Bidyut Ghosh (Aug 28, 2017). One-way ANOVA in R. https://datascienceplus.com/one-way-anova-in-r/
    R Package Citations
    Package Version Citation
    DescTools 0.99.60 Signorell (2025)
    ggprism 1.0.6 Dawson (2025)
    ggstatsplot 0.13.1 Patil (2021)
    ggtext 0.1.2 Wilke and Wiernik (2022)
    restriktor 0.6.10 Vanbrabant and Kuiper (2024)
    supernova 3.0.0 Blake et al. (2024)
    Blake, Adam, Jeff Chrabaszcz, Ji Son, and Jim Stigler. 2024. supernova: Judd, McClelland, & Ryan Formatting for ANOVA Output. https://doi.org/10.32614/CRAN.package.supernova.
    Dawson, Charlotte. 2025. ggprism: A “ggplot2” Extension Inspired by “GraphPad Prism”. https://doi.org/10.32614/CRAN.package.ggprism.
    Langsrud, Øyvind. 2003. Statistics and Computing 13 (2): 163–67. https://doi.org/10.1023/a:1023260610025.
    Patil, Indrajeet. 2021. “Visualizations with statistical details: The ‘ggstatsplot’ approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.
    Signorell, Andri. 2025. DescTools: Tools for Descriptive Statistics. https://doi.org/10.32614/CRAN.package.DescTools.
    Vanbrabant, Leonard, and Rebecca Kuiper. 2024. restriktor: Restricted Statistical Estimation and Inference for Linear Models. https://doi.org/10.32614/CRAN.package.restriktor.
    Wilke, Claus O., and Brenton M. Wiernik. 2022. ggtext: Improved Text Rendering Support for “ggplot2”. https://doi.org/10.32614/CRAN.package.ggtext.
    Back to top

    Footnotes

    1. The ANOVA tutorial at Our Coding Club.↩︎

    2. https://github.com/UCLATALL/supernova↩︎

    3. https://www.openintro.org/go/?id=anova-supplement↩︎

    4. Pruim R, Kaplan DT, Horton NJ (2017). “The mosaic Package: Helping Students to ‘Think with Data’ Using R.” The R Journal, 9(1), 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.↩︎

    5. mosaic::xpf() gives both a graph and the probabilities.↩︎

    6. ggplot2 Based Plots with Statistical Details • ggstatsplot https://indrajeetpatil.github.io/ggstatsplot/↩︎

    7. https://github.com/UCLATALL/supernova↩︎

    8. Ernst, Michael D. 2004. “Permutation Methods: A Basis for Exact Inference.” Statistical Science 19 (4): 676–85. doi:10.1214/088342304000000396.↩︎

    Citation

    BibTeX citation:
    @online{2023,
      author = {},
      title = {Comparing {Multiple} {Means} with {ANOVA}},
      date = {2023-03-28},
      url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/130-ThreeMeansOrMore/},
      langid = {en},
      abstract = {ANOVA to investigate how frogspawn hatching time varies
        with temperature.}
    }
    
    For attribution, please cite this work as:
    “Comparing Multiple Means with ANOVA.” 2023. March 28, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/130-ThreeMeansOrMore/.
    🃏 Inference for Comparing Two Paired Means
    Inference for Correlation

    License: CC BY-SA 2.0

    Website made with ❤️ and Quarto, by Arvind V.

    Hosted by Netlify .