Inference for Two Independent Means

Author

Arvind V

Published

November 22, 2022

Modified

May 23, 2024

Setting up R Packages

library(tidyverse)
library(mosaic) # Our go-to package
library(ggformula)
library(infer) # An alternative package for inference using tidy data
library(broom) # Clean test results in tibble form
library(skimr) # data inspection
library(explore) # New, Easy package for Stats Test and Viz, and other things
library(resampledata3) # Datasets from Chihara and Hesterberg's book
library(openintro) # datasets
library(gt) # for tables

Introduction

flowchart TD
    A[Inference for Two Independent Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test]
    D <-->F[Linear Model\n Method] 
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction]
    W<-->F
    C -->|No\n Non-Parametric| E[wilcox.test]
    E <--> G[Linear Model\n with\n Signed-Ranks]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
    P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]

flowchart TD
    A[Inference for Two Independent Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test]
    D <-->F[Linear Model\n Method] 
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction]
    W<-->F
    C -->|No\n Non-Parametric| E[wilcox.test]
    E <--> G[Linear Model\n with\n Signed-Ranks]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
    P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]
 

Case Study #1: A Simple Data set with Two Quant Variables

Let us look at the MathAnxiety dataset from the package resampledata. Here we have โ€œanxietyโ€ scores for boys and girls, for different components of mathematics.

Inspecting and Charting Data

data("MathAnxiety")
MathAnxiety
MathAnxiety_inspect <- inspect(MathAnxiety)
MathAnxiety_inspect$categorical
MathAnxiety_inspect$quantitative

We have ~600 data entries, and with 4 Quant variables; Age,AMAS, RCMAS, and AMAS; and two Qual variables, Gender and Grade. A simple dataset, with enough entries to make it worthwhile to explore as our first example.

Research Question

Is there a difference between boy and girl โ€œanxietyโ€ levels for AMAS (test) in the MathAnxiety dataset?

First, histograms, densities and counts of the variable we are interested in, after converting data into long format:

# Set graph theme
theme_set(new = theme_custom())
#
MathAnxiety %>% 
  gf_density(~ AMAS,
              fill = ~ Gender,
              alpha = 0.5,
              title = "") 
MathAnxiety %>% 
  gf_boxplot(AMAS ~ Gender,
              fill = ~ Gender,
              alpha = 0.5,
              title = "") 
MathAnxiety %>% count(Gender)
MathAnxiety %>% group_by(Gender) %>% summarise(mean = mean(AMAS))

The distributions for anxiety scores for boys and girls overlap considerably and are very similar, though the boxplot for boys shows a significant outlier. Are they close to being normal distributions too? We should check.

A. Check for Normality

Statistical tests for means usually require a couple of checks1 2:

  • Are the data normally distributed?
  • Are the data variances similar?

Let us complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution. We will also look at Q-Q plots for both variables:

# Set graph theme
theme_set(new = theme_custom())
#

MathAnxiety %>%
  gf_density( ~ AMAS,
              fill = ~ Gender,
              alpha = 0.5,
              title = "Math Anxiety scores for boys and girls") %>%
  gf_facet_grid(~ Gender) %>% 
  gf_fitdistr(dist = "dnorm") 
MathAnxiety%>% 
  gf_qq(~ AMAS, color = ~ Gender) %>% 
  gf_qqline(ylab = "Math Anxiety Score..are they Normally Disributed?") %>%
  gf_facet_wrap(~ Gender)  # independent y-axis
boys_AMAS <- MathAnxiety %>% filter(Gender== "Boy") %>% select(AMAS)
girls_AMAS <- MathAnxiety %>% filter(Gender== "Girl") %>% select(AMAS)

shapiro.test(boys_AMAS$AMAS)
shapiro.test(girls_AMAS$AMAS)


    Shapiro-Wilk normality test

data:  boys_AMAS$AMAS
W = 0.99043, p-value = 0.03343


    Shapiro-Wilk normality test

data:  girls_AMAS$AMAS
W = 0.99074, p-value = 0.07835

The distributions for anxiety scores for boys and girls are almost normal, visually speaking. With the Shapiro-Wilk test we find that the scores for girls are normally distributed, but the boys scores are not so.

Note

The p.value obtained in the shapiro.wilk test suggests the chances of the data, given the Assumption that they are normally distributed.

We see that MathAnxiety contains discrete-level scores for anxiety for the two variables (for Boys and Girls) anxiety scores. The boys score has a significant outlier, which we saw earlier and perhaps that makes that variable lose out, perhaps narrowly.

B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(AMAS ~ Gender, data = MathAnxiety, 
         conf.int = TRUE,conf.level = 0.95) %>% 
  broom::tidy()
qf(0.975,275,322)
[1] 1.254823

The variances are quite similar as seen by the \(p.value = 0.82\). We also saw it visually when we plotted the overlapped distributions earlier.

Conditions:
  1. The two variables are not both normally distributed.
  2. The two variances are significantly similar.

Hypothesis

Based on the graphs, how would we formulate our Hypothesis? We wish to infer whether there is any difference in the mean anxiety score between Girls and Boys in the dataset MathAnxiety. So accordingly:

\[ H_0: \mu_{Boys} = \mu_{Girls}\\ \] \[ H_a: \mu_{Girls} \ne \mu_{Boys}\\ \]

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function:

obs_diff_amas <- diffmean(AMAS ~ Gender, data = MathAnxiety) 
obs_diff_amas
diffmean 
  1.7676 

Girlsโ€™ AMAS anxiety scores are, on average, \(1.76\) points higher than those for Boys.

On Observed Difference Estimates

Different tests here will show the difference as positive or negative, but with the same value! This depends upon the way the factor variable Gender is used, i.e. Boy-Girl or Girl-Boyโ€ฆ

Inference

All Tests Together

We can put all the test results together to get a few more insights about the tests:

mosaic::t_test(AMAS ~ Gender, 
               data = MathAnxiety) %>% 
  broom::tidy() %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "t.test")

wilcox.test(AMAS ~ Gender, 
               data = MathAnxiety) %>% 
  broom::tidy() %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "wilcox.test")

lm(AMAS ~ Gender, 
   data = MathAnxiety) %>% 
  broom::tidy(conf.int = TRUE,
              conf.level = 0.95) %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model with Original Data")

lm(rank(AMAS) ~ Gender, 
   data = MathAnxiety) %>% 
  broom::tidy(conf.int = TRUE,
              conf.level = 0.95) %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model with Ranked Data")
t.test
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
-1.7676 21.16718 22.93478 -3.291843 0.001055808 580.2004 -2.822229 -0.7129706 Welch Two Sample t-test two.sided
wilcox.test
statistic p.value method alternative
37483 0.0007736219 Wilcoxon rank sum test with continuity correction two.sided
Linear Model with Original Data
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 21.16718 0.3641315 58.130602 5.459145e-248 20.4520482 21.882317
GenderGirl 1.76760 0.5364350 3.295087 1.042201e-03 0.7140708 2.821129
Linear Model with Ranked Data
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 278.04644 9.535561 29.158898 6.848992e-117 259.31912 296.7738
GenderGirl 47.64559 14.047696 3.391701 7.405210e-04 20.05668 75.2345
Back to top

As we can see, all tests are in agreement that there is a significant effect of Gender on the AMAS anxiety scores!

Case Study #2: Youth Risk Behavior Surveillance System (YRBSS) survey

Every two years, the Centers for Disease Control and Prevention in the USA conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from highschoolers (9th through 12th grade), to analyze health patterns. We will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.Type this in your console: help(yrbss)

Inspecting and Charting Data

data(yrbss)
yrbss
yrbss_inspect <- inspect(yrbss)
yrbss_inspect$categorical
yrbss_inspect$quantitative

We have 13K data entries, and with 13 different variables, some Qual and some Quant. Many entries are missing too, typical of real-world data and something we will have to account for in our computations. The meaning of each variable can be found by bringing up the help file.Type this in your console: help(yrbss)

In this Case Study, our research question is:

Research Question

Does weight of highschoolers in this dataset vary with gender?

Inspecting and Charting Data

First, histograms, densities and counts of the variable we are interested in:

# Set graph theme
theme_set(new = theme_custom())
#

yrbss_select_gender <- yrbss %>% 
  select(weight, gender) %>% 
  drop_na(weight) # Sadly dropping off NA data
###
yrbss_select_gender %>%
  gf_density( ~ weight,
              fill = ~ gender,
              alpha = 0.5,
              title = "Highschoolers' Weights by Gender") 
###
yrbss_select_gender %>%
  gf_boxplot(weight ~ gender,
              fill = ~ gender,
              alpha = 0.5,
              title = "Highschoolers' Weights by Gender") 
###
yrbss_select_gender %>% count(gender)

Overlapped Distribution plot shows some difference in the means; and the Boxplots show visible difference in the medians.

A. Check for Normality

As stated before, statistical tests for means usually require a couple of checks:

  • Are the data normally distributed?
  • Are the data variances similar?

Let us also complete a visual check for normality, with plots since we cannot do a shapiro.test:

Shapiro-Wilks Test

The longest data it can take (in R) is 5000. Since our data is longer, we will cannot use this procedure and have to resort to visual means.

Let us plot frequency distribution and Q-Q plots6 for both variables.

# Set graph theme
theme_set(new = theme_custom())
#

male_student_weights <- yrbss_select_gender %>% 
  filter(gender == "male") %>% 
  select(weight)
female_student_weights <- yrbss_select_gender %>% 
  filter(gender == "female") %>% 
  select(weight)
#shapiro.test(male_student_weights$weight)
#shapiro.test(female_student_weights$weight)

yrbss_select_gender %>%
  gf_density( ~ weight,
              fill = ~ gender,
              alpha = 0.5,
              title = "Highschoolers' Weights by Gender") %>%
  gf_facet_grid(~ gender) %>% 
  gf_fitdistr(dist = "dnorm") 
yrbss_select_gender %>% 
  gf_qq(~ weight | gender) %>% 
  gf_qqline(ylab = "scores") 

Distributions are not too close to normalโ€ฆperhaps a hint of a rightward skew, suggesting that there are some obese students.

No real evidence (visually) of the variables being normally distributed.

B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test( weight ~  gender, data = yrbss_select_gender, 
          conf.int = TRUE,
          conf.level = 0.95) %>% 
  broom::tidy()

#qf(0.975,6164, 6413)

The p.value being so small, we are able to reject the NULL Hypothesis that the variances of weight are nearly equal across the two exercise regimes.

Conditions
  1. The two variables are not normally distributed.
  2. The two variances are also significantly different.

This means that the parametric t.test must be eschewed in favour of the non-parametric wilcox.test. We will use that, and also attempt linear models with rank data, and a final permutation test.

Hypothesis

Based on the graphs, how would we formulate our Hypothesis? We wish to infer whether there is difference in mean weight across gender. So accordingly:

\[ H_0: \mu_{male} = \mu_{female}\\ \\\ H_a: \mu_{male} \ne \mu_{female}\ \]

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, from mosaic:

obs_diff_gender <- diffmean(weight ~ gender, data = yrbss_select_gender) 

obs_diff_gender
diffmean 
11.70089 

Inference

All Tests Together

We can put all the test results together to get a few more insights about the tests:

wilcox.test(weight ~ gender, data = yrbss_select_gender, 
            conf.int = TRUE, 
            conf.level = 0.95) %>% 
  broom::tidy() %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "wilcox.test")

lm(rank(weight) ~ gender, 
   data = yrbss_select_gender) %>% 
  broom::tidy(conf.int = TRUE,
              conf.level = 0.95) %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model with Ranked Data")
wilcox.test
estimate statistic p.value conf.low conf.high method alternative
-11.33999 10808212 0 -11.34003 -10.87994 Wilcoxon rank sum test with continuity correction two.sided
Linear Model with Ranked Data
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 4836.157 42.52745 113.71848 0 4752.797 4919.517
gendermale 2851.246 59.55633 47.87478 0 2734.507 2967.986

The wilcox.test and the linear model with rank data offer the same results. This is of course not surprising!

Case Study #3: Weight vs Exercise in the YRBSS Survey

Finally, consider the possible relationship between a highschoolerโ€™s weight and their physical activity.

First, letโ€™s create a new variable physical_3plus, which will be coded as either โ€œyesโ€ if the student is physically active for at least 3 days a week, and โ€œnoโ€ if not. Recall that we have several missing data in that column, so we will (sadly) drop these before generating the new variable:

yrbss_select_phy <- yrbss %>% 
  drop_na(physically_active_7d, weight) %>% 
  mutate(physical_3plus = if_else(physically_active_7d >= 3, "yes", "no"),
         physical_3plus = factor(physical_3plus, 
                                 labels = c("yes", "no"),
                                 levels = c("yes", "no"))) %>% 
  select(weight,physical_3plus)

# Let us check
yrbss_select_phy%>% count(physical_3plus)
Research Question

Does weight vary based on whether students exercise on more or less than 3 days a week? (physically_active_7d >= 3 days)

Inspecting and Charting Data

We can make distribution plots for weight by physical_3plus:

# Set graph theme
theme_set(new = theme_custom())
#
gf_boxplot(weight ~ physical_3plus, 
          fill = ~ physical_3plus,
          data = yrbss_select_phy, xlab = "Days of Exercise >=3 ")

###
gf_density(~ weight,
          fill = ~ physical_3plus,
          data = yrbss_select_phy)

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the physical_3plus variable, and then calculate the mean weight in these groups using the mean function while ignoring missing values by setting the na.rm argument to TRUE.

yrbss_select_phy %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))

There is an observed difference, but is this difference large enough to deem it โ€œstatistically significantโ€? In order to answer this question we will conduct a hypothesis test. But before that a few more checks on the data:

A. Check for Normality

As stated before, statistical tests for means usually require a couple of checks:

  • Are the data normally distributed?
  • Are the data variances similar?

Let us also complete a visual check for normality,with plots since we cannot do a shapiro.test:

yrbss_select_phy %>%
  gf_density( ~ weight,
              fill = ~ physical_3plus,
              alpha = 0.5,
              title = "Highschoolers' Weights by Exercise Frequency") %>%
  gf_facet_grid(~ physical_3plus) %>% 
  gf_fitdistr(dist = "dnorm") 

Again, not normally distributedโ€ฆ

We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

yrbss_select_phy %>% 
  gf_qq(~ weight | physical_3plus , color = ~ physical_3plus) %>% 
  gf_qqline(ylab = "Weight") 

The QQ-plots confirm that he tow data variables are not normally distributed.

B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test( weight ~ physical_3plus, data = yrbss_select_phy, 
          conf.int = TRUE,
          conf.level = 0.95) %>% 
  broom::tidy()

# Critical F value
qf(0.975,4021, 8341)
[1] 1.054398

The p.value states the probability of the data being what it is, assuming the NULL hypothesis that variances were similar. It being so small, we are able to reject this NULL Hypothesis that the variances of weight are nearly equal across the two exercise frequencies. (Compare the statistic in the var.test with the critical F-value)

Conditions
  1. The two variables are not normally distributed.
  2. The two variances are also significantly different.

Hence we will have to use non-parametric tests to infer if the means are similar.

Hypothesis

Based on the graphs, how would we formulate our Hypothesis? We wish to infer whether there is difference in mean weight across physical_3plus. So accordingly:

\[ H_0: \mu_{physical-3plus-Yes} = \mu_{physical-3plus-No}\\ \\\ H_a: \mu_{physical-3plus-Yes} \ne \mu_{physical-3plus-No}\\ \]

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, from mosaic:

obs_diff_phy <- diffmean(weight ~ physical_3plus, data = yrbss_select_phy) 

obs_diff_phy
 diffmean 
-1.774584 

Inference

::: {.panel-tabset .nav-pills style=โ€œbackground: whitesmoke;โ€}

Using parametric t.test

Well, the variables are not normally distributed, and the variances are significantly different so a standard t.test is not advised. We can still try:

mosaic::t_test(weight ~ physical_3plus,
               var.equal = FALSE, # Welch Correction
               data = yrbss_select_phy) %>% 
  broom::tidy()

The p.value is \(8.9e-08\) ! And the Confidence Interval is clear of \(0\). So the t.test gives us good reason to reject the Null Hypothesis that the means are similar. But can we really believe this, given the non-normality of data?

Using non-parametric paired Wilcoxon test

However, we have seen that the data variables are not normally distributed. So a Wilcoxon Test, using signed-ranks, is indicated: (recall the model!)

# For stability reasons, it may be advisable to use rounded data or to set digits.rank = 7, say, 
# such that determination of ties does not depend on very small numeric differences (see the example).

wilcox.test(weight ~ physical_3plus,
            conf.int = TRUE,
            conf.level = 0.95,
            data = yrbss_select_phy) %>% 
  broom::tidy()

The nonparametric wilcox.test also suggests that the means for weight across physical_3plus are significantly different.

Using the Linear Model Interpretation

We can apply the linear-model-as-inference interpretation to the ranked data data to implement the non-parametric test as a Linear Model:

\[ lm(rank(weight) \sim physical.3plus) = \beta_0 + \beta_1 \times physical.3plus \\ H_0: \beta_1 = 0\\ \\\ H_a: \beta_1 \ne 0\\ \]

lm(rank(weight) ~ physical_3plus, 
   data = yrbss_select_phy) %>% 
  broom::tidy(conf.int = TRUE,
              conf.level = 0.95)

Here too, the linear model using rank data arrives at a conclusion similar to that of the Mann-Whitney U test.

Using Permutation Tests

For this last Case Study, we will do this in two ways, just for fun: one using our familiar mosaic package, and the other using the package infer.

But first, we need to initialize the test, which we will save as obs_diff_**.

obs_diff_infer <- yrbss_select_phy %>%
  infer::specify(weight ~ physical_3plus) %>%
  infer::calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff_infer
obs_diff_mosaic <- mosaic::diffmean(~ weight | physical_3plus, data = yrbss_select_phy)
obs_diff_mosaic
obs_diff_phy
 diffmean 
-1.774584 
 diffmean 
-1.774584 
Important

Note that obs_diff_infer is a 1 X 1 dataframe; obs_diff_mosaic is a scalar!!

Your Turn

  1. Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.

  2. Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.

  3. Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who donโ€™t.

  4. Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.

  5. Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your \(\alpha\) level, and conclude in context.


Conclusion

We have learnt how to perform inference for independent means. We have looked at the conditions that make the regular t.test possible, and learnt what to do if the conditions of normality and equal variance are not met. We have also looked at how these tests can be understood as manifestations of the linear model, with data and sign-ranked data. It should also be fairly clear now that we can test for the equivalence of two paired means, using a very simple permutation tests. Given computing power, we can always mechanize this test very quickly to get our results. And that performing this test yields reliable results without having to rely on any assumption relating to underlying distributions and so on.

References

  1. Randall Pruim, Nicholas J. Horton, Daniel T. Kaplan, Start Teaching with R
  2. https://bcs.wiley.com/he-bcs/Books?action=index&itemId=111941654X&bcsId=11307
  3. https://statsandr.com/blog/wilcoxon-test-in-r-how-to-compare-2-groups-under-the-non-normality-assumption/
R Package Citations
Package Version Citation
explore 1.3.0 @explore
infer 1.0.7 @infer
openintro 2.4.0 @openintro
resampledata 0.3.1 @resampledata
TeachHist 0.2.1 @TeachHist
TeachingDemos 2.13 @TeachingDemos