Inference for Two Independent Means

Author

Arvind V

Published

November 22, 2022

Modified

May 23, 2024

Setting up R Packages

library(tidyverse)
library(mosaic) # Our go-to package
library(ggformula)
library(infer) # An alternative package for inference using tidy data
library(broom) # Clean test results in tibble form
library(skimr) # data inspection
library(explore) # New, Easy package for Stats Test and Viz, and other things
library(resampledata3) # Datasets from Chihara and Hesterberg's book
library(openintro) # datasets
library(gt) # for tables

Introduction

flowchart TD
    A[Inference for Two Independent Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test]
    D <-->F[Linear Model\n Method] 
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction]
    W<-->F
    C -->|No\n Non-Parametric| E[wilcox.test]
    E <--> G[Linear Model\n with\n Signed-Ranks]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
    P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]

flowchart TD
    A[Inference for Two Independent Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test]
    D <-->F[Linear Model\n Method] 
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction]
    W<-->F
    C -->|No\n Non-Parametric| E[wilcox.test]
    E <--> G[Linear Model\n with\n Signed-Ranks]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
    P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]

Case Study #1: A Simple Data set with Two Quant Variables

Let us look at the MathAnxiety dataset from the package resampledata. Here we have “anxiety” scores for boys and girls, for different components of mathematics.

Inspecting and Charting Data

data("MathAnxiety")
MathAnxiety
MathAnxiety_inspect <- inspect(MathAnxiety)
MathAnxiety_inspect$categorical
MathAnxiety_inspect$quantitative

We have ~600 data entries, and with 4 Quant variables; Age,AMAS, RCMAS, and AMAS; and two Qual variables, Gender and Grade. A simple dataset, with enough entries to make it worthwhile to explore as our first example.

Research Question

Is there a difference between boy and girl “anxiety” levels for AMAS (test) in the MathAnxiety dataset?

First, histograms, densities and counts of the variable we are interested in, after converting data into long format:

# Set graph theme
theme_set(new = theme_custom())
#
MathAnxiety %>% 
  gf_density(~ AMAS,
              fill = ~ Gender,
              alpha = 0.5,
              title = "") 
MathAnxiety %>% 
  gf_boxplot(AMAS ~ Gender,
              fill = ~ Gender,
              alpha = 0.5,
              title = "") 
MathAnxiety %>% count(Gender)
MathAnxiety %>% group_by(Gender) %>% summarise(mean = mean(AMAS))

The distributions for anxiety scores for boys and girls overlap considerably and are very similar, though the boxplot for boys shows a significant outlier. Are they close to being normal distributions too? We should check.

A. Check for Normality

Statistical tests for means usually require a couple of checks¹ ²:

Are the data normally distributed?
Are the data variances similar?

Let us complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution. We will also look at Q-Q plots for both variables:

# Set graph theme
theme_set(new = theme_custom())
#

MathAnxiety %>%
  gf_density( ~ AMAS,
              fill = ~ Gender,
              alpha = 0.5,
              title = "Math Anxiety scores for boys and girls") %>%
  gf_facet_grid(~ Gender) %>% 
  gf_fitdistr(dist = "dnorm") 
MathAnxiety%>% 
  gf_qq(~ AMAS, color = ~ Gender) %>% 
  gf_qqline(ylab = "Math Anxiety Score..are they Normally Disributed?") %>%
  gf_facet_wrap(~ Gender)  # independent y-axis
boys_AMAS <- MathAnxiety %>% filter(Gender== "Boy") %>% select(AMAS)
girls_AMAS <- MathAnxiety %>% filter(Gender== "Girl") %>% select(AMAS)

shapiro.test(boys_AMAS$AMAS)
shapiro.test(girls_AMAS$AMAS)


    Shapiro-Wilk normality test

data:  boys_AMAS$AMAS
W = 0.99043, p-value = 0.03343


    Shapiro-Wilk normality test

data:  girls_AMAS$AMAS
W = 0.99074, p-value = 0.07835

The distributions for anxiety scores for boys and girls are almost normal, visually speaking. With the Shapiro-Wilk test we find that the scores for girls are normally distributed, but the boys scores are not so.

Note

The p.value obtained in the shapiro.wilk test suggests the chances of the data, given the Assumption that they are normally distributed.

We see that MathAnxiety contains discrete-level scores for anxiety for the two variables (for Boys and Girls) anxiety scores. The boys score has a significant outlier, which we saw earlier and perhaps that makes that variable lose out, perhaps narrowly.

B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(AMAS ~ Gender, data = MathAnxiety, 
         conf.int = TRUE,conf.level = 0.95) %>% 
  broom::tidy()
qf(0.975,275,322)

[1] 1.254823

The variances are quite similar as seen by the \(p.value = 0.82\). We also saw it visually when we plotted the overlapped distributions earlier.

Conditions:

The two variables are not both normally distributed.
The two variances are significantly similar.

Hypothesis

Based on the graphs, how would we formulate our Hypothesis? We wish to infer whether there is any difference in the mean anxiety score between Girls and Boys in the dataset MathAnxiety. So accordingly:

\[ H_0: \mu_{Boys} = \mu_{Girls}\\ \] \[ H_a: \mu_{Girls} \ne \mu_{Boys}\\ \]

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function:

obs_diff_amas <- diffmean(AMAS ~ Gender, data = MathAnxiety) 
obs_diff_amas

diffmean 
  1.7676

Girls’ AMAS anxiety scores are, on average, \(1.76\) points higher than those for Boys.

On Observed Difference Estimates

Different tests here will show the difference as positive or negative, but with the same value! This depends upon the way the factor variable Gender is used, i.e. Boy-Girl or Girl-Boy…

Inference

Since the data are not both normally distributed, though the variances similar, we typically cannot use a parametric t.test. However, we can still examine the results:

mosaic::t_test(AMAS ~ Gender, 
               data = MathAnxiety) %>% 
  broom::tidy()

The p.value is \(0.001\) ! And the Confidence Interval does not straddle \(0\). So the t.test gives us good reason to reject the Null Hypothesis that the means are similar and that there is a significant difference between Boys and Girls when it comes to AMAS anxiety. But can we really believe this, given the non-normality of data?

Since the data variables do not satisfy the assumption of being normally distributed, and though the variances are similar, we use the classical wilcox.testType help(wilcox.test) in your Console., which implements what we need here: the Mann-Whitney U test:³

The Mann-Whitney test as a test of mean ranks. It first ranks all your values from high to low, computes the mean rank in each group, and then computes the probability that random shuffling of those values between two groups would end up with the mean ranks as far apart as, or further apart, than you observed. No assumptions about distributions are needed so far. (emphasis mine)

\[ mean(rank(AMAS_{Girls})) - mean(rank(AMAS_{Boys})) = \beta_1;\\ \] \[ H_0: \beta_1 = 0;\\ \\ \] \[ H_a: \beta_1 \ne 0 \]

wilcox.test(AMAS ~ Gender, data = MathAnxiety, 
            conf.int = TRUE, 
            conf.level = 0.95) %>% 
  broom::tidy()

The p.value is very similar, \(0.00077\), and again the Confidence Interval does not straddle \(0\), and we are hence able to reject the NULL hypothesis that the means are equal and accept the alternative hypothesis that there is a significant difference in mean anxiety scores between Boys and Girls.

We can apply the linear-model-as-inference interpretation to the ranked data data to implement the non-parametric test as a Linear Model:

\[ lm(rank(AMAS) \sim gender) = \beta_0 + \beta_1 * gender \] \[ H_0: \beta_1 = 0\\ \] \[ H_a: \beta_1 \ne 0\\ \]

lm(rank(AMAS) ~ Gender, 
   data = MathAnxiety) %>% 
  broom::tidy(conf.int = TRUE,
              conf.level = 0.95)

Dummy Variables in lm

Note how the Qual variable was used here in Linear Regression! The Gender variable was treated as a binary “dummy” variable⁴.

Here too we see that the p.value for the slope term (“GenderGirl”) is significant at \(7.4*10^{-4}\).

We saw from the diagram created by Allen Downey that there is only one test⁵! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.

We pretend that Gender has not effect on the AMAS anxiety scores. If this is our position, then the Gender labels are essentially meaningless, and we can pretend that any AMAS score belongs to a Boy or a Girl. This means we can mosaic::shuffle (permute) the Gender labels and see how uncommon our real data is. And we do not have to really worry about whether the data are normally distributed, or if their variances are nearly equal.

Important

The “pretend” position is exactly the NULL Hypothesis!! The “uncommon” part is the p.value under NULL!!

# Set graph theme
theme_set(new = theme_custom())
#
null_dist_amas <- 
  do(9999) * diffmean(data = MathAnxiety, AMAS ~ shuffle(Gender))
null_dist_amas
###

gf_histogram(data = null_dist_amas, ~ diffmean, bins = 25) %>%
  gf_vline(xintercept = obs_diff_amas, colour = "red") 
###
gf_ecdf(data = null_dist_amas, ~ diffmean) %>%
  gf_vline(xintercept = obs_diff_amas, colour = "red") 
###
1-prop1(~ diffmean <= obs_diff_amas, data = null_dist_amas)

prop_TRUE 
    7e-04

Clearly the observed_diff_amas is much beyond anything we can generate with permutations with gender! And hence there is a significant difference in weights across gender!

All Tests Together

We can put all the test results together to get a few more insights about the tests:

mosaic::t_test(AMAS ~ Gender, 
               data = MathAnxiety) %>% 
  broom::tidy() %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "t.test")

wilcox.test(AMAS ~ Gender, 
               data = MathAnxiety) %>% 
  broom::tidy() %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "wilcox.test")

lm(AMAS ~ Gender, 
   data = MathAnxiety) %>% 
  broom::tidy(conf.int = TRUE,
              conf.level = 0.95) %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model with Original Data")

lm(rank(AMAS) ~ Gender, 
   data = MathAnxiety) %>% 
  broom::tidy(conf.int = TRUE,
              conf.level = 0.95) %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model with Ranked Data")

t.test
estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-1.7676	21.16718	22.93478	-3.291843	0.001055808	580.2004	-2.822229	-0.7129706	Welch Two Sample t-test	two.sided

wilcox.test
statistic	p.value	method	alternative
37483	0.0007736219	Wilcoxon rank sum test with continuity correction	two.sided

Linear Model with Original Data
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	21.16718	0.3641315	58.130602	5.459145e-248	20.4520482	21.882317
GenderGirl	1.76760	0.5364350	3.295087	1.042201e-03	0.7140708	2.821129

Linear Model with Ranked Data
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	278.04644	9.535561	29.158898	6.848992e-117	259.31912	296.7738
GenderGirl	47.64559	14.047696	3.391701	7.405210e-04	20.05668	75.2345

Linear Model with Ranked Data
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	4836.157	42.52745	113.71848	0	4752.797	4919.517
gendermale	2851.246	59.55633	47.87478	0	2734.507	2967.986

Package	Version	Citation
explore	1.3.0	@explore
infer	1.0.7	@infer
openintro	2.4.0	@openintro
resampledata	0.3.1	@resampledata
TeachHist	0.2.1	@TeachHist
TeachingDemos	2.13	@TeachingDemos

Inference for Two Independent Means

Setting up R Packages

Introduction

Case Study #1: A Simple Data set with Two Quant Variables

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

All Tests Together

Footnotes

Case Study #2: Youth Risk Behavior Surveillance System (YRBSS) survey

Inspecting and Charting Data

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

All Tests Together

Case Study #3: Weight vs Exercise in the YRBSS Survey

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

Using parametric `t.test`

Using non-parametric paired Wilcoxon test

Using the Linear Model Interpretation

Using Permutation Tests

Your Turn

Conclusion

References

R Package Citations

Setting up R Packages

Introduction

Case Study #1: A Simple Data set with Two Quant Variables

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

All Tests Together

Footnotes

Case Study #2: Youth Risk Behavior Surveillance System (YRBSS) survey

Inspecting and Charting Data

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

All Tests Together

Case Study #3: Weight vs Exercise in the YRBSS Survey

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

Using parametric t.test

Using non-parametric paired Wilcoxon test

Using the Linear Model Interpretation

Using Permutation Tests

Your Turn

Conclusion

References

R Package Citations

Using parametric `t.test`