Inference for Two Independent Means

Author

Arvind Venkatadri

Published

November 22, 2022

Modified

July 29, 2025

Setting up R Packages

knitr::opts_chunk$set(echo = TRUE, message = TRUE, warning = TRUE, fig.align = "center")
library(tidyverse)
library(mosaic) # Our go-to package
library(infer) # An alternative package for inference using tidy data
library(broom) # Clean test results in tibble form
library(skimr) # data inspection

library(resampledata) # Datasets from Chihara and Hesterberg's book
library(openintro) # datasets
library(gt) # for tables

Introduction

flowchart TD
    A[Inference for Independent Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test]
    D <-->F[Linear Model\n Method] 
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction]
    W<-->F
    C -->|No\n Non-Parametric| E[wilcox.test]
    E <--> G[Linear Model\n with\n Signed-Ranks]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
    P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]

Case Study #1: A Simple Data set with Two Quant Variables

Research Question

TBD

Inspecting and Charting Data

A. Check for Normality

Statistical tests for means usually require a couple of checks¹ ²:

Are the data normally distributed?
Are the data variances similar?:

Let us also complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution.

B. Check for Variances

Conditions:

The two variables are not normally distributed.
The two variances are also significantly different.

Hypothesis

Observed and Test Statistic

Inference

Type this in your console: help(yrbss)Type help(wilcox.test) in your Console.

Case Study #2: Youth Risk Behavior Surveillance System (YRBSS) survey

Every two years, the Centers for Disease Control and Prevention in the USA conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from highschoolers (9th through 12th grade), to analyze health patterns. We will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Inspecting and Charting Data

data(yrbss)
yrbss
yrbss_inspect <- inspect(yrbss)
yrbss_inspect$categorical
yrbss_inspect$quantitative

ABCDEFGHIJ0123456789

age <int>	gender <chr>	grade <chr>	hispanic <chr>	race <chr>
14	female	9	not	Black or African American
14	female	9	not	Black or African American
15	female	9	hispanic	Native Hawaiian or Other Pacific Islander
15	female	9	not	Black or African American
15	female	9	not	Black or African American
15	female	9	not	Black or African American
15	female	9	not	Black or African American
14	male	9	not	Black or African American
15	male	9	not	Black or African American
15	male	10	not	Black or African American

ABCDEFGHIJ0123456789

name <chr>	class <chr>	levels <int>	n <int>	missing <int>
gender	character	2	13571	12
grade	character	5	13504	79
hispanic	character	2	13352	231
race	character	5	10778	2805
helmet_12m	character	6	13272	311
text_while_driving_30d	character	8	12665	918
hours_tv_per_school_day	character	7	13245	338
school_night_hours_sleep	character	7	12335	1248

ABCDEFGHIJ0123456789

	name <chr>	class <chr>	min <dbl>	Q1 <dbl>
1	age	integer	12.00	15.00
2	height	numeric	1.27	1.60
3	weight	numeric	29.94	56.25
4	physically_active_7d	integer	0.00	2.00
5	strength_training_7d	integer	0.00	0.00

We have 13K data entries, and with 13 different variables, some Qual and some Quant. Many entries are missing too, typical of real-world data and something we will have to account for in our computations. The meaning of each variable can be found by bringing up the help file.

In this tutorial, our research question is:

Research Question

Does weight of highschoolers in this dataset vary with gender?

Inspecting and Charting Data

First, histograms and densities of the variable we are interested in:

yrbss_select_gender <- yrbss %>%
  select(weight, gender, physically_active_7d) %>%
  drop_na(weight) # Sadly dropping off NA data

yrbss_select_gender %>%
  gf_density(~weight,
    fill = ~gender,
    alpha = 0.5,
    title = "Highschoolers' Weights by Gender"
  ) %>%
  gf_theme(theme_classic())
yrbss_select_gender %>%
  gf_boxplot(weight ~ gender,
    fill = ~gender,
    alpha = 0.5,
    title = "Highschoolers' Weights by Gender"
  ) %>%
  gf_theme(theme_classic())

Overlapped Distribution plot shows some difference in the means; and the Boxplots show visible difference in the medians.

As stated before, statistical tests for means usually require a couple of checks:

Are the data normally distributed?
Are the data variances similar?

Let us also complete a visual check for normality,with plots since we cannot do a shapiro.test:

Shapiro-Wilks Test

The longest data it can take (in R) is 5000. Since our data is longer, we will cannot use this procedure and have to resort to visual means.

male_student_weights <- yrbss_select_gender %>%
  filter(gender == "male") %>%
  select(weight)
female_student_weights <- yrbss_select_gender %>%
  filter(gender == "female") %>%
  select(weight)
# shapiro.test(male_student_weights$weight)
# shapiro.test(female_student_weights$weight)

yrbss_select_gender %>%
  gf_density(~weight,
    fill = ~gender,
    alpha = 0.5,
    title = "Highschoolers' Weights by Gender"
  ) %>%
  gf_facet_grid(~gender) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_theme(theme_classic())

Distributions are not too close to normal…perhaps a hint of a rightward skew, suggesting that there are some obese students.

We can plot Q-Q plots³ for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

yrbss_select_gender %>%
  gf_qq(~ weight | gender) %>%
  gf_qqline(ylab = "scores") %>%
  gf_theme(theme_classic())

No real evidence (visually) of the variables being normally distributed.

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(weight ~ gender,
  data = yrbss_select_gender,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

# qf(0.975,6164, 6413)

ABCDEFGHIJ0123456789

estimate <dbl>	num.df <int>	den.df <int>	statistic <dbl>	p.value <dbl>
0.703976	6164	6413	0.703976	1.065068e-43

The p.value being so small, we are able to reject the NULL Hypothesis that the variances of weight are nearly equal across the two exercise regimes.

Conditions

The two variables are not normally distributed.
The two variances are also significantly different.

This means that the parametric t.test must be eschewed in favour of the non-parametric wilcox.test. We will use that, and also attempt linear models with rank data, and a final permutation test.

Hypothesis

Based on the graphs, how would we formulate our Hypothesis? We wish to infer whether there is difference in mean weight across gender. So accordingly:

$H_{0} : μ_{m a l e} = μ_{f e m a l e} H_{a} : μ_{m a l e} \neq μ_{f e m a l e}$

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, from mosaic:

obs_diff_gender <- diffmean(weight ~ gender, data = yrbss_select_gender)

obs_diff_gender

diffmean 
11.70089

Inference

Since the data variables do not satisfy the assumption of being normally distributed, and the variances are significantly different, we use the classical wilcox.test, which implements what we need here: the Mann-Whitney U test:⁴

The Mann-Whitney test as a test of mean ranks. It first ranks all your values from high to low, computes the mean rank in each group, and then computes the probability that random shuffling of those values between two groups would end up with the mean ranks as far apart as, or further apart, than you observed. No assumptions about distributions are needed so far. (emphasis mine)

We will use the mosaic variant). Our model would be:

$m e a n (r a n k (W e i g h t_{m a l e})) - m e a n (r a n k (W e i g h t_{f e m a l e})) = β_{0} H_{0} : β_{0} = 0; H_{a} : β_{0} \neq 0$

wilcox.test(weight ~ gender,
  data = yrbss_select_gender,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>
-11.33999	10808212	0	-11.34003	-10.87994

The p.value is negligible and we are able to reject the NULL hypothesis that the means are equal.

We can apply the linear-model-as-inference interpretation to the ranked data data to implement the non-parametric test as a Linear Model:

$l m (r a n k (w e i g h t) \sim g e n d e r) = β_{0} + β_{1} * g e n d e r H_{0} : β_{1} = 0 H_{a} : β_{1} \neq 0$

# Create a sign-rank function
# signed_rank <- function(x) {sign(x) * rank(abs(x))}

lm(rank(weight) ~ gender,
  data = yrbss_select_gender
) %>%
  broom::tidy(
    conf.int = TRUE,
    conf.level = 0.95
  )

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	4836.157	42.52745	113.71848	0
gendermale	2851.246	59.55633	47.87478	0

Dummy Variables in lm

Note how the Qual variable was used here in Linear Regression! The gender variable was treated as a binary “dummy” variable⁵.

We saw from the diagram created by Allen Downey that there is only one test⁶! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code. For the specific data at hand, we need to shuffle the records between Semifinal and Final on a per Swimmer basis and take the test statistic (difference between the two swim records for each swimmer). Another way to look at this is to take the differences between Semifinal and Final scores and shuffle the differences to either polarity. We will follow this method in the code below:

null_dist_weight <-
  do(9999) * diffmean(data = yrbss_select_gender, weight ~ shuffle(gender))
null_dist_weight
gf_histogram(data = null_dist_weight, ~diffmean, bins = 25) %>%
  gf_vline(xintercept = obs_diff_gender, colour = "red") %>%
  gf_theme(theme_classic())
gf_ecdf(data = null_dist_weight, ~diffmean) %>%
  gf_vline(xintercept = obs_diff_gender, colour = "red") %>%
  gf_theme(theme_classic())
prop1(~ diffmean <= obs_diff_gender, data = null_dist_weight)

ABCDEFGHIJ0123456789

diffmean <dbl>
2.968491e-01
2.065851e-02
2.334340e-02
-3.646964e-01
1.712191e-01
5.583873e-01
-4.461974e-01
2.235745e-01
-1.144226e-01
4.410358e-02

prop_TRUE 
        1

Clearly the observed_diff_weight is much beyond anything we can generate with permutations with gender! And hence there is a significant difference in weights across gender!

We can put all the test results together to get a few more insights about the tests:

wilcox.test(weight ~ gender,
  data = yrbss_select_gender,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "wilcox.test")

lm(rank(weight) ~ gender,
  data = yrbss_select_gender
) %>%
  broom::tidy(
    conf.int = TRUE,
    conf.level = 0.95
  ) %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model with Ranked Data")

wilcox.test
estimate	statistic	p.value	conf.low	conf.high	method	alternative
-11.33999	10808212	0	-11.34003	-10.87994	Wilcoxon rank sum test with continuity correction	two.sided

Linear Model with Ranked Data
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	4836.157	42.52745	113.71848	0	4752.797	4919.517
gendermale	2851.246	59.55633	47.87478	0	2734.507	2967.986

The wilcox.test and the linear model with rank data offer the same results. This is of course not surprising!

Case Study #3: Weight vs Exercise in the YRBSS Survey

Next, consider the possible relationship between a highschooler’s weight and their physical activity.

First, let’s create a new variable physical_3plus, which will be coded as either “yes” if the student is physically active for at least 3 days a week, and “no” if not. Recall that we have several missing data in that column, so we will (sadly) drop these before generating the new variable:

yrbss_select_phy <- yrbss %>%
  drop_na(physically_active_7d, weight) %>%
  mutate(
    physical_3plus = if_else(physically_active_7d >= 3, "yes", "no"),
    physical_3plus = factor(physical_3plus,
      labels = c("yes", "no"),
      levels = c("yes", "no")
    )
  ) %>%
  select(weight, physical_3plus)

# Let us check
yrbss_select_phy %>% count(physical_3plus)

ABCDEFGHIJ0123456789

physical_3plus <fct>	n <int>
yes	8342
no	4022

Research Question

Does weight vary based on whether students exercise on more or less than 3 days a week? (physically_active_7d >= 3 days)

Inspecting and Charting Data

We can make distribution plots for weight by physical_3plus:

gf_boxplot(weight ~ physical_3plus,
  fill = ~physical_3plus,
  data = yrbss_select_phy, xlab = "Days of Exercise >=3 "
) %>%
  gf_theme(theme_classic())

gf_density(~weight,
  fill = ~physical_3plus,
  data = yrbss_select_phy
) %>%
  gf_theme(theme_classic())

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following to first group the data by the physical_3plus variable, and then calculate the mean weight in these groups using the mean function while ignoring missing values by setting the na.rm argument to TRUE.

yrbss_select_phy %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight, na.rm = TRUE))

ABCDEFGHIJ0123456789

physical_3plus <fct>	mean_weight <dbl>
yes	68.44847
no	66.67389

There is an observed difference, but is this difference large enough to deem it “statistically significant”? In order to answer this question we will conduct a hypothesis test. But before that a few more checks on the data:

As stated before, statistical tests for means usually require a couple of checks:

Are the data normally distributed?
Are the data variances similar?

Let us also complete a visual check for normality,with plots since we cannot do a shapiro.test:

yrbss_select_phy %>%
  gf_density(~weight,
    fill = ~physical_3plus,
    alpha = 0.5,
    title = "Highschoolers' Weights by Exercise Frequency"
  ) %>%
  gf_facet_grid(~physical_3plus) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_theme(theme_classic())

Again, not normally distributed…

We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

yrbss_select_phy %>%
  gf_qq(~ weight | physical_3plus, color = ~physical_3plus) %>%
  gf_qqline(ylab = "Weight") %>%
  gf_theme(theme_classic())

The QQ-plots confirm that he tow data variables are not normally distributed.

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(weight ~ physical_3plus,
  data = yrbss_select_phy,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

# Critical F value
qf(0.975, 4021, 8341)

ABCDEFGHIJ0123456789

estimate <dbl>	num.df <int>	den.df <int>	statistic <dbl>	p.value <dbl>
0.8728201	8341	4021	0.8728201	4.390179e-07

[1] 1.054398

The p.value states the probability of the data being what it is, assuming the NULL hypothesis that variances were similar. It being so small, we are able to reject this NULL Hypothesis that the variances of weight are nearly equal across the two exercise frequencies. (Compare the statistic in the var.test with the critical F-value)

Conditions

The two variables are not normally distributed.
The two variances are also significantly different.

Hence we will have to use non-parametric tests to infer if the means are similar.

Hypothesis

Based on the graphs, how would we formulate our Hypothesis? We wish to infer whether there is difference in mean weight across physical_3plus. So accordingly:

$H_{0} : μ_{p h y s i c a l - 3 p l u s - Y e s} = μ_{p h y s i c a l - 3 p l u s - N o} H_{a} : μ_{p h y s i c a l - 3 p l u s - Y e s} \neq μ_{p h y s i c a l - 3 p l u s - N o}$

Observed and Test

Statistic

obs_diff_phy <- diffmean(weight ~ physical_3plus, data = yrbss_select_phy)

obs_diff_phy

 diffmean 
-1.774584

Using parametric t.test
Using non-parametric paired Wilcoxon test

Well, the variables are not normally distributed, and the variances are significantly different so a standard t.test is not advised. We can still try:

mosaic::t_test(weight ~ physical_3plus,
  var.equal = FALSE, # Welch Correction
  data = yrbss_select_phy
) %>%
  broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	estimate1 <dbl>	estimate2 <dbl>	statistic <dbl>	p.value <dbl>
1.774584	68.44847	66.67389	5.353003	8.907531e-08

The p.value is $8.9 e - 08$ ! And the Confidence Interval is clear of $0$ . So the t.test gives us good reason to reject the Null Hypothesis that the means are similar. But can we really believe this, given the non-normality of data?

However, we have seen that the data variables are not normally distributed. So a Wilcoxon Test, using signed-ranks, is indicated: (recall the model!)

# For stability reasons, it may be advisable to use rounded data or to set digits.rank = 7, say,
# such that determination of ties does not depend on very small numeric differences (see the example).

wilcox.test(weight ~ physical_3plus,
  conf.int = TRUE,
  conf.level = 0.95,
  data = yrbss_select_phy
) %>%
  broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>
2.269967	18314392	1.262977e-16	1.819992	2.720077

The nonparametric wilcox.test also suggests that the means for weight across physical_3plus are significantly different.

Using the Linear Model Interpretation

We can apply the linear-model-as-inference interpretation to the ranked data data to implement the non-parametric test as a Linear Model:

$l m (r a n k (w e i g h t) \sim p h y s i c a l .3 p l u s) = β_{0} + β_{1} \times p h y s i c a l .3 p l u s H_{0} : β_{1} = 0 H_{a} : β_{1} \neq 0$

lm(rank(weight) ~ physical_3plus,
  data = yrbss_select_phy
) %>%
  broom::tidy(
    conf.int = TRUE,
    conf.level = 0.95
  )

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	6366.9438	38.96362	163.407391	0.000000e+00
physical_3plusno	-566.9972	68.31527	-8.299715	1.151496e-16

Here too, the linear model using rank data arrives at a conclusion similar to that of the Mann-Whitney U test.

Using Permutation Tests

We will do this in two ways, just for fun: one using mosaic and the other using infer.

But first, we need to initialize the test, which we will save as obs_diff.

obs_diff_infer <- yrbss_select_phy %>%
  infer::specify(weight ~ physical_3plus) %>%
  infer::calculate(stat = "diff in means", order = c("yes", "no"))
obs_diff_infer
obs_diff_mosaic <- mosaic::diffmean(~ weight | physical_3plus, data = yrbss_select_phy)
obs_diff_mosaic
obs_diff_phy

ABCDEFGHIJ0123456789

stat <dbl>
1.774584

 diffmean 
-1.774584

 diffmean 
-1.774584

Important

Note that obs_diff_infer is a 1 X 1 dataframe; obs_diff_mosaic is a scalar!!

Inference Using mosaic

We already have the observed difference, obs_diff_mosaic. Now we generate the null distribution using permutation, with mosaic:

null_dist_mosaic <- do(999) * diffmean(~ weight | shuffle(physical_3plus), data = yrbss_select_phy)

We can also generate the histogram of the null distribution, compare that with the observed diffrence and compute the p-value and confidence intervals:

gf_histogram(~diffmean, data = null_dist_mosaic) %>%
  gf_vline(xintercept = obs_diff_mosaic, colour = "red")

# p-value
prop(~ diffmean != obs_diff_mosaic, data = null_dist_mosaic)

prop_TRUE 
        1

# Confidence Intervals for p = 0.95
mosaic::cdata(~diffmean, p = 0.95, data = null_dist_mosaic)

ABCDEFGHIJ0123456789

	lower <dbl>	upper <dbl>	central.p <dbl>
2.5%	-0.6019895	0.614377	0.95

Your Turn

Calculate a 95% confidence interval for the average height in meters (height) and interpret it in context.
Calculate a new confidence interval for the same parameter at the 90% confidence level. Comment on the width of this interval versus the one obtained in the previous exercise.
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.
Now, a non-inference task: Determine the number of different options there are in the dataset for the hours_tv_per_school_day there are.
Come up with a research question evaluating the relationship between height or weight and sleep. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Report the statistical results, and also provide an explanation in plain language. Be sure to check all assumptions, state your $α$ level, and conclude in context.

Setting up R Packages

Introduction

Case Study #1: A Simple Data set with Two Quant Variables

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

Case Study #2: Youth Risk Behavior Surveillance System (YRBSS) survey

Inspecting and Charting Data

Inspecting and Charting Data

Hypothesis

Observed and Test Statistic

Inference

Case Study #3: Weight vs Exercise in the YRBSS Survey

Inspecting and Charting Data

Hypothesis

Observed and Test

Using the Linear Model Interpretation

Using Permutation Tests

Your Turn

Footnotes