🃏 Inference for Comparing Two Paired Means

Author

Arvind V

Published

November 10, 2022

Modified

May 21, 2024

Setting up R Packages

knitr::opts_chunk$set(echo = FALSE,message = TRUE, warning = TRUE, fig.align = "center")
options(scipen=5)
library(tidyverse)
library(mosaic)
library(broom)
library(MKinfer) # Confidence Interval Computation
library(resampledata) ### Datasets from Chihara and Hesterberg's book 
library(gt) # for tables

What is Paired Data?

Sometimes the data is collected on the same set of individual categories, e.g. scores by sport persons in two separate tournaments, or sales of identical items in two separate locations of a chain store. Or say the number of customers in the morning and in the evening, at a set of store locations. In this case we treat the two sets of observations as paired, since they correspond to the same set of observable entities. This is how some experiments give us paired data.

We would naturally be interested in the differences in means across these two sets. In this module, we will examine tests for this purpose.

Introduction to Inference for Paired Data

flowchart TD
    A[Inference for Paired Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test with paired=TRUE]
    D <-->F[Linear Model\n Method] 
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction, paired =TRUE]
    W<-->F
    C -->|No\n Non-Parametric| E[wilcox.test with paired = TRUE]
    E <--> G[Linear Model\n with\n Signed-Ranks]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
    P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]

We will now use a couple to case studies to traverse all the possible pathways in the Workflow above.

Case Study #1: Results from a Diving Championship

Here we have swimming records across a Semi-Final and a Final:

Inspecting and Charting Data

data("Diving2017")
Diving2017
Diving2017_inspect <- inspect(Diving2017)
Diving2017_inspect$categorical
Diving2017_inspect$quantitative

The data is made up of paired observations per swimmer, one for the semi-final and one for the final race. There are 12 swimmers and therefore 12 paired records. How can we quickly visualize this data?

First, histograms and densities of the two variables at hand:

# Set graph theme
theme_set(new = theme_custom())
#
Diving2017_long <- Diving2017 %>%
  pivot_longer(
    cols = c(Final, Semifinal),
    names_to = "race",
    values_to = "scores"
  )

Diving2017_long %>%
  gf_density( ~ scores,
              fill = ~ race,
              alpha = 0.5,
              title = "Diving Scores") %>%
  gf_facet_grid( ~ race) %>%
  gf_fitdistr(dist = "dnorm")
###
Diving2017_long %>%
  gf_col(
    fct_reorder(Name, scores) ~ scores,
    fill = ~ race,
    alpha = 0.5,
    position = "dodge",
    xlab = "Scores",
    ylab = "Name",
    title = "Diving Scores"
  ) 
###
Diving2017_long %>%
  gf_boxplot(
    scores ~ race,
    fill = ~ race,
    alpha = 0.5,
    xlab = "Race",
    ylab = "Scores",
    title = "Diving Scores"
  )

We see that:

The data are not normally distributed. With just such few readings (n < 30) it was just possible…more readings would have helped. We will verify this aspect formally very shortly.
There is no immediately identifiable trend in score changes from one race to the other.

A. Check for Normality

Let us also complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution.

shapiro.test(Diving2017$Final)
shapiro.test(Diving2017$Semifinal)


    Shapiro-Wilk normality test

data:  Diving2017$Final
W = 0.9184, p-value = 0.273


    Shapiro-Wilk normality test

data:  Diving2017$Semifinal
W = 0.86554, p-value = 0.05738

Hmmm….the Shapiro-Wilk test suggests that both scores are normally distributed, though Semifinal is probably marginally so.

Can we check this with plots? We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

# Set graph theme
theme_set(new = theme_custom())
#
set.rseed(1234)
Diving2017 %>% 
  mutate(Final_norm = rnorm(n = 12, 
                          mean = mean(Final), 
                          sd = sd(Final)),

         Semifinal_norm = rnorm(n = 12, 
                                      mean = mean(Semifinal), 
                                      sd = sd(Semifinal))) %>% 
  pivot_longer(cols = 
                 c(Semifinal, Final, Semifinal_norm,Final_norm),
               names_to = "score_type", values_to = "value") %>% 
  gf_boxplot(value ~ score_type, fill = ~ score_type, 
             show.legend = FALSE)
###
Diving2017_long %>% 
  gf_qq(~ scores | race) %>% 
  gf_qqline(ylab = "scores")

While the boxplots are not very evocative, we see in the QQ-plots that the Final scores are closer to the straight line than the Semifinal scores. But it is perhaps still hard to accept the data as normally distributed…hmm.

B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(scores ~ race, data = Diving2017_long,
         conf.int = TRUE,conf.level = 0.95) %>% 
  broom::tidy()
qf(0.975,11,11)

[1] 3.473699

The variances are not significantly different, as seen by the \(p.value = 0.08\).

So to summarise our data checks:
- data are normally distributed
- variances are not significantly different

Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in performance, per swimmer between the two races. So accordingly:

\[ H_0: \mu_{semifinal} = \mu_{final}\\ \]

\[ H_a: \mu_{semifinal} \ne \mu_{final}\ \]

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, with the argument only.2=FALSE to allow for paired data:

obs_diff_swim <- diffmean(scores ~ race, data = Diving2017_long, 
                          only.2 = FALSE) 

# Can use this also
# formula method is better for permutation test!
# obs_diff_swim <- mean(~ (Final - Semifinal), data = Diving2017)

obs_diff_swim

diffmean 
 -11.975

Inference

Since the data variables satisfy the assumption of being normally distributed, and the variances are not significantly different, we may attempt the classical t.test with paired data. (we will use the mosaic variant). Type help(t.test) in your Console. Our model would be:

\[ mean(Final(i) - Semifinal(i)) = \beta_0 \\ \]

And that \[ H_0: \beta_0 = 0;\\ \\ H_a: \beta_0 \ne 0 \]

mosaic::t.test(x = Diving2017$Semifinal, 
               y = Diving2017$Final,
               paired = TRUE, var.equal = FALSE) %>% broom::tidy()

The confidence interval spans the zero value, and the p.value is a high \(0.259\), so there is no reason to accept alternative hypothesis that the means are different. Hence we say that there is no evidence of a difference between SemiFinal and Final scores.

Well, we might consider ( based on knowledge of the sport ) that at least one of the variables does not meet the normality criteria, and though their variances are not significantly different. So we would attempt a non-parametric Wilcoxon test, that uses the signed-rank of the paired data differences, instead of the data variables. Our model would be:

\[ mean(\ sign.rank[\ Final(i) - Semifinal(i)\ ]\ ) = \beta_0 \\ \]

\[ H_0 : \beta_0 = 0;\\ H_a: \beta_0 \ne 0 \]

wilcox.test(x = Diving2017$Semifinal, 
            y = Diving2017$Final,
               paired = TRUE,conf.int = TRUE,conf.level = 0.95) %>% broom::tidy()

Here also with the p.value being \(0.3804\), we have no reason to accept the Alternative Hypothesis. The parametric t.test and the non-parametric wilcox.test agree in their inferences.

We can apply the linear-model-as-inference interpretation both to the original data and to the sign.rank data:

\[ lm(y_i - x_i \sim 1) = \beta_0\\ \\ and\\ lm(\ sign.rank[\ Final(i) - Semifinal(i)\ ] \sim 1) = \beta_0\\ \]

And the Hypothesis for both interpretations would be:
\[ H_0: \beta_0 = 0\\ \\\ H_a: \beta_0 \ne 0\\ \]

lm(Semifinal - Final ~ 1, data = Diving2017) %>% 
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

# Create a sign-rank function
signed_rank <- function(x) {sign(x) * rank(abs(x))}

lm(signed_rank(Semifinal - Final) ~ 1, 
                   data = Diving2017) %>% 
  broom::tidy(conf.int = TRUE,conf.level = 0.95)

We observe that using the linear model method for the original scores and the sign-rank scores both sdo not permit us to reject the \(H_0\) Null Hypothesis, since p.values are high, and the confidence.intervals straddle \(0\).

We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code. For the specific data at hand, we need to shuffle the records between Semifinal and Final on a per Swimmer basis and take the test statistic (difference between the two swim records for each swimmer). Another way to look at this is to take the differences between Semifinal and Final scores and shuffle the differences to either polarity. We will follow this method in the code below:

polarity <- c(rep(1, 6), rep(-1, 6)) 
# 12 +/- 1s, 
# 6 each to make sure there is equal probability
polarity
null_dist_swim <- do(9999) *
  mean(data = Diving2017,
       ~ (Final - Semifinal) * mosaic::resample(polarity, replace = TRUE))

null_dist_swim

 [1]  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1

Let us plot the NULL distribution and compare it with the actual observed differences in the race times:

# Set graph theme
theme_set(new = theme_custom())
#
gf_histogram(data = null_dist_swim, ~ mean) %>%
  gf_vline(xintercept = obs_diff_swim, colour = "red")
###
gf_ecdf(data = null_dist_swim, ~ mean) %>%
  gf_vline(xintercept = obs_diff_swim, colour = "red")
###
prop1(~ mean <= obs_diff_swim, data = null_dist_swim)

prop_TRUE 
   0.1254

Hmm…so by generating 9999 shuffles of score-difference polarities, it does appear that we can not only obtain the current observed difference but even surpass it frequently. So it does seem that there is no difference in means between Semi-Final and Final swimming scores.

All Tests Together

We can put all the test results together to get a few more insights about the tests:

mosaic::t.test(x = Diving2017$Semifinal, 
               y = Diving2017$Final,
               paired = TRUE) %>% broom::tidy() %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "t.test")

lm(Semifinal - Final ~ 1, data = Diving2017) %>% 
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>% 
    gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model")

wilcox.test(x = Diving2017$Semifinal, 
               y = Diving2017$Final,
               paired = TRUE) %>% broom::tidy() %>% 
    gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Wilcoxon test")

lm(signed_rank(Semifinal - Final) ~ 1, 
                   data = Diving2017) %>% 
  broom::tidy(conf.int = TRUE,conf.level = 0.95) %>% 
    gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model with sign.rank")

t.test
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-11.975	-1.190339	0.2589684	11	-34.11726	10.16726	Paired t-test	two.sided

Linear Model
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-11.975	10.06016	-1.190339	0.2589684	-34.11726	10.16726

Wilcoxon test
statistic	p.value	method	alternative
27	0.3803711	Wilcoxon signed rank exact test	two.sided

Linear Model with sign.rank
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-2	2.135558	-0.9365236	0.3691097	-6.70033	2.70033

Footnotes

https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/tidyr-pivoting.gif ↩︎

Package	Version	Citation
gt	0.10.1	@gt
infer	1.0.7	@infer
MKinfer	1.2	@MKinfer
openintro	2.4.0	@openintro

🃏 Inference for Comparing Two Paired Means

Setting up R Packages

What is Paired Data?

Introduction to Inference for Paired Data

Case Study #1: Results from a Diving Championship

Inspecting and Charting Data

A. Check for Normality

B. Check for Variances

Hypothesis

Observed and Test Statistic

Inference

All Tests Together

Footnotes

Case Study #2: Walmart vs Target

Inspecting and Charting Data

Hypothesis

Testing for Assumptions on the Data

A. Check for Normality

B. Check for Variances

Inference

All Tests Together

Conclusion

References

R Package Citations