πŸƒ Inference for Comparing Two Paired Means

Author

Arvind V

Published

November 10, 2022

Modified

May 21, 2024

Setting up R Packages

knitr::opts_chunk$set(echo = FALSE,message = TRUE, warning = TRUE, fig.align = "center")
options(scipen=5)
library(tidyverse)
library(mosaic)
library(broom)
library(MKinfer) # Confidence Interval Computation
library(resampledata) ### Datasets from Chihara and Hesterberg's book 
library(gt) # for tables

What is Paired Data?

Sometimes the data is collected on the same set of individual categories, e.g. scores by sport persons in two separate tournaments, or sales of identical items in two separate locations of a chain store. Or say the number of customers in the morning and in the evening, at a set of store locations. In this case we treat the two sets of observations as paired, since they correspond to the same set of observable entities. This is how some experiments give us paired data.

We would naturally be interested in the differences in means across these two sets. In this module, we will examine tests for this purpose.

Introduction to Inference for Paired Data

flowchart TD
    A[Inference for Paired Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test]
    B --> C{OK?}
    C -->|Yes, both\n Parametric| D[t.test with paired=TRUE]
    D <-->F[Linear Model\n Method] 
    C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction, paired =TRUE]
    W<-->F
    C -->|No\n Non-Parametric| E[wilcox.test with paired = TRUE]
    E <--> G[Linear Model\n with\n Signed-Ranks]
    C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation]
    P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]
 

We will now use a couple to case studies to traverse all the possible pathways in the Workflow above.

Case Study #1: Results from a Diving Championship

Here we have swimming records across a Semi-Final and a Final:

Inspecting and Charting Data

data("Diving2017")
Diving2017
Diving2017_inspect <- inspect(Diving2017)
Diving2017_inspect$categorical
Diving2017_inspect$quantitative

The data is made up of paired observations per swimmer, one for the semi-final and one for the final race. There are 12 swimmers and therefore 12 paired records. How can we quickly visualize this data?

First, histograms and densities of the two variables at hand:

# Set graph theme
theme_set(new = theme_custom())
#
Diving2017_long <- Diving2017 %>%
  pivot_longer(
    cols = c(Final, Semifinal),
    names_to = "race",
    values_to = "scores"
  )

Diving2017_long %>%
  gf_density( ~ scores,
              fill = ~ race,
              alpha = 0.5,
              title = "Diving Scores") %>%
  gf_facet_grid( ~ race) %>%
  gf_fitdistr(dist = "dnorm")
###
Diving2017_long %>%
  gf_col(
    fct_reorder(Name, scores) ~ scores,
    fill = ~ race,
    alpha = 0.5,
    position = "dodge",
    xlab = "Scores",
    ylab = "Name",
    title = "Diving Scores"
  ) 
###
Diving2017_long %>%
  gf_boxplot(
    scores ~ race,
    fill = ~ race,
    alpha = 0.5,
    xlab = "Race",
    ylab = "Scores",
    title = "Diving Scores"
  ) 

We see that:

  • The data are not normally distributed. With just such few readings (n < 30) it was just possible…more readings would have helped. We will verify this aspect formally very shortly. 
  • There is no immediately identifiable trend in score changes from one race to the other.

A. Check for Normality

Let us also complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution.

shapiro.test(Diving2017$Final)
shapiro.test(Diving2017$Semifinal)

    Shapiro-Wilk normality test

data:  Diving2017$Final
W = 0.9184, p-value = 0.273


    Shapiro-Wilk normality test

data:  Diving2017$Semifinal
W = 0.86554, p-value = 0.05738

Hmmm….the Shapiro-Wilk test suggests that both scores are normally distributed, though Semifinal is probably marginally so.

Can we check this with plots? We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

# Set graph theme
theme_set(new = theme_custom())
#
set.rseed(1234)
Diving2017 %>% 
  mutate(Final_norm = rnorm(n = 12, 
                          mean = mean(Final), 
                          sd = sd(Final)),

         Semifinal_norm = rnorm(n = 12, 
                                      mean = mean(Semifinal), 
                                      sd = sd(Semifinal))) %>% 
  pivot_longer(cols = 
                 c(Semifinal, Final, Semifinal_norm,Final_norm),
               names_to = "score_type", values_to = "value") %>% 
  gf_boxplot(value ~ score_type, fill = ~ score_type, 
             show.legend = FALSE)
###
Diving2017_long %>% 
  gf_qq(~ scores | race) %>% 
  gf_qqline(ylab = "scores")

While the boxplots are not very evocative, we see in the QQ-plots that the Final scores are closer to the straight line than the Semifinal scores. But it is perhaps still hard to accept the data as normally distributed…hmm.

B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(scores ~ race, data = Diving2017_long,
         conf.int = TRUE,conf.level = 0.95) %>% 
  broom::tidy()
qf(0.975,11,11)
[1] 3.473699

The variances are not significantly different, as seen by the \(p.value = 0.08\).

So to summarise our data checks:
- data are normally distributed
- variances are not significantly different

Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in performance, per swimmer between the two races. So accordingly:

\[ H_0: \mu_{semifinal} = \mu_{final}\\ \]

\[ H_a: \mu_{semifinal} \ne \mu_{final}\ \]

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, with the argument only.2=FALSE to allow for paired data:

obs_diff_swim <- diffmean(scores ~ race, data = Diving2017_long, 
                          only.2 = FALSE) 

# Can use this also
# formula method is better for permutation test!
# obs_diff_swim <- mean(~ (Final - Semifinal), data = Diving2017)

obs_diff_swim
diffmean 
 -11.975 

Inference

All Tests Together

We can put all the test results together to get a few more insights about the tests:

mosaic::t.test(x = Diving2017$Semifinal, 
               y = Diving2017$Final,
               paired = TRUE) %>% broom::tidy() %>% 
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "t.test")

lm(Semifinal - Final ~ 1, data = Diving2017) %>% 
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>% 
    gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model")

wilcox.test(x = Diving2017$Semifinal, 
               y = Diving2017$Final,
               paired = TRUE) %>% broom::tidy() %>% 
    gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Wilcoxon test")

lm(signed_rank(Semifinal - Final) ~ 1, 
                   data = Diving2017) %>% 
  broom::tidy(conf.int = TRUE,conf.level = 0.95) %>% 
    gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"),cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)) %>% 
  tab_header(title = "Linear Model with sign.rank")
t.test
estimate statistic p.value parameter conf.low conf.high method alternative
-11.975 -1.190339 0.2589684 11 -34.11726 10.16726 Paired t-test two.sided
Linear Model
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -11.975 10.06016 -1.190339 0.2589684 -34.11726 10.16726
Wilcoxon test
statistic p.value method alternative
27 0.3803711 Wilcoxon signed rank exact test two.sided
Linear Model with sign.rank
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -2 2.135558 -0.9365236 0.3691097 -6.70033 2.70033
Back to top

The linear model and the t.test are nearly identical in performance; the p.values are the same. The same is also true of the wilcox.test and the linear model with sign-rank data differences. This is of course not surprising!

Case Study #2: Walmart vs Target

Is there a difference in the price of Groceries sold by the two retailers Target and Walmart? The data set Groceries contains a sample of grocery items and their prices advertised on their respective web sites on one specific day. We will:

  1. Inspect the data set, then explain why this is an example of matched pairs data.
  2. Compute summary statistics of the prices for each store.
  3. Conduct a permutation test to determine whether or not there is a difference in the mean prices.
  4. Create a histogram bar-chart of the difference in prices. What is unusual about Quaker Oats Life cereal?
  5. Redo the hypothesis test without this observation. Would we reach the same conclusion?

Inspecting and Charting Data

data("Groceries")
Groceries <- Groceries %>% 
  mutate(Product = stringr::str_squish(Product)) # Knock off extra spaces
Groceries
Groceries_inspect <- inspect(Groceries)
Groceries_inspect$categorical
Groceries_inspect$quantitative

There are just 30 prices for each vendor….just barely enough to get an idea of what the distribution might be. Let us plot histograms/densities of the two variables that we wish to compare. We will also overlay a Gaussian distribution for comparison:

# Set graph theme
theme_set(new = theme_custom())
#
Groceries_long <- Groceries %>%
  pivot_longer(
    cols = c(Walmart, Target),
    names_to = "store",
    values_to = "prices"
  ) %>% 
  mutate(store = as_factor(store))

Groceries_long %>%
  gf_dhistogram( ~ prices,
              fill = ~ store,
              alpha = 0.5,
              title = "Grocery Costs") %>%
  gf_facet_grid( ~ store) %>%
  gf_fitdistr(dist = "dnorm")
Groceries_long %>%
  gf_density( ~ prices,
              fill = ~ store,
              alpha = 0.5,
              title = "Grocery Costs") %>%
  gf_facet_grid( ~ store) %>%
  gf_fitdistr(dist = "dnorm")

Not close to the Gaussian…there is clearly some skew to the right, with some items being very costly compared to the rest. More when we check the assumptions on data for the tests.

How about price differences, what we are interested in? Let us plot the prices for the products, as box plots after pivoting the data to long form, 1 and as bar charts:

# Set graph theme
theme_set(new = theme_custom())
#
Groceries_long %>% 
  gf_boxplot(prices~ store, fill = ~ store) 
###
Groceries_long %>%
  gf_col(fct_reorder(Product, prices) ~ prices,
    fill = ~ store,
    alpha = 0.5,
    position = "dodge",
    xlab = "Prices",
    ylab = "",
    title = "Groceries Costs"
  ) %>% 
  gf_col(data = Groceries_long %>% filter(Product == "Quaker Oats Life Cereal Original"), fct_reorder(Product, prices) ~ prices, fill = ~store, position = "dodge")

Note

We see that the price difference between Walmart and Target prices is highest for the Product named Quaker Oats Life Cereal Original. Apart from this Product, the rest have no discernible trend either way. Let us check observed statistic (the mean difference in prices)

obs_diff_price <- diffmean(prices ~ store, 
                           data = Groceries_long, 
                           only.2 = FALSE)

# Can also use
# obs_diff_price <-  mean( ~ Walmart - Target, data = Groceries)

obs_diff_price
  diffmean 
0.05666667 

Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in prices, per product between the two Store chains. So accordingly:

\[ H_0: \mu_{Walmart} = \mu_{Target}\\ \]

\[ H_a: \mu_{Walmart} \ne \mu_{Target}\ \]

Testing for Assumptions on the Data

There are a few checks we need to make of our data, to decide what test procedure to use.

A. Check for Normality

shapiro.test(Groceries$Walmart)
shapiro.test(Groceries$Target)

    Shapiro-Wilk normality test

data:  Groceries$Walmart
W = 0.78662, p-value = 3.774e-05


    Shapiro-Wilk normality test

data:  Groceries$Target
W = 0.79722, p-value = 5.836e-05

For both tests, we see that the p.value is very small, indicating that the data are unlikely to be normally distributed. This means we cannot apply a standard paired t.test and need to use the non-parametric wilcox.test, that does not rely on the assumption of normality.

B. Check for Variances

Let us check if the two variables have similar variances:

var.test(Groceries$Walmart, Groceries$Target)

    F test to compare two variances

data:  Groceries$Walmart and Groceries$Target
F = 0.97249, num df = 29, denom df = 29, p-value = 0.9406
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4628695 2.0431908
sample estimates:
ratio of variances 
         0.9724868 

It appears from the \(p.value = 0.9\) and the \(Confidence Interval = [0.4629, 2.0432]\) that we cannot reject the NULL Hypothesis that the variances are not significantly different.

Inference

All Tests Together

We can put all the test results together to get a few more insights about the tests:

mosaic::t_test(Groceries$Walmart, Groceries$Target,paired = TRUE) %>% 
  broom::tidy() %>%
  gt() %>%
  tab_style(style = list(cell_fill(color = "cyan"),
                         cell_text(weight = "bold")),
            locations = cells_body(columns = p.value)) %>%
  tab_header(title = "t.test")
###
lm(Target - Walmart ~ 1, data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>% gt() %>%
  tab_style(style = list(cell_fill(color = "cyan"),
                         cell_text(weight = "bold")),
            locations = cells_body(columns = p.value))  %>%
  tab_header(title = "Linear Model")
###
wilcox.test(Groceries$Walmart, Groceries$Target,
  digits.rank = 7,
  paired = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy() %>% gt() %>%
  tab_style(style = list(cell_fill(color = "palegreen"),
                         cell_text(weight = "bold")),
            locations = cells_body(columns = p.value))  %>%
  tab_header(title = "Wilcoxon Test")
###
lm(signed_rank(Target - Walmart) ~ 1,
   data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>% gt() %>%
  tab_style(style = list(cell_fill(color = "palegreen"),
                         cell_text(weight = "bold")),
            locations = cells_body(columns = p.value)) %>%
  tab_header(title = "Linear Model with Sign.Ranks")
t.test
estimate statistic p.value parameter conf.low conf.high method alternative
-0.05666667 -0.4704556 0.6415488 29 -0.3030159 0.1896825 Paired t-test two.sided
Linear Model
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.05666667 0.1204506 0.4704556 0.6415488 -0.1896825 0.3030159
Wilcoxon Test
estimate statistic p.value conf.low conf.high method alternative
-0.104966 95 0.01431746 -0.1750051 -0.03005987 Wilcoxon signed rank test with continuity correction two.sided
Linear Model with Sign.Ranks
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 8.533333 2.888834 2.953902 0.006167464 2.625004 14.44166

Clearly, the parametric tests do not detect a significant difference in prices, whereas the non-parametric tests do.

Suppose we knock off the Quaker Cereal data item…(note the spaces in the product name)

# Set graph theme
theme_set(new = theme_custom())
#

set.seed(12345)
Groceries_less <- Groceries %>%
  filter(Product != "Quaker Oats Life Cereal Original")

Groceries_less_long <- Groceries_less %>%
  pivot_longer(
    cols = c(Target, Walmart),
    names_to = "store",
    values_to = "prices"
  )

wilcox.test(Groceries_less$Walmart, Groceries_less$Target,paired = TRUE,
  digits.rank = 7,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()
obs_diff_price_less <-
  mean(~ (Target - Walmart), data = Groceries_less)
obs_diff_price_less
polarity_less <- c(rep(1, 15), rep(-1, 14))
# Due to resampling this small bias makes no difference
null_dist_price_less <-
  do(9999) * mean(data = Groceries_less,
                  ~ (Target - Walmart) * resample(polarity_less,
                                                  replace = TRUE))
##
gf_histogram(data = null_dist_price_less, ~ mean) %>%
  gf_vline(xintercept = obs_diff_price_less, colour = "red")
##
mean(null_dist_price_less >= obs_diff_price_less)
[1] 0.1558621

[1] 0.01370137

We see that removing the Quaker Oats product item from the data does give a significant difference in mean prices !!! That one price difference was in the opposite direction compared to the general trend in differences, so when it was removed, we obtained a truer picture of price differences.

Try to do a regular parametric t.test with this reduced data!

Conclusion

We have learnt how to perform inference for paired-means. We have looked at the conditions that make the regular t.test possible, and learnt what to do if the conditions of normality and equal variance are not met. We have also looked at how these tests can be understood as manifestations of the linear model, with data and sign-ranked data. It should also be fairly clear now that we can test for the equivalence of two paired means, using a very simple permutation tests. Given computing power, we can always mechanize this test very quickly to get our results. And that performing this test yields reliable results without having to rely on any assumption relating to underlying distributions and so on.

References

  1. Randall Pruim, Nicholas J. Horton, Daniel T. Kaplan, Start Teaching with R
  2. https://bcs.wiley.com/he-bcs/Books?action=index&itemId=111941654X&bcsId=11307
  3. https://statsandr.com/blog/wilcoxon-test-in-r-how-to-compare-2-groups-under-the-non-normality-assumption/
R Package Citations
Package Version Citation
gt 0.10.1 @gt
infer 1.0.7 @infer
MKinfer 1.2 @MKinfer
openintro 2.4.0 @openintro