🃏 Inference for Comparing Two Paired Means

Published

November 10, 2022

Modified

October 29, 2024

Setting up R Packages

library(tidyverse)
library(mosaic)
library(broom) # Tidy Test data
library(resampledata3) # Datasets from Chihara and Hesterberg's book
library(gt) # for tables

Plot Theme

Show the Code

# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes

theme_custom <- function() {
  font <- "Roboto Condensed" # assign font family up front

  theme_classic(base_size = 14) %+replace% # replace elements we want to change

    theme(
      panel.grid.minor = element_blank(), # strip minor gridlines
      text = element_text(family = font),
      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        # size = 20,               #set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        # vjust = 2                #raise slightly
        margin = margin(0, 0, 10, 0)
      ),
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        # size = 14,                #font size
        hjust = 0,
        margin = margin(2, 0, 5, 0)
      ),
      plot.caption = element_text( # caption
        family = font, # font family
        size = 8, # font size
        hjust = 1
      ), # right align

      axis.title = element_text( # axis titles
        family = font, # font family
        size = 10 # font size
      ),
      axis.text = element_text( # axis text
        family = font, # axis family
        size = 8
      ) # font size
    )
}

# Set graph theme
theme_set(new = theme_custom())
#

Introduction to Inference for Paired Data

What is Paired Data?

Sometimes the data is collected on the same set of individual categories, e.g. scores by sport persons in two separate tournaments, or sales of identical items in two separate locations of a chain store. Or say the number of customers in the morning and in the evening, at a set of store locations. In this case we treat the two sets of observations as paired, since they correspond to the same set of observable entities. This is how some experiments give us paired data.

We would naturally be interested in the differences in means across these two sets, which exploits this paired nature. In this module, we will examine tests for this purpose.

Workflow for Inference for Paired Means

We will now use a couple to case studies to traverse all the possible pathways in the Workflow above.

Case Study #1: Results from a Diving Championship

Here we have swimming records across a Semi-Final and a Final:

Inspecting and Charting Data

data("Diving2017", package = "resampledata3")
Diving2017
Diving2017_inspect <- inspect(Diving2017)
Diving2017_inspect$categorical
Diving2017_inspect$quantitative

ABCDEFGHIJ0123456789

Name <fct>	Country <fct>	Semifinal <dbl>	Final <dbl>
CHEONG Jun Hoong	Malaysia	325.50	397.50
SI Yajie	China	382.80	396.00
REN Qian	China	367.50	391.95
KIM Mi Rae	North Korea	346.00	385.55
WU Melissa	Australia	318.70	370.20
KIM Kuk Hyang	North Korea	360.85	360.00
ITAHASHI Minami	Japan	313.70	357.85
BENFEITO Meaghan	Canada	355.15	331.40
PAMG Pandelela	Malaysia	322.75	322.40
CHAMANDY Olivia	Canada	320.55	307.15

ABCDEFGHIJ0123456789

name <chr>	class <chr>	levels <int>	n <int>	missing <int>	distribution <chr>
Name	factor	12	12	0	SI Yajie (8.3%) ...
Country	factor	8	12	0	Canada (16.7%), China (16.7%) ...

ABCDEFGHIJ0123456789

	name <chr>	class <chr>	min <dbl>	Q1 <dbl>	median <dbl>	Q3 <dbl>	max <dbl>	mean <dbl>	sd <dbl>
1	Semifinal	numeric	313.70	322.2000	325.625	356.575	382.8	338.500	22.94946
2	Final	numeric	283.35	318.5875	358.925	387.150	397.5	350.475	40.02204

The data is made up of paired observations per swimmer, one for the semi-final and one for the final race. There are 12 swimmers and therefore 12 paired records. How can we quickly visualize this data?

Let us first make this data into long form:

# Set graph theme
theme_set(new = theme_custom())
#
Diving2017_long <- Diving2017 %>%
  pivot_longer(
    cols = c(Final, Semifinal),
    names_to = "race",
    values_to = "scores"
  )
Diving2017_long

ABCDEFGHIJ0123456789

Name <fct>	Country <fct>	race <chr>	scores <dbl>
CHEONG Jun Hoong	Malaysia	Final	397.50
CHEONG Jun Hoong	Malaysia	Semifinal	325.50
SI Yajie	China	Final	396.00
SI Yajie	China	Semifinal	382.80
REN Qian	China	Final	391.95
REN Qian	China	Semifinal	367.50
KIM Mi Rae	North Korea	Final	385.55
KIM Mi Rae	North Korea	Semifinal	346.00
WU Melissa	Australia	Final	370.20
WU Melissa	Australia	Semifinal	318.70

Next, histograms and densities of the two variables at hand:

Diving2017_long %>%
  gf_density(~scores,
    fill = ~race,
    alpha = 0.5,
    title = "Diving Scores"
  ) %>%
  gf_facet_grid(~race) %>%
  gf_fitdistr(dist = "dnorm")

Diving2017_long %>%
  gf_col(
    fct_reorder(Name, scores) ~ scores,
    fill = ~race,
    alpha = 0.5,
    position = "dodge",
    xlab = "Scores",
    ylab = "Name",
    title = "Diving Scores"
  )

Diving2017_long %>%
  gf_boxplot(
    scores ~ race,
    fill = ~race,
    alpha = 0.5,
    xlab = "Race",
    ylab = "Scores",
    title = "Diving Scores"
  )

We see that:

The data are not normally distributed. With just such few readings (n < 30) it was just possible…more readings would have helped. We will verify this aspect formally very shortly.
There is no immediately identifiable trend in score changes from one race to the other.
Although the two medians appear to be different, the box plots overlap considerably. So one cannot visually conclude that the two sets of race timings have different means.

A. Check for Normality

Let us also complete a check for normality: the shapiro.wilk test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution.

shapiro.test(Diving2017$Final)
shapiro.test(Diving2017$Semifinal)


    Shapiro-Wilk normality test

data:  Diving2017$Final
W = 0.9184, p-value = 0.273


    Shapiro-Wilk normality test

data:  Diving2017$Semifinal
W = 0.86554, p-value = 0.05738

Hmmm….the Shapiro-Wilk test suggests that both scores are normally distributed (!!!), though Semifinal is probably marginally so.

Can we check this comparison also with plots? We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:

# Set graph theme
theme_set(new = theme_custom())
#
set.rseed(1234)
Diving2017 %>%
  mutate(
    Final_norm = rnorm(
      n = 12,
      mean = mean(Final),
      sd = sd(Final)
    ),
    Semifinal_norm = rnorm(
      n = 12,
      mean = mean(Semifinal),
      sd = sd(Semifinal)
    )
  ) %>%
  pivot_longer(
    cols =
      c(Semifinal, Final, Semifinal_norm, Final_norm),
    names_to = "score_type", values_to = "value"
  ) %>%
  gf_boxplot(value ~ score_type,
    fill = ~score_type,
    show.legend = FALSE
  ) %>%
  gf_labs(title = "Comparing Data and Normal Boxplots")
###
Diving2017_long %>%
  gf_qq(~ scores | race, size = 2) %>%
  gf_qqline(ylab = "scores", xlab = "theoretical normal")

While the boxplots are not very evocative, we see in the QQ-plots that the Final scores are closer to the straight line than the Semifinal scores. But it is perhaps still hard to accept the data as normally distributed…hmm.

B. Check for Variances

Let us check if the two variables have similar variances: the var.test does this for us, with a NULL hypothesis that the variances are not significantly different:

var.test(scores ~ race,
  data = Diving2017_long,
  ratio = 1, # What we believe
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	num.df <int>	den.df <int>	statistic <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>	method <chr>	alternative <chr>
3.041259	11	11	3.041259	0.0783009	0.8755102	10.56442	F test to compare two variances	two.sided

The variances are not significantly different, as seen by the $p . v a l u e = 0.08$ .

So to summarise our data checks:

Conditions

data are normally distributed
variances are not significantly different

Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in performance between the population of swimmers who might have participated in these two races. So accordingly:

$H_{0} : μ_{s e m i f i n a l} = μ_{f i n a l}$

$H_{a} : μ_{s e m i f i n a l} \neq μ_{f i n a l}$

Observed and Test Statistic

What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean function, with the argument only.2 = FALSE to allow for paired data:

obs_diff_swim <- diffmean(scores ~ race,
  data = Diving2017_long,
  only.2 = FALSE
) # paired data

# Can use this also
# formula method is better for permutation test!
# obs_diff_swim <- mean(~ (Final - Semifinal), data = Diving2017)

obs_diff_swim

diffmean 
 -11.975

Inference

Since the data variables satisfy the assumption of being normally distributed, and the variances are not significantly different, we may attempt the classical t.test with paired data. (we will use the mosaic variant). Type help(t.test) in your Console. Our model would be:

$m e a n (F i n a l (i) - S e m i_f i n a l (i)) = β_{0}$

And that: $H_{0} : μ_{f i n a l} - μ_{s e m i f i n a l} = 0;$ $H_{a} : μ_{f i n a l} - μ_{s e m i f i n a l} \neq 0;$

mosaic::t.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE, var.equal = FALSE
) %>% broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	parameter <dbl>	conf.low <dbl>	conf.high <dbl>	method <chr>	alternative <chr>
-11.975	-1.190339	0.2589684	11	-34.11726	10.16726	Paired t-test	two.sided

The confidence interval spans the zero value, and the p.value is a high $0.259$ , so there is no reason to accept alternative hypothesis that the means are different. Hence we say that there is no evidence of a difference between SemiFinal and Final scores.

Well, we might consider ( based on knowledge of the sport ) that at least one of the variables does not meet the normality criteria, and though their variances are not significantly different. So we would attempt a non-parametric Wilcoxon test, that uses the signed-rank of the paired data differences, instead of the data variables. Our model would be:

$m e a n (s i g n . r a n k [F i n a l (i) - S e m i f i n a l (i)]) = β_{0}$ $H_{0} : μ_{f i n a l} - μ_{s e m i f i n a l} = 0;$ $H_{a} : μ_{f i n a l} - μ_{s e m i f i n a l} \neq 0;$

wilcox.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  mu = 0, # belief
  alternative = "two.sided", # difference either way
  paired = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>
-11.9625	27	0.3803711	-35.825	12.3

Here also with the p.value being $0.3804$ , we have no reason to accept the Alternative Hypothesis. The parametric t.test and the non-parametric wilcox.test agree in their inferences.

We can apply the linear-model-as-inference interpretation both to the original data and to the sign.rank data:

$l m (y_{i} - x_{i} \sim 1) = β_{0} a n d l m (s i g n . r a n k [F i n a l (i) - S e m i f i n a l (i)] \sim 1) = β_{0}$

And the Hypothesis for both interpretations would be:
$H_{0} : β_{0} = 0 H_{a} : β_{0} \neq 0$

lm(Semifinal - Final ~ 1, data = Diving2017) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

# Create a sign-rank function
signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

lm(signed_rank(Semifinal - Final) ~ 1,
  data = Diving2017
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	-11.975	10.06016	-1.190339	0.2589684

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	-2	2.135558	-0.9365236	0.3691097

We observe that using the linear model method for the original scores and the sign-rank scores both sdo not permit us to reject the $H_{0}$ Null Hypothesis, since p.values are high, and the confidence.intervals straddle $0$ .

For the specific data at hand, we need to shuffle the records between Semifinal and Final on a per Swimmer basis (paired data!!) and take the test statistic (difference between the two swim records for each swimmer). Another way to look at this is to take the differences between Semifinal and Final scores and shuffle the differences to either polarity. We will follow this method in the code below:

polarity <- c(rep(1, 6), rep(-1, 6))
# 12 +/- 1s,
# 6 each to make sure there is equal probability
polarity
##
null_dist_swim <- do(4999) *
  mean(
    data = Diving2017,
    ~ (Final - Semifinal) * # take (pairwise) differences
      mosaic::resample(polarity, # Swap polarity randomly
        replace = TRUE
      )
  )
##
null_dist_swim

 [1]  1  1  1  1  1  1 -1 -1 -1 -1 -1 -1

ABCDEFGHIJ0123456789

mean <dbl>
1.866667e+00
-3.733333e+00
-1.234167e+01
-2.058333e+00
8.783333e+00
1.296667e+01
-1.440833e+01
1.409167e+01
-1.263333e+01
-1.440833e+01

Let us plot the NULL distribution and compare it with the actual observed differences in the race times:

# Set graph theme
theme_set(new = theme_custom())
#
gf_histogram(data = null_dist_swim, ~mean) %>%
  gf_vline(
    xintercept = obs_diff_swim,
    colour = "red",
    linewidth = 1
  )
###
gf_ecdf(data = null_dist_swim, ~mean, linewidth = 1) %>%
  gf_vline(
    xintercept = obs_diff_swim,
    colour = "red",
    linewidth = 1
  )
###
prop1(~ mean <= obs_diff_swim, data = null_dist_swim)

prop_TRUE 
   0.1272

Hmm…so by generating 4999 shuffles of score-difference polarities, it does appear that we can not only obtain the current observed difference but even surpass it frequently. So it does seem that there is no difference in means between Semi-Final and Final swimming scores.

All Tests Together

We can put all the test results together to get a few more insights about the tests:

mosaic::t.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "t.test")

lm(Semifinal - Final ~ 1, data = Diving2017) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model")

wilcox.test(
  x = Diving2017$Semifinal,
  y = Diving2017$Final,
  paired = TRUE
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Wilcoxon test")

lm(signed_rank(Semifinal - Final) ~ 1,
  data = Diving2017
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(cell_fill(color = "palegreen"), cell_text(weight = "bold")),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model with sign.rank")

t.test
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-11.975	-1.190339	0.2589684	11	-34.11726	10.16726	Paired t-test	two.sided

Linear Model
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-11.975	10.06016	-1.190339	0.2589684	-34.11726	10.16726

Wilcoxon test
statistic	p.value	method	alternative
27	0.3803711	Wilcoxon signed rank exact test	two.sided

Linear Model with sign.rank
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-2	2.135558	-0.9365236	0.3691097	-6.70033	2.70033

The linear model and the t.test are nearly identical in performance; the p.values are the same. The same is also true of the wilcox.test and the linear model with sign-rank data differences. This is of course not surprising!

Case Study #2: Walmart vs Target

Is there a difference in the price of Groceries sold by the two retailers Target and Walmart? The data set Groceries contains a sample of grocery items and their prices advertised on their respective web sites on one specific day. We will:

Inspect the data set, then explain why this is an example of matched pairs data.
Compute summary statistics of the prices for each store.
Conduct a permutation test to determine whether or not there is a difference in the mean prices.
Create a ~~histogram~~ bar-chart of the difference in prices. What is unusual about Quaker Oats Life cereal?
Redo the hypothesis test without this observation. Would we reach the same conclusion?

Inspecting and Charting Data

data("Groceries")
Groceries <- Groceries %>%
  mutate(Product = stringr::str_squish(Product)) # Knock off extra spaces
Groceries
Groceries_inspect <- inspect(Groceries)
Groceries_inspect$categorical
Groceries_inspect$quantitative

ABCDEFGHIJ0123456789

Product <chr>	Size <fct>	Target <dbl>	Walmart <dbl>
Kellogg NutriGrain Bars	8 bars	2.50	2.78
Quaker Oats Life Cereal Original	18oz	3.19	6.01
General Mills Lucky Charms	11.50z	3.19	2.98
Quaker Oats Old Fashioned	18oz	2.82	2.68
Nabisco Oreo Cookies	14.3oz	2.99	2.98
Nabisco Chips Ahoy	13oz	2.64	1.98
Doritos Nacho Cheese Chips	10oz	3.99	2.50
Cheez-it Original Baked	21oz	4.79	4.79
Swiss Miss Hot Chocolate	10 count	1.49	1.28
Tazo Chai Classic Latte Black Tea	32 oz	3.49	2.98

ABCDEFGHIJ0123456789

name <chr>	class <chr>	levels <int>	n <int>	missing <int>	distribution <chr>
Product	character	30	30	0	Annie's Macaroni & Cheese (3.3%) ...
Size	factor	24	30	0	18oz (10%), 12oz (6.7%) ...

ABCDEFGHIJ0123456789

	name <chr>	class <chr>	min <dbl>	Q1 <dbl>	median <dbl>	Q3 <dbl>	max <dbl>	mean <dbl>	sd <dbl>
1	Target	numeric	0.99	1.8275	2.545	3.140	7.99	2.762333	1.582128
2	Walmart	numeric	1.00	1.7600	2.340	2.955	6.98	2.705667	1.560211

There are just 30 prices for each vendor….just barely enough to get an idea of what the distribution might be. Let us plot the prices for the products, as box plots after pivoting the data to long form, ¹ and as bar charts:

## Set graph theme
theme_set(new = theme_custom())
##
Groceries_long <- Groceries %>%
  pivot_longer(
    cols = c(Walmart, Target),
    names_to = "store",
    values_to = "prices"
  ) %>%
  mutate(store = as_factor(store))

Let us plot histograms/densities of the two variables that we wish to compare. We will also overlay a Gaussian distribution for comparison:

Groceries_long %>%
  gf_dhistogram(~prices,
    fill = ~store,
    alpha = 0.5,
    title = "Grocery Costs"
  ) %>%
  gf_facet_grid(~store) %>%
  gf_fitdistr(dist = "dnorm")

Groceries_long %>%
  gf_density(~prices,
    fill = ~store,
    alpha = 0.5,
    title = "Grocery Costs"
  ) %>%
  gf_facet_grid(~store) %>%
  gf_fitdistr(dist = "dnorm")

Not close to the Gaussian…there is clearly some skew to the right, with some items being very costly compared to the rest. More when we check the assumptions on data for the tests.

How about price differences, what we are interested in?

## Set graph theme
theme_set(new = theme_custom())
##
Groceries_long %>%
  gf_boxplot(prices ~ store,
    fill = ~store
  )

## Set graph theme
theme_set(new = theme_custom())
##
Groceries_long %>%
  gf_col(fct_reorder(Product, prices) ~ prices,
    fill = ~store,
    alpha = 0.5,
    position = "dodge",
    xlab = "Prices",
    ylab = "",
    title = "Groceries Costs"
  ) %>%
  gf_col(
    data =
      Groceries_long %>%
        filter(
          Product == "Quaker Oats Life Cereal Original"
        ),
    fct_reorder(Product, prices) ~ prices,
    fill = ~store,
    position = "dodge"
  )

Note

We see that the price difference between Walmart and Target prices is highest for the Product named Quaker Oats Life Cereal Original. Apart from this Product, the rest have no discernible trend either way. Let us check observed statistic (the mean difference in prices)

obs_diff_price <- diffmean(prices ~ store,
  data = Groceries_long,
  only.2 = FALSE
)
# Can also use
# obs_diff_price <-  mean( ~ Walmart - Target, data = Groceries)
obs_diff_price

  diffmean 
0.05666667

Hypothesis

Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in prices, per product between the two Store chains. So accordingly:

$H_{0} : μ_{W a l m a r t} = μ_{T a r g e t}$

$H_{a} : μ_{W a l m a r t} \neq μ_{T a r g e t}$

Testing for Assumptions on the Data

There are a few checks we need to make of our data, to decide what test procedure to use.

A. Check for Normality

shapiro.test(Groceries$Walmart)
shapiro.test(Groceries$Target)


    Shapiro-Wilk normality test

data:  Groceries$Walmart
W = 0.78662, p-value = 3.774e-05


    Shapiro-Wilk normality test

data:  Groceries$Target
W = 0.79722, p-value = 5.836e-05

For both tests, we see that the p.value is very small, indicating that the data are unlikely to be normally distributed. This means we cannot apply a standard paired t.test and need to use the non-parametric wilcox.test, that does not rely on the assumption of normality.

B. Check for Variances

Let us check if the two variables have similar variances:

var.test(Groceries$Walmart, Groceries$Target)


    F test to compare two variances

data:  Groceries$Walmart and Groceries$Target
F = 0.97249, num df = 29, denom df = 29, p-value = 0.9406
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4628695 2.0431908
sample estimates:
ratio of variances 
         0.9724868

It appears from the $p . v a l u e = 0.9$ and the $C o n f i d e n c e I n t e r v a l = [0.4629, 2.0432]$ , which includes $1$ , that we cannot reject the NULL Hypothesis that the variances are not significantly different.

Inference

Well, the variables are not normally distributed, so a standard t.test is not advised, even if the variances are similar. We can still try:

mosaic::t_test(Groceries$Walmart, Groceries$Target, paired = TRUE) %>%
  broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	parameter <dbl>	conf.low <dbl>	conf.high <dbl>	method <chr>	alternative <chr>
-0.05666667	-0.4704556	0.6415488	29	-0.3030159	0.1896825	Paired t-test	two.sided

The p.value is $0.64$ ! And the Confidence Interval straddles $0$ . So the t.test gives us no reason to reject the Null Hypothesis that the means are similar. But can we really believe this, given the non-normality of data?

However, we have seen that the data variables are not normally distributed. So a Wilcoxon Test, using signed-ranks, is indicated: (recall the model!)

# For stability reasons, it may be advisable to use rounded data or to set digits.rank = 7, say,
# such that determination of ties does not depend on very small numeric differences (see the example).

wilcox.test(Groceries$Walmart, Groceries$Target,
  data = Groceries_long,
  digits.rank = 7, paired = TRUE,
  conf.int = TRUE, conf.level = 0.95
) %>%
  broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>
-0.104966	95	0.01431746	-0.1750051	-0.03005987

The Wilcoxon test result is very interesting: the p.value says there is a significant difference between the two store prices, and the confidence.interval also is unipolar…

As before we can do the linear model for both the original data and the sign.rank data. The test statistic is again the difference between thetwo variables:

lm(Target - Walmart ~ 1, data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

# Create a sign-rank function
signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

lm(signed_rank(Target - Walmart) ~ 1,
  data = Groceries
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	0.05666667	0.1204506	0.4704556	0.6415488

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	8.533333	2.888834	2.953902	0.006167464

Very interesting results, but confirming what we saw earlier: The Linear Model with the original data reports no significant difference, but the linear model with sign-ranks, suggests there is a significant difference in means prices between stores!

Let us perform the pair-wise permutation test on prices, by shuffling the two store names:

# | layout: [[15, 85, 15]]
# Set graph theme
theme_set(new = theme_custom())
#

polarity <- c(rep(1, 15), rep(-1, 15))
##
null_dist_price <- do(9999) *
  mean(
    data = Groceries,
    ~ (Target - Walmart) *
      resample(polarity, replace = TRUE)
  )
null_dist_price

ABCDEFGHIJ0123456789

mean <dbl>
-1.413333e-01
-2.253333e-01
-1.193333e-01
-1.160000e-01
8.733333e-02
-1.406667e-01
-1.200000e-02
-4.800000e-02
8.133333e-02
1.433333e-01

##
gf_histogram(data = null_dist_price, ~mean) %>%
  gf_vline(xintercept = obs_diff_price, colour = "red")

prop1(~mean, data = null_dist_price)

prop_-0.292 
      2e-04

Does not seem to be any significant difference in prices…

All Tests Together

We can put all the test results together to get a few more insights about the tests:

mosaic::t_test(Groceries$Walmart, Groceries$Target, paired = TRUE) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "cyan"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "t.test")
###
lm(Target - Walmart ~ 1, data = Groceries) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "cyan"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model")
###
wilcox.test(Groceries$Walmart, Groceries$Target,
  digits.rank = 7,
  paired = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy() %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "palegreen"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Wilcoxon Test")
###
lm(signed_rank(Target - Walmart) ~ 1,
  data = Groceries
) %>%
  broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
  gt() %>%
  tab_style(
    style = list(
      cell_fill(color = "palegreen"),
      cell_text(weight = "bold")
    ),
    locations = cells_body(columns = p.value)
  ) %>%
  tab_header(title = "Linear Model with Sign.Ranks")

t.test
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-0.05666667	-0.4704556	0.6415488	29	-0.3030159	0.1896825	Paired t-test	two.sided

Linear Model
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.05666667	0.1204506	0.4704556	0.6415488	-0.1896825	0.3030159

Wilcoxon Test
estimate	statistic	p.value	conf.low	conf.high	method	alternative
-0.104966	95	0.01431746	-0.1750051	-0.03005987	Wilcoxon signed rank test with continuity correction	two.sided

Linear Model with Sign.Ranks
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	8.533333	2.888834	2.953902	0.006167464	2.625004	14.44166

Clearly, the parametric tests do not detect a significant difference in prices, whereas the non-parametric tests do.

Suppose we knock off the Quaker Cereal data item…(note the spaces in the product name)

## Set graph theme
theme_set(new = theme_custom())
##
set.seed(12345)
Groceries_less <- Groceries %>%
  filter(Product != "Quaker Oats Life Cereal Original")
##
Groceries_less_long <- Groceries_less %>%
  pivot_longer(
    cols = c(Target, Walmart),
    names_to = "store",
    values_to = "prices"
  )
##
wilcox.test(Groceries_less$Walmart,
  Groceries_less$Target,
  paired = TRUE, digits.rank = 7,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  broom::tidy()
##
obs_diff_price_less <-
  mean(~ (Target - Walmart), data = Groceries_less)
obs_diff_price_less
polarity_less <- c(rep(1, 15), rep(-1, 14))
# Due to resampling this small bias makes no difference
null_dist_price_less <-
  do(9999) * mean(
    data = Groceries_less,
    ~ (Target - Walmart) * resample(polarity_less, replace = TRUE)
  )
##
gf_histogram(data = null_dist_price_less, ~mean) %>%
  gf_vline(
    xintercept = obs_diff_price_less,
    colour = "red"
  )
##
mean(null_dist_price_less >= obs_diff_price_less)

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>
-0.1100204	67	0.003492194	-0.1899856	-0.04503131

[1] 0.1558621

[1] 0.01370137

We see that removing the Quaker Oats product item from the data does give a significant difference in mean prices !!! That one price difference was in the opposite direction compared to the general trend in differences, so when it was removed, we obtained a truer picture of price differences.

Try to do a regular parametric t.test with this reduced data!

Wait, But Why?

Conclusion

We have learnt how to perform inference for paired-means. We have looked at the conditions that make the regular t.test possible, and learnt what to do if the conditions of normality and equal variance are not met. We have also looked at how these tests can be understood as manifestations of the linear model, with data and sign-ranked data. It should also be fairly clear now that we can test for the equivalence of two paired means, using a very simple permutation tests. Given computing power, we can always mechanize this test very quickly to get our results. And that performing this test yields reliable results without having to rely on any assumption relating to underlying distributions and so on.

Your Turn

Try the datasets in the openintro package. Use data(package = "openintro") in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g. babynames) to read it.
Same with the resampledata and resampledata3 packages.
Try the datasets in the PairedData package. Yes, install this too, peasants.

References

Paired Independence test with the package infer: https://infer.netlify.app/articles/paired
Randall Pruim, Nicholas J. Horton, Daniel T. Kaplan, StartTeaching with R
https://bcs.wiley.com/he-bcs/Books?action=index&itemId=111941654X&bcsId=11307
https://statsandr.com/blog/wilcoxon-test-in-r-how-to-compare-2-groups-under-the-non-normality-assumption/

R Package Citations

Package	Version	Citation
gt	0.11.1	@gt
infer	1.0.7	@infer
MKinfer	1.2	@MKinfer
openintro	2.5.0	@openintro

Footnotes

https://raw.githubusercontent.com/gadenbuie/tidyexplain/main/images/tidyr-pivoting.gif ↩︎