knitr::opts_chunk$set(echo = FALSE,message = TRUE, warning = TRUE, fig.align = "center")
options(scipen=5)
library(tidyverse)
library(mosaic)
library(broom)
library(MKinfer) # Confidence Interval Computation
library(resampledata) ### Datasets from Chihara and Hesterberg's book
library(gt) # for tables
π Inference for Comparing Two Paired Means
Setting up R Packages
What is Paired Data?
Sometimes the data is collected on the same set of individual categories, e.g. scores by sport persons in two separate tournaments, or sales of identical items in two separate locations of a chain store. Or say the number of customers in the morning and in the evening, at a set of store locations. In this case we treat the two sets of observations as paired, since they correspond to the same set of observable entities. This is how some experiments give us paired data.
We would naturally be interested in the differences in means across these two sets. In this module, we will examine tests for this purpose.
Introduction to Inference for Paired Data
flowchart TD A[Inference for Paired Means] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test] B --> C{OK?} C -->|Yes, both\n Parametric| D[t.test with paired=TRUE] D <-->F[Linear Model\n Method] C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction, paired =TRUE] W<-->F C -->|No\n Non-Parametric| E[wilcox.test with paired = TRUE] E <--> G[Linear Model\n with\n Signed-Ranks] C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation] P <--> Q[Linear Model\n with Signed-Rank\n with Permutation]
We will now use a couple to case studies to traverse all the possible pathways in the Workflow above.
Case Study #1: Results from a Diving Championship
Here we have swimming records across a Semi-Final and a Final:
Inspecting and Charting Data
data("Diving2017")
Diving2017
Diving2017_inspect <- inspect(Diving2017)
Diving2017_inspect$categorical
Diving2017_inspect$quantitative
The data is made up of paired observations per swimmer, one for the semi-final and one for the final race. There are 12 swimmers and therefore 12 paired records. How can we quickly visualize this data?
First, histograms and densities of the two variables at hand:
# Set graph theme
theme_set(new = theme_custom())
#
Diving2017_long <- Diving2017 %>%
pivot_longer(
cols = c(Final, Semifinal),
names_to = "race",
values_to = "scores"
)
Diving2017_long %>%
gf_density( ~ scores,
fill = ~ race,
alpha = 0.5,
title = "Diving Scores") %>%
gf_facet_grid( ~ race) %>%
gf_fitdistr(dist = "dnorm")
###
Diving2017_long %>%
gf_col(
fct_reorder(Name, scores) ~ scores,
fill = ~ race,
alpha = 0.5,
position = "dodge",
xlab = "Scores",
ylab = "Name",
title = "Diving Scores"
)
###
Diving2017_long %>%
gf_boxplot(
scores ~ race,
fill = ~ race,
alpha = 0.5,
xlab = "Race",
ylab = "Scores",
title = "Diving Scores"
)
We see that:
- The data are not normally distributed. With just such few readings (n < 30) it was just possibleβ¦more readings would have helped. We will verify this aspect formally very shortly.
- There is no immediately identifiable trend in score changes from one race to the other.
A. Check for Normality
Let us also complete a check for normality: the shapiro.wilk
test checks whether a Quant variable is from a normal distribution; the NULL hypothesis is that the data are from a normal distribution.
shapiro.test(Diving2017$Final)
shapiro.test(Diving2017$Semifinal)
Shapiro-Wilk normality test
data: Diving2017$Final
W = 0.9184, p-value = 0.273
Shapiro-Wilk normality test
data: Diving2017$Semifinal
W = 0.86554, p-value = 0.05738
Hmmmβ¦.the Shapiro-Wilk test suggests that both scores are normally distributed, though Semifinal
is probably marginally so.
Can we check this with plots? We can plot Q-Q plots for both variables, and also compare both data with normally-distributed data generated with the same means and standard deviations:
# Set graph theme
theme_set(new = theme_custom())
#
set.rseed(1234)
Diving2017 %>%
mutate(Final_norm = rnorm(n = 12,
mean = mean(Final),
sd = sd(Final)),
Semifinal_norm = rnorm(n = 12,
mean = mean(Semifinal),
sd = sd(Semifinal))) %>%
pivot_longer(cols =
c(Semifinal, Final, Semifinal_norm,Final_norm),
names_to = "score_type", values_to = "value") %>%
gf_boxplot(value ~ score_type, fill = ~ score_type,
show.legend = FALSE)
###
Diving2017_long %>%
gf_qq(~ scores | race) %>%
gf_qqline(ylab = "scores")
While the boxplots are not very evocative, we see in the QQ-plots that the Final
scores are closer to the straight line than the Semifinal
scores. But it is perhaps still hard to accept the data as normally distributedβ¦hmm.
B. Check for Variances
Let us check if the two variables have similar variances: the var.test
does this for us, with a NULL hypothesis that the variances are not significantly different:
var.test(scores ~ race, data = Diving2017_long,
conf.int = TRUE,conf.level = 0.95) %>%
broom::tidy()
qf(0.975,11,11)
[1] 3.473699
The variances are not significantly different, as seen by the \(p.value = 0.08\).
So to summarise our data checks:
- data are normally distributed
- variances are not significantly different
Hypothesis
Based on the graph, how would we formulate our Hypothesis? We wish to infer whether there is any change in performance, per swimmer between the two races. So accordingly:
\[ H_0: \mu_{semifinal} = \mu_{final}\\ \]
\[ H_a: \mu_{semifinal} \ne \mu_{final}\ \]
Observed and Test Statistic
What would be the test statistic we would use? The difference in means. Is the observed difference in the means between the two groups of scores non-zero? We use the diffmean
function, with the argument only.2=FALSE
to allow for paired data:
obs_diff_swim <- diffmean(scores ~ race, data = Diving2017_long,
only.2 = FALSE)
# Can use this also
# formula method is better for permutation test!
# obs_diff_swim <- mean(~ (Final - Semifinal), data = Diving2017)
obs_diff_swim
diffmean
-11.975
Inference
Since the data variables satisfy the assumption of being normally distributed, and the variances are not significantly different, we may attempt the classical t.test
with paired data. (we will use the mosaic
variant). Type help(t.test)
in your Console. Our model would be:
\[ mean(Final(i) - Semifinal(i)) = \beta_0 \\ \]
And that \[ H_0: \beta_0 = 0;\\ \\ H_a: \beta_0 \ne 0 \]
mosaic::t.test(x = Diving2017$Semifinal,
y = Diving2017$Final,
paired = TRUE, var.equal = FALSE) %>% broom::tidy()
The confidence interval
spans the zero value, and the p.value
is a high \(0.259\), so there is no reason to accept alternative hypothesis that the means are different. Hence we say that there is no evidence of a difference between SemiFinal
and Final
scores.
Well, we might consider ( based on knowledge of the sport ) that at least one of the variables does not meet the normality criteria, and though their variances are not significantly different. So we would attempt a non-parametric Wilcoxon test, that uses the signed-rank of the paired data differences, instead of the data variables. Our model would be:
\[ mean(\ sign.rank[\ Final(i) - Semifinal(i)\ ]\ ) = \beta_0 \\ \]
\[ H_0 : \beta_0 = 0;\\ H_a: \beta_0 \ne 0 \]
wilcox.test(x = Diving2017$Semifinal,
y = Diving2017$Final,
paired = TRUE,conf.int = TRUE,conf.level = 0.95) %>% broom::tidy()
Here also with the p.value
being \(0.3804\), we have no reason to accept the Alternative Hypothesis. The parametric t.test
and the non-parametric wilcox.test
agree in their inferences.
We can apply the linear-model-as-inference interpretation both to the original data and to the sign.rank data:
\[ lm(y_i - x_i \sim 1) = \beta_0\\ \\ and\\ lm(\ sign.rank[\ Final(i) - Semifinal(i)\ ] \sim 1) = \beta_0\\ \]
And the Hypothesis for both interpretations would be:
\[
H_0: \beta_0 = 0\\
\\\
H_a: \beta_0 \ne 0\\
\]
lm(Semifinal - Final ~ 1, data = Diving2017) %>%
broom::tidy(conf.int = TRUE, conf.level = 0.95)
# Create a sign-rank function
signed_rank <- function(x) {sign(x) * rank(abs(x))}
lm(signed_rank(Semifinal - Final) ~ 1,
data = Diving2017) %>%
broom::tidy(conf.int = TRUE,conf.level = 0.95)
We observe that using the linear model method for the original scores and the sign-rank scores both sdo not permit us to reject the \(H_0\) Null Hypothesis, since p.values
are high, and the confidence.intervals
straddle \(0\).
We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code. For the specific data at hand, we need to shuffle the records between Semifinal
and Final
on a per Swimmer basis and take the test statistic
(difference between the two swim records for each swimmer). Another way to look at this is to take the differences between Semifinal
and Final
scores and shuffle the differences to either polarity. We will follow this method in the code below:
polarity <- c(rep(1, 6), rep(-1, 6))
# 12 +/- 1s,
# 6 each to make sure there is equal probability
polarity
null_dist_swim <- do(9999) *
mean(data = Diving2017,
~ (Final - Semifinal) * mosaic::resample(polarity, replace = TRUE))
null_dist_swim
[1] 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1
Let us plot the NULL distribution and compare it with the actual observed differences
in the race times:
# Set graph theme
theme_set(new = theme_custom())
#
gf_histogram(data = null_dist_swim, ~ mean) %>%
gf_vline(xintercept = obs_diff_swim, colour = "red")
###
gf_ecdf(data = null_dist_swim, ~ mean) %>%
gf_vline(xintercept = obs_diff_swim, colour = "red")
###
prop1(~ mean <= obs_diff_swim, data = null_dist_swim)
Hmmβ¦so by generating 9999 shuffles of score-difference polarities, it does appear that we can not only obtain the current observed difference but even surpass it frequently. So it does seem that there is no difference in means between Semi-Final and Final swimming scores.
All Tests Together
We can put all the test results together to get a few more insights about the tests:
mosaic::t.test(x = Diving2017$Semifinal,
y = Diving2017$Final,
paired = TRUE) %>% broom::tidy() %>%
gt() %>%
tab_style(
style = list(cell_fill(color = "cyan"), cell_text(weight = "bold")),
locations = cells_body(columns = p.value)) %>%
tab_header(title = "t.test")
lm(Semifinal - Final ~ 1, data = Diving2017) %>%
broom::tidy(conf.int = TRUE, conf.level = 0.95) %>%
gt() %>%
tab_style(
style = list(cell_fill(color = "cyan"),cell_text(weight = "bold")),
locations = cells_body(columns = p.value)) %>%
tab_header(title = "Linear Model")
wilcox.test(x = Diving2017$Semifinal,
y = Diving2017$Final,
paired = TRUE) %>% broom::tidy() %>%
gt() %>%
tab_style(
style = list(cell_fill(color = "palegreen"),cell_text(weight = "bold")),
locations = cells_body(columns = p.value)) %>%
tab_header(title = "Wilcoxon test")
lm(signed_rank(Semifinal - Final) ~ 1,
data = Diving2017) %>%
broom::tidy(conf.int = TRUE,conf.level = 0.95) %>%
gt() %>%
tab_style(
style = list(cell_fill(color = "palegreen"),cell_text(weight = "bold")),
locations = cells_body(columns = p.value)) %>%
tab_header(title = "Linear Model with sign.rank")
t.test | |||||||
---|---|---|---|---|---|---|---|
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
-11.975 | -1.190339 | 0.2589684 | 11 | -34.11726 | 10.16726 | Paired t-test | two.sided |
Linear Model | ||||||
---|---|---|---|---|---|---|
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
(Intercept) | -11.975 | 10.06016 | -1.190339 | 0.2589684 | -34.11726 | 10.16726 |
Wilcoxon test | |||
---|---|---|---|
statistic | p.value | method | alternative |
27 | 0.3803711 | Wilcoxon signed rank exact test | two.sided |
Linear Model with sign.rank | ||||||
---|---|---|---|---|---|---|
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
(Intercept) | -2 | 2.135558 | -0.9365236 | 0.3691097 | -6.70033 | 2.70033 |