knitr::opts_chunk$set(echo = TRUE,message = TRUE,warning = TRUE, fig.align = "center")
options(digits=2)
library(tidyverse)
library(mosaic)
library(infer)
### Dataset from Chihara and Hesterberg's book (Second Edition)
library(resampledata)
library(openintro) # datasets
library(explore) # New, Easy package for Stats Test and Viz, and other things
π Inference for a Single Mean
βThe more I love humanity in general, the less I love man in particular. β Fyodor Dostoyevsky, The Brothers Karamazov
Setting up R packages
Introduction
We saw from the diagram created by Allen Downey that there is only one test 1! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.
We will use two packages in R, mosaic
to develop our intuition for what are called bootstrap randomization based statistical tests. (There is also a more recent package called infer
in R which can do pretty much all of this, including visualization. In my opinion, the code is a little too high-level and does not offer quite the detailed insight that the mosaic
package does).
Case Study #1: Toy data
First we will use a toy dataset with three βimaginaryβ samples, \(x, y, y2\). Each is normally distributed and made up of 50 observations.
We start by creating a function that will allow us to produce samples of a given size (N) with a specified mean (mu) and standard deviation (sd).
We create three variables: x ( explanatory) and y, y2 ( dependent ).
set.seed(40) # for replication
# Data as vectors ( for t.tests etc)
x <- rnorm_fixed(50, mu = 0.0, sd = 1) #explanatory
y <- rnorm_fixed(50, mu = 0.3, sd = 2) # dependent #1
y2 <- rnorm_fixed(50, mu = 0.5, sd = 1.5) # dependent #2
# Make a tibble with all variables
mydata_wide <- tibble(x = x, y = y, y2 = y2)
# Long form data
mydata_long <-
mydata_wide %>%
pivot_longer(., cols = c(x,y,y2),
names_to = "group",
values_to = "value")
# Long form data with only dependent variables
mydata_long_y <-
mydata_wide %>%
select(-x) %>%
pivot_longer(., cols = c(y,y2),
names_to = "group",
values_to = "value")
mydata_wide
mydata_long
mydata_long_y
βSigned Rankβ Values
Most statistical tests use the actual values of the data variables. However, in some non-parametric statistical tests, the data are used in rank-transformed sense/order. In some cases the signed-rank of the data values is used instead of the data itself.
Signed Rank is calculated as follows:
1. Take the absolute value of each observation in a sample
2. Place the ranks in order of (absolute magnitude). The smallest number has rank = 1 and so on.
3. Give each of the ranks the sign of the original observation ( + or - )
Introduction to Inference for a Single Mean
A series of tests deal with one mean value of a sample. The idea is to evaluate whether that mean is representative of the mean of the underlying population. Depending upon the nature of the (single) variable, the test that can be used are as follows:
flowchart TD A[Inference for Single Mean] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n] B --> C{OK?} C -->|Yes\n Parametric| D[t.test] D <-->F[Linear Model\n with Data] C -->|No\n Non-Parametric| E[wilcox.test] E <--> G[Linear Model\n with\n Signed-Ranks of Data] C -->|No\n Non-Parametric| P[Bootstrap] P <--> Q[Linear Model\n with Signed-Rank\n with Bootstrap]
Inspecting and Charting Data
# Set graph theme
theme_set(new = theme_custom())
#
mydata_long %>%
gf_density(~ value, group = ~ group, fill = ~ group) %>%
gf_fitdistr(dist = "dnorm") %>%
gf_facet_wrap(vars(group)) %>%
gf_labs(title = "Densities of Original Data Variables",
subtitle ="Compared with Normal Density")
- All variables appear to be zero mean
- \(x\) seems to have lower \(sd\) than the other two
Testing Assumptions in the Data
Inference
A. Model
A single number predicts y:
\[ y = \beta_0 + \beta_1*x \\ \] \[ \\and\ further \\ \] \[ y = \beta_0\ \]
and the second term vanishes, since βthere is no xβ: all the x-values are made equal to zero in the linear model for a single variable !! The NULL Hypothesis therefore is:
\[ \ H_0: \beta_0 = 0 \]
Note that if we want the NULL hypothesis to be that the mean is other than zero, we can use
\[
H_0:\ \beta_0 = \mu \ne 0
\] the lm(...., mu = some_number, ..)
parameter in the command in the code.
B. Code
If we compare the t.test
with the appropriate lm
model:
# A tibble: 1 Γ 8
estimate statistic p.value parameter conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 0.3 1.06 0.294 49 -0.268 0.868 One Sampl⦠two.sided
So even though y has a mean of 0.3, the confidence intervals straddle zero, and hence we cannot reject the NULL hypothesis that the true population, of which y is a sample, could have mean=0.
Since we are dealing with the mean, the sign of the rank becomes important to use, in the case of a non-parametric single mean test.
A. Model
\[ signed\_rank(y) = \beta_0 \]
\[ H_0: \beta_0 = 0 \]
B. Code
# Standard Wilcoxon Signed_Rank Test
w1 <- wilcox.test(y)
w1
Wilcoxon signed rank test with continuity correction
data: y
V = 754, p-value = 0.2628
alternative hypothesis: true location is not equal to 0
# Wilcoxon test with lm
w2 <- lm(signed_rank(y) ~ 1 , data = mydata_wide)
w2
Call:
lm(formula = signed_rank(y) ~ 1, data = mydata_wide)
Coefficients:
(Intercept)
4.66
# t-test with signed_rank data
w3 <- t.test(signed_rank(y))
w3
One Sample t-test
data: signed_rank(y)
t = 1.1277, df = 49, p-value = 0.265
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-3.644491 12.964491
sample estimates:
mean of x
4.66
Call:
lm(formula = y ~ 1, data = mydata_wide)
Residuals:
Min 1Q Median 3Q Max
-4.5554 -1.4845 -0.0392 1.5559 4.5119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3000 0.2828 1.061 0.294
Residual standard error: 2 on 49 degrees of freedom
2.5 % 97.5 %
(Intercept) -0.2683937 0.8683937
The confidence intervals for both the t.test
and the lm
model are identical.
t-test confidence intervals:
linear model confidence intervals: -0.2683937, 0.8683937
TBW
We can plot the y data both original and ranked to see where the mean lies in each case. The approximation to the true \(\beta_0\) ( is good when the number of observations \(n >=50\). Lindoloev has a simulation on this.. We can also plot the model using lm for both the original data and the sign-ranked data.
t1
# Set graph theme
theme_set(new = theme_custom())
#
p1 <- ggplot(mydata_wide, aes( x = 0, y = y)) +
geom_point(alpha = 0.4) +
geom_segment(data = t1, aes(y = estimate,
yend = estimate,
x = -0.2, xend = 0.2)) +
labs(title = "Student's\n t-Test")
# t-test using linear model
p2 <- ggplot(mydata_wide, aes( x = 0, y = y)) +
geom_point(alpha = 0.4) +
geom_segment(aes(y = lm(y ~ 1)$coefficient,
yend = lm(y ~ 1)$coefficient,
x = -0.2, xend = 0.2)) +
labs(title = "Student's\n t-Test \n using lm")
# Wilcoxon test, using signed-ranks of data
p3 <- ggplot(mydata_wide, aes( x = 0, y = signed_rank(y))) +
geom_point(alpha = 0.4) +
geom_segment(aes(y = mean(signed_rank(y)), yend = mean(signed_rank(y)), x = -0.2, xend = 0.2)) +
labs(title = "Wilcoxon \nSigned-Rank\n Test")
# Wilcoxon test, using signed-ranks of data, and lm
p4 <- ggplot(mydata_wide, aes( x = 0, y = signed_rank(y))) +
geom_point(alpha = 0.4) +
geom_segment(aes(y = lm(signed_rank(y) ~1)$coefficient,
yend = lm(signed_rank(y) ~1)$coefficient,
x = -0.2, xend = 0.2)) +
labs(title = "Wilcoxon \n Signed-Rank \n Test with lm")
patchwork::wrap_plots(p1,p2,p3,p4, nrow = 2, guides = "collect")
All the tests assert that the mean of y is not significantly different from zero.
Case Study #2: Exam data
Let us now choose a dataset from the openintro
package:
data("exam_grades")
exam_grades
Inspecting and Charting Data
Testing Assumptions in the Data
Inference
Conclusion
TBW
References
- OpenIntro Modern Statistics, Chapter #17
R Package Citations
Footnotes
Citation
@online{v2022,
author = {V, Arvind},
title = {π {Inference} for a {Single} {Mean}},
date = {2022-11-10},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/100-OneMean/single-mean.html},
langid = {en},
abstract = {Inference Tests to check the significance of a single
Mean}
}