# CRAN Packages
library(tidyverse)
library(mosaic)
library(ggformula)
library(broom)
library(mosaicCore)
library(mosaicData)
library(crosstable) # tabulated summary stats
library(openintro) # datasets and methods
library(resampledata3) # datasets
library(statsExpressions) # datasets and methods
library(ggstatsplot) # special stats plots
library(ggExtra)
# Non-CRAN Packages
# remotes::install_github("easystats/easystats")
library(easystats)
Inference for Correlation
Statistics ; Tests; p-value
Setting up R packages
Plot Theme
Show the Code
# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto
# Chunk options
knitr::opts_chunk$set(
fig.width = 7,
fig.asp = 0.618, # Golden Ratio
# out.width = "80%",
fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes
theme_custom <- function() {
font <- "Roboto Condensed" # assign font family up front
theme_classic(base_size = 14) %+replace% # replace elements we want to change
theme(
panel.grid.minor = element_blank(), # strip minor gridlines
text = element_text(family = font),
# text elements
plot.title = element_text( # title
family = font, # set font family
# size = 20, #set font size
face = "bold", # bold typeface
hjust = 0, # left align
# vjust = 2 #raise slightly
margin = margin(0, 0, 10, 0)
),
plot.subtitle = element_text( # subtitle
family = font, # font family
# size = 14, #font size
hjust = 0,
margin = margin(2, 0, 5, 0)
),
plot.caption = element_text( # caption
family = font, # font family
size = 8, # font size
hjust = 1
), # right align
axis.title = element_text( # axis titles
family = font, # font family
size = 10 # font size
),
axis.text = element_text( # axis text
family = font, # axis family
size = 8
) # font size
)
}
# Set graph theme
theme_set(new = theme_custom())
#
Introduction
Correlations define how one variables varies with another. One of the basic Questions we would have of our data is: Does some variable have a significant correlation score with another in some way? Does \(y\) vary with \(x\)? A Correlation Test is designed to answer exactly this question. The block diagram depicts the statistical procedures available to test for the significance of correlation scores between two variables.
Before we begin, let us recap a few basic definitions:
We have already encountered the variance
of a variable:
\[ \begin{align*} var_x &= \frac{\sum_{i=1}^{n}(x_i - \mu_x)^2}{(n-1)}\\ where ~ \mu_x &= mean(x)\\ n &= sample\ size \end{align*} \] The standard deviation is:
\[ \sigma_x = \sqrt{var_x}\\ \] The covariance of two variables is defined as:
\[ \begin{align} cov(x,y) &= \frac{\sum_{i = 1}^{n}(x_i - \mu_x)*(y_i - \mu_y)}{n-1}\\ &= \frac{\sum{x_i *y_i}}{n-1} - \frac{\sum{x_i *\mu_y}}{n-1} - \frac{\sum{y_i *\mu_x}}{n-1} + \frac{\sum{\mu_x *\mu_y}}{n-1}\\ &= \frac{\sum{x_i *y_i}}{n-1} - \frac{\sum{\mu_x *\mu_y}}{n-1}\\ \end{align} \]
Hence covariance is the expectation of the product minus the product of the expectations of the two variables.
Note that in both cases we are dealing with z-scores: variable minus its mean, \(x_i - \mu_x\), which we have seen when dealing with the CLT and the Gaussian Distribution.
So, finally, the coefficient of correlation between two variables is defined as:
\[ \begin{align} correlation ~ r &= \frac{cov(x,y)}{\sigma_x * \sigma_y} \\ &= \frac{cov(x,y)}{\sqrt{var_x} * \sqrt{var_y}} \end{align} \tag{1}\]
Thus correlation coefficient is the covariance scaled by the geometric mean of the variances.
flowchart TD A[Inference for Correlation] -->|Check Assumptions| B[Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test] B --> C{OK?} C -->|Yes, both\n Parametric| D[t.test] D <-->F[Linear Model\n Method] C -->|Yes, but not variance\n Parametric| W[t.test with\n Welch Correction] W<-->F C -->|No\n Non-Parametric| E[wilcox.test] E <--> G[Linear Model\n with\n Ranked Data] C -->|No\n Non-Parametric| P[Bootstrap\n or\n Permutation] P <--> Q[Linear Model\n with Ranked Data\n and Permutation]
Case Study #1: Galton’s famous dataset
How can we start, except by using the famous Galton
dataset, now part of the mosaicData
package?
Workflow: Read and Inspect the Data
data("Galton", package = "mosaicData")
Galton
The variables in this dataset are:
-
sex
(char): sex of the child -
family
(int): an ID for each family
-
father
(dbl): father’s height in inches -
mother
(dbl): mother’s height in inches -
height
(dbl): Child’s height in inches -
nkids
(int): Number of children in each family
inspect(Galton)
categorical variables:
name class levels n missing
1 family factor 197 898 0
2 sex factor 2 898 0
distribution
1 185 (1.7%), 166 (1.2%), 66 (1.2%) ...
2 M (51.8%), F (48.2%)
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 father numeric 62 68 69.0 71.0 78.5 69.232851 2.470256 898 0
2 mother numeric 58 63 64.0 65.5 70.5 64.084410 2.307025 898 0
3 height numeric 56 64 66.5 69.7 79.0 66.760690 3.582918 898 0
4 nkids integer 1 4 6.0 8.0 15.0 6.135857 2.685156 898 0
So there are several correlations we can explore here: Children’s height
vs that of father
or mother
, based on sex
. In essence we are replicating Francis Galton’s famous study.
Workflow: Research Questions
- Based on this sample, what can we say about the correlation between a son’s height and a father’s height in the population?
- Based on this sample, what can we say about the correlation between a daughter’s height and a father’s height in the population?
Of course we can formulate more questions, but these are good for now! And since we are going to infer correlations by sex
, let us split the dataset into two parts, one for the sons and one for the daughters, and quickly summarise them too:
Workflow: Visualization
Let us first quickly plot a graph that is relevant to each of the two research questions.
# Set graph theme
theme_set(new = theme_custom())
#
Galton_sons %>%
gf_point(son ~ father) %>%
gf_lm() %>%
gf_labs(
title = "Height of Sons vs fathers",
subtitle = "Galton dataset"
)
##
Galton_daughters %>%
gf_point(daughter ~ father) %>%
gf_lm() %>%
gf_labs(
title = "Height of Daughters vs Fathers",
subtitle = "Galton dataset"
)
We might even plot the overall heights together and colour by sex
of the child:
# Set graph theme
theme_set(new = theme_custom())
#
Galton %>%
gf_point(height ~ father,
group = ~sex, colour = ~sex
) %>%
gf_lm() %>%
gf_labs(
title = "Height of sons vs fathers",
subtitle = "Galton dataset"
)
So daughters are shorter than sons, generally speaking, and both heights seem related to that of the father.
When we filtered the dataset into two, the filtering by sex
of the child also effectively filtered the heights of the father
(and mother
). This is proper and desired; but think!
Workflow: Assumptions
For the classical correlation tests, we need that the variables are normally distributed. As before we check this with the shapiro.test
:
shapiro.test(Galton_sons$father)
shapiro.test(Galton_sons$son)
##
shapiro.test(Galton_daughters$father)
shapiro.test(Galton_daughters$daughter)
Shapiro-Wilk normality test
data: Galton_sons$father
W = 0.98529, p-value = 0.0001191
Shapiro-Wilk normality test
data: Galton_sons$son
W = 0.99135, p-value = 0.008133
Shapiro-Wilk normality test
data: Galton_daughters$father
W = 0.98438, p-value = 0.0001297
Shapiro-Wilk normality test
data: Galton_daughters$daughter
W = 0.99113, p-value = 0.01071
Let us also check the densities and quartile plots of the heights the dataset:
# Set graph theme
theme_set(new = theme_custom())
#
Galton %>%
group_by(sex) %>%
gf_density(~height,
group = ~sex,
fill = ~sex
) %>%
gf_fitdistr(dist = "dnorm") %>%
gf_facet_grid(vars(sex)) %>%
gf_labs(title = "Facetted Density Plots")
##
Galton %>%
group_by(sex) %>%
gf_qq(~height,
group = ~sex,
colour = ~sex, size = 0.5
) %>%
gf_qqline(colour = "black") %>%
gf_facet_grid(vars(sex)) %>%
gf_labs(
title = "Facetted QQ Plots",
x = "Theoretical quartiles",
y = "Actual Data"
)
and the father
’s heights:
# Set graph theme
theme_set(new = theme_custom())
#
##
Galton %>%
group_by(sex) %>%
gf_density(~father,
group = ~sex, # no this is not weird
fill = ~sex
) %>%
gf_fitdistr(dist = "dnorm") %>%
gf_refine(scale_fill_discrete(name = "Sex of Child")) %>%
gf_facet_grid(vars(sex)) %>%
gf_labs(
title = "Fathers: Facetted Density Plots",
subtitle = "By Sex of Child"
)
Galton %>%
group_by(sex) %>%
gf_qq(~father,
group = ~sex, # no this is not weird
colour = ~sex, size = 0.5
) %>%
gf_qqline(colour = "black") %>%
gf_facet_grid(vars(sex)) %>%
gf_refine(scale_colour_discrete(name = "Sex of Child")) %>%
gf_labs(
title = "Fathers Heights: Facetted QQ Plots",
subtitle = "By Sex of Child",
x = "Theoretical quartiles",
y = "Actual Data"
)
The shapiro.test
informs us that the child-related height
variables are not normally distributed; though visually there seems nothing much to complain about. Hmmm…
Dads are weird anyway, so we must not expect father
heights to be normally distributed.
Workflow: Inference
Let us now see how Correlation Tests can be performed based on this dataset, to infer patterns in the population from which this dataset/sample was drawn.
We will go with classical tests first, and then set up a permutation test that does not need any assumptions.
We perform the Pearson correlation test first: the data is not normal so we cannot really use this. We should use a non-parametric correlation test as well, using a Spearman
correlation.
# Pearson (built-in test)
cor_son_pearson <- cor.test(son ~ father,
method = "pearson",
data = Galton_sons
) %>%
broom::tidy() %>%
mutate(term = "Pearson Correlation r")
cor_son_pearson
cor_son_spearman <- cor.test(son ~ father, method = "spearman", data = Galton_sons) %>%
broom::tidy() %>%
mutate(term = "Spearman Correlation r")
cor_son_spearman
Both tests state that the correlation between son
and father
is significant.
# Pearson (built-in test)
cor_daughter_pearson <- cor.test(daughter ~ father,
method = "pearson",
data = Galton_daughters
) %>%
broom::tidy() %>%
mutate(term = "Pearson Correlation r")
cor_daughter_pearson
##
cor_daughter_spearman <- cor.test(daughter ~ father, method = "spearman", data = Galton_daughters) %>%
broom::tidy() %>%
mutate(term = "Spearman Correlation r")
cor_daughter_spearman
Again both tests state that the correlation between daughter
and father
is significant.
What is happening under the hood in cor.test
?
To be Written Up!
We can of course use a randomization based test for correlation. How would we mechanize this, what aspect would be randomize?
Correlation is calculated on a vector-basis: each individual observation of variable#1
is multiplied by the corresponding observation of variable#2
. Look at (eqn-corr?)! So we might be able to randomize the order of this multiplication to see how uncommon this particular set of multiplications are. That would give us a p-value
to decide if the observed correlation is close to the truth. So, onwards with our friend mosaic
:
obs_daughter_corr <- cor(Galton_daughters$father, Galton_daughters$daughter)
obs_daughter_corr
[1] 0.4587605
corr_daughter_null <- do(4999) * cor.test(daughter ~ shuffle(father), data = Galton_daughters) %>%
broom::tidy()
corr_daughter_null
corr_daughter_null %>%
gf_histogram(~estimate, bins = 50) %>%
gf_vline(
xintercept = obs_daughter_corr,
color = "red", linewidth = 1
) %>%
gf_labs(
title = "Permutation Null Distribution",
subtitle = "Daughter Heights vs Father Heights"
)
##
p_value_null <- 2.0 * mean(corr_daughter_null$estimate >= obs_daughter_corr)
p_value_null
[1] 0
We see that will all permutations of father
, we are never able to hit the actual obs_daughter_corr
! Hence there is a definite correlation between father
height and daughter
height.
The premise here is that many common statistical tests are special cases of the linear model. A linear model estimates the relationship between dependent variable or
“response” variable height
and an explanatory variable or “predictor”, father
. It is assumed that the relationship is linear. \(\beta_0\) is the intercept and \(\beta_1\) is the slope of the linear fit, that predicts the value of height
based the value of father
.
\[ height = \beta_0 + \beta_1 \times father \] The model for Pearson Correlation tests is exactly the Linear Model:
\[ \begin{aligned} height = \beta_0 + \beta_1 \times father\\ \\ H_0: Null\ Hypothesis\ => \beta_1 = 0\\\ H_a: Alternate\ Hypothesis\ => \beta_1 \ne 0\\ \end{aligned} \]
Using the linear model method we get:
# Linear Model
lin_son <- lm(son ~ father, data = Galton_sons) %>%
broom::tidy() %>%
mutate(term = c("beta_0", "beta_1")) %>%
select(term, estimate, p.value)
lin_son
##
lin_daughter <- lm(daughter ~ father, data = Galton_daughters) %>%
broom::tidy() %>%
mutate(term = c("beta_0", "beta_1")) %>%
select(term, estimate, p.value)
lin_daughter
Why are the respective \(r\)-s and \(\beta_1\)-s different, though the p-value
-s is suspiciously the same!? Did we miss a factor of \(\frac{sd(son/daughter)}{sd(father)} = ??\) somewhere…??
Let us scale the variables to within {-1, +1}
: (subtract the mean and divide by sd) and re-do the Linear Model with scaled versions of height
and father
:
# Scaled linear model
lin_scaled_galton_daughters <- lm(scale(daughter) ~ 1 + scale(father), data = Galton_daughters) %>%
broom::tidy() %>%
mutate(term = c("beta_0", "beta_1"))
lin_scaled_galton_daughters
Now you’re talking!! The estimate
is the same in both the classical test and the linear model! So we conclude:
When both target and predictor have the same standard deviation, the slope from the linear model and the Pearson correlation are the same.
There is this relationship between the slope in the linear model and Pearson correlation:
\[ Slope\ \beta_1 = \frac{sd_y}{sd_x} * r \]
The slope is usually much more interpretable and informative than the correlation coefficient.
- Hence a linear model using
scale()
for both variables will show slope = r.
Slope_Scaled: 0.4587605 = Correlation: 0.4587605
- Finally, the p-value for Pearson Correlation and that for the slope in the linear model is the same (\(0.04280043\)). Which means we cannot reject the NULL hypothesis of “no relationship” between
daughter
-s andfather
-s heights.
Can you complete this for the sons?
Wait, But Why?
Conclusion
Your Turn
Try the datasets in the
infer
package. Usedata(package = "infer")
in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g.babynames
) to read it.Same with the
resampledata
andresampledata3
packages.
References
-
Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv
CheatSheet
-
Common statistical tests are linear models: a work through by Steve Doogue
-
Jeffrey Walker “Elements of Statistical Modeling for Experimental Biology”
- Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine: OpenIntro Statistics
- Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling by Måns Thulin
- Jeffrey Walker “A linear-model-can-be-fit-to-data-with-continuous-discrete-or-categorical-x-variables”
Citation
@online{v.2022,
author = {V., Arvind},
title = {Inference for {Correlation}},
date = {2022-11-25},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/150-Correlation/},
langid = {en},
abstract = {Statistical Significance Tests for Correlations between
two Variables}
}