Inference for Correlation

Author

Arvind V.

Published

November 25, 2022

Modified

July 29, 2025

Abstract

Statistical Significance Tests for Correlations between two Variables

Keywords

Statistics ; Tests; p-value

Setting up R packages

# CRAN Packages
library(mosaic)
library(ggformula)
library(broom)
library(mosaicCore)
library(mosaicData)
library(crosstable) # tabulated summary stats

library(openintro) # datasets and methods
library(resampledata3) # datasets
library(statsExpressions) # datasets and methods
library(ggstatsplot) # special stats plots
library(ggExtra)

# Non-CRAN Packages
# remotes::install_github("easystats/easystats")
library(easystats)


library(tidyverse) # Tidy Data Processing

Plot Fonts and Theme

Show the Code

library(systemfonts)
library(showtext)
## Clean the slate
systemfonts::clear_local_fonts()
systemfonts::clear_registry()
##
showtext_opts(dpi = 96) # set DPI for showtext
sysfonts::font_add(
  family = "Alegreya",
  regular = "../../../../../../fonts/Alegreya-Regular.ttf",
  bold = "../../../../../../fonts/Alegreya-Bold.ttf",
  italic = "../../../../../../fonts/Alegreya-Italic.ttf",
  bolditalic = "../../../../../../fonts/Alegreya-BoldItalic.ttf"
)

sysfonts::font_add(
  family = "Roboto Condensed",
  regular = "../../../../../../fonts/RobotoCondensed-Regular.ttf",
  bold = "../../../../../../fonts/RobotoCondensed-Bold.ttf",
  italic = "../../../../../../fonts/RobotoCondensed-Italic.ttf",
  bolditalic = "../../../../../../fonts/RobotoCondensed-BoldItalic.ttf"
)
showtext_auto(enable = TRUE) # enable showtext
##
theme_custom <- function() {
  font <- "Alegreya" # assign font family up front

  theme_classic(base_size = 14, base_family = font) %+replace% # replace elements we want to change

    theme(
      text = element_text(family = font), # set base font family

      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        size = 24, # set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        margin = margin(t = 5, r = 0, b = 5, l = 0)
      ), # margin
      plot.title.position = "plot",
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        size = 14, # font size
        hjust = 0, # left align
        margin = margin(t = 5, r = 0, b = 10, l = 0)
      ), # margin

      plot.caption = element_text( # caption
        family = font, # font family
        size = 9, # font size
        hjust = 1
      ), # right align

      plot.caption.position = "plot", # right align

      axis.title = element_text( # axis titles
        family = "Roboto Condensed", # font family
        size = 12
      ), # font size

      axis.text = element_text( # axis text
        family = "Roboto Condensed", # font family
        size = 9
      ), # font size

      axis.text.x = element_text( # margin for axis text
        margin = margin(5, b = 10)
      )

      # since the legend often requires manual tweaking
      # based on plot content, don't define it here
    )
}

Show the Code

```{r}
#| cache: false
#| code-fold: true
## Set the theme
theme_set(new = theme_custom())
```

Error in theme_set(new = theme_custom()): could not find function "theme_set"

Show the Code

```{r}
#| cache: false
#| code-fold: true
## Use available fonts in ggplot text geoms too!
update_geom_defaults(geom = "text", new = list(
  family = "Roboto Condensed",
  face = "plain",
  size = 3.5,
  color = "#2b2b2b"
))
```

Error in update_geom_defaults(geom = "text", new = list(family = "Roboto Condensed", : could not find function "update_geom_defaults"

Introduction

Correlations define how one variables varies with another. One of the basic Questions we would have of our data is: Does some variable have a significant correlation score with another in some way? Does $y$ vary with $x$ ? A Correlation Test is designed to answer exactly this question. The block diagram depicts the statistical procedures available to test for the significance of correlation scores between two variables.

Before we begin, let us recap a few basic definitions:

We have already encountered the variance of a variable:

$\begin{aligned} v a r_{x} & = \frac{\sum_{i = 1}^{n} (x_{i} - μ_{x})^{2}}{(n - 1)} \\ w h e r e μ_{x} & = m e a n (x) \\ n & = s a m p l e s i z e \end{aligned}$ The standard deviation is:

$σ_{x} = \sqrt{v a r_{x}}$ The covariance of two variables is defined as:

$\begin{aligned} c o v (x, y) & = \frac{\sum_{i = 1}^{n} (x_{i} - μ_{x}) * (y_{i} - μ_{y})}{n - 1} \\ = \frac{\sum x_{i} * y_{i}}{n - 1} - \frac{\sum x_{i} * μ_{y}}{n - 1} - \frac{\sum y_{i} * μ_{x}}{n - 1} + \frac{\sum μ_{x} * μ_{y}}{n - 1} \\ = \frac{\sum x_{i} * y_{i}}{n - 1} - \frac{\sum μ_{x} * μ_{y}}{n - 1} \end{aligned}$

Hence covariance is the expectation of the product minus the product of the expectations of the two variables.

Covariance uses z-scores!

Note that in both cases we are dealing with z-scores: variable minus its mean, $x_{i} - μ_{x}$ , which we have seen when dealing with the CLT and the Gaussian Distribution.

So, finally, the coefficient of correlation between two variables is defined as:

$\begin{matrix} (1) & \begin{aligned} c o r r e l a t i o n r & = \frac{c o v (x, y)}{σ_{x} * σ_{y}} \\ = \frac{c o v (x, y)}{\sqrt{v a r_{x}} * \sqrt{v a r_{y}}} \end{aligned} \end{matrix}$

Thus correlation coefficient is the covariance scaled by the geometric mean of the variances.

Case Study #1: Galton’s famous dataset

How can we start, except by using the famous Galton dataset, now part of the mosaicData package?

Workflow: Read and Inspect the Data

data("Galton", package = "mosaicData")
Galton

ABCDEFGHIJ0123456789

family <fct>	father <dbl>	mother <dbl>	sex <fct>	height <dbl>	nkids <int>
1	78.5	67.0	M	73.2	4
1	78.5	67.0	F	69.2	4
1	78.5	67.0	F	69.0	4
1	78.5	67.0	F	69.0	4
2	75.5	66.5	M	73.5	4
2	75.5	66.5	M	72.5	4
2	75.5	66.5	F	65.5	4
2	75.5	66.5	F	65.5	4
3	75.0	64.0	M	71.0	2
3	75.0	64.0	F	68.0	2

The variables in this dataset are:

Qualitative Variables

sex(char): sex of the child
family(int): an ID for each family

Quantitative Variables

father(dbl): father’s height in inches
mother(dbl): mother’s height in inches
height(dbl): Child’s height in inches
nkids(int): Number of children in each family

inspect(Galton)


categorical variables:  
    name  class levels   n missing
1 family factor    197 898       0
2    sex factor      2 898       0
                                   distribution
1 185 (1.7%), 166 (1.2%), 66 (1.2%) ...        
2 M (51.8%), F (48.2%)                         

quantitative variables:  
    name   class min Q1 median   Q3  max      mean       sd   n missing
1 father numeric  62 68   69.0 71.0 78.5 69.232851 2.470256 898       0
2 mother numeric  58 63   64.0 65.5 70.5 64.084410 2.307025 898       0
3 height numeric  56 64   66.5 69.7 79.0 66.760690 3.582918 898       0
4  nkids integer   1  4    6.0  8.0 15.0  6.135857 2.685156 898       0

So there are several correlations we can explore here: Children’s height vs that of father or mother, based on sex. In essence we are replicating Francis Galton’s famous study.

Workflow: Research Questions

Question 1

Based on this sample, what can we say about the correlation between a son’s height and a father’s height in the population?

Question 2

Based on this sample, what can we say about the correlation between a daughter’s height and a father’s height in the population?

Of course we can formulate more questions, but these are good for now! And since we are going to infer correlations by sex, let us split the dataset into two parts, one for the sons and one for the daughters, and quickly summarise them too:

Galton_sons <- Galton %>%
  dplyr::filter(sex == "M") %>%
  rename("son" = height)
Galton_daughters <- Galton %>%
  dplyr::filter(sex == "F") %>%
  rename("daughter" = height)
dim(Galton_sons)

[1] 465   6

dim(Galton_daughters)

[1] 433   6

Galton_sons %>%
  summarize(across(
    .cols = c(son, father),
    .fns = list(mean = mean, sd = sd)
  ))

ABCDEFGHIJ0123456789

son_mean <dbl>	son_sd <dbl>	father_mean <dbl>	father_sd <dbl>
69.22882	2.631594	69.16817	2.299929

Galton_daughters %>%
  summarize(across(
    .cols = c(daughter, father),
    .fns = list(mean = mean, sd = sd)
  ))

ABCDEFGHIJ0123456789

daughter_mean <dbl>	daughter_sd <dbl>	father_mean <dbl>	father_sd <dbl>
64.11016	2.37032	69.30231	2.641898

Workflow: Visualization

Let us first quickly plot a graph that is relevant to each of the two research questions.

# Set graph theme
theme_set(new = theme_custom())
#
Galton_sons %>%
  gf_point(son ~ father) %>%
  gf_lm() %>%
  gf_labs(
    title = "Heights of Sons vs Fathers",
    subtitle = "Galton dataset"
  )
##
Galton_daughters %>%
  gf_point(daughter ~ father) %>%
  gf_lm() %>%
  gf_labs(
    title = "Heights of Daughters vs Fathers",
    subtitle = "Galton dataset"
  )

We might even plot the overall heights together and colour by sex of the child:

# Set graph theme
theme_set(new = theme_custom())
#
Galton %>%
  gf_point(height ~ father,
    group = ~sex, colour = ~sex
  ) %>%
  gf_lm() %>%
  gf_refine(scale_color_brewer(palette = "Set1")) %>%
  gf_labs(
    title = "Heights of Children vs Fathers",
    subtitle = "Galton dataset"
  )

So daughters are shorter than sons, generally speaking, and both heights seem related to that of the father.

What did filtering do?

When we filtered the dataset into two, the filtering by sex of the child also effectively filtered the heights of the father (and mother). This is proper and desired; but think!

Workflow: Assumptions

For the classical correlation tests, we need that the variables are normally distributed. As before we check this with the shapiro.test:

shapiro.test(Galton_sons$father)
shapiro.test(Galton_sons$son)
##
shapiro.test(Galton_daughters$father)
shapiro.test(Galton_daughters$daughter)


    Shapiro-Wilk normality test

data:  Galton_sons$father
W = 0.98529, p-value = 0.0001191


    Shapiro-Wilk normality test

data:  Galton_sons$son
W = 0.99135, p-value = 0.008133


    Shapiro-Wilk normality test

data:  Galton_daughters$father
W = 0.98438, p-value = 0.0001297


    Shapiro-Wilk normality test

data:  Galton_daughters$daughter
W = 0.99113, p-value = 0.01071

Let us also check the densities and quartile plots of the heights the dataset:

# Set graph theme
theme_set(new = theme_custom())
#
Galton %>%
  group_by(sex) %>%
  gf_density(~height,
    group = ~sex,
    fill = ~sex
  ) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(palette = "Set1")) %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(title = "Facetted Density Plots")
##
Galton %>%
  group_by(sex) %>%
  gf_qq(~height,
    group = ~sex,
    colour = ~sex, size = 0.5
  ) %>%
  gf_qqline(colour = "black") %>%
  gf_refine(scale_color_brewer(palette = "Set1")) %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(
    title = "Facetted QQ Plots",
    x = "Theoretical quartiles",
    y = "Actual Data"
  )

and the father’s heights:

# Set graph theme
theme_set(new = theme_custom())
#
##
Galton %>%
  group_by(sex) %>%
  gf_density(~father,
    group = ~sex, # no this is not weird
    fill = ~sex
  ) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_brewer(name = "Sex of Child", palette = "Set1")) %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(
    title = "Fathers: Facetted Density Plots",
    subtitle = "By Sex of Child"
  )

Galton %>%
  group_by(sex) %>%
  gf_qq(~father,
    group = ~sex, # no this is not weird
    colour = ~sex, size = 0.5
  ) %>%
  gf_qqline(colour = "black") %>%
  gf_facet_grid(vars(sex)) %>%
  gf_refine(scale_colour_brewer(name = "Sex of Child", palette = "Set1")) %>%
  gf_labs(
    title = "Fathers Heights: Facetted QQ Plots",
    subtitle = "By Sex of Child",
    x = "Theoretical quartiles",
    y = "Actual Data"
  )

The shapiro.test informs us that the child-related height variables are not normally distributed; though visually there seems nothing much to complain about. Hmmm…

Dads are weird anyway, so we must not expect father heights to be normally distributed.

Workflow: Inference

Let us now see how Correlation Tests can be performed based on this dataset, to infer patterns in the population from which this dataset/sample was drawn.

We will go with classical tests first, and then set up a permutation test that does not need any assumptions.

We perform the Pearson correlation test first: the data is not normal so we cannot really use this. We should use a non-parametric correlation test as well, using a Spearman correlation.

# Pearson (built-in test)
cor_son_pearson <- cor.test(son ~ father,
  method = "pearson",
  data = Galton_sons
) %>%
  broom::tidy() %>%
  mutate(term = "Pearson Correlation r")
cor_son_pearson

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	parameter <int>	conf.low <dbl>	conf.high <dbl>	method <chr>	alternative <chr>	term <chr>
0.3913174	9.149788	1.824016e-18	463	0.3114667	0.4656805	Pearson's product-moment correlation	two.sided	Pearson Correlation r

cor_son_spearman <- cor.test(son ~ father, method = "spearman", data = Galton_sons) %>%
  broom::tidy() %>%
  mutate(term = "Spearman Correlation r")
cor_son_spearman

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	method <chr>	alternative <chr>	term <chr>
0.4063241	9948441	6.51485e-20	Spearman's rank correlation rho	two.sided	Spearman Correlation r

Both tests state that the correlation between son and father is significant.

# Pearson (built-in test)
cor_daughter_pearson <- cor.test(daughter ~ father,
  method = "pearson",
  data = Galton_daughters
) %>%
  broom::tidy() %>%
  mutate(term = "Pearson Correlation r")
cor_daughter_pearson

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	parameter <int>	conf.low <dbl>	conf.high <dbl>	method <chr>	alternative <chr>	term <chr>
0.4587605	10.7186	6.355655e-24	431	0.3809944	0.5300812	Pearson's product-moment correlation	two.sided	Pearson Correlation r

##
cor_daughter_spearman <- cor.test(daughter ~ father, method = "spearman", data = Galton_daughters) %>%
  broom::tidy() %>%
  mutate(term = "Spearman Correlation r")
cor_daughter_spearman

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	method <chr>	alternative <chr>	term <chr>
0.43337	7666721	2.982817e-21	Spearman's rank correlation rho	two.sided	Spearman Correlation r

Again both tests state that the correlation between daughter and father is significant.

What is happening under the hood in cor.test?

To be Written Up! But when?

We can of course use a randomization based test for correlation. How would we mechanize this, what aspect would be randomize?

Correlation is calculated on a vector-basis: each individual observation of variable#1 is multiplied by the corresponding observation of variable#2. Look at Equation 1! So we might be able to randomize the order of this multiplication to see how uncommon this particular set of multiplications are. That would give us a p-value to decide if the observed correlation is close to the truth. So, onwards with our friend mosaic:

obs_daughter_corr <- cor(Galton_daughters$father, Galton_daughters$daughter)
obs_daughter_corr

[1] 0.4587605

corr_daughter_null <- do(4999) * cor.test(daughter ~ shuffle(father),
  data = Galton_daughters
) %>%
  broom::tidy()
corr_daughter_null

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	parameter <int>	conf.low <dbl>	conf.high <dbl>	method <chr>	alternative <chr>	.row <int>	.index <dbl>
3.913872e-02	0.8131639499	0.4165730294	431	-0.0553026530	1.328860e-01	Pearson's product-moment correlation	two.sided	1	1
-6.073446e-03	-0.1260903366	0.8997192136	431	-0.1002534623	8.821444e-02	Pearson's product-moment correlation	two.sided	1	2
-1.652351e-02	-0.3430839064	0.7317026136	431	-0.1105887089	7.783508e-02	Pearson's product-moment correlation	two.sided	1	3
1.354030e-02	0.2811297296	0.7787458108	431	-0.0808001962	1.076403e-01	Pearson's product-moment correlation	two.sided	1	4
-2.576536e-03	-0.0534904498	0.9573659237	431	-0.0967904306	9.168312e-02	Pearson's product-moment correlation	two.sided	1	5
-8.213732e-03	-0.1705272633	0.8646755160	431	-0.1023718883	8.609030e-02	Pearson's product-moment correlation	two.sided	1	6
-7.858073e-02	-1.6364386067	0.1024777819	431	-0.1715477725	1.577347e-02	Pearson's product-moment correlation	two.sided	1	7
-1.701146e-02	-0.3532180973	0.7240976348	431	-0.1110707919	7.734994e-02	Pearson's product-moment correlation	two.sided	1	8
7.708721e-02	1.6051485231	0.1091935524	431	-0.0172756804	1.700890e-01	Pearson's product-moment correlation	two.sided	1	9
-7.880252e-02	-1.6410862172	0.1015090114	431	-0.1717643698	1.555035e-02	Pearson's product-moment correlation	two.sided	1	10

corr_daughter_null %>%
  gf_histogram(~estimate, bins = 50) %>%
  gf_vline(
    xintercept = obs_daughter_corr,
    color = "red", linewidth = 1
  ) %>%
  gf_labs(
    title = "Permutation Null Distribution",
    subtitle = "Daughter Heights vs Father Heights"
  )

##
p_value_null <- 2.0 * mean(corr_daughter_null$estimate >= obs_daughter_corr)
p_value_null

[1] 0

We see that will all permutations of father, we are never able to hit the actual obs_daughter_corr! Hence there is a definite correlation between father height and daughter height.

The premise here is that many common statistical tests are special cases of the linear model. A linear model estimates the relationship between dependent variable or
“response” variable height and an explanatory variable or “predictor”, father. It is assumed that the relationship is linear. $β_{0}$ is the intercept and $β_{1}$ is the slope of the linear fit, that predicts the value of height based the value of father.

$h e i g h t = β_{0} + β_{1} \times f a t h e r$ The model for Pearson Correlation tests is exactly the Linear Model:

$\begin{array}{r} h e i g h t = β_{0} + β_{1} \times f a t h e r \\ H_{0} : N u l l H y p o t h e s i s => β_{1} = 0 \\ H_{a} : A l t e r n a t e H y p o t h e s i s => β_{1} \neq 0 \end{array}$

Using the linear model method we get:

# Linear Model
lin_son <- lm(son ~ father, data = Galton_sons) %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1")) %>%
  select(term, estimate, p.value)
lin_son

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	p.value <dbl>
beta_0	38.2589122	2.642076e-26
beta_1	0.4477479	1.824016e-18

##
lin_daughter <- lm(daughter ~ father, data = Galton_daughters) %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1")) %>%
  select(term, estimate, p.value)
lin_daughter

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	p.value <dbl>
beta_0	35.5852284	2.573249e-34
beta_1	0.4116015	6.355655e-24

Why are the respective $r$ -s and $β_{1}$ -s different, though the p-value-s is suspiciously the same!? Did we miss a factor of $\frac{s d (s o n / d a u g h t e r)}{s d (f a t h e r)} = ? ?$ somewhere…??

Let us scale the variables to within {-1, +1} : (subtract the mean and divide by sd) and re-do the Linear Model with scaled versions of height and father:

# Scaled linear model
lin_scaled_galton_daughters <- lm(scale(daughter) ~ 1 + scale(father), data = Galton_daughters) %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1"))
lin_scaled_galton_daughters

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
beta_0	-1.532454e-14	0.04275097	-3.584606e-13	1.000000e+00
beta_1	4.587605e-01	0.04280043	1.071860e+01	6.355655e-24

Now you’re talking!! The estimate is the same in both the classical test and the linear model! So we conclude:

When both target and predictor have the same standard deviation, the slope from the linear model and the Pearson correlation are the same.
There is this relationship between the slope in the linear model and Pearson correlation:

$S l o p e β_{1} = \frac{s d_{y}}{s d_{x}} * r$

The slope is usually much more interpretable and informative than the correlation coefficient.

Hence a linear model using scale() for both variables will show slope = r.

Slope_Scaled: 0.4587605 = Correlation: 0.4587605

Finally, the p-value for Pearson Correlation and that for the slope in the linear model is the same ( $0.04280043$ ). Which means we cannot reject the NULL hypothesis of “no relationship” between daughter-s and father-s heights.

Can you complete this for the sons?

Case Study #2: Study and Grades

In some cases the LINE assumptions may not hold.

Nonlinear relationships, non-normally distributed data ( with large outliers ) and working with ordinal rather than continuous data: these situations necessitate the use of Spearman’s ranked correlation scores. (Ranked, not sign-ranked.).

See the example below: We choose to look at the gpa_study_hours dataset. It has two numeric columns gpa and study_hours:

glimpse(gpa_study_hours)

Rows: 193
Columns: 2
$ gpa         <dbl> 4.000, 3.800, 3.930, 3.400, 3.200, 3.520, 3.680, 3.400, 3.…
$ study_hours <dbl> 10, 25, 45, 10, 4, 10, 24, 40, 10, 10, 30, 7, 15, 60, 10, …

We can plot this:

# Set graph theme
theme_set(new = theme_custom())
#
ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) +
  geom_point() +
  geom_smooth() +
  labs(
    title = "GPA vs Study Hours",
    subtitle = "Pearson Correlation Test"
  )

Hmm…not normally distributed, and there is a sort of increasing relationship, however is it linear? And there is some evidence of heteroscedasticity, so the LINE assumptions are clearly in violation. Pearson correlation would not be the best idea here.

Let us quickly try it anyway, using a Linear Model for the scaled gpa and study_hours variables, from where we get:

# Pearson Correlation as Linear Model
model_gpa <-
  lm(scale(gpa) ~ 1 + scale(study_hours), data = gpa_study_hours)
##
model_gpa %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1")) %>%
  cbind(confint(model_gpa) %>% as_tibble()) %>%
  select(term, estimate, p.value, `2.5 %`, `97.5 %`)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	p.value <dbl>	2.5 % <dbl>	97.5 % <dbl>
beta_0	-2.882036e-15	1.00000000	-0.141087199	0.1410872
beta_1	1.330138e-01	0.06517072	-0.008440359	0.2744679

The correlation estimate is $0.133$ ; the p-value is $0.065$ (and the confidence interval includes $0$ ).

Hence we fail to reject the NULL hypothesis that study_hours and gpa have no relationship. But can this be right?

Should we use another test, that does not need the LINE assumptions?

“Signed Rank” Values

Most statistical tests use the actual values of the data variables. However, in some non-parametric statistical tests, the data are used in rank-transformed sense/order. (In some cases the signed-rank of the data values is used instead of the data itself.)

Signed Rank is calculated as follows:

Take the absolute value of each observation in a sample
Place the ranks in order of (absolute magnitude). The smallest number has rank = 1 and so on. This gives is ranked data.
Give each of the ranks the sign of the original observation ( + or -). This gives us signed ranked data.

signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

Plotting Original and Signed Rank Data

Let us see how this might work by comparing data and its signed-rank version…A quick set of plots:

So the means of the ranks three separate variables seem to be in the same order as the means of the data variables themselves.

How about associations between data? Do ranks reflect well what the data might?

The slopes are almost identical, $0.25$ for both original data and ranked data for $y 1 \sim x$ . So maybe ranked and even sign_ranked data could work, and if it can work despite LINE assumptions not being satisfied, that would be nice!

How does Sign-Rank data work?

TBD: need to add some explanation here.

Spearman correlation = Pearson correlation using the rank of the data observations. Let’s check how this holds for a our x and y1 data:

So the Linear Model for the Ranked Data would be:

$\begin{array}{r} y = β_{0} + β_{1} \times r a n k (x) \\ H_{0} : N u l l H y p o t h e s i s => β_{1} = 0 \\ H_{a} : A l t e r n a t e H y p o t h e s i s => β_{1} \neq 0 \end{array}$

Code

Notes:

When ranks are used, the slope of the linear model ( $β_{1}$ ) has the same value as the Spearman correlation coefficient ( $ρ$ ).
Note that the slope from the linear model now has an intuitive interpretation: the number of ranks y changes for each change in rank of x. ( Ranks are “independent” of sd )

Example

We examine the cars93 data, where the numeric variables of interest are weight and price.

# Set graph theme
theme_set(new = theme_custom())
#

cars93 %>%
  ggplot(aes(weight, price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, lty = 2) +
  labs(title = "Car Weight and Car Price have a nonlinear relationship") +
  theme_classic()

Let us try a Spearman Correlation score for these variables, since the data are not linearly related and the variance of price also is not constant over weight

# Set graph theme
theme_set(new = theme_custom())
#

cor.test(cars93$price, cars93$weight, method = "spearman") %>% broom::tidy()

ABCDEFGHIJ0123456789

estimate <dbl>	statistic <dbl>	p.value <dbl>	method <chr>	alternative <chr>
0.8828317	3073.91	1.066315e-18	Spearman's rank correlation rho	two.sided

# Using linear Model
lm(rank(price) ~ rank(weight), data = cars93) %>% summary()


Call:
lm(formula = rank(price) ~ rank(weight), data = cars93)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.0676  -3.0135   0.7815   3.6926  20.4099 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.22074    2.05894   1.564    0.124    
rank(weight)  0.88288    0.06514  13.554   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.46 on 52 degrees of freedom
Multiple R-squared:  0.7794,    Adjusted R-squared:  0.7751 
F-statistic: 183.7 on 1 and 52 DF,  p-value: < 2.2e-16

# Stats Plot
ggstatsplot::ggscatterstats(
  data = cars93, x = weight,
  y = price,
  type = "nonparametric",
  title = "Cars93: Weight vs Price",
  subtitle = "Spearman Correlation"
)

We see that using ranks of the price variable, we obtain a Spearman’s $ρ = 0.882$ with a p-value that is very small. Hence we are able to reject the NULL hypothesis and state that there is a relationship between these two variables. The linear relationship is evaluated as a correlation of 0.882.

# Other ways using other packages
mosaic::cor_test(gpa ~ study_hours, data = gpa_study_hours) %>%
  broom::tidy() %>%
  select(estimate, p.value, conf.low, conf.high)

ABCDEFGHIJ0123456789

estimate <dbl>	p.value <dbl>	conf.low <dbl>	conf.high <dbl>
0.1330138	0.06517072	-0.008383868	0.2691966

statsExpressions::corr_test(
  data = gpa_study_hours,
  x = study_hours,
  y = gpa
)

ABCDEFGHIJ0123456789

parameter1 <chr>	parameter2 <chr>	effectsize <chr>	estimate <dbl>	conf.level <dbl>	conf.low <dbl>	conf.high <dbl>	statistic <dbl>	df.error <int>	p.value <dbl>
study_hours	gpa	Pearson correlation	0.1330138	0.95	-0.008383868	0.2691966	1.854768	191	0.06517072

Wait, But Why?

Correlation tests are useful to understand the relationship between two variables, but they do not imply causation. A high correlation does not mean that one variable causes the other to change. It is essential to consider the context and other factors that may influence the relationship.

Conclusion

Correlation tests are a powerful way to understand the relationship between two variables. They can be performed using classical methods like Pearson and Spearman correlation, or using more robust methods like permutation tests. The linear model approach provides a deeper understanding of the relationship, especially when the assumptions of normality and homoscedasticity are met.

Your Turn

Try the datasets in the infer package. Use data(package = "infer") in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g. babynames) to read it.
Same with the resampledata and resampledata3 packages.

References

Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv
CheatSheet
Common statistical tests are linear models: a work through by Steve Doogue
Jeffrey Walker “Elements of Statistical Modeling for Experimental Biology”
Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine: OpenIntro Statistics
Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling by Måns Thulin
Jeffrey Walker “A linear-model-can-be-fit-to-data-with-continuous-discrete-or-categorical-x-variables”

Package	Version	Citation
easystats	0.7.4	Lüdecke et al. (2022)
ggExtra	0.10.1	Attali and Baker (2023)
ggstatsplot	0.13.1	Patil (2021b)
openintro	2.5.0	Çetinkaya-Rundel et al. (2024)
resampledata3	1.0	Chihara and Hesterberg (2022)
statsExpressions	1.7.0	Patil (2021a)

Attali, Dean, and Christopher Baker. 2023. ggExtra: Add Marginal Histograms to “ggplot2,” and More “ggplot2” Enhancements. https://doi.org/10.32614/CRAN.package.ggExtra.

Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno, and Christopher Barr. 2024. openintro: Datasets and Supplemental Functions from “OpenIntro” Textbooks and Labs. https://doi.org/10.32614/CRAN.package.openintro.

Chihara, Laura, and Tim Hesterberg. 2022. Resampledata3: Data Sets for “Mathematical Statistics with Resampling and R” (3rd Ed). https://doi.org/10.32614/CRAN.package.resampledata3.

Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Brenton M. Wiernik, Etienne Bacher, Rémi Thériault, and Dominique Makowski. 2022. “easystats: Framework for Easy Statistical Modeling, Visualization, and Reporting.” CRAN. https://doi.org/10.32614/CRAN.package.easystats.

Patil, Indrajeet. 2021a. “statsExpressions: R Package for Tidy Dataframes and Expressions with Statistical Details.” Journal of Open Source Software 6 (61): 3236. https://doi.org/10.21105/joss.03236.

———. 2021b. “Visualizations with statistical details: The ‘ggstatsplot’ approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.

Citation

BibTeX citation:

@online{v.2022,
  author = {V., Arvind},
  title = {Inference for {Correlation}},
  date = {2022-11-25},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/150-Correlation/},
  langid = {en},
  abstract = {Statistical Significance Tests for Correlations between
    two Variables}
}

For attribution, please cite this work as:

V., Arvind. 2022. “Inference for Correlation.” November 25, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/150-Correlation/.