Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Statistical Inference
  4. Inference for Correlation
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Setting up R packages
  • Introduction
  • Case Study #1: Galton’s famous dataset
    • Workflow: Read and Inspect the Data
    • Workflow: Research Questions
    • Workflow: Visualization
    • Workflow: Assumptions
    • Workflow: Inference
  • Case Study #2: Study and Grades
  • Wait, But Why?
  • Conclusion
  • Your Turn
  • References
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Statistical Inference
  4. Inference for Correlation

Inference for Correlation

Author

Arvind V.

Published

November 25, 2022

Modified

October 9, 2024

Abstract
Statistical Significance Tests for Correlations between two Variables
Keywords

Statistics ; Tests; p-value

Setting up R packages

# CRAN Packages
library(tidyverse)
library(mosaic)
library(ggformula)
library(broom)
library(mosaicCore)
library(mosaicData)
library(crosstable) # tabulated summary stats

library(openintro) # datasets and methods
library(resampledata3) # datasets
library(statsExpressions) # datasets and methods
library(ggstatsplot) # special stats plots
library(ggExtra)

# Non-CRAN Packages
# remotes::install_github("easystats/easystats")
library(easystats)

Plot Theme

Show the Code
# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes

theme_custom <- function() {
  font <- "Roboto Condensed" # assign font family up front

  theme_classic(base_size = 14) %+replace% # replace elements we want to change

    theme(
      panel.grid.minor = element_blank(), # strip minor gridlines
      text = element_text(family = font),
      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        # size = 20,               #set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        # vjust = 2                #raise slightly
        margin = margin(0, 0, 10, 0)
      ),
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        # size = 14,                #font size
        hjust = 0,
        margin = margin(2, 0, 5, 0)
      ),
      plot.caption = element_text( # caption
        family = font, # font family
        size = 8, # font size
        hjust = 1
      ), # right align

      axis.title = element_text( # axis titles
        family = font, # font family
        size = 10 # font size
      ),
      axis.text = element_text( # axis text
        family = font, # axis family
        size = 8
      ) # font size
    )
}

# Set graph theme
theme_set(new = theme_custom())
#

Introduction

Correlations define how one variables varies with another. One of the basic Questions we would have of our data is: Does some variable have a significant correlation score with another in some way? Does y vary with x? A Correlation Test is designed to answer exactly this question. The block diagram depicts the statistical procedures available to test for the significance of correlation scores between two variables.

Before we begin, let us recap a few basic definitions:

We have already encountered the variance of a variable:

varx=∑i=1n(xi−μx)2(n−1)where μx=mean(x)n=sample size The standard deviation is:

σx=varx The covariance of two variables is defined as:

cov(x,y)=∑i=1n(xi−μx)∗(yi−μy)n−1=∑xi∗yin−1−∑xi∗μyn−1−∑yi∗μxn−1+∑μx∗μyn−1=∑xi∗yin−1−∑μx∗μyn−1

Hence covariance is the expectation of the product minus the product of the expectations of the two variables.

TipCovariance uses z-scores!

Note that in both cases we are dealing with z-scores: variable minus its mean, xi−μx, which we have seen when dealing with the CLT and the Gaussian Distribution.

So, finally, the coefficient of correlation between two variables is defined as:

(1)correlation r=cov(x,y)σx∗σy=cov(x,y)varx∗vary

Thus correlation coefficient is the covariance scaled by the geometric mean of the variances.

Check Assumptions

Yes, both\n Parametric

Yes, but not variance\n Parametric

No\n Non-Parametric

No\n Non-Parametric

Inference for Correlation

Normality: Shapiro-Wilk Test shapiro.test\n Variances: Fisher F-test var.test

OK?

t.test

Linear Model\n Method

t.test with\n Welch Correction

wilcox.test

Linear Model\n with\n Ranked Data

Bootstrap\n or\n Permutation

Linear Model\n with Ranked Data\n and Permutation

Case Study #1: Galton’s famous dataset

How can we start, except by using the famous Galton dataset, now part of the mosaicData package?

Workflow: Read and Inspect the Data

data("Galton", package = "mosaicData")
Galton
ABCDEFGHIJ0123456789
family
<fct>
father
<dbl>
mother
<dbl>
sex
<fct>
height
<dbl>
nkids
<int>
178.567.0M73.24
178.567.0F69.24
178.567.0F69.04
178.567.0F69.04
275.566.5M73.54
275.566.5M72.54
275.566.5F65.54
275.566.5F65.54
375.064.0M71.02
375.064.0F68.02
Next
123456
...
90
Previous
1-10 of 898 rows

The variables in this dataset are:

NoteQualitative Variables
  • sex(char): sex of the child
  • family(int): an ID for each family
NoteQuantitative Variables
  • father(dbl): father’s height in inches
  • mother(dbl): mother’s height in inches
  • height(dbl): Child’s height in inches
  • nkids(int): Number of children in each family
inspect(Galton)

categorical variables:  
    name  class levels   n missing
1 family factor    197 898       0
2    sex factor      2 898       0
                                   distribution
1 185 (1.7%), 166 (1.2%), 66 (1.2%) ...        
2 M (51.8%), F (48.2%)                         

quantitative variables:  
    name   class min Q1 median   Q3  max      mean       sd   n missing
1 father numeric  62 68   69.0 71.0 78.5 69.232851 2.470256 898       0
2 mother numeric  58 63   64.0 65.5 70.5 64.084410 2.307025 898       0
3 height numeric  56 64   66.5 69.7 79.0 66.760690 3.582918 898       0
4  nkids integer   1  4    6.0  8.0 15.0  6.135857 2.685156 898       0

So there are several correlations we can explore here: Children’s height vs that of father or mother, based on sex. In essence we are replicating Francis Galton’s famous study.

Workflow: Research Questions

NoteQuestion 1
  1. Based on this sample, what can we say about the correlation between a son’s height and a father’s height in the population?
NoteQuestion 2
  1. Based on this sample, what can we say about the correlation between a daughter’s height and a father’s height in the population?

Of course we can formulate more questions, but these are good for now! And since we are going to infer correlations by sex, let us split the dataset into two parts, one for the sons and one for the daughters, and quickly summarise them too:

Galton_sons <- Galton %>%
  dplyr::filter(sex == "M") %>%
  rename("son" = height)
Galton_daughters <- Galton %>%
  dplyr::filter(sex == "F") %>%
  rename("daughter" = height)
dim(Galton_sons)
[1] 465   6
dim(Galton_daughters)
[1] 433   6
Galton_sons %>%
  summarize(across(
    .cols = c(son, father),
    .fns = list(mean = mean, sd = sd)
  ))
ABCDEFGHIJ0123456789
son_mean
<dbl>
son_sd
<dbl>
father_mean
<dbl>
father_sd
<dbl>
69.228822.63159469.168172.299929
1 row
Galton_daughters %>%
  summarize(across(
    .cols = c(daughter, father),
    .fns = list(mean = mean, sd = sd)
  ))
ABCDEFGHIJ0123456789
daughter_mean
<dbl>
daughter_sd
<dbl>
father_mean
<dbl>
father_sd
<dbl>
64.110162.3703269.302312.641898
1 row

Workflow: Visualization

Let us first quickly plot a graph that is relevant to each of the two research questions.

# Set graph theme
theme_set(new = theme_custom())
#
Galton_sons %>%
  gf_point(son ~ father) %>%
  gf_lm() %>%
  gf_labs(
    title = "Height of Sons vs fathers",
    subtitle = "Galton dataset"
  )
##
Galton_daughters %>%
  gf_point(daughter ~ father) %>%
  gf_lm() %>%
  gf_labs(
    title = "Height of Daughters vs Fathers",
    subtitle = "Galton dataset"
  )

We might even plot the overall heights together and colour by sex of the child:

# Set graph theme
theme_set(new = theme_custom())
#
Galton %>%
  gf_point(height ~ father,
    group = ~sex, colour = ~sex
  ) %>%
  gf_lm() %>%
  gf_labs(
    title = "Height of sons vs fathers",
    subtitle = "Galton dataset"
  )

So daughters are shorter than sons, generally speaking, and both heights seem related to that of the father.

NoteWhat did filtering do?

When we filtered the dataset into two, the filtering by sex of the child also effectively filtered the heights of the father (and mother). This is proper and desired; but think!

Workflow: Assumptions

For the classical correlation tests, we need that the variables are normally distributed. As before we check this with the shapiro.test:

shapiro.test(Galton_sons$father)
shapiro.test(Galton_sons$son)
##
shapiro.test(Galton_daughters$father)
shapiro.test(Galton_daughters$daughter)

    Shapiro-Wilk normality test

data:  Galton_sons$father
W = 0.98529, p-value = 0.0001191

    Shapiro-Wilk normality test

data:  Galton_sons$son
W = 0.99135, p-value = 0.008133

    Shapiro-Wilk normality test

data:  Galton_daughters$father
W = 0.98438, p-value = 0.0001297

    Shapiro-Wilk normality test

data:  Galton_daughters$daughter
W = 0.99113, p-value = 0.01071

Let us also check the densities and quartile plots of the heights the dataset:

# Set graph theme
theme_set(new = theme_custom())
#
Galton %>%
  group_by(sex) %>%
  gf_density(~height,
    group = ~sex,
    fill = ~sex
  ) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(title = "Facetted Density Plots")
##
Galton %>%
  group_by(sex) %>%
  gf_qq(~height,
    group = ~sex,
    colour = ~sex, size = 0.5
  ) %>%
  gf_qqline(colour = "black") %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(
    title = "Facetted QQ Plots",
    x = "Theoretical quartiles",
    y = "Actual Data"
  )

and the father’s heights:

# Set graph theme
theme_set(new = theme_custom())
#
##
Galton %>%
  group_by(sex) %>%
  gf_density(~father,
    group = ~sex, # no this is not weird
    fill = ~sex
  ) %>%
  gf_fitdistr(dist = "dnorm") %>%
  gf_refine(scale_fill_discrete(name = "Sex of Child")) %>%
  gf_facet_grid(vars(sex)) %>%
  gf_labs(
    title = "Fathers: Facetted Density Plots",
    subtitle = "By Sex of Child"
  )

Galton %>%
  group_by(sex) %>%
  gf_qq(~father,
    group = ~sex, # no this is not weird
    colour = ~sex, size = 0.5
  ) %>%
  gf_qqline(colour = "black") %>%
  gf_facet_grid(vars(sex)) %>%
  gf_refine(scale_colour_discrete(name = "Sex of Child")) %>%
  gf_labs(
    title = "Fathers Heights: Facetted QQ Plots",
    subtitle = "By Sex of Child",
    x = "Theoretical quartiles",
    y = "Actual Data"
  )

The shapiro.test informs us that the child-related height variables are not normally distributed; though visually there seems nothing much to complain about. Hmmm…

Dads are weird anyway, so we must not expect father heights to be normally distributed.

Workflow: Inference

Let us now see how Correlation Tests can be performed based on this dataset, to infer patterns in the population from which this dataset/sample was drawn.

We will go with classical tests first, and then set up a permutation test that does not need any assumptions.

  • Classical Tests
  • Intuitive
  • Permutation Test
  • The Linear Model

We perform the Pearson correlation test first: the data is not normal so we cannot really use this. We should use a non-parametric correlation test as well, using a Spearman correlation.

# Pearson (built-in test)
cor_son_pearson <- cor.test(son ~ father,
  method = "pearson",
  data = Galton_sons
) %>%
  broom::tidy() %>%
  mutate(term = "Pearson Correlation r")
cor_son_pearson
ABCDEFGHIJ0123456789
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
conf.low
<dbl>
conf.high
<dbl>
method
<chr>
0.39131749.1497881.824016e-184630.31146670.4656805Pearson's product-moment correlation
1 row | 1-7 of 9 columns
cor_son_spearman <- cor.test(son ~ father, method = "spearman", data = Galton_sons) %>%
  broom::tidy() %>%
  mutate(term = "Spearman Correlation r")
cor_son_spearman
ABCDEFGHIJ0123456789
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
method
<chr>
alternative
<chr>
term
<chr>
0.406324199484416.51485e-20Spearman's rank correlation rhotwo.sidedSpearman Correlation r
1 row

Both tests state that the correlation between son and father is significant.

# Pearson (built-in test)
cor_daughter_pearson <- cor.test(daughter ~ father,
  method = "pearson",
  data = Galton_daughters
) %>%
  broom::tidy() %>%
  mutate(term = "Pearson Correlation r")
cor_daughter_pearson
ABCDEFGHIJ0123456789
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
conf.low
<dbl>
conf.high
<dbl>
method
<chr>
0.458760510.71866.355655e-244310.38099440.5300812Pearson's product-moment correlation
1 row | 1-7 of 9 columns
##
cor_daughter_spearman <- cor.test(daughter ~ father, method = "spearman", data = Galton_daughters) %>%
  broom::tidy() %>%
  mutate(term = "Spearman Correlation r")
cor_daughter_spearman
ABCDEFGHIJ0123456789
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
method
<chr>
alternative
<chr>
term
<chr>
0.4333776667212.982817e-21Spearman's rank correlation rhotwo.sidedSpearman Correlation r
1 row

Again both tests state that the correlation between daughter and father is significant.

What is happening under the hood in cor.test?

To be Written Up!

We can of course use a randomization based test for correlation. How would we mechanize this, what aspect would be randomize?

Correlation is calculated on a vector-basis: each individual observation of variable#1 is multiplied by the corresponding observation of variable#2. Look at (eqn-corr?)! So we might be able to randomize the order of this multiplication to see how uncommon this particular set of multiplications are. That would give us a p-value to decide if the observed correlation is close to the truth. So, onwards with our friend mosaic:

obs_daughter_corr <- cor(Galton_daughters$father, Galton_daughters$daughter)
obs_daughter_corr
[1] 0.4587605
corr_daughter_null <- do(4999) * cor.test(daughter ~ shuffle(father), data = Galton_daughters) %>%
  broom::tidy()
corr_daughter_null
ABCDEFGHIJ0123456789
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
conf.low
<dbl>
4.294244e-028.923313e-010.3727132854431-5.150338e-02
-2.786444e-02-5.787056e-010.5630901213431-1.217821e-01
-6.658551e-02-1.385426e+000.1666388292431-1.598201e-01
-4.601437e-02-9.562960e-010.3394587623431-1.396462e-01
-1.540347e-02-3.198222e-010.7492580553431-1.094819e-01
2.407539e-024.999630e-010.6173562907431-7.032155e-02
-8.361593e-03-1.735973e-010.8622634941431-1.025182e-01
4.128640e-028.578593e-010.3914469287431-5.315782e-02
-5.750686e-02-1.195852e+000.2324116738431-1.509263e-01
4.847244e-021.007498e+000.3142606932431-4.597497e-02
Next
123456
...
500
Previous
1-10 of 4,999 rows | 1-5 of 10 columns
corr_daughter_null %>%
  gf_histogram(~estimate, bins = 50) %>%
  gf_vline(
    xintercept = obs_daughter_corr,
    color = "red", linewidth = 1
  ) %>%
  gf_labs(
    title = "Permutation Null Distribution",
    subtitle = "Daughter Heights vs Father Heights"
  )

##
p_value_null <- 2.0 * mean(corr_daughter_null$estimate >= obs_daughter_corr)
p_value_null
[1] 0

We see that will all permutations of father, we are never able to hit the actual obs_daughter_corr! Hence there is a definite correlation between father height and daughter height.

The premise here is that many common statistical tests are special cases of the linear model. A linear model estimates the relationship between dependent variable or
“response” variable height and an explanatory variable or “predictor”, father. It is assumed that the relationship is linear. β0 is the intercept and β1 is the slope of the linear fit, that predicts the value of height based the value of father.

height=β0+β1×father The model for Pearson Correlation tests is exactly the Linear Model:

height=β0+β1×fatherH0:Null Hypothesis =>β1=0 Ha:Alternate Hypothesis =>β1≠0

Using the linear model method we get:

# Linear Model
lin_son <- lm(son ~ father, data = Galton_sons) %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1")) %>%
  select(term, estimate, p.value)
lin_son
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
p.value
<dbl>
beta_038.25891222.642076e-26
beta_10.44774791.824016e-18
2 rows
##
lin_daughter <- lm(daughter ~ father, data = Galton_daughters) %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1")) %>%
  select(term, estimate, p.value)
lin_daughter
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
p.value
<dbl>
beta_035.58522842.573249e-34
beta_10.41160156.355655e-24
2 rows

Why are the respective r-s and β1-s different, though the p-value-s is suspiciously the same!? Did we miss a factor of sd(son/daughter)sd(father)=?? somewhere…??

Let us scale the variables to within {-1, +1} : (subtract the mean and divide by sd) and re-do the Linear Model with scaled versions of height and father:

# Scaled linear model
lin_scaled_galton_daughters <- lm(scale(daughter) ~ 1 + scale(father), data = Galton_daughters) %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1"))
lin_scaled_galton_daughters
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
beta_0-1.532454e-140.04275097-3.584606e-131.000000e+00
beta_14.587605e-010.042800431.071860e+016.355655e-24
2 rows

Now you’re talking!! The estimate is the same in both the classical test and the linear model! So we conclude:

  1. When both target and predictor have the same standard deviation, the slope from the linear model and the Pearson correlation are the same.

  2. There is this relationship between the slope in the linear model and Pearson correlation:

Slope β1=sdysdx∗r

The slope is usually much more interpretable and informative than the correlation coefficient.

  1. Hence a linear model using scale() for both variables will show slope = r.

Slope_Scaled: 0.4587605 = Correlation: 0.4587605

  1. Finally, the p-value for Pearson Correlation and that for the slope in the linear model is the same (0.04280043). Which means we cannot reject the NULL hypothesis of “no relationship” between daughter-s and father-s heights.

Can you complete this for the sons?

Case Study #2: Study and Grades

In some cases the LINE assumptions may not hold.

Nonlinear relationships, non-normally distributed data ( with large outliers ) and working with ordinal rather than continuous data: these situations necessitate the use of Spearman’s ranked correlation scores. (Ranked, not sign-ranked.).

See the example below: We choose to look at the gpa_study_hours dataset. It has two numeric columns gpa and study_hours:

glimpse(gpa_study_hours)
Rows: 193
Columns: 2
$ gpa         <dbl> 4.000, 3.800, 3.930, 3.400, 3.200, 3.520, 3.680, 3.400, 3.…
$ study_hours <dbl> 10, 25, 45, 10, 4, 10, 24, 40, 10, 10, 30, 7, 15, 60, 10, …

We can plot this:

# Set graph theme
theme_set(new = theme_custom())
#
ggplot(gpa_study_hours, aes(x = study_hours, y = gpa)) +
  geom_point() +
  geom_smooth()

Hmm…not normally distributed, and there is a sort of increasing relationship, however is it linear? And there is some evidence of heteroscedasticity, so the LINE assumptions are clearly in violation. Pearson correlation would not be the best idea here.

Let us quickly try it anyway, using a Linear Model for the scaled gpa and study_hours variables, from where we get:

# Pearson Correlation as Linear Model
model_gpa <-
  lm(scale(gpa) ~ 1 + scale(study_hours), data = gpa_study_hours)
##
model_gpa %>%
  broom::tidy() %>%
  mutate(term = c("beta_0", "beta_1")) %>%
  cbind(confint(model_gpa) %>% as_tibble()) %>%
  select(term, estimate, p.value, `2.5 %`, `97.5 %`)
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
p.value
<dbl>
2.5 %
<dbl>
97.5 %
<dbl>
beta_0-2.882036e-151.00000000-0.1410871990.1410872
beta_11.330138e-010.06517072-0.0084403590.2744679
2 rows

The correlation estimate is 0.133; the p-value is 0.065 (and the confidence interval includes 0).

Hence we fail to reject the NULL hypothesis that study_hours and gpa have no relationship. But can this be right?

Should we use another test, that does not need the LINE assumptions?

“Signed Rank” Values

Most statistical tests use the actual values of the data variables. However, in some non-parametric statistical tests, the data are used in rank-transformed sense/order. (In some cases the signed-rank of the data values is used instead of the data itself.)

Signed Rank is calculated as follows:

  1. Take the absolute value of each observation in a sample

  2. Place the ranks in order of (absolute magnitude). The smallest number has rank = 1 and so on. This gives is ranked data.

  3. Give each of the ranks the sign of the original observation ( + or -). This gives us signed ranked data.

signed_rank <- function(x) {
  sign(x) * rank(abs(x))
}

Plotting Original and Signed Rank Data

Let us see how this might work by comparing data and its signed-rank version…A quick set of plots:

So the means of the ranks three separate variables seem to be in the same order as the means of the data variables themselves.

How about associations between data? Do ranks reflect well what the data might?

The slopes are almost identical, 0.25 for both original data and ranked data for y1∼x. So maybe ranked and even sign_ranked data could work, and if it can work despite LINE assumptions not being satisfied, that would be nice!

How does Sign-Rank data work?

TBD: need to add some explanation here.

Spearman correlation = Pearson correlation using the rank of the data observations. Let’s check how this holds for a our x and y1 data:

So the Linear Model for the Ranked Data would be:

y=β0+β1×rank(x)H0:Null Hypothesis =>β1=0 Ha:Alternate Hypothesis =>β1≠0

Code

Notes:

  1. When ranks are used, the slope of the linear model (β1) has the same value as the Spearman correlation coefficient ( ρ ).

  2. Note that the slope from the linear model now has an intuitive interpretation: the number of ranks y changes for each change in rank of x. ( Ranks are “independent” of sd )

Example

We examine the cars93 data, where the numeric variables of interest are weight and price.

# Set graph theme
theme_set(new = theme_custom())
#

cars93 %>%
  ggplot(aes(weight, price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, lty = 2) +
  labs(title = "Car Weight and Car Price have a nonlinear relationship") +
  theme_classic()

Let us try a Spearman Correlation score for these variables, since the data are not linearly related and the variance of price also is not constant over weight

# Set graph theme
theme_set(new = theme_custom())
#

cor.test(cars93$price, cars93$weight, method = "spearman") %>% broom::tidy()
ABCDEFGHIJ0123456789
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
method
<chr>
alternative
<chr>
0.88283173073.911.066315e-18Spearman's rank correlation rhotwo.sided
1 row
# Using linear Model
lm(rank(price) ~ rank(weight), data = cars93) %>% summary()

Call:
lm(formula = rank(price) ~ rank(weight), data = cars93)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.0676  -3.0135   0.7815   3.6926  20.4099 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.22074    2.05894   1.564    0.124    
rank(weight)  0.88288    0.06514  13.554   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.46 on 52 degrees of freedom
Multiple R-squared:  0.7794,    Adjusted R-squared:  0.7751 
F-statistic: 183.7 on 1 and 52 DF,  p-value: < 2.2e-16
# Stats Plot
ggstatsplot::ggscatterstats(
  data = cars93, x = weight,
  y = price,
  type = "nonparametric",
  title = "Cars93: Weight vs Price",
  subtitle = "Spearman Correlation"
)

We see that using ranks of the price variable, we obtain a Spearman’s ρ=0.882 with a p-value that is very small. Hence we are able to reject the NULL hypothesis and state that there is a relationship between these two variables. The linear relationship is evaluated as a correlation of 0.882.

# Other ways using other packages
mosaic::cor_test(gpa ~ study_hours, data = gpa_study_hours) %>%
  broom::tidy() %>%
  select(estimate, p.value, conf.low, conf.high)
ABCDEFGHIJ0123456789
estimate
<dbl>
p.value
<dbl>
conf.low
<dbl>
conf.high
<dbl>
0.13301380.06517072-0.0083838680.2691966
1 row
statsExpressions::corr_test(
  data = gpa_study_hours,
  x = study_hours,
  y = gpa
)
ABCDEFGHIJ0123456789
parameter1
<chr>
parameter2
<chr>
effectsize
<chr>
estimate
<dbl>
conf.level
<dbl>
study_hoursgpaPearson correlation0.13301380.95
1 row | 1-5 of 14 columns

Wait, But Why?

Conclusion

Your Turn

  1. Try the datasets in the infer package. Use data(package = "infer") in your Console to list out the data packages. Then simply type the name of the dataset in a Quarto chunk ( e.g. babynames) to read it.

  2. Same with the resampledata and resampledata3 packages.

References

  1. Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv
    CheatSheet
  2. Common statistical tests are linear models: a work through by Steve Doogue
  3. Jeffrey Walker “Elements of Statistical Modeling for Experimental Biology”
  4. Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine: OpenIntro Statistics
  5. Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling by Måns Thulin
  6. Jeffrey Walker “A linear-model-can-be-fit-to-data-with-continuous-discrete-or-categorical-x-variables”
Package Version Citation
easystats 0.7.3 Lüdecke et al. (2022)
ggExtra 0.10.1 Attali and Baker (2023)
ggstatsplot 0.12.4 Patil (2021b)
openintro 2.5.0 Çetinkaya-Rundel et al. (2024)
resampledata3 1.0 Chihara and Hesterberg (2022)
statsExpressions 1.6.0 Patil (2021a)
Attali, Dean, and Christopher Baker. 2023. ggExtra: Add Marginal Histograms to “ggplot2,” and More “ggplot2” Enhancements. https://CRAN.R-project.org/package=ggExtra.
Çetinkaya-Rundel, Mine, David Diez, Andrew Bray, Albert Y. Kim, Ben Baumer, Chester Ismay, Nick Paterno, and Christopher Barr. 2024. openintro: Datasets and Supplemental Functions from “OpenIntro” Textbooks and Labs. https://CRAN.R-project.org/package=openintro.
Chihara, Laura, and Tim Hesterberg. 2022. Resampledata3: Data Sets for “Mathematical Statistics with Resampling and R” (3rd Ed). https://CRAN.R-project.org/package=resampledata3.
Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, Brenton M. Wiernik, Etienne Bacher, Rémi Thériault, and Dominique Makowski. 2022. “easystats: Framework for Easy Statistical Modeling, Visualization, and Reporting.” CRAN. https://doi.org/10.32614/CRAN.package.easystats.
Patil, Indrajeet. 2021a. “statsExpressions: R Package for Tidy Dataframes and Expressions with Statistical Details.” Journal of Open Source Software 6 (61): 3236. https://doi.org/10.21105/joss.03236.
———. 2021b. “Visualizations with statistical details: The ‘ggstatsplot’ approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.
Back to top

Citation

BibTeX citation:
@online{v.2022,
  author = {V., Arvind},
  title = {Inference for {Correlation}},
  date = {2022-11-25},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/150-Correlation/},
  langid = {en},
  abstract = {Statistical Significance Tests for Correlations between
    two Variables}
}
For attribution, please cite this work as:
V., Arvind. 2022. “Inference for Correlation.” November 25, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/150-Correlation/.
Comparing Multiple Means with ANOVA
🃏 Testing a Single Proportion

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .