Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Tutorial on Correlations in R
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Introduction
  • Setting up R Packages
  • Case Study 1: Galton Dataset from mosaicData
    • Inspecting the Data
    • Correlations and Plots
    • Visualizing Uncertainty in Correlation Estimates
  • Case Study#2: Dataset from NHANES
    • Inspecting the Data
    • Correlations and Plots
    • Visualizing Uncertainty in Correlation Estimates
  • Conclusion
  • References

Tutorial on Correlations in R

Author

Arvind Venkatadri

Published

November 22, 2022

Modified

May 21, 2024

Abstract
Tests, Tables, and Graphs for Correlations in R

Introduction

We will create Tables for Correlations, and graphs for Correlations in R. As always, we will consistently use the Project Mosaic ecosystem of packages in R (mosaic, mosaicData and ggformula).

Setting up R Packages

library(tidyverse)
library(mosaic) # package for stats, simulations, and basic plots
library(ggformula) # package for professional looking plots, that use the formula interface from mosaic
library(skimr)
library(GGally)
library(corrplot) # For Correlogram plots
library(broom) # to properly format stat test results

library(mosaicData) # package containing datasets
library(NHANES) # survey data collected by the US National Center for Health Statistics (NCHS)

All R functions seen in the code are clickable links that take you to online documentation about the function. Try!
TipThe Formula interface

Note the standard method for all commands from the mosaic package:

goal( y ~ x | z, data = mydata, …)

With ggformula, one can create any graph/chart using:

gf_geometry(y ~ x | z, data = mydata)

OR

mydata %>% gf_geometry( y ~ x | z )

The second method may be preferable, especially if you have done some data manipulation first! More about this later!

Case Study 1: Galton Dataset from mosaicData

Let us inspect what datasets are available in the package mosaicData. Run this command in your Console:

# Run in Console
data(package = "mosaicData")

The popup tab shows a lot of datasets we could use. Let us continue to use the famous Galton dataset and inspect it:

data("Galton")

Inspecting the Data

The inspect command already gives us a series of statistical measures of different variables of interest. As discussed previously, we can retain the output of inspect and use it in our reports: (there are ways of dressing up these tables too)

galton_describe <- inspect(Galton)

galton_describe$categorical
ABCDEFGHIJ0123456789
name
<chr>
class
<chr>
levels
<int>
n
<int>
missing
<int>
distribution
<chr>
familyfactor1978980185 (1.7%), 166 (1.2%), 66 (1.2%) ...
sexfactor28980M (51.8%), F (48.2%)
2 rows
galton_describe$quantitative
ABCDEFGHIJ0123456789
 
 
name
<chr>
class
<chr>
min
<dbl>
Q1
<dbl>
median
<dbl>
Q3
<dbl>
max
<dbl>
mean
<dbl>
sd
<dbl>
1fathernumeric626869.071.078.569.2328512.470256
2mothernumeric586364.065.570.564.0844102.307025
3heightnumeric566466.569.779.066.7606903.582918
4nkidsinteger146.08.015.06.1358572.685156
4 rows | 1-10 of 12 columns

Try help("Galton") in your Console. The dataset is described as:

A data frame with 898 observations on the following variables.
- family a factor with levels for each family
- father the father’s height (in inches)
- mother the mother’s height (in inches)
- sex the child’s sex: F or M
- height the child’s height as an adult (in inches)
- nkids the number of adult children in the family, or, at least, the number whose heights Galton recorded.

There is a lot of Description generated by the mosaic::inspect() command ! Let us also look at the output of skim:

skimr::skim(Galton)
Data summary
Name Galton
Number of rows 898
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
family 0 1 FALSE 197 185: 15, 166: 11, 66: 11, 130: 10
sex 0 1 FALSE 2 M: 465, F: 433

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
father 0 1 69.23 2.47 62 68 69.0 71.0 78.5 ▁▅▇▂▁
mother 0 1 64.08 2.31 58 63 64.0 65.5 70.5 ▂▅▇▃▁
height 0 1 66.76 3.58 56 64 66.5 69.7 79.0 ▁▇▇▅▁
nkids 0 1 6.14 2.69 1 4 6.0 8.0 15.0 ▃▇▆▂▁

What can we say about the dataset and its variables? How big is the dataset? How many variables? What types are they, Quant or Qual? If they are Qual, what are the levels? Are they ordered levels? Which variables could have relationships with others? Why? Write down these Questions!

Correlations and Plots

What Questions might we have, that we could answer with a Statistical Measure, or Correlation chart?

NoteQuestions

How does children’s height correlate with that of father and mother? Is this relationship also affected by sex of the child?

With this question, height becomes our target variable, which we should always plot on the dependent y-axis.

# Pulling out the list of Quant variables from NHANES
galton_quant <- galton_describe$quantitative
galton_quant$name
[1] "father" "mother" "height" "nkids" 
GGally::ggpairs(
  Galton,
  
  # Choose the variables we want to plot for
  columns = c("father", "mother", "height", "nkids"),
  
  switch = "both", # axis labels in more traditional locations
  progress = FALSE, # no compute progress messages needed
  
  # Choose the diagonal graphs (always single variable! Think!)
  diag = list(continuous = "barDiag"), # choosing histogram,not density
  
  # Choose lower triangle graphs, two-variable graphs
  lower = list(continuous = wrap("smooth", alpha = 0.1)),
  
  title = "Galton Data Correlations Plot"
) + 
  
  theme_bw()

We note that children’s height is correlated with that of father and mother. The correlations are both positive, and that with father seems to be the larger of the two. ( Look at the slopes of the lines and the values of the correlation scores. )

NoteQuestion

What if we group the Quant variables based on a Qual variable, like sex of the child?

# Pulling out the list of Quant variables from NHANES
galton_quant <- galton_describe$quantitative
galton_quant$name
[1] "father" "mother" "height" "nkids" 
GGally::ggpairs(
  Galton,
  
  mapping = aes(colour = sex), # Colour by `sex`

  # Choose the variables we want to plot for
  columns = c("father", "mother", "height", "nkids"),
  switch = "both", # axis labels in more traditional locations
  progress = FALSE, # no compute progress messages needed
  
  diag = list(continuous = "barDiag"),
  
  # Choose lower triangle graphs, two-variable graphs
  lower = list(continuous = wrap("smooth", alpha = 0.1)),
  
  title = "Galton Data Correlations Plot"
) + 
  
  theme_bw()

The split scatter plots are useful, as is the split histogram for height: Clearly the correlation of children’s height with father and mother is positive for both sex-es. The other plots, and even some of the correlations scores are not all useful! Just shows everything we can compute is not necessarily useful immediately.

In later modules we will see how to plot correlations when the number of variables is larger still.

NoteQuestion

Can we plot a Correlogram for this dataset?

#library(corrplot)

galton_num_var <- Galton %>% select(father, mother, height, nkids)
galton_cor <- cor(galton_num_var)
galton_cor %>%
  corrplot(method = "ellipse",
           type = "lower",
           main = "Correlogram for Galton dataset")

Clearly height is positively correlated to father and mother; interestingly, height is negatively correlated ( slightly) with nkids.

NoteQuestion

Let us confirm with a correlation test:

We will use the mosaic function cor_test to get these results:

mosaic::cor_test(height ~ father, data = Galton) %>% 
  broom::tidy() %>% 
  knitr::kable(digits = 2,
               caption = "Children vs Fathers")
Children vs Fathers
estimate statistic p.value parameter conf.low conf.high method alternative
0.28 8.57 0 896 0.21 0.33 Pearson’s product-moment correlation two.sided
mosaic::cor_test(height ~ mother, data = Galton) %>% 
  broom::tidy() %>% 
    knitr::kable(digits = 2,
               caption = "Children vs Mothers")
Children vs Mothers
estimate statistic p.value parameter conf.low conf.high method alternative
0.2 6.16 0 896 0.14 0.26 Pearson’s product-moment correlation two.sided
ImportantCorrelation Scores and Uncertainty

Note how the mosaic::cor_test() reports a correlation score estimate and the p-value for the same. There is also a confidence interval reported for the correlation score, an interval within which we are 95% sure that the true correlation value is to be found.

Note that GGally::ggpairs() too reports the significance of the correlation scores estimates using *** or **. This indicates the p-value in the scores obtained by GGally; Presumably, there is an internal cor_test that is run for each pair of variables and the p-value and confidence levels are also computed internally.

In both cases, we used the formula height∼other−variable, in keeping with our idea of height being the dependent, target variable..

We also see the p.value for the estimateed correlation is negligible, and the conf.low/conf.high interval does not straddle 0. These attest to the significance of the correlation score.

NoteQuestion

What does this correlation look when split by sex of Child?

# For the sons

mosaic::cor_test(height ~ father,
                 data = Galton %>% filter(sex == "M")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Sons vs Fathers")
cor_test(height ~ mother, 
         data = Galton %>% filter(sex == "M")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Sons vs Mothers")

# For the daughters
cor_test(height ~ father, 
         data = Galton %>% filter(sex == "F")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Daughters vs Fathers")
cor_test(height ~ mother, 
         data = Galton %>% filter(sex == "F")) %>% 
  broom::tidy() %>% knitr::kable(digits = 2,
                                 caption = "Daughters vs Mothers")
Sons vs Fathers
estimate statistic p.value parameter conf.low conf.high method alternative
0.39 9.15 0 463 0.31 0.47 Pearson’s product-moment correlation two.sided
Sons vs Mothers
estimate statistic p.value parameter conf.low conf.high method alternative
0.33 7.63 0 463 0.25 0.41 Pearson’s product-moment correlation two.sided
Daughters vs Fathers
estimate statistic p.value parameter conf.low conf.high method alternative
0.46 10.72 0 431 0.38 0.53 Pearson’s product-moment correlation two.sided
Daughters vs Mothers
estimate statistic p.value parameter conf.low conf.high method alternative
0.31 6.86 0 431 0.23 0.4 Pearson’s product-moment correlation two.sided

The same observation as made above ( p.value and confidence intervals) applies here too and tells us that the estimated correlations are significant.

Visualizing Uncertainty in Correlation Estimates

We can also visualize this uncertainty and the confidence levels in a plot too, using gf_errorbar and a handy set of functions within purrr which is part of the tidyverse. Assuming heights is the target variable we want to correlate every other (quantitative) variable against, we can proceed very quickly as follows: we will first plot correlation uncertainty for one pair of variables to develop the intuition, and then for all variables against the one target variable:

mosaic::cor_test(height ~ mother, data = Galton) %>% 
  broom::tidy() %>% 

# We need a graph not a table 
# So comment out this line from the earlier code
#knitr::kable(digits = 2,caption = "Children vs Mothers")

rowid_to_column(var = "index") %>% # Need an index to plot with
  
  # Uncertainty as error-bars
  gf_errorbar(conf.high + conf.low ~ index, linewidth = 2) %>% 
  
  # Estimate as a point
  gf_point(estimate ~ index, color = "red", size = 6) %>% 
  
  # Labels
  gf_text(estimate ~ index - 0.2, 
             label = "Correlation Score = estimate") %>% 
  gf_text(conf.high*0.98 ~ index - 0.25, 
           label = "Upper Limit = estimate + conf.high") %>%   
  gf_text(conf.low*1.04 ~ index - 0.25, 
           label = "Lower Limit = estimate - conf.low") %>% 
  gf_theme(theme_bw())

We can now do this for all variables against the target variable height, which we identified in our research question. We will use the iteration capabilities offered by the tidyverse package, purrr:

all_corrs <- Galton %>% 
  select(where(is.numeric)) %>% 
  
  # leave off height to get all the remaining ones
  select(- height) %>%  
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, Galton$height)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") 

all_corrs
ABCDEFGHIJ0123456789
predictor
<chr>
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
conf.low
<dbl>
conf.high
<dbl>
father0.27535488.5737044.354588e-178960.21378510.33474554
mother0.20165496.1627931.079105e-098960.13805540.26359819
nkids-0.1269101-3.8298001.371780e-04896-0.1907472-0.06200411
3 rows | 1-7 of 9 columns
all_corrs %>% 
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
              color = ~ estimate,
              width = 0.2,
              linewidth = ~ -log10(p.value)) %>% 
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate), 
           color = "black") %>% 
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>% 
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlation with height in Galton", 
          caption = "Significance = - log10(p.value)") %>% 
  
  gf_refine(
    
    # Scale for colour
 scale_colour_distiller("Correlation", type = "div", palette = "RdBu"),
            
    # Scale for dumbbells!!
    scale_linewidth_continuous("significance",
                                       range = c(0.5,4))) %>% 
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE))) %>%
  gf_theme(theme_classic())

We can clearly see the size of the correlations and the confidence intervals marked in this plot. father has somewhat greater correlation with children’s height, as compared to mother. nkids seems to matter very little. This kind of plot will be very useful when we pursue linear regression models.

NoteQuestion

How can we show this correlation in a set of Scatter Plots + Regression Lines? Can we recreate Galton’s famous diagram?

# For the sons
gf_point(height ~ father, 
         data = Galton %>% filter(sex == "M"),
         title = "Soms and Fathers") %>%
  gf_smooth(method = "lm") %>%
  gf_theme(theme_minimal())
gf_point(height ~ mother, 
         data = Galton %>% filter(sex == "M"),
         title = "Sons and Mothers") %>%
  gf_smooth(method = "lm") %>%
  gf_theme(theme_minimal())
# For the daughters
gf_point(height ~ father, 
         data = Galton %>% filter(sex == "F"),
         title = "Daughters and Fathers") %>%
  gf_smooth(method = "lm") %>%
  gf_theme(theme_minimal())
gf_point(height ~ mother, 
         data = Galton %>% filter(sex == "F"),
         title = "Daughters and Mothers") %>%
  gf_smooth(method = "lm") %>% 
  gf_theme(theme_minimal())

An approximation to Galton’s famous plot1 (see Wikipedia):

gf_point(height ~ (father + mother)/2, data = Galton) %>% 
  gf_smooth(method = "lm") %>% 
  gf_density_2d(n = 8) %>% 
  gf_abline(slope = 1) %>% 
  gf_theme(theme_minimal())
gf_point(height ~ (father + mother)/2, data = Galton) %>% 
  gf_smooth(method = "lm") %>% 
  gf_ellipse(level = 0.95, color = "red") %>% 
    gf_ellipse(level = 0.75, color = "blue") %>% 
    gf_ellipse(level = 0.5, color = "green") %>% 
  gf_abline(slope = 1) %>% 
  gf_theme(theme_minimal())

How would you interpret this plot2?

Case Study#2: Dataset from NHANES

Let us look at the NHANES dataset from the package NHANES:

data("NHANES")

Inspecting the Data

NHANES_describe <- inspect(NHANES)

NHANES_describe$categorical
NHANES_describe$quantitative
NHANES
skimr::skim(NHANES)
ABCDEFGHIJ0123456789
name
<chr>
class
<chr>
levels
<int>
SurveyYrfactor2
Genderfactor2
AgeDecadefactor8
Race1factor5
Race3factor6
Educationfactor5
MaritalStatusfactor6
HHIncomefactor12
HomeOwnfactor3
Workfactor3
Next
12
Previous
1-10 of 31 rows | 1-3 of 6 columns
ABCDEFGHIJ0123456789
 
 
name
<chr>
class
<chr>
1IDinteger
2Ageinteger
3AgeMonthsinteger
4HHIncomeMidinteger
5Povertynumeric
6HomeRoomsinteger
7Weightnumeric
8Lengthnumeric
9HeadCircnumeric
10Heightnumeric
Next
12
Previous
1-10 of 45 rows | 1-3 of 12 columns
ABCDEFGHIJ0123456789
ID
<int>
SurveyYr
<fct>
Gender
<fct>
Age
<int>
516242009_10male34
516242009_10male34
516242009_10male34
516252009_10male4
516302009_10female49
516382009_10male9
516462009_10male8
516472009_10female45
516472009_10female45
516472009_10female45
Next
12
Previous
1-10 of 10,000 rows | 1-4 of 76 columns
Data summary
Name NHANES
Number of rows 10000
Number of columns 76
_______________________
Column type frequency:
factor 31
numeric 45
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
SurveyYr 0 1.00 FALSE 2 200: 5000, 201: 5000
Gender 0 1.00 FALSE 2 fem: 5020, mal: 4980
AgeDecade 333 0.97 FALSE 8 40: 1398, 0-: 1391, 10: 1374, 20: 1356
Race1 0 1.00 FALSE 5 Whi: 6372, Bla: 1197, Mex: 1015, Oth: 806
Race3 5000 0.50 FALSE 6 Whi: 3135, Bla: 589, Mex: 480, His: 350
Education 2779 0.72 FALSE 5 Som: 2267, Col: 2098, Hig: 1517, 9 -: 888
MaritalStatus 2769 0.72 FALSE 6 Mar: 3945, Nev: 1380, Div: 707, Liv: 560
HHIncome 811 0.92 FALSE 12 mor: 2220, 750: 1084, 250: 958, 350: 863
HomeOwn 63 0.99 FALSE 3 Own: 6425, Ren: 3287, Oth: 225
Work 2229 0.78 FALSE 3 Wor: 4613, Not: 2847, Loo: 311
BMICatUnder20yrs 8726 0.13 FALSE 4 Nor: 805, Obe: 221, Ove: 193, Und: 55
BMI_WHO 397 0.96 FALSE 4 18.: 2911, 30.: 2751, 25.: 2664, 12.: 1277
Diabetes 142 0.99 FALSE 2 No: 9098, Yes: 760
HealthGen 2461 0.75 FALSE 5 Goo: 2956, Vgo: 2508, Fai: 1010, Exc: 878
LittleInterest 3333 0.67 FALSE 3 Non: 5103, Sev: 1130, Mos: 434
Depressed 3327 0.67 FALSE 3 Non: 5246, Sev: 1009, Mos: 418
SleepTrouble 2228 0.78 FALSE 2 No: 5799, Yes: 1973
PhysActive 1674 0.83 FALSE 2 Yes: 4649, No: 3677
TVHrsDay 5141 0.49 FALSE 7 2_h: 1275, 1_h: 884, 3_h: 836, 0_t: 638
CompHrsDay 5137 0.49 FALSE 7 0_t: 1409, 0_h: 1073, 1_h: 1030, 2_h: 589
Alcohol12PlusYr 3420 0.66 FALSE 2 Yes: 5212, No: 1368
SmokeNow 6789 0.32 FALSE 2 No: 1745, Yes: 1466
Smoke100 2765 0.72 FALSE 2 No: 4024, Yes: 3211
Smoke100n 2765 0.72 FALSE 2 Non: 4024, Smo: 3211
Marijuana 5059 0.49 FALSE 2 Yes: 2892, No: 2049
RegularMarij 5059 0.49 FALSE 2 No: 3575, Yes: 1366
HardDrugs 4235 0.58 FALSE 2 No: 4700, Yes: 1065
SexEver 4233 0.58 FALSE 2 Yes: 5544, No: 223
SameSex 4232 0.58 FALSE 2 No: 5353, Yes: 415
SexOrientation 5158 0.48 FALSE 3 Het: 4638, Bis: 119, Hom: 85
PregnantNow 8304 0.17 FALSE 3 No: 1573, Yes: 72, Unk: 51

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1.00 61944.64 5871.17 51624.00 56904.50 62159.50 67039.00 71915.00 ▇▇▇▇▇
Age 0 1.00 36.74 22.40 0.00 17.00 36.00 54.00 80.00 ▇▇▇▆▅
AgeMonths 5038 0.50 420.12 259.04 0.00 199.00 418.00 624.00 959.00 ▇▇▇▆▃
HHIncomeMid 811 0.92 57206.17 33020.28 2500.00 30000.00 50000.00 87500.00 100000.00 ▃▆▃▁▇
Poverty 726 0.93 2.80 1.68 0.00 1.24 2.70 4.71 5.00 ▅▅▃▃▇
HomeRooms 69 0.99 6.25 2.28 1.00 5.00 6.00 8.00 13.00 ▂▆▇▂▁
Weight 78 0.99 70.98 29.13 2.80 56.10 72.70 88.90 230.70 ▂▇▂▁▁
Length 9457 0.05 85.02 13.71 47.10 75.70 87.00 96.10 112.20 ▁▃▆▇▃
HeadCirc 9912 0.01 41.18 2.31 34.20 39.58 41.45 42.92 45.40 ▁▂▇▇▅
Height 353 0.96 161.88 20.19 83.60 156.80 166.00 174.50 200.40 ▁▁▁▇▂
BMI 366 0.96 26.66 7.38 12.88 21.58 25.98 30.89 81.25 ▇▆▁▁▁
Pulse 1437 0.86 73.56 12.16 40.00 64.00 72.00 82.00 136.00 ▂▇▃▁▁
BPSysAve 1449 0.86 118.15 17.25 76.00 106.00 116.00 127.00 226.00 ▃▇▂▁▁
BPDiaAve 1449 0.86 67.48 14.35 0.00 61.00 69.00 76.00 116.00 ▁▁▇▇▁
BPSys1 1763 0.82 119.09 17.50 72.00 106.00 116.00 128.00 232.00 ▂▇▂▁▁
BPDia1 1763 0.82 68.28 13.78 0.00 62.00 70.00 76.00 118.00 ▁▁▇▆▁
BPSys2 1647 0.84 118.48 17.49 76.00 106.00 116.00 128.00 226.00 ▃▇▂▁▁
BPDia2 1647 0.84 67.66 14.42 0.00 60.00 68.00 76.00 118.00 ▁▁▇▆▁
BPSys3 1635 0.84 117.93 17.18 76.00 106.00 116.00 126.00 226.00 ▃▇▂▁▁
BPDia3 1635 0.84 67.30 14.96 0.00 60.00 68.00 76.00 116.00 ▁▁▇▇▁
Testosterone 5874 0.41 197.90 226.50 0.25 17.70 43.82 362.41 1795.60 ▇▂▁▁▁
DirectChol 1526 0.85 1.36 0.40 0.39 1.09 1.29 1.58 4.03 ▅▇▂▁▁
TotChol 1526 0.85 4.88 1.08 1.53 4.11 4.78 5.53 13.65 ▂▇▁▁▁
UrineVol1 987 0.90 118.52 90.34 0.00 50.00 94.00 164.00 510.00 ▇▅▂▁▁
UrineFlow1 1603 0.84 0.98 0.95 0.00 0.40 0.70 1.22 17.17 ▇▁▁▁▁
UrineVol2 8522 0.15 119.68 90.16 0.00 52.00 95.00 171.75 409.00 ▇▆▃▂▁
UrineFlow2 8524 0.15 1.15 1.07 0.00 0.48 0.76 1.51 13.69 ▇▁▁▁▁
DiabetesAge 9371 0.06 48.42 15.68 1.00 40.00 50.00 58.00 80.00 ▁▂▆▇▂
DaysPhysHlthBad 2468 0.75 3.33 7.40 0.00 0.00 0.00 3.00 30.00 ▇▁▁▁▁
DaysMentHlthBad 2466 0.75 4.13 7.83 0.00 0.00 0.00 4.00 30.00 ▇▁▁▁▁
nPregnancies 7396 0.26 3.03 1.80 1.00 2.00 3.00 4.00 32.00 ▇▁▁▁▁
nBabies 7584 0.24 2.46 1.32 0.00 2.00 2.00 3.00 12.00 ▇▅▁▁▁
Age1stBaby 8116 0.19 22.65 4.77 14.00 19.00 22.00 26.00 39.00 ▆▇▅▂▁
SleepHrsNight 2245 0.78 6.93 1.35 2.00 6.00 7.00 8.00 12.00 ▁▅▇▁▁
PhysActiveDays 5337 0.47 3.74 1.84 1.00 2.00 3.00 5.00 7.00 ▇▇▃▅▅
TVHrsDayChild 9347 0.07 1.94 1.43 0.00 1.00 2.00 3.00 6.00 ▇▆▂▂▂
CompHrsDayChild 9347 0.07 2.20 2.52 0.00 0.00 1.00 6.00 6.00 ▇▁▁▁▃
AlcoholDay 5086 0.49 2.91 3.18 1.00 1.00 2.00 3.00 82.00 ▇▁▁▁▁
AlcoholYear 4078 0.59 75.10 103.03 0.00 3.00 24.00 104.00 364.00 ▇▁▁▁▁
SmokeAge 6920 0.31 17.83 5.33 6.00 15.00 17.00 19.00 72.00 ▇▂▁▁▁
AgeFirstMarij 7109 0.29 17.02 3.90 1.00 15.00 16.00 19.00 48.00 ▁▇▂▁▁
AgeRegMarij 8634 0.14 17.69 4.81 5.00 15.00 17.00 19.00 52.00 ▂▇▁▁▁
SexAge 4460 0.55 17.43 3.72 9.00 15.00 17.00 19.00 50.00 ▇▅▁▁▁
SexNumPartnLife 4275 0.57 15.09 57.85 0.00 2.00 5.00 12.00 2000.00 ▇▁▁▁▁
SexNumPartYear 5072 0.49 1.34 2.78 0.00 1.00 1.00 1.00 69.00 ▇▁▁▁▁

Try help("NHANES") in your Console.

This is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination centre (MEC).

The dataset is described as: A data frame with 100000 observations on 76 variables. Some of these are:
- Race1 and Race2: factors with 5 and 6 levels respectively
- Education a factor with 5 levels
- HHIncomeMid Total annual gross income for the household in US dollars.
- Age
- BMI: Body mass index (weight/height2 in kg/m2)
- Height: Standing height in cm.
- Weight: Weight in kg > > - Testosterone: Testosterone total (ng/dL) - PhysActiveDays: Number of days in a typical week that participant does moderate or vigorous-intensity activity.
- CompHrsDay: Number of hours per day on average participant used a computer or gaming device over the past 30 days.

ImportantMissing Data

Why do so many of the variables have missing entries? What could be your guess about the Experiment/Survey`?

Let us make some counts of the data, since we have so many factors:

NHANES %>% count(Gender)
ABCDEFGHIJ0123456789
Gender
<fct>
n
<int>
female5020
male4980
2 rows
NHANES %>% count(Race1)
ABCDEFGHIJ0123456789
Race1
<fct>
n
<int>
Black1197
Hispanic610
Mexican1015
White6372
Other806
5 rows
NHANES %>% count(Race3)
ABCDEFGHIJ0123456789
Race3
<fct>
n
<int>
Asian288
Black589
Hispanic350
Mexican480
White3135
Other158
NA5000
7 rows
NHANES %>% count(Education)
ABCDEFGHIJ0123456789
Education
<fct>
n
<int>
8th Grade451
9 - 11th Grade888
High School1517
Some College2267
College Grad2098
NA2779
6 rows
NHANES %>% count(MaritalStatus)
ABCDEFGHIJ0123456789
MaritalStatus
<fct>
n
<int>
Divorced707
LivePartner560
Married3945
NeverMarried1380
Separated183
Widowed456
NA2769
7 rows

There is a good mix of factors and counts.

Now we articulate our Research Questions:

NoteResearch Questions
  1. Does Testosterone have a relationship with parameters such as BMI, Weight, Height, PhysActiveDays CompHrsDay and Age?

  2. Does HHIncomeMid have a relationship with these same parameters? And with Gender?

  3. Are there any other pairwise correlations that we should note? (This is especially useful in choosing independent variables for multiple regression)

( Yes we are concerned with men more than with the women, sadly.)

Correlations and Plots

GGally::ggpairs(NHANES, 
                # Choose the variables we want to plot for
                columns = c("HHIncomeMid", "Weight", "Height", 
                            "BMI", "Gender"), 
                
                # LISTs of graphs needed at different locations
                # For different combinations of variables 
                diag = list(continuous = "barDiag"),
                lower = list(continuous = wrap("smooth", alpha = 0.01)),
                upper = list(continuous = "cor"),
                
                switch = "both", # axis labels in more traditional locations
                progress = FALSE ) + # No compute progress bars needed
  theme_bw()

We see that HHIncomeMid is Quantitative, discrete valued variable, since it is based on a set of median incomes for different ranges of income. BMI, Weight, Height are continuous Quant variables.

HHIncomeMid also seems to be relatively unaffected by Weight; And is only mildly correlated with Height and BMI, as seen both by the correlation score magnitudes and the slopes of the trend lines.

There is a difference in the median income by Gender, but we will defer that kind of test for later, when we do Statistical Inference.

Unsurprisingly, BMI and Weight have a strong relationship, as do Height and Weight; the latter is of course non-linear, since the Height levels off at a point.

GGally::ggpairs(NHANES, 
                columns = c("Testosterone", "Weight", "Height", "BMI"), 
                
                diag = list(continuous = "barDiag"),
                lower = list(continuous = wrap("smooth", alpha = 0.01)),
                upper = list(continuous = "cor"),
                
                switch = "both",
                progress = FALSE ) +
  theme_bw()

It is clear that Testosterone has strong relationships with Height and Weight but not so much with BMI.

Visualizing Uncertainty in Correlation Estimates

Since the pairs plot is fairly clear for both target variables, let us head to visualizing the significance and uncertainty in the correlation estimates.

HHIncome_corrs <- NHANES %>% 
  select(where(is.numeric)) %>% 
  
  # leave off height to get all the remaining ones
  select(- HHIncomeMid) %>%  
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, NHANES$HHIncomeMid)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") 

HHIncome_corrs
ABCDEFGHIJ0123456789
predictor
<chr>
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
conf.low
<dbl>
conf.high
<dbl>
ID-0.001803175-0.172832538.627869e-019187-2.224911e-020.018644265
Age0.0203817451.953972485.073477e-029187-6.503467e-050.040811489
AgeMonths0.0592125403.992321216.646547e-0545303.014902e-020.088176019
Poverty0.895687490192.891586620.000000e+0091718.915652e-010.899661395
HomeRooms0.45211808448.570891310.000000e+0091824.356946e-010.468240645
Weight0.0322598943.082043982.061977e-0391181.174371e-020.052748927
Length-0.002611812-0.058577059.533123e-01503-8.984636e-020.084662501
HeadCirc0.0405616590.360816857.191994e-0179-1.793764e-010.256638118
Height0.11678118211.068060552.745490e-2888609.619449e-020.137268008
BMI-0.063549377-5.990480532.173863e-098850-8.427016e-02-0.042773659
Next
12345
Previous
1-10 of 44 rows | 1-7 of 9 columns
HHIncome_corrs %>% 
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>% 
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
              color = ~ estimate,
              width = 0.2,
              linewidth = ~ -log10(p.value + 0.001)) %>% 
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate), 
           color = "black") %>% 
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlations with HouseHold Median Income", 
          caption = "Significance = - log10(p.value)") %>% 
  gf_theme(theme_classic()) %>%

  
  # Scale for colour
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            scale_colour_distiller("Correlation", type = "div", 
                                    palette = "RdBu"),
            
  # Scale for dumbbells!!
  scale_linewidth_continuous("Significance", range = c(0.05,2)),
  
  theme(axis.text.y = element_text(size = 6, hjust = 1)),
  coord_flip()) 

If we select just the variables from our Research Question:

HHIncome_corrs_select <- NHANES %>% 
  select(Height, Weight, BMI) %>% # Only change is here!
  
  # leave off height to get all the remaining ones
  #select(- HHIncomeMid) %>%  
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, NHANES$HHIncomeMid)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") 

HHIncome_corrs_select
ABCDEFGHIJ0123456789
predictor
<chr>
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
conf.low
<dbl>
conf.high
<dbl>
Height0.1167811811.0680612.745490e-2888600.096194490.13726801
Weight0.032259893.0820442.061977e-0391180.011743710.05274893
BMI-0.06354938-5.9904812.173863e-098850-0.08427016-0.04277366
3 rows | 1-7 of 9 columns
HHIncome_corrs_select %>% 
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(conf.high + conf.low ~ reorder(predictor, estimate),
              color = ~ estimate,
              width = 0.2,
              linewidth = ~ -log10(p.value + 0.000001)) %>% 
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate), 
           color = "black") %>% 
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>% 
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlations with HouseHold Median Income", 
          caption = "Significance = - log10(p.value + 0.000001)") %>% 
  
  gf_theme(theme_classic()) %>%

  
  # Scale for colour
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            scale_colour_distiller("Correlation", type = "div", 
                                    palette = "RdBu"),
            
  # Scale for dumbbells!!
  scale_linewidth_continuous("Significance", range = c(0.05,2)),
  
  theme(axis.text.y = element_text(size = 8, hjust = 1)),
  coord_flip()) 

So we might say taller people make more money? And fatter people make slightly less money? Well, the magnitude of the correlations (aka effect size) are low so we would not imagine this to be a hypothesis that we can defend.

Let us look at the Testosterone variable: trying all variables shows some paucity of observations ( due to missing data), so we will stick with our chosen variables:

Testosterone_corrs <- NHANES %>%
  select(Height, Weight, BMI) %>%
  
  # leave off height to get all the remaining ones
  #select(- Testosterone) %>%
  
  # perform a cor.test for all variables against height
  purrr::map(.x = .,
             .f = \(x) cor.test(x, NHANES$Testosterone)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor")

Testosterone_corrs
ABCDEFGHIJ0123456789
predictor
<chr>
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
conf.low
<dbl>
conf.high
<dbl>
Height0.56638781644.01607810.000000e+0041020.545240250.5868149
Weight0.26303328317.45280947.478487e-6640980.234308740.2912990
BMI0.0025705770.16449748.693477e-014095-0.028053970.0331903
3 rows | 1-7 of 9 columns
Testosterone_corrs %>%
  
  # Reference line at zero correlation score
  gf_hline(yintercept = 0,
           color = "grey",
           linewidth = 2) %>%
  
  # arrange the predictors in order of their correlation scores
  # with the target variable (`height`)
  # Add errorbars to show uncertainty ranges / confidence intervals
  # Use errorbar width and linewidth fo emphasis
  gf_errorbar(
    conf.high + conf.low ~ reorder(predictor, estimate),
    color = ~ estimate,
    width = 0.2,
    linewidth = ~ -log10(p.value + 0.000001)
  ) %>%
  
  # All correlation estimates as points
  gf_point(estimate ~ reorder(predictor, estimate),
           color = "black") %>%
  
  
  # Themes,Titles, and Scales
  gf_labs(x = NULL, y = "Correlations with Testosterone Levels",
          caption = "Significance = - log10(p.value + 0.000001)") %>%
  
  gf_theme(theme_classic()) %>%
  
  
  # Scale for colour
  gf_refine(
    guides(linewidth = guide_legend(reverse = TRUE)),
    scale_colour_distiller("Correlation", type = "div",
                           palette = "RdBu"),
    
    # Scale for dumbbells!!
    scale_linewidth_continuous("Significance", range = c(0.05, 2)),
    
    theme(axis.text.y = element_text(size = 8, hjust = 1)),
    coord_flip()
  ) 

Conclusion

We have a decent Correlations related workflow in R:

  • Load the dataset
  • inspect/skim/glimpse the dataset, identify Quant and Qual variables
  • Identify a target variable based on your knowledge of the data, how it was gathered, who gathered it and what was their intent
  • Develop Pair-Wise plots + Correlations using GGally::ggpairs()
  • Develop Correlogram corrplot::corrplot
  • Check everything with a cor_test: effect size,significance, confidence intervals
  • Use purrr + cor.test to plot correlations and confidence intervals for multiple Quant predictor variables against the target variable
  • Plot scatter plots using gf_point.
  • Add extra lines using gf_abline() to compare hypotheses that you may have.

References

R Package Citations
Package Version Citation
corrplot 0.92 Wei and Simko (2021)
GGally 2.2.1 Schloerke et al. (2024)
ggformula 0.12.0 Kaplan and Pruim (2023)
mosaic 1.9.1 Pruim, Kaplan, and Horton (2017)
mosaicData 0.20.4 Pruim, Kaplan, and Horton (2023)
NHANES 2.1.0 Pruim (2015)
TeachHist 0.2.1 Lange (2023)
TeachingDemos 2.13 Snow (2024)
Kaplan, Daniel, and Randall Pruim. 2023. ggformula: Formula Interface to the Grammar of Graphics. https://CRAN.R-project.org/package=ggformula.
Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://CRAN.R-project.org/package=TeachHist.
Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://CRAN.R-project.org/package=NHANES.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to ‘Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Pruim, Randall, Daniel Kaplan, and Nicholas Horton. 2023. mosaicData: Project MOSAIC Data Sets. https://CRAN.R-project.org/package=mosaicData.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to “ggplot2”. https://CRAN.R-project.org/package=GGally.
Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://CRAN.R-project.org/package=TeachingDemos.
Wei, Taiyun, and Viliam Simko. 2021. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Back to top

Footnotes

  1. http://euclid.psych.yorku.ca/SCS/Gallery/images/galton-corr.jpg>↩︎

  2. https://www.researchgate.net/figure/Galtons-smoothed-correlation-diagram-for-the-data-on-heights-of-parents-and-children_fig15_226400313↩︎

Citation

BibTeX citation:
@online{venkatadri2022,
  author = {Venkatadri, Arvind},
  title = {Tutorial on {Correlations} in {R}},
  date = {2022-11-22},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/files/correlations.html},
  langid = {en},
  abstract = {Tests, Tables, and Graphs for Correlations in R}
}
For attribution, please cite this work as:
Venkatadri, Arvind. 2022. “Tutorial on Correlations in R.” November 22, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/files/correlations.html.

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .