library(tidyverse)
library(mosaic) # package for stats, simulations, and basic plots
library(mosaicData) # package containing datasets
library(ggformula) # package for professional looking plots, that use the formula interface from mosaic
library(NHANES) # survey data collected by the US National Center for Health Statistics (NCHS)
library(corrplot) # For Correlogram plots
library(plotly)
library(echarts4r)
EDA: Interactive Correlation Graphs in R
Introduction
We will create Tables for Correlations, and graphs for Correlations in R. As always, we will consistently use the Project Mosaic ecosystem of packages in R (mosaic
, mosaicData
and ggformula
).
Setting up R Packages
echarts4r
We will also start using echarts4r
side by side for interactive graphs.
- Every function in the package starts with
e_
. - You start coding a visualization by creating an echarts object with the
e_charts()
function. That takes yourdata frame
andx-axis column
as arguments. - Next, you add a function for the type of chart (
e_line()
,e_bar()
, etc.) with they-axis series column name
as an argument. - The rest is mostly customization!
echarts4r
takes some effort in getting used to, but it totally worth it!
Case Study #1: Dataset from mosaicData
Let us inspect what datasets are available in the package mosaicData
.
The popup tab shows a lot of datasets we could use. Let us continue to use the famous Galton
dataset and inspect
it: (We will save the inspect
output as an R object for use later)
galton_describe$quantitative
The inspect
command already gives us a series of statistical measures of different variables of interest. As discussed previously, we can retain the output of inspect
and use it in our reports: (there are ways of dressing up these tables too)
The dataset is described as:
help("Galton")
in your Console.A data frame with 898 observations on the following variables.
-family
a factor with levels for each family
-father
the father’s height (in inches)
-mother
the mother’s height (in inches)
-sex
the child’s sex: F or M
-height
the child’s height as an adult (in inches)
-nkids
the number of adult children in the family, or, at least, the number whose heights Galton recorded.
There is a lot of Description generated by the mosaic::inspect()
command ! What can we say about the dataset and its variables? How big is the dataset? How many variables? What types are they, Quant or Qual? If they are Qual, what are the levels? Are they ordered levels? Discuss!
Correlations and Plots
What Questions might we have, that we could answer with a Statistical Measure, or Correlation chart?
Q.1 Which are the variables that have significant pair-wise correlations? What polarity are these correlations?
# Pulling out the list of Quant variables from NHANES
galton_quant <- galton_describe$quantitative
galton_quant$name
[1] "father" "mother" "height" "nkids"
GGally::ggpairs(
Galton,
columns = c("father", "mother", "height", "nkids"),
diag = list("densityDiag"),
title = "Galton Data Correlations Plot"
) %>%
plotly::ggplotly()
Warning: Can only have one: highlight
Warning: Can only have one: highlight
Warning: Can only have one: highlight
Insight: There are significant, but low value correlations in the Galton
dataset. height
is best correlated with father
(\(0.275\)). The Scatter Plots shown in the plot also visually demonstrate the (lack of) large value correlations.
We cannot have too many variables in this kind of plot. We will shortly see how to plot correlations when there are a large number of variables.
echarts4r
does not have a comprehensive combination plot like what GGally
offers. However, we can plot a Correlation Heatmap using echarts4r
:
Galton %>%
select(where(is.numeric)) %>%
mosaic::cor() %>%
e_charts(height = 300) %>%
e_correlations(order = "hclust", visual_map = TRUE) %>%
e_title("Galton Correlations Heatmap")
Insight: Moving the cursor over the heatmap gives us the an indication of the correlation scores between variables. The visual map slider moves automatically to indicate the scores. We can also move the slider ourselves to “filter” the heatmap!
Q.2: Can we plot a Correlogram for this dataset?
# library(corrplot)
galton_num_var <- Galton %>% select(father, mother, height, nkids)
galton_cor <- cor(galton_num_var)
galton_cor %>%
corrplot(
method = "ellipse",
type = "lower",
main = "Correlogram for Galton dataset"
)
Insight: Again, height
is positively correlated to father
and mother
as depicted by the rightward-sloping blue ellipses. And height
is negatively correlated (very slightly) with nkids
, with leftward-sloping reddish ellipses. (See the color palette + legend below the figure).
Q.3: What do the correlation tests tell us?
Pearson's product-moment correlation
data: height and father
t = 8.5737, df = 896, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2137851 0.3347455
sample estimates:
cor
0.2753548
Pearson's product-moment correlation
data: height and mother
t = 6.1628, df = 896, p-value = 1.079e-09
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1380554 0.2635982
sample estimates:
cor
0.2016549
Insight: The tests give us the same values seen before, along with the confidence intervals for the correlation estimate. These represent the uncertainty that exists in our estimates.
Q.4: What does this correlation look when split by sex
of Child?
We will use the mosaic
function cor_test
to get these results:
# For the sons
mosaic::cor_test(height ~ father,
data = Galton %>% filter(sex == "M")
)
cor_test(height ~ mother, data = Galton %>%
filter(sex == "M"))
Pearson's product-moment correlation
data: height and father
t = 9.1498, df = 463, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3114667 0.4656805
sample estimates:
cor
0.3913174
Pearson's product-moment correlation
data: height and mother
t = 7.628, df = 463, p-value = 1.367e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2508178 0.4125305
sample estimates:
cor
0.3341309
# For the daughters
cor_test(height ~ father,
data = Galton %>% filter(sex == "F")
)
cor_test(height ~ mother,
data = Galton %>% filter(sex == "F")
)
Pearson's product-moment correlation
data: height and father
t = 10.719, df = 431, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3809944 0.5300812
sample estimates:
cor
0.4587605
Pearson's product-moment correlation
data: height and mother
t = 6.8588, df = 431, p-value = 2.421e-11
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2261463 0.3962226
sample estimates:
cor
0.3136984
Insight: Son’s heights are correlated more with father
than with mother
. This trend is even more so for daughters! Hmmm…mother
’s influence on children is clearly not with height
.
Correlation Tests and Uncertainty
Note how the cor.test
reports a correlation score and the p-value
for the same. There is also a confidence interval
reported for the correlation score, an interval within which we are 95% sure that the true correlation value is to be found. Note that GGally
too reports the significance of the correlation scores using ***
or **
. This indicates the p-value in the scores obtained by GGally
; Presumably, there is an internal cor.test
that is run for each pair of variables and the p-value and confidence levels are also computed internally.
We can also visualise this uncertainty and the confidence levels in a plot too, using gf_errorbar
and a handy set of functions within purrr
which is part of the tidyverse: Assuming heights
is the target variable we want to correlate every other (quantitative) variable against, we can proceed very quickly as follows:
all_corrs <- Galton %>%
select(where(is.numeric)) %>%
# leave off height to get all the remaining ones
select(-height) %>%
# perform a cor.test for all variables against height
purrr::map(
.x = .,
.f = \(x) cor.test(x, Galton$height)
) %>%
# tidy up the cor.test outputs into a tidy data frame
map_dfr(broom::tidy, .id = "predictor")
all_corrs
all_corrs %>%
e_charts(predictor) %>%
e_bar(estimate, colorBy = "data", legend = FALSE) %>%
e_error_bar(lower = conf.low, upper = conf.high) %>%
e_y_axis(
name = "Correlation with `height`",
nameLocation = "middle", nameGap = 35
) %>%
e_x_axis(
name = "Parameter", nameLocation = "center",
nameGap = 35, type = "category"
) %>%
e_tooltip()
all_corrs %>%
mutate(sd = (conf.high - conf.low) / 2) %>%
plot_ly() %>%
add_bars(
y = ~estimate, x = ~predictor,
error_y = ~ list(array = sd, color = "black")
)
Insight: We can clearly see the size of the correlations and the confidence intervals marked in this plot. father
has somewhat greater correlation with children’s height
, as compared to mother
. nkids
seems to matter very slightly, in a negative way.
This kind of plot will be very useful when we pursue linear regression models.
Q.5. How can we show this correlation in a set of Scatter Plots + Regression Lines? Can we recreate Galton’s famous diagram?
# For the father
Galton %>%
group_by(sex) %>%
e_charts(father, height = 300) %>%
e_scatter(height, symbol_size = 8) %>%
e_lm(height ~ father, legend = FALSE) %>%
e_x_axis(
name = "father", nameLocation = "middle", nameGap = 35,
min = 60, max = 80
) %>%
e_y_axis(
name = "height", nameLocation = "middle", nameGap = 35,
min = 50, max = 80
) %>%
e_tooltip()
# for the mother
Galton %>%
group_by(sex) %>%
e_charts(mother, height = 300) %>%
e_scatter(height, symbol_size = 8) %>%
e_lm(height ~ mother, legend = FALSE) %>%
e_x_axis(
name = "mother", nameLocation = "middle", nameGap = 35,
min = 55, max = 75
) %>%
e_y_axis(
name = "height", nameLocation = "middle", nameGap = 35,
min = 50, max = 80
) %>%
e_tooltip()
Insight: Visibly the scatter plots are slightly tilted upward to the right, showing a positive correlation for both sons’ and daughters’ height
s with that of the father
and mother
.
An approximation to Galton’s famous plot (see Wikipedia):
gf_point(height ~ (father + mother) / 2, data = Galton) %>%
gf_smooth(method = "lm") %>%
gf_density_2d(n = 8) %>%
gf_abline(slope = 1) %>%
gf_theme(theme_minimal())
Insight: How would you interpret this plot1? As yet we are not able to reproduce this with charts4r
.
Case Study #2: Dataset from NHANES
We will “live code” this in class!
Conclusion
We have a decent Correlations related workflow in R:
- load the dataset
- inspect
the dataset, identify Quant and Qual variables
- Develop Pair-Wise plots + Correlations using GGally::ggpairs()
- Develop Correlogram corrplot::corrplot
- Check everything with a cor_test
- Use purrr
+ cor.test
to plot correlations and confidence intervals for multiple Quant variables
- Plot scatter plots using gf_point
.
- Add extra lines using gf_abline()
to compare hypotheses that you may have.
Footnotes
https://www.researchgate.net/figure/Galtons-smoothed-correlation-diagram-for-the-data-on-heights-of-parents-and-children_fig15_226400313↩︎