Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Descriptive Analytics
  4. Change
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Slides and Tutorials
  • Setting up R Packages
  • What graphs will we see today?
  • What kind of Data Variables will we choose?
  • Inspiration
  • What is Correlation?
  • Pearson Correlation coefficient
  • Case Study-1: HollywoodMovies2011 dataset
  • Inspecting the Data
  • Scatter Plots
  • Quantizing Correlation
    • Doing a Correlation Test
    • The ErrorBar Plot for Correlations
  • An Interactive Correlation Game
  • Simpson’s Paradox
  • Your Turn
  • Wait, But Why?
  • Conclusions
  • AI Generated Summary and Podcast
  • References
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Descriptive Analytics
  4. Change

Change

Correlations

Correlations
Scatter Plots
Bubble Plots
Errorbar Plot
Heatmaps
Regression Lines
Author

Arvind V.

Published

November 22, 2022

Modified

June 19, 2025

Abstract
How one variable changes with another
WebR Status

Installing package 1 out of 12: readr

Slides and Tutorials

Tutorial    R (Interactive Graphs

“The world says: ‘You have needs – satisfy them. You have as much right as the rich and the mighty. Don’t hesitate to satisfy your needs; indeed, expand your needs and demand more.’ This is the worldly doctrine of today. And they believe that this is freedom. The result for the rich is isolation and suicide, for the poor, envy and murder.”

— Fyodor Dostoevsky

Setting up R Packages

library(tidyverse) # Tidy data processing and plotting
library(ggformula) # Formula based plots
library(mosaic) # Our go-to package
library(skimr) # Another Data inspection package
library(kableExtra) # Making good tables with data

library(GGally) # Corr plots
library(corrplot) # More corrplots
library(ggExtra) # Making Combination Plots

# library(devetools)
# devtools::install_github("rpruim/Lock5withR")
library(Lock5withR) # Datasets
library(palmerpenguins) # A famous dataset

library(easystats) # Easy Statistical Analysis and Charts
library(correlation) # Different Types of Correlations
# From the easystats collection of packages

Plot Theme

Show the Code
# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes

theme_custom <- function() {
  font <- "Roboto Condensed" # assign font family up front

  theme_classic(base_size = 14) %+replace% # replace elements we want to change

    theme(
      panel.grid.minor = element_blank(), # strip minor gridlines

      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        # size = 20,               #set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        # vjust = 2                #raise slightly
        margin = margin(0, 0, 10, 0)
      ), plot.title.position = "plot",
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        # size = 14,                #font size
        hjust = 0,
        margin = margin(2, 0, 5, 0)
      ),
      plot.caption = element_text( # caption
        family = font, # font family
        size = 8, # font size
        hjust = 1
      ), # right align

      axis.title = element_text( # axis titles
        family = font, # font family
        size = 10 # font size
      ),
      axis.text = element_text( # axis text
        family = font, # axis family
        size = 8
      ) # font size
    )
}

# Set graph theme
theme_set(new = theme_custom())

What graphs will we see today?

Variable #1 Variable #2 Chart Names Chart Shape
Quant Quant Scatter Plot

Some of the very basic and commonly used plots for data are:

  • Scatter Plot for two variables
  • Contour Plot
  • Scatter Plot with Confidence Ellipses
  • Pairwise Correlation Plots for multiple variables
  • Correlogram for multiple variables
  • Heatmap for multiple variables
  • Errorbar chart for multiple variables
  • Combination chart with marginal densities

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation

Inspiration

Figure 1: ScatterPlot Inspiration http://www.calamitiesofnature.com/archive/?c=559

Does belief in Evolution depend upon the GSP of of the country? Where is the US in all of this? Does the Bible Belt tip the scales here?

And India?

What is Correlation?

One of the basic Questions we would have of our data is: Does some variable depend upon another in some way? Does y vary with x? A Correlation Test is designed to answer exactly this question.

The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between rainy days and reduced sales at supermarkets. However, in statistical terms we use correlation to denote association between two quantitative variables. We also assume that the association is linear, that one variable increases or decreases a fixed amount for a unit increase or decrease in the other. The other technique that is often used in these circumstances is regression, which involves estimating the best straight line to summarise the association.

Pearson Correlation coefficient

The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes called Pearson’s correlation coefficient after its originator and is a measure of linear association. (If a curved line is needed to express the relationship, other and more complicated measures of the correlation must be used.)

The correlation coefficient is measured on a scale that varies from + 1 through 0 to – 1. Complete correlation between two variables is expressed by either + 1 or -1. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative.

In formal terms, the correlation between two variables x and y is defined as

ρ=E[(x−μx)∗(y−μy)(σx)∗(σy)]

where E is the expectation operator ( i.e taking mean ). Think of this as the average of the products of two scaled variables.

TipPearson Correlation uses z-scores

We can see (x−μx)/σx is a centering and scaling of the variable x. Recall from our discussion on Distributions that this is called the z-score of x.

Pearson correlation assumes that the relationship between the two variables is linear. There are of course many other types of correlation measures: some which work when this is not so. Type vignette("types", package = "correlation") in your Console to see the vignette from the correlation package that discusses various types of correlation measures.

Case Study-1: HollywoodMovies2011 dataset

Let us look at the HollywoodMovies2011 dataset from the Lock5withR package. The dataset is also available by clicking the icon below ( in case you are not able to install Lock5withR):

Inspecting the Data

  • R
  • web-r
HollywoodMovies2011 -> movies
glimpse(movies)
Rows: 136
Columns: 14
$ Movie             <fct> "Insidious", "Paranormal Activity 3", "Bad Teacher",…
$ LeadStudio        <fct> Sony, Independent, Independent, Warner Bros, Relativ…
$ RottenTomatoes    <int> 67, 68, 44, 96, 90, 93, 75, 35, 63, 69, 69, 49, 26, …
$ AudienceScore     <int> 65, 58, 38, 92, 77, 84, 91, 58, 74, 73, 72, 57, 68, …
$ Story             <fct> Monster Force, Monster Force, Comedy, Rivalry, Rival…
$ Genre             <fct> Horror, Horror, Comedy, Fantasy, Comedy, Romance, Dr…
$ TheatersOpenWeek  <int> 2408, 3321, 3049, 4375, 2918, 944, 2534, 3615, NA, 2…
$ BOAverageOpenWeek <int> 5511, 15829, 10365, 38672, 8995, 6177, 10278, 23775,…
$ DomesticGross     <dbl> 54.01, 103.66, 100.29, 381.01, 169.11, 56.18, 169.22…
$ ForeignGross      <dbl> 43.00, 98.24, 115.90, 947.10, 119.28, 83.00, 30.10, …
$ WorldGross        <dbl> 97.009, 201.897, 216.196, 1328.111, 288.382, 139.177…
$ Budget            <dbl> 1.5, 5.0, 20.0, 125.0, 32.5, 17.0, 25.0, 80.0, 0.2, …
$ Profitability     <dbl> 64.672667, 40.379400, 10.809800, 10.624888, 8.873292…
$ OpeningWeekend    <dbl> 13.27, 52.57, 31.60, 169.19, 26.25, 5.83, 26.04, 85.…
1
HollywoodMovies2011-> movies
1
skimr::skim(movies)
1
NoteBusiness Insights from Data Inspection

movies has 136 observations on the following 14 variables.

  • Movie a factor with many levels
  • LeadStudio a factor with many levels
  • RottenTomatoes a numeric vector
  • AudienceScore a numeric vector
  • Story a factor with many levels
  • Genre a factor with levels Action, Adventure, Animation, Comedy, Drama, Fantasy, Horror, Romance, Thriller.
  • TheatersOpenWeek a numeric vector. No. of theatres.
  • BOAverageOpenWeek a numeric vector.
  • DomesticGross a numeric vector. In million USD.
  • ForeignGross a numeric vector. In million USD.
  • WorldGross a numeric vector. In million USD.
  • Budget a numeric vector. In million USD.
  • Profitability a numeric vector. A ratio
  • OpeningWeekend a numeric vector. In million USD.

There are no missing values in the Qual variables; but some entries in the Quant variables are missing. skim throws a warning that we may need to examine later.

Let us look at the Quant variables: are these related in anyway? Could the relationship between any two Quant variables also depend upon the level of a Qual variable?

Scatter Plots

Which are the numeric variables in movies?

  • R
  • web-r
movies_quant <- movies %>%
  drop_na() %>%
  select(where(is.numeric))
movies_quant
ABCDEFGHIJ0123456789
RottenTomatoes
<int>
AudienceScore
<int>
TheatersOpenWeek
<int>
BOAverageOpenWeek
<int>
DomesticGross
<dbl>
ForeignGross
<dbl>
WorldGross
<dbl>
Budget
<dbl>
67652408551154.0143.0097.0091.5
6858332115829103.6698.24201.8975.0
4438304910365100.29115.90216.19620.0
9692437538672381.01947.101328.111125.0
907729188995169.11119.28288.38232.5
9384944617756.1883.00139.17717.0
7591253410278169.2230.10199.32425.0
3558361523775254.46327.00581.46480.0
69732756686079.2582.60161.84927.0
697230409310117.5492.10209.63835.0
Next
123456
...
12
Previous
1-10 of 111 rows | 1-8 of 10 columns
1

Now let us plot their relationships.

  • Using ggformula
  • Using ggplot
  • web-r
# Set graph theme
theme_set(new = theme_custom())

movies %>%
  drop_na() %>%
  gf_point(DomesticGross ~ WorldGross) %>%
  gf_lm() %>%
  gf_labs(
    title = "Scatter Plot",
    subtitle = "Movie Gross Earnings: Domestics vs World"
  )

# Set graph theme
theme_set(new = theme_custom())
movies %>%
  drop_na() %>%
  gf_point(Profitability ~ OpeningWeekend) %>%
  gf_lm() %>%
  gf_labs(
    title = "Scatter Plot",
    subtitle = "Movies: Does Opening Week Earnings indicate Profitability?"
  )

# Set graph theme
theme_set(new = theme_custom())
##
movies %>%
  drop_na() %>%
  gf_point(RottenTomatoes ~ AudienceScore) %>%
  gf_lm() %>%
  gf_labs(
    title = "Scatter Plot",
    subtitle = "Movie Ratings: Tomatoes vs Audience"
  )

We can split some of the scatter plots using one or other of the Qual variables. For instance, is the relationship between the two ratings the same, regardless of movie genre?

# Set graph theme
theme_set(new = theme_custom())

movies %>%
  drop_na() %>%
  gf_point(RottenTomatoes ~ AudienceScore,
    color = ~Genre
  ) %>%
  gf_lm() %>%
  gf_labs(
    title = "Scatter Plot",
    subtitle = "Movie Ratings: Trends by Genre"
  )

# Set graph theme
theme_set(new = theme_custom())

movies %>%
  drop_na() %>%
  ggplot(aes(x = DomesticGross, y = WorldGross)) +
  geom_point() +
  geom_lm() +
  labs(
    title = "Scatter Plot",
    subtitle = "Movie Gross Earnings: Domestics vs World"
  )

# Set graph theme
theme_set(new = theme_custom())
##
movies %>%
  drop_na() %>%
  ggplot(aes(OpeningWeekend, Profitability)) +
  geom_point() +
  geom_lm() +
  labs(
    title = "Scatter Plot",
    subtitle = "Movies: Does Opening Week Earnings indicate Profitability?"
  )

# Set graph theme
theme_set(new = theme_custom())
##
movies %>%
  drop_na() %>%
  ggplot(aes(AudienceScore, RottenTomatoes)) +
  geom_point() +
  geom_lm() +
  labs(
    title = "Scatter Plot",
    subtitle = "Movie Ratings: Tomatoes vs Audience"
  )

# Set graph theme
theme_set(new = theme_custom())
##
movies %>%
  drop_na() %>%
  ggplot(aes(RottenTomatoes, AudienceScore, color = Genre)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Scatter Plot",
    subtitle = "Movie Ratings: Trends by Genre"
  )

1
1
1
1
## We can split some of the scatter plots using one or other of the Qual variables. For instance, is the relationship between the two ratings the same, regardless of movie genre?
NoteBusiness Insight from movies scatter plots

We have fitted a trend line to each of the scatter plots.

  • DomesticGross and World Gross are related, though there are fewer movies at the high end of DomesticGross…
  • AudienceScore and RottenTomatoes seem clearly related…both increase together.
  • OpeningWeek and Profitability are also related in a linear way. There are just two movies which have been extremely profitable..but they do not influence the slope of the trend line too much, because of their location midway in the range of OpeningWeek. Influence is something that is a key concept in Linear Regression.
  • By and large, there are only small variations in slope across Genres.
ImportantIndependent and Dependent Variables

Note that we have rather arbitrarily taken AudienceScore as the independent variable, to be plotted on the x-axis, and RottenTomatoes on the y-axis. It could easily have been the other way around, based on our Research Question. Datasets are gathered with specific Research Hypotheses in mind, so check the help file and also with the person who gathered the data about what variable they are interested in!

Quantizing Correlation

So we see that there are visible relationships between Quant variables. How do we quantize this relationship, into a correlation score?

There are two ways: using the GGally and corplot packages, and doing a formal correlation test with the mosaic package.

  • Using GGally
  • Using corrplot

By default, GGally::ggpairs() provides:

  • two different comparisons of each pair of columns
  • displays either the density or count of the respective variable along the diagonal. 
  • With different parameter settings, the diagonal can be replaced with the axis values and variable labels.
# Set graph theme
theme_set(new = theme_custom())

# names(movies_quant)

GGally::ggpairs(
  movies %>% drop_na(),
  # Select Quant variables only for now
  columns = c(
    "RottenTomatoes", "AudienceScore", "DomesticGross", "ForeignGross"
  ),
  switch = "both",
  # axis labels in more traditional locations(left and bottom)

  progress = FALSE,
  # no compute progress messages needed

  # Choose the diagonal graphs (always single variable! Think!)
  diag = list(continuous = "barDiag"),
  # choosing histogram,not density

  # Choose lower triangle graphs, two-variable graphs
  lower = list(continuous = wrap("smooth", alpha = 0.3, se = FALSE)),
  title = "Movies Data Correlations Plot #1"
)

NoteBusiness Insight from Pairs Plot#1
  • As we saw earlier from the Scatter Plot, AudienceScore and RottenTomatoes are well correlated, with a correlation score of 0.833
  • DomesticGross and ForeignGross are also extremely well correlated, with a score of 0.873.
  • Both these correlation scores are highly significant, with three stars. (We will speak of significance in a while.)
  • None of the other pairs of variables have good correlation scores.
  • Note in passing that both the “Gross” related variables have highly skewed distributions. That is the nature of the movie business!

Let us also try a few other variables, related to budget and profits. For instance, it would be interesting to see the relationship between Budget and Profitability and even either of the “gross” earnings and Profitability.

# Set graph theme
theme_set(new = theme_custom())

GGally::ggpairs(
  movies %>% drop_na(),
  # Select Quant variables only for now
  columns = c(
    "Budget", "Profitability", "DomesticGross", "ForeignGross"
  ),
  switch = "both",
  # axis labels in more traditional locations(left and bottom)

  progress = FALSE,
  # no compute progress messages needed

  # Choose the diagonal graphs (always single variable! Think!)
  diag = list(continuous = "barDiag"),
  # choosing histogram,not density

  # Choose lower triangle graphs, two-variable graphs
  lower = list(continuous = wrap("smooth", alpha = 0.3, se = FALSE)),
  title = "Movies Data Correlations Plot #2"
)

NoteBusiness Insight from Pairs Plot #2
  • The Budget variable has good correlation scores with DomesticGross and ForeignGross
  • Profitability and Budget seem to have a very slight negative correlation, but this does not appear to be significant.

In this chart, the correlation between pairs of variables is shown symbolically as coloured shapes or colours. Circles, Squares, and Ellipse for example.

  • The size, colour, and “orientation” of the shapes in question symbolically represent the strength and polarity of the correlation scores.
  • The direction of the semi-major axis + the colour of the ellipse indicate whether the correlation score is positive or negative;
  • And the more eccentric the ellipse, the higher is the correlation score in value.
Note

Whereas GGally computes the correlation scores, corplot “merely” displays them in an evocative way. We need to compute the correlations a priori.

Note also:

Tip

R package corrplot provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables. corrplot is very easy to use and provides a rich array of plotting options in visualization method, graphic layout, color, legend, text labels, etc. It also provides p-values and confidence intervals to help users determine the statistical significance of the correlations.

# library(corrplot)
mydata_cor <- cor(movies_quant)
mydata_cor %>%
  knitr::kable(caption = "Correlation Scores Matrix")
Correlation Scores Matrix
RottenTomatoes AudienceScore TheatersOpenWeek BOAverageOpenWeek DomesticGross ForeignGross WorldGross Budget Profitability OpeningWeekend
RottenTomatoes 1.0000000 0.8329740 -0.0873543 0.1823480 0.2085935 0.0979132 0.1356232 -0.0147887 0.1502764 0.0986304
AudienceScore 0.8329740 1.0000000 0.0259118 0.1851768 0.3849406 0.2557891 0.3037927 0.1268649 0.1047582 0.2695132
TheatersOpenWeek -0.0873543 0.0259118 1.0000000 0.0117674 0.5981162 0.4850569 0.5344582 0.5924941 0.0547807 0.5977724
BOAverageOpenWeek 0.1823480 0.1851768 0.0117674 1.0000000 0.4713164 0.4522253 0.4710352 0.2880262 0.0964176 0.5043684
DomesticGross 0.2085935 0.3849406 0.5981162 0.4713164 1.0000000 0.8725927 0.9374780 0.6497274 0.1812387 0.9232259
ForeignGross 0.0979132 0.2557891 0.4850569 0.4522253 0.8725927 1.0000000 0.9880383 0.6707613 0.1230330 0.8487202
WorldGross 0.1356232 0.3037927 0.5344582 0.4710352 0.9374780 0.9880383 1.0000000 0.6830783 0.1448857 0.8962294
Budget -0.0147887 0.1268649 0.5924941 0.2880262 0.6497274 0.6707613 0.6830783 1.0000000 -0.1437862 0.6228180
Profitability 0.1502764 0.1047582 0.0547807 0.0964176 0.1812387 0.1230330 0.1448857 -0.1437862 1.0000000 0.1713962
OpeningWeekend 0.0986304 0.2695132 0.5977724 0.5043684 0.9232259 0.8487202 0.8962294 0.6228180 0.1713962 1.0000000
## View the matrix
corrplot::corrplot(mydata_cor,
  method = "number",
  number.cex = 0.6,
  cl.cex = 0.6, tl.cex = 0.6
)

# Default plot with circles
corrplot(mydata_cor,
  method = "circle",
  main = "Correlogram with Circles"
)

# Ellipse plot
corrplot(mydata_cor,
  method = "ellipse",
  main = "Correlogram with Ellipes"
)

# Heatmap
corrplot(mydata_cor,
  method = "color", ## US Spelling only
  main = "Correlogram"
)

# Set graph theme
theme_set(new = theme_custom())

# Heatmap with numbers
corrplot.mixed(mydata_cor,
  lower = "color", number.cex = 0.6,
  cl.cex = 0.6, tl.cex = 0.6,
  upper = "number",
  tl.pos = "l",
  main = "Heatmap?"
)

NoteBusiness Insights from corplots
  • Most of the variables here have positive correlations, many of them are significant

Doing a Correlation Test

Correlations scores can be obtained by conducting a formal test in R. We will use the mosaic function cor_test to get these results:

mosaic::cor_test(Profitability ~ Budget, data = movies) %>%
  broom::tidy() %>%
  knitr::kable(
    digits = 2,
    caption = "Movie Profitability vs Budget"
  )
Movie Profitability vs Budget
estimate statistic p.value parameter conf.low conf.high method alternative
-0.08 -0.96 0.34 132 -0.25 0.09 Pearson’s product-moment correlation two.sided
mosaic::cor_test(DomesticGross ~ Budget, data = movies) %>%
  broom::tidy() %>%
  knitr::kable(
    digits = 2,
    caption = "Movie Domestic Gross vs Budget"
  )
Movie Domestic Gross vs Budget
estimate statistic p.value parameter conf.low conf.high method alternative
0.7 11.06 0 131 0.6 0.77 Pearson’s product-moment correlation two.sided
mosaic::cor_test(ForeignGross ~ Budget, data = movies) %>%
  broom::tidy() %>%
  knitr::kable(
    digits = 2,
    caption = "Movie Foreign Gross vs Budget"
  )
Movie Foreign Gross vs Budget
estimate statistic p.value parameter conf.low conf.high method alternative
0.69 10.22 0 118 0.58 0.77 Pearson’s product-moment correlation two.sided
NoteBusiness Insights from Correlation Tests

The budget and profitability are not well correlated, sadly. We see this from the p.value which is 0.34 and the confidence values for the correlation estimate which also cover 0.

However, both DomesticGross and ForeignGross are well correlated with Budget. Look at the p.value (=0) and the confidence intervals which are unipolar.

The ErrorBar Plot for Correlations

As stated earlier, in our dataset we have a specific dependent or target variable, which represents the outcome of our experiment or our business situation. The remaining variables are usually independent or predictor variables. A very useful thing to know, and to view, would be the correlations of all independent variables. Using the correlation package from the easystats family of R packages, this can be very easily achieved. Let us quickly do this for the familiar mtcars dataset: we will quickly glimpse it, identify the target variable, and plot the correlations:

glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
## Target variable: mpg
## Calculate all correlations
cor <- correlation::correlation(mtcars)
cor
ABCDEFGHIJ0123456789
 
 
Parameter1
<chr>
Parameter2
<chr>
r
<dbl>
CI
<dbl>
CI_low
<dbl>
CI_high
<dbl>
t
<dbl>
df_error
<int>
p
<dbl>
1mpgcyl-0.852161960.95-0.92576936-0.7163171-8.9196988303.178597e-08
2mpgdisp-0.847551380.95-0.92335937-0.7081376-8.7471515304.783967e-08
3mpghp-0.776168370.95-0.88526861-0.5860994-6.7423885308.045259e-06
4mpgdrat0.681171910.950.436048380.83220105.0960421305.861592e-04
5mpgwt-0.867659380.95-0.93382641-0.7440872-9.5590441306.857981e-09
6mpgqsec0.418684030.950.081954870.66961862.5252133302.220659e-01
7mpgvs0.664038920.950.410363010.82232624.8643850301.093100e-03
8mpgam0.599832430.950.317558300.78445204.1061270308.265602e-03
9mpggear0.480284760.950.158061770.71006282.9991906309.721707e-02
10mpgcarb-0.550925070.95-0.75464796-0.2503183-3.6157497302.385782e-02
Next
123456
Previous
1-10 of 55 rows | 1-10 of 12 columns

We see correlation between all pairs of variables. We need to choose just those with target variable mpg:

theme_set(new = theme_custom())
##
cor %>%
  # Filter for target variable `mpg` and plot
  filter(Parameter1 == "mpg") %>%
  gf_point(r ~ reorder(Parameter2, r), size = 4) %>%
  gf_errorbar(CI_low + CI_high ~ reorder(Parameter2, r),
    width = 0.5
  ) %>%
  gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>%
  gf_labs(
    title = "Correlation Errorbar Chart",
    subtitle = "Target variable: mpg",
    x = "Predictor Variable",
    y = "Correlation Score with mpg"
  )

NoteBusiness Insights from ErrorBar Plot
  • Several variables are negatively correlated and some are positively correlated with ’mpg`. (The grey line shows “zero correlation”)
  • Since none of the error bars straddle zero, the correlations are mostly significant.

A New Combination Plot…

Sometimes, a simple scatter, or density alone, or viewed next to one another is not adequate to develop, or convey, our insight. We might just need a combination density + scatter plot. Such a plot can be be constructed from the ground up using ggformula or ggplot; however, there is a nice package called ggExtra that allows the creation of a powerful combination plot:

# Set graph theme
theme_set(new = theme_custom())

library(ggExtra)

penguins %>%
  drop_na() %>%
  gf_point(body_mass_g ~ flipper_length_mm, colour = ~species) %>%
  gf_smooth(method = "lm") %>%
  gf_refine(scale_colour_brewer(palette = "Accent")) %>%
  gf_labs(title = "Scatter Plot with Marginal Densities") %>%
  ggExtra::ggMarginal(
    type = "density", groupColour = TRUE,
    groupFill = TRUE, margins = "both"
  )

An Interactive Correlation Game

Head off to this interactive game website where you can play with correlations!

https://openintro.shinyapps.io/correlation_game/

Simpson’s Paradox

See how the overall correlation/regression line slopes upward, whereas that for the individual groups slopes downward!! This is an example of Simpson’s Paradox!

Your Turn

  1. Try to play this online Correlation Game.
Note2. School Expenditure and Grades.

Note3. Gas Prices and Consumption

As described here. Note the log-transformed Quant data…why do you reckon this was done in the data set itself?

Note4. Horror Movies (Bah.You awful people..)

Note6. Food Delivery Times

Wait, But Why?

  • Scatter Plots, when they show “linear” clouds, tell us that there is some relationship between two Quant variables we have just plotted
  • If so, then if one is the target variable you are trying to design for, then the other independent, or controllable, variable is something you might want to design with.
Important

Target variables are usually plotted on the Y-axis, while Predictor variables are on the X-Axis, in a Scatter Plot. Why? Because y=mx+c !

  • Correlation scores are good indicators of things that are, well, related. While one variable may not necessarily cause another, a good correlation score may indicate how to chose a good predictor.
  • That is something we will see when we examine Linear Regression
  • Always, always, plot and test your data! Both numerical summaries as tables, and graphical summaries as charts, are necessary! See below!!
WarningAnd How about these datasets?
dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
away 54.26610 47.83472 16.76983 26.93974 -0.0641284
bullseye 54.26873 47.83082 16.76924 26.93573 -0.0685864
circle 54.26732 47.83772 16.76001 26.93004 -0.0683434
dino 54.26327 47.83225 16.76514 26.93540 -0.0644719
dots 54.26030 47.83983 16.76774 26.93019 -0.0603414
h_lines 54.26144 47.83025 16.76590 26.93988 -0.0617148
high_lines 54.26881 47.83545 16.76670 26.94000 -0.0685042
slant_down 54.26785 47.83590 16.76676 26.93610 -0.0689797
slant_up 54.26588 47.83150 16.76885 26.93861 -0.0686092
star 54.26734 47.83955 16.76896 26.93027 -0.0629611
v_lines 54.26993 47.83699 16.76996 26.93768 -0.0694456
wide_lines 54.26692 47.83160 16.77000 26.93790 -0.0665752
x_shape 54.26015 47.83972 16.76996 26.93000 -0.0655833

Yes, you did want to plot that cute T-Rex, didn’t you? Here is the data then!!

Warning
  • Can selling more ice-cream make people drown?
  • Use your head about pairs of variables. Do not fall into this trap)

Conclusions

Scatter Plots give a us sense of change; whether it is linear or non-linear. We can get an idea of correlation between variables with a scatter plot. Our workflow for evaluating correlations between target variable and several other predictor variables uses several packages such as GGally, corrplot, correlation, and of course mosaic for correlation tests.

AI Generated Summary and Podcast

This document focusses on correlation between quantitative variables. It examines different ways to visualize correlations, including scatter plots and correlograms. The document provides examples of how to use R packages like GGally and corrplot to create these visualizations and correlation tests to assess the strength and significance of relationships between variables. The tutorial uses the HollywoodMovies2011 and mtcars datasets as examples to demonstrate these concepts.

Your browser does not support the audio tag; for browser support, please see: https://www.w3schools.com/tags/tag_audio.asp

References

  1. Winston Chang (2024). R Graphics Cookbook. https://r-graphics.org

  2. Minimal R using mosaic. https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf

  3. Antoine Soetewey. Pearson, Spearman and Kendall correlation coefficients by hand https://www.r-bloggers.com/2023/09/pearson-spearman-and-kendall-correlation-coefficients-by-hand/

  4. Taiyun Wei, Viliam Simko. An Introduction to corrplot Package. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

R Package Citations
Package Version Citation
corrplot 0.95 Wei and Simko (2024)
GGally 2.2.1 Schloerke et al. (2024)
ggExtra 0.10.1 Attali and Baker (2023)
Attali, Dean, and Christopher Baker. 2023. ggExtra: Add Marginal Histograms to “ggplot2,” and More “ggplot2” Enhancements. https://doi.org/10.32614/CRAN.package.ggExtra.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to “ggplot2”. https://doi.org/10.32614/CRAN.package.GGally.
Wei, Taiyun, and Viliam Simko. 2024. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Back to top

Citation

BibTeX citation:
@online{v.2022,
  author = {V., Arvind},
  title = {\textless Iconify-Icon Icon=“icon-Park-Outline:change”
    Width=“1.2em”
    Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
    {Change}},
  date = {2022-11-22},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/},
  langid = {en},
  abstract = {How one variable changes with another}
}
For attribution, please cite this work as:
V., Arvind. 2022. “<Iconify-Icon Icon=‘icon-Park-Outline:change’ Width=‘1.2em’ Height=‘1.2em’></Iconify-Icon> Change.” November 22, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/.
Groups and Densities
Proportions

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .