library(tidyverse) # Tidy data processing and plotting
library(ggformula) # Formula based plots
library(mosaic) # Our go-to package
library(skimr) # Another Data inspection package
library(kableExtra) # Making good tables with data
library(GGally) # Corr plots
library(corrplot) # More corrplots
library(ggExtra) # Making Combination Plots
# library(devetools)
# devtools::install_github("rpruim/Lock5withR")
library(Lock5withR) # Datasets
library(palmerpenguins) # A famous dataset
library(easystats) # Easy Statistical Analysis and Charts
library(correlation) # Different Types of Correlations
# From the easystats collection of packages
Change
Correlations
Slides and Tutorials
Tutorial | R (Interactive Graphs |
βThe world says: βYou have needs β satisfy them. You have as much right as the rich and the mighty. Donβt hesitate to satisfy your needs; indeed, expand your needs and demand more.β This is the worldly doctrine of today. And they believe that this is freedom. The result for the rich is isolation and suicide, for the poor, envy and murder.β
β Fyodor Dostoevsky
Setting up R Packages
Plot Theme
Show the Code
# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto
# Chunk options
knitr::opts_chunk$set(
fig.width = 7,
fig.asp = 0.618, # Golden Ratio
# out.width = "80%",
fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes
theme_custom <- function() {
font <- "Roboto Condensed" # assign font family up front
theme_classic(base_size = 14) %+replace% # replace elements we want to change
theme(
panel.grid.minor = element_blank(), # strip minor gridlines
# text elements
plot.title = element_text( # title
family = font, # set font family
# size = 20, #set font size
face = "bold", # bold typeface
hjust = 0, # left align
# vjust = 2 #raise slightly
margin = margin(0, 0, 10, 0)
),
plot.subtitle = element_text( # subtitle
family = font, # font family
# size = 14, #font size
hjust = 0,
margin = margin(2, 0, 5, 0)
),
plot.caption = element_text( # caption
family = font, # font family
size = 8, # font size
hjust = 1
), # right align
axis.title = element_text( # axis titles
family = font, # font family
size = 10 # font size
),
axis.text = element_text( # axis text
family = font, # axis family
size = 8
) # font size
)
}
# Set graph theme
theme_set(new = theme_custom())
What graphs will we see today?
Variable #1 | Variable #2 | Chart Names | Chart Shape |
---|---|---|---|
Quant | Quant | Scatter Plot |
|
Some of the very basic and commonly used plots for data are:
- Scatter Plot for two variables
Contour PlotScatter Plot with Confidence Ellipses- Pairwise Correlation Plots for multiple variables
- Correlogram for multiple variables
- Heatmap for multiple variables
- Errorbar chart for multiple variables
- Combination chart with marginal densities
What kind of Data Variables will we choose?
No | Pronoun | Answer | Variable/Scale | Example | What Operations? |
---|---|---|---|---|---|
1 | How Many / Much / Heavy? Few? Seldom? Often? When? | Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. | Quantitative/Ratio | Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate | Correlation |
Inspiration
Does belief in Evolution depend upon the GSP of of the country? Where is the US in all of this? Does the Bible Belt tip the scales here?
And India?
What is Correlation?
One of the basic Questions we would have of our data is: Does some variable depend upon another in some way? Does \(y\) vary with \(x\)? A Correlation Test is designed to answer exactly this question.
The word correlation is used in everyday life to denote some form of association. We might say that we have noticed a correlation between rainy days and reduced sales at supermarkets. However, in statistical terms we use correlation to denote association between two quantitative variables. We also assume that the association is linear, that one variable increases or decreases a fixed amount for a unit increase or decrease in the other. The other technique that is often used in these circumstances is regression, which involves estimating the best straight line to summarise the association.
Pearson Correlation coefficient
The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes called Pearsonβs correlation coefficient after its originator and is a measure of linear association. (If a curved line is needed to express the relationship, other and more complicated measures of the correlation must be used.)
The correlation coefficient is measured on a scale that varies from + 1 through 0 to β 1. Complete correlation between two variables is expressed by either + 1 or -1. When one variable increases as the other increases the correlation is positive; when one decreases as the other increases it is negative.
In formal terms, the correlation between two variables \(x\) and \(y\) is defined as
\[ \rho = E\left[\frac{(x - \mu_{x}) * (y - \mu_{y})}{(\sigma_x)*(\sigma_y)}\right] \]
where \(E\) is the expectation operator ( i.e taking mean ). Think of this as the average of the products of two scaled variables.
Warning: All aesthetics have length 1, but the data has 6 rows.
βΉ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 6 rows.
βΉ Please consider using `annotate()` or provide this layer with data containing
a single row.
All aesthetics have length 1, but the data has 6 rows.
βΉ Please consider using `annotate()` or provide this layer with data containing
a single row.
We can see \((x-\mu_x)/\sigma_x\) is a centering and scaling of the variable \(x\). Recall from our discussion on Distributions that this is called the z-score
of x.
Pearson correlation assumes that the relationship between the two variables is linear. There are of course many other types of correlation measures: some which work when this is not so. Type vignette("types", package = "correlation")
in your Console to see the vignette from the correlation
package that discusses various types of correlation measures.
Case Study-1: HollywoodMovies2011
dataset
Let us look at the HollywoodMovies2011
dataset from the Lock5withR
package. The dataset is also available by clicking the icon below ( in case you are not able to install Lock5withR
):
Inspecting the Data
HollywoodMovies2011 -> movies
glimpse(movies)
Rows: 136
Columns: 14
$ Movie <fct> "Insidious", "Paranormal Activity 3", "Bad Teacher",β¦
$ LeadStudio <fct> Sony, Independent, Independent, Warner Bros, Relativβ¦
$ RottenTomatoes <int> 67, 68, 44, 96, 90, 93, 75, 35, 63, 69, 69, 49, 26, β¦
$ AudienceScore <int> 65, 58, 38, 92, 77, 84, 91, 58, 74, 73, 72, 57, 68, β¦
$ Story <fct> Monster Force, Monster Force, Comedy, Rivalry, Rivalβ¦
$ Genre <fct> Horror, Horror, Comedy, Fantasy, Comedy, Romance, Drβ¦
$ TheatersOpenWeek <int> 2408, 3321, 3049, 4375, 2918, 944, 2534, 3615, NA, 2β¦
$ BOAverageOpenWeek <int> 5511, 15829, 10365, 38672, 8995, 6177, 10278, 23775,β¦
$ DomesticGross <dbl> 54.01, 103.66, 100.29, 381.01, 169.11, 56.18, 169.22β¦
$ ForeignGross <dbl> 43.00, 98.24, 115.90, 947.10, 119.28, 83.00, 30.10, β¦
$ WorldGross <dbl> 97.009, 201.897, 216.196, 1328.111, 288.382, 139.177β¦
$ Budget <dbl> 1.5, 5.0, 20.0, 125.0, 32.5, 17.0, 25.0, 80.0, 0.2, β¦
$ Profitability <dbl> 64.672667, 40.379400, 10.809800, 10.624888, 8.873292β¦
$ OpeningWeekend <dbl> 13.27, 52.57, 31.60, 169.19, 26.25, 5.83, 26.04, 85.β¦
movies
has 136 observations on the following 14 variables.
-
Movie
a factor with many levels -
LeadStudio
a factor with many levels -
RottenTomatoes
a numeric vector -
AudienceScore
a numeric vector -
Story
a factor with many levels -
Genre
a factor with levelsAction, Adventure, Animation, Comedy, Drama, Fantasy, Horror, Romance, Thriller.
-
TheatersOpenWeek
a numeric vector. No. of theatres. -
BOAverageOpenWeek
a numeric vector. -
DomesticGross
a numeric vector. In million USD. -
ForeignGross
a numeric vector. In million USD. -
WorldGross
a numeric vector. In million USD. -
Budget
a numeric vector. In million USD. -
Profitability
a numeric vector. A ratio -
OpeningWeekend
a numeric vector. In million USD.
There are no missing values in the Qual variables; but some entries in the Quant variables are missing. skim
throws a warning that we may need to examine later.
Let us look at the Quant variables: are these related in anyway? Could the relationship between any two Quant variables also depend upon the level of a Qual variable?
Scatter Plots
Which are the numeric variables in movies
?
Now let us plot their relationships.
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
We can split some of the scatter plots using one or other of the Qual variables. For instance, is the relationship between the two ratings the same, regardless of movie genre?
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
# Set graph theme
theme_set(new = theme_custom())
##
movies %>%
drop_na() %>%
ggplot(aes(RottenTomatoes, AudienceScore, color = Genre)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Scatter Plot",
subtitle = "Movie Ratings: Trends by Genre"
)
Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
βΉ Please use the `linewidth` aesthetic instead.
movies
scatter plots
We have fitted a trend line to each of the scatter plots.
-
DomesticGross
andWorld Gross
are related, though there are fewer movies at the high end ofDomesticGross
β¦ -
AudienceScore
andRottenTomatoes
seem clearly relatedβ¦both increase together. -
OpeningWeek
andProfitability
are also related in a linear way. There are just two movies which have been extremely profitable..but they do not influence the slope of the trend line too much, because of their location midway in the range ofOpeningWeek
. Influence is something that is a key concept in Linear Regression. - By and large, there are only small variations in slope across
Genre
s.
Note that we have rather arbitrarily taken AudienceScore
as the independent variable, to be plotted on the x-axis
, and RottenTomatoes
on the y-axis
. It could easily have been the other way around, based on our Research Question. Datasets are gathered with specific Research Hypotheses in mind, so check the help file and also with the person who gathered the data about what variable they are interested in!
Quantizing Correlation
So we see that there are visible relationships between Quant variables. How do we quantize this relationship, into a correlation score?
There are two ways: using the GGally
and corplot
packages, and doing a formal correlation test with the mosaic
package.
By default, GGally::ggpairs()
provides:
- two different comparisons of each pair of columns
- displays either the density or count of the respective variable along the diagonal.
- With different parameter settings, the diagonal can be replaced with the axis values and variable labels.
# Set graph theme
theme_set(new = theme_custom())
# names(movies_quant)
GGally::ggpairs(
movies %>% drop_na(),
# Select Quant variables only for now
columns = c(
"RottenTomatoes", "AudienceScore", "DomesticGross", "ForeignGross"
),
switch = "both",
# axis labels in more traditional locations(left and bottom)
progress = FALSE,
# no compute progress messages needed
# Choose the diagonal graphs (always single variable! Think!)
diag = list(continuous = "barDiag"),
# choosing histogram,not density
# Choose lower triangle graphs, two-variable graphs
lower = list(continuous = wrap("smooth", alpha = 0.3, se = FALSE)),
title = "Movies Data Correlations Plot #1"
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- As we saw earlier from the Scatter Plot,
AudienceScore
andRottenTomatoes
are well correlated, with a correlation score of \(0.833\) -
DomesticGross
andForeignGross
are also extremely well correlated, with a score of \(0.873\). - Both these correlation scores are highly significant, with three stars. (We will speak of significance in a while.)
- None of the other pairs of variables have good correlation scores.
- Note in passing that both the βGrossβ related variables have highly skewed distributions. That is the nature of the movie business!
Let us also try a few other variables, related to budget and profits. For instance, it would be interesting to see the relationship between Budget
and Profitability
and even either of the βgrossβ earnings and Profitability
.
# Set graph theme
theme_set(new = theme_custom())
GGally::ggpairs(
movies %>% drop_na(),
# Select Quant variables only for now
columns = c(
"Budget", "Profitability", "DomesticGross", "ForeignGross"
),
switch = "both",
# axis labels in more traditional locations(left and bottom)
progress = FALSE,
# no compute progress messages needed
# Choose the diagonal graphs (always single variable! Think!)
diag = list(continuous = "barDiag"),
# choosing histogram,not density
# Choose lower triangle graphs, two-variable graphs
lower = list(continuous = wrap("smooth", alpha = 0.3, se = FALSE)),
title = "Movies Data Correlations Plot #2"
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- The
Budget
variable has good correlation scores withDomesticGross
andForeignGross
-
Profitability
andBudget
seem to have a very slight negative correlation, but this does not appear to be significant.
In this chart, the correlation between pairs of variables is shown symbolically as coloured shapes or colours. Circles, Squares, and Ellipse for example.
- The size, colour, and βorientationβ of the shapes in question symbolically represent the strength and polarity of the correlation scores.
- The direction of the semi-major axis + the colour of the ellipse indicate whether the correlation score is positive or negative;
- And the more eccentric the ellipse, the higher is the correlation score in value.
Whereas GGally
computes the correlation scores, corplot
βmerelyβ displays them in an evocative way. We need to compute the correlations a priori.
Note also:
R package corrplot
provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables. corrplot
is very easy to use and provides a rich array of plotting options in visualization method, graphic layout, color, legend, text labels, etc. It also provides p-values
and confidence intervals
to help users determine the statistical significance of the correlations.
## View the matrix
corrplot::corrplot(mydata_cor,
method = "number",
number.cex = 0.6,
cl.cex = 0.6, tl.cex = 0.6
)
# Default plot with circles
corrplot(mydata_cor,
method = "circle",
main = "Correlogram with Circles"
)
# Ellipse plot
corrplot(mydata_cor,
method = "ellipse",
main = "Correlogram with Ellipes"
)
# Heatmap
corrplot(mydata_cor,
method = "color", ## US Spelling only
main = "Correlogram"
)
# Set graph theme
theme_set(new = theme_custom())
# Heatmap with numbers
corrplot.mixed(mydata_cor,
lower = "color", number.cex = 0.6,
cl.cex = 0.6, tl.cex = 0.6,
upper = "number",
tl.pos = "l",
main = "Heatmap?"
)
- Most of the variables here have positive correlations, many of them are significant
Doing a Correlation Test
Correlations scores can be obtained by conducting a formal test in R. We will use the mosaic
function cor_test
to get these results:
mosaic::cor_test(Profitability ~ Budget, data = movies) %>%
broom::tidy() %>%
knitr::kable(
digits = 2,
caption = "Movie Profitability vs Budget"
)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
-0.08 | -0.96 | 0.34 | 132 | -0.25 | 0.09 | Pearsonβs product-moment correlation | two.sided |
mosaic::cor_test(DomesticGross ~ Budget, data = movies) %>%
broom::tidy() %>%
knitr::kable(
digits = 2,
caption = "Movie Domestic Gross vs Budget"
)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
0.7 | 11.06 | 0 | 131 | 0.6 | 0.77 | Pearsonβs product-moment correlation | two.sided |
mosaic::cor_test(ForeignGross ~ Budget, data = movies) %>%
broom::tidy() %>%
knitr::kable(
digits = 2,
caption = "Movie Foreign Gross vs Budget"
)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
0.69 | 10.22 | 0 | 118 | 0.58 | 0.77 | Pearsonβs product-moment correlation | two.sided |
The budget
and profitability
are not well correlated, sadly. We see this from the p.value
which is \(0.34\) and the confidence values for the correlation estimate
which also cover \(0\).
However, both DomesticGross
and ForeignGross
are well correlated with Budget
. Look at the p.value
(=0) and the confidence intervals which are unipolar.
The ErrorBar Plot for Correlations
As stated earlier, in our dataset we have a specific dependent
or target
variable, which represents the outcome of our experiment or our business situation. The remaining variables are usually independent
or predictor
variables. A very useful thing to know, and to view, would be the correlations of all independent variables. Using the correlation
package from the easystats
family of R packages, this can be very easily achieved. Let us quickly do this for the familiar mtcars
dataset: we will quickly glimpse
it, identify the target variable, and plot the correlations:
glimpse(mtcars)
Rows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,β¦
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,β¦
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16β¦
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180β¦
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,β¦
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.β¦
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18β¦
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,β¦
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,β¦
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,β¦
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,β¦
## Target variable: mpg
## Calculate all correlations
cor <- correlation::correlation(mtcars)
cor
theme_set(new = theme_custom())
##
cor %>%
# Filter for target variable `mpg` and plot
filter(Parameter1 == "mpg") %>%
gf_point(r ~ reorder(Parameter2, r), size = 4) %>%
gf_errorbar(CI_low + CI_high ~ reorder(Parameter2, r),
width = 0.5
) %>%
gf_hline(yintercept = 0, color = "grey", linewidth = 2) %>%
gf_labs(
title = "Correlation Errorbar Chart",
subtitle = "Target variable: mpg",
x = "Predictor Variable",
y = "Correlation Score with mpg"
)
- Several variables are negatively correlated and some are positively correlated with βmpg`. (The grey line shows βzero correlationβ)
- Since none of the error bars straddle zero, the correlations are mostly significant.
A New Combination Plotβ¦
Sometimes, a simple scatter, or density alone, or viewed next to one another is not adequate to develop, or convey, our insight. We might just need a combination density + scatter plot. Such a plot can be be constructed from the ground up using ggformula
or ggplot
; however, there is a nice package called ggExtra
that allows the creation of a powerful combination plot:
# Set graph theme
theme_set(new = theme_custom())
library(ggExtra)
penguins %>%
drop_na() %>%
gf_point(body_mass_g ~ flipper_length_mm, colour = ~species) %>%
gf_smooth(method = "lm") %>%
gf_refine(scale_colour_brewer(palette = "Accent")) %>%
gf_labs(title = "Scatter Plot with Marginal Densities") %>%
ggExtra::ggMarginal(
type = "density", groupColour = TRUE,
groupFill = TRUE, margins = "both"
)
An Interactive Correlation Game
Head off to this interactive game website where you can play with correlations!
Simpsonβs Paradox
See how the overall correlation/regression line slopes upward, whereas that for the individual groups slopes downward!! This is an example of Simpsonβs Paradox!
Your Turn
- Try to play this online Correlation Game.
As described here. Note the log-transformed
Quant dataβ¦why do you reckon this was done in the data set itself?
Wait, But Why?
- Scatter Plots, when they show βlinearβ clouds, tell us that there is some relationship between two Quant variables we have just plotted
- If so, then if one is the target variable you are trying to design for, then the other independent, or controllable, variable is something you might want to design with.
Target variables are usually plotted on the Y-axis, while Predictor variables are on the X-Axis, in a Scatter Plot. Why? Because \(y = mx + c\) !
- Correlation scores are good indicators of things that are, well, related. While one variable may not necessarily cause another, a good correlation score may indicate how to chose a good predictor.
- That is something we will see when we examine Linear Regression
- Always, always, plot and test your data! Both numerical summaries as tables, and graphical summaries as charts, are necessary! See below!!
dataset | mean_x | mean_y | std_dev_x | std_dev_y | corr_x_y |
---|---|---|---|---|---|
away | 54.26610 | 47.83472 | 16.76983 | 26.93974 | -0.0641284 |
bullseye | 54.26873 | 47.83082 | 16.76924 | 26.93573 | -0.0685864 |
circle | 54.26732 | 47.83772 | 16.76001 | 26.93004 | -0.0683434 |
dino | 54.26327 | 47.83225 | 16.76514 | 26.93540 | -0.0644719 |
dots | 54.26030 | 47.83983 | 16.76774 | 26.93019 | -0.0603414 |
h_lines | 54.26144 | 47.83025 | 16.76590 | 26.93988 | -0.0617148 |
high_lines | 54.26881 | 47.83545 | 16.76670 | 26.94000 | -0.0685042 |
slant_down | 54.26785 | 47.83590 | 16.76676 | 26.93610 | -0.0689797 |
slant_up | 54.26588 | 47.83150 | 16.76885 | 26.93861 | -0.0686092 |
star | 54.26734 | 47.83955 | 16.76896 | 26.93027 | -0.0629611 |
v_lines | 54.26993 | 47.83699 | 16.76996 | 26.93768 | -0.0694456 |
wide_lines | 54.26692 | 47.83160 | 16.77000 | 26.93790 | -0.0665752 |
x_shape | 54.26015 | 47.83972 | 16.76996 | 26.93000 | -0.0655833 |
Yes, you did want to plot that cute T-Rex, didnβt you? Here is the data then!!
- Can selling more ice-cream make people drown?
- Use your head about pairs of variables. Do not fall into this trap)
Conclusions
Scatter Plots give a us sense of change; whether it is linear or non-linear. We can get an idea of correlation between variables with a scatter plot. Our workflow for evaluating correlations between target variable and several other predictor variables uses several packages such as GGally
, corrplot
, correlation
, and of course mosaic
for correlation tests.
AI Generated Summary and Podcast
This document focusses on correlation between quantitative variables. It examines different ways to visualize correlations, including scatter plots and correlograms. The document provides examples of how to use R packages like GGally
and corrplot
to create these visualizations and correlation tests
to assess the strength and significance of relationships between variables. The tutorial uses the HollywoodMovies2011
and mtcars
datasets as examples to demonstrate these concepts.
References
Winston Chang (2024). R Graphics Cookbook. https://r-graphics.org
Minimal R using
mosaic
. https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdfAntoine Soetewey. Pearson, Spearman and Kendall correlation coefficients by hand https://www.r-bloggers.com/2023/09/pearson-spearman-and-kendall-correlation-coefficients-by-hand/
Taiyun Wei, Viliam Simko. An Introduction to corrplot Package. https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
R Package Citations
Citation
@online{v.2022,
author = {V., Arvind},
title = {\textless Iconify-Icon Icon=βicon-Park-Outline:changeβ
Width=β1.2emβ
Height=β1.2emβ\textgreater\textless/Iconify-Icon\textgreater{}
{Change}},
date = {2022-11-22},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/},
langid = {en},
abstract = {How one variable changes with another}
}