π Densities: Plotting Distributions
The Hills are Shadows, said Tennyson
Slides and Tutorials
R (Static Viz) | Radiant Tutorial | Datasets |
Setting up R Packages
What graphs will we see today?
Variable #1 | Variable #2 | Chart Names | Chart Shape | |
---|---|---|---|---|
Quant | None | Density plot, Ridge Density Plot |
What kind of Data Variables will we choose?
No | Pronoun | Answer | Variable/Scale | Example | What Operations? |
---|---|---|---|---|---|
1 | How Many / Much / Heavy? Few? Seldom? Often? When? | Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. | Quantitative/Ratio | Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate | Correlation |
What is a βDensity Plotβ?
As we saw earlier, Histograms are best to show the distribution of raw Quantitative data, by displaying the number of values that fall within defined ranges, often called buckets or bins.
Sometimes it is useful to consider a chart where the bucket width shrinks to zero!
You might imagine a density chart as a histogram where the buckets are infinitesimally small, i.e. zero width. Think of the frequency density as a differentiation (as in calculus) of the histogram. By taking the smallest of steps \(\sim 0\), we get a measure of the slope of distribution. This may seem counter-intuitive, but densities have their uses in spotting the ranges in the data where there are more frequent values. In this, they serve a similar purpose as do histograms, but may offer insights not readily apparent with histograms, especially with default bucket widths. The chunkiness that we see in the histograms is removed and gives us a smooth curve showing in which range the data are more frequent.
Case Study-1: penguins
dataset
We will first look at at a dataset that is directly available in R, the penguins
dataset.
Examine the Data
As per our Workflow, we will look at the data using all the three methods we have seen.
glimpse(penguins)
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelβ¦
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerseβ¦
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, β¦
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, β¦
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186β¦
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, β¦
$ sex <fct> male, female, female, NA, female, male, female, maleβ¦
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007β¦
skim(penguins)
Name | penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | βββββ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | β β βββ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ββββ β |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | βββββ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | βββββ |
penguins
dataset
This is a smallish dataset (344 rows, 8 columns).
There are several Qualitative variables:
species
,island
andsex
. These have 3, 3, and 2 levels respectively. They are all<fct>
, i.e. factors.bill_length_mm
,bill_length_mm
,flipper_length_mm
, andbody_mass_g
are Quantitative variables.There are a few missing values in
sex
(11 missing entries) and all the Quant variables (2 missing entries each).
Plotting Densities
## Set graph theme
theme_set(new = theme_custom())
##
penguins <- penguins %>% drop_na()
gf_density( ~ body_mass_g, data = penguins) %>%
gf_labs(title = "Plot A: Penguin Masses", caption = "ggformula")
###
penguins %>% gf_density( ~ body_mass_g, fill = ~ species, color = "black") %>%
gf_refine(scale_color_viridis_d(option = "magma", aesthetics = c("colour", "fill"))) %>%
gf_labs(title = "Plot B: Penguin Body Mass by Species", caption = "ggformula")
###
penguins %>%
gf_density(
~ body_mass_g,
fill = ~ species,
color = "black",
alpha = 0.3
) %>%
gf_facet_wrap(vars(sex)) %>%
gf_labs(title = "Plot C: Penguin Body Mass by Species and facetted by Sex", caption = "ggformula")
###
penguins %>%
gf_density( ~ body_mass_g, fill = ~ species, color = "black") %>%
gf_facet_wrap(vars(sex), scales = "free_y", nrow = 2) %>%
gf_labs(title = "Plot D: Penguin Body Mass by Species and facetted by Sex",
subtitle = "Free y-scale",
caption = "ggformula") %>%
gf_theme(theme(axis.text.x = element_text(angle = 45, hjust = 1)))
## Set graph theme
theme_set(new = theme_custom())
## Remove the rows containing NA (11 rows!)
penguins <- penguins %>% drop_na()
ggplot(data = penguins) +
geom_density(aes(x = body_mass_g)) +
labs(title = "Plot A: Penguin Masses",caption = "ggplot")
###
penguins %>%
ggplot() +
geom_density(aes(x = body_mass_g, fill = species),
color = "black") +
scale_color_viridis_d(option = "magma",
aesthetics = c("colour", "fill")) +
labs(title = "Plot B: Penguin Body Mass by Species",
caption = "ggplot")
###
penguins %>% ggplot() +
geom_density(aes(x = body_mass_g, fill = species),
color = "black",
alpha = 0.3) +
facet_wrap(vars(sex)) +
labs(title = "Plot C: Penguin Body Mass by Species and facetted by Sex",caption = "ggplot")
###
penguins %>% ggplot() +
geom_density(aes(x = body_mass_g, fill = species),
color = "black") +
facet_wrap(vars(sex), scales = "free_y", nrow = 2) +
labs(title = "Plot D: Penguin Body Mass by Species and facetted by Sex",
subtitle = "Free y-scale", caption = "ggplot") %>%
theme(theme(axis.text.x = element_text(angle = 45,hjust = 1)))
diamond
Densities
Pretty much similar conclusions as with histograms. Although densities may not be used much in business contexts, they are better than histograms when comparing multiple distributions! So you should use thems!
Ridge Plots
Sometimes we may wish to show the distribution/density of a Quant variable, against several levels of a Qual variable. For instance, the prices of different items of furniture, based on the furniture βstyleβ variable. Or the sales
of a particular line of products, across different shops or cities. We did this with both histograms and densities, by colouring based on a Qual variable, and by facetting using a Qual variable. There is a third way, using what is called a ridge plot. ggformula
support this plot by importing/depending upon the ggridges
package; however, ggplot
itself appears to not have this capability.
## Set graph theme
theme_set(new = theme_custom())
##
gf_density_ridges(drv ~ hwy, fill = ~ drv,
alpha = 0.3,
rel_min_height = 0.005, data = mpg) %>%
gf_refine(scale_y_discrete(expand = c(0.01, 0)),
scale_x_continuous(expand = c(0.01, 0))) %>%
gf_labs(title = "Ridge Plot")
mpg
Ridge Plots
This is another way of visualizing multiple distributions, of a Quant variable at different levels of a Qual variable. We see that the distribution of hwy
mileage varies substantially with drv
type.
Case Study-2:
Conclusion
- Histograms and Frequency Distributions are both used for Quantitative data variables
- Whereas Histograms βdwell uponβ counts, ranges, means and standard deviations
- Frequency Density plots βdwell uponβ probabilities and densities
- Ridge Plots are density plots used for describing one Quant and one Qual variable (by inherent splitting)
- We can split all these plots on the basis of another Qualitative variable.(Ridge Plots are already split)
- Long tailed distributions need care in visualization and in inference making!
Your Turn
- Click on the Dataset Icon above, and unzip that archive. Try to make distribution plots with each of the three tools.
- A dataset from calmcode.io https://calmcode.io/datasets.html
- Old Faithful Data in R (Find it!)
inspect
the dataset in each case and develop a set of Questions, that can be answered by appropriate stat measures, or by using a chart to show the distribution.
References
- See the scrolly animation for a histogram at this website: Exploring Histograms, an essay by Aran Lunzer and Amelia McNamara https://tinlizzie.org/histograms/?s=09
- Minimal R using
mosaic
.https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf
- Sebastian Sauer, Plotting multiple plots using purrr::map and ggplot
R Package Citations
Citation
@online{v.2024,
author = {V., Arvind},
title = {π {Densities:} {Plotting} {Distributions}},
date = {2024-06-22},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/26-Densities/},
langid = {en},
abstract = {Quant and Qual Variable Graphs and their Siblings}
}