π Violins: Plotting Groups and Density
Slides and Tutorials
R (Static Viz) | Radiant Tutorial | Datasets |
Setting up R Packages
What graphs will we see today?
Variable #1 | Variable #2 | Chart Names | Chart Shape | |
---|---|---|---|---|
Quant | (Qual) | Violin Plot |
What kind of Data Variables will we choose?
No | Pronoun | Answer | Variable/Scale | Example | What Operations? |
---|---|---|---|---|---|
1 | How Many / Much / Heavy? Few? Seldom? Often? When? | Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. | Quantitative/Ratio | Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate | Correlation |
Inspiration
Violin Plots
Often one needs to view multiple densities at the same time. Ridge plots of course give us one option, where we get densities of a Quant variable split by a Qual variable. Another option is to generate a density plot facetted into small multiples using a Qual variable.
Yet another plot that allows comparison of multiple densities side by side is a violin plot. The violin plot combines the aspects of a boxplot(ranking of values, median, quantilesβ¦) with a superimposed density plot. This allows us to look at medians, means, densities, and quantiles of a Quant variable with respect to another Qual variable. Let us see what this looks like!
## Set graph theme
theme_set(new = theme_custom())
##
gf_violin(price ~ "All Diamonds", data = diamonds,
draw_quantiles = c(0,.25,.50,.75)) %>%
gf_labs(title = "Plot A: Violin plot for Diamond Prices")
###
diamonds %>%
gf_violin(price ~ cut,
draw_quantiles = c(0,.25,.50,.75)) %>%
gf_labs(title = "Plot B: Price by Cut")
###
diamonds %>%
gf_violin(price ~ cut,
fill = ~ cut,
color = ~ cut,
alpha = 0.3,
draw_quantiles = c(0,.25,.50,.75)) %>%
gf_labs(title = "Plot C: Price by Cut")
###
diamonds %>%
gf_violin(price ~ cut,
fill = ~ cut,
colour = ~ cut,
alpha = 0.3,draw_quantiles = c(0,.25,.50,.75)) %>%
gf_facet_wrap(vars(clarity)) %>%
gf_labs(title = "Plot D: Price by Cut facetted by Clarity") %>%
gf_theme(theme(axis.text.x = element_text(angle = 45,hjust = 1)))
## Set graph theme
theme_set(new = theme_custom())
##
diamonds %>% ggplot() +
geom_violin(aes(y = price, x = ""),
draw_quantiles = c(0,.25,.50,.75)) + # note: y, not x
labs(title = "Plot A: violin for Diamond Prices")
###
diamonds %>% ggplot() +
geom_violin(aes(cut, price),
draw_quantiles = c(0,.25,.50,.75)) +
labs(title = "Plot B: Price by Cut")
###
diamonds %>% ggplot() +
geom_violin(aes(cut, price,
color = cut, fill = cut),
draw_quantiles = c(0,.25,.50,.75),
alpha = 0.4) +
labs(title = "Plot C: Price by Cut")
###
diamonds %>% ggplot() +
geom_violin(aes(cut,
price,
color = cut, fill = cut),
draw_quantiles = c(0,.25,.50,.75),
alpha = 0.4) +
facet_wrap(vars(clarity)) +
labs(title = "Plot D: Price by Cut facetted by Clarity") +
theme(axis.text.x = element_text(angle = 45,hjust = 1))
diamond
Violin Plots
The distribution for price is clearly long-tailed (skewed). The distributions also vary considerably based on both cut
and clarity
. These Qual variables clearly have a large effect on the prices of individual diamonds.
Z-scores
Often when we compute wish to compare distributions with different values for means and standard deviations, we resort to a scaling of the variables that are plotted in the respective distributions.
Although the densities all look the same, they are are quite different! The x-axis in each case has two scales: one is the actual value of the x-variable, and the other is the z-score which is calculated as:
\[ z_x = \frac{x - \mu_{x}}{\sigma_x} \]
With similar distributions (i.e. normal distributions), we see that the variation in density is the same at the same values of z-score
for each variable. However since the \(\mu_i\) and \(\sigma_i\) are different, the absolute value of the z-score
is different for each variable. In the first plot (from the top left), \(z = 1\) corresponds to an absolute change of \(5\) units; it is \(15\) units in the plot directly below it.
Our comparisons are done easily when we compare differences in probabilities at identical z-scores
, or differences in z-scores
at identical probabilities.
Conclusion
- Histograms, Frequency Distributions, and Box Plots are used for Quantitative data variables
- Histograms βdwell uponβ counts, ranges, means and standard deviations
- Frequency Density plots βdwell uponβ probabilities and densities
- Box Plots βdwell uponβ medians and Quartiles
- Qualitative data variables can be plotted as counts, using Bar Charts, or using Heat Maps
- Violin Plots help us to visualize multiple distributions at the same time, as when we split a Quant variable wrt to the levels of a Qual variable.
- Ridge Plots are density plots used for describing one Quant and one Qual variable (by inherent splitting)
- We can split all these plots on the basis of another Qualitative variable.(Ridge Plots are already split)
- Long tailed distributions need care in visualization and in inference making!
Your Turn
- Click on the Dataset Icon above, and unzip that archive. Try to make distribution plots with each of the three tools.
- A dataset from calmcode.io https://calmcode.io/datasets.html
- Old Faithful Data in R (Find it!)
inspect
the dataset in each case and develop a set of Questions, that can be answered by appropriate stat measures, or by using a chart to show the distribution.
References
- See the scrolly animation for a histogram at this website: Exploring Histograms, an essay by Aran Lunzer and Amelia McNamara https://tinlizzie.org/histograms/?s=09
- Minimal R using
mosaic
.https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf
- Sebastian Sauer, Plotting multiple plots using purrr::map and ggplot
R Package Citations
Citation
@online{v.2022,
author = {V., Arvind},
title = {π {Violins:} {Plotting} {Groups} and {Density}},
date = {2022-11-15},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/28-Violins/},
langid = {en},
abstract = {Quant and Qual Variable Graphs and their Siblings}
}