๐Ÿ‰ Mosaics: Plotting Categorical Data

Proportions
Frequency Tables
Contingency Tables
Numerical Data in Groups
Margins
Likert Scale data
Bar Plots (for Contingency Tables)
Mosaic Plots
Balloon Plots
Pie Charts
Correspondence Analysis
Author

Arvind V.

Published

December 27, 2022

Modified

June 26, 2024

Abstract
Types, Categories, and Counts

Setting up R Packages

library(tidyverse)
library(mosaic) # Our trusted friend
library(skimr)
library(vcd) # Michael Friendly's package, Visualizing Categorical Data
library(vcdExtra) # Categorical Data Sets
library(ggmosaic) # Mosaic Plots
library(resampledata) # More datasets

library(GGally) # Correlation Plots
library(sjPlot) # Likert Scale Plots
library(sjlabelled) # Creating Labelled Data for Likert Plots

library(ggpubr) # Colours, Themes and new geometries in ggplot
library(ca) # Correspondence Analysis, for use some day

## Making Tables
library(kableExtra) # html styled tables
library(gt)

Introduction

To recall, a categorical variable is one for which the possible measured or assigned values consist of a discrete set of categories, which may be ordered or unordered. Some typical examples are:

  • Gender, with categories โ€œMale,โ€ โ€œFemale.โ€
  • Marital status, with categories โ€œNever married,โ€ โ€œMarried,โ€ โ€œSeparated,โ€ โ€œDivorced,โ€ โ€œWidowed.โ€
  • Fielding position (in baseball cricket), with categories โ€œSlips,โ€Cover โ€œ,โ€Mid-off โ€œDeep Fine Legโ€, โ€œClose-inโ€, โ€œDeepโ€โ€ฆ
  • Side effects (in a pharmacological study), with categories โ€œNone,โ€ โ€œSkin rash,โ€ โ€œSleep disorder,โ€ โ€œAnxiety,โ€ . . ..
  • Political attitude, with categories โ€œLeft,โ€ โ€œCenter,โ€ โ€œRight.โ€
  • Party preference (in India), with categories โ€œBJPโ€ โ€œCongress,โ€ โ€œAAP,โ€ โ€œTMCโ€โ€ฆ
  • Treatment outcome, with categories โ€œno improvement,โ€ โ€œsome improvement,โ€ or โ€œmarked improvement.โ€
  • Age, with categories โ€œ0โ€“9,โ€ โ€œ10โ€“19,โ€ โ€œ20โ€“29,โ€ โ€œ30โ€“39,โ€ . . . .
  • Number of children, with categories 0, 1, 2, . . . .

As these examples suggest, categorical variables differ in the number of categories: we often distinguish binary variables (or dichotomous variables) such as Gender from those with more than two categories (called polytomous variables).

Categorical Data

From the {vcd package} vignette:

The first thing you need to know is that categorical data can be represented in three different forms in R, and it is sometimes necessary to convert from one form to another, for carrying out statistical tests, fitting models or visualizing the results.

  • Case Data
  • Frequency Data
  • Cross-Tabular Count Data

Let us first see examples of each.

Creating Contingency Tables

Many plots for Categorical Data ( as we shall see ) require that the data be converted into a Contingency Table ; the Statistical tests for Proportions ( the \(\chi^2\) test ) also needs Contingency Tables. The Frequency Table we encountered earlier is very close to being a full-fledged Contingency Table; one only needs to add the margin counts! So what is a Contingency Table?

From Wolfram Alpha:

A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts. More precisely, an \(r \times c\) contingency table shows the observed frequency of two variables the observed frequencies of which are arranged into \(r\) rows and \(c\) columns. The intersection of a row and a column of a contingency table is called a cell.

In this section we understand how to make Contingency Tables from each of the three forms. We will use vcd, mosaic and the tidyverse packages for our purposes. Then we will see how they can be visualized.

Plots for Categorical Data

Let us now examine the various kinds of plots we can make with Categorical Data. We will start with simple Bar plots, then move to plotting entire Contingency Tables, and then look Balloon Plots as an alternative. Finally we will look at a special case of survey data and look at Likert Plots.

Simple Bar Plots

We have already seen bar plots, which allow us to plot counts of categorical data. These can be used for say 2 or 3 Categorical variables, with not too many levels. However, for more complex data, if there are a large number of categorical variables, or if the categorical variables have many levels, the bar plot is not adequate.

Recollect that the bar plot computes counts to plot with.

Mosaic Plots

From Michael Friendly:

The familiar techniques for displaying raw data are often disappointing when applied to categorical data. The simple scatterplot, for example, widely used to show the relation between quantitative response and predictors, when applied to discrete variables, gives a display of the category combinations, with all identical values overplotted, and no representation of their frequency. (AV: Scatter plots do not do counting internally!)

Instead, frequencies of categorical variables are often best represented graphically using areas rather than as position along a scale. Using the visual attribute:

\[\pmb{area \sim frequency}\]

allows creating novel graphical displays of frequency data for special circumstances.

Let us not look at some sample plots that embody this area-frequency principle. A mosaic plot is basically an area-proportional visualization of (typically observed) frequencies (i.e counts), consisting of tiles (corresponding to the cells) created by recursively splitting a rectangle vertically and horizontally. Thus, the area of each tile is proportional to the corresponding cell entry given the dimensions of previous splits.

Balloon Plots

There is another visualization of Categorical Data, called a Balloon Plot. We will use the housetasks dataset from the package ggpubr. This data is already in Contingency Table form (without the margin totals)!

# Set graph theme
theme_set(new = theme_custom())

housetasks <- read.delim(
  system.file("demo-data/housetasks.txt", 
              package = "ggpubr"), row.names = 1)
head(housetasks, 4)
ggpubr::ggballoonplot(housetasks, fill = "value", 
                      #ggtheme = theme_pubr()
                      ) +
  scale_fill_viridis_c(option = "C") +
  labs(title = "A Balloon Plot for Categorical Data")

And repeat with the familiar HairEyeColor dataset:

# Set graph theme
theme_set(new = theme_custom())

df <- as_tibble(HairEyeColor)
df
ggballoonplot(df, x = "Hair", y = "Eye", size = "n",
              fill = "n",
              #ggtheme = theme_pubr()
              ) +
  scale_fill_viridis_c(option = "C") + 
  labs(title = "Balloon Plot")
# Balloon Plot with facetting
ggballoonplot(df, x = "Hair", y = "Eye", size = "n",
              fill = "n", facet.by = "Sex",
              #ggtheme = theme_pubr()
              ) +
  scale_fill_viridis_c(option = "C") + 
  labs(title = "Balloon Plot with Facetting", 
       subtitle = "Hair and Eye Color")

Note the somewhat different syntax with ggballoonplot: the variable names are enclosed in quotes.

Balloon Plots work because they use color and size aesthetics to represent categories and counts respectively.

Conclusion

How are the bar plots for categorical data different from histograms? Why donโ€™t โ€œregularโ€ scatter plots simply work for Categorical data? Discuss!

There are quite a few things we can do with Qualitative/Categorical data:

  1. Make simple bar charts with colours and facetting
  2. Make Contingency Tables for a \(X^2\)-test
  3. Make Mosaic Plots to show how the categories stack up
  4. Make Balloon Charts as an alternative
  5. Make Likert Charts for Survey Questionnaire Data
  6. Then, draw your inferences and tell the story!

Your Turn

  1. Take some of the categorical datasets from the vcd and vcdExtra packages and recreate the plots from this module. Go to https://vincentarelbundock.github.io/Rdatasets/articles/data.html and type โ€œvcdโ€ in the search box. You can directly load CSV files from there, using read_csv("url-to-csv").

References

  1. Nice Chi-square interactive story at https://statisticalstories.xyz/chi-square

  2. Mine Cetinkaya-Rundel and Johanna Hardin. An Introduction to Modern Statistics, Chapter 4. https://openintro-ims.netlify.app/explore-categorical.html

  3. Using the strcplot command from vcd, https://cran.r-project.org/web/packages/vcd/vignettes/strucplot.pdf

  4. Creating Frequency Tables with vcd, https://cran.r-project.org/web/packages/vcdExtra/vignettes/A_creating.html

  5. Creating mosaic plots with vcd, https://cran.r-project.org/web/packages/vcdExtra/vignettes/D_mosaics.html

  6. Michael Friendly, Corrgrams: Exploratory displays for correlation matrices. The American Statistician August 19, 2002 (v1.5). https://www.datavis.ca/papers/corrgram.pdf

  7. Visualizing Categorical Data in R

  8. H. Riedwyl & M. Schรผpbach (1994), Parquet diagram to plot contingency tables. In F. Faulbaum (ed.), Softstat โ€™93: Advances in Statistical Software, 293โ€“299. Gustav Fischer, New York.

R Package Citations
Package Version Citation
ggmosaic 0.3.3 Jeppson, Hofmann, and Cook (2021)
ggpubr 0.6.0 Kassambara (2023)
janitor 2.2.0 Firke (2023)
kableExtra 1.4.0 Zhu (2024)
resampledata 0.3.1 Chihara and Hesterberg (2018)
sjlabelled 1.2.0 Lรผdecke (2022)
sjPlot 2.8.16 Lรผdecke (2024)
vcd 1.4.12 Meyer, Zeileis, and Hornik (2006); Zeileis, Meyer, and Hornik (2007); Meyer et al. (2023)
vcdExtra 0.8.5 Friendly (2023)
Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. 2nd ed. Hoboken, NJ: John Wiley & Sons. https://sites.google.com/site/chiharahesterberg/home.
Firke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Friendly, Michael. 2023. vcdExtra: โ€œvcdโ€ Extensions and Additions. https://CRAN.R-project.org/package=vcdExtra.
Jeppson, Haley, Heike Hofmann, and Di Cook. 2021. ggmosaic: Mosaic Plots in the โ€œggplot2โ€ Framework. https://CRAN.R-project.org/package=ggmosaic.
Kassambara, Alboukadel. 2023. ggpubr: โ€œggplot2โ€ Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr.
Lรผdecke, Daniel. 2022. sjlabelled: Labelled Data Utility Functions (Version 1.2.0). https://doi.org/10.5281/zenodo.1249215.
โ€”โ€”โ€”. 2024. sjPlot: Data Visualization for Statistics in Social Science. https://CRAN.R-project.org/package=sjPlot.
Meyer, David, Achim Zeileis, and Kurt Hornik. 2006. โ€œThe Strucplot Framework: Visualizing Multi-Way Contingency Tables with Vcd.โ€ Journal of Statistical Software 17 (3): 1โ€“48. https://doi.org/10.18637/jss.v017.i03.
Meyer, David, Achim Zeileis, Kurt Hornik, and Michael Friendly. 2023. vcd: Visualizing Categorical Data. https://CRAN.R-project.org/package=vcd.
Zeileis, Achim, David Meyer, and Kurt Hornik. 2007. โ€œResidual-Based Shadings for Visualizing (Conditional) Independence.โ€ Journal of Computational and Graphical Statistics 16 (3): 507โ€“25. https://doi.org/10.1198/106186007X237856.
Zhu, Hao. 2024. kableExtra: Construct Complex Table with โ€œkableโ€ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.
Back to top

Footnotes

  1. https://tidyr.tidyverse.org/articles/pivot.htmlโ†ฉ๏ธŽ

Citation

BibTeX citation:
@online{v.2022,
  author = {V., Arvind},
  title = {๐Ÿ‰ {Mosaics:} {Plotting} {Categorical} {Data}},
  date = {2022-12-27},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/40-CatData},
  langid = {en},
  abstract = {Types, Categories, and Counts}
}
For attribution, please cite this work as:
V., Arvind. 2022. โ€œ๐Ÿ‰ Mosaics: Plotting Categorical Data.โ€ December 27, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/40-CatData.