library(tidyverse)
library(mosaic) # Our trusted friend
library(skimr)
library(vcd) # Michael Friendly's package, Visualizing Categorical Data
library(vcdExtra) # Categorical Data Sets
library(ggmosaic) # Mosaic Plots
library(resampledata) # More datasets
library(GGally) # Correlation Plots
library(sjPlot) # Likert Scale Plots
library(sjlabelled) # Creating Labelled Data for Likert Plots
library(ggpubr) # Colours, Themes and new geometries in ggplot
library(ca) # Correspondence Analysis, for use some day
## Making Tables
library(kableExtra) # html styled tables
library(gt)
๐ Mosaics: Plotting Categorical Data
Setting up R Packages
Introduction
To recall, a categorical variable is one for which the possible measured or assigned values consist of a discrete set of categories, which may be ordered
or unordered
. Some typical examples are:
- Gender, with categories โMale,โ โFemale.โ
- Marital status, with categories โNever married,โ โMarried,โ โSeparated,โ โDivorced,โ โWidowed.โ
- Fielding position (in
baseballcricket), with categories โSlips,โCover โ,โMid-off โDeep Fine Legโ, โClose-inโ, โDeepโโฆ - Side effects (in a pharmacological study), with categories โNone,โ โSkin rash,โ โSleep disorder,โ โAnxiety,โ . . ..
- Political attitude, with categories โLeft,โ โCenter,โ โRight.โ
- Party preference (in India), with categories โBJPโ โCongress,โ โAAP,โ โTMCโโฆ
- Treatment outcome, with categories โno improvement,โ โsome improvement,โ or โmarked improvement.โ
- Age, with categories โ0โ9,โ โ10โ19,โ โ20โ29,โ โ30โ39,โ . . . .
- Number of children, with categories 0, 1, 2, . . . .
As these examples suggest, categorical variables differ in the number of categories: we often distinguish binary variables (or dichotomous variables) such as Gender from those with more than two categories (called polytomous variables).
Categorical Data
From the {vcd package}
vignette:
The first thing you need to know is that categorical data can be represented in three different forms in R, and it is sometimes necessary to convert from one form to another, for carrying out statistical tests, fitting models or visualizing the results.
- Case Data
- Frequency Data
- Cross-Tabular Count Data
Let us first see examples of each.
From Michael Friendly Discrete Data Analysis and Visualization :
In many circumstances, data is recorded on each individual or experimental unit. Data in this form is called case data, or data in case form. Containing individual observations with one or more categorical factors, used as classifying variables. The total number of observations is
nrow(X)
, and the number of variables isncol(X)
.
class(Arthritis)
# Tibble as HTML for presentation
Arthritis %>%
head(10) %>%
kbl(caption = "Case Form: Arthritis Treatments and Effects<br> First 10 Observations", centering = TRUE) %>%
kable_classic_2(html_font = "Cambria", full_width = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
[1] "data.frame"
ID | Treatment | Sex | Age | Improved |
---|---|---|---|---|
57 | Treated | Male | 27 | Some |
46 | Treated | Male | 29 | None |
77 | Treated | Male | 30 | None |
17 | Treated | Male | 32 | Marked |
36 | Treated | Male | 46 | Marked |
23 | Treated | Male | 58 | Marked |
75 | Treated | Male | 59 | None |
39 | Treated | Male | 59 | Marked |
33 | Treated | Male | 63 | None |
55 | Treated | Male | 63 | None |
The Arthritis
data set has three factors and two integer variables. One of the three factors Improved
is an ordered factor.
- ID
- Treatment: a factor; Placebo or Treated
- Sex: a factor, M / F
- Age: integer
- Improved: Ordinal factor; None < Some < Marked
Each row in the Arthritis
dataset is a separate case or observation.
Data in frequency form has already been tabulated and aggregated by counting over the (combinations of) categories of the table variables. When the data are in case form, we can always trace any observation back to its individual identifier or data record, since each row is a unique observation or case; the reverse, with the Frequency Form is rarely possible.
Frequency Data is usually a data frame, with columns of categorical variables and at least one column containing frequency
or count
information.
str(GSS)
'data.frame': 6 obs. of 3 variables:
$ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2
$ party: Factor w/ 3 levels "dem","indep",..: 1 1 2 2 3 3
$ count: num 279 165 73 47 225 191
# Tibble as HTML for presentation
GSS %>%
kbl(caption = "General Social Survey",centering = TRUE) %>%
kable_classic_2(html_font = "Cambria", full_width = F) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
sex | party | count |
---|---|---|
female | dem | 279 |
male | dem | 165 |
female | indep | 73 |
male | indep | 47 |
female | rep | 225 |
male | rep | 191 |
Respondents in the GSS survey were classified by sex
and party
identification. As can be seen, there is a count for every combination of the two categorical variables, sex
and party
.
Table Form Data can be a matrix
, array
or table object
, whose elements are the frequencies in an n-way table. The variable names (factors) and their levels are given by dimnames(X)
.
HairEyeColor
, , Sex = Male
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
class(HairEyeColor)
[1] "table"
HairEyeColor
is a โtwo-wayโ table, consisting of two tables, one for Sex = Female
and the other for Sex = Male
. The total number of observations is sum(X)
. The number of dimensions of the table is length(dimnames(X))
, and the table sizes are given by sapply(dimnames(X), length)
. The data looks like a n-dimensional cube and needs n-way tables to represent.
sum(HairEyeColor)
[1] 592
dimnames(HairEyeColor)
$Hair
[1] "Black" "Brown" "Red" "Blond"
$Eye
[1] "Brown" "Blue" "Hazel" "Green"
$Sex
[1] "Male" "Female"
Hair Eye Sex
4 4 2
A good way to think of tabular data is to think of a Rubikโs Cube.
Each of the edges is an Ordinal Variable, each segment represents a level in the variable. So each face of the Cube represents two ordinal variables. Any segment is at the intersection of two (independent) levels of two variables, and the colour may be visualized as a count. This array of counts on a face is a 2D or 2-Way Table. ( More on this later )
Since we can only print 2D tables, we hold one face in front and the image we see is a 2-Way Table. Turning the Cube by 90 degrees gives us another face with 2 variables, with one variable in common with the previous face. If we consider two faces together, we get two 2-way tables, effectively allowing us to contemplate 3 categorical variables.
Multiple 2-Way tables can be flattened into a single long table that contains all counts for all combinations of categorical variables. This can be visualized as โopening upโ and laying flat the Rubikโs cube, as with a cardboard model of it.
ftable(HairEyeColor)
Sex Male Female
Hair Eye
Black Brown 32 36
Blue 11 9
Hazel 10 5
Green 3 2
Brown Brown 53 66
Blue 50 34
Hazel 25 29
Green 15 14
Red Brown 10 16
Blue 10 7
Hazel 7 7
Green 7 7
Blond Brown 3 4
Blue 30 64
Hazel 5 5
Green 8 8
Finally, we may need to convert the (multiple) tables into a data frame or tibble:
## Convert the two tables into a data frame
HairEyeColor %>%
as_tibble()
# Tibble as HTML for presentation
HairEyeColor %>%
as_tibble() %>% # Convert
kbl(caption = "Hair Eye and Color") %>%
kable_classic_2(html_font = "Cambria", full_width = F) %>%
kable_styling(bootstrap_options = c("hover", "striped", "responsive"))
Hair | Eye | Sex | n |
---|---|---|---|
Black | Brown | Male | 32 |
Brown | Brown | Male | 53 |
Red | Brown | Male | 10 |
Blond | Brown | Male | 3 |
Black | Blue | Male | 11 |
Brown | Blue | Male | 50 |
Red | Blue | Male | 10 |
Blond | Blue | Male | 30 |
Black | Hazel | Male | 10 |
Brown | Hazel | Male | 25 |
Red | Hazel | Male | 7 |
Blond | Hazel | Male | 5 |
Black | Green | Male | 3 |
Brown | Green | Male | 15 |
Red | Green | Male | 7 |
Blond | Green | Male | 8 |
Black | Brown | Female | 36 |
Brown | Brown | Female | 66 |
Red | Brown | Female | 16 |
Blond | Brown | Female | 4 |
Black | Blue | Female | 9 |
Brown | Blue | Female | 34 |
Red | Blue | Female | 7 |
Blond | Blue | Female | 64 |
Black | Hazel | Female | 5 |
Brown | Hazel | Female | 29 |
Red | Hazel | Female | 7 |
Blond | Hazel | Female | 5 |
Black | Green | Female | 2 |
Brown | Green | Female | 14 |
Red | Green | Female | 7 |
Blond | Green | Female | 8 |
Creating Contingency Tables
Many plots for Categorical Data ( as we shall see ) require that the data be converted into a Contingency Table ; the Statistical tests for Proportions ( the \(\chi^2\) test ) also needs Contingency Tables. The Frequency Table we encountered earlier is very close to being a full-fledged Contingency Table; one only needs to add the margin counts! So what is a Contingency Table?
From Wolfram Alpha:
A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts. More precisely, an \(r \times c\) contingency table shows the observed frequency of two variables the observed frequencies of which are arranged into \(r\) rows and \(c\) columns. The intersection of a row and a column of a contingency table is called a cell.
In this section we understand how to make Contingency Tables from each of the three forms. We will use vcd
, mosaic
and the tidyverse
packages for our purposes. Then we will see how they can be visualized.
# One Way Table ( one variable )
table(Arthritis$Treatment) # Contingency Table
Placebo Treated
43 41
# 1-way Contingency Table
table(Arthritis$Treatment) %>% addmargins() # Contingency Table with margins
Placebo Treated Sum
43 41 84
# 2-Way Contingency Tables
# Choosing Treatment and Improved
table(Arthritis$Treatment, Arthritis$Improved) %>% addmargins()
None Some Marked Sum
Placebo 29 7 7 43
Treated 13 7 21 41
Sum 42 14 28 84
# Choosing Treatment and Sex
table(Arthritis$Sex, Arthritis$Improved) %>% addmargins()
None Some Marked Sum
Female 25 12 22 59
Male 17 2 6 25
Sum 42 14 28 84
We can use table()
( and also xtabs()
) to generate multi-dimensional tables too (More than 2-way) These will be printed out as a series of 2D tables, one for each value/level of the โthirdโ parameter. We can then flatten this set of tables using ftable() and add margins to convert into a Contingency Table:
my_arth_table <- table(Arthritis$Treatment, Arthritis$Sex, Arthritis$Improved)
my_arth_table
, , = None
Female Male
Placebo 19 10
Treated 6 7
, , = Some
Female Male
Placebo 7 0
Treated 5 2
, , = Marked
Female Male
Placebo 6 1
Treated 16 5
# Now flatten
ftable(my_arth_table)
None Some Marked
Placebo Female 19 7 6
Male 10 0 1
Treated Female 6 5 16
Male 7 2 5
ftable(my_arth_table) %>% addmargins()
Sum
19 7 6 32
10 0 1 11
6 5 16 27
7 2 5 14
Sum 42 14 28 84
A bit strange that the column labels disappear in the ftable
when margins are addedโฆmaybe need to investigate the FUN
argument to add_margins()
.
The vcd
( Visualize Categorical Data ) package by Michael Friendly has a convenient function to create Contingency Tables: structable()
; this function produces a โflatโ representation of a high-dimensional contingency table constructed by recursive splits (similar to the construction of mosaic charts/graphs). structable
tends to render flat tables, of the kind that can be thought of as a โtext representationโ of the vcd::mosaic
plot:
The arguments of structable
are:
- a formula \(y + p \sim x + z\) which shows which variables are to be included as columns and rows respectively on a table;
- a
data
argument, which can indicate adata frame
from where the variables are drawn.
# Three Way!!
arth_vcd <- vcd::structable(data = Arthritis, Treatment ~ Improved + Sex)
arth_vcd
class(arth_vcd)
Treatment Placebo Treated
Improved Sex
None Female 19 6
Male 10 7
Some Female 7 5
Male 0 2
Marked Female 6 16
Male 1 5
[1] "structable" "ftable"
# With Margins
arth_vcd %>% as.matrix() %>% addmargins()
Treatment
Improved_Sex Placebo Treated Sum
None_Female 19 6 25
None_Male 10 7 17
Some_Female 7 5 12
Some_Male 0 2 2
Marked_Female 6 16 22
Marked_Male 1 5 6
Sum 43 41 84
# HairEyeColor is in multiple table form
HairEyeColor
# structable flattens these into one, as for a mosaic chart
vcd::structable(HairEyeColor)
# As tibble
vcd::structable(HairEyeColor) %>% as_tibble()
, , Sex = Male
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
, , Sex = Female
Eye
Hair Brown Blue Hazel Green
Black 36 9 5 2
Brown 66 34 29 14
Red 16 7 7 7
Blond 4 64 5 8
Eye Brown Blue Hazel Green
Hair Sex
Black Male 32 11 10 3
Female 36 9 5 2
Brown Male 53 50 25 15
Female 66 34 29 14
Red Male 10 10 7 7
Female 16 7 7 7
Blond Male 3 30 5 8
Female 4 64 5 8
UCBAdmissions
is already in Frequency Form i.e. a Contingency Table. But it is a set of (two-way) Contingency Tables:
UCBAdmissions
###
vcd::structable(UCBAdmissions)
###
structable(UCBAdmissions) %>% as.matrix() %>% addmargins()
, , Dept = A
Gender
Admit Male Female
Admitted 512 89
Rejected 313 19
, , Dept = B
Gender
Admit Male Female
Admitted 353 17
Rejected 207 8
, , Dept = C
Gender
Admit Male Female
Admitted 120 202
Rejected 205 391
, , Dept = D
Gender
Admit Male Female
Admitted 138 131
Rejected 279 244
, , Dept = E
Gender
Admit Male Female
Admitted 53 94
Rejected 138 299
, , Dept = F
Gender
Admit Male Female
Admitted 22 24
Rejected 351 317
Gender Male Female
Admit Dept
Admitted A 512 89
B 353 17
C 120 202
D 138 131
E 53 94
F 22 24
Rejected A 313 19
B 207 8
C 205 391
D 279 244
E 138 299
F 351 317
Gender
Admit_Dept Male Female Sum
Admitted_A 512 89 601
Admitted_B 353 17 370
Admitted_C 120 202 322
Admitted_D 138 131 269
Admitted_E 53 94 147
Admitted_F 22 24 46
Rejected_A 313 19 332
Rejected_B 207 8 215
Rejected_C 205 391 596
Rejected_D 279 244 523
Rejected_E 138 299 437
Rejected_F 351 317 668
Sum 2691 1835 4526
Note that structable
does not permit the adding of margins directly; it needs to be converted to a matrix for addmargins()
to do its work.
I think this is the simplest and most elegant way of obtaining Contingency Tables:
# One Way Table
mosaicCore::tally( ~ substance, data = HELPrct, margins = TRUE)
# Two-Way Tables
# Two ways of producing the same result
tally( sex ~ substance, data = HELPrct, margins = TRUE)
tally(~ sex | substance, data = HELPrct, margins = TRUE)
substance
alcohol cocaine heroin Total
177 152 124 453
substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
Total 177 152 124
substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
Total 177 152 124
So far these packages give Contingency Tables that are easy to see for humans; some of these structures are also capable being passed directly to commands such as stats::chisq.test()
or janitor::chisq.test()
.
Often we need Contingency Tables that are in tibble
form, and we need to perform some data processing using dplyr
to get there. Doing this with the tidyverse
set of packages may seem counter-intuitive and long-winded, but the workflow is easily understandable.
First we develop the counts:
diamonds %>% count(cut)
diamonds %>% count(clarity)
diamonds %>%
group_by(cut, clarity) %>%
dplyr::summarise( count = n())
We need to have the individual levels of cut
as rows and the individual levels of clarity
as columns. This means that we need to pivot this from โlong to wideโ1 to obtain a Contingency Table:
diamonds %>%
group_by(cut, clarity) %>%
dplyr::summarise( count = n()) %>%
pivot_wider(id_cols = cut,
names_from = clarity,
values_from = count) %>%
# Now add the row and column totals using the `janitor` package
janitor::adorn_totals(where = c("row", "col")) %>%
# Recover to tibble since janitor gives a "tabyl" format
# ( which can be useful too !)
as_tibble()
### Another Way
diamonds %>%
group_by(cut, clarity) %>%
dplyr::summarise( count = n()) %>%
pivot_wider(id_cols = cut,
names_from = clarity,
values_from = count) %>%
# Now add the row and column totals using the `dplyr` package
# From: https://stackoverflow.com/a/67885521
mutate("row_totals" = sum(across(where(is.integer)))) %>%
ungroup() %>%
add_row(cut = "col_total", summarize(., across(where(is.integer), sum)))
Plots for Categorical Data
Let us now examine the various kinds of plots we can make with Categorical Data. We will start with simple Bar plots, then move to plotting entire Contingency Tables, and then look Balloon Plots as an alternative. Finally we will look at a special case of survey data and look at Likert Plots.
Simple Bar Plots
We have already seen bar plots, which allow us to plot counts of categorical data. These can be used for say 2 or 3 Categorical variables, with not too many levels. However, for more complex data, if there are a large number of categorical variables, or if the categorical variables have many levels, the bar plot is not adequate.
Recollect that the bar plot computes counts to plot with.
Mosaic Plots
From Michael Friendly:
The familiar techniques for displaying raw data are often disappointing when applied to categorical data. The simple scatterplot, for example, widely used to show the relation between quantitative response and predictors, when applied to discrete variables, gives a display of the category combinations, with all identical values overplotted, and no representation of their frequency. (AV: Scatter plots do not do counting internally!)
Instead, frequencies of categorical variables are often best represented graphically using areas rather than as position along a scale. Using the visual attribute:
\[\pmb{area \sim frequency}\]
allows creating novel graphical displays of frequency data for special circumstances.
Let us not look at some sample plots that embody this area-frequency principle. A mosaic plot is basically an area-proportional visualization of (typically observed) frequencies (i.e counts), consisting of tiles (corresponding to the cells) created by recursively splitting a rectangle vertically and horizontally. Thus, the area of each tile is proportional to the corresponding cell entry given the dimensions of previous splits.
The vcd::mosaic()
function needs the data in contingency table form. We will use our vcd::structable()
function to construct one:
arthritis_table <- vcd::structable(~ Treatment + Improved,
data = Arthritis)
arthritis_table
vcd::mosaic(arthritis_table,
gp = shading_max,
main = "Arthritis Treatment Dataset")
data("GSS2002", package = "resampledata")
gss2002 <- GSS2002 %>%
# select two categorical variables from the dataset
dplyr::select(Education, DeathPenalty) %>%
drop_na(Education, DeathPenalty)
gss2002
# make a tally table
gss_table <- mosaic::tally(DeathPenalty ~ Education, data = gss2002)
gss_table %>% addmargins()
Education
DeathPenalty Left HS HS Jr Col Bachelors Graduate Sum
Favor 117 511 71 135 64 898
Oppose 72 200 16 71 50 409
Sum 189 711 87 206 114 1307
# gss_table is *not* a tibble, but a *table* object.
vcd::mosaic(gss_table, gp = shading_hsv)
The mosaic chart provides more additional information than does the corresponding bar chart. The individual tiles are coloured based on the value of the Pearson Residual (explained below). The Pearson residual defines the (scaled) difference between the actual count in the cell, with what might be expected if there was no correlation between the two Qual variables.
An uncoloured Mosaic Chart indicates that the residual differences are small and that there is no correlation.
The mosaic chart is a visualization of the obtained count on which the tile is constructed.
It is also possible to compute a per-cell expected count, if the categorical variables are independent, that is, not correlated. The null hypothesis is that the variables are independent. The test for independence, as any inferential test, is based on comparing the observed counts with the expected counts, under the null hypothesis. So, what might the expected frequency of a cell be in cross-tabulation table for cell \(i,j\) given no relationship between the variables of interest?
Represent the sum of row \(i\) with \(n_{+i}\), the sum of column \(j\) with \(n_{j+}\), and the grand total of all the observations with \(n\). And independence of variables means that their joint probability is the product of their probabilities. Therefore, the Expected Cell Frequency/Count is given by:
\[ \begin{array}{lcl} ~Expected~Count~ e_{i,j} & = & P(Row~and~Column) * n\\ & = & P(row) \times P(column) * n\\ & = & \frac{rowSum}{n} *\frac{colSum}{n} \times n\\ & = &\frac{rowSum *colSum}{n}\\ \end{array} \]
and therefore:
\[ e_{i,j} = \frac{(n_{+i})(n_{j+})}{n} \]
Now, the Pearson Residual in each cell is equivalent to the โz-scoreโ of that cell.
Recall the z-score idea: we subtract the mean and divide by the std. deviation to get the z-score. In the Contingency Table, we have counts which are usually modeled as an (integer) Poisson distribution, for which mean (i.e Expected value) and variance are identical. Thus we get the Pearson Residual as:
\[ r_{i,j} = \frac{(Actual - Expected)}{\sqrt{\displaystyle Expected}} \] and therefore:
\[ r_{i,j} = \frac{(o_{i,j}- e_{i,j})}{\sqrt{\displaystyle e_{i,j}}} \]
The comparison of what occurred to what is expected is based on their difference, scaled by the square root of the expected, the Pearson Residual.
The sum of all the squared Pearson residuals is the chi-square statistic, ฯ2, upon which the inferential analysis follows. \[ ฯ2 = \sum_{i=1}^R\sum_{j=1}^C{r_{i,j}^2} \] where R and C are number of rows and columns in the Contingency Table, the levels in the two Qual variables.
We will treat the \(X^2\) test in the Module on Inference for Two Proportions.. See also the sub-section here below on Actual and Expected Contingency Tables.
This is perhaps the simplest way, but does use a different package and also does not use the formula notation: there is no gf_mosaic
command yet!
ggmosaic
takes a tibble with Qualitative variables, internally computes the counts/table, and plots the mosaic plot:
#library(ggmosaic)
#
# Set graph theme
theme_set(new = theme_custom())
gss2002
ggplot(data = gss2002) +
ggmosaic::geom_mosaic(aes(x = product(DeathPenalty, Education),
fill = DeathPenalty))
This needs quite some work, to convert the Contingency Table into a mosaic plot; perhaps not the most intuitive of methods either. This code has been developed using this Stackoverflow post.
# Reference
# https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2
# Set graph theme
theme_set(new = theme_custom())
gss_summary <- gss2002 %>%
dplyr::group_by(Education, DeathPenalty) %>%
dplyr::summarise(count = n()) %>% # This is good for a chisq test
# Data is still grouped by `Education`
# Add two more columns to facilitate mosaic Plot
# These two columns are quite unusual...
mutate(edu_count = sum(count),
edu_prop = count / sum(count)) %>%
ungroup()
gss_summary
# This works but is not very intuitive...
gf_col(edu_prop ~ Education, data = gss_summary,
width = ~ edu_count, # Not typically used in a column chart
fill = ~ DeathPenalty,
stat = "identity",
position = "fill",
color = "black") %>%
gf_text(edu_prop ~ Education,
label = ~ scales::percent(edu_prop),
position = position_stack(vjust = 0.5)) %>%
gf_facet_grid(~ Education,
scales = "free_x",
space = "free_x") %>%
gf_theme(scale_fill_manual(values = c("orangered", "palegreen3")))
We briefly. discussed this idea in the vcd
sub-section above.
Contingency Tables reveal relationships between Qualitative variables, by examining counts of observations for each combination of the levels of the variables. Two Qual variables with \(m\) and \(n\) levels will give us a table of size \(m \times n\).
In the module on statistical inference for Qualitative data we will see how we can create not only actual Contingency Tables ( as above) but that which would have existed if the two Qualitative variables had no relationship whatsoever. This is called the โExpected Contingency Tableโ.
We can calculate and plot both of these with the vcd
package, as shown below:
vcd::mosaic(~ Sex + Eye + Hair, data = HairEyeColor)
vcd::mosaic(~ Sex + Eye + Hair, data = HairEyeColor, type = "expected")
From an inspection of these paired plots, we see the difference between situatlons when Qualitative variables are not related to that when they are related.
Balloon Plots
There is another visualization of Categorical Data, called a Balloon Plot. We will use the housetasks
dataset from the package ggpubr
. This data is already in Contingency Table form (without the margin totals)!
# Set graph theme
theme_set(new = theme_custom())
housetasks <- read.delim(
system.file("demo-data/housetasks.txt",
package = "ggpubr"), row.names = 1)
head(housetasks, 4)
ggpubr::ggballoonplot(housetasks, fill = "value",
#ggtheme = theme_pubr()
) +
scale_fill_viridis_c(option = "C") +
labs(title = "A Balloon Plot for Categorical Data")
And repeat with the familiar HairEyeColor
dataset:
# Set graph theme
theme_set(new = theme_custom())
df <- as_tibble(HairEyeColor)
df
ggballoonplot(df, x = "Hair", y = "Eye", size = "n",
fill = "n",
#ggtheme = theme_pubr()
) +
scale_fill_viridis_c(option = "C") +
labs(title = "Balloon Plot")
# Balloon Plot with facetting
ggballoonplot(df, x = "Hair", y = "Eye", size = "n",
fill = "n", facet.by = "Sex",
#ggtheme = theme_pubr()
) +
scale_fill_viridis_c(option = "C") +
labs(title = "Balloon Plot with Facetting",
subtitle = "Hair and Eye Color")
Note the somewhat different syntax with ggballoonplot
: the variable names are enclosed in quotes.
Balloon Plots work because they use color and size aesthetics to represent categories and counts respectively.
Conclusion
How are the bar plots for categorical data different from histograms? Why donโt โregularโ scatter plots simply work for Categorical data? Discuss!
There are quite a few things we can do with Qualitative/Categorical data:
- Make simple bar charts with colours and facetting
- Make Contingency Tables for a \(X^2\)-test
- Make Mosaic Plots to show how the categories stack up
- Make Balloon Charts as an alternative
- Make Likert Charts for Survey Questionnaire Data
- Then, draw your inferences and tell the story!
Your Turn
- Take some of the categorical datasets from the
vcd
andvcdExtra
packages and recreate the plots from this module. Go to https://vincentarelbundock.github.io/Rdatasets/articles/data.html and type โvcdโ in thesearch
box. You can directly load CSV files from there, usingread_csv("url-to-csv")
.
References
Nice Chi-square interactive story at https://statisticalstories.xyz/chi-square
Mine Cetinkaya-Rundel and Johanna Hardin. An Introduction to Modern Statistics, Chapter 4. https://openintro-ims.netlify.app/explore-categorical.html
Using the
strcplot
command fromvcd
, https://cran.r-project.org/web/packages/vcd/vignettes/strucplot.pdfCreating Frequency Tables with
vcd
, https://cran.r-project.org/web/packages/vcdExtra/vignettes/A_creating.htmlCreating mosaic plots with
vcd
, https://cran.r-project.org/web/packages/vcdExtra/vignettes/D_mosaics.htmlMichael Friendly, Corrgrams: Exploratory displays for correlation matrices. The American Statistician August 19, 2002 (v1.5). https://www.datavis.ca/papers/corrgram.pdf
H. Riedwyl & M. Schรผpbach (1994), Parquet diagram to plot contingency tables. In F. Faulbaum (ed.), Softstat โ93: Advances in Statistical Software, 293โ299. Gustav Fischer, New York.
R Package Citations
Package | Version | Citation |
---|---|---|
ggmosaic | 0.3.3 | Jeppson, Hofmann, and Cook (2021) |
ggpubr | 0.6.0 | Kassambara (2023) |
janitor | 2.2.0 | Firke (2023) |
kableExtra | 1.4.0 | Zhu (2024) |
resampledata | 0.3.1 | Chihara and Hesterberg (2018) |
sjlabelled | 1.2.0 | Lรผdecke (2022) |
sjPlot | 2.8.16 | Lรผdecke (2024) |
vcd | 1.4.12 | Meyer, Zeileis, and Hornik (2006); Zeileis, Meyer, and Hornik (2007); Meyer et al. (2023) |
vcdExtra | 0.8.5 | Friendly (2023) |
Footnotes
Citation
@online{v.2022,
author = {V., Arvind},
title = {๐ {Mosaics:} {Plotting} {Categorical} {Data}},
date = {2022-12-27},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/40-CatData},
langid = {en},
abstract = {Types, Categories, and Counts}
}