Proportion

Rescuing Jack

Author

Arvind V

Published

April 26, 2024

Modified

June 26, 2024

Abstract

Associations between Qual variables

What graphs will we see today?

Variable #1	Variable #2	Chart Names	Chart Shape
Qual	Qual	Mosaic Charts

What kind of Data Variables will we choose?

No	Pronoun	Answer	Variable/Scale	Example	What Operations?
3	How, What Kind, What Sort	A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..)	Qualitative/Ordinal	Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like)	Median,Percentile

Inspiration

Figure 1: Covid Deaths https://datatopics.worldbank.org/sdgatlas/goal-3-good-health-and-well-being?lang=en

According to World Bank, six countries (India, Russia, Indonesia, United States, Brazil, and Mexico) accounted for over 60 percent of the total additional deaths in the first two years of the pandemic.

How do these Chart(s) Work?

We saw with Bar Charts that when we deal with single Qual variables, we perform counts for each level of the variable. What is there are two Quals? Or even more?

The answer is to take them pair-wise and make all combinations of levels for both and calculate counts for these. This is called a Contingency Table.

What is a Contingency Table?

From Wolfram Alpha:

A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.

More precisely, an \(r \times c\) contingency table shows the observed frequency of two variables the observed frequencies of which are arranged into \(r\) rows and \(c\) columns. The intersection of a row and a column of a contingency table is called a cell.

The Contingency Table is then plotted in a chart called the Mosaic Chart.

Dataset: General Social Survey 2002

Let us first construct a Contingency Table from this dataset, and then plot the mosaic chart for it.

Examine the Data

Data Dictionary

Quantitative Data

ID is the only Quant data variable!

Qualitative Data

“ID” “Region” “Gender” “Race”
“Education” “Marital” “Religion” “Happy”
“Income” “PolParty” “Politics” “Marijuana”
“DeathPenalty” “OwnGun” “GunLaw” “SpendMilitary” “SpendEduc” “SpendEnv” “SpendSci” “Pres00”
“Postlife”

are all Qual variables! Let us choose just two Qual variables from this dataset, DeathPenalty and Education.

DeathPenalty: (chr) Opinion as to whether they favour or oppose the death penalty
Education: (chr) Education among respondents, 5 levels (Left HS, HS, Jr Col, Bachelors, Graduate).

A Contingency table with these two Qual variables looks like Figure 4:

	Left HS	HS	Jr Col	Bachelors	Graduate	Sum
Favor	117	511	71	135	64	898
Oppose	72	200	16	71	50	409
Sum	189	711	87	206	114	1307

Figure 4: Contingency Table for General Social Survey 2002

Now then, how does one plot a set of data that looks like this, a matrix? No column is a single variable, nor is each row a single observation, which is what we understand with the idea of tidy data.

The answer is provided in the very shape of the data: we plot this as a set of tiles, where \[ \pmb{area~of~tile \sim count} \] In this way we recursively partition off a (usually) square area into vertical and horizontal pieces whose area is proportional to the count at a specific combination of levels of the two Qual variables.

Research Questions

Question

Q1. Are Education and DeathPenalty associated?

Let us plot the mosaic chart in two steps: we now choose Qual variables Education and DeathPenalty, in that order to plot the mosaic chart. Here are the two steps in the recursion:

In this second step, the columns from step #1 are sliced horizontally into tiles, in proportion to the number of people in each Education category/level who support/do not support DeathPenalty. This is done in proportion to all the entries in each COLUMN.

Important

Note that the order in which we choose the variables matters, since the mosaic plot is fundamentally asymmetric. More on this in a bit.

Colouring by Pearson Residuals

Mosaic Charts generated by Orange can be coloured based on “Pearson Residuals”. What this means is that the mosaic plot calculates what might be the “expected counts” (see below) in the Contingency Table and calculates the differences (i.e. “residuals” ) between Observed/Actual and Expected values. If the errors are negative (Obs < Exp) then the tile is coloured red. And blue if the error is positive (Obs > Exp).

In Figure 5 (b) we see that there is a small positive and a small negative residual at two locations in the mosaic chart. By and large the chart is white, showing very little association between Education and DeathPenalty. However, we should verify this using a statistical “chi-square” \(X^2\) test.

More on “expected counts” and the “chi-square” \(X^2\) test below.

Plotting Mosaic Charts

The description of the Orange widget for mosaic charts is here.

Let us take a very sadly famous data set (no, not iris again 🙀), but titanic and examine it in Orange.

Not a mosaic plot, but a Matrix Plot.

Download this RAWGraphs workflow file and import there and see.

Does not seem to have a mosaic diagram capability.