Proportions

Rescuing Jack and Rose

Published

April 26, 2024

Modified

June 13, 2025

Abstract

Single and Nested Proportions with Qual Variables

What graphs will we see today?

Variable #1	Variable #2	Chart Names	Chart Shape
Qual	Qual	Pies, and Mosaic Charts

What kind of Data Variables will we choose?

No	Pronoun	Answer	Variable/Scale	Example	What Operations?
3	How, What Kind, What Sort	A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..)	Qualitative/Ordinal	Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like)	Median,Percentile

Inspiration

From Figure 1 (a), it is seen that Egypt, Qatar, and the United States are the only countries with a population greater than 1 million on this list. Poor food habits are once again a factor, with some cultural differences. In Egypt, high food inflation has pushed residents to low-cost high-calorie meals. To combat food insecurity, the government subsidizes bread, wheat flour, sugar and cooking oil, many of which are the ingredients linked to weight gain. In Qatar, a country with one of the highest per capita GDPs in the world, a genetic predisposition towards obesity and sedentary lifestyles worsen the impact of rich diets. And in the U.S., bigger portions are one of the many reasons cited for rampant adult and child obesity. For example, Americans ate 20% more calories in the year 2000 than they did in 1983. They consume 195 lbs of meat annually compared to 138 lbs in 1953. And their grain intake has increased 45% since 1970.

It’s worth noting however that this dataset is based on BMI values, which do not fully account for body types with larger bone and muscle mass.

From Figure 1 (b), according to World Bank, six countries (India, Russia, Indonesia, United States, Brazil, and Mexico) accounted for over 60 percent of the total additional deaths in the first two years of the pandemic.

How do these Chart(s) Work?

We saw with Bar Charts that when we deal with single Qual variables, we perform counts for each level of the variable. For a single Qual variable, even with multiple levels ( e.g. Education Status: High school, College, Post-Graduate, PhD), we can count the observations as with Bar Charts and plot Pies.

What if there are two Quals? Or even more?

The answer is to take them pair-wise, make all combinations of levels for both and calculate counts for these. This is called a Contingency Table. Then we plot that table. We’ll see.

Plotting Pies

Let us deal with single Qual variables first.

Let us the same dataset as is used in the RAWgraphs tutorial (to follow):

Can you find the Pie Chart Widget in Orange? Let us do this “live” in class and test our new-found Orange skills!

Download this RAWgraphs project workflow and open it in RAWgraphs.

Note

Note the shape of data here: it is wide!

https://academy.datawrapper.de/article/24-how-to-create-a-pie-chart

The problem is that humans are pretty bad at reading angles. This ubiquitous chart is much vilified in the industry and bar charts that we have seen earlier, are viewed as better options. On the other hand, pie charts are ubiquitous in design and business circles, and are very much accepted! Do also read this spirited defense of pie charts here. https://speakingppt.com/why-tufte-is-flat-out-wrong-about-pie-charts/

Plotting Nested Proportions

When we want to visualize proportions based on Multiple Qual variables, we are looking at what Claus Wilke calls nested proportions: groups within groups. Making counts with combinations of levels for two Qual variables gives us a data structure called a Contingency Table, which we will use to build our plot for nested proportions

What is a Contingency Table?

From Wolfram Alpha:

A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.

More precisely, an $r \times c$ contingency table shows the observed frequency of two variables the observed frequencies of which are arranged into $r$ rows and $c$ columns. The intersection of a row and a column of a contingency table is called a cell.

The Contingency Table is then plotted in a chart called the Mosaic Chart. Let us develop our intuition for a Contingency Table first, and arrive at the mosaic chart.

Dataset: General Social Survey 2002

Let us first construct a Contingency Table from this dataset, and then plot the mosaic chart for it.

Here is the Orange workflow:

Examine the Data

Data Dictionary

Quantitative Data

ID is the only Quant data variable!

Qualitative Data

“ID” “Region” “Gender” “Race”
“Education” “Marital” “Religion” “Happy”
“Income” “PolParty” “Politics” “Marijuana”
“DeathPenalty” “OwnGun” “GunLaw” “SpendMilitary” “SpendEduc” “SpendEnv” “SpendSci” “Pres00”
“Postlife”

are all Qual variables! Let us choose just two Qual variables from this dataset, DeathPenalty and Education.

DeathPenalty: (chr) Opinion as to whether they favour or oppose the death penalty
Education: (chr) Education among respondents, 5 levels (Left HS, HS, Jr Col, Bachelors, Graduate).

A Contingency table with these two Qual variables looks like Figure 4:

DeathPenalty	Left HS	HS	Jr Col	Bachelors	Graduate	Sum
Favor	117	511	71	135	64	898
Oppose	72	200	16	71	50	409
Sum	189	711	87	206	114	1307

Figure 4: Contingency Table

How was this computed?

So $117$ is the number of people who Left HS and Favor the death penalty, and $71$ is the count for Bachelors who Oppose the death penalty. And so on.

Now then, how does one plot a set of data that looks like this, a matrix? No column is a single variable, nor is each row a single observation, which is what we understand with the idea of tidy data.

The answer is provided in the very shape of the data: we plot this as a set of tiles, where $a r e a o f t i l e \sim c o u n t a r e a o f t i l e \sim c o u n t$ In this way we recursively partition off a (usually) square area into vertical and horizontal pieces whose area is proportional to the count at a specific combination of levels of the two Qual variables. So we might follow the process as shown below:

Take the bottom row of per-column totals and create vertical rectangles with these widths
Take the individual counts in the rows and partition each rectangle based in the counts in these rows.

Let us do this step by step.

Research Questions

Question

Q1. Are Education and DeathPenalty associated?

Let us plot the mosaic chart in two steps: we now choose Qual variables Education and DeathPenalty, in that order to plot the mosaic chart. Here are the two steps in the recursion:

The first split shows the various levels of Education and their counts as widths. Order is alphabetical! This splitting corresponds to the bottom ROW of the Figure 4. HS is clearly the largest subgroup in Education.

In the second step, the columns from Figure 5 (a) are sliced horizontally into tiles, in proportion to the number of people in each Education category/level who support/do not support DeathPenalty. This is done in proportion to all the entries in each COLUMN.

Important

Note that the order in which we choose the variables matters, since the mosaic plot is fundamentally asymmetric. More on this in a bit.

Colouring by Pearson Residuals

Mosaic Charts generated by Orange can be coloured based on “Pearson Residuals”. What this means is that the mosaic plot calculates what might be the “expected counts” (see below) in the Contingency Table and calculates the differences (i.e. “residuals” ) between Observed/Actual and Expected values. If the errors are negative (Obs < Exp) then the tile is coloured red. And blue if the error is positive (Obs > Exp).

In Figure 5 (b) we see that there is a small positive and a small negative residual at two locations in the mosaic chart. By and large the chart is white, showing very little association between Education and DeathPenalty. However, we should verify this using a statistical “chi-square” $X^{2}$ test.

More on “expected counts” and the “chi-square” $X^{2}$ test below.

Plotting Mosaic Charts

The description of the Orange widget for mosaic charts is here.

Let us take a very sadly famous data set (no, not iris again 🙀), but titanic and examine it in Orange.

We will reuse this workflow:

Not a mosaic plot, but a Matrix Plot.

Download this RAWGraphs workflow file and import there and see.

Does not seem to have a mosaic diagram capability.