No | Pronoun | Answer | Variable/Scale | Example | What Operations? |
---|---|---|---|---|---|
3 | How, What Kind, What Sort | A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) | Qualitative/Ordinal | Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) | Median,Percentile |
Proportion
Rescuing Jack
What graphs will we see today?
Variable #1 | Variable #2 | Chart Names | Chart Shape |
---|---|---|---|
Qual | Qual | Mosaic Charts |
What kind of Data Variables will we choose?
Inspiration
According to World Bank, six countries (India, Russia, Indonesia, United States, Brazil, and Mexico) accounted for over 60 percent of the total additional deaths in the first two years of the pandemic.
How do these Chart(s) Work?
We saw with Bar Charts that when we deal with single Qual variables, we perform counts for each level of the variable. What is there are two Quals? Or even more?
The answer is to take them pair-wise and make all combinations of levels for both and calculate counts for these. This is called a Contingency Table.
From Wolfram Alpha:
A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.
More precisely, an \(r \times c\) contingency table shows the observed frequency of two variables the observed frequencies of which are arranged into \(r\) rows and \(c\) columns. The intersection of a row and a column of a contingency table is called a cell.
The Contingency Table is then plotted in a chart called the Mosaic Chart.
Plotting Mosaic Charts
The description of the Orange widget for mosaic charts is here.
Let us take a very sadly famous data set (no, not iris
again π), but titanic
and examine it in Orange.
Not a mosaic plot, but a Matrix Plot.
Download this RAWGraphs workflow file and import there and see.
Does not seem to have a mosaic diagram capability.
Dataset: Titanic
Ok, let us see if we can rescue Jack. Here is the titanic
data.
Examine the Data
There were 2201 passengers, as per this dataset.
Data Dictionary
None.
-
survived
: (chr) yes or no -
status
: (chr) Class of Travel, else βcrewβ -
age
: (chr) Adult, Child -
sex
: (chr) Male / Female.
Research Questions
survived
upon sex
?
Note the huge imbalance in survived
with sex
: men have clearly perished in larger numbers than women. Which is why the colouring by the Pearson Residuals show large positive residuals for men who died, and large negative residuals for women who died.
So sadly Jack is far more likely to have died than Rose.
survived
depend upon status
?
Crew has seen deaths in large numbers, as seen by the large negative residual for crew-survivals. First Class passengers have had speedy access to the boats and have survived in larger proportions than say second or third class. There is a large positive residual for first-class survivals.
Rose travelled first class
and Jack was third class
. So again the odds are stacked against him.
What is the Story Here?
In Figure 8, we have plotted sex
vs status
, and coloured by whether the (subset of) people survived or not. (Red is YES, Blue is NO!). As can be seen the areas are very dissimilar across both variables. More deaths occurred among the crew than among the passengers; and more first class passengers have survived than third class passengers. And of course, more men died than women.
So we can state that:
-
Status
andSurvived
are not un-associated -
Sex
andSurvived
are not un-associated - Does ticking the
Compare with Total
box in Orange help to arrive at this inference? How so? Or does it confuse?
It remains to figure out just how serious this association is. For that we need the statistical βchi-squareβ \(X^2\) test.
The mosaic chart is a visualization of the obtained counts, based on which the tiles are constructed.
It is also possible to compute a per-cell expected count, if the categorical variables are assumed independent, that is, not correlated. This is the NULL Hypothesis. The test for whether they are independent or not, as any inferential test, is based on comparing the observed counts with these expected counts under the null hypothesis. So, what might the expected frequency of a cell be in cross-tabulation table for cell \(i,j\) given no relationship between the variables of interest?
Represent the sum of row \(i\) with \(n_{+i}\), the sum of column \(j\) with \(n_{j+}\), and the grand total of all the observations with \(n\). And independence of variables means that their joint probability is the product of their probabilities. Therefore, the Expected Cell Frequency/Count is given by:
\[ \begin{array}{lcl} ~Expected~Count~ e_{i,j} &=& \frac{rowSum ~\times~colSum}{n}\\ &=& \frac{(n_{+i})(n_{j+})}{n}\\ \end{array} \]
The comparison of what occurred to what is expected is based on their difference, scaled by the square root of the expected, the Pearson Residual:
\[ \begin{array}{lcl} r_{i,j} &=& \frac{(Actual - Expected)}{\sqrt{\displaystyle Expected}}\\ &=& \frac{(o_{i,j}- e_{i,j})}{\sqrt{\displaystyle e_{i,j}}} \end{array} \]
The sum of all the squared Pearson residuals is the chi-square statistic, Ο2, upon which the inferential analysis follows.
For the intrepid and insatiably curious, there is an intuitive explanation, and some hand-calculations and walk-through of the Contingency table and the Ο2-test here.
Dataset: Who Does the Housework?
Let us take this dataset on household tasks, and who does them. Download this dataset and import in into your Mosaic Chart workflow.
Examine the Data
52 observations.
Data Dictionary
-
Freq
: (int) No of times a task was carried out by specific people
-
Who
: (chr) Who carried out the task? -
Task
: (chr) Task? Which task? Canβt you see Iβm tired?
Let us plot the mosaic chart:
This data looks fine all right, but this mosaic plot looks bewildering, dumbfounding, and is of course wrong. The reason for this is that the basic HouseTasks.csv
data is pre-aggregated: we have a neat column of counts already in the Freq
data. And why is this a problem? Orange expects data to be purely categorical for the Mosaic Chart and does it own counting internally. It is not able to sensibly use this Freq
column. Orange simply counts categories here, which are of course utterly symmetric and unique and of no use.
Most, if not all, statistical graphs do some internal computation. For instance the bar chart performs counts vs Qual variables; a Histogram both bins the Quant variable, and counts for entries in each bin. This is a good thing, people, but it does mean that the data needs to be in specific format before using it for plots.
So now what? We need to (wait for it):
-
uncount
the data π π π - How? Take each combination of Quals
Who
andTask
- Repeat ( i.e copy-paste) that combo line as many times as the value in
Freq
- (optionally) Deleting the
Freq
column, or at least not using it further
All this is (to the best of my ability) not possible in any of these trifling tools that we are using here, and can be done in a jiffy in R or Python. Didnβt I tell you coding was far far far far simpler? Peasants.
So following this ashtavakra procedure of jumping to another tool and coming back here, good things can be somehow made to happen, and so here is the βun-aggregatedβ data for you:
Import this into Orange.
Research Questions
What is the Story Here?
The Mosaic plot in Figure 11 is seriously coloured, showing that there are Pearson Residuals/Errors in both directions (positive and negative). The Ο2-value is large (not visible here, check in Orange) and the p-value
is zero. This indicates that it is very very unlikely that this data happened by chance, assuming the two Qual variables are un-related. Hence, we are likely to conclude that our assumption that they are un-related can be rejected. (Note this complex wording here. We donβt say they are related.)
Why is this unsurprising? Men donβt do housework, it would seem.
In general, if you want to spot association, look for serious amounts of colour in the mosaic chart.
Your Turn
- Clothing and Intelligence Rating of Children!! Are well-dressed actually smarter? Is that the exact reverse with SMI faculty? Or β¦?
Wait, But Why?
- We can detect correlation between Quant variables using the scatter plots and regression lines
- And we can detect association between Qual variables using mosaics and sieves (which we did not see here, but is possible in Orange)
- Your project primary research data may be pure Qualitative too, as with a Questionnaire / Survey instrument.
- One such Qual variable therein will be your target variable
- You will need to justify whether the target variable is dependent upon the other Quals, and then to decide what to do about that.
Often times, the primary research questionnaire is in the form of Questions whose answer is on a Likert Scale data, where several respondents rate a product, or a service, on a scale of Very much like
, somewhat like
, neutral
, Dislike
and Very much dislike,
for example. The data are again categorical; but a Contingency Table / Mosaic Chart would be quite complex to behold and understand. A Likert Plot is what is constructed at such times. Here is a sample Likert Plot for a fictitious app called βQuickEZβ:
Yeah, this is possible in R and Python. But not in these barbarian tools that we are using. There are some websites that offer free apps for these plots too.
For more tutorial information, head off to Visualizing Survey Data (in R).
References
Michael friendly. A Brief History of the Mosaic Display. https://www.datavis.ca/papers/moshist.pdf
David Meyer, Achim Zeileis, Kurt Hornik. Visualizing Contingency Tables. Some very clear and simple pictures at https://statmath.wu.ac.at/projects/vcd/
Nice Chi-square interactive story at https://statisticalstories.xyz/chi-square
A different graph on Housework Inequality,but the same story! https://datatopics.worldbank.org/sdgatlas/goal-5-gender-equality?lang=en#c4