Groups

The further off from England the nearer is to France.

Qual Variables

Quant Variables

Box Plots

t.test

ANOVA

Published

April 22, 2024

Modified

August 16, 2024

What graphs will we see today?

Variable #1	Variable #2	Chart Names	Chart Shape
Quant	None	Box-Whisker Plots and Violin Plots

No	Pronoun	Answer	Variable/Scale	Example	What Operations?
2	How Many / Much / Heavy? Few? Seldom? Often? When?	Quantities with Scale. Differences are meaningful, but not products or ratios	Quantitative/Interval	pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College	Mean,Standard Deviation
3	How, What Kind, What Sort	A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..)	Qualitative/Ordinal	Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like)	Median,Percentile

Inspiration

Alice said, “I say what I mean and I mean what I say!” Are the rest of us so sure? What do we mean when we use any of the phrases above? How definite are we? There is a range of “sureness” and “unsureness”…and this is where we can use box plots like Figure 1 to show that range of opinion.

Maybe it is time for a box plot on uh, shades¹ of meaning for ~~Jane Austen~~ Gen-Z phrases! Bah.

How do these Chart(s) Work?

Box Plots are an extremely useful data visualization that gives us an idea of the distribution of a Quant variable, for each level of another Qual variable. The internal process of this plot is as follows:

make groups of the Quant variable for each level of the Qual
in each group, rank the Quant variable values in increasing order
Calculate: median, IQR, outliers
plot these as a vertical or horizontal box structure

The box can also be asymmetric “half plots” if needed…

Histograms and Box Plots

Note how the histogram that dwells upon the mean and standard deviation, whereas the boxplot focuses on the median and quartiles. The former uses the values of the Quant variable, whereas the latter uses their sequence number or ranks.

Box plots are often used for example in HR operations to understand Salary distributions across grades of employees. Marks of students in competitive exams are also declared using Quartiles.

In the Figure 2, the boxplot is the one on the top. The box part represents the middle 50% of the data, in order of magnitude, and the two halves of the box are defined by the median line.

The boxplot in the figure compared with a density plot, which shows a symmetric normal density. Since the latter is symmetric, the median and the mean are identical, as seen by the correspondence with the boxplot in the figure above.

In the Figure 3 (a), we see the difference between boxplots that show symmetric and skewed distributions. The “lid” and the “bottom” of the box are not of similar width in distributions with significant skewness.

Compare these with the corresponding Figure 3 (b).

Creating Box Plots

Here is the BoxPlot Widget description.

Let us first plot a set of boxplots for the familiar iris dataset and then investigate other datasets using the same Orange workflow.

Figure 4 shows the three horizontal box-plots for the chosen Quant variable, one for each level of iris(species). The IQR is also shown for each fo the groups. One can selectively compare either medians or means across these groups of measurements.

https://youtu.be/Cax0cQ6caI8

There does not seem to be a way of creating Box Plots in DataWrapper .

Dataset: Salaries in Academia

Let us examine this dataset in Orange.

Examine the Data

Figure 5 states that there are 397 teachers, with 6 variables in the dataset.

Data Dictionary

Quantitative Data

salary: (int) (Annual) Salary!
yrs_service: (int) No. of Years they have served as teachers
yrs_since_phd: (int) No. of Years after their PhD. (sigh)

Qualitative Data

discipline: (chr) Nature of Expertise
rank: (chr) Nature of Appointment
sex: (chr) Male / Female. Note the imbalance in the counts!!

Qual and Quant…

Can any of the Quant variables be thought of as Quant variables? When, under what circumstances?

Research Questions

Let’s try a few questions and see if they are answerable with Box Plots and Violins

Question

Q1. What is the distribution of salary? If we split by sex?

Question

Q2. What is the distribution of salary, when we split by other Qual variables, such as rank?

What is the Story Here?

Salaries have quite a wide distribution with some very highly paid individuals ( ~ $240K), while the median is still $107K. So some people are paid than 2X the median!

When split by sex, we get two box plots that show the differences between group salaries. The means and medians are quite different between the two groups, an important inference that needs to be completely verified by a statistical t-test.

When split by rank, we get three box plots that show the differences between group salaries, again an important inference that needs to be completely verified by a statistical ANOVA test.

Are the Differences Significant?

Hunches and Hypotheses

In data analysis, we always want to know², as in life, how important things are, whether they matter. To do this, we make up hunches, or more precisely, Hypotheses. We make two in fact:

$H_0$: Nothing is happening;
$H_a$: (“a” for Alternate): Something is happening and it is important enough to pay attention to.

We then pretend that $H_0$ is true and ask that our data prove us wrong; if it does, we reject $H_0$ in favour of $H_a$.

This is a very important idea of Hypothesis Testing which helps you justify your hunch. Try to do this for the Package Opening and Closing Times.

t-test for two categories

When comparing mean salaries vs sex in Figure 7 (b), note the annotation below the graph. This is the result of the t-test:

\[ Student's ~ t: 3.198~(p = 0.002, ~ N = 397) \]

This indicates several things:

That the t-statistic is 3.198;
If we assume sex makes no difference to salary, then the probability that this difference could arise merely by chance is low $p = 0.002$;
And of course that there $397$ data points to vouch for this estimate.

The test states that this difference is statistically significant and could be used to justify further actions based upon it. Look at the references below to get a fascinating history of statistical testing and its origins in …beer.

ANOVA test for more than 2 levels

Now observe the boxplots and annotations in Figure 8, where again we compare mean salaries vs rank. This is the result of the ANOVA-test:

\[ ANOVA: ~ 128.217~~(p = 0.000, ~ N = 397) \]

This indicates several things:

That the ANOVA F-statistic is 128.217;
If we assume rank makes no difference to salary, then the probability that this difference could arise merely by chance is negligible $p = 0.000$;
And again that there $397$ data points to vouch for this estimate.

The ANOVA test states that the (multiple) differences are statistically significant and could be used to justify further actions based upon it.

ANOVA for the Cat-egorically Curious

For the intrepid, here is a brief, diagrammed, hand-calculated, and intuitive walk-through of ANOVA. Note that the t-test and ANOVA are identical tests, the former being used for 2-level comparisons of means, and the latter for comparisons of more than 2 means. Again, means, not medians.

Your Turn

Here are a couple of datasets that you might want to analyze with box plots, and even perform t-tests and ANOVA-tests:

Insurance Data

Political Donations

UFO Encounters

The data dictionary for this dataset is here at the TidyTuesday Website.. The TidyTuesday Website is a treasure trove of interesting datasets!

GPT-based Language detectors are biased against non-native English writers.

What story can you tell, and deduction can you make from Figure 9 below? How would you replicate it? What would you add?

Wait, But Why?

Box plots are a powerful statistical graphic that give us a combined view of data ranges, quartiles, medians, and outliers.
Box plots can compare groups within our Quant variable, based on levels of a Qual variable. This is a very common and important task in research! In your design research, you would have numerical Quant data that is accompanied by categorical Qual data pertaining to your target audience. Analyzing for differences in the Quant across levels of the Qual (e.g household expenditure across groups of people) is a vital step in justifying time, effort, and money for further actions in your project. Don’t faff this.
They are ideal for visualizing statistical tests for difference in mean values across groups (t-test and ANOVA).

Readings

Bevans, R. (2023, June 22). An Introduction to t Tests | Definitions, Formula and Examples. Scribbr. https://www.scribbr.com/statistics/t-test/
Brown, Angus. (2008). The Strange Origins of the t-test. Physiology News | No. 71 | Summer 2008| https://static.physoc.org/app/uploads/2019/03/22194755/71-a.pdf
Stephen T. Ziliak.(2008). Guinnessometrics: The Economic Foundation of “Student’s” t. Journal of Economic Perspectives—Volume 22, Number 4—Fall 2008—Pages 199–216. https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.22.4.199
https://quillette.com/2024/08/03/xy-athletes-in-womens-olympic-boxing-paris-2024-controversy-explained-khelif-yu-ting/
Senefeld JW, Lambelet Coleman D, Johnson PW, Carter RE, Clayburn AJ, Joyner MJ. Divergence in Timing and Magnitude of Testosterone Levels Between Male and Female Youths. JAMA. 2020;324(1):99–101. doi:10.1001/jama.2020.5655. https://jamanetwork.com/journals/jama/fullarticle/2767852
Doriane Lambelet Coleman, Sex in Sport, 80 Law and Contemporary Problems 63-126 (2017). Available at: https://scholarship.law.duke.edu/lcp/vol80/iss4/5

Footnotes

The term throwing a shade can be found in Jane Austen’s novel Mansfield Park (1814). Young Edmund Bertram is displeased with a dinner guest’s disparagement of the uncle who took her in: “With such warm feelings and lively spirits it must be difficult to do justice to her affection for Mrs. Crawford, without throwing a shade on the Admiral.”↩︎
“Ah, Misha, he has a stormy spirit. His mind is in bondage. He is haunted by a great, unsolved doubt. He is one of those who don’t want millions, but an answer to their questions.” ― Fyodor Dostoevsky, The Brothers Karamazov: A Novel in Four Parts With Epilogue↩︎