🐢 Groups

The further off from England the nearer is to France.

Qual Variables and Quant Variables
Box Plots
t.test
ANOVA
Author

Arvind V

Published

April 22, 2024

Modified

June 24, 2024

What graphs will we see today?

Variable #1 Variable #2 Chart Names Chart Shape
Quant None Box-Whisker Plots and Violin Plots

No Pronoun Answer Variable/Scale Example What Operations?
2 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities with Scale. Differences are meaningful, but not products or ratios Quantitative/Interval pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College Mean,Standard Deviation
3 How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) Qualitative/Ordinal Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) Median,Percentile

Inspiration

Figure 1: Box Plot Inspiration

Alice said, “I say what I mean and I mean what I say!” Are the rest of us so sure? What do we mean when we use any of the phrases above? How definite are we? There is a range of “sureness” and “unsureness”…and this is where we can use box plots like Figure 1 to show that range of opinion.

Maybe it is time for a box plot on uh, shades1 of meaning for Jane Austen Gen-Z phrases! Bah.

How do these Chart(s) Work?

Box Plots are an extremely useful data visualization that gives us an idea of the distribution of a Quant variable, for each level of another Qual variable. The internal process of this plot is as follows:

  • make groups of the Quant variable for each level of the Qual
  • in each group, rank the Quant variable values in increasing order
  • Calculate: median, IQR, outliers
  • plot these as a vertical or horizontal box structure

The box can also be asymmetric “half plots” if needed…

Histograms and Box Plots

Note how the histogram that dwells upon the mean and standard deviation, whereas the boxplot focuses on the median and quartiles. The former uses the values of the Quant variable, whereas the latter uses their sequence number or ranks.

Box plots are often used for example in HR operations to understand Salary distributions across grades of employees. Marks of students in competitive exams are also declared using Quartiles.

Figure 2: Box Plot and Density

In the Figure 2, the boxplot is the one on the top. The box part represents the middle 50% of the data, in order of magnitude, and the two halves of the box are defined by the median line.

The boxplot in the figure compared with a density plot, which shows a symmetric normal density. Since the latter is symmetric, the median and the mean are identical, as seen by the correspondence with the boxplot in the figure above.

(a) Box Plot and Skewness
(b) Density and Skewness
Figure 3: Box Plot Discussions

In the Figure 3 (a), we see the difference between boxplots that show symmetric and skewed distributions. The “lid” and the “bottom” of the box are not of similar width in distributions with significant skewness.

Compare these with the corresponding Figure 3 (b).

Creating Box Plots

Dataset: Salaries in Academia

Let us examine this dataset in Orange.

Examine the Data

Figure 5: Salaries Data Table

Figure 5 states that there are 397 teachers, with 6 variables in the dataset.

Figure 6: SalariesData Table

Data Dictionary

Quantitative Data
  • salary: (int) (Annual) Salary!
  • yrs_service: (int) No. of Years they have served as teachers
  • yrs_since_phd: (int) No. of Years after their PhD. (sigh)
Qualitative Data
  • discipline: (chr) Nature of Expertise
  • rank: (chr) Nature of Appointment
  • sex: (chr) Male / Female. Note the imbalance in the counts!!
Qual and Quant…

Can any of the Quant variables be thought of as Quant variables? When, under what circumstances?

Research Questions

Let’s try a few questions and see if they are answerable with Box Plots and Violins

Question

Q1. What is the distribution of salary? If we split by sex?

(a) Salaries Box Plot
(b) Salaries Box Plot by Sex
Figure 7: Salaries Data Box Plots
Question

Q2. What is the distribution of salary, when we split by other Qual variables, such as rank?

Figure 8: Salaries Box Plot by Sex

What is the Story Here?

Salaries have quite a wide distribution with some very highly paid individuals ( ~ $240K), while the median is still $107K. So some people are paid than 2X the median!

When split by sex, we get two box plots that show the differences between group salaries. The means and medians are quite different between the two groups, an important inference that needs to be completely verified by a statistical t-test.

When split by rank, we get three box plots that show the differences between group salaries, again an important inference that needs to be completely verified by a statistical ANOVA test.

Are the Differences Significant?

Hunches and Hypotheses

In data analysis, we always want to know2, as in life, how important things are, whether they matter. To do this, we make up hunches, or more precisely, Hypotheses. We make two in fact:

\(H_0\): Nothing is happening;
\(H_a\): (“a” for Alternate): Something is happening and it is important enough to pay attention to.

We then pretend that \(H_0\) is true and ask that our data prove us wrong; if it does, we reject \(H_0\) in favour of \(H_a\).

This is a very important idea of Hypothesis Testing which helps you justify your hunch. Try to do this for the Package Opening and Closing Times.

t-test for two categories

When comparing mean salaries vs sex in Figure 7 (b), note the annotation below the graph. This is the result of the t-test:

\[ Student's ~ t: 3.198~(p = 0.002, ~ N = 397) \]

This indicates several things:

  • That the t-statistic is 3.198;
  • If we assume sex makes no difference to salary, then the probability that this difference could arise merely by chance is low \(p = 0.002\);
  • And of course that there \(397\) data points to vouch for this estimate.

The test states that this difference is statistically significant and could be used to justify further actions based upon it. Look at the references below to get a fascinating history of statistical testing and its origins in …beer.

ANOVA test for more than 2 levels

Now observe the boxplots and annotations in Figure 8, where again we compare mean salaries vs rank. This is the result of the ANOVA-test:

\[ ANOVA: ~ 128.217~~(p = 0.000, ~ N = 397) \]

This indicates several things:

  • That the ANOVA F-statistic is 128.217;
  • If we assume rank makes no difference to salary, then the probability that this difference could arise merely by chance is negligible \(p = 0.000\);
  • And again that there \(397\) data points to vouch for this estimate.

The ANOVA test states that the (multiple) differences are statistically significant and could be used to justify further actions based upon it.

ANOVA for the Cat-egorically Curious

For the intrepid, here is a brief, diagrammed, hand-calculated, and intuitive walk-through of ANOVA. Note that the t-test and ANOVA are identical tests, the former being used for 2-level comparisons of means, and the latter for comparisons of more than 2 means. Again, means, not medians.

Your Turn

Here are a couple of datasets that you might want to analyze with box plots, and even perform t-tests and ANOVA-tests:

  1. Insurance Data

  1. Political Donations

  1. UFO Encounters The data dictionary for this dataset is here at the TidyTuesday Website.

(The TidyTuesday Website is a treasure trove of interesting datasets!)

  1. GPT-based Language detectors are biased against non-native English writers.

What story, and deduction can you tell from Figure 9 ? How would you replicate it? What would you add?

Figure 9: AI Detectors

Wait, But Why?

  • Box plots are a powerful statistical graphic that give us a combined view of data ranges, quartiles, medians, and outliers.
  • Box plots can compare groups within our Quant variable, based on levels of a Qual variable. This is a very common and important task in research! In your design research, you would have numerical Quant data that is accompanied by categorical Qual data pertaining to your target audience. Analyzing for differences in the Quant across levels of the Qual (e.g household expenditure across groups of people) is a vital step in justifying time, effort, and money for further actions in your project. Don’t faff this.
  • They are ideal for visualizing statistical tests for difference in mean values across groups (t-test and ANOVA).

References

  1. Bevans, R. (2023, June 22). An Introduction to t Tests | Definitions, Formula and Examples. Scribbr. https://www.scribbr.com/statistics/t-test/

  2. Brown, Angus. (2008). The Strange Origins of the t-test. Physiology News | No. 71 | Summer 2008| https://static.physoc.org/app/uploads/2019/03/22194755/71-a.pdf

  3. Stephen T. Ziliak.(2008). Guinnessometrics: The Economic Foundation of “Student’s” t. Journal of Economic Perspectives—Volume 22, Number 4—Fall 2008—Pages 199–216. https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.22.4.199

Back to top

Footnotes

  1. The term throwing a shade can be found in Jane Austen’s novel Mansfield Park (1814). Young Edmund Bertram is displeased with a dinner guest’s disparagement of the uncle who took her in: “With such warm feelings and lively spirits it must be difficult to do justice to her affection for Mrs. Crawford, without throwing a shade on the Admiral.”↩︎

  2. “Ah, Misha, he has a stormy spirit. His mind is in bondage. He is haunted by a great, unsolved doubt. He is one of those who don’t want millions, but an answer to their questions.” ― Fyodor Dostoevsky, The Brothers Karamazov: A Novel in Four Parts With Epilogue↩︎