πŸ“Š Boxplots: Plotting Groups

Plotting Distributions over Categories

Qual Variables
Quant Variables
Box Plots
Violin Plots
Author

Arvind V.

Published

June 24, 2024

Modified

June 26, 2024

Abstract
Quant and Qual Variable Graphs and their Siblings

Slides and Tutorials

R (Static Viz)   Radiant Tutorial  Datasets

Setting up R Packages

options(paged.print = TRUE)
library(tidyverse)
library(mosaic)
library(ggformula)
library(palmerpenguins) # Our new favourite dataset

What graphs will we see today?

Variable #1 Variable #2 Chart Names Chart Shape
Quant Qual Box Plot

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation
4 What, Who, Where, Whom, Which Name, Place, Animal, Thing Qualitative/Nominal Name Count no. of cases,Mode

Inspiration

Figure 1: Box Plot Inspiration

Alice said, β€œI say what I mean and I mean what I say!” Are the rest of us so sure? What do we mean when we use any of the phrases above? How definite are we? There is a range of β€œsureness” and β€œunsureness”…and this is where we can use box plots like Figure 1 to show that range of opinion.

How do these Chart(s) Work?

Box Plots are an extremely useful data visualization that gives us an idea of the distribution of a Quant variable, for each level of another Qual variable. The internal process of this plot is as follows:

  • make groups of the Quant variable for each level of the Qual
  • in each group, rank the Quant variable values in increasing order
  • Calculate: median, IQR, outliers
  • plot these as a vertical or horizontal box structure

The box can also be asymmetric β€œhalf plots” if needed…

Histograms and Box Plots

Note how the histogram that dwells upon the mean and standard deviation, whereas the boxplot focuses on the median and quartiles. The former uses the values of the Quant variable, whereas the latter uses their sequence number or ranks.

Box plots are often used for example in HR operations to understand Salary distributions across grades of employees. Marks of students in competitive exams are also declared using Quartiles.

Figure 2: Box Plot Definitions

Box plots can show skew in distributions. In such cases the β€œbottom” and the β€œlid” of the box may not be the same size!

(a) Box Plot and Skewness
(b) Density and Skewness
Figure 3: Box Plot Discussions

In the Figure 3 (a), we see the difference between boxplots that show symmetric and skewed distributions. The β€œlid” and the β€œbottom” of the box are not of similar width in distributions with significant skewness.

Compare these with the corresponding Figure 3 (b).

Case Study-1: gss_wages dataset

We will first look at Wage data from the General Social Survey (1974-2018) conducted in the USA, which is used to illustrate wage discrepancies by gender (while also considering respondent occupation, age, and education). This is available on Vincent Arel-Bundock’s superb repository of datasets. Let us read into R directly from the website.

Examine the Data

As per our Workflow, we will look at the data using all the three methods we have seen.

Data Dictionary for the wages dataset

From the dataset documentation page, we note that:

  • This is a large dataset (61K rows), with 11 variables:
  • year(dbl): the survey year
  • realrinc(dbl): the respondent’s base income (in constant 1986 USD
  • age(dbl): the respondent’s age in years
  • occ10(dbl): respondent’s occupation code (2010)
  • occrecode(chr): recode of the occupation code into one of 11 main categories
  • prestg10(dbl): respondent’s occupational prestige score (2010)
  • childs(dbl): number of children (0-8)
  • wrkstat(chr): the work status of the respondent (full-time, part-time, temporarily not working, unemployed (laid off), retired, school, housekeeper, other). 8 levels. 
  • gender(chr): respondent’s gender (male or female). 2 levels.
  • educcat(chr): respondent’s degree level (Less Than High School, High School, Junior College, Bachelor, or Graduate). 5 levels.
  • maritalcat(chr): respondent’s marital status (Married, Widowed, Divorced, Separated, Never Married). 5 levels.
Business Insights based on wages dataset
  • Fair amount of missing data; however with 61K rows, we can for the present simply neglect the missing data.
  • Good mix of Qual and Quant variables
Hypothesis and Research Questions
  • The target variable for an experiment that resulted in this data might be the realinc variable, the resultant income of the individual. Which is numerical variable.
  • Research Questions:
    • What is the basic distribution of realrinc?
    • Is realrinc affected by gender?
    • By educcat? By maritalcat?
    • Is realrinc affected by child?
    • Do combinations of these factors have an effect on the target variable?

These should do for now! But we should make more questions when have seen some plots!

Data Munging

Since there are so many missing data in the target variable realinc and there is still enough data leftover, let us remove the rows containing missing data in that variable.

Important

NOTE: This is not advised at all as a general procedure!! Data is valuable and there are better ways to manage this problem!

wages_clean <- 
  wages %>% 
  tidyr::drop_na(realrinc) # choose column or leave blank to choose all

Plotting Box Plots

Question-1: What is the basic distribution of realrinc?

Question-1: What is the basic distribution of realrinc?

Business Insights-1

  • Income is a very skewed distribution, as might be expected.
  • Presence of many higher-side outliers is noted.

Question-2: Is realrinc affected by gender?

Question-2: Is realrinc affected by gender?

Business Insights-2

  • Even when split by gender, realincome presents a skewed set of distributions.
  • The IQR for males is smaller than the IQR for females. There is less variation in the middle ranges of realrinc for men.
  • log10 transformation helps to view and understand the regions of low realrinc.
  • There are outliers on both sides, indicating that there may be many people who make very small amounts of money and large amounts of money in both genders.

Question-3: Is realrinc affected by educcat?

Question-3: Is realrinc affected by educcat?

Business Insights-3

  • realrinc rises with educcat, which is to be expected.
  • However, there are people with very low and very high income in all categories of educcat
  • Hence educcat alone may not be a good predictor for realrinc.

We can do similar work with the other Qual variables. Let us now see how we can use more than one Qual variable and answer the last hypothesis, Question 4.

Question-4: Is the target variable realrinc affected by combinations of Qual factors gender, educcat, maritalcat and childs?

Important

This is a rather complex question and could take us deep into Modelling. Ideally we ought to:
- take each Qual variable, explain its effect on the target variable
- remove that effect and model the remainder ( i.e. residual) with the next Qual variable
- Proceed in this way until we have a good model.
if we are going to do this manually. There are more modern Modelling Workflows, that can do things much faster and without such manual tweaking.

So will simply plot box plots showing effects on the target variable of combinations of Qual variables taken two at a time. (We will of course use facetted box plots!)

We will also drop NA values all around this time, to avoid seeing boxplots for undocumented categories.

Question-4: Is realrinc affected by combinations of factors?

Business Insights-4

  • From Figure 8 (a), we see that realrinc increases with educcat, across (almost) all family sizes childs.
  • However, this trend breaks a little when family sizes childs is large, say >= 7. Be aware that the data observations for such large families may be sparse and this inference may not be necessarily valid.
  • From Figure 8 (b), we see that the effect of childs on realrinc is different for each gender! For females, the income steadily drops with the number of children, whereas for males it actually increases up to a certain family size before decreasing again.

Conclusion

  • Box Plots β€œdwell upon” medians and Quartiles
  • Box Plots can show distributions of a Quant variable over levels of a Qual variable
  • This allows a comparison of box plots side by side to visibly detect differences in median and IQR across such levels

Your Turn

Datasets

  1. Click on the Dataset Icon above, and unzip that archive. Try to make distribution plots with each of the three tools.
  2. A dataset from calmcode.io https://calmcode.io/datasets.html
  3. Old Faithful Data in R (Find it!)

inspect the dataset in each case and develop a set of Questions, that can be answered by appropriate stat measures, or by using a chart to show the distribution.

References

  1. See the scrolly animation for a histogram at this website: Exploring Histograms, an essay by Aran Lunzer and Amelia McNamara https://tinlizzie.org/histograms/?s=09
  2. Minimal R using mosaic.https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalRgg.pdf
  3. Sebastian Sauer, Plotting multiple plots using purrr::map and ggplot
R Package Citations
Package Version Citation
ggridges 0.5.6 Wilke (2024)
NHANES 2.1.0 Pruim (2015)
TeachHist 0.2.1 Lange (2023)
TeachingDemos 2.13 Snow (2024)
visualize 4.5.0 Balamuta (2023)
Balamuta, James. 2023. visualize: Graph Probability Distributions with User Supplied Parameters and Statistics. https://CRAN.R-project.org/package=visualize.
Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://CRAN.R-project.org/package=TeachHist.
Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://CRAN.R-project.org/package=NHANES.
Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://CRAN.R-project.org/package=TeachingDemos.
Wilke, Claus O. 2024. ggridges: Ridgeline Plots in β€œggplot2”. https://CRAN.R-project.org/package=ggridges.
Back to top

Citation

BibTeX citation:
@online{v.2024,
  author = {V., Arvind},
  title = {πŸ“Š {Boxplots:} {Plotting} {Groups}},
  date = {2024-06-24},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/24-BoxPlots/},
  langid = {en},
  abstract = {Quant and Qual Variable Graphs and their Siblings}
}
For attribution, please cite this work as:
V., Arvind. 2024. β€œπŸ“Š Boxplots: Plotting Groups.” June 24, 2024. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/24-BoxPlots/.