πŸ“Š Bar Plots: Plotting Counts

How much data does a man need?

Qual Variables
Bar Charts
Column Charts
Author

Arvind V.

Published

June 23, 2024

Modified

June 26, 2024

Abstract
Quant and Qual Variable Graphs and their Siblings

Slides and Tutorials

R (Static Viz)   Radiant Tutorial  Datasets

Setting up R Packages

What graphs will we see today?

Variable #1 Variable #2 Chart Names Chart Shape
Qual None Bar Chart

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
3 How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) Qualitative/Ordinal Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) Median,Percentile

Inspiration

Figure 1: Capital Cities

How much does the (financial) capital of a country contribute to its GDP? Which would be India’s city? What would be the reduction in percentage? And these Germans are crazy.(Toc, toc, toc.toc!)

Note how the axis variable that defines the bar locations is a …Qual variable!

Graphing Packages in R

There are several Data Visualization packages, even systems, within R.

Each system has its benefits and learning complexities. We will look at plots created using the simpler and intuitive ggformula system that uses the popularggplot framework, but provides a simplified interface that is easy to recall and apply. While our first option will be to use ggformula, we will, where appropriate state ggplot code too for comparison.

A quick reminder on how mosaic and ggformula and ggplot work in a very similar fashion:

mosaic and ggformula command template

Note the standard method for all commands from the mosaic and ggformula packages: goal( y ~ x | z, data = _____)

With mosaic, one can create a statistical correlation test between two variables as: cor_test(y ~ x, data = ______ )

With ggformula, one can create any graph/chart using: gf_***(y ~ x | z, data = _____) In practice, we often use: dataframe %>% gf_***(y ~ x | z) which has cool benefits such as β€œautocompletion” of variable names, as we shall see. The β€œ***” indicates what kind of graph you desire: histogram, bar, scatter, density; the β€œ___” is the name of your dataset that you want to plot with.

ggplot command template

The ggplot2 template is used to identify the dataframe, identify the x and y axis, and define visualized layers:

ggplot(data = ---, mapping = aes(x = ---, y = ---)) + geom_----()

Note: β€”- is meant to imply text you supply. e.g. function names, data frame names, variable names.

It is helpful to see the argument mapping, above. In practice, rather than typing the formal arguments, code is typically shorthanded to this:

dataframe %>% ggplot(aes(xvar, yvar)) + geom_----()

Bar Charts and Histograms

Bar Charts show counts of observations with respect to a Qualitative variable. For instance, a shop inventory with shirt-sizes. Each bar has a height proportional to the count per shirt-size, in this example.

Although Histograms may look similar to Bar Charts, the two are different. First, histograms show continuous Quant data. By contrast, bar charts show categorical data, such as shirt-sizes, or apples, bananas, carrots, etc. Visually speaking, histograms do not usually show spaces between bars because these are continuous values, while column charts must show spaces to separate each category.

How do Bar Chart(s) Work?

Bar are used to show β€œcounts” and β€œtallies” with respect to Qual variables: they answer the question How Many?. For instance, in a survey, how many people vs Gender? In a Target Audience survey on Weekly Consumption, how many low, medium, or high expenditure people?

Each Qual variable potentially has many levels as we saw in the Nature of Data. For instance, in the above example on Weekly Expenditure, low, medium and high were levels for the Qual variable Expenditure. Bar charts perform internal counts for each level of the Qual variable under consideration. The Bar Plot is then a set of disjoint bars representing these counts; see the icon above, and then that for histograms!! The X-axis is the set of levels in the Qual variable, and the Y-axis represents the counts for each level.

Case Study-1: Chicago Taxi Rides dataset

We will first look at at a dataset that speaks about taxi rides in Chicago in the year 2022. This is available on Vincent Arel-Bundock’s superb repository of datasets.Let us read into R directly from the website.

Examine the Data

As per our Workflow, we will look at the data using all the three methods we have seen.

Business Insights on Examining the taxi dataset
  • This is a large dataset (10K rows), 8 columns/variables.
  • There are several Qualitative variables: tip(2), company(7) and local(2), dow(7), and month(12). These have levels as shwon in the parenthesis.
  • Note that hour despite being a discrete/numerical variable, it can be treated as a Categorical variable too.
  • distance is Quantitative.
  • There are no missing values for any variable, all are complete with 10K entries.
Hypothesis and Research Questions
  • The target variable for an experiment that resulted in this data might be the tip variable. Which is a binary i.e. Yes/No type Qual variable.
  • Research Questions:
    • Do more people tip than not?
    • Does a tip depend upon whether the trip is local or not?
    • Do some cab company-ies get more tips than others?
    • And upon the distance, hour of day, and dow and month?

Try and think of more Questions!

Plotting Barcharts

Let’s plot some bar graphs: recall that for bar charts, we need to choose Qual variables to count with! In each case, we will state a Hypothesis/Question and try to answer it with a chart.

Data Munging

We will keep the target variable tip in mind at all times. And convert the dow, month variables into factors beforehand.

## Convert `dow` and `month` into ordered factors
taxi <- taxi %>%
  mutate(dow = factor(
    dow,
    levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
    labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),ordered = TRUE
  ),
  
  month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr"),
                 labels = c("Jan", "Feb", "Mar", "Apr"),ordered = TRUE))

Question-1: Do more people tip than not?

Question-1: Do more people tip than not?

Business Insights-1

  • Far more people tip than not.
  • (Future) The counts of tip are very imbalanced and if we are to setup a model for that (e.g. logistic regression) we would need to very carefully subset the data for training and testing our model.

Question-2: Does the tip depend upon whether the trip is local or not?

Question-2: Does the tip depend upon whether the trip is local or not?

Business Insights-2

  • Counting the frequency of tip by local gives us grouped counts, but we cannot tell the percentage per group (local or not) of those who tip and those who do not.
  • We need per-group percentages because the number of local trips are not balanced
  • Hence we tried bar charts with position = stack, but finally it is the position = fill that works best.
  • We see that the percentage of tippers is somewhat higher with people who make non-local trips. Not surprising.

Question-3: Do some cab company-ies get more tips than others?

Question-3: Do some cab company-ies get more tips than others?

Business Insights-3

  • Using stack, dodge, and fill in bar plots gives us different ways of looking at the sets of counts;
  • fill: gives us a per-group proportion of another Qual variable for a chosen Qual variable. This chart view is useful in Inference for Proportions;
  • Most cab company-ies have similar usage, if you neglect the other category of company;
  • Does seem that of all the company-ies, tips are not so good for the Flash Cab company. A driver issue? Or are the cars too old? Or don’t they offer service everywhere?

Question-4: Does a tip depend upon the distance, hour of day, and dow and month?

Question-4: Does a tip depend upon the distance, hour of day, and dow and month?

Business Insights-4

  • Note: We were using fill = ~ tip here! Why is that a good idea?
  • tips vs hour: There are always more people who tip than those who do not. Of course there are fewer trips during the early morning hours and the late night hours, based on the very small bar-pairs we see at those times
  • tips vs dow: Except for Sunday, the tip count patterns (Yes/No) look similar across all days.
  • tips vs month: We have data for 4 months only. Again, the tip count patterns (Yes/No) look similar across all months. Perhaps slightly fewer trips in Jan, when it is cold in Chicago and people may not go out much.
  • tips vs dow vs month: Very similar counts for tips(Yes/No) across day-of-week and month.

Bar Plot Extras

gf-bar and gf-col

Note also that gf_bar/geom_bar takes only ONE variable (for the x-axis), whereas gf_col/geom_col needs both X and Y variables since it simply plots columns. Both are useful!

And we can plot Proportions and Percentages too!

Also check out gf_props and gf_percents ! These are also very useful ggformula functions!

## Set graph theme
theme_set(new = theme_custom())
##

gf_props(~ substance,
  data = mosaicData::HELPrct, fill = ~ sex,
  position = "dodge") %>%
  gf_labs(title = "Plotting Proportions using gf_props")
###
gf_percents(~ substance,
  data = mosaicData::HELPrct, fill = ~ sex,
  position = "dodge") %>%
  gf_labs(title = "Plotting Percentages using gf_percents")

Conclusion

  • Qualitative data variables can be plotted as counts, using Bar Charts
  • gf_col and gf_bar provide Bar charts; gf_bar performs counts internally, whereas gf_col requires pre-counted data.
  • Using facets allows us to view counts of one Qual variable split over two other Qual variables

Your Turn

Datasets

  1. Click on the Dataset Icon above, and unzip that archive. Try to make Bar plots with each of them, using one or more Qual variables.
  2. A dataset from calmcode.io https://calmcode.io/datasets.html

glimpse / skim / inspect the dataset in each case and develop a set of Questions, that can be answered by appropriate stat measures, or by using a chart to show the distribution.

References

R Package Citations
Package Version Citation
ggformula 0.12.0 Kaplan and Pruim (2023)
mosaic 1.9.1 Pruim, Kaplan, and Horton (2017)
tidyverse 2.0.0 Wickham et al. (2019)
Kaplan, Daniel, and Randall Pruim. 2023. ggformula: Formula Interface to the Grammar of Graphics. https://CRAN.R-project.org/package=ggformula.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. β€œThe Mosaic Package: Helping Students to β€˜Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain FranΓ§ois, Garrett Grolemund, et al. 2019. β€œWelcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Back to top

Citation

BibTeX citation:
@online{v.2024,
  author = {V., Arvind},
  title = {πŸ“Š {Bar} {Plots:} {Plotting} {Counts}},
  date = {2024-06-23},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/20-BarPlots/},
  langid = {en},
  abstract = {Quant and Qual Variable Graphs and their Siblings}
}
For attribution, please cite this work as:
V., Arvind. 2024. β€œπŸ“Š Bar Plots: Plotting Counts.” June 23, 2024. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/20-BarPlots/.