Summaries

Throwing away data to grasp it

Qual Variables

Quant Variables

Mean

Median

Standard Deviation

Quartiles

Author

Arvind V.

Published

October 15, 2023

Modified

May 20, 2025

Abstract

Bill Gates walked into a bar, and everyone’s salary went up on average.

WebR Status

Installing package 4 out of 6: mosaic

“The most certain sign of wisdom is cheerfulness.”

— Michel de Montaigne, Writer and philosopher

Using web-R

This tutorial uses web-r that allows you to run all code within your browser, on all devices. Most code chunks herein are formatted in a tabbed structure (like in an old-fashioned library) with duplicated code. The tabs in front have regular R code that will work when copy-pasted in your RStudio session. The tab “behind” has the web-R code that can work directly in your browser, and can be modified as well. The R code is also there to make sure you have original code to go back to, when you have made several modifications to the code on the web-r tabs and need to compare your code with the original!

Keyboard Shortcuts

Run selected code using either:
- macOS: ⌘ + ↩︎/Return
- Windows/Linux: Ctrl + ↩︎/Enter
Run the entire code by clicking the “Run code” button or pressing Shift+↩︎.

“The best attitude to adopt is one of compassionate patience, which has to include a tolerance for failure.”

— Gabor Maté

Setting up R Packages

library(tidyverse)
library(mosaic)
library(skimr)

Installing package 4 out of 6: mosaic

How do we Grasp Data?

We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data.

As we discussed in that same Module, for us to grasp the significance of the data, we need to describe it; the actual data is usually too vast for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.

The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.

Reduction is Addition

Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.

Steven Stigler (2016) is the author of the book “The Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation

The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, “The Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier day—and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.

And then, an example from a more sombre story:

US Population: Reading and Numeracy Levels
Year	Below Level #1	Level #1	Level #2	Level #3	Levels #4 and #5
Number in millions (2012/2014)	8.35	26.49	65.10	71.41	26.57
Number in millions (2017)	7.59	29.23	66.07	68.81	26.75
Note:
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.

Table 1: US Population: Reading and Numeracy Levels

This ghastly-looking Table 1 examines U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane. To summarize is to understand. Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.

And if we don’t summarise? Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was big data without Statistics.

What graphs / numbers will we see today?

Variable #1	Variable #2	Chart Names	“Chart Shape”
All	All	Tables and Stat Measures

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

Dataset length: How many rows/observations?
Dataset breadth: How many columns/variables?
How many Quant variables?
How many Qual variables?
Quant variables: min, max, mean, median, sd
Qual variables: levels, counts per level
Both: means, medians for each level of a Qual variable…

What kind of Data Variables will we choose?

No	Pronoun	Answer	Variable/Scale	Example	What Operations?
1	How Many / Much / Heavy? Few? Seldom? Often? When?	Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful.	Quantitative/Ratio	Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate	Correlation
2	How Many / Much / Heavy? Few? Seldom? Often? When?	Quantities with Scale. Differences are meaningful, but not products or ratios	Quantitative/Interval	pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College	Mean,Standard Deviation
3	How, What Kind, What Sort	A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..)	Qualitative/Ordinal	Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like)	Median,Percentile
4	What, Who, Where, Whom, Which	Name, Place, Animal, Thing	Qualitative/Nominal	Name	Count no. of cases,Mode

We will obviously choose all variables in the dataset, unless they are unrelated ones such as row number or ID which (we think) may not contribute any information and we can disregard.

How do these Summaries Work?

Quant variables: Inspecting the min, max, mean, median, variance and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error! Comparing two Quant variables for their ranges also tells us that we may have to $s c a l e / n o r m a l i z e$ them for computational ease, if one variable has large numbers and the other has very small ones.

Qual variables: With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these. Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.

Together?: We can use Quant and Qual together, to develop the above summaries (min, max,mean, median and sd) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts if we are interested in that.

For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row). Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.

And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!

Some Quick Definitions

Mean

The sample mean, or average, of a Quantitative data variable can be calculated as the sum of the observed values divided by the number of observations:

$m e a n = \bar{x} = \frac{x_{1} + x_{2} + x_{3} . . . . + x_{n}}{n}$

Variance and Standard Deviation

Observations can be on either side of the mean, naturally. To measure the extent of these differences, we square and sum the differences between individual values and their mean, and take their average to obtain the (sample) variance:

$v a r i a n c e = s^{2} = \frac{(x_{1} - \bar{x})^{2} + (x_{2} - \bar{x})^{2} + (x_{2} - \bar{x})^{2} + . . . (x_{n} - \bar{x})^{2} +}{n - 1}$ The standard deviation $s$ is just the square root of the variance.

(The $n - 1$ is a mathematical nuance to allow for the fact that we have used the data to calculate the mean before we get to $s^{2}$ , and hence have “used up” one degree of randomness in the data. It gets us more robust results.)

Median

When the observations in a Quant variable are placed in order of their maginitude, the observation in tke middle is the median. Half the observations are below, and half are above, the median

Case Study-1

We will first use a dataset mpg that is available in R as part of one of the R packages that we have loaded with the library() command.

Examine the Data

It is usually a good idea to make crisp business-like tables, for the data itself, and the schema as revealed by one of the outputs of the three methods to be presented below. There are many methods to do this; one of the simplest and effective ones is to use the kable set of commands from the knitr and kableExtra packagepackage:

mpg %>%
  head(10) %>%
  kbl(
    # add Human Readable column names
    col.names = c(
      "Manufacturer", "Model", "Engine\nDisplacement",
      "Model\n Year", "Cylinders", "Transmission",
      "Drivetrain", "City\n Mileage", "Highway\n Mileage",
      "Fuel", "Class\nOf\nVehicle"
    ),
    caption = "MPG Dataset"
  ) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", "hover",
      "condensed", "responsive"
    ),
    full_width = F, position = "center"
  )

MPG Dataset
Manufacturer	Model	Engine Displacement	Model Year	Cylinders	Transmission	Drivetrain	City Mileage	Highway Mileage	Fuel	Class Of Vehicle
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact
audi	a4	3.1	2008	6	auto(av)	f	18	27	p	compact
audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact
audi	a4 quattro	1.8	1999	4	auto(l5)	4	16	25	p	compact
audi	a4 quattro	2.0	2008	4	manual(m6)	4	20	28	p	compact

Next we will look at a few favourite statistics or “favstats” that we can derive from data. R is full of packages that can provide very evocative and effective summaries of data. We will first start with the dplyr package from the tidyverse, the skimr package, then the mosaic package. We will look at the summary outputs from these and learn how to interpret them.

The dplyr package offers a convenient command called glimpse:

glimpse(mpg)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Note

Very crisp output, giving us the size of the dataset (234 X 11) and the nature of the variable columns, along with their first few entries.
The chr variables are usually Categorical/Qualitative.
The int or dbl (double precision) are usually Numerical/Quantitative.
But be careful! Verify that this is as per your intent, interpret the variables and modify their encoding as needed.

Let us look at mpgusing skimr::skim().

From the output of ?skimr:

The format of the results are a single wide data frame combining the results, with some additional attributes and two metadata columns:

skim_variable: name of the original variable
skim_type: class of the variable

We can use skim(dataset) directly as shown below:

skimr::skim(mpg) # explicitly stating package name

Data summary
Name	mpg
Number of rows	234
Number of columns	11
_______________________
Column type frequency:
character	6
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
manufacturer	1	4	10	15
model	1	2	22	38
trans	1	8	10	10
drv	1	1	1	3
fl	1	1	1	5
class	1	3	10	7

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
displ	1	3.47	1.29	1.6	2.4	3.3	4.6	7	▇▆▆▃▁
year	1	2003.50	4.51	1999.0	1999.0	2003.5	2008.0	2008	▇▁▁▁▇
cyl	1	5.89	1.61	4.0	4.0	6.0	8.0	8	▇▁▇▁▇
cty	1	16.86	4.26	9.0	14.0	17.0	19.0	35	▆▇▃▁▁
hwy	1	23.44	5.95	12.0	18.0	24.0	27.0	44	▅▅▇▁▁

Taken together, we have the following:

Note

A Data Summary: it lists the dimensions of the mpg dataset: 234 rows and 11 columns. 6 columns are character formatted, the remaining 5 are numeric. The dataset is not “grouped” (more on this later).
The second part of the output shows a table with the character variables which are therefore factor variables with levels.
The third part shows a table listing the names and summary stats for the numerical variables. We have mean, sd, all the quantiles (p0, p25, p50(median), p75 and p100 percentiles) and a neat little histogram for each. From the histogram we can see that year is two-valued, cyl is three-valued, and cty and hwy are continuous… Again check that this is as you intend them to be. We may need to modify the encoding if needed.

We get very similar output from mosaic::inspect():

inspect(mpg)


categorical variables:  
          name     class levels   n missing
1 manufacturer character     15 234       0
2        model character     38 234       0
3        trans character     10 234       0
4          drv character      3 234       0
5           fl character      5 234       0
6        class character      7 234       0
                                   distribution
1 dodge (15.8%), toyota (14.5%) ...            
2 caravan 2wd (4.7%) ...                       
3 auto(l4) (35.5%), manual(m5) (24.8%) ...     
4 f (45.3%), 4 (44%), r (10.7%)                
5 r (71.8%), p (22.2%), e (3.4%) ...           
6 suv (26.5%), compact (20.1%) ...             

quantitative variables:  
   name   class    min     Q1 median     Q3  max        mean       sd   n
1 displ numeric    1.6    2.4    3.3    4.6    7    3.471795 1.291959 234
2  year integer 1999.0 1999.0 2003.5 2008.0 2008 2003.500000 4.509646 234
3   cyl integer    4.0    4.0    6.0    8.0    8    5.888889 1.611534 234
4   cty integer    9.0   14.0   17.0   19.0   35   16.858974 4.255946 234
5   hwy integer   12.0   18.0   24.0   27.0   44   23.440171 5.954643 234
  missing
1       0
2       0
3       0
4       0
5       0

Note

We see that the output of mosaic::inspect() is organized as follows:

There are two dataframes/tables in the output, one describing the Qualitative Variables and the other describing the Quantitative Variables.
In the table describing the Qual variables, we have:
- name: Name of the variable in the (parent) dataset. i.e Column Names
- class: format of that column
- levels: All these variables are factors, with levels shown here. Some for example, manufacturer has 15 levels, and there are 234 rows

inspect also conveniently shows how much data is missing and in which variables. This is a very important consideration in the use of the data for analytics purposes.

We can save and see the outputs separately:

mpg_describe <- inspect(mpg)
mpg_describe$categorical
mpg_describe$quantitative

Data Dictionary and Munging

Using skim/inspect/glimpse, we can put together a (brief) data dictionary as follows:

Qualitative Data

model(chr): Car model name
manufacturer(chr): Car maker name
fl(chr): fuel type
drv(chr): type of drive(front, rear, 4W)
class(chr): type of vehicle ( sedan, pickup…)
trans(chr): type of transmission ( auto, manual..)

Quantitative Data

hwy(int): Highway Mileage
cty(int): City Mileage
cyl(int): Number of Cylinders. How do we understand this variable? Should this be Qual?
displ(dbl): Engine piston displacement
year(int): Year of model

We see that there are certain variables that must be converted to factors for analytics purposes, since they are unmistakably Qualitative in nature. Let us do that now, for use later:

mpg_modified <- mpg %>%
  dplyr::mutate(
    cyl = as_factor(cyl),
    fl = as_factor(fl),
    drv = as_factor(drv),
    class = as_factor(class),
    trans = as_factor(trans)
  )
glimpse(mpg_modified)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <fct> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
$ drv          <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
$ class        <fct> compact, compact, compact, compact, compact, compact, com…

Case Study-2

Instead of taking a “built-in” dataset , i.e. one that is part of an R package that we can load with library(), let us try the above process with a data set that we obtain from the internet. We will use this superb repository of datasets created by Vincent Arel-Bundock: https://vincentarelbundock.github.io/Rdatasets/articles/data.html

Let us choose a modest-sized dataset, say this dataset on Doctor Visits, which is available online https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv and read it into R.

Reading external data into R

The read_csv() command from R package readr allows us to read both locally saved data on our hard disk, or data available in a shared folder online. Avoid using the read.csv() from base R , though it will show up in your code auto-complete set of options!

# From Vincent Arel-Bundock's dataset website
# https://vincentarelbundock.github.io/Rdatasets
#
# read_csv can read data directly from the net
# Don't use read.csv()
docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")

Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

So, a data frame containing 5,190 observations on 12 variables.

How about a locally stored CSV file?

We can also use a locally downloaded and stored CSV file. Assuming the file is stored in a subfolder called data inside your R project folder, we can proceed as follows:

```{r}
#| eval: false
docVisits <- read_csv("data/DoctorVisits.csv")
```

Let us quickly report the data itself, as in a real report. Note that we can use the features of the kableExtra package to dress up this table too!!

docVisits %>%
  head(10) %>%
  kbl(
    caption = "Doctor Visits Dataset",
    # Add Human Readable Names if desired
    # col.names(..names that you may want..)
  ) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", "hover",
      "condensed", "responsive"
    ),
    full_width = F, position = "center"
  )

Doctor Visits Dataset
rownames	visits	gender	age	income	illness	reduced	health	private	freepoor	freerepat	nchronic	lchronic
1	1	female	0.19	0.55	1	4	1	yes	no	no	no	no
2	1	female	0.19	0.45	1	2	1	yes	no	no	no	no
3	1	male	0.19	0.90	3	0	0	no	no	no	no	no
4	1	male	0.19	0.15	1	0	0	no	no	no	no	no
5	1	male	0.19	0.45	2	5	1	no	no	no	yes	no
6	1	female	0.19	0.35	5	1	9	no	no	no	yes	no
7	1	female	0.19	0.55	4	0	2	no	no	no	no	no
8	1	female	0.19	0.15	3	0	6	no	no	no	no	no
9	1	female	0.19	0.65	2	0	5	yes	no	no	no	no
10	1	male	0.19	0.15	1	0	0	yes	no	no	no	no

Examine the Data

glimpse(docVisits)

Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ gender    <chr> "female", "female", "male", "male", "male", "female", "femal…
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
$ private   <chr> "yes", "yes", "no", "no", "no", "no", "no", "no", "yes", "ye…
$ freepoor  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ freerepat <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ nchronic  <chr> "no", "no", "no", "no", "yes", "yes", "no", "no", "no", "no"…
$ lchronic  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

Descriptive Stat Summary from dplyr::glimpse()

Very crisp output, giving us the size of the dataset (5190 X 13) and the nature of the variable columns, along with their first few entries. There are several Quantitative variables: visits, age, income, illness, reduced and health; the rest seem to be Qualitative variables.

Always document your data with variable descriptions when you share it, a data dictionary!

skim(docVisits) %>% kbl()

skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	gender	1	4	6	0	2	0	NA	NA	NA	NA	NA	NA	NA	NA
character	private	1	2	3	0	2	0	NA	NA	NA	NA	NA	NA	NA	NA
character	freepoor	1	2	3	0	2	0	NA	NA	NA	NA	NA	NA	NA	NA
character	freerepat	1	2	3	0	2	0	NA	NA	NA	NA	NA	NA	NA	NA
character	nchronic	1	2	3	0	2	0	NA	NA	NA	NA	NA	NA	NA	NA
character	lchronic	1	2	3	0	2	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	rownames	1	NA	NA	NA	NA	NA	2595.5000000	1498.3682792	1.00	1298.25	2595.50	3892.75	5190.00	▇▇▇▇▇
numeric	visits	1	NA	NA	NA	NA	NA	0.3017341	0.7981338	0.00	0.00	0.00	0.00	9.00	▇▁▁▁▁
numeric	age	1	NA	NA	NA	NA	NA	0.4063854	0.2047818	0.19	0.22	0.32	0.62	0.72	▇▂▁▂▅
numeric	income	1	NA	NA	NA	NA	NA	0.5831599	0.3689067	0.00	0.25	0.55	0.90	1.50	▇▆▅▅▂
numeric	illness	1	NA	NA	NA	NA	NA	1.4319846	1.3841524	0.00	0.00	1.00	2.00	5.00	▇▂▂▁▁
numeric	reduced	1	NA	NA	NA	NA	NA	0.8618497	2.8876284	0.00	0.00	0.00	0.00	14.00	▇▁▁▁▁
numeric	health	1	NA	NA	NA	NA	NA	1.2175337	2.1242665	0.00	0.00	0.00	2.00	12.00	▇▁▁▁▁

Descriptive Stat Summary from skimr::skim()

A Data Summary: it lists the dimensions of the docVisits dataset: 5190 rows and 13 columns. 6 columns are character formatted, the remaining 7 are numeric. The dataset is not “grouped” (more on this later).
The second part of the output shows a table with the character variables which are therefore factor variables with levels.
The third part shows a table listing the names and summary stats for the numerical variables. We have mean, sd, all the quantiles (p0, p25, p50(median), p75 and p100 percentiles) and a neat little histogram for each.
Can we consider the health Goldberg score a Qualitative variable, to be understood as “ranks” between a minimum and maximum? It is just possible…

inspect(docVisits)


categorical variables:  
       name     class levels    n missing
1    gender character      2 5190       0
2   private character      2 5190       0
3  freepoor character      2 5190       0
4 freerepat character      2 5190       0
5  nchronic character      2 5190       0
6  lchronic character      2 5190       0
                                   distribution
1 female (52.1%), male (47.9%)                 
2 no (55.7%), yes (44.3%)                      
3 no (95.7%), yes (4.3%)                       
4 no (79%), yes (21%)                          
5 no (59.7%), yes (40.3%)                      
6 no (88.3%), yes (11.7%)                      

quantitative variables:  
      name   class  min      Q1  median      Q3     max         mean
1 rownames numeric 1.00 1298.25 2595.50 3892.75 5190.00 2595.5000000
2   visits numeric 0.00    0.00    0.00    0.00    9.00    0.3017341
3      age numeric 0.19    0.22    0.32    0.62    0.72    0.4063854
4   income numeric 0.00    0.25    0.55    0.90    1.50    0.5831599
5  illness numeric 0.00    0.00    1.00    2.00    5.00    1.4319846
6  reduced numeric 0.00    0.00    0.00    0.00   14.00    0.8618497
7   health numeric 0.00    0.00    0.00    2.00   12.00    1.2175337
            sd    n missing
1 1498.3682792 5190       0
2    0.7981338 5190       0
3    0.2047818 5190       0
4    0.3689067 5190       0
5    1.3841524 5190       0
6    2.8876284 5190       0
7    2.1242665 5190       0

Descriptive Stat Summary from mosaic::inspect()

We see that the output of mosaic::inspect() is organized very similarly to the output from skim. Is there any missing data? Both skim and mosaic report on the data completion for each variable in the dataset.

Data Dictionary

Variable	Description
visits	Number of doctor visits in past 2 weeks.
gender	Factor indicating gender.
age	Age in years divided by 100.
income	Annual income in tens of thousands of dollars.
illness	Number of illnesses in past 2 weeks.
reduced	Number of days of reduced activity in past 2 weeks due to illness or injury.
health	General health questionnaire score using Goldberg’s method.
private	Factor. Does the individual have private health docVisits?
freepoor	Factor. Does the individual have free government health docVisits due to low income?
freerepat	Factor. Does the individual have free government health docVisits due to old age, disability or veteran status?
nchronic	Factor. Is there a chronic condition not limiting activity?
lchronic	Factor. Is there a chronic condition limiting activity?

Here too, we should convert the variables that are obviously Qualitative into factors, ordered or otherwise:

docVisits_modified <- docVisits %>%
  mutate(
    gender = as_factor(gender),
    private = as_factor(private),
    freepoor = as_factor(freepoor),
    freerepat = as_factor(freerepat),
    nchronic = as_factor(nchronic),
    lchronic = as_factor(lchronic)
  )
docVisits_modified

ABCDEFGHIJ0123456789

rownames <dbl>	visits <dbl>	gender <fct>	age <dbl>	income <dbl>	illness <dbl>	reduced <dbl>	health <dbl>	private <fct>	freepoor <fct>
1	1	female	0.19	0.55	1	4	1	yes	no
2	1	female	0.19	0.45	1	2	1	yes	no
3	1	male	0.19	0.90	3	0	0	no	no
4	1	male	0.19	0.15	1	0	0	no	no
5	1	male	0.19	0.45	2	5	1	no	no
6	1	female	0.19	0.35	5	1	9	no	no
7	1	female	0.19	0.55	4	0	2	no	no
8	1	female	0.19	0.15	3	0	6	no	no
9	1	female	0.19	0.65	2	0	5	yes	no
10	1	male	0.19	0.15	1	0	0	yes	no

Groups and Counts of Qualitative Variables

What is the most important dialogue uttered in the movie “Sholay”?

Recall our discussion in Types of Data Variables. We have looked at means, limits, and percentiles of Quantitative variables. Another good idea to examine datasets is to look at counts, proportions, and frequencies with respect to Qualitative variables.

We typically do this with the dplyr package from the tidyverse.

mpg_modified %>% dplyr::count(cyl)

ABCDEFGHIJ0123456789

cyl <fct>	n <int>
4	81
5	4
6	79
8	70

mpg_modified %>% mosaic::count(drv) # does the same thing! Counts!

ABCDEFGHIJ0123456789

drv <fct>	n <int>
f	106
4	103
r	25

mpg_modified %>% count(fl)

ABCDEFGHIJ0123456789

fl <fct>	n <int>
p	52
r	168
e	8
d	5
c	1

### All combinations of cut, color, clarity
### Overwhelming??
mpg_modified %>%
  count(across(where(is.factor)))

ABCDEFGHIJ0123456789

cyl <fct>	trans <fct>	drv <fct>	fl <fct>	class <fct>	n <int>
4	auto(l5)	f	p	compact	1
4	auto(l5)	f	p	midsize	1
4	auto(l5)	f	r	midsize	1
4	auto(l5)	f	r	subcompact	1
4	auto(l5)	f	c	subcompact	1
4	auto(l5)	4	p	compact	1
4	manual(m5)	f	p	compact	1
4	manual(m5)	f	p	midsize	1
4	manual(m5)	f	p	subcompact	1
4	manual(m5)	f	r	compact	7

Business Insights from Groups and Counts (mpg)

We see that the groups for each level of cyl, drv, and fl are not the same size: for instance the group with the “r”-level fl type is largest at 168 observations, and the “r”-level in drv has only 25 observations.
Group counts based on cyl are also not balanced.
Counting all combinations of these three factors shows counts of 1 for several combination and does not lead to any decent amount of aggregation.

These aspects may need to be factored into downstream modelling or machine learning tasks. (Usually by stratification wrt levels of the Qualitative variables)

The levels are not too many, so tables work, and so would bar charts, which we will examine next. If there are too many levels in any factor, tables are a better option. Bar charts can still be plotted, but it may be preferable to lump smaller categories/levels together. (Using the forcats package)

## Counting by the obvious factor variables
docVisits %>% count(gender)

ABCDEFGHIJ0123456789

gender <chr>	n <int>
female	2702
male	2488

docVisits %>% count(private)

ABCDEFGHIJ0123456789

private <chr>	n <int>
no	2892
yes	2298

docVisits %>% count(freepoor)

ABCDEFGHIJ0123456789

freepoor <chr>	n <int>
no	4968
yes	222

docVisits %>% count(freerepat)

ABCDEFGHIJ0123456789

freerepat <chr>	n <int>
no	4099
yes	1091

docVisits %>% count(lchronic)

ABCDEFGHIJ0123456789

lchronic <chr>	n <int>
no	4585
yes	605

docVisits %>% count(nchronic)

ABCDEFGHIJ0123456789

nchronic <chr>	n <int>
no	3098
yes	2092

# Now for all Combinations...
# Maybe too much to digest...
docVisits %>% count(across(where(is.character)))

ABCDEFGHIJ0123456789

gender <chr>	private <chr>	freepoor <chr>	freerepat <chr>	nchronic <chr>
female	no	no	no	no
female	no	no	no	no
female	no	no	no	yes
female	no	no	yes	no
female	no	no	yes	no
female	no	no	yes	yes
female	no	yes	no	no
female	no	yes	no	no
female	no	yes	no	yes
female	yes	no	no	no

# Shall we try counting by some variables that might be factors?
# Even if they are labeled as <dbl>?
#
docVisits %>% count(illness)

ABCDEFGHIJ0123456789

illness <dbl>	n <int>
0	1554
1	1638
2	946
3	542
4	274
5	236

docVisits %>% count(health)

ABCDEFGHIJ0123456789

health <dbl>	n <int>
0	3026
1	823
2	446
3	273
4	187
5	132
6	104
7	61
8	42
9	32

Business Insights from Groups and Counts (docVisits)

Most of the counts are roughly balanced across the levels of the factors; however, freepoor and lchronic show unbalanced counts…
The factors are too numerous for a combination count table to very useful..
Counting by illness and health does show that these two columns have a limited set of integer entries across over 5000 rows!! So these can be thought of as factors if needed in the analysis. So not every integer variable is necessaily a number!!

Groups and Summaries of Quantitative Variables

We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?

Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group.

Let us work through these ideas for both our familiar datasets.

mpg_modified %>%
  group_by(cyl) %>%
  summarize(average_hwy = mean(hwy), count = n())

ABCDEFGHIJ0123456789

cyl <fct>	average_hwy <dbl>	count <int>
4	28.80247	81
5	28.75000	4
6	22.82278	79
8	17.62857	70

mpg_modified %>%
  group_by(cyl, fl) %>%
  summarize(average_hwy = mean(hwy), count = n())

ABCDEFGHIJ0123456789

cyl <fct>	fl <fct>	average_hwy <dbl>	count <int>
4	p	27.81818	22
4	r	28.29091	55
4	d	43.00000	3
4	c	36.00000	1
5	r	28.75000	4
6	p	25.29412	17
6	r	22.23333	60
6	e	17.00000	1
6	d	22.00000	1
8	p	20.76923	13

# Perhaps the best method for us!
mpg_modified %>%
  mosaic::favstats(hwy ~ cyl, data = .) # Don't use fav_stats with formula!!!

ABCDEFGHIJ0123456789

cyl <chr>	min <dbl>	Q1 <dbl>	median <dbl>	Q3 <dbl>	max <dbl>	mean <dbl>	sd <dbl>	n <int>
4	20	26.00	29	31	44	28.80247	4.515030	81
5	28	28.75	29	29	29	28.75000	0.500000	4
6	17	19.00	24	26	29	22.82278	3.685590	79
8	12	16.00	17	19	26	17.62857	3.262307	70

# Be aware of the first column format here!
mpg_modified %>%
  mosaic::favstats(hwy ~ cyl + fl, data = .) # Don't use fav_stats with formula!!!

ABCDEFGHIJ0123456789

cyl.fl <chr>	min <dbl>	Q1 <dbl>	median <dbl>	Q3 <dbl>	max <dbl>	mean <dbl>	sd <dbl>	n <int>
4.p	23	26.25	29	29	31	27.81818	2.061815	22
5.p	NA	NA	NA	NA	NA	NaN	NA	0
6.p	20	25.00	26	26	27	25.29412	1.571810	17
8.p	14	16.00	23	25	26	20.76923	4.603510	13
4.r	20	26.00	28	31	37	28.29091	3.951899	55
5.r	28	28.75	29	29	29	28.75000	0.500000	4
6.r	17	19.00	23	26	29	22.23333	3.819516	60
8.r	15	16.00	17	18	23	17.51020	1.894317	49
4.e	NA	NA	NA	NA	NA	NaN	NA	0
5.e	NA	NA	NA	NA	NA	NaN	NA	0

Business Insights from Grouped Quant Summaries (mpg)

We have quite some variation of mean_hwy mileage over cyl , though the groups/level are quite imbalanced. This is of course a small dataset.
The number of groups are large enough (>> 7!) to warrant a chart, which we will make in our next module on Distributions.
Mean price varies quite some based on cyland fl. Some groups are non-existent, hence we see “NA” and “NaN” in the output of mosaic::favstats.; and also on combinations of cyl:fl. This should point to the existence of some interaction effect when modelling for price.

docVisits_modified %>%
  group_by(gender) %>%
  summarize(average_visits = mean(visits), count = n())

ABCDEFGHIJ0123456789

gender <fct>	average_visits <dbl>	count <int>
female	0.3619541	2702
male	0.2363344	2488

##
docVisits_modified %>%
  group_by(gender) %>%
  summarize(average_visits = mean(visits), count = n())

ABCDEFGHIJ0123456789

gender <fct>	average_visits <dbl>	count <int>
female	0.3619541	2702
male	0.2363344	2488

##
docVisits_modified %>%
  group_by(freepoor, nchronic) %>%
  summarise(
    mean_income = mean(income),
    average_visits = mean(visits),
    count = n()
  )

ABCDEFGHIJ0123456789

freepoor <fct>	nchronic <fct>	mean_income <dbl>	average_visits <dbl>	count <int>
no	no	0.6236097	0.28317980	2931
no	yes	0.5556652	0.34413353	2037
yes	no	0.3108383	0.07784431	167
yes	yes	0.2727273	0.40000000	55

##
docVisits_modified %>%
  mosaic::favstats(income ~ gender, data = .) # Don't use fav_stats with formula!!!

ABCDEFGHIJ0123456789

gender <chr>	min <dbl>	Q1 <dbl>	median <dbl>	Q3 <dbl>
female	0	0.25	0.35	0.65
male	0	0.35	0.65	0.90

##
docVisits_modified %>%
  mosaic::favstats(income ~ freepoor + nchronic, data = .) # Don't use fav_stats with formula!!!

ABCDEFGHIJ0123456789

freepoor.nchronic <chr>	Q1 <dbl>	median <dbl>	Q3 <dbl>
no.no	0.25	0.55	0.90
yes.no	0.15	0.25	0.45
no.yes	0.25	0.45	0.75
yes.yes	0.15	0.25	0.35

Business Insights from Grouped Quant Summaries (docVisits)

Clearly the people who are freepoor ( On Govt Insurance) AND with a chronic condition are those who have lower average income and a higher average number of visits to the doctor…but there are relatively few of them (n = 55) in this dataset.

More on dplyr

The dplyr package is capable of doing much more than just count, group_by and summarize. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on dplyr is linked to the icon below:

`dplyr` Tutorial

Reporting Tables for Data and the Data Schema

Data and the Data Schema are Different!!

Note that all the three methods (dplyr::glimpse(), skimr::skim(), and mosaic::inspect()) report the schema of the original dataframe. The schema are also formatted as data frames! However they do not “contain” the original data! Do not confuse between the data and it’s reported schema!

As stated earlier, it is usually a good idea to make crisp business-like tables, for the data itself, and of the schema as revealed by one of the outputs of the three methods (glimpse/skim/inspect) presented above. There are many table-making methods in R to do this; one of the simplest and effective ones is to use the kable set of commands from the knitr and kableExtra packages that we have installed already:

mpg %>%
  head(10) %>%
  kbl(
    col.names = c(
      "Manufacturer", "Model", "Engine\nDisplacement",
      "Model\n Year", "Cylinders", "Transmission",
      "Drivetrain", "City\n Mileage", "Highway\n Mileage",
      "Fuel", "Class\nOf\nVehicle"
    ),
    longtable = FALSE, centering = TRUE,
    caption = "MPG Dataset"
  ) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", "hover",
      "condensed", "responsive"
    ),
    full_width = F, position = "center"
  )

MPG Dataset
Manufacturer	Model	Engine Displacement	Model Year	Cylinders	Transmission	Drivetrain	City Mileage	Highway Mileage	Fuel	Class Of Vehicle
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact
audi	a4	3.1	2008	6	auto(av)	f	18	27	p	compact
audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact
audi	a4 quattro	1.8	1999	4	auto(l5)	4	16	25	p	compact
audi	a4 quattro	2.0	2008	4	manual(m6)	4	20	28	p	compact

And for the schema from skim(), with some extra bells and whistles on the table:

skim(mpg) %>%
  kbl(align = "c", caption = "Skim Output for mpg Dataset") %>%
  kable_paper(full_width = F)

Skim Output for mpg Dataset
skim_type	skim_variable	complete_rate	character.min	character.max	character.empty	character.n_unique	character.whitespace	numeric.mean	numeric.sd	numeric.p0	numeric.p25	numeric.p50	numeric.p75	numeric.p100	numeric.hist
character	manufacturer	1	4	10	0	15	0	NA	NA	NA	NA	NA	NA	NA	NA
character	model	1	2	22	0	38	0	NA	NA	NA	NA	NA	NA	NA	NA
character	trans	1	8	10	0	10	0	NA	NA	NA	NA	NA	NA	NA	NA
character	drv	1	1	1	0	3	0	NA	NA	NA	NA	NA	NA	NA	NA
character	fl	1	1	1	0	5	0	NA	NA	NA	NA	NA	NA	NA	NA
character	class	1	3	10	0	7	0	NA	NA	NA	NA	NA	NA	NA	NA
numeric	displ	1	NA	NA	NA	NA	NA	3.471795	1.291959	1.6	2.4	3.3	4.6	7	▇▆▆▃▁
numeric	year	1	NA	NA	NA	NA	NA	2003.500000	4.509646	1999.0	1999.0	2003.5	2008.0	2008	▇▁▁▁▇
numeric	cyl	1	NA	NA	NA	NA	NA	5.888889	1.611535	4.0	4.0	6.0	8.0	8	▇▁▇▁▇
numeric	cty	1	NA	NA	NA	NA	NA	16.858974	4.255946	9.0	14.0	17.0	19.0	35	▆▇▃▁▁
numeric	hwy	1	NA	NA	NA	NA	NA	23.440171	5.954643	12.0	18.0	24.0	27.0	44	▅▅▇▁▁

See https://haozhu233.github.io/kableExtra/ for more options on formatting the table with kableExtra.

A Quick Quiz

Warning

It is always a good idea to look for variables in data that may be incorrectly formatted. For instance, a variable marked as numerical may have the values 1-2-3-4 which represent options, sizes, or say months. in which case it would have to be interpreted as a factor.

Let us take a small test with the mpg dataset:

What is the number of qualitative/categorical variables in the mpg data?
How many manufacturers are named in this dataset?
How many levels does the variable drv have?
How many quantitative/numerical variables shown in the mpg data?
But the variable

is actually a qualitative variable.

Your Turn

Try your hand at these datasets. Look at the data, state the data dictionary, contemplate a few Research Questions and answer them with Summaries and Tables in Quarto!
Try adding more summary functions to the summary table? Which might you choose? Why?

Star Trek Books

Which would be the Group By variables here? And what would you summarize? With which function?

Math Anxiety! Hah!

Wait, But Why?

Data Summaries give you the essentials, without getting bogged down in the details(just yet).
Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
Summaries help evoke Questions and Hypotheses, which may lead to inquiries, analysis, and insights
Grouped Summaries should tell you if:
- counts of groups in your target audience are lopsided/imbalanced; Go and Get your data again.
- there are visible differences in Quant data across groups, so your target audience could be nicely fractured;
- etc.

Conclusion

The three methods (glimpse/skim/inspect) given here give us a very comprehensive look into the structure of the dataset.
The favstats method allows us to compute a whole lot of metrics for Quant variables for each level of one or more *Qual variables.
Use the kable set of commands to make a smart-looking of the data and the outputs of any of the three methods.

Make these part of your Workflow.

AI Generated Summary and Podcast

This is a tutorial on using the R programming language to perform descriptive statistical analysis on data sets. The tutorial focuses on summarizing data using various R packages like dplyr, skimr, and mosaic. It emphasizes the importance of understanding the data’s structure, identifying different types of variables (qualitative and quantitative), and calculating summary statistics such as means, medians, and frequencies. The tutorial provides examples using real datasets and highlights the significance of data summaries in gaining initial insights, formulating research questions, and identifying potential issues with the data.

References

Lock, Lock, Lock, Lock, and Lock. (2021). Statistics: Unlocking the Power of Data, 3rd Edition). https://media.wiley.com/product_data/excerpt/69/11196821/1119682169-32.pdf

R Package Citations

Package	Version	Citation
mosaic	1.9.1	Pruim, Kaplan, and Horton (2017)
palmerpenguins	0.1.1	Horst, Hill, and Gorman (2020)
skimr	2.1.5	Waring et al. (2022)

Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.

Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to ‘Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.

Stigler, Stephen M. 2016. “The Seven Pillars of Statistical Wisdom,” March. https://doi.org/10.4159/9780674970199.

Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. skimr: Compact and Flexible Summaries of Data. https://doi.org/10.32614/CRAN.package.skimr.

Citation

BibTeX citation:

@online{v.2023,
  author = {V., Arvind},
  title = {\textless Iconify-Icon Icon=“carbon:summary-Kpi”
    Width=“1.2em”
    Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
    {Summaries}},
  date = {2023-10-15},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/10-FavStats/},
  langid = {en},
  abstract = {Bill Gates walked into a bar, and everyone’s salary went
    up on average.}
}

For attribution, please cite this work as:

V., Arvind. 2023. “<Iconify-Icon Icon=‘carbon:summary-Kpi’ Width=‘1.2em’ Height=‘1.2em’></Iconify-Icon> Summaries.” October 15, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/10-FavStats/.