πŸ“Š Descriptive Statistics

Mirror, mirror on the wall…

Qual Variables
Quant Variables
Mean
Median
Standard Deviation
Quartiles
Author

Arvind V.

Published

October 15, 2023

Modified

July 4, 2024

Abstract
Who is the Fairest of them All?

Using web-R

This tutorial uses web-r that allows you to run all code within your browser, on all devices. Most code chunks herein are formatted in a tabbed structure (like in an old-fashioned library) with duplicated code. The tabs in front have regular R code that will work when copy-pasted in your RStudio session. The tab β€œbehind” has the web-R code that can work directly in your browser, and can be modified as well. The R code is also there to make sure you have original code to go back to, when you have made several modifications to the code on the web-r tabs and need to compare your code with the original!

Keyboard Shortcuts

  • Run selected code using either:
    • macOS: ⌘ + β†©οΈŽ/Return
    • Windows/Linux: Ctrl + β†©οΈŽ/Enter
  • Run the entire code by clicking the β€œRun code” button or pressing Shift+β†©οΈŽ.

Setting up R Packages

How do we Grasp Data?

We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data.

As we discussed in that same Module, for us to grasp the significance of the data, we need to describe it; the actual data is usually too vast for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.

The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.

Reduction is Addition

Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.

Steven Stigler (2016) is the author of the book β€œThe Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation

The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, β€œThe Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier dayβ€”and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.

And then, an example from a more sombre story:

Year Below Level #1 Level #1 Level #2 Level #3 Levels #4 and #5
Number in millions (2012/2014) 8.35 26.49 65.10 71.41 26.57
Number in millions (2017) 7.59 29.23 66.07 68.81 26.75
Note:
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.
Table 1: US Population: Reading and Numeracy Levels

This ghastly-looking Table 1 examines U.S. adults with low English literacy and numeracy skillsβ€”or low-skilled adultsβ€”at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane. To summarize is to understand. Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.

And if we don’t summarise? Jorge Luis Borges, in a fantasy short story published in 1942, titled β€œFunes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, β€œTo think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was big data without Statistics.

What graphs / numbers will we see today?

Variable #1 Variable #2 Chart Names β€œChart Shape”
All All Tables and Stat Measures

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

  • Dataset length: How many rows/observations?
  • Dataset breadth: How many columns/variables?
  • How many Quant variables?
  • How many Qual variables?
  • Quant variables: min, max, mean, median, sd
  • Qual variables: levels, counts per level
  • Both: means, medians for each level of a Qual variable…

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation
2 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities with Scale. Differences are meaningful, but not products or ratios Quantitative/Interval pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College Mean,Standard Deviation
3 How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) Qualitative/Ordinal Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) Median,Percentile
4 What, Who, Where, Whom, Which Name, Place, Animal, Thing Qualitative/Nominal Name Count no. of cases,Mode

We will obviously choose all variables in the dataset, unless they are unrelated ones such as row number or ID which (we think) may not contribute any information and we can disregard.

How do these Summaries Work?

Inspecting the min, max,mean, median and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error! Comparing two Quant variables for their ranges also tells us that we may have to \(scale/normalize\) them for computational ease, if one variable has large numbers and the other has very small ones.

With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these. Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented. For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row). Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.

And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!

Case Study-1

We will first use a dataset mpg that is available in R as part of one of the R packages that we have loaded with the library() command.

Examine the Data

It is usually a good idea to make crisp business-like tables, for the data itself, and the schema as revealed by one of the outputs of the three methods to be presented below. There are many methods to do this; one of the simplest and effective ones is to use the kable set of commands from the knitr and kableExtra packagepackage:

mpg %>% 
  head(10) %>%
  kbl(
    # add Human Readable column names
    col.names = c("Manufacturer", "Model", "Engine\nDisplacement", 
                    "Model\n Year", "Cylinders", "Transmission",
                    "Drivetrain", "City\n Mileage", "Highway\n Mileage",
                    "Fuel", "Class\nOf\nVehicle"), 
    caption = "MPG Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", 
                                      "condensed", "responsive"),
                full_width = F, position = "center")
MPG Dataset
Manufacturer Model Engine Displacement Model Year Cylinders Transmission Drivetrain City Mileage Highway Mileage Fuel Class Of Vehicle
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact
audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact

Next we will look at a few favourite statistics or β€œfavstats” that we can derive from data. R is full of packages that can provide very evocative and effective summaries of data. We will first start with the dplyr package from the tidyverse, the skimr package, then the mosaic package. We will look at the summary outputs from these and learn how to interpret them.

Data Dictionary and Munging

Using skim/inspect/glimpse, we can put together a (brief) data dictionary as follows:

Qualitative Data
  • model(chr): Car model name
  • manufacturer(chr): Car maker name
  • fl(chr): fuel type
  • drv(chr): type of drive(front, rear, 4W)
  • class(chr): type of vehicle ( sedan, pickup…)
  • trans(chr): type of transmission ( auto, manual..)
Quantitative Data
  • hwy(int): Highway Mileage
  • cty(int): City Mileage
  • cyl(int): Number of Cylinders. How do we understand this variable? Should this be Qual?
  • displ(dbl): Engine piston displacement
  • year(int): Year of model

We see that there are certain variables that must be converted to factors for analytics purposes, since they are unmistakably Qualitative in nature. Let us do that now, for use later:

mpg_modified <- mpg %>% 
  dplyr::mutate(cyl = as_factor(cyl),
                fl = as_factor(fl),
                drv = as_factor(drv),
                class = as_factor(class),
                trans = as_factor(trans))
glimpse(mpg_modified)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <fct> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
$ drv          <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
$ class        <fct> compact, compact, compact, compact, compact, compact, com…

Case Study-2

Instead of taking a β€œbuilt-in” dataset , i.e. one that is part of an R package that we can load with library(), let us try the above process with a data set that we obtain from the internet. We will use this superb repository of datasets created by Vincent Arel-Bundock: https://vincentarelbundock.github.io/Rdatasets/articles/data.html

Let us choose a modest-sized dataset, say this dataset on Doctor Visits, which is available online https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv and read it into R.

Reading external data into R

The read_csv() command from R package readr allows us to read both locally saved data on our hard disk, or data available in a shared folder online. Avoid using the read.csv() from base R , though it will show up in your code auto-complete set of options!

# From Vincent Arel-Bundock's dataset website
# https://vincentarelbundock.github.io/Rdatasets
# 
# read_csv can read data directly from the net
# Don't use read.csv()
docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health

β„Ή Use `spec()` to retrieve the full column specification for this data.
β„Ή Specify the column types or set `show_col_types = FALSE` to quiet this message.

So, a data frame containing 5,190 observations on 12 variables.

How about a locally stored CSV file?

We can also use a locally downloaded and stored CSV file. Assuming the file is stored in a subfolder called data inside your R project folder, we can proceed as follows:

```{r}
#| eval: false
docVisits <- read_csv("data/DoctorVisits.csv")
```

Let us quickly report the data itself, as in a real report. Note that we can use the features of the kableExtra package to dress up this table too!!

docVisits %>%
  head(10) %>%
  kbl(caption = "Doctor Visits Dataset",
      # Add Human Readable Names if desired
      # col.names(..names that you may want..)
      ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover",
                          "condensed", "responsive"),
    full_width = F, position = "center")
Doctor Visits Dataset
rownames visits gender age income illness reduced health private freepoor freerepat nchronic lchronic
1 1 female 0.19 0.55 1 4 1 yes no no no no
2 1 female 0.19 0.45 1 2 1 yes no no no no
3 1 male 0.19 0.90 3 0 0 no no no no no
4 1 male 0.19 0.15 1 0 0 no no no no no
5 1 male 0.19 0.45 2 5 1 no no no yes no
6 1 female 0.19 0.35 5 1 9 no no no yes no
7 1 female 0.19 0.55 4 0 2 no no no no no
8 1 female 0.19 0.15 3 0 6 no no no no no
9 1 female 0.19 0.65 2 0 5 yes no no no no
10 1 male 0.19 0.15 1 0 0 yes no no no no

Examine the Data

Data Dictionary

Variable Description
visits Number of doctor visits in past 2 weeks.
gender Factor indicating gender.
age Age in years divided by 100.
income Annual income in tens of thousands of dollars.
illness Number of illnesses in past 2 weeks.
reduced Number of days of reduced activity in past 2 weeks due to illness or injury.
health General health questionnaire score using Goldberg’s method.
private Factor. Does the individual have private health docVisits?
freepoor Factor. Does the individual have free government health docVisits due to low income?
freerepat Factor. Does the individual have free government health docVisits due to old age, disability or veteran status?
nchronic Factor. Is there a chronic condition not limiting activity?
lchronic Factor. Is there a chronic condition limiting activity?

Here too, we should convert the variables that are obviously Qualitative into factors, ordered or otherwise:

docVisits_modified <-  docVisits %>% 
  mutate(gender = as_factor(gender),
         private = as_factor(private),
         freepoor = as_factor(freepoor),
         freerepat = as_factor(freerepat),
         nchronic = as_factor(nchronic),
         lchronic = as_factor(lchronic))
docVisits_modified

Groups and Counts of Qualitative Variables

What is the most important dialogue uttered in the movie β€œSholay”?

Recall our discussion in Types of Data Variables. We have looked at means, limits, and percentiles of Quantitative variables. Another good idea to examine datasets is to look at counts, proportions, and frequencies with respect to Qualitative variables.

We typically do this with the dplyr package from the tidyverse.

Groups and Summaries of Quantitative Variables

We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?

Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group.

Let us work through these ideas for both our familiar datasets.

More on dplyr

The dplyr package is capable of doing much more than just count, group_by and summarize. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on dplyr is linked to the icon below:

dplyr Tutorial

Reporting Tables for Data and the Data Schema

Data and the Data Schema are Different!!

Note that all the three methods (dplyr::glimpse(), skimr::skim(), and mosaic::inspect()) report the schema of the original dataframe. The schema are also formatted as data frames! However they do not β€œcontain” the original data! Do not confuse between the data and it’s reported schema!

As stated earlier, it is usually a good idea to make crisp business-like tables, for the data itself, and of the schema as revealed by one of the outputs of the three methods (glimpse/skim/inspect) presented above. There are many table-making methods in R to do this; one of the simplest and effective ones is to use the kable set of commands from the knitr and kableExtra packages that we have installed already:

mpg %>% 
  head(10) %>%
  kbl(col.names = c("Manufacturer", "Model", "Engine\nDisplacement", 
                    "Model\n Year", "Cylinders", "Transmission",
                    "Drivetrain", "City\n Mileage", "Highway\n Mileage",
                    "Fuel", "Class\nOf\nVehicle"), 
      longtable = FALSE, centering = TRUE,
      caption = "MPG Dataset") %>%
    kable_styling(bootstrap_options = c("striped", "hover", 
                                        "condensed", "responsive"),
                  full_width = F, position = "center")
MPG Dataset
Manufacturer Model Engine Displacement Model Year Cylinders Transmission Drivetrain City Mileage Highway Mileage Fuel Class Of Vehicle
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact
audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact

And for the schema from skim(), with some extra bells and whistles on the table:

skim(mpg) %>%
  kbl(align = "c", caption = "Skim Output for mpg Dataset") %>%
kable_paper(full_width = F)
Skim Output for mpg Dataset
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character manufacturer 0 1 4 10 0 15 0 NA NA NA NA NA NA NA NA
character model 0 1 2 22 0 38 0 NA NA NA NA NA NA NA NA
character trans 0 1 8 10 0 10 0 NA NA NA NA NA NA NA NA
character drv 0 1 1 1 0 3 0 NA NA NA NA NA NA NA NA
character fl 0 1 1 1 0 5 0 NA NA NA NA NA NA NA NA
character class 0 1 3 10 0 7 0 NA NA NA NA NA NA NA NA
numeric displ 0 1 NA NA NA NA NA 3.471795 1.291959 1.6 2.4 3.3 4.6 7 ▇▆▆▃▁
numeric year 0 1 NA NA NA NA NA 2003.500000 4.509646 1999.0 1999.0 2003.5 2008.0 2008 ▇▁▁▁▇
numeric cyl 0 1 NA NA NA NA NA 5.888889 1.611535 4.0 4.0 6.0 8.0 8 ▇▁▇▁▇
numeric cty 0 1 NA NA NA NA NA 16.858974 4.255946 9.0 14.0 17.0 19.0 35 ▆▇▃▁▁
numeric hwy 0 1 NA NA NA NA NA 23.440171 5.954643 12.0 18.0 24.0 27.0 44 ▅▅▇▁▁

See https://haozhu233.github.io/kableExtra/ for more options on formatting the table with kableExtra.

A Quick Quiz

Warning

It is always a good idea to look for variables in data that may be incorrectly formatted. For instance, a variable marked as numerical may have the values 1-2-3-4 which represent options, sizes, or say months. in which case it would have to be interpreted as a factor.

Let us take a small test with the mpg dataset:

  • What is the number of qualitative/categorical variables in the mpg data?

  • How many manufacturers are named in this dataset?

  • How many levels does the variable drv have?

  • How many quantitative/numerical variables shown in the mpg data?

  • But the variable
    is actually a qualitative variable.

Conclusion

  • The three methods (glimpse/skim/inspect) given here give us a very comprehensive look into the structure of the dataset.
  • The favstats method allows us to compute a whole lot of metrics for Quant variables for each level of one or more *Qual variables.
  • Use the kable set of commands to make a smart-looking of the data and the outputs of any of the three methods.

Make these part of your Workflow.

References

R Package Citations
Package Version Citation
mosaic 1.9.1 Pruim, Kaplan, and Horton (2017)
palmerpenguins 0.1.1 Horst, Hill, and Gorman (2020)
skimr 2.1.5 Waring et al. (2022)
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. β€œThe Mosaic Package: Helping Students to β€˜Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Stigler, Stephen M. 2016. β€œThe Seven Pillars of Statistical Wisdom,” March. https://doi.org/10.4159/9780674970199.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. skimr: Compact and Flexible Summaries of Data. https://CRAN.R-project.org/package=skimr.
Back to top

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {πŸ“Š {Descriptive} {Statistics}},
  date = {2023-10-15},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/10-FavStats/},
  langid = {en},
  abstract = {Who is the Fairest of them All?}
}
For attribution, please cite this work as:
V., Arvind. 2023. β€œπŸ“Š Descriptive Statistics .” October 15, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/10-FavStats/.