Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Descriptive Analytics
  4. Summaries
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Using web-R
  • Setting up R Packages
  • How do we Grasp Data?
  • What graphs / numbers will we see today?
  • What kind of Data Variables will we choose?
  • How do these Summaries Work?
  • Some Quick Definitions
  • Case Study-1
    • Examine the Data
    • Data Dictionary and Munging
  • Case Study-2
    • Examine the Data
    • Data Dictionary
  • Groups and Counts of Qualitative Variables
  • Groups and Summaries of Quantitative Variables
  • More on dplyr
  • Reporting Tables for Data and the Data Schema
  • A Quick Quiz
  • Your Turn
  • Wait, But Why?
  • Conclusion
  • AI Generated Summary and Podcast
  • References
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Descriptive Analytics
  4. Summaries

Summaries

Throwing away data to grasp it

Qual Variables
Quant Variables
Mean
Median
Standard Deviation
Quartiles
Author

Arvind V.

Published

October 15, 2023

Modified

May 20, 2025

Abstract
Bill Gates walked into a bar, and everyone’s salary went up on average.
WebR Status

Installing package 4 out of 6: mosaic

“The most certain sign of wisdom is cheerfulness.”

— Michel de Montaigne, Writer and philosopher

Using web-R

This tutorial uses web-r that allows you to run all code within your browser, on all devices. Most code chunks herein are formatted in a tabbed structure (like in an old-fashioned library) with duplicated code. The tabs in front have regular R code that will work when copy-pasted in your RStudio session. The tab “behind” has the web-R code that can work directly in your browser, and can be modified as well. The R code is also there to make sure you have original code to go back to, when you have made several modifications to the code on the web-r tabs and need to compare your code with the original!

Keyboard Shortcuts

  • Run selected code using either:
    • macOS: ⌘ + ↩︎/Return
    • Windows/Linux: Ctrl + ↩︎/Enter
  • Run the entire code by clicking the “Run code” button or pressing Shift+↩︎.

“The best attitude to adopt is one of compassionate patience, which has to include a tolerance for failure.”

— Gabor Maté

Setting up R Packages

library(tidyverse)
library(mosaic)
library(skimr)

Installing package 4 out of 6: mosaic

How do we Grasp Data?

We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data.

As we discussed in that same Module, for us to grasp the significance of the data, we need to describe it; the actual data is usually too vast for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.

The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.

ImportantReduction is Addition

Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.

Steven Stigler (2016) is the author of the book “The Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation

The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, “The Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier day—and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.

And then, an example from a more sombre story:
US Population: Reading and Numeracy Levels
Year Below Level #1 Level #1 Level #2 Level #3 Levels #4 and #5
Number in millions (2012/2014) 8.35 26.49 65.10 71.41 26.57
Number in millions (2017) 7.59 29.23 66.07 68.81 26.75
Note:
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.
Table 1: US Population: Reading and Numeracy Levels

This ghastly-looking Table 1 examines U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane. To summarize is to understand. Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.

And if we don’t summarise? Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was big data without Statistics.

What graphs / numbers will we see today?

Variable #1 Variable #2 Chart Names “Chart Shape”
All All Tables and Stat Measures

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

  • Dataset length: How many rows/observations?
  • Dataset breadth: How many columns/variables?
  • How many Quant variables?
  • How many Qual variables?
  • Quant variables: min, max, mean, median, sd
  • Qual variables: levels, counts per level
  • Both: means, medians for each level of a Qual variable…

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation
2 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities with Scale. Differences are meaningful, but not products or ratios Quantitative/Interval pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College Mean,Standard Deviation
3 How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) Qualitative/Ordinal Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) Median,Percentile
4 What, Who, Where, Whom, Which Name, Place, Animal, Thing Qualitative/Nominal Name Count no. of cases,Mode

We will obviously choose all variables in the dataset, unless they are unrelated ones such as row number or ID which (we think) may not contribute any information and we can disregard.

How do these Summaries Work?

Quant variables: Inspecting the min, max, mean, median, variance and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error! Comparing two Quant variables for their ranges also tells us that we may have to scale/normalize them for computational ease, if one variable has large numbers and the other has very small ones.

Qual variables: With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these. Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.

Together?: We can use Quant and Qual together, to develop the above summaries (min, max,mean, median and sd) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts if we are interested in that.

For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row). Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.

And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!

Some Quick Definitions

ImportantMean

The sample mean, or average, of a Quantitative data variable can be calculated as the sum of the observed values divided by the number of observations:

mean=x¯=x1+x2+x3....+xnn

ImportantVariance and Standard Deviation

Observations can be on either side of the mean, naturally. To measure the extent of these differences, we square and sum the differences between individual values and their mean, and take their average to obtain the (sample) variance:

variance=s2=(x1−x¯)2+(x2−x¯)2+(x2−x¯)2+...(xn−x¯)2+n−1 The standard deviation s is just the square root of the variance.

(The n−1 is a mathematical nuance to allow for the fact that we have used the data to calculate the mean before we get to s2, and hence have “used up” one degree of randomness in the data. It gets us more robust results.)

ImportantMedian

When the observations in a Quant variable are placed in order of their maginitude, the observation in tke middle is the median. Half the observations are below, and half are above, the median

Case Study-1

We will first use a dataset mpg that is available in R as part of one of the R packages that we have loaded with the library() command.

Examine the Data

It is usually a good idea to make crisp business-like tables, for the data itself, and the schema as revealed by one of the outputs of the three methods to be presented below. There are many methods to do this; one of the simplest and effective ones is to use the kable set of commands from the knitr and kableExtra packagepackage:

mpg %>%
  head(10) %>%
  kbl(
    # add Human Readable column names
    col.names = c(
      "Manufacturer", "Model", "Engine\nDisplacement",
      "Model\n Year", "Cylinders", "Transmission",
      "Drivetrain", "City\n Mileage", "Highway\n Mileage",
      "Fuel", "Class\nOf\nVehicle"
    ),
    caption = "MPG Dataset"
  ) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", "hover",
      "condensed", "responsive"
    ),
    full_width = F, position = "center"
  )
MPG Dataset
Manufacturer Model Engine Displacement Model Year Cylinders Transmission Drivetrain City Mileage Highway Mileage Fuel Class Of Vehicle
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact
audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact

Next we will look at a few favourite statistics or “favstats” that we can derive from data. R is full of packages that can provide very evocative and effective summaries of data. We will first start with the dplyr package from the tidyverse, the skimr package, then the mosaic package. We will look at the summary outputs from these and learn how to interpret them.

  • Using dplyr::glimpse()
  • Using skimr::skim()
  • Using mosaic::inspect()
  • web-r

The dplyr package offers a convenient command called glimpse:

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
Note
  • Very crisp output, giving us the size of the dataset (234 X 11) and the nature of the variable columns, along with their first few entries.
  • The chr variables are usually Categorical/Qualitative.
  • The int or dbl (double precision) are usually Numerical/Quantitative.
  • But be careful! Verify that this is as per your intent, interpret the variables and modify their encoding as needed.

Let us look at mpgusing skimr::skim().

From the output of ?skimr:

The format of the results are a single wide data frame combining the results, with some additional attributes and two metadata columns:

  • skim_variable: name of the original variable
  • skim_type: class of the variable

We can use skim(dataset) directly as shown below:

skimr::skim(mpg) # explicitly stating package name
Data summary
Name mpg
Number of rows 234
Number of columns 11
_______________________
Column type frequency:
character 6
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
manufacturer 0 1 4 10 0 15 0
model 0 1 2 22 0 38 0
trans 0 1 8 10 0 10 0
drv 0 1 1 1 0 3 0
fl 0 1 1 1 0 5 0
class 0 1 3 10 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
displ 0 1 3.47 1.29 1.6 2.4 3.3 4.6 7 ▇▆▆▃▁
year 0 1 2003.50 4.51 1999.0 1999.0 2003.5 2008.0 2008 ▇▁▁▁▇
cyl 0 1 5.89 1.61 4.0 4.0 6.0 8.0 8 ▇▁▇▁▇
cty 0 1 16.86 4.26 9.0 14.0 17.0 19.0 35 ▆▇▃▁▁
hwy 0 1 23.44 5.95 12.0 18.0 24.0 27.0 44 ▅▅▇▁▁

Taken together, we have the following:

Note
  • A Data Summary: it lists the dimensions of the mpg dataset: 234 rows and 11 columns. 6 columns are character formatted, the remaining 5 are numeric. The dataset is not “grouped” (more on this later).

  • The second part of the output shows a table with the character variables which are therefore factor variables with levels.

  • The third part shows a table listing the names and summary stats for the numerical variables. We have mean, sd, all the quantiles (p0, p25, p50(median), p75 and p100 percentiles) and a neat little histogram for each. From the histogram we can see that year is two-valued, cyl is three-valued, and cty and hwy are continuous… Again check that this is as you intend them to be. We may need to modify the encoding if needed.

We get very similar output from mosaic::inspect():

inspect(mpg)

categorical variables:  
          name     class levels   n missing
1 manufacturer character     15 234       0
2        model character     38 234       0
3        trans character     10 234       0
4          drv character      3 234       0
5           fl character      5 234       0
6        class character      7 234       0
                                   distribution
1 dodge (15.8%), toyota (14.5%) ...            
2 caravan 2wd (4.7%) ...                       
3 auto(l4) (35.5%), manual(m5) (24.8%) ...     
4 f (45.3%), 4 (44%), r (10.7%)                
5 r (71.8%), p (22.2%), e (3.4%) ...           
6 suv (26.5%), compact (20.1%) ...             

quantitative variables:  
   name   class    min     Q1 median     Q3  max        mean       sd   n
1 displ numeric    1.6    2.4    3.3    4.6    7    3.471795 1.291959 234
2  year integer 1999.0 1999.0 2003.5 2008.0 2008 2003.500000 4.509646 234
3   cyl integer    4.0    4.0    6.0    8.0    8    5.888889 1.611534 234
4   cty integer    9.0   14.0   17.0   19.0   35   16.858974 4.255946 234
5   hwy integer   12.0   18.0   24.0   27.0   44   23.440171 5.954643 234
  missing
1       0
2       0
3       0
4       0
5       0
Note

We see that the output of mosaic::inspect() is organized as follows:

  • There are two dataframes/tables in the output, one describing the Qualitative Variables and the other describing the Quantitative Variables.
  • In the table describing the Qual variables, we have:
    • name: Name of the variable in the (parent) dataset. i.e Column Names
    • class: format of that column
    • levels: All these variables are factors, with levels shown here. Some for example, manufacturer has 15 levels, and there are 234 rows

inspect also conveniently shows how much data is missing and in which variables. This is a very important consideration in the use of the data for analytics purposes.

We can save and see the outputs separately:

mpg_describe <- inspect(mpg)
mpg_describe$categorical
mpg_describe$quantitative
1
mpg %>% dplyr::glimpse()
1
mpg %>% mosaic::inspect()
1
mpg %>% skimr::skim()

Data Dictionary and Munging

Using skim/inspect/glimpse, we can put together a (brief) data dictionary as follows:

NoteQualitative Data
  • model(chr): Car model name
  • manufacturer(chr): Car maker name
  • fl(chr): fuel type
  • drv(chr): type of drive(front, rear, 4W)
  • class(chr): type of vehicle ( sedan, pickup…)
  • trans(chr): type of transmission ( auto, manual..)
NoteQuantitative Data
  • hwy(int): Highway Mileage
  • cty(int): City Mileage
  • cyl(int): Number of Cylinders. How do we understand this variable? Should this be Qual?
  • displ(dbl): Engine piston displacement
  • year(int): Year of model

We see that there are certain variables that must be converted to factors for analytics purposes, since they are unmistakably Qualitative in nature. Let us do that now, for use later:

mpg_modified <- mpg %>%
  dplyr::mutate(
    cyl = as_factor(cyl),
    fl = as_factor(fl),
    drv = as_factor(drv),
    class = as_factor(class),
    trans = as_factor(trans)
  )
glimpse(mpg_modified)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <fct> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
$ drv          <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
$ class        <fct> compact, compact, compact, compact, compact, compact, com…
1
2
3
4
5
6
7
mpg_modified <- mpg %>% 
  dplyr::mutate(cyl = as_factor(cyl),
                fl = as_factor(fl),
                drv = as_factor(drv),
                class = as_factor(class),
                trans = as_factor(trans))
glimpse(mpg_modified)

Case Study-2

Instead of taking a “built-in” dataset , i.e. one that is part of an R package that we can load with library(), let us try the above process with a data set that we obtain from the internet. We will use this superb repository of datasets created by Vincent Arel-Bundock: https://vincentarelbundock.github.io/Rdatasets/articles/data.html

Let us choose a modest-sized dataset, say this dataset on Doctor Visits, which is available online https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv and read it into R.

ImportantReading external data into R

The read_csv() command from R package readr allows us to read both locally saved data on our hard disk, or data available in a shared folder online. Avoid using the read.csv() from base R , though it will show up in your code auto-complete set of options!

# From Vincent Arel-Bundock's dataset website
# https://vincentarelbundock.github.io/Rdatasets
#
# read_csv can read data directly from the net
# Don't use read.csv()
docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

So, a data frame containing 5,190 observations on 12 variables.

NoteHow about a locally stored CSV file?

We can also use a locally downloaded and stored CSV file. Assuming the file is stored in a subfolder called data inside your R project folder, we can proceed as follows:

```{r}
#| eval: false
docVisits <- read_csv("data/DoctorVisits.csv")
```

Let us quickly report the data itself, as in a real report. Note that we can use the features of the kableExtra package to dress up this table too!!

docVisits %>%
  head(10) %>%
  kbl(
    caption = "Doctor Visits Dataset",
    # Add Human Readable Names if desired
    # col.names(..names that you may want..)
  ) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", "hover",
      "condensed", "responsive"
    ),
    full_width = F, position = "center"
  )
Doctor Visits Dataset
rownames visits gender age income illness reduced health private freepoor freerepat nchronic lchronic
1 1 female 0.19 0.55 1 4 1 yes no no no no
2 1 female 0.19 0.45 1 2 1 yes no no no no
3 1 male 0.19 0.90 3 0 0 no no no no no
4 1 male 0.19 0.15 1 0 0 no no no no no
5 1 male 0.19 0.45 2 5 1 no no no yes no
6 1 female 0.19 0.35 5 1 9 no no no yes no
7 1 female 0.19 0.55 4 0 2 no no no no no
8 1 female 0.19 0.15 3 0 6 no no no no no
9 1 female 0.19 0.65 2 0 5 yes no no no no
10 1 male 0.19 0.15 1 0 0 yes no no no no

Examine the Data

  • Using dplyr::glimpse()
  • Using skimr::skim()
  • Using mosaic::inspect()
  • web-r
glimpse(docVisits)
Rows: 5,190
Columns: 13
$ rownames  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ visits    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ gender    <chr> "female", "female", "male", "male", "male", "female", "femal…
$ age       <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income    <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness   <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced   <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health    <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
$ private   <chr> "yes", "yes", "no", "no", "no", "no", "no", "no", "yes", "ye…
$ freepoor  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ freerepat <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ nchronic  <chr> "no", "no", "no", "no", "yes", "yes", "no", "no", "no", "no"…
$ lchronic  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
NoteDescriptive Stat Summary from dplyr::glimpse()

Very crisp output, giving us the size of the dataset (5190 X 13) and the nature of the variable columns, along with their first few entries. There are several Quantitative variables: visits, age, income, illness, reduced and health; the rest seem to be Qualitative variables.

Always document your data with variable descriptions when you share it, a data dictionary!

skim(docVisits) %>% kbl()
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character gender 0 1 4 6 0 2 0 NA NA NA NA NA NA NA NA
character private 0 1 2 3 0 2 0 NA NA NA NA NA NA NA NA
character freepoor 0 1 2 3 0 2 0 NA NA NA NA NA NA NA NA
character freerepat 0 1 2 3 0 2 0 NA NA NA NA NA NA NA NA
character nchronic 0 1 2 3 0 2 0 NA NA NA NA NA NA NA NA
character lchronic 0 1 2 3 0 2 0 NA NA NA NA NA NA NA NA
numeric rownames 0 1 NA NA NA NA NA 2595.5000000 1498.3682792 1.00 1298.25 2595.50 3892.75 5190.00 ▇▇▇▇▇
numeric visits 0 1 NA NA NA NA NA 0.3017341 0.7981338 0.00 0.00 0.00 0.00 9.00 ▇▁▁▁▁
numeric age 0 1 NA NA NA NA NA 0.4063854 0.2047818 0.19 0.22 0.32 0.62 0.72 ▇▂▁▂▅
numeric income 0 1 NA NA NA NA NA 0.5831599 0.3689067 0.00 0.25 0.55 0.90 1.50 ▇▆▅▅▂
numeric illness 0 1 NA NA NA NA NA 1.4319846 1.3841524 0.00 0.00 1.00 2.00 5.00 ▇▂▂▁▁
numeric reduced 0 1 NA NA NA NA NA 0.8618497 2.8876284 0.00 0.00 0.00 0.00 14.00 ▇▁▁▁▁
numeric health 0 1 NA NA NA NA NA 1.2175337 2.1242665 0.00 0.00 0.00 2.00 12.00 ▇▁▁▁▁
NoteDescriptive Stat Summary from skimr::skim()
  • A Data Summary: it lists the dimensions of the docVisits dataset: 5190 rows and 13 columns. 6 columns are character formatted, the remaining 7 are numeric. The dataset is not “grouped” (more on this later).

  • The second part of the output shows a table with the character variables which are therefore factor variables with levels.

  • The third part shows a table listing the names and summary stats for the numerical variables. We have mean, sd, all the quantiles (p0, p25, p50(median), p75 and p100 percentiles) and a neat little histogram for each.

  • Can we consider the health Goldberg score a Qualitative variable, to be understood as “ranks” between a minimum and maximum? It is just possible…

inspect(docVisits)

categorical variables:  
       name     class levels    n missing
1    gender character      2 5190       0
2   private character      2 5190       0
3  freepoor character      2 5190       0
4 freerepat character      2 5190       0
5  nchronic character      2 5190       0
6  lchronic character      2 5190       0
                                   distribution
1 female (52.1%), male (47.9%)                 
2 no (55.7%), yes (44.3%)                      
3 no (95.7%), yes (4.3%)                       
4 no (79%), yes (21%)                          
5 no (59.7%), yes (40.3%)                      
6 no (88.3%), yes (11.7%)                      

quantitative variables:  
      name   class  min      Q1  median      Q3     max         mean
1 rownames numeric 1.00 1298.25 2595.50 3892.75 5190.00 2595.5000000
2   visits numeric 0.00    0.00    0.00    0.00    9.00    0.3017341
3      age numeric 0.19    0.22    0.32    0.62    0.72    0.4063854
4   income numeric 0.00    0.25    0.55    0.90    1.50    0.5831599
5  illness numeric 0.00    0.00    1.00    2.00    5.00    1.4319846
6  reduced numeric 0.00    0.00    0.00    0.00   14.00    0.8618497
7   health numeric 0.00    0.00    0.00    2.00   12.00    1.2175337
            sd    n missing
1 1498.3682792 5190       0
2    0.7981338 5190       0
3    0.2047818 5190       0
4    0.3689067 5190       0
5    1.3841524 5190       0
6    2.8876284 5190       0
7    2.1242665 5190       0
NoteDescriptive Stat Summary from mosaic::inspect()

We see that the output of mosaic::inspect() is organized very similarly to the output from skim. Is there any missing data? Both skim and mosaic report on the data completion for each variable in the dataset.

1
docVisits %>% glimpse()
1
docVisits %>% inspect()
1
docVisits %>% skim()

Data Dictionary

Variable Description
visits Number of doctor visits in past 2 weeks.
gender Factor indicating gender.
age Age in years divided by 100.
income Annual income in tens of thousands of dollars.
illness Number of illnesses in past 2 weeks.
reduced Number of days of reduced activity in past 2 weeks due to illness or injury.
health General health questionnaire score using Goldberg’s method.
private Factor. Does the individual have private health docVisits?
freepoor Factor. Does the individual have free government health docVisits due to low income?
freerepat Factor. Does the individual have free government health docVisits due to old age, disability or veteran status?
nchronic Factor. Is there a chronic condition not limiting activity?
lchronic Factor. Is there a chronic condition limiting activity?

Here too, we should convert the variables that are obviously Qualitative into factors, ordered or otherwise:

docVisits_modified <- docVisits %>%
  mutate(
    gender = as_factor(gender),
    private = as_factor(private),
    freepoor = as_factor(freepoor),
    freerepat = as_factor(freerepat),
    nchronic = as_factor(nchronic),
    lchronic = as_factor(lchronic)
  )
docVisits_modified
ABCDEFGHIJ0123456789
rownames
<dbl>
visits
<dbl>
gender
<fct>
age
<dbl>
income
<dbl>
illness
<dbl>
reduced
<dbl>
health
<dbl>
private
<fct>
freepoor
<fct>
11female0.190.55141yesno
21female0.190.45121yesno
31male0.190.90300nono
41male0.190.15100nono
51male0.190.45251nono
61female0.190.35519nono
71female0.190.55402nono
81female0.190.15306nono
91female0.190.65205yesno
101male0.190.15100yesno
Next
123456
...
519
Previous
1-10 of 5,190 rows | 1-10 of 13 columns
1
2
3
4
5
6
7
8
docVisits_modified <-  docVisits %>% 
  mutate(gender = as_factor(gender),
         private = as_factor(private),
         freepoor = as_factor(freepoor),
         freerepat = as_factor(freerepat),
         nchronic = as_factor(nchronic),
         lchronic = as_factor(lchronic))
glimpse(docVisits_modified)

Groups and Counts of Qualitative Variables

What is the most important dialogue uttered in the movie “Sholay”?

Recall our discussion in Types of Data Variables. We have looked at means, limits, and percentiles of Quantitative variables. Another good idea to examine datasets is to look at counts, proportions, and frequencies with respect to Qualitative variables.

We typically do this with the dplyr package from the tidyverse.

  • mpg dataset
  • docVisits dataset
  • web-r
mpg_modified %>% dplyr::count(cyl)
ABCDEFGHIJ0123456789
cyl
<fct>
n
<int>
481
54
679
870
4 rows
mpg_modified %>% mosaic::count(drv) # does the same thing! Counts!
ABCDEFGHIJ0123456789
drv
<fct>
n
<int>
f106
4103
r25
3 rows
mpg_modified %>% count(fl)
ABCDEFGHIJ0123456789
fl
<fct>
n
<int>
p52
r168
e8
d5
c1
5 rows
### All combinations of cut, color, clarity
### Overwhelming??
mpg_modified %>%
  count(across(where(is.factor)))
ABCDEFGHIJ0123456789
cyl
<fct>
trans
<fct>
drv
<fct>
fl
<fct>
class
<fct>
n
<int>
4auto(l5)fpcompact1
4auto(l5)fpmidsize1
4auto(l5)frmidsize1
4auto(l5)frsubcompact1
4auto(l5)fcsubcompact1
4auto(l5)4pcompact1
4manual(m5)fpcompact1
4manual(m5)fpmidsize1
4manual(m5)fpsubcompact1
4manual(m5)frcompact7
Next
123456
...
12
Previous
1-10 of 116 rows
NoteBusiness Insights from Groups and Counts (mpg)
  • We see that the groups for each level of cyl, drv, and fl are not the same size: for instance the group with the “r”-level fl type is largest at 168 observations, and the “r”-level in drv has only 25 observations.

  • Group counts based on cyl are also not balanced.

  • Counting all combinations of these three factors shows counts of 1 for several combination and does not lead to any decent amount of aggregation.

These aspects may need to be factored into downstream modelling or machine learning tasks. (Usually by stratification wrt levels of the Qualitative variables)

The levels are not too many, so tables work, and so would bar charts, which we will examine next. If there are too many levels in any factor, tables are a better option. Bar charts can still be plotted, but it may be preferable to lump smaller categories/levels together. (Using the forcats package)

## Counting by the obvious factor variables
docVisits %>% count(gender)
ABCDEFGHIJ0123456789
gender
<chr>
n
<int>
female2702
male2488
2 rows
docVisits %>% count(private)
ABCDEFGHIJ0123456789
private
<chr>
n
<int>
no2892
yes2298
2 rows
docVisits %>% count(freepoor)
ABCDEFGHIJ0123456789
freepoor
<chr>
n
<int>
no4968
yes222
2 rows
docVisits %>% count(freerepat)
ABCDEFGHIJ0123456789
freerepat
<chr>
n
<int>
no4099
yes1091
2 rows
docVisits %>% count(lchronic)
ABCDEFGHIJ0123456789
lchronic
<chr>
n
<int>
no4585
yes605
2 rows
docVisits %>% count(nchronic)
ABCDEFGHIJ0123456789
nchronic
<chr>
n
<int>
no3098
yes2092
2 rows
# Now for all Combinations...
# Maybe too much to digest...
docVisits %>% count(across(where(is.character)))
ABCDEFGHIJ0123456789
gender
<chr>
private
<chr>
freepoor
<chr>
freerepat
<chr>
nchronic
<chr>
femalenononono
femalenononono
femalenononoyes
femalenonoyesno
femalenonoyesno
femalenonoyesyes
femalenoyesnono
femalenoyesnono
femalenoyesnoyes
femaleyesnonono
Next
123
Previous
1-10 of 24 rows | 1-5 of 7 columns
# Shall we try counting by some variables that might be factors?
# Even if they are labeled as <dbl>?
#
docVisits %>% count(illness)
ABCDEFGHIJ0123456789
illness
<dbl>
n
<int>
01554
11638
2946
3542
4274
5236
6 rows
docVisits %>% count(health)
ABCDEFGHIJ0123456789
health
<dbl>
n
<int>
03026
1823
2446
3273
4187
5132
6104
761
842
932
Next
12
Previous
1-10 of 13 rows
NoteBusiness Insights from Groups and Counts (docVisits)
  • Most of the counts are roughly balanced across the levels of the factors; however, freepoor and lchronic show unbalanced counts…

  • The factors are too numerous for a combination count table to very useful..

  • Counting by illness and health does show that these two columns have a limited set of integer entries across over 5000 rows!! So these can be thought of as factors if needed in the analysis. So not every integer variable is necessaily a number!!

1
 mpg_modified %>% mosaic::count(cyl)
1
docVisits_modified %>% count(illness)
1
docVisits_modified %>% count(health)
1
docVisits %>% 

Groups and Summaries of Quantitative Variables

We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?

Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group.

Let us work through these ideas for both our familiar datasets.

  • mpg dataset
  • docvisits dataset
  • web-r
mpg_modified %>%
  group_by(cyl) %>%
  summarize(average_hwy = mean(hwy), count = n())
ABCDEFGHIJ0123456789
cyl
<fct>
average_hwy
<dbl>
count
<int>
428.8024781
528.750004
622.8227879
817.6285770
4 rows
mpg_modified %>%
  group_by(cyl, fl) %>%
  summarize(average_hwy = mean(hwy), count = n())
ABCDEFGHIJ0123456789
cyl
<fct>
fl
<fct>
average_hwy
<dbl>
count
<int>
4p27.8181822
4r28.2909155
4d43.000003
4c36.000001
5r28.750004
6p25.2941217
6r22.2333360
6e17.000001
6d22.000001
8p20.7692313
Next
12
Previous
1-10 of 13 rows
# Perhaps the best method for us!
mpg_modified %>%
  mosaic::favstats(hwy ~ cyl, data = .) # Don't use fav_stats with formula!!!
ABCDEFGHIJ0123456789
cyl
<chr>
min
<dbl>
Q1
<dbl>
median
<dbl>
Q3
<dbl>
max
<dbl>
mean
<dbl>
sd
<dbl>
n
<int>
missing
<int>
42026.0029314428.802474.515030810
52828.7529292928.750000.50000040
61719.0024262922.822783.685590790
81216.0017192617.628573.262307700
4 rows
# Be aware of the first column format here!
mpg_modified %>%
  mosaic::favstats(hwy ~ cyl + fl, data = .) # Don't use fav_stats with formula!!!
ABCDEFGHIJ0123456789
cyl.fl
<chr>
min
<dbl>
Q1
<dbl>
median
<dbl>
Q3
<dbl>
max
<dbl>
mean
<dbl>
sd
<dbl>
n
<int>
missing
<int>
4.p2326.2529293127.818182.061815220
5.pNANANANANANaNNA00
6.p2025.0026262725.294121.571810170
8.p1416.0023252620.769234.603510130
4.r2026.0028313728.290913.951899550
5.r2828.7529292928.750000.50000040
6.r1719.0023262922.233333.819516600
8.r1516.0017182317.510201.894317490
4.eNANANANANANaNNA00
5.eNANANANANANaNNA00
Next
12
Previous
1-10 of 20 rows
NoteBusiness Insights from Grouped Quant Summaries (mpg)
  • We have quite some variation of mean_hwy mileage over cyl , though the groups/level are quite imbalanced. This is of course a small dataset.
  • The number of groups are large enough (>> 7!) to warrant a chart, which we will make in our next module on Distributions.
  • Mean price varies quite some based on cyland fl. Some groups are non-existent, hence we see “NA” and “NaN” in the output of mosaic::favstats.; and also on combinations of cyl:fl. This should point to the existence of some interaction effect when modelling for price.
docVisits_modified %>%
  group_by(gender) %>%
  summarize(average_visits = mean(visits), count = n())
ABCDEFGHIJ0123456789
gender
<fct>
average_visits
<dbl>
count
<int>
female0.36195412702
male0.23633442488
2 rows
##
docVisits_modified %>%
  group_by(gender) %>%
  summarize(average_visits = mean(visits), count = n())
ABCDEFGHIJ0123456789
gender
<fct>
average_visits
<dbl>
count
<int>
female0.36195412702
male0.23633442488
2 rows
##
docVisits_modified %>%
  group_by(freepoor, nchronic) %>%
  summarise(
    mean_income = mean(income),
    average_visits = mean(visits),
    count = n()
  )
ABCDEFGHIJ0123456789
freepoor
<fct>
nchronic
<fct>
mean_income
<dbl>
average_visits
<dbl>
count
<int>
nono0.62360970.283179802931
noyes0.55566520.344133532037
yesno0.31083830.07784431167
yesyes0.27272730.4000000055
4 rows
##
docVisits_modified %>%
  mosaic::favstats(income ~ gender, data = .) # Don't use fav_stats with formula!!!
ABCDEFGHIJ0123456789
gender
<chr>
min
<dbl>
Q1
<dbl>
median
<dbl>
Q3
<dbl>
female00.250.350.65
male00.350.650.90
2 rows | 1-5 of 10 columns
##
docVisits_modified %>%
  mosaic::favstats(income ~ freepoor + nchronic, data = .) # Don't use fav_stats with formula!!!
ABCDEFGHIJ0123456789
freepoor.nchronic
<chr>
min
<dbl>
Q1
<dbl>
median
<dbl>
Q3
<dbl>
no.no00.250.550.90
yes.no00.150.250.45
no.yes00.250.450.75
yes.yes00.150.250.35
4 rows | 1-5 of 10 columns
NoteBusiness Insights from Grouped Quant Summaries (docVisits)

Clearly the people who are freepoor ( On Govt Insurance) AND with a chronic condition are those who have lower average income and a higher average number of visits to the doctor…but there are relatively few of them (n = 55) in this dataset.

1

More on dplyr

The dplyr package is capable of doing much more than just count, group_by and summarize. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on dplyr is linked to the icon below:

dplyr Tutorial

Reporting Tables for Data and the Data Schema

ImportantData and the Data Schema are Different!!

Note that all the three methods (dplyr::glimpse(), skimr::skim(), and mosaic::inspect()) report the schema of the original dataframe. The schema are also formatted as data frames! However they do not “contain” the original data! Do not confuse between the data and it’s reported schema!

As stated earlier, it is usually a good idea to make crisp business-like tables, for the data itself, and of the schema as revealed by one of the outputs of the three methods (glimpse/skim/inspect) presented above. There are many table-making methods in R to do this; one of the simplest and effective ones is to use the kable set of commands from the knitr and kableExtra packages that we have installed already:

mpg %>%
  head(10) %>%
  kbl(
    col.names = c(
      "Manufacturer", "Model", "Engine\nDisplacement",
      "Model\n Year", "Cylinders", "Transmission",
      "Drivetrain", "City\n Mileage", "Highway\n Mileage",
      "Fuel", "Class\nOf\nVehicle"
    ),
    longtable = FALSE, centering = TRUE,
    caption = "MPG Dataset"
  ) %>%
  kable_styling(
    bootstrap_options = c(
      "striped", "hover",
      "condensed", "responsive"
    ),
    full_width = F, position = "center"
  )
MPG Dataset
Manufacturer Model Engine Displacement Model Year Cylinders Transmission Drivetrain City Mileage Highway Mileage Fuel Class Of Vehicle
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact
audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact

And for the schema from skim(), with some extra bells and whistles on the table:

skim(mpg) %>%
  kbl(align = "c", caption = "Skim Output for mpg Dataset") %>%
  kable_paper(full_width = F)
Skim Output for mpg Dataset
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
character manufacturer 0 1 4 10 0 15 0 NA NA NA NA NA NA NA NA
character model 0 1 2 22 0 38 0 NA NA NA NA NA NA NA NA
character trans 0 1 8 10 0 10 0 NA NA NA NA NA NA NA NA
character drv 0 1 1 1 0 3 0 NA NA NA NA NA NA NA NA
character fl 0 1 1 1 0 5 0 NA NA NA NA NA NA NA NA
character class 0 1 3 10 0 7 0 NA NA NA NA NA NA NA NA
numeric displ 0 1 NA NA NA NA NA 3.471795 1.291959 1.6 2.4 3.3 4.6 7 ▇▆▆▃▁
numeric year 0 1 NA NA NA NA NA 2003.500000 4.509646 1999.0 1999.0 2003.5 2008.0 2008 ▇▁▁▁▇
numeric cyl 0 1 NA NA NA NA NA 5.888889 1.611535 4.0 4.0 6.0 8.0 8 ▇▁▇▁▇
numeric cty 0 1 NA NA NA NA NA 16.858974 4.255946 9.0 14.0 17.0 19.0 35 ▆▇▃▁▁
numeric hwy 0 1 NA NA NA NA NA 23.440171 5.954643 12.0 18.0 24.0 27.0 44 ▅▅▇▁▁

See https://haozhu233.github.io/kableExtra/ for more options on formatting the table with kableExtra.

A Quick Quiz

Warning

It is always a good idea to look for variables in data that may be incorrectly formatted. For instance, a variable marked as numerical may have the values 1-2-3-4 which represent options, sizes, or say months. in which case it would have to be interpreted as a factor.

Let us take a small test with the mpg dataset:

  • What is the number of qualitative/categorical variables in the mpg data?

  • How many manufacturers are named in this dataset?

  • How many levels does the variable drv have?

  • How many quantitative/numerical variables shown in the mpg data?

  • But the variable
    is actually a qualitative variable.

Your Turn

  1. Try your hand at these datasets. Look at the data, state the data dictionary, contemplate a few Research Questions and answer them with Summaries and Tables in Quarto!

  2. Try adding more summary functions to the summary table? Which might you choose? Why?

NoteStar Trek Books

Which would be the Group By variables here? And what would you summarize? With which function?

NoteMath Anxiety! Hah!

Wait, But Why?

  • Data Summaries give you the essentials, without getting bogged down in the details(just yet).
  • Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
  • Summaries help evoke Questions and Hypotheses, which may lead to inquiries, analysis, and insights
  • Grouped Summaries should tell you if:
    • counts of groups in your target audience are lopsided/imbalanced; Go and Get your data again.
    • there are visible differences in Quant data across groups, so your target audience could be nicely fractured;
    • etc.

Conclusion

  • The three methods (glimpse/skim/inspect) given here give us a very comprehensive look into the structure of the dataset.
  • The favstats method allows us to compute a whole lot of metrics for Quant variables for each level of one or more *Qual variables.
  • Use the kable set of commands to make a smart-looking of the data and the outputs of any of the three methods.

Make these part of your Workflow.

AI Generated Summary and Podcast

This is a tutorial on using the R programming language to perform descriptive statistical analysis on data sets. The tutorial focuses on summarizing data using various R packages like dplyr, skimr, and mosaic. It emphasizes the importance of understanding the data’s structure, identifying different types of variables (qualitative and quantitative), and calculating summary statistics such as means, medians, and frequencies. The tutorial provides examples using real datasets and highlights the significance of data summaries in gaining initial insights, formulating research questions, and identifying potential issues with the data.

Your browser does not support the audio tag; for browser support, please see: https://www.w3schools.com/tags/tag_audio.asp

References

  1. Lock, Lock, Lock, Lock, and Lock. (2021). Statistics: Unlocking the Power of Data, 3rd Edition). https://media.wiley.com/product_data/excerpt/69/11196821/1119682169-32.pdf
R Package Citations
Package Version Citation
mosaic 1.9.1 Pruim, Kaplan, and Horton (2017)
palmerpenguins 0.1.1 Horst, Hill, and Gorman (2020)
skimr 2.1.5 Waring et al. (2022)
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.
Pruim, Randall, Daniel T Kaplan, and Nicholas J Horton. 2017. “The Mosaic Package: Helping Students to ‘Think with Data’ Using r.” The R Journal 9 (1): 77–102. https://journal.r-project.org/archive/2017/RJ-2017-024/index.html.
Stigler, Stephen M. 2016. “The Seven Pillars of Statistical Wisdom,” March. https://doi.org/10.4159/9780674970199.
Waring, Elin, Michael Quinn, Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, and Shannon Ellis. 2022. skimr: Compact and Flexible Summaries of Data. https://doi.org/10.32614/CRAN.package.skimr.
Back to top

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {\textless Iconify-Icon Icon=“carbon:summary-Kpi”
    Width=“1.2em”
    Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
    {Summaries}},
  date = {2023-10-15},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/10-FavStats/},
  langid = {en},
  abstract = {Bill Gates walked into a bar, and everyone’s salary went
    up on average.}
}
For attribution, please cite this work as:
V., Arvind. 2023. “<Iconify-Icon Icon=‘carbon:summary-Kpi’ Width=‘1.2em’ Height=‘1.2em’></Iconify-Icon> Summaries.” October 15, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/10-FavStats/.
Data
Counts

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .