Summaries
Throwing away data to grasp it
Using web-R
This tutorial uses web-r
that allows you to run all code within your browser, on all devices. Most code chunks herein are formatted in a tabbed structure (like in an old-fashioned library) with duplicated code. The tabs in front have regular R code that will work when copy-pasted in your RStudio session. The tab “behind” has the web-R
code that can work directly in your browser, and can be modified as well. The R code is also there to make sure you have original code to go back to, when you have made several modifications to the code on the web-r
tabs and need to compare your code with the original!
Keyboard Shortcuts
- Run selected code using either:
- macOS: ⌘ + ↩︎/Return
- Windows/Linux: Ctrl + ↩︎/Enter
- Run the entire code by clicking the “Run code” button or pressing Shift+↩︎.
“The best attitude to adopt is one of compassionate patience, which has to include a tolerance for failure.”
— Gabor Maté
Setting up R Packages
How do we Grasp Data?
We spoke of Experiments and Data Gathering in the first module Nature of Data. This helped us to obtain data.
As we discussed in that same Module, for us to grasp the significance of the data, we need to describe it; the actual data is usually too vast for us to comprehend in its entirety. Anything more than a handful of observations in a dataset is enough for us to require other ways of grasping it.
The first thing we need to do, therefore, is to reduce it to a few salient numbers that allow us to summarize the data.
Such a reduction may seem paradoxical but is one of the important tenets of statistics: reduction, while taking away information, ends up adding to insight.
Steven Stigler (2016) is the author of the book “The Seven Pillars of Statistical Wisdom”. One of the Big Ideas in Statistics from that book is: Aggregation
The first pillar I will call Aggregation, although it could just as well be given the nineteenth-century name, “The Combination of Observations,” or even reduced to the simplest example, taking a mean. Those simple names are misleading, in that I refer to an idea that is now old but was truly revolutionary in an earlier day—and it still is so today, whenever it reaches into a new area of application. How is it revolutionary? By stipulating that, given a number of observations, you can actually gain information by throwing information away! In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.
Let us get some inspiration from Brad Pitt, from the movie Moneyball, which is about applying Data Analytics to the game of baseball.
And then, an example from a more sombre story:
Year | Below Level #1 | Level #1 | Level #2 | Level #3 | Levels #4 and #5 |
---|---|---|---|---|---|
Number in millions (2012/2014) | 8.35 | 26.49 | 65.10 | 71.41 | 26.57 |
Number in millions (2017) | 7.59 | 29.23 | 66.07 | 68.81 | 26.75 |
Note: | |||||
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014. |
This ghastly-looking Table 1 examines U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!
So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane. To summarize is to understand. Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.
And if we don’t summarise? Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)
Aggregation can yield great gains above the individual components in data. Funes was big data without Statistics.
What graphs / numbers will we see today?
Variable #1 | Variable #2 | Chart Names | “Chart Shape” |
---|---|---|---|
All | All | Tables and Stat Measures |
|
Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:
- Dataset length: How many rows/observations?
- Dataset breadth: How many columns/variables?
- How many Quant variables?
- How many Qual variables?
- Quant variables: min, max, mean, median, sd
- Qual variables: levels, counts per level
- Both: means, medians for each level of a Qual variable…
What kind of Data Variables will we choose?
No | Pronoun | Answer | Variable/Scale | Example | What Operations? |
---|---|---|---|---|---|
1 | How Many / Much / Heavy? Few? Seldom? Often? When? | Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. | Quantitative/Ratio | Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate | Correlation |
2 | How Many / Much / Heavy? Few? Seldom? Often? When? | Quantities with Scale. Differences are meaningful, but not products or ratios | Quantitative/Interval | pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College | Mean,Standard Deviation |
3 | How, What Kind, What Sort | A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) | Qualitative/Ordinal | Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) | Median,Percentile |
4 | What, Who, Where, Whom, Which | Name, Place, Animal, Thing | Qualitative/Nominal | Name | Count no. of cases,Mode |
We will obviously choose all variables in the dataset, unless they are unrelated ones such as row number
or ID
which (we think) may not contribute any information and we can disregard.
How do these Summaries Work?
Quant variables: Inspecting the min
, max
, mean
, median
, variance
and sd
of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error! Comparing two Quant variables for their ranges also tells us that we may have to \(scale/normalize\) them for computational ease, if one variable has large numbers and the other has very small ones.
Qual variables: With Qual variables, we understand the levels
within each, and understand the total number of combinations of the levels across these. Counts
across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.
Together?: We can use Quant and Qual together, to develop the above summaries (min
, max
,mean
, median
and sd
) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts
if we are interested in that.
For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row). Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.
And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!
Some Quick Definitions
The sample mean, or average, of a Quantitative data variable can be calculated as the sum of the observed values divided by the number of observations:
\[ mean = \bar{x} = \frac{x_1 + x_2+ x_3....+x_n}{n} \]
Observations can be on either side of the mean, naturally. To measure the extent of these differences, we square and sum the differences between individual values and their mean, and take their average to obtain the (sample) variance
:
\[ variance = s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + (x_2 - \bar{x})^2 +...(x_n - \bar{x})^2 +}{n-1} \] The standard deviation \(s\) is just the square root of the variance.
(The \(n-1\) is a mathematical nuance to allow for the fact that we have used the data to calculate the mean before we get to \(s^2\), and hence have “used up” one degree of randomness in the data. It gets us more robust results.)
When the observations in a Quant variable are placed in order of their maginitude, the observation in tke middle is the median
. Half the observations are below, and half are above, the median
Case Study-1
We will first use a dataset mpg
that is available in R as part of one of the R packages that we have loaded with the library()
command.
Examine the Data
It is usually a good idea to make crisp business-like tables, for the data itself, and the schema as revealed by one of the outputs of the three methods to be presented below. There are many methods to do this; one of the simplest and effective ones is to use the kable
set of commands from the knitr
and kableExtra
packagepackage:
mpg %>%
head(10) %>%
kbl(
# add Human Readable column names
col.names = c(
"Manufacturer", "Model", "Engine\nDisplacement",
"Model\n Year", "Cylinders", "Transmission",
"Drivetrain", "City\n Mileage", "Highway\n Mileage",
"Fuel", "Class\nOf\nVehicle"
),
caption = "MPG Dataset"
) %>%
kable_styling(
bootstrap_options = c(
"striped", "hover",
"condensed", "responsive"
),
full_width = F, position = "center"
)
Manufacturer | Model | Engine Displacement | Model Year | Cylinders | Transmission | Drivetrain | City Mileage | Highway Mileage | Fuel | Class Of Vehicle |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
audi | a4 | 3.1 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
audi | a4 quattro | 2.0 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
Next we will look at a few favourite statistics or “favstats” that we can derive from data. R is full of packages that can provide very evocative and effective summaries of data. We will first start with the dplyr
package from the tidyverse, the skimr
package, then the mosaic
package. We will look at the summary outputs from these and learn how to interpret them.
The dplyr
package offers a convenient command called glimpse
:
glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
- Very crisp output, giving us the size of the dataset (234 X 11) and the nature of the variable columns, along with their first few entries.
- The
chr
variables are usually Categorical/Qualitative. - The
int
ordbl
(double precision) are usually Numerical/Quantitative. - But be careful! Verify that this is as per your intent, interpret the variables and modify their encoding as needed.
Let us look at mpg
using skimr::skim()
.
From the output of ?skimr
:
The format of the results are a single wide data frame combining the results, with some additional attributes and two metadata columns:
-
skim_variable
: name of the original variable -
skim_type
: class of the variable
We can use skim(dataset)
directly as shown below:
skimr::skim(mpg) # explicitly stating package name
Name | mpg |
Number of rows | 234 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
manufacturer | 0 | 1 | 4 | 10 | 0 | 15 | 0 |
model | 0 | 1 | 2 | 22 | 0 | 38 | 0 |
trans | 0 | 1 | 8 | 10 | 0 | 10 | 0 |
drv | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
fl | 0 | 1 | 1 | 1 | 0 | 5 | 0 |
class | 0 | 1 | 3 | 10 | 0 | 7 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
displ | 0 | 1 | 3.47 | 1.29 | 1.6 | 2.4 | 3.3 | 4.6 | 7 | ▇▆▆▃▁ |
year | 0 | 1 | 2003.50 | 4.51 | 1999.0 | 1999.0 | 2003.5 | 2008.0 | 2008 | ▇▁▁▁▇ |
cyl | 0 | 1 | 5.89 | 1.61 | 4.0 | 4.0 | 6.0 | 8.0 | 8 | ▇▁▇▁▇ |
cty | 0 | 1 | 16.86 | 4.26 | 9.0 | 14.0 | 17.0 | 19.0 | 35 | ▆▇▃▁▁ |
hwy | 0 | 1 | 23.44 | 5.95 | 12.0 | 18.0 | 24.0 | 27.0 | 44 | ▅▅▇▁▁ |
Taken together, we have the following:
A Data Summary: it lists the dimensions of the
mpg
dataset: 234 rows and 11 columns. 6 columns are character formatted, the remaining 5 are numeric. The dataset is not “grouped” (more on this later).The second part of the output shows a table with the
character
variables which are thereforefactor
variables withlevels
.The third part shows a table listing the names and summary stats for the
numerical
variables. We havemean
,sd
, all the quantiles (p0, p25, p50(median), p75 and p100 percentiles) and a neat little histogram for each. From the histogram we can see thatyear
is two-valued,cyl
is three-valued, andcty
andhwy
are continuous… Again check that this is as you intend them to be. We may need to modify the encoding if needed.
We get very similar output from mosaic::inspect()
:
inspect(mpg)
categorical variables:
name class levels n missing
1 manufacturer character 15 234 0
2 model character 38 234 0
3 trans character 10 234 0
4 drv character 3 234 0
5 fl character 5 234 0
6 class character 7 234 0
distribution
1 dodge (15.8%), toyota (14.5%) ...
2 caravan 2wd (4.7%) ...
3 auto(l4) (35.5%), manual(m5) (24.8%) ...
4 f (45.3%), 4 (44%), r (10.7%)
5 r (71.8%), p (22.2%), e (3.4%) ...
6 suv (26.5%), compact (20.1%) ...
quantitative variables:
name class min Q1 median Q3 max mean sd n
1 displ numeric 1.6 2.4 3.3 4.6 7 3.471795 1.291959 234
2 year integer 1999.0 1999.0 2003.5 2008.0 2008 2003.500000 4.509646 234
3 cyl integer 4.0 4.0 6.0 8.0 8 5.888889 1.611534 234
4 cty integer 9.0 14.0 17.0 19.0 35 16.858974 4.255946 234
5 hwy integer 12.0 18.0 24.0 27.0 44 23.440171 5.954643 234
missing
1 0
2 0
3 0
4 0
5 0
We see that the output of mosaic::inspect()
is organized as follows:
- There are two dataframes/tables in the output, one describing the Qualitative Variables and the other describing the Quantitative Variables.
- In the table describing the Qual variables, we have:
-
name
: Name of the variable in the (parent) dataset. i.e Column Names -
class
: format of that column -
levels
: All these variables are factors, with levels shown here. Some for example,manufacturer
has 15 levels, and there are 234 rows
-
inspect
also conveniently shows how much data is missing and in which variables. This is a very important consideration in the use of the data for analytics purposes.
We can save and see the outputs separately:
mpg_describe <- inspect(mpg)
mpg_describe$categorical
mpg_describe$quantitative
Data Dictionary and Munging
Using skim
/inspect
/glimpse
, we can put together a (brief) data dictionary as follows:
-
model
(chr): Car model name -
manufacturer
(chr): Car maker name -
fl
(chr): fuel type -
drv
(chr): type of drive(front, rear, 4W) -
class
(chr): type of vehicle ( sedan, pickup…) -
trans
(chr): type of transmission ( auto, manual..)
-
hwy
(int): Highway Mileage -
cty
(int): City Mileage -
cyl
(int): Number of Cylinders. How do we understand this variable? Should this be Qual? -
displ
(dbl): Engine piston displacement -
year
(int): Year of model
We see that there are certain variables that must be converted to factors for analytics purposes, since they are unmistakably Qualitative in nature. Let us do that now, for use later:
mpg_modified <- mpg %>%
dplyr::mutate(
cyl = as_factor(cyl),
fl = as_factor(fl),
drv = as_factor(drv),
class = as_factor(class),
trans = as_factor(trans)
)
glimpse(mpg_modified)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <fct> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <fct> auto(l5), manual(m5), manual(m6), auto(av), auto(l5), man…
$ drv <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <fct> p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, p, r, …
$ class <fct> compact, compact, compact, compact, compact, compact, com…
Case Study-2
Instead of taking a “built-in” dataset , i.e. one that is part of an R package that we can load with library()
, let us try the above process with a data set that we obtain from the internet. We will use this superb repository of datasets created by Vincent Arel-Bundock: https://vincentarelbundock.github.io/Rdatasets/articles/data.html
Let us choose a modest-sized dataset, say this dataset on Doctor Visits
, which is available online https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv and read it into R.
The read_csv()
command from R package readr
allows us to read both locally saved data on our hard disk, or data available in a shared folder online. Avoid using the read.csv()
from base R , though it will show up in your code auto-complete set of options!
# From Vincent Arel-Bundock's dataset website
# https://vincentarelbundock.github.io/Rdatasets
#
# read_csv can read data directly from the net
# Don't use read.csv()
docVisits <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/AER/DoctorVisits.csv")
Rows: 5190 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, private, freepoor, freerepat, nchronic, lchronic
dbl (7): rownames, visits, age, income, illness, reduced, health
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
So, a data frame containing 5,190 observations on 12 variables.
We can also use a locally downloaded and stored CSV file. Assuming the file is stored in a subfolder called data
inside your R project
folder, we can proceed as follows:
```{r}
#| eval: false
docVisits <- read_csv("data/DoctorVisits.csv")
```
Let us quickly report the data itself, as in a real report. Note that we can use the features of the kableExtra
package to dress up this table too!!
docVisits %>%
head(10) %>%
kbl(
caption = "Doctor Visits Dataset",
# Add Human Readable Names if desired
# col.names(..names that you may want..)
) %>%
kable_styling(
bootstrap_options = c(
"striped", "hover",
"condensed", "responsive"
),
full_width = F, position = "center"
)
rownames | visits | gender | age | income | illness | reduced | health | private | freepoor | freerepat | nchronic | lchronic |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | female | 0.19 | 0.55 | 1 | 4 | 1 | yes | no | no | no | no |
2 | 1 | female | 0.19 | 0.45 | 1 | 2 | 1 | yes | no | no | no | no |
3 | 1 | male | 0.19 | 0.90 | 3 | 0 | 0 | no | no | no | no | no |
4 | 1 | male | 0.19 | 0.15 | 1 | 0 | 0 | no | no | no | no | no |
5 | 1 | male | 0.19 | 0.45 | 2 | 5 | 1 | no | no | no | yes | no |
6 | 1 | female | 0.19 | 0.35 | 5 | 1 | 9 | no | no | no | yes | no |
7 | 1 | female | 0.19 | 0.55 | 4 | 0 | 2 | no | no | no | no | no |
8 | 1 | female | 0.19 | 0.15 | 3 | 0 | 6 | no | no | no | no | no |
9 | 1 | female | 0.19 | 0.65 | 2 | 0 | 5 | yes | no | no | no | no |
10 | 1 | male | 0.19 | 0.15 | 1 | 0 | 0 | yes | no | no | no | no |
Examine the Data
glimpse(docVisits)
Rows: 5,190
Columns: 13
$ rownames <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ visits <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, …
$ gender <chr> "female", "female", "male", "male", "male", "female", "femal…
$ age <dbl> 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, …
$ income <dbl> 0.55, 0.45, 0.90, 0.15, 0.45, 0.35, 0.55, 0.15, 0.65, 0.15, …
$ illness <dbl> 1, 1, 3, 1, 2, 5, 4, 3, 2, 1, 1, 2, 3, 4, 3, 2, 1, 1, 1, 1, …
$ reduced <dbl> 4, 2, 0, 0, 5, 1, 0, 0, 0, 0, 0, 0, 13, 7, 1, 0, 0, 1, 0, 0,…
$ health <dbl> 1, 1, 0, 0, 1, 9, 2, 6, 5, 0, 0, 2, 1, 6, 0, 7, 5, 0, 0, 0, …
$ private <chr> "yes", "yes", "no", "no", "no", "no", "no", "no", "yes", "ye…
$ freepoor <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ freerepat <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ nchronic <chr> "no", "no", "no", "no", "yes", "yes", "no", "no", "no", "no"…
$ lchronic <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
dplyr::glimpse()
Very crisp output, giving us the size of the dataset (5190 X 13) and the nature of the variable columns, along with their first few entries. There are several Quantitative variables: visits
, age
, income
, illness
, reduced
and health
; the rest seem to be Qualitative variables.
Always document your data with variable descriptions when you share it, a data dictionary!
skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | private | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | freepoor | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | freerepat | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | nchronic | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | lchronic | 0 | 1 | 2 | 3 | 0 | 2 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
numeric | rownames | 0 | 1 | NA | NA | NA | NA | NA | 2595.5000000 | 1498.3682792 | 1.00 | 1298.25 | 2595.50 | 3892.75 | 5190.00 | ▇▇▇▇▇ |
numeric | visits | 0 | 1 | NA | NA | NA | NA | NA | 0.3017341 | 0.7981338 | 0.00 | 0.00 | 0.00 | 0.00 | 9.00 | ▇▁▁▁▁ |
numeric | age | 0 | 1 | NA | NA | NA | NA | NA | 0.4063854 | 0.2047818 | 0.19 | 0.22 | 0.32 | 0.62 | 0.72 | ▇▂▁▂▅ |
numeric | income | 0 | 1 | NA | NA | NA | NA | NA | 0.5831599 | 0.3689067 | 0.00 | 0.25 | 0.55 | 0.90 | 1.50 | ▇▆▅▅▂ |
numeric | illness | 0 | 1 | NA | NA | NA | NA | NA | 1.4319846 | 1.3841524 | 0.00 | 0.00 | 1.00 | 2.00 | 5.00 | ▇▂▂▁▁ |
numeric | reduced | 0 | 1 | NA | NA | NA | NA | NA | 0.8618497 | 2.8876284 | 0.00 | 0.00 | 0.00 | 0.00 | 14.00 | ▇▁▁▁▁ |
numeric | health | 0 | 1 | NA | NA | NA | NA | NA | 1.2175337 | 2.1242665 | 0.00 | 0.00 | 0.00 | 2.00 | 12.00 | ▇▁▁▁▁ |
skimr::skim()
A Data Summary: it lists the dimensions of the
docVisits
dataset: 5190 rows and 13 columns. 6 columns are character formatted, the remaining 7 are numeric. The dataset is not “grouped” (more on this later).The second part of the output shows a table with the
character
variables which are thereforefactor
variables withlevels
.The third part shows a table listing the names and summary stats for the
numerical
variables. We havemean
,sd
, all the quantiles (p0, p25, p50(median), p75 and p100 percentiles) and a neat little histogram for each.Can we consider the
health
Goldberg score a Qualitative variable, to be understood as “ranks” between a minimum and maximum? It is just possible…
inspect(docVisits)
categorical variables:
name class levels n missing
1 gender character 2 5190 0
2 private character 2 5190 0
3 freepoor character 2 5190 0
4 freerepat character 2 5190 0
5 nchronic character 2 5190 0
6 lchronic character 2 5190 0
distribution
1 female (52.1%), male (47.9%)
2 no (55.7%), yes (44.3%)
3 no (95.7%), yes (4.3%)
4 no (79%), yes (21%)
5 no (59.7%), yes (40.3%)
6 no (88.3%), yes (11.7%)
quantitative variables:
name class min Q1 median Q3 max mean
1 rownames numeric 1.00 1298.25 2595.50 3892.75 5190.00 2595.5000000
2 visits numeric 0.00 0.00 0.00 0.00 9.00 0.3017341
3 age numeric 0.19 0.22 0.32 0.62 0.72 0.4063854
4 income numeric 0.00 0.25 0.55 0.90 1.50 0.5831599
5 illness numeric 0.00 0.00 1.00 2.00 5.00 1.4319846
6 reduced numeric 0.00 0.00 0.00 0.00 14.00 0.8618497
7 health numeric 0.00 0.00 0.00 2.00 12.00 1.2175337
sd n missing
1 1498.3682792 5190 0
2 0.7981338 5190 0
3 0.2047818 5190 0
4 0.3689067 5190 0
5 1.3841524 5190 0
6 2.8876284 5190 0
7 2.1242665 5190 0
mosaic::inspect()
We see that the output of mosaic::inspect()
is organized very similarly to the output from skim
. Is there any missing data? Both skim
and mosaic
report on the data completion for each variable in the dataset.
Data Dictionary
Variable | Description |
---|---|
visits | Number of doctor visits in past 2 weeks. |
gender | Factor indicating gender. |
age | Age in years divided by 100. |
income | Annual income in tens of thousands of dollars. |
illness | Number of illnesses in past 2 weeks. |
reduced | Number of days of reduced activity in past 2 weeks due to illness or injury. |
health | General health questionnaire score using Goldberg’s method. |
private | Factor. Does the individual have private health docVisits? |
freepoor | Factor. Does the individual have free government health docVisits due to low income? |
freerepat | Factor. Does the individual have free government health docVisits due to old age, disability or veteran status? |
nchronic | Factor. Is there a chronic condition not limiting activity? |
lchronic | Factor. Is there a chronic condition limiting activity? |
Here too, we should convert the variables that are obviously Qualitative into factors, ordered or otherwise:
Groups and Counts of Qualitative Variables
What is the most important dialogue uttered in the movie “Sholay”?
Recall our discussion in Types of Data Variables. We have looked at means, limits, and percentiles of Quantitative variables. Another good idea to examine datasets is to look at counts, proportions, and frequencies with respect to Qualitative variables.
We typically do this with the dplyr
package from the tidyverse
.
### All combinations of cut, color, clarity
### Overwhelming??
mpg_modified %>%
count(across(where(is.factor)))
mpg
)
We see that the groups for each level of
cyl
,drv
, andfl
are not the same size: for instance the group with the “r”-levelfl
type is largest at 168 observations, and the “r”-level indrv
has only 25 observations.Group counts based on
cyl
are also not balanced.Counting all combinations of these three
factors
shows counts of 1 for several combination and does not lead to any decent amount of aggregation.
These aspects may need to be factored into downstream modelling or machine learning tasks. (Usually by stratification wrt levels of the Qualitative variables)
The levels are not too many, so tables work, and so would bar charts, which we will examine next. If there are too many levels in any factor, tables are a better option. Bar charts can still be plotted, but it may be preferable to lump
smaller categories/levels together. (Using the forcats
package)
# Now for all Combinations...
# Maybe too much to digest...
docVisits %>% count(across(where(is.character)))
# Shall we try counting by some variables that might be factors?
# Even if they are labeled as <dbl>?
#
docVisits %>% count(illness)
docVisits
)
Most of the counts are roughly balanced across the levels of the factors; however,
freepoor
andlchronic
show unbalanced counts…The factors are too numerous for a combination count table to very useful..
Counting by
illness
andhealth
does show that these two columns have a limited set of integer entries across over 5000 rows!! So these can be thought of as factors if needed in the analysis. So not every integer variable is necessaily a number!!
Groups and Summaries of Quantitative Variables
We saw that we could obtain numerical summary stats such as means, medians, quartiles, maximum/minimum
of entire Quantitative variables, i.e the complete column. However, we often need identical numerical summary stats of parts of a Quantitative variable. Why?
Note that we have Qualitative variables as well in a typical dataset. These Qual variables help us to group the entire dataset based on their combinations of levels. We can now think of summarizing Quant variables within each such group.
Let us work through these ideas for both our familiar datasets.
# Perhaps the best method for us!
mpg_modified %>%
mosaic::favstats(hwy ~ cyl, data = .) # Don't use fav_stats with formula!!!
# Be aware of the first column format here!
mpg_modified %>%
mosaic::favstats(hwy ~ cyl + fl, data = .) # Don't use fav_stats with formula!!!
mpg
)
- We have quite some variation of mean_hwy mileage over
cyl
, though the groups/level are quite imbalanced. This is of course a small dataset. - The number of groups are large enough (>> 7!) to warrant a chart, which we will make in our next module on Distributions.
- Mean
price
varies quite some based oncyl
andfl
. Some groups are non-existent, hence we see “NA” and “NaN” in the output ofmosaic::favstats
.; and also on combinations ofcyl:fl
. This should point to the existence of some interaction effect when modelling for price.
##
docVisits_modified %>%
group_by(gender) %>%
summarize(average_visits = mean(visits), count = n())
##
docVisits_modified %>%
group_by(freepoor, nchronic) %>%
summarise(
mean_income = mean(income),
average_visits = mean(visits),
count = n()
)
##
docVisits_modified %>%
mosaic::favstats(income ~ gender, data = .) # Don't use fav_stats with formula!!!
##
docVisits_modified %>%
mosaic::favstats(income ~ freepoor + nchronic, data = .) # Don't use fav_stats with formula!!!
docVisits
)
Clearly the people who are freepoor
( On Govt Insurance) AND with a chronic condition are those who have lower average income and a higher average number of visits to the doctor…but there are relatively few of them (n = 55) in this dataset.
More on dplyr
The dplyr
package is capable of doing much more than just count
, group_by
and summarize
. We will encounter this package many times more as we build our intuition about data visualization. A full tutorial on dplyr
is linked to the icon below:
dplyr Tutorial |
---|
Reporting Tables for Data and the Data Schema
Note that all the three methods (dplyr::glimpse()
, skimr::skim()
, and mosaic::inspect()
) report the schema of the original dataframe. The schema are also formatted as data frames! However they do not “contain” the original data! Do not confuse between the data and it’s reported schema!
As stated earlier, it is usually a good idea to make crisp business-like tables, for the data itself, and of the schema as revealed by one of the outputs of the three methods (glimpse/skim/inspect
) presented above. There are many table-making methods in R to do this; one of the simplest and effective ones is to use the kable
set of commands from the knitr
and kableExtra
packages that we have installed already:
mpg %>%
head(10) %>%
kbl(
col.names = c(
"Manufacturer", "Model", "Engine\nDisplacement",
"Model\n Year", "Cylinders", "Transmission",
"Drivetrain", "City\n Mileage", "Highway\n Mileage",
"Fuel", "Class\nOf\nVehicle"
),
longtable = FALSE, centering = TRUE,
caption = "MPG Dataset"
) %>%
kable_styling(
bootstrap_options = c(
"striped", "hover",
"condensed", "responsive"
),
full_width = F, position = "center"
)
Manufacturer | Model | Engine Displacement | Model Year | Cylinders | Transmission | Drivetrain | City Mileage | Highway Mileage | Fuel | Class Of Vehicle |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
audi | a4 | 3.1 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
audi | a4 quattro | 2.0 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
And for the schema from skim()
, with some extra bells and whistles on the table:
skim(mpg) %>%
kbl(align = "c", caption = "Skim Output for mpg Dataset") %>%
kable_paper(full_width = F)
skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | manufacturer | 0 | 1 | 4 | 10 | 0 | 15 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | model | 0 | 1 | 2 | 22 | 0 | 38 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | trans | 0 | 1 | 8 | 10 | 0 | 10 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | drv | 0 | 1 | 1 | 1 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | fl | 0 | 1 | 1 | 1 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
character | class | 0 | 1 | 3 | 10 | 0 | 7 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
numeric | displ | 0 | 1 | NA | NA | NA | NA | NA | 3.471795 | 1.291959 | 1.6 | 2.4 | 3.3 | 4.6 | 7 | ▇▆▆▃▁ |
numeric | year | 0 | 1 | NA | NA | NA | NA | NA | 2003.500000 | 4.509646 | 1999.0 | 1999.0 | 2003.5 | 2008.0 | 2008 | ▇▁▁▁▇ |
numeric | cyl | 0 | 1 | NA | NA | NA | NA | NA | 5.888889 | 1.611535 | 4.0 | 4.0 | 6.0 | 8.0 | 8 | ▇▁▇▁▇ |
numeric | cty | 0 | 1 | NA | NA | NA | NA | NA | 16.858974 | 4.255946 | 9.0 | 14.0 | 17.0 | 19.0 | 35 | ▆▇▃▁▁ |
numeric | hwy | 0 | 1 | NA | NA | NA | NA | NA | 23.440171 | 5.954643 | 12.0 | 18.0 | 24.0 | 27.0 | 44 | ▅▅▇▁▁ |
See https://haozhu233.github.io/kableExtra/ for more options on formatting the table with kableExtra.
A Quick Quiz
It is always a good idea to look for variables in data that may be incorrectly formatted. For instance, a variable marked as numerical may have the values 1-2-3-4 which represent options
, sizes
, or say months
. in which case it would have to be interpreted as a factor
.
Let us take a small test with the mpg
dataset:
- What is the number of qualitative/categorical variables in the
mpg
data?
- How many manufacturers are named in this dataset?
- How many levels does the variable
drv
have?
- How many quantitative/numerical variables shown in the
mpg
data?
- But the variable is actually a qualitative variable.
Your Turn
Try your hand at these datasets. Look at the data, state the data dictionary, contemplate a few Research Questions and answer them with Summaries and Tables in Quarto!
Try adding more summary functions to the summary table? Which might you choose? Why?
Which would be the Group By
variables here? And what would you summarize? With which function?
Wait, But Why?
- Data Summaries give you the essentials, without getting bogged down in the details(just yet).
- Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
- Summaries help evoke Questions and Hypotheses, which may lead to inquiries, analysis, and insights
-
Grouped Summaries should tell you if:
- counts of groups in your target audience are lopsided/imbalanced; Go and Get your data again.
- there are visible differences in Quant data across groups, so your target audience could be nicely fractured;
- etc.
Conclusion
- The three methods (glimpse/skim/inspect) given here give us a very comprehensive look into the structure of the dataset.
- The
favstats
method allows us to compute a whole lot of metrics for Quant variables for each level of one or more *Qual variables. - Use the
kable
set of commands to make a smart-looking of the data and the outputs of any of the three methods.
Make these part of your Workflow.
AI Generated Summary and Podcast
This is a tutorial on using the R programming language to perform descriptive statistical analysis on data sets. The tutorial focuses on summarizing data using various R packages like dplyr
, skimr
, and mosaic
. It emphasizes the importance of understanding the data’s structure, identifying different types of variables (qualitative and quantitative), and calculating summary statistics such as means, medians, and frequencies. The tutorial provides examples using real datasets and highlights the significance of data summaries in gaining initial insights, formulating research questions, and identifying potential issues with the data.
References
- Lock, Lock, Lock, Lock, and Lock. (2021). Statistics: Unlocking the Power of Data, 3rd Edition). https://media.wiley.com/product_data/excerpt/69/11196821/1119682169-32.pdf
R Package Citations
Citation
@online{v.2023,
author = {V., Arvind},
title = {\textless Iconify-Icon Icon=“carbon:summary-Kpi”
Width=“1.2em”
Height=“1.2em”\textgreater\textless/Iconify-Icon\textgreater{}
{Summaries}},
date = {2023-10-15},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/10-FavStats/},
langid = {en},
abstract = {Bill Gates walked into a bar, and everyone’s salary went
up on average.}
}