๐Ÿ•ถ Summaries

Summary Stats
Favourite Stats
Quant Variables
Qual Variables
Author

Arvind V

Published

April 16, 2024

Modified

June 22, 2024

Abstract
Bill Gates walked into a bar and everyoneโ€™s salary went up on average.

Inspiration(s)!

First, some baseball:

And then, an example from a more sombre story:

Year Below Level #1 Level #1 Level #2 Level #3 Levels #4 and #5
Number in millions (2012/2014) 8.35 26.49 65.10 71.41 26.57
Number in millions (2017) 7.59 29.23 66.07 68.81 26.75
Note:
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.
Table 1: US Population: Reading and Numeracy Levels

This ghastly-looking Table 1 examines U.S. adults with low English literacy and numeracy skillsโ€”or low-skilled adultsโ€”at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and in the movie by Brad Pitt aka Billy Beane. To summarize is to understand. Add to that the fact that our Working Memories can hold maybe 7 items.

And if we donโ€™t summarise? Jorge Luis Borges, in a fantasy short story published in 1942, titled โ€œFunes the Memorious,โ€ he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, โ€œTo think is to forget details, generalize, make abstractions. In the teeming world of Funes there were only details.โ€ (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was big data without Statistics.

What graphs / numbers will we see today?

Variable #1 Variable #2 Chart Names โ€œChart Shapeโ€
All All Tables and Stat Measures

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

  • Dataset length: How many rows/observations?
  • Dataset breadth: How many columns/variables?
  • How many Quant variables?
  • How many Qual variables?
  • Quant variables: min, max, mean, median, sd
  • Qual variables: levels, counts per level
  • Both: means, medians for each level of a Qual variableโ€ฆ

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation
2 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities with Scale. Differences are meaningful, but not products or ratios Quantitative/Interval pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College Mean,Standard Deviation
3 How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) Qualitative/Ordinal Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) Median,Percentile
4 What, Who, Where, Whom, Which Name, Place, Animal, Thing Qualitative/Nominal Name Count no. of cases,Mode

We will obviously choose all variables in the dataset, unless they are unrelated ones such as row number or ID which (we think) may not contribute any information and we can disregard.

How do these Summaries Work?

Inspecting the min, max,mean, median and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error! Comparing two Quant variables for their ranges also tells us that we may have to \(scale/normalize\) them for computational ease, if one variable has large numbers and the other has very small ones.

With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these. Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented. This may point to data gathering errors, which may be fixable, or you may have to decide in what to do with this data sparseness.

For both types of variables, we need to keep an eye open for data entries that are missing! We will have to take a decision to let go of that entire observation (i.e. a row) or do what is called imputation to fill in values that are based on the other values in the same column.

Obtaining Quant Summaries

Dataset: TBD

Examine the Data

Data Dictionary

Quantitative Data
Qualitative Data

Research Questions

Letโ€™s try a few questions and see if they are answerable with Sumamry Figures and Tables.

Note

Q1.

(a) IMDB Ratings Histogram
(b) IMDB Rating vs Genere
Figure 1: Netflix Data Histograms
Note

Q2.

(a) Reformatting โ€œSeasonsโ€
(b) IMDB Rating vs Seasons
Figure 2: Plotting with Seasons

What is the Story Here?

Dataset:

Here is a dataset about the eruption durations, and wait times between eruptions of the Old Faithful geyser in Yellowstone National Park, USA.

Download this data to your machine and import it into Orange.

Examine the Data

Figure 3: Old Faithful Data Table

Figure 3 states that we have 272 data points, and three variables. All variables are Quantitative!

Data Dictionary

Quantitative Data
Qualitative Data
  • No Qual variables!!

Research Questions

Note

Q1.

(a) Eruption Durations Histogram
(b) Waiting Times Histogram
Figure 4: Old Faithful Data Histograms

What is the Story Here?

Your Turn

Try your hand at these datasets. Look at the data table, state the data dictionary, contemplate a few Research Questions and answer them with Summaries and Tables in Orange!

  1. Airbnb Price Data on the French Riviera


1. Wage and Education Data from Canada

Wait, But Why?

References

Back to top

References

Stigler, Stephen M. 2016. โ€œThe Seven Pillars of Statistical Wisdom,โ€ March. https://doi.org/10.4159/9780674970199.