Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Science with No Code
  3. Summaries
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Inspiration(s)!
  • What graphs / numbers will we see today?
  • What kind of Data Variables will we choose?
  • How do these Summaries Work?
  • Obtaining Quant Summaries
  • Dataset: Penguins
    • Examine the Data
    • Data Dictionary
    • Research Questions
    • What is the Story Here?
  • Your Turn
  • Wait, But Why?
  • Readings
  1. Teaching
  2. Data Science with No Code
  3. Summaries

Summaries

Throwing away data to grasp it

Summary Stats
Favourite Stats
Quant Variables
Qual Variables
Published

April 16, 2024

Modified

July 28, 2024

Abstract
Bill Gates walked into a bar, and everyone’s salary went up on average.

Inspiration(s)!

First, some baseball:

And then, an example from a more sombre story:

US Population: Reading and Numeracy Levels
Year Below Level #1 Level #1 Level #2 Level #3 Levels #4 and #5
Number in millions (2012/2014) 8.35 26.49 65.10 71.41 26.57
Number in millions (2017) 7.59 29.23 66.07 68.81 26.75
Note:
SOURCE: U.S. Department of Education, National Center for Education Statistics, Program for the International Assessment of Adult Competencies (PIAAC), U.S. PIAAC 2017, U.S. PIAAC 2012/2014.
Table 1: US Population: Reading and Numeracy Levels

This ghastly-looking Table 1 examines U.S. adults with low English literacy and numeracy skills—or low-skilled adults—at two points in the 2010s, in the years 2012/20141 and 2017, using data from the Program for the International Assessment of Adult Competencies (PIAAC). As can be seen the summary table is quite surprising in absolute terms, for a developed country like the US, and the numbers have increased from 2012/2014 to 2017!

So why do we need to summarise data? Summarization is an act of throwing away data to make more sense, as stated by (Stigler 2016) and also in the movie by Brad Pitt aka Billy Beane. To summarize is to understand. Add to that the fact that our Working Memories can hold maybe 7 items, so it means information retention too.

And if we don’t summarise? Jorge Luis Borges, in a fantasy short story published in 1942, titled “Funes the Memorious,” he described a man, Ireneo Funes, who found after an accident that he could remember absolutely everything. He could reconstruct every day in the smallest detail, and he could even later reconstruct the reconstruction, but he was incapable of understanding. Borges wrote, “To think is to forget details, generalize, make abstractions. In the teeming world of Funes, there were only details.” (emphasis mine)

Aggregation can yield great gains above the individual components in data. Funes was big data without Statistics.

What graphs / numbers will we see today?

Variable #1 Variable #2 Chart Names “Chart Shape”
All All Tables and Stat Measures

Before we plot a single chart, it is wise to take a look at several numbers that summarize the dataset under consideration. What might these be? Some obviously useful numbers are:

  • Dataset length: How many rows/observations?
  • Dataset breadth: How many columns/variables?
  • How many Quant variables?
  • How many Qual variables?
  • Quant variables: min, max, mean, median, sd
  • Qual variables: levels, counts per level
  • Both: means, medians for each level of a Qual variable…

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation
2 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities with Scale. Differences are meaningful, but not products or ratios Quantitative/Interval pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College Mean,Standard Deviation
3 How, What Kind, What Sort A Manner / Method, Type or Attribute from a list, with list items in some " order" ( e.g. good, better, improved, best..) Qualitative/Ordinal Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) Median,Percentile
4 What, Who, Where, Whom, Which Name, Place, Animal, Thing Qualitative/Nominal Name Count no. of cases,Mode

We will obviously choose all variables in the dataset, unless they are unrelated ones such as row number or ID which (we think) may not contribute any information and we can disregard.

How do these Summaries Work?

Quant variables: Inspecting the min, max,mean, median and sd of each of the Quant variables tells us straightaway what the ranges of the variables are, and if there are some outliers, which could be normal, or maybe due to data entry error! Comparing two Quant variables for their ranges also tells us that we may have to scale/normalize them for computational ease, if one variable has large numbers and the other has very small ones.

Qual variables: With Qual variables, we understand the levels within each, and understand the total number of combinations of the levels across these. Counts across levels, and across combinations of levels tells us whether the data has sufficient readings for graphing, inference, and decision-making, of if certain levels/classes of data are under or over represented.

Together?: We can use Quant and Qual together, to develop the above summaries (min, max,mean, median and sd) for Quant variables, again across levels, and across combinations of levels of single or multiple Quals, along with counts if we are interested in that.

For both types of variables, we need to keep an eye open for data entries that are missing! This may point to data gathering errors, which may be fixable. Or we will have to take a decision to let go of that entire observation (i.e. a row). Or even do what is called imputation to fill in values that are based on the other values in the same column, which sounds like we are making up data, but isn’t so really.

And this may also tell us if we are witnessing a Simpson’s Paradox situation. You may have to decide on what to do with this data sparseness, or just check your biases!

Obtaining Quant Summaries

  • Using Orange
  • Using RAWgraphs
  • Using DataWrapper

Let us examine a healthcare-related dataset in Orange, on heart disease. Download the Orange workflow by clicking on the icon below, and open it in Orange.

Figure 1: Grouped Summaries

In Figure 1, we see two sub-windows: on the upper right, we see the output of “Group By” where we have selected gender. We can also in the window choose what summary statistics we wish to see for each of the other variables in the dataset. To the lower left, we see the output of the Grouped Summaries Data Table, which shows just two rows: one for gender::female, and another for gender::male. All other variables have been summarised as desired.

Play with the summary output settings, and also with choosing which variable to Group By. Can you Group By more than one Qual variable?

Note

Does Group By with a Quant variable make sense?

NoteGrouping By Multiple Variables

There does not appear to be a way in which one can choose to Group By multiple variables as an input to summary…which does not seem possible in Orange, but is a breeze with R or Python or…stuff which you peasants will not touch. Use CMD and the Windows key respectively to select two or more Group By variables. Hmph!

https://www.rawgraphs.io

https://www.datawrapper.de

Dataset: Penguins

Let us use a (now) well-known data set on penguins. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

Download this data into your Orange work folder and then use it in Orange.

Examine the Data

Here is the Data Table for the penguins data:

Figure 2: Penguins Data Table

We see from Figure 2 that there are 344 observations (i.e. individual penguins) and 6 variables. There is some missing data but not too much!

Data Dictionary

NoteQualitative Data
  • sex: male and female penguins
  • island: they have islands to themselves!!
  • species: Three adorable types!
Figure 3: Penguin Species
NoteQuantitative Data
  • bill_length_mm: The length of the penguins’ bills
  • bill_depth_mm: See the picture!!
  • flipper_length_mm: Flippers! Penguins have “hands”!!
  • body_mass_gm: Grams? Grams??? Why, these penguins are like human babies!!❤️
Figure 4: Penguin Features

Research Questions

Let’s try a few questions and see if they are answerable with Summary Figures and Tables.

Note

Q1. What are the mean weights of the penguins, for each species? In Orange, we do a Group By with the species variable, and select mean for the summary function for the variable `body_mass_gm.

Note

For now, disable summaries for everything else to avoid clutter!

(a) Group By Species
(b) Mass and Count by Species
Figure 5: Penguins’ Mass and Counts by Species
Note

Q2. What if we group by species and sex? And also look at flipper_length_mm?

(a) Group by Species and Sex
(b) Mass and Flipper Length by Species and Sex
Figure 6: Summary over two Qual variables

What is the Story Here?

  • From Figure 5, clearly Gentoo penguins are the big brothers/sisters here, with a mean body mass higher by around 1250 grams!!😆

  • From Figure 6:

    • Hmm..Chinstrap penguins are fewer in number, compared to the other two species.
    • flipper_lengths_mm are pretty much ballpark same across all species and sex combinations; Gentoo still dominates the body_mass_gm across both.
    • female Gentoo are heavier (on average) than other male-s even!!💪 (Not necessarily on individual basis!!).

Your Turn

  1. Try adding more summary functions to the summary table? Which might you choose? Why?

  2. Try your hand at these datasets. Look at the data table, state the data dictionary, contemplate a few Research Questions and answer them with Summaries and Tables in Orange!

NoteStar Trek Books

Which would be the Group By variables here? And what would you summarize? With which function?

NoteMath Anxiety! Hah!

Wait, But Why?

  • Data Summaries give you the essentials, without getting bogged down in the details(just yet).
  • Summaries help you “live with your data”; this is an important step in understanding it, and deciding what to do with it.
  • Summaries help evoke Questions and Hypotheses, which may lead to inquiries, analysis, and insights

Readings

Back to top

References

Stigler, Stephen M. 2016. “The Seven Pillars of Statistical Wisdom,” March. https://doi.org/10.4159/9780674970199.
Orange
Counts

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .