Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Statistical Inference
  4. 🎲 Samples, Populations, Statistics and Inference
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Slides and Tutorials
  • Setting up R Packages
  • What is a Population?
  • What is a Sample?
  • Population Parameters and Sample Statistics
  • Case Study #1: Sampling the NHANES dataset
    • An “Assumed” Population
    • Sampling
    • Repeated Samples and Sample Means
    • Distribution of Sample-Means
    • Deriving the Central Limit Theorem (CLT)
  • The Standard Error
  • Confidence intervals
  • CLT Workflow
  • CLT Assumptions
  • An interactive Sampling app
  • References
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Statistical Inference
  4. 🎲 Samples, Populations, Statistics and Inference

🎲 Samples, Populations, Statistics and Inference

Sampling
Central Limit Theorem
Standard Error
Confidence Intervals
Published

November 25, 2022

Modified

March 1, 2025

Abstract
How much Land Data does a Man need?

Slides and Tutorials

R Tutorial   Datasets

“When Alexander the Great visited Diogenes and asked whether he could do anything for the famed teacher, Diogenes replied:”Only stand out of my light.” Perhaps some day we shall know how to heighten creativity. Until then, one of the best things we can do for creative men and women is to stand out of their light.”

— John W. Gardner, author and educator (8 Oct 1912-2002)

Setting up R Packages

library(tidyverse) # Data Processing in R
library(mosaic) # Our workhorse for stats, sampling
library(skimr) # Good to Examine data
library(ggformula) # Formula interface for graphs

# load the NHANES data library
library(NHANES)
library(infer)

Plot Theme

Show the Code
# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes

theme_custom <- function() {
  font <- "Roboto Condensed" # assign font family up front

  theme_classic(base_size = 14) %+replace% # replace elements we want to change

    theme(
      panel.grid.minor = element_blank(), # strip minor gridlines
      text = element_text(family = font),
      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        # size = 20,               #set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        # vjust = 2                #raise slightly
        margin = margin(0, 0, 10, 0)
      ),
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        # size = 14,                #font size
        hjust = 0,
        margin = margin(2, 0, 5, 0)
      ),
      plot.caption = element_text( # caption
        family = font, # font family
        size = 8, # font size
        hjust = 1
      ), # right align

      axis.title = element_text( # axis titles
        family = font, # font family
        size = 10 # font size
      ),
      axis.text = element_text( # axis text
        family = font, # axis family
        size = 8
      ) # font size
    )
}

# Set graph theme
theme_set(new = theme_custom())
#

What is a Population?

A population is a collection of individuals or observations we are interested in. This is also commonly denoted as a study population in science research, or a target audience in design. We mathematically denote the population’s size using upper-case N.

A population parameter is some numerical summary about the population that is unknown but you wish you knew. For example, when this quantity is a mean like the mean height of all Bangaloreans, the population parameter of interest is the population mean μ.

A census is an exhaustive enumeration or counting of all N individuals in the population. We do this in order to compute the population parameter’s value exactly. Of note is that as the number N of individuals in our population increases, conducting a census gets more expensive (in terms of time, energy, and money).

Important Parameters

Populations Parameters are usually indicated by Greek Letters.

What is a Sample?

Sampling is the act of collecting a small subset from the population, which we generally do when we can’t perform a census. We mathematically denote the sample size using lower case n, as opposed to upper case N which denotes the population’s size. Typically the sample size n is much smaller than the population size N. Thus sampling is a much cheaper alternative than performing a census.

A sample statistic, also known as a point estimate, is a summary statistic like a mean or standard deviation that is computed from a sample.

NoteWhy do we sample?

Because we cannot conduct a census ( not always ) — and sometimes we won’t even know how big the population is — we take samples. And we still want to do useful work for/with the population, after estimating its parameters, an act of generalizing from sample to population. So the question is, can we estimate useful parameters of the population, using just samples? Can point estimates serve as useful guides to population parameters?

This act of generalizing from sample to population is at the heart of statistical inference.

ImportantAn Alliterative Mnemonic

NOTE: there is an alliterative mnemonic here: Samples have Statistics; Populations have Parameters.

Population Parameters and Sample Statistics

Parameters and Statistics
Population Parameter Sample Statistic
Mean μ x¯
Standard Deviation σ s
Proportion p p^
Correlation ρ r
Slope (Regression) β1 b1
NoteQuestion

Q.1. What is the mean commute time for workers in a particular city?

A.1. The parameter is the mean commute time μ for a population containing all workers who work in the city. We estimate it using x¯, the mean of the random sample of people who work in the city.

NoteQuestion

Q.2. What is the correlation between the size of dinner bills and the size of tips at a restaurant?

A.2. The parameter is ρ , the correlation between bill amount and tip size for a population of all dinner bills at that restaurant. We estimate it using r, the correlation from a random sample of dinner bills.

NoteQuestion

Q.3. How much difference is there in the proportion of 30 to 39-year-old residents who have only a cell phone (no land line phone) compared to 50 to 59-year-olds in the country?

A.3. The population is all citizens of the country, and the parameter is p1−p2, the difference in proportion of 30 to 39-year-old residents who have only a cell phone (p1) and the proportion with the same property among all 50 to 59-year olds (p2). We estimate it using (p1^−p2^), the difference in sample proportions computed from random samples taken from each group.

Sample statistics vary and in the following we will estimate this uncertainty and decide how reliable they might be as estimates of population parameters.

Case Study #1: Sampling the NHANES dataset

We will first execute some samples from a known dataset. We load up the NHANES dataset and glimpse it.

data("NHANES")
glimpse(NHANES)
Rows: 10,000
Columns: 76
$ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 5164…
$ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10,…
$ Gender           <fct> male, male, male, male, female, male, male, female, f…
$ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10, …
$ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  40…
$ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795,…
$ Race1            <fct> White, White, White, Other, White, White, White, Whit…
$ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Education        <fct> High School, High School, High School, NA, Some Colle…
$ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, M…
$ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, 3…
$ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 8750…
$ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00,…
$ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3,…
$ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, O…
$ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, N…
$ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7,…
$ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166.…
$ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.2…
$ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plus…
$ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, 8…
$ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111, …
$ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, 6…
$ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124, …
$ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, 6…
$ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108, …
$ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, 6…
$ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114, …
$ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, 7…
$ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, 2…
$ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, 5…
$ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113, …
$ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1.…
$ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
$ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vgo…
$ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0,…
$ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA,…
$ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, No…
$ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None,…
$ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, N…
$ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, N…
$ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, Y…
$ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes, …
$ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2, …
$ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, N…
$ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, N…
$ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
$ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA, …
$ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, N…
$ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA, …
$ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No, …
$ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smoke…
$ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, N…
$ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Ye…
$ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, N…
$ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes,…
$ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, N…
$ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes, …
$ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
$ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, N…
$ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA, …
$ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1,…
$ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, N…
$ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Heteros…
$ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Let us create a NHANES (sub)-dataset without duplicated IDs and only adults and select the Height variable:

NHANES_adult <-
  NHANES %>%
  distinct(ID, .keep_all = TRUE) %>%
  filter(Age >= 18) %>%
  select(Height) %>%
  drop_na(Height)
NHANES_adult
ABCDEFGHIJ0123456789
Height
<dbl>
164.7
168.4
166.7
169.5
181.9
169.4
148.1
177.8
181.3
169.9
Next
123456
...
479
Previous
1-10 of 4,790 rows

An “Assumed” Population

ImportantAn “Assumed” Population

Normally, we very rarely have access to a population. All we can do is sample it. However, for now, and in order to build up our intuition, we will treat this single-variable dataset as our Population. So this is our population, with appropriate population parameters such as pop_mean, and pop_sd.

Let us calculate the population parameters for the Height data from our “assumed” population:

# NHANES_adult is assumed population
pop_mean <- mosaic::mean(~Height, data = NHANES_adult)

pop_sd <- mosaic::sd(~Height, data = NHANES_adult)

pop_mean
[1] 168.3497
pop_sd
[1] 10.15705

Sampling

Now, we will sample ONCE from the NHANES Height variable. Let us take a sample of sample size 50. We will compare sample statistics with population parameters on the basis of this ONE sample of 50:

# Set graph theme
theme_set(new = theme_custom())
#

sample_50 <- mosaic::sample(NHANES_adult, size = 50) %>%
  select(Height)
sample_50
sample_mean_50 <- mean(~Height, data = sample_50)
sample_mean_50
# Plotting the histogram of this sample
sample_50 %>%
  gf_histogram(~Height, bins = 10) %>%
  gf_vline(
    xintercept = ~sample_mean_50,
    color = "purple"
  ) %>%
  gf_vline(
    xintercept = ~pop_mean,
    colour = "black"
  ) %>%
  gf_label(7 ~ (pop_mean + 8),
    label = "Population Mean",
    color = "black"
  ) %>%
  gf_label(7 ~ (sample_mean_50 - 8),
    label = "Sample Mean", color = "purple"
  ) %>%
  gf_labs(
    title = "Distribution and Mean of a Single Sample",
    subtitle = "Sample Size = 50"
  )
ABCDEFGHIJ0123456789
Height
<dbl>
156.2
162.1
187.1
177.3
161.8
178.9
170.9
165.2
168.9
175.6
Next
12
Previous
1-10 of 50 rows
[1] 170.654

Repeated Samples and Sample Means

OK, so the sample_mean_50 is not too far from the pop_mean. Is this always true?

Let us check: we will create 500 repeated samples, each of size 50. And calculate their mean as the sample statistic, giving us a data frame containing 500 sample means. We will then see if these 500 means lie close to the pop_mean:

sample_50_500 <- do(500) * {
  sample(NHANES_adult, size = 50) %>%
    select(Height) %>% # drop sampling related column "orig.id"
    summarise(
      sample_mean = mean(Height),
      sample_sd = sd(Height),
      sample_min = min(Height),
      sample_max = max(Height)
    )
}
sample_50_500
dim(sample_50_500)
ABCDEFGHIJ0123456789
sample_mean
<dbl>
sample_sd
<dbl>
sample_min
<dbl>
sample_max
<dbl>
.row
<int>
.index
<dbl>
168.80010.120841149.7192.611
169.06810.001539147.5184.912
167.95010.811507146.4186.813
167.1849.243870148.0188.214
168.08010.819502144.4186.815
169.9268.830855149.6189.716
167.75410.105590143.9186.317
167.97010.807126147.9191.218
168.4688.620229152.7186.319
168.55011.802045147.0191.9110
Next
123456
...
50
Previous
1-10 of 500 rows
[1] 500   6
# Set graph theme
theme_set(new = theme_custom())
#

sample_50_500 %>%
  gf_point(.index ~ sample_mean,
    color = "purple",
    title = "Sample Means are close to the Population Mean",
    subtitle = "Sample Means are Random!",
    caption = "Grey lines represent our 500 samples"
  ) %>%
  gf_segment(
    .index + .index ~ sample_min + sample_max,
    color = "grey",
    linewidth = 0.3,
    alpha = 0.3,
    ylab = "Sample Index (1-500)",
    xlab = "Sample Means"
  ) %>%
  gf_vline(
    xintercept = ~pop_mean,
    color = "black"
  ) %>%
  gf_label(-25 ~ pop_mean,
    label = "Population Mean",
    color = "black"
  )
##

sample_50_500 %>%
  gf_point(.index ~ sample_sd,
    color = "purple",
    title = "Sample SDs are close to the Population Sd",
    subtitle = "Sample SDs are Random!",
  ) %>%
  gf_vline(
    xintercept = ~pop_sd,
    color = "black"
  ) %>%
  gf_label(-25 ~ pop_sd,
    label = "Population SD",
    color = "black"
  ) %>%
  gf_refine(lims(x = c(4, 16)))

NoteSample Means and Sample SDs are a Random Variables!!1

The sample_means (purple dots in the two graphs), are themselves random because the samples are random, of course. It appears that they are generally in the vicinity of the pop_mean (vertical black line). And hence they will have a mean and sd too 😱. Do not get confused ;-D

And the sample_sds are also random and have their own distribution, around the pop_sd.2

Distribution of Sample-Means

Since the sample-means are themselves random variables, let’s plot the distribution of these 500 sample-means themselves, called a distribution of sample-means. We will also plot the position of the population mean pop_mean parameter, the mean of the Height variable.

# Set graph theme
theme_set(new = theme_custom())
#

sample_50_500 %>%
  gf_dhistogram(~sample_mean, bins = 30, xlab = "Height") %>%
  gf_vline(
    xintercept = pop_mean,
    color = "blue"
  ) %>%
  gf_label(0.01 ~ pop_mean,
    label = "Population Mean",
    color = "blue"
  ) %>%
  gf_labs(
    title = "Sampling Mean Distribution",
    subtitle = "500 means"
  )


# How does this **distribution of sample-means** compare with the
# overall distribution of the population?
#
sample_50_500 %>%
  gf_dhistogram(~sample_mean, bins = 30, xlab = "Height") %>%
  gf_vline(
    xintercept = pop_mean,
    color = "blue"
  ) %>%
  gf_label(0.01 ~ pop_mean,
    label = "Population Mean",
    color = "blue"
  ) %>%
  ## Add the population histogram
  gf_histogram(~Height,
    data = NHANES_adult,
    alpha = 0.2, fill = "blue",
    bins = 30
  ) %>%
  gf_label(0.025 ~ (pop_mean + 20),
    label = "Population Distribution", color = "blue"
  ) %>%
  gf_labs(title = "Sampling Mean Distribution", subtitle = "Original Population overlay")

Sample Means

Sample Means

Sample Means and Population

Sample Means and Population

Distributions

Deriving the Central Limit Theorem (CLT)

We see in the Figure above that

  • the distribution of sample-means is centered around the pop_mean. ( Mean of the sample means = pop_mean!! 😱)
  • That the “spread” of the distribution of sample means is less than pop_sd. Curiouser and curiouser! But exactly how much is it?
  • And what is the kind of distribution?

One more experiment.

Now let’s repeatedly sample Height and compute the sample-means, and look at the resulting histograms. (We will deal with sample-standard-deviations shortly.) We will also use sample sizes of c(8, 16, ,32, 64) and generate 1000 samples each time, take the means and plot these 4 * 1000 means:

# set.seed(12345)

samples_08_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 08))

samples_16_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 16))

samples_32_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 32))

samples_64_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 64))

# samples_128_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 128))

# Quick Check
head(samples_08_1000)
ABCDEFGHIJ0123456789
 
 
mean
<dbl>
1165.3375
2171.9125
3169.2250
4173.0875
5164.8500
6174.2250
6 rows

Let us plot their individual histograms to compare them:

# Set graph theme
theme_set(new = theme_custom())
#
# Let us overlay their individual histograms to compare them:
p5 <- gf_dhistogram(~mean,
  data = samples_08_1000,
  color = "grey",
  fill = "dodgerblue", title = "N = 8"
) %>%
  gf_fitdistr(linewidth = 1) %>%
  gf_vline(
    xintercept = pop_mean, inherit = FALSE,
    color = "blue"
  ) %>%
  gf_label(-0.025 ~ pop_mean,
    label = "Population Mean",
    color = "blue"
  ) %>%
  gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))
##
p6 <- gf_dhistogram(~mean,
  data = samples_16_1000,
  color = "grey",
  fill = "sienna", title = "N = 16"
) %>%
  gf_fitdistr(linewidth = 1) %>%
  gf_vline(
    xintercept = pop_mean,
    color = "blue"
  ) %>%
  gf_label(-.025 ~ pop_mean,
    label = "Population Mean",
    color = "blue"
  ) %>%
  gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))
##
p7 <- gf_dhistogram(~mean,
  data = samples_32_1000,
  na.rm = TRUE,
  color = "grey",
  fill = "palegreen", title = "N = 32"
) %>%
  gf_fitdistr(linewidth = 1) %>%
  gf_vline(
    xintercept = pop_mean,
    color = "blue"
  ) %>%
  gf_label(-.025 ~ pop_mean,
    label = "Population Mean", color = "blue"
  ) %>%
  gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))

p8 <- gf_dhistogram(~mean,
  data = samples_64_1000,
  na.rm = TRUE,
  color = "grey",
  fill = "violetred", title = "N = 64"
) %>%
  gf_fitdistr(linewidth = 1) %>%
  gf_vline(
    xintercept = pop_mean,
    color = "blue"
  ) %>%
  gf_label(-.025 ~ pop_mean,
    label = "Population Mean", color = "blue"
  ) %>%
  gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))

# patchwork::wrap_plots(p5,p6,p7,p8)
p5
p6
p7
p8

And if we overlay these histograms on top of one another:

From the histograms we learn that the sample-means are normally distributed around the population mean. This feels intuitively right because when we sample from the population, many values will be close to the population mean, and values far away from the mean will be increasingly scarce.

Let us calculate the means of the sample-distributions:

mean(~mean, data = samples_08_1000) # Mean of means!!!;-0
mean(~mean, data = samples_16_1000)
mean(~mean, data = samples_32_1000)
mean(~mean, data = samples_64_1000)
pop_mean
[1] 168.1511
[1] 168.353
[1] 168.3914
[1] 168.3819
[1] 168.3497

All are pretty close to the population mean !!!

Consider the standard deviations of the sampling distributions:

sdi=Σ[xi−xi¯]2ni

We take the sum of all squared differences from the mean in a sample. If we divide this sum-of-squared-differences by the sample length n, we get a sort of average squared difference, or per-capita squared error from the mean, which is called the variance.

Taking the square root gives us an average error, which we would call, of course, the standard deviation.

pop_sd
sd(~mean, data = samples_08_1000)
sd(~mean, data = samples_16_1000)
sd(~mean, data = samples_32_1000)
sd(~mean, data = samples_64_1000)
[1] 10.15705
[1] 3.682819
[1] 2.5176
[1] 1.775868
[1] 1.25669

These are also decreasing steadily with sample size. How are they related to the pop_sd?

pop_sd
pop_sd / sqrt(8)
pop_sd / sqrt(16)
pop_sd / sqrt(32)
pop_sd / sqrt(64)
[1] 10.15705
[1] 3.591058
[1] 2.539262
[1] 1.795529
[1] 1.269631

As we can see, the standard deviations of the sampling-mean-distributions is inversely proportional to the squre root of lengths of the sample.

NoteCentral Limit Theorem

Now we have enough to state the Central Limit Theorem (CLT)

  • the sample-means are normally distributed around the population mean. So any mean of a single sample is a good, unbiased estimate for the pop_mean
  • the sample-mean distributions narrow with sample length, i.e their sd decreases with increasing sample size.
  • sd∼1sqrt(n)
  • This is regardless of the distribution of the population itself.3

This theorem underlies all our procedures and techniques for statistical inference, as we shall see.

The Standard Error

Consider once again the standard deviations in each of the sample-distributions that we have generated. As we saw these decrease with sample size.

sd = pop_sd/sqrt(sample_size)where sample-size4 here is one of c(8, 16, 32, 64).

We reserve the term Standard Deviation for the population, and name this computed standard deviation of the sample-mean-distributions as the Standard Error. This statistic derived from the sample-mean-distribution will help us infer our population parameters with a precise estimate of the uncertainty involved.

Standard Error sese=pop.sdn However, we don’t normally know the pop_sd!! So now what?? So is this chicken and egg??

Let us take SINGLE SAMPLES of each size, as we normally would:

sample_08 <- mosaic::sample(NHANES_adult, size = 8) %>%
  select(Height)
sample_16 <- mosaic::sample(NHANES_adult, size = 16) %>%
  select(Height)
sample_32 <- mosaic::sample(NHANES_adult, size = 32) %>%
  select(Height)
sample_64 <- mosaic::sample(NHANES_adult, size = 64) %>%
  select(Height)
##
sd(~Height, data = sample_08)
[1] 7.427555
sd(~Height, data = sample_16)
[1] 7.710034
sd(~Height, data = sample_32)
[1] 8.803592
sd(~Height, data = sample_64)
[1] 10.10828
Warning

As we saw at the start of this module, the sample-means are distributed around the pop_mean, AND it appears that the sample-sds are also distributed around the pop_sd!

However, the sample-sd distribution is not Gaussian, and the sample_sd is also not an unbiased estimate for the pop_sd. However, when the sample size n is large (n>=30), the sample-sd distribution approximates the Gaussian, and the bias is small.

And so, in the same way, we approximate the pop_sd with the sd of a single sample of the same length. Hence the Standard Error can be computed from a single sample as:

Standard Error sese=sample.sdn

With these, the Standard Errors(SE) evaluate to:

pop_sd <- sd(~Height, data = NHANES_adult)
pop_sd
sd(~Height, data = sample_08) / sqrt(8)
sd(~Height, data = sample_16) / sqrt(16)
sd(~Height, data = sample_32) / sqrt(32)
sd(~Height, data = sample_64) / sqrt(64)
[1] 10.15705
[1] 2.626037
[1] 1.927509
[1] 1.55627
[1] 1.263536

Confidence intervals

When we work with samples, we want to be able to speak with a certain degree of confidence about the population mean, based on the evaluation of one sample mean, not a large number of them. Given that sample-means are normally distributed around the pop_mean, we can say that 68% of all possible sample-mean lie within ±SE of the population mean; and further that 95% of all possible sample-mean lie within ±2∗SE of the population mean. These two constants ±1 and ±2 are the z-scores from the z-distribution we saw earlier:

These two intervals sample.mean±SE and sample.mean±2∗SE are called the confidence intervals for the population mean, at levels 68% and 95% probability respectively.

How do these vary with sample size n?f

# Set graph theme
theme_set(new = theme_custom())
#
tbl_1 <- get_ci(samples_08_1000, level = 0.95)
tbl_2 <- get_ci(samples_16_1000, level = 0.95)
tbl_3 <- get_ci(samples_32_1000, level = 0.95)
tbl_4 <- get_ci(samples_64_1000, level = 0.95)
rbind(tbl_1, tbl_2, tbl_3, tbl_4) %>%
  rownames_to_column("index") %>%
  cbind("sample_size" = c(8, 16, 32, 64)) %>%
  gf_segment(index + index ~ lower_ci + upper_ci) %>%
  gf_vline(xintercept = pop_mean) %>%
  gf_labs(
    title = "95% Confidence Intervals for the Mean",
    subtitle = "Varying samples sizes 8-16-32-64",
    y = "Sample Size", x = "Mean Ranges"
  ) %>%
  gf_refine(scale_y_discrete(labels = c(8, 16, 32, 64))) %>%
  gf_refine(annotate(geom = "label", x = pop_mean + 1.75, y = 1.5, label = "Population Mean"))

ImportantConfidence Intervals Shrink!

Yes, with sample size! Why is that a good thing?

So finally, we can pretend that our sample is also distributed normally ( i.e Gaussian) and use its sample_sd as a substitute for pop_sd, plot the standard errors on the sample histogram, and see where our believed mean is.

# Set graph theme
theme_set(new = theme_custom())
#
sample_mean <- mean(~Height, data = sample_16)
se <- sd(~Height, data = sample_16) / sqrt(16)
#
xqnorm(
  p = c(0.025, 0.975),
  mean = sample_mean,
  sd = sd(~Height, data = sample_16),
  return = c("plot"), verbose = F
) %>%
  gf_vline(xintercept = ~pop_mean, colour = "black") %>%
  gf_vline(xintercept = mean(~Height, data = sample_16), colour = "purple") %>%
  gf_labs(
    title = "Confidence Intervals and the Bell Curve. N=16",
    subtitle = "Sample is plotted as theoretical Gaussian Bell Curve"
  ) %>%
  gf_refine(
    annotate(geom = "label", x = pop_mean + 15, y = 0.05, label = "Population Mean"),
    annotate(geom = "label", x = sample_mean - 15, y = 0.05, label = "Sample Mean", colour = "purple")
  )

pop_mean
se <- sd(~Height, data = sample_16) / sqrt(16)
mean(~Height, data = sample_16) - 2.0 * se
mean(~Height, data = sample_16) + 2.0 * se
[1] 168.3497
[1] 166.2512
[1] 173.9613

So if our believed mean is within the Confidence Intervals, then OK, we can say our belief may be true. If it is way outside the confidence intervals, we have to think that our belief may be flawed and accept and alternative interpretation.

CLT Workflow

Thus if we want to estimate a population mean:

  • we take one random sample from the population of length n
  • we calculate the mean from the sample sample-mean
  • we calculate the sample-sd
  • we calculate the Standard Error as sample−sdn
  • we calculate 95% confidence intervals for the population parameter based on the formula CI95%=sample.mean±2∗SE.
  • Since Standard Error decreases with sample size, we need to make our sample of adequate size.( n=30 seems appropriate in most cases. Why?)
  • And we do not have to worry about the distribution of the population. It need not be normal/Gaussian/Bell-shaped !!

CLT Assumptions

  • Sample is of “decent length” n>=30;
  • Therefore sample histogram is Gaussian

An interactive Sampling app

Here below is an interactive sampling app. Play with the different settings, especially the distribution in the population to get a firm grasp on Sampling and the CLT.

https://gallery.shinyapps.io/CLT_mean/

References

  1. Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine, OpenIntro Statistics. https://www.openintro.org/book/os/

  2. Stats Test Wizard. https://www.socscistatistics.com/tests/what_stats_test_wizard.aspx

  3. Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine: OpenIntro Statistics. Available online https://www.openintro.org/book/os/

  4. Måns Thulin, Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling http://www.modernstatisticswithr.com/

  5. Jonas Kristoffer Lindeløv, Common statistical tests are linear models (or: how to teach stats) https://lindeloev.github.io/tests-as-linear/

  6. CheatSheet https://lindeloev.github.io/tests-as-linear/linear_tests_cheat_sheet.pdf

  7. Common statistical tests are linear models: a work through by Steve Doogue https://steverxd.github.io/Stat_tests/

  8. Jeffrey Walker “Elements of Statistical Modeling for Experimental Biology”. https://www.middleprofessor.com/files/applied-biostatistics_bookdown/_book/

  9. Adam Loy, Lendie Follett & Heike Hofmann (2016) Variations of Q–Q Plots: The Power of Our Eyes!, The American Statistician, 70:2, 202-214, DOI: 10.1080/00031305.2015.1077728

R Package Citations
Package Version Citation
NHANES 2.1.0 Pruim (2015)
regressinator 0.2.0 Reinhart (2024)
smovie 1.1.6 Northrop (2023)
TeachHist 0.2.1 Lange (2023)
TeachingDemos 2.13 Snow (2024)
visualize 4.5.0 Balamuta (2023)
Balamuta, James. 2023. visualize: Graph Probability Distributions with User Supplied Parameters and Statistics. https://CRAN.R-project.org/package=visualize.
Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://CRAN.R-project.org/package=TeachHist.
Northrop, Paul J. 2023. smovie: Some Movies to Illustrate Concepts in Statistics. https://CRAN.R-project.org/package=smovie.
Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://CRAN.R-project.org/package=NHANES.
Reinhart, Alex. 2024. regressinator: Simulate and Diagnose (Generalized) Linear Models. https://CRAN.R-project.org/package=regressinator.
Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://CRAN.R-project.org/package=TeachingDemos.
Back to top

Footnotes

  1. https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Introductory_Statistics_(Shafer_and_Zhang)/06%3A_Sampling_Distributions/6.01%3A_The_Mean_and_Standard_Deviation_of_the_Sample_Mean↩︎

  2. https://mathworld.wolfram.com/StandardDeviationDistribution.html↩︎

  3. The `Height` variable seems to be normally distributed at population level. We will try other non-normal population variables as an exercise in the tutorials.↩︎

  4. Once sample size = population, we have complete access to the population and there is no question of estimation error! So sample_sd = pop_sd!↩︎

Citation

BibTeX citation:
@online{2022,
  author = {},
  title = {🎲 {Samples,} {Populations,} {Statistics} and {Inference}},
  date = {2022-11-25},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/20-SampProb/},
  langid = {en},
  abstract = {How much \textasciitilde\textasciitilde
    Land\textasciitilde\textasciitilde{} Data does a Man need?}
}
For attribution, please cite this work as:
“🎲 Samples, Populations, Statistics and Inference.” 2022. November 25, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/20-SampProb/.
🧭 Basics of Statistical Inference
Basics of Randomization Tests

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .