Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Viz and Analytics
  3. Statistical Inference
  4. 🎲 Samples, Populations, Statistics and Inference
  • Teaching
    • Data Viz and Analytics
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Graphs
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Using AI in Analytics
        • Case Studies
          • Demo:Product Packaging and Elderly People
          • Ikea Furniture
          • Movie Profits
          • Gender at the Work Place
          • Heptathlon
          • School Scores
          • Children's Games
          • Valentine’s Day Spending
          • Women Live Longer?
          • Hearing Loss in Children
          • California Transit Payments
          • Seaweed Nutrients
          • Coffee Flavours
          • Legionnaire’s Disease in the USA
          • Antarctic Sea ice
          • William Farr's Observations on Cholera in London
      • TRIZ for Problem Solvers
        • I am Water
        • I am What I yam
        • Birds of Different Feathers
        • I Connect therefore I am
        • I Think, Therefore I am
        • The Art of Parallel Thinking
        • A Year of Metaphoric Thinking
        • TRIZ - Problems and Contradictions
        • TRIZ - The Unreasonable Effectiveness of Available Resources
        • TRIZ - The Ideal Final Result
        • TRIZ - A Contradictory Language
        • TRIZ - The Contradiction Matrix Workflow
        • TRIZ - The Laws of Evolution
        • TRIZ - Substance Field Analysis, and ARIZ
      • Math Models for Creative Coders
        • Maths Basics
          • Vectors
          • Matrix Algebra Whirlwind Tour
          • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
        • Tech
          • Tools and Installation
          • Adding Libraries to p5.js
          • Using Constructor Objects in p5.js
        • Geometry
          • Circles
          • Complex Numbers
          • Fractals
          • Affine Transformation Fractals
          • L-Systems
          • Kolams and Lusona
        • Media
          • Fourier Series
          • Additive Sound Synthesis
          • Making Noise Predictably
          • The Karplus-Strong Guitar Algorithm
        • AI
          • Working with Neural Nets
          • The Perceptron
          • The Multilayer Perceptron
          • MLPs and Backpropagation
          • Gradient Descent
        • Projects
          • Projects
      • Tech for Creative Education
        • 🧭 Using Idyll
        • 🧭 Using Apparatus
        • 🧭 Using g9.js
      • Literary Jukebox: In Short, the World
        • Italy - Dino Buzzati
        • France - Guy de Maupassant
        • Japan - Hisaye Yamamoto
        • Peru - Ventura Garcia Calderon
        • Russia - Maxim Gorky
        • Egypt - Alifa Rifaat
        • Brazil - Clarice Lispector
        • England - V S Pritchett
        • Russia - Ivan Bunin
        • Czechia - Milan Kundera
        • Sweden - Lars Gustaffsson
        • Canada - John Cheever
        • Ireland - William Trevor
        • USA - Raymond Carver
        • Italy - Primo Levi
        • India - Ruth Prawer Jhabvala
        • USA - Carson McCullers
        • Zimbabwe - Petina Gappah
        • India - Bharati Mukherjee
        • USA - Lucia Berlin
        • USA - Grace Paley
        • England - Angela Carter
        • USA - Kurt Vonnegut
        • Spain-Merce Rodoreda
        • Israel - Ruth Calderon
        • Israel - Etgar Keret
    • Posts
    • Blogs and Talks

    On this page

    • Slides and Tutorials
    • Setting up R Packages
    • What is a Population?
    • What is a Sample?
    • Population Parameters and Sample Statistics
    • Case Study #1: Sampling the NHANES dataset
      • An “Assumed” Population
      • Sampling
      • Repeated Samples and Sample Means
      • Distribution of Sample-Means
      • Deriving the Central Limit Theorem (CLT)
    • The Standard Error
    • Confidence intervals
    • CLT Workflow
    • CLT Assumptions
    • An interactive Sampling app
    • References
    1. Teaching
    2. Data Viz and Analytics
    3. Statistical Inference
    4. 🎲 Samples, Populations, Statistics and Inference

    🎲 Samples, Populations, Statistics and Inference

    Sampling
    Central Limit Theorem
    Standard Error
    Confidence Intervals
    Published

    November 25, 2022

    Modified

    March 1, 2025

    Abstract
    How much Land Data does a Man need?

    Slides and Tutorials

    R Tutorial   Datasets

    “When Alexander the Great visited Diogenes and asked whether he could do anything for the famed teacher, Diogenes replied:”Only stand out of my light.” Perhaps some day we shall know how to heighten creativity. Until then, one of the best things we can do for creative men and women is to stand out of their light.”

    — John W. Gardner, author and educator (8 Oct 1912-2002)

    Setting up R Packages

    library(tidyverse) # Data Processing in R
    library(mosaic) # Our workhorse for stats, sampling
    library(skimr) # Good to Examine data
    library(ggformula) # Formula interface for graphs
    
    # load the NHANES data library
    library(NHANES)
    library(infer)

    Plot Theme

    Show the Code
    # https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto
    
    # Chunk options
    knitr::opts_chunk$set(
      fig.width = 7,
      fig.asp = 0.618, # Golden Ratio
      # out.width = "80%",
      fig.align = "center"
    )
    ### Ggplot Theme
    ### https://rpubs.com/mclaire19/ggplot2-custom-themes
    
    theme_custom <- function() {
      font <- "Roboto Condensed" # assign font family up front
    
      theme_classic(base_size = 14) %+replace% # replace elements we want to change
    
        theme(
          panel.grid.minor = element_blank(), # strip minor gridlines
          text = element_text(family = font),
          # text elements
          plot.title = element_text( # title
            family = font, # set font family
            # size = 20,               #set font size
            face = "bold", # bold typeface
            hjust = 0, # left align
            # vjust = 2                #raise slightly
            margin = margin(0, 0, 10, 0)
          ),
          plot.subtitle = element_text( # subtitle
            family = font, # font family
            # size = 14,                #font size
            hjust = 0,
            margin = margin(2, 0, 5, 0)
          ),
          plot.caption = element_text( # caption
            family = font, # font family
            size = 8, # font size
            hjust = 1
          ), # right align
    
          axis.title = element_text( # axis titles
            family = font, # font family
            size = 10 # font size
          ),
          axis.text = element_text( # axis text
            family = font, # axis family
            size = 8
          ) # font size
        )
    }
    
    # Set graph theme
    theme_set(new = theme_custom())
    #

    What is a Population?

    A population is a collection of individuals or observations we are interested in. This is also commonly denoted as a study population in science research, or a target audience in design. We mathematically denote the population’s size using upper-case N.

    A population parameter is some numerical summary about the population that is unknown but you wish you knew. For example, when this quantity is a mean like the mean height of all Bangaloreans, the population parameter of interest is the population mean μ.

    A census is an exhaustive enumeration or counting of all N individuals in the population. We do this in order to compute the population parameter’s value exactly. Of note is that as the number N of individuals in our population increases, conducting a census gets more expensive (in terms of time, energy, and money).

    Important Parameters

    Populations Parameters are usually indicated by Greek Letters.

    What is a Sample?

    Sampling is the act of collecting a small subset from the population, which we generally do when we can’t perform a census. We mathematically denote the sample size using lower case n, as opposed to upper case N which denotes the population’s size. Typically the sample size n is much smaller than the population size N. Thus sampling is a much cheaper alternative than performing a census.

    A sample statistic, also known as a point estimate, is a summary statistic like a mean or standard deviation that is computed from a sample.

    NoteWhy do we sample?

    Because we cannot conduct a census ( not always ) — and sometimes we won’t even know how big the population is — we take samples. And we still want to do useful work for/with the population, after estimating its parameters, an act of generalizing from sample to population. So the question is, can we estimate useful parameters of the population, using just samples? Can point estimates serve as useful guides to population parameters?

    This act of generalizing from sample to population is at the heart of statistical inference.

    ImportantAn Alliterative Mnemonic

    NOTE: there is an alliterative mnemonic here: Samples have Statistics; Populations have Parameters.

    Population Parameters and Sample Statistics

    Parameters and Statistics
    Population Parameter Sample Statistic
    Mean μ x¯
    Standard Deviation σ s
    Proportion p p^
    Correlation ρ r
    Slope (Regression) β1 b1
    NoteQuestion

    Q.1. What is the mean commute time for workers in a particular city?

    A.1. The parameter is the mean commute time μ for a population containing all workers who work in the city. We estimate it using x¯, the mean of the random sample of people who work in the city.

    NoteQuestion

    Q.2. What is the correlation between the size of dinner bills and the size of tips at a restaurant?

    A.2. The parameter is ρ , the correlation between bill amount and tip size for a population of all dinner bills at that restaurant. We estimate it using r, the correlation from a random sample of dinner bills.

    NoteQuestion

    Q.3. How much difference is there in the proportion of 30 to 39-year-old residents who have only a cell phone (no land line phone) compared to 50 to 59-year-olds in the country?

    A.3. The population is all citizens of the country, and the parameter is p1−p2, the difference in proportion of 30 to 39-year-old residents who have only a cell phone (p1) and the proportion with the same property among all 50 to 59-year olds (p2). We estimate it using (p1^−p2^), the difference in sample proportions computed from random samples taken from each group.

    Sample statistics vary and in the following we will estimate this uncertainty and decide how reliable they might be as estimates of population parameters.

    Case Study #1: Sampling the NHANES dataset

    We will first execute some samples from a known dataset. We load up the NHANES dataset and glimpse it.

    data("NHANES")
    glimpse(NHANES)
    Rows: 10,000
    Columns: 76
    $ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 5164…
    $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10,…
    $ Gender           <fct> male, male, male, male, female, male, male, female, f…
    $ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10, …
    $ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  40…
    $ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795,…
    $ Race1            <fct> White, White, White, Other, White, White, White, Whit…
    $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ Education        <fct> High School, High School, High School, NA, Some Colle…
    $ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, M…
    $ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, 3…
    $ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 8750…
    $ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00,…
    $ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3,…
    $ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, O…
    $ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, N…
    $ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7,…
    $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166.…
    $ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.2…
    $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plus…
    $ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, 8…
    $ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111, …
    $ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, 6…
    $ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124, …
    $ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, 6…
    $ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108, …
    $ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, 6…
    $ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114, …
    $ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, 7…
    $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, 2…
    $ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, 5…
    $ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113, …
    $ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1.…
    $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N…
    $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vgo…
    $ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0,…
    $ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA,…
    $ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, No…
    $ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None,…
    $ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, N…
    $ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA…
    $ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, N…
    $ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, Y…
    $ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes, …
    $ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2, …
    $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
    $ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, N…
    $ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, N…
    $ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
    $ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA, …
    $ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, N…
    $ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA, …
    $ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No, …
    $ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smoke…
    $ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, N…
    $ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Ye…
    $ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, N…
    $ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes,…
    $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, N…
    $ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes, …
    $ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Y…
    $ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, N…
    $ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA, …
    $ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1,…
    $ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, N…
    $ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Heteros…
    $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

    Let us create a NHANES (sub)-dataset without duplicated IDs and only adults and select the Height variable:

    NHANES_adult <-
      NHANES %>%
      distinct(ID, .keep_all = TRUE) %>%
      filter(Age >= 18) %>%
      select(Height) %>%
      drop_na(Height)
    NHANES_adult
    ABCDEFGHIJ0123456789
    Height
    <dbl>
    164.7
    168.4
    166.7
    169.5
    181.9
    169.4
    148.1
    177.8
    181.3
    169.9
    Next
    123456
    ...
    479
    Previous
    1-10 of 4,790 rows

    An “Assumed” Population

    ImportantAn “Assumed” Population

    Normally, we very rarely have access to a population. All we can do is sample it. However, for now, and in order to build up our intuition, we will treat this single-variable dataset as our Population. So this is our population, with appropriate population parameters such as pop_mean, and pop_sd.

    Let us calculate the population parameters for the Height data from our “assumed” population:

    # NHANES_adult is assumed population
    pop_mean <- mosaic::mean(~Height, data = NHANES_adult)
    
    pop_sd <- mosaic::sd(~Height, data = NHANES_adult)
    
    pop_mean
    [1] 168.3497
    pop_sd
    [1] 10.15705

    Sampling

    Now, we will sample ONCE from the NHANES Height variable. Let us take a sample of sample size 50. We will compare sample statistics with population parameters on the basis of this ONE sample of 50:

    # Set graph theme
    theme_set(new = theme_custom())
    #
    
    sample_50 <- mosaic::sample(NHANES_adult, size = 50) %>%
      select(Height)
    sample_50
    sample_mean_50 <- mean(~Height, data = sample_50)
    sample_mean_50
    # Plotting the histogram of this sample
    sample_50 %>%
      gf_histogram(~Height, bins = 10) %>%
      gf_vline(
        xintercept = ~sample_mean_50,
        color = "purple"
      ) %>%
      gf_vline(
        xintercept = ~pop_mean,
        colour = "black"
      ) %>%
      gf_label(7 ~ (pop_mean + 8),
        label = "Population Mean",
        color = "black"
      ) %>%
      gf_label(7 ~ (sample_mean_50 - 8),
        label = "Sample Mean", color = "purple"
      ) %>%
      gf_labs(
        title = "Distribution and Mean of a Single Sample",
        subtitle = "Sample Size = 50"
      )
    ABCDEFGHIJ0123456789
    Height
    <dbl>
    156.2
    162.1
    187.1
    177.3
    161.8
    178.9
    170.9
    165.2
    168.9
    175.6
    Next
    12
    Previous
    1-10 of 50 rows
    [1] 170.654

    Repeated Samples and Sample Means

    OK, so the sample_mean_50 is not too far from the pop_mean. Is this always true?

    Let us check: we will create 500 repeated samples, each of size 50. And calculate their mean as the sample statistic, giving us a data frame containing 500 sample means. We will then see if these 500 means lie close to the pop_mean:

    sample_50_500 <- do(500) * {
      sample(NHANES_adult, size = 50) %>%
        select(Height) %>% # drop sampling related column "orig.id"
        summarise(
          sample_mean = mean(Height),
          sample_sd = sd(Height),
          sample_min = min(Height),
          sample_max = max(Height)
        )
    }
    sample_50_500
    dim(sample_50_500)
    ABCDEFGHIJ0123456789
    sample_mean
    <dbl>
    sample_sd
    <dbl>
    sample_min
    <dbl>
    sample_max
    <dbl>
    .row
    <int>
    .index
    <dbl>
    168.80010.120841149.7192.611
    169.06810.001539147.5184.912
    167.95010.811507146.4186.813
    167.1849.243870148.0188.214
    168.08010.819502144.4186.815
    169.9268.830855149.6189.716
    167.75410.105590143.9186.317
    167.97010.807126147.9191.218
    168.4688.620229152.7186.319
    168.55011.802045147.0191.9110
    Next
    123456
    ...
    50
    Previous
    1-10 of 500 rows
    [1] 500   6
    # Set graph theme
    theme_set(new = theme_custom())
    #
    
    sample_50_500 %>%
      gf_point(.index ~ sample_mean,
        color = "purple",
        title = "Sample Means are close to the Population Mean",
        subtitle = "Sample Means are Random!",
        caption = "Grey lines represent our 500 samples"
      ) %>%
      gf_segment(
        .index + .index ~ sample_min + sample_max,
        color = "grey",
        linewidth = 0.3,
        alpha = 0.3,
        ylab = "Sample Index (1-500)",
        xlab = "Sample Means"
      ) %>%
      gf_vline(
        xintercept = ~pop_mean,
        color = "black"
      ) %>%
      gf_label(-25 ~ pop_mean,
        label = "Population Mean",
        color = "black"
      )
    ##
    
    sample_50_500 %>%
      gf_point(.index ~ sample_sd,
        color = "purple",
        title = "Sample SDs are close to the Population Sd",
        subtitle = "Sample SDs are Random!",
      ) %>%
      gf_vline(
        xintercept = ~pop_sd,
        color = "black"
      ) %>%
      gf_label(-25 ~ pop_sd,
        label = "Population SD",
        color = "black"
      ) %>%
      gf_refine(lims(x = c(4, 16)))

    NoteSample Means and Sample SDs are a Random Variables!!1

    The sample_means (purple dots in the two graphs), are themselves random because the samples are random, of course. It appears that they are generally in the vicinity of the pop_mean (vertical black line). And hence they will have a mean and sd too 😱. Do not get confused ;-D

    And the sample_sds are also random and have their own distribution, around the pop_sd.2

    Distribution of Sample-Means

    Since the sample-means are themselves random variables, let’s plot the distribution of these 500 sample-means themselves, called a distribution of sample-means. We will also plot the position of the population mean pop_mean parameter, the mean of the Height variable.

    # Set graph theme
    theme_set(new = theme_custom())
    #
    
    sample_50_500 %>%
      gf_dhistogram(~sample_mean, bins = 30, xlab = "Height") %>%
      gf_vline(
        xintercept = pop_mean,
        color = "blue"
      ) %>%
      gf_label(0.01 ~ pop_mean,
        label = "Population Mean",
        color = "blue"
      ) %>%
      gf_labs(
        title = "Sampling Mean Distribution",
        subtitle = "500 means"
      )
    
    
    # How does this **distribution of sample-means** compare with the
    # overall distribution of the population?
    #
    sample_50_500 %>%
      gf_dhistogram(~sample_mean, bins = 30, xlab = "Height") %>%
      gf_vline(
        xintercept = pop_mean,
        color = "blue"
      ) %>%
      gf_label(0.01 ~ pop_mean,
        label = "Population Mean",
        color = "blue"
      ) %>%
      ## Add the population histogram
      gf_histogram(~Height,
        data = NHANES_adult,
        alpha = 0.2, fill = "blue",
        bins = 30
      ) %>%
      gf_label(0.025 ~ (pop_mean + 20),
        label = "Population Distribution", color = "blue"
      ) %>%
      gf_labs(title = "Sampling Mean Distribution", subtitle = "Original Population overlay")

    Sample Means

    Sample Means

    Sample Means and Population

    Sample Means and Population

    Distributions

    Deriving the Central Limit Theorem (CLT)

    We see in the Figure above that

    • the distribution of sample-means is centered around the pop_mean. ( Mean of the sample means = pop_mean!! 😱)
    • That the “spread” of the distribution of sample means is less than pop_sd. Curiouser and curiouser! But exactly how much is it?
    • And what is the kind of distribution?

    One more experiment.

    Now let’s repeatedly sample Height and compute the sample-means, and look at the resulting histograms. (We will deal with sample-standard-deviations shortly.) We will also use sample sizes of c(8, 16, ,32, 64) and generate 1000 samples each time, take the means and plot these 4 * 1000 means:

    # set.seed(12345)
    
    samples_08_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 08))
    
    samples_16_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 16))
    
    samples_32_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 32))
    
    samples_64_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 64))
    
    # samples_128_1000 <- do(1000) * mean(resample(NHANES_adult$Height, size = 128))
    
    # Quick Check
    head(samples_08_1000)
    ABCDEFGHIJ0123456789
     
     
    mean
    <dbl>
    1165.3375
    2171.9125
    3169.2250
    4173.0875
    5164.8500
    6174.2250
    6 rows

    Let us plot their individual histograms to compare them:

    # Set graph theme
    theme_set(new = theme_custom())
    #
    # Let us overlay their individual histograms to compare them:
    p5 <- gf_dhistogram(~mean,
      data = samples_08_1000,
      color = "grey",
      fill = "dodgerblue", title = "N = 8"
    ) %>%
      gf_fitdistr(linewidth = 1) %>%
      gf_vline(
        xintercept = pop_mean, inherit = FALSE,
        color = "blue"
      ) %>%
      gf_label(-0.025 ~ pop_mean,
        label = "Population Mean",
        color = "blue"
      ) %>%
      gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))
    ##
    p6 <- gf_dhistogram(~mean,
      data = samples_16_1000,
      color = "grey",
      fill = "sienna", title = "N = 16"
    ) %>%
      gf_fitdistr(linewidth = 1) %>%
      gf_vline(
        xintercept = pop_mean,
        color = "blue"
      ) %>%
      gf_label(-.025 ~ pop_mean,
        label = "Population Mean",
        color = "blue"
      ) %>%
      gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))
    ##
    p7 <- gf_dhistogram(~mean,
      data = samples_32_1000,
      na.rm = TRUE,
      color = "grey",
      fill = "palegreen", title = "N = 32"
    ) %>%
      gf_fitdistr(linewidth = 1) %>%
      gf_vline(
        xintercept = pop_mean,
        color = "blue"
      ) %>%
      gf_label(-.025 ~ pop_mean,
        label = "Population Mean", color = "blue"
      ) %>%
      gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))
    
    p8 <- gf_dhistogram(~mean,
      data = samples_64_1000,
      na.rm = TRUE,
      color = "grey",
      fill = "violetred", title = "N = 64"
    ) %>%
      gf_fitdistr(linewidth = 1) %>%
      gf_vline(
        xintercept = pop_mean,
        color = "blue"
      ) %>%
      gf_label(-.025 ~ pop_mean,
        label = "Population Mean", color = "blue"
      ) %>%
      gf_theme(scale_y_continuous(expand = expansion(mult = c(0.08, 0.02))))
    
    # patchwork::wrap_plots(p5,p6,p7,p8)
    p5
    p6
    p7
    p8

    And if we overlay these histograms on top of one another:

    From the histograms we learn that the sample-means are normally distributed around the population mean. This feels intuitively right because when we sample from the population, many values will be close to the population mean, and values far away from the mean will be increasingly scarce.

    Let us calculate the means of the sample-distributions:

    mean(~mean, data = samples_08_1000) # Mean of means!!!;-0
    mean(~mean, data = samples_16_1000)
    mean(~mean, data = samples_32_1000)
    mean(~mean, data = samples_64_1000)
    pop_mean
    [1] 168.1511
    [1] 168.353
    [1] 168.3914
    [1] 168.3819
    [1] 168.3497

    All are pretty close to the population mean !!!

    Consider the standard deviations of the sampling distributions:

    sdi=Σ[xi−xi¯]2ni

    We take the sum of all squared differences from the mean in a sample. If we divide this sum-of-squared-differences by the sample length n, we get a sort of average squared difference, or per-capita squared error from the mean, which is called the variance.

    Taking the square root gives us an average error, which we would call, of course, the standard deviation.

    pop_sd
    sd(~mean, data = samples_08_1000)
    sd(~mean, data = samples_16_1000)
    sd(~mean, data = samples_32_1000)
    sd(~mean, data = samples_64_1000)
    [1] 10.15705
    [1] 3.682819
    [1] 2.5176
    [1] 1.775868
    [1] 1.25669

    These are also decreasing steadily with sample size. How are they related to the pop_sd?

    pop_sd
    pop_sd / sqrt(8)
    pop_sd / sqrt(16)
    pop_sd / sqrt(32)
    pop_sd / sqrt(64)
    [1] 10.15705
    [1] 3.591058
    [1] 2.539262
    [1] 1.795529
    [1] 1.269631

    As we can see, the standard deviations of the sampling-mean-distributions is inversely proportional to the squre root of lengths of the sample.

    NoteCentral Limit Theorem

    Now we have enough to state the Central Limit Theorem (CLT)

    • the sample-means are normally distributed around the population mean. So any mean of a single sample is a good, unbiased estimate for the pop_mean
    • the sample-mean distributions narrow with sample length, i.e their sd decreases with increasing sample size.
    • sd∼1sqrt(n)
    • This is regardless of the distribution of the population itself.3

    This theorem underlies all our procedures and techniques for statistical inference, as we shall see.

    The Standard Error

    Consider once again the standard deviations in each of the sample-distributions that we have generated. As we saw these decrease with sample size.

    sd = pop_sd/sqrt(sample_size)where sample-size4 here is one of c(8, 16, 32, 64).

    We reserve the term Standard Deviation for the population, and name this computed standard deviation of the sample-mean-distributions as the Standard Error. This statistic derived from the sample-mean-distribution will help us infer our population parameters with a precise estimate of the uncertainty involved.

    Standard Error sese=pop.sdn However, we don’t normally know the pop_sd!! So now what?? So is this chicken and egg??

    Let us take SINGLE SAMPLES of each size, as we normally would:

    sample_08 <- mosaic::sample(NHANES_adult, size = 8) %>%
      select(Height)
    sample_16 <- mosaic::sample(NHANES_adult, size = 16) %>%
      select(Height)
    sample_32 <- mosaic::sample(NHANES_adult, size = 32) %>%
      select(Height)
    sample_64 <- mosaic::sample(NHANES_adult, size = 64) %>%
      select(Height)
    ##
    sd(~Height, data = sample_08)
    [1] 7.427555
    sd(~Height, data = sample_16)
    [1] 7.710034
    sd(~Height, data = sample_32)
    [1] 8.803592
    sd(~Height, data = sample_64)
    [1] 10.10828
    Warning

    As we saw at the start of this module, the sample-means are distributed around the pop_mean, AND it appears that the sample-sds are also distributed around the pop_sd!

    However, the sample-sd distribution is not Gaussian, and the sample_sd is also not an unbiased estimate for the pop_sd. However, when the sample size n is large (n>=30), the sample-sd distribution approximates the Gaussian, and the bias is small.

    And so, in the same way, we approximate the pop_sd with the sd of a single sample of the same length. Hence the Standard Error can be computed from a single sample as:

    Standard Error sese=sample.sdn

    With these, the Standard Errors(SE) evaluate to:

    pop_sd <- sd(~Height, data = NHANES_adult)
    pop_sd
    sd(~Height, data = sample_08) / sqrt(8)
    sd(~Height, data = sample_16) / sqrt(16)
    sd(~Height, data = sample_32) / sqrt(32)
    sd(~Height, data = sample_64) / sqrt(64)
    [1] 10.15705
    [1] 2.626037
    [1] 1.927509
    [1] 1.55627
    [1] 1.263536

    Confidence intervals

    When we work with samples, we want to be able to speak with a certain degree of confidence about the population mean, based on the evaluation of one sample mean, not a large number of them. Given that sample-means are normally distributed around the pop_mean, we can say that 68% of all possible sample-mean lie within ±SE of the population mean; and further that 95% of all possible sample-mean lie within ±2∗SE of the population mean. These two constants ±1 and ±2 are the z-scores from the z-distribution we saw earlier:

    These two intervals sample.mean±SE and sample.mean±2∗SE are called the confidence intervals for the population mean, at levels 68% and 95% probability respectively.

    How do these vary with sample size n?f

    # Set graph theme
    theme_set(new = theme_custom())
    #
    tbl_1 <- get_ci(samples_08_1000, level = 0.95)
    tbl_2 <- get_ci(samples_16_1000, level = 0.95)
    tbl_3 <- get_ci(samples_32_1000, level = 0.95)
    tbl_4 <- get_ci(samples_64_1000, level = 0.95)
    rbind(tbl_1, tbl_2, tbl_3, tbl_4) %>%
      rownames_to_column("index") %>%
      cbind("sample_size" = c(8, 16, 32, 64)) %>%
      gf_segment(index + index ~ lower_ci + upper_ci) %>%
      gf_vline(xintercept = pop_mean) %>%
      gf_labs(
        title = "95% Confidence Intervals for the Mean",
        subtitle = "Varying samples sizes 8-16-32-64",
        y = "Sample Size", x = "Mean Ranges"
      ) %>%
      gf_refine(scale_y_discrete(labels = c(8, 16, 32, 64))) %>%
      gf_refine(annotate(geom = "label", x = pop_mean + 1.75, y = 1.5, label = "Population Mean"))

    ImportantConfidence Intervals Shrink!

    Yes, with sample size! Why is that a good thing?

    So finally, we can pretend that our sample is also distributed normally ( i.e Gaussian) and use its sample_sd as a substitute for pop_sd, plot the standard errors on the sample histogram, and see where our believed mean is.

    # Set graph theme
    theme_set(new = theme_custom())
    #
    sample_mean <- mean(~Height, data = sample_16)
    se <- sd(~Height, data = sample_16) / sqrt(16)
    #
    xqnorm(
      p = c(0.025, 0.975),
      mean = sample_mean,
      sd = sd(~Height, data = sample_16),
      return = c("plot"), verbose = F
    ) %>%
      gf_vline(xintercept = ~pop_mean, colour = "black") %>%
      gf_vline(xintercept = mean(~Height, data = sample_16), colour = "purple") %>%
      gf_labs(
        title = "Confidence Intervals and the Bell Curve. N=16",
        subtitle = "Sample is plotted as theoretical Gaussian Bell Curve"
      ) %>%
      gf_refine(
        annotate(geom = "label", x = pop_mean + 15, y = 0.05, label = "Population Mean"),
        annotate(geom = "label", x = sample_mean - 15, y = 0.05, label = "Sample Mean", colour = "purple")
      )

    pop_mean
    se <- sd(~Height, data = sample_16) / sqrt(16)
    mean(~Height, data = sample_16) - 2.0 * se
    mean(~Height, data = sample_16) + 2.0 * se
    [1] 168.3497
    [1] 166.2512
    [1] 173.9613

    So if our believed mean is within the Confidence Intervals, then OK, we can say our belief may be true. If it is way outside the confidence intervals, we have to think that our belief may be flawed and accept and alternative interpretation.

    CLT Workflow

    Thus if we want to estimate a population mean:

    • we take one random sample from the population of length n
    • we calculate the mean from the sample sample-mean
    • we calculate the sample-sd
    • we calculate the Standard Error as sample−sdn
    • we calculate 95% confidence intervals for the population parameter based on the formula CI95%=sample.mean±2∗SE.
    • Since Standard Error decreases with sample size, we need to make our sample of adequate size.( n=30 seems appropriate in most cases. Why?)
    • And we do not have to worry about the distribution of the population. It need not be normal/Gaussian/Bell-shaped !!

    CLT Assumptions

    • Sample is of “decent length” n>=30;
    • Therefore sample histogram is Gaussian

    An interactive Sampling app

    Here below is an interactive sampling app. Play with the different settings, especially the distribution in the population to get a firm grasp on Sampling and the CLT.

    https://gallery.shinyapps.io/CLT_mean/

    References

    1. Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine, OpenIntro Statistics. https://www.openintro.org/book/os/

    2. Stats Test Wizard. https://www.socscistatistics.com/tests/what_stats_test_wizard.aspx

    3. Diez, David M & Barr, Christopher D & Çetinkaya-Rundel, Mine: OpenIntro Statistics. Available online https://www.openintro.org/book/os/

    4. Måns Thulin, Modern Statistics with R: From wrangling and exploring data to inference and predictive modelling http://www.modernstatisticswithr.com/

    5. Jonas Kristoffer Lindeløv, Common statistical tests are linear models (or: how to teach stats) https://lindeloev.github.io/tests-as-linear/

    6. CheatSheet https://lindeloev.github.io/tests-as-linear/linear_tests_cheat_sheet.pdf

    7. Common statistical tests are linear models: a work through by Steve Doogue https://steverxd.github.io/Stat_tests/

    8. Jeffrey Walker “Elements of Statistical Modeling for Experimental Biology”. https://www.middleprofessor.com/files/applied-biostatistics_bookdown/_book/

    9. Adam Loy, Lendie Follett & Heike Hofmann (2016) Variations of Q–Q Plots: The Power of Our Eyes!, The American Statistician, 70:2, 202-214, DOI: 10.1080/00031305.2015.1077728

    R Package Citations
    Package Version Citation
    NHANES 2.1.0 Pruim (2015)
    regressinator 0.2.0 Reinhart (2024)
    smovie 1.1.6 Northrop (2023)
    TeachHist 0.2.1 Lange (2023)
    TeachingDemos 2.13 Snow (2024)
    visualize 4.5.0 Balamuta (2023)
    Balamuta, James. 2023. visualize: Graph Probability Distributions with User Supplied Parameters and Statistics. https://CRAN.R-project.org/package=visualize.
    Lange, Carsten. 2023. TeachHist: A Collection of Amended Histograms Designed for Teaching Statistics. https://CRAN.R-project.org/package=TeachHist.
    Northrop, Paul J. 2023. smovie: Some Movies to Illustrate Concepts in Statistics. https://CRAN.R-project.org/package=smovie.
    Pruim, Randall. 2015. NHANES: Data from the US National Health and Nutrition Examination Study. https://CRAN.R-project.org/package=NHANES.
    Reinhart, Alex. 2024. regressinator: Simulate and Diagnose (Generalized) Linear Models. https://CRAN.R-project.org/package=regressinator.
    Snow, Greg. 2024. TeachingDemos: Demonstrations for Teaching and Learning. https://CRAN.R-project.org/package=TeachingDemos.
    Back to top

    Footnotes

    1. https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Introductory_Statistics_(Shafer_and_Zhang)/06%3A_Sampling_Distributions/6.01%3A_The_Mean_and_Standard_Deviation_of_the_Sample_Mean↩︎

    2. https://mathworld.wolfram.com/StandardDeviationDistribution.html↩︎

    3. The `Height` variable seems to be normally distributed at population level. We will try other non-normal population variables as an exercise in the tutorials.↩︎

    4. Once sample size = population, we have complete access to the population and there is no question of estimation error! So sample_sd = pop_sd!↩︎

    Citation

    BibTeX citation:
    @online{2022,
      author = {},
      title = {🎲 {Samples,} {Populations,} {Statistics} and {Inference}},
      date = {2022-11-25},
      url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/20-SampProb/},
      langid = {en},
      abstract = {How much \textasciitilde\textasciitilde
        Land\textasciitilde\textasciitilde{} Data does a Man need?}
    }
    
    For attribution, please cite this work as:
    “🎲 Samples, Populations, Statistics and Inference.” 2022. November 25, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/20-SampProb/.
    🧭 Basics of Statistical Inference
    Basics of Randomization Tests

    License: CC BY-SA 2.0

    Website made with ❤️ and Quarto, by Arvind V.

    Hosted by Netlify .