๐Ÿƒ Inferences Test for Two Proportions

Permutation
Monte Carlo Simulation
Random Number Generation
Distributions
Generating Parallel Worlds
Author

Arvind V.

Published

November 10, 2022

Abstract
Inference Test for Two Proportions

Setting up R packages

library(tidyverse)
library(mosaic)
library(ggmosaic) # plotting mosaic plots for Categorical Data

### Dataset from Chihara and Hesterberg's book (Second Edition)
library(resampledata)
library(vcd)

Introduction

We saw from the diagram created by Allen Downey that there is only one test! We will now use this philosophy to develop a technique that allows us to mechanize several Statistical Models in that way, with nearly identical code.

We will use two packages in R, mosaic and the relatively new infer package, to develop our intuition for what are called permutation based statistical tests.

Inference for Proportions Case Study-1: GSS2002 dataset

We can extend this idea to multiple proportions too.

Let us try a dataset with Qualitative / Categorical data. This is the General Social Survey GSS dataset from the resampledata package, and we have people with different levels of Education stating their opinion on the Death Penalty. We want to know if these two Categorical variables have a correlation, i.e. can the opinions in favour of the Death Penalty be explained by the Education level?

Since data is Categorical ( both variables ), we need to take counts in a table, and then implement a chi-square test. In the test, we will permute the Education variable to see if we can see how significant its effect size is.

data(GSS2002, package = "resampledata")
glimpse(GSS2002)
Rows: 2,765
Columns: 21
$ ID            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1โ€ฆ
$ Region        <fct> South Central, South Central, South Central, South Centrโ€ฆ
$ Gender        <fct> Female, Male, Female, Female, Male, Male, Female, Femaleโ€ฆ
$ Race          <fct> White, White, White, White, White, White, White, White, โ€ฆ
$ Education     <fct> HS, Bachelors, HS, Left HS, Left HS, HS, Bachelors, HS, โ€ฆ
$ Marital       <fct> Divorced, Married, Separated, Divorced, Divorced, Divorcโ€ฆ
$ Religion      <fct> Inter-nondenominational, Protestant, Protestant, Protestโ€ฆ
$ Happy         <fct> Pretty happy, Pretty happy, NA, NA, NA, Pretty happy, NAโ€ฆ
$ Income        <fct> 30000-34999, 75000-89999, 35000-39999, 50000-59999, 4000โ€ฆ
$ PolParty      <fct> "Strong Rep", "Not Str Rep", "Strong Rep", "Ind, Near Deโ€ฆ
$ Politics      <fct> Conservative, Conservative, NA, NA, NA, Conservative, NAโ€ฆ
$ Marijuana     <fct> NA, Not legal, NA, NA, NA, NA, NA, NA, Legal, NA, NA, NAโ€ฆ
$ DeathPenalty  <fct> Favor, Favor, NA, NA, NA, Favor, NA, NA, Favor, NA, NA, โ€ฆ
$ OwnGun        <fct> No, Yes, NA, NA, NA, Yes, NA, NA, Yes, NA, NA, NA, NA, Nโ€ฆ
$ GunLaw        <fct> Favor, Oppose, NA, NA, NA, Oppose, NA, NA, Oppose, NA, Nโ€ฆ
$ SpendMilitary <fct> Too little, About right, NA, About right, NA, Too littleโ€ฆ
$ SpendEduc     <fct> Too little, Too little, NA, Too little, NA, Too little, โ€ฆ
$ SpendEnv      <fct> About right, About right, NA, Too little, NA, Too littleโ€ฆ
$ SpendSci      <fct> About right, About right, NA, Too little, NA, Too littleโ€ฆ
$ Pres00        <fct> Bush, Bush, Bush, NA, NA, Bush, Bush, Bush, Bush, NA, NAโ€ฆ
$ Postlife      <fct> Yes, Yes, NA, NA, NA, Yes, NA, NA, Yes, NA, NA, NA, NA, โ€ฆ
inspect(GSS2002)

categorical variables:  
            name  class levels    n missing
1         Region factor      7 2765       0
2         Gender factor      2 2765       0
3           Race factor      3 2765       0
4      Education factor      5 2760       5
5        Marital factor      5 2765       0
6       Religion factor     13 2746      19
7          Happy factor      3 1369    1396
8         Income factor     24 1875     890
9       PolParty factor      8 2729      36
10      Politics factor      7 1331    1434
11     Marijuana factor      2  851    1914
12  DeathPenalty factor      2 1308    1457
13        OwnGun factor      3  924    1841
14        GunLaw factor      2  916    1849
15 SpendMilitary factor      3 1324    1441
16     SpendEduc factor      3 1343    1422
17      SpendEnv factor      3 1322    1443
18      SpendSci factor      3 1266    1499
19        Pres00 factor      5 1749    1016
20      Postlife factor      2 1211    1554
                                    distribution
1  North Central (24.7%) ...                    
2  Female (55.6%), Male (44.4%)                 
3  White (79.1%), Black (14.8%) ...             
4  HS (53.8%), Bachelors (16.1%) ...            
5  Married (45.9%), Never Married (25.6%) ...   
6  Protestant (53.2%), Catholic (24.5%) ...     
7  Pretty happy (57.3%) ...                     
8  40000-49999 (9.1%) ...                       
9  Ind (19.3%), Not Str Dem (18.9%) ...         
10 Moderate (39.2%), Conservative (15.8%) ...   
11 Not legal (64%), Legal (36%)                 
12 Favor (68.7%), Oppose (31.3%)                
13 No (65.5%), Yes (33.5%) ...                  
14 Favor (80.5%), Oppose (19.5%)                
15 About right (46.5%) ...                      
16 Too little (73.9%) ...                       
17 Too little (60%) ...                         
18 About right (49.7%) ...                      
19 Bush (50.6%), Gore (44.7%) ...               
20 Yes (80.5%), No (19.5%)                      

quantitative variables:  
  name   class min  Q1 median   Q3  max mean       sd    n missing
1   ID integer   1 692   1383 2074 2765 1383 798.3311 2765       0
skimr::skim(GSS2002)
Data summary
Name GSS2002
Number of rows 2765
Number of columns 21
_______________________
Column type frequency:
factor 20
numeric 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Region 0 1.00 FALSE 7 Nor: 684, Sou: 486, Sou: 471, Mid: 435
Gender 0 1.00 FALSE 2 Fem: 1537, Mal: 1228
Race 0 1.00 FALSE 3 Whi: 2188, Bla: 410, Oth: 167
Education 5 1.00 FALSE 5 HS: 1485, Bac: 443, Lef: 400, Gra: 230
Marital 0 1.00 FALSE 5 Mar: 1269, Nev: 708, Div: 445, Wid: 247
Religion 19 0.99 FALSE 13 Pro: 1460, Cat: 673, Non: 379, Chr: 65
Happy 1396 0.50 FALSE 3 Pre: 784, Ver: 415, Not: 170
Income 890 0.68 FALSE 24 400: 170, 300: 166, 250: 140, 500: 136
PolParty 36 0.99 FALSE 8 Ind: 528, Not: 515, Not: 449, Str: 408
Politics 1434 0.48 FALSE 7 Mod: 522, Con: 210, Sli: 209, Sli: 159
Marijuana 1914 0.31 FALSE 2 Not: 545, Leg: 306
DeathPenalty 1457 0.47 FALSE 2 Fav: 899, Opp: 409
OwnGun 1841 0.33 FALSE 3 No: 605, Yes: 310, Ref: 9
GunLaw 1849 0.33 FALSE 2 Fav: 737, Opp: 179
SpendMilitary 1441 0.48 FALSE 3 Abo: 615, Too: 414, Too: 295
SpendEduc 1422 0.49 FALSE 3 Too: 992, Abo: 278, Too: 73
SpendEnv 1443 0.48 FALSE 3 Too: 793, Abo: 439, Too: 90
SpendSci 1499 0.46 FALSE 3 Abo: 629, Too: 461, Too: 176
Pres00 1016 0.63 FALSE 5 Bus: 885, Gor: 781, Nad: 57, Oth: 16
Postlife 1554 0.44 FALSE 2 Yes: 975, No: 236

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 1383 798.33 1 692 1383 2074 2765 โ–‡โ–‡โ–‡โ–‡โ–‡

Note how all variables are Categorical !! Education has five levels, and of course DeathPenalty has three:

GSS2002 %>% count(Education)
GSS2002 %>% count(DeathPenalty)

Let us drop NA entries in Education and Death Penalty and set up a Contingency Table.

gss2002 <- GSS2002 %>% 
  dplyr::select(Education, DeathPenalty) %>% 
  tidyr::drop_na(., c(Education, DeathPenalty))
##
gss_table <- mosaic::tally(DeathPenalty ~ Education, data = gss2002)
gss_table %>% 
  addmargins()
            Education
DeathPenalty Left HS   HS Jr Col Bachelors Graduate  Sum
      Favor      117  511     71       135       64  898
      Oppose      72  200     16        71       50  409
      Sum        189  711     87       206      114 1307

Contingency Table Plots

The Contingency Table can be plotted, as we have seen, using a mosaic plot using several packages:

Observed Statistic: the \(X^2\) metric

When there are multiple proportions involved, the \(X^2\) test is what is used.

Hypotheses Definition

What would our Hypotheses be?

\(H_0: \text{Education does not affect votes for Death Penalty}\) \(H_a: \text{Education affects votes for Death Penalty}\)

Permutation Test for Education

We should now repeat the test with permutations on Education:

# Set graph theme
theme_set(new = theme_custom())
#

null_chisq <- do(9999) * 
  chisq.test(tally(DeathPenalty ~ shuffle(Education), 
                   data = gss2002))

head(null_chisq)
gf_histogram( ~ X.squared, data = null_chisq) %>% 
  
  gf_vline(xintercept = observedChi2, 
           color = "red") 

prop1(~ X.squared >= observedChi2, data = null_chisq)
prop_TRUE 
    3e-04 

The p-value is well below our threshold of \(0.05\), so we would conclude that Education has a significant effect on DeathPenalty opinion!

Inference for Proportions Case Study-2: TBD dataset

To be Written Up.

Conclusion

In our basic \(X^2\) test, we calculate the test statistic of \(X^2\) and look up a theoretical null distribution for that statistic, and see how unlikely our observed value is.

Why would a permutation test be a good idea here? With a permutation test, there are no assumptions of the null distribution: this is computed based on real data. We note in passing that, in this case, since the number of cases in each cell of the Contingency Table are fairly high ( >= 5) the resulting NULL distribution is of the \(X^2\) variety.

References

  1. OpenIntro Modern Statistics: Chapter 17

  2. Exploring the underlying theory of the chi-square test through simulation - part 1 https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence/

  3. Exploring the underlying theory of the chi-square test through simulation - part 2 https://www.rdatagen.net/post/a-little-intuition-and-simulation-behind-the-chi-square-test-of-independence-part-2/

  4. An Online \(\Xi^2\)-test calculator. https://www.statology.org/chi-square-test-of-independence-calculator/

R Package Citations
Package Version Citation
ggmosaic 0.3.3 Jeppson, Hofmann, and Cook (2021)
resampledata 0.3.1 Chihara and Hesterberg (2018)
scales 1.3.0 Wickham, Pedersen, and Seidel (2023)
vcd 1.4.12 Meyer, Zeileis, and Hornik (2006); Zeileis, Meyer, and Hornik (2007); Meyer et al. (2023)
Chihara, Laura M., and Tim C. Hesterberg. 2018. Mathematical Statistics with Resampling and r. 2nd ed. Hoboken, NJ: John Wiley & Sons. https://sites.google.com/site/chiharahesterberg/home.
Jeppson, Haley, Heike Hofmann, and Di Cook. 2021. ggmosaic: Mosaic Plots in the โ€œggplot2โ€ Framework. https://CRAN.R-project.org/package=ggmosaic.
Meyer, David, Achim Zeileis, and Kurt Hornik. 2006. โ€œThe Strucplot Framework: Visualizing Multi-Way Contingency Tables with Vcd.โ€ Journal of Statistical Software 17 (3): 1โ€“48. https://doi.org/10.18637/jss.v017.i03.
Meyer, David, Achim Zeileis, Kurt Hornik, and Michael Friendly. 2023. vcd: Visualizing Categorical Data. https://CRAN.R-project.org/package=vcd.
Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2023. scales: Scale Functions for Visualization. https://CRAN.R-project.org/package=scales.
Zeileis, Achim, David Meyer, and Kurt Hornik. 2007. โ€œResidual-Based Shadings for Visualizing (Conditional) Independence.โ€ Journal of Computational and Graphical Statistics 16 (3): 507โ€“25. https://doi.org/10.1198/106186007X237856.
Back to top

Citation

BibTeX citation:
@online{v.2022,
  author = {V., Arvind},
  title = {๐Ÿƒ {Inferences} {Test} for {Two} {Proportions}},
  date = {2022-11-10},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/190-TwoProp/two-props.html},
  langid = {en},
  abstract = {Inference Test for Two Proportions}
}
For attribution, please cite this work as:
V., Arvind. 2022. โ€œ๐Ÿƒ Inferences Test for Two Proportions.โ€ November 10, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Inference/Modules/190-TwoProp/two-props.html.