Tutorial: Permutation Testing for One Proportion
Introduction
We will use the datasets that are part of the resampledata
package.1
Case Study-1: Verizon
Does Verizon create a difference in Repair Times between ILEC and CLEC systems?
categorical variables:
name class levels n missing
1 Group factor 2 1687 0
distribution
1 ILEC (98.6%), CLEC (1.4%)
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 Time numeric 0 0.75 3.63 7.35 191.6 8.522009 14.78848 1687 0
Describe the Variables!
Hypothesis Specification
Write the Null and Alternate hypotheses here.
Null Distribution Computation
Verizon Conclusion
Case Story-2: Recidivism
Do criminals released after a jail term commit crimes again? Does recidivism depend upon age?
categorical variables:
name class levels n missing
1 Gender factor 2 17019 3
2 Age factor 5 17019 3
3 Age25 factor 2 17019 3
4 Race factor 10 16988 34
5 Offense factor 2 17022 0
6 Recid factor 2 17022 0
7 Type factor 3 17022 0
distribution
1 M (87.7%), F (12.3%)
2 25-34 (36.6%), 35-44 (23.7%) ...
3 Over 25 (81.9%), Under 25 (18.1%)
4 White-NonHispanic (67%) ...
5 Felony (80.6%), Misdemeanor (19.4%)
6 No (68.4%), Yes (31.6%)
7 No Recidivism (68.4%), New (20.2%) ...
quantitative variables:
name class min Q1 median Q3 max mean sd n missing
1 Days integer 0 241 418 687 1095 473.3275 283.1393 5386 11636
Describe the variables!
Hypothesis Specification
Let us see if the incidence of recidivism is dependent upon whether a person is aged less than or more than 25 years. Write the Null and Alternate hypotheses here.
Recidivism
Gender <fct> | Age <fct> | Age25 <fct> | Race <fct> | Offense <fct> | Recid <fct> | Type <fct> | Days <int> |
---|---|---|---|---|---|---|---|
M | Under 25 | Under 25 | White-NonHispanic | Felony | Yes | Tech | 16 |
M | 55 and Older | Over 25 | White-NonHispanic | Felony | Yes | Tech | 19 |
M | 25-34 | Over 25 | White-NonHispanic | Felony | Yes | Tech | 22 |
M | 55 and Older | Over 25 | White-NonHispanic | Felony | Yes | Tech | 25 |
M | 25-34 | Over 25 | Black-NonHispanic | Felony | Yes | Tech | 26 |
M | Under 25 | Under 25 | White-NonHispanic | Felony | Yes | Tech | 27 |
M | 45-54 | Over 25 | White-NonHispanic | Misdemeanor | Yes | New | 28 |
M | 45-54 | Over 25 | White-NonHispanic | Felony | Yes | Tech | 41 |
M | 45-54 | Over 25 | White-NonHispanic | Misdemeanor | Yes | Tech | 44 |
M | 45-54 | Over 25 | White-NonHispanic | Felony | Yes | Tech | 46 |
Also, the variable Recid
is a factor variable coded “Yes” or “No”. We ought to convert it to a numeric variable of 1’s and 0’s. Why?
Null Distribution for Recidivism
Recidivism Conclusion
Case Study #3: Flight Delays
LaGuardia Airport (LGA) is one of three major airports that serves the New York City metropolitan area. In 2008, over 23 million passengers and over 375 000 planes flew in or out of LGA. United Airlines and America Airlines are two major airlines that schedule services at LGA. The data set FlightDelays
contains information on all 4029 departures of these two airlines from LGA during May and June 2009.
categorical variables:
name class levels n missing
1 Carrier factor 2 4029 0
2 Destination factor 7 4029 0
3 DepartTime factor 5 4029 0
4 Day factor 7 4029 0
5 Month factor 2 4029 0
6 Delayed30 factor 2 4029 0
distribution
1 AA (72.1%), UA (27.9%)
2 ORD (44.3%), DFW (22.8%), MIA (15.1%) ...
3 8-Noon (26.1%), Noon-4pm (26%) ...
4 Fri (15.8%), Mon (15.6%), Tue (15.6%) ...
5 June (50.4%), May (49.6%)
6 No (85.2%), Yes (14.8%)
quantitative variables:
name class min Q1 median Q3 max mean sd n
1 ID integer 1 1008 2015 3022 4029 2015.0000 1163.21645 4029
2 FlightNo integer 71 371 691 787 2255 827.1035 551.30939 4029
3 FlightLength integer 68 155 163 228 295 185.3011 41.78783 4029
4 Delay integer -19 -6 -3 5 693 11.7379 41.63050 4029
missing
1 0
2 0
3 0
4 0
The variables in the FlightDelays
dataset are:
Hypothesis Specification
Let us compute the proportion of times that each carrier’s flights was delayed more than 20 min. We will conduct a two-sided test to see if the difference in these proportions is statistically significant.
Null Distribution for FlightDelays
which is very small. Hence we reject the null Hypothesis that there is no difference between carrier
s on delay times
.
References
Footnotes
https://github.com/rudeboybert/resampledata↩︎