Permutation Tests for Linear Regression
Linear Regression using Permutation Tests
We wish to establish the significance of the effect size due to each of the levels in TempFac
. From the normality tests conducted earlier we see that except at one level of TempFac
, the times are are not normally distributed. Hence we opt for a Permutation Test to check for significance of effect.
As remarked in Ernst[^2], the non-parametric permutation test can be both exact and also intuitively easier for students to grasp. Permutations are easily executed in R, using packages such as mosaic
[^3].
We proceed with a Permutation Test for TempFac
. We shuffle the levels (13, 18, 25) randomly between the Times and repeat the ANOVA test each time and calculate the F-statistic. The Null distribution is the distribution of the F-statistic over the many permutations and the p-value is given by the proportion of times the F-statistic equals or exceeds that observed.
Read the Data
categorical variables:
name class levels n missing distribution
1 town factor 92 506 0 Cambridge (5.9%) ...
2 chas factor 2 506 0 0 (93.1%), 1 (6.9%)
quantitative variables:
name class min Q1 median Q3 max
1 tract integer 1.00000 1303.250000 3393.50000 3739.750000 5082.0000
2 lon numeric -71.28950 -71.093225 -71.05290 -71.019625 -70.8100
3 lat numeric 42.03000 42.180775 42.21810 42.252250 42.3810
4 medv numeric 5.00000 17.025000 21.20000 25.000000 50.0000
5 cmedv numeric 5.00000 17.025000 21.20000 25.000000 50.0000
6 crim numeric 0.00632 0.082045 0.25651 3.677083 88.9762
7 zn numeric 0.00000 0.000000 0.00000 12.500000 100.0000
8 indus numeric 0.46000 5.190000 9.69000 18.100000 27.7400
9 nox numeric 0.38500 0.449000 0.53800 0.624000 0.8710
10 rm numeric 3.56100 5.885500 6.20850 6.623500 8.7800
11 age numeric 2.90000 45.025000 77.50000 94.075000 100.0000
12 dis numeric 1.12960 2.100175 3.20745 5.188425 12.1265
13 rad integer 1.00000 4.000000 5.00000 24.000000 24.0000
14 tax integer 187.00000 279.000000 330.00000 666.000000 711.0000
15 ptratio numeric 12.60000 17.400000 19.05000 20.200000 22.0000
16 b numeric 0.32000 375.377500 391.44000 396.225000 396.9000
17 lstat numeric 1.73000 6.950000 11.36000 16.955000 37.9700
mean sd n missing
1 2700.3557312 1.380037e+03 506 0
2 -71.0563887 7.540535e-02 506 0
3 42.2164403 6.177718e-02 506 0
4 22.5328063 9.197104e+00 506 0
5 22.5288538 9.182176e+00 506 0
6 3.6135236 8.601545e+00 506 0
7 11.3636364 2.332245e+01 506 0
8 11.1367787 6.860353e+00 506 0
9 0.5546951 1.158777e-01 506 0
10 6.2846344 7.026171e-01 506 0
11 68.5749012 2.814886e+01 506 0
12 3.7950427 2.105710e+00 506 0
13 9.5494071 8.707259e+00 506 0
14 408.2371542 1.685371e+02 506 0
15 18.4555336 2.164946e+00 506 0
16 356.6740316 9.129486e+01 506 0
17 12.6530632 7.141062e+00 506 0
We will use mosaic
and also try with infer
.
mosaic
offers an easy and intuitive way of doing a repeated permutation test, using the do()
command. We will shuffle
the TempFac
factor to jumble up the Time
observations, 10000 times. Each time we shuffle, we compute the F_statistic and record it. We then plot the 10000 F-statistics and compare that with the real-world observation of F-stat
.
The Null distribution of the F_statistic under permutation shows it never crosses the real-world observed value, testifying the strength of the effect of TempFac
on hatching Time
. And the p-value is:
We calculate the observed F-stat with infer
, which also has a very direct, if verbose, syntax for doing permutation tests:
We see that the observed F-Statistic is of course \(385.8966\) as before. Now we use infer
to generate a NULL distribution using permutation of the factor TempFac
:
As seen, the infer
based permutation test also shows that the permutationally generated F-statistics are nowhere near that which was observed. The effect of TempFac
is very strong.