Permutation Tests for Linear Regression

Linear Regression

Quantitative Predictor

Quantitative Response

Sum of Squares

Residuals

Permutation

Author

Arvind Venkatadri

Published

May 3, 2023

Modified

May 21, 2024

Abstract

Using a Permutation Test to check our Regression Model

knitr::opts_chunk$set(echo = TRUE,warning = FALSE,message = FALSE)
library(tidyverse)
library(ggformula)
library(mosaic)
library(infer)

Linear Regression using Permutation Tests

We wish to establish the significance of the effect size due to each of the levels in TempFac. From the normality tests conducted earlier we see that except at one level of TempFac, the times are are not normally distributed. Hence we opt for a Permutation Test to check for significance of effect.

As remarked in Ernst[^2], the non-parametric permutation test can be both exact and also intuitively easier for students to grasp. Permutations are easily executed in R, using packages such as mosaic[^3].

We proceed with a Permutation Test for TempFac. We shuffle the levels (13, 18, 25) randomly between the Times and repeat the ANOVA test each time and calculate the F-statistic. The Null distribution is the distribution of the F-statistic over the many permutations and the p-value is given by the proportion of times the F-statistic equals or exceeds that observed.

Read the Data

data("BostonHousing2", package = "mlbench")
housing <- BostonHousing2
inspect(housing)


categorical variables:  
  name  class levels   n missing                                  distribution
1 town factor     92 506       0 Cambridge (5.9%) ...                         
2 chas factor      2 506       0 0 (93.1%), 1 (6.9%)                          

quantitative variables:  
      name   class       min          Q1     median          Q3       max
1    tract integer   1.00000 1303.250000 3393.50000 3739.750000 5082.0000
2      lon numeric -71.28950  -71.093225  -71.05290  -71.019625  -70.8100
3      lat numeric  42.03000   42.180775   42.21810   42.252250   42.3810
4     medv numeric   5.00000   17.025000   21.20000   25.000000   50.0000
5    cmedv numeric   5.00000   17.025000   21.20000   25.000000   50.0000
6     crim numeric   0.00632    0.082045    0.25651    3.677083   88.9762
7       zn numeric   0.00000    0.000000    0.00000   12.500000  100.0000
8    indus numeric   0.46000    5.190000    9.69000   18.100000   27.7400
9      nox numeric   0.38500    0.449000    0.53800    0.624000    0.8710
10      rm numeric   3.56100    5.885500    6.20850    6.623500    8.7800
11     age numeric   2.90000   45.025000   77.50000   94.075000  100.0000
12     dis numeric   1.12960    2.100175    3.20745    5.188425   12.1265
13     rad integer   1.00000    4.000000    5.00000   24.000000   24.0000
14     tax integer 187.00000  279.000000  330.00000  666.000000  711.0000
15 ptratio numeric  12.60000   17.400000   19.05000   20.200000   22.0000
16       b numeric   0.32000  375.377500  391.44000  396.225000  396.9000
17   lstat numeric   1.73000    6.950000   11.36000   16.955000   37.9700
           mean           sd   n missing
1  2700.3557312 1.380037e+03 506       0
2   -71.0563887 7.540535e-02 506       0
3    42.2164403 6.177718e-02 506       0
4    22.5328063 9.197104e+00 506       0
5    22.5288538 9.182176e+00 506       0
6     3.6135236 8.601545e+00 506       0
7    11.3636364 2.332245e+01 506       0
8    11.1367787 6.860353e+00 506       0
9     0.5546951 1.158777e-01 506       0
10    6.2846344 7.026171e-01 506       0
11   68.5749012 2.814886e+01 506       0
12    3.7950427 2.105710e+00 506       0
13    9.5494071 8.707259e+00 506       0
14  408.2371542 1.685371e+02 506       0
15   18.4555336 2.164946e+00 506       0
16  356.6740316 9.129486e+01 506       0
17   12.6530632 7.141062e+00 506       0

We will use mosaic and also try with infer.

Using mosaic
Using infer

mosaic offers an easy and intuitive way of doing a repeated permutation test, using the do() command. We will shuffle the TempFac factor to jumble up the Time observations, 10000 times. Each time we shuffle, we compute the F_statistic and record it. We then plot the 10000 F-statistics and compare that with the real-world observation of F-stat.

The Null distribution of the F_statistic under permutation shows it never crosses the real-world observed value, testifying the strength of the effect of TempFac on hatching Time. And the p-value is:

We calculate the observed F-stat with infer, which also has a very direct, if verbose, syntax for doing permutation tests:

We see that the observed F-Statistic is of course $385.8966$ as before. Now we use infer to generate a NULL distribution using permutation of the factor TempFac:

As seen, the infer based permutation test also shows that the permutationally generated F-statistics are nowhere near that which was observed. The effect of TempFac is very strong.