Tutorial: Multiple Linear Regression with Backward Selection

Linear Regression
Quantitative Predictor
Quantitative Response
Sum of Squares
Residuals
Author

Arvind V.

Published

May 3, 2023

Modified

May 21, 2024

Abstract
Using Multiple Regression to predict Quantitative Target Variables

Setting up R Packages

options(scipen = 1, digits = 3) #set to three decimal 
knitr::opts_chunk$set(echo = TRUE,warning = FALSE,message = FALSE)
library(tidyverse)
library(ggformula)
library(mosaic)
library(infer)
# Let us set a plot theme for Data visualisation

# my_theme <- function(){  # Creating a function
#   theme_classic() +  # Using pre-defined theme as base
#   theme(axis.text.x = element_text(size = 12, face = "bold"),  # Customizing axes text      
#         axis.text.y = element_text(size = 12, face = "bold"),
#         axis.title = element_text(size = 14, face = "bold"),  # Customizing axis title
#         panel.grid = element_blank(),  # Taking off the default grid
#         plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"),
#         legend.text = element_text(size = 12, face = "italic"),  # Customizing legend text
#         legend.title = element_text(size = 12, face = "bold"),  # Customizing legend title
#         legend.position = "right",  # Customizing legend position
#         plot.caption = element_text(size = 12))  # Customizing plot caption
# }   

my_theme <- function(){  # Creating a function
  theme_classic() +  # Using pre-defined theme as base
  theme(plot.title = element_text(face = "bold", size = 14),
        axis.text.x = element_text(size = 10, face = "bold"),  
        # Customizing axes text      
        axis.text.y = element_text(size = 10, face = "bold"),
        axis.title = element_text(size = 12, face = "bold"),  
        # Customizing axis title
        panel.grid = element_blank(),  # Taking off the default grid
        plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"),
        legend.text = element_text(size = 8, face = "italic"),  
        # Customizing legend text
        legend.title = element_text(size = 10, face = "bold"),  
        # Customizing legend title
        legend.position = "right",  # Customizing legend position
        plot.caption = element_text(size = 8))  # Customizing plot caption
}   

In this tutorial, we will use the Boston housing dataset. Our research question is:

Research Question

How do we predict the price of a house in Boston, based on other parameters Quantitative parameters such as area, location, rooms, and crime-rate in the neighbourhood?

And how do we choose the β€œbest” model, based on a tradeoff between Model Complexity and Model Accuracy?

Workflow: Read the Data

data("BostonHousing2", package = "mlbench")
housing <- BostonHousing2
inspect(housing)

categorical variables:  
  name  class levels   n missing                                  distribution
1 town factor     92 506       0 Cambridge (5.9%) ...                         
2 chas factor      2 506       0 0 (93.1%), 1 (6.9%)                          

quantitative variables:  
      name   class       min       Q1   median       Q3      max     mean
1    tract integer   1.00000 1303.250 3393.500 3739.750 5082.000 2700.356
2      lon numeric -71.28950  -71.093  -71.053  -71.020  -70.810  -71.056
3      lat numeric  42.03000   42.181   42.218   42.252   42.381   42.216
4     medv numeric   5.00000   17.025   21.200   25.000   50.000   22.533
5    cmedv numeric   5.00000   17.025   21.200   25.000   50.000   22.529
6     crim numeric   0.00632    0.082    0.257    3.677   88.976    3.614
7       zn numeric   0.00000    0.000    0.000   12.500  100.000   11.364
8    indus numeric   0.46000    5.190    9.690   18.100   27.740   11.137
9      nox numeric   0.38500    0.449    0.538    0.624    0.871    0.555
10      rm numeric   3.56100    5.886    6.208    6.623    8.780    6.285
11     age numeric   2.90000   45.025   77.500   94.075  100.000   68.575
12     dis numeric   1.12960    2.100    3.207    5.188   12.127    3.795
13     rad integer   1.00000    4.000    5.000   24.000   24.000    9.549
14     tax integer 187.00000  279.000  330.000  666.000  711.000  408.237
15 ptratio numeric  12.60000   17.400   19.050   20.200   22.000   18.456
16       b numeric   0.32000  375.377  391.440  396.225  396.900  356.674
17   lstat numeric   1.73000    6.950   11.360   16.955   37.970   12.653
          sd   n missing
1  1380.0368 506       0
2     0.0754 506       0
3     0.0618 506       0
4     9.1971 506       0
5     9.1822 506       0
6     8.6015 506       0
7    23.3225 506       0
8     6.8604 506       0
9     0.1159 506       0
10    0.7026 506       0
11   28.1489 506       0
12    2.1057 506       0
13    8.7073 506       0
14  168.5371 506       0
15    2.1649 506       0
16   91.2949 506       0
17    7.1411 506       0

The original data are 506 observations on 14 variables, medv being the target variable:

crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
b \(1000(B - 0.63)^2\) where B is the proportion of Blacks by town
lstat percentage of lower status of the population
medv median value of owner-occupied homes in USD 1000’s

The corrected data set has the following additional columns:

cmedv corrected median value of owner-occupied homes in USD 1000’s
town name of town
tract census tract
lon longitude of census tract
lat latitude of census tract

Our response variable is cmedv, the corrected median value of owner-occupied homes in USD 1000’s. Their are many Quantitative feature variables that we can use to predict cmedv. And there are two Qualitative features, chas and tax.

Workflow: Correlations

We can use purrr to evaluate all pair-wise correlations in one shot:

all_corrs <- housing %>% 
  select(where(is.numeric)) %>% 
  
  # leave off cmedv/medv to get all the remaining ones
  select(- cmedv, -medv) %>%  
  
  # perform a cor.test for all variables against cmedv
  purrr::map(.x = .,
             .f = \(x) cor.test(x, housing$cmedv)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") 

all_corrs
all_corrs %>%
  gf_hline(yintercept = 0,
           color = "grey",
           linewidth = 2) %>%
  gf_errorbar(
    conf.high + conf.low ~ reorder(predictor, estimate),
    colour = ~ estimate,
    width = 0.5,
    linewidth = ~ -log10(p.value),
    caption = "Significance = -log10(p.value)"
  ) %>%
  gf_point(estimate ~ reorder(predictor, estimate)) %>%
  gf_labs(x = "Predictors", y = "Correlation with Median House Price") %>%
  gf_theme(my_theme()) %>%
  gf_theme(theme(axis.text.x = element_text(angle = 45, hjust = 1))) %>%
  gf_refine(
    scale_colour_distiller("Correlation", type = "div", palette = "RdBu"),
    scale_linewidth_continuous("Significance", range = c(0.25, 3)),
    guides(linewidth = guide_legend(reverse = TRUE)))

The variables rm, lstat seem to have high correlations with cmedv which are also statistically significant.

Workflow: Maximal Multiple Regression Model

We will create a regression model for cmedv using all the other numerical predictor features in the dataset.

housing_numeric <- housing %>% select(where(is.numeric), 
                                      
                    # remove medv
                    # an older version of cmedv
                                      -c(medv))
names(housing_numeric) # 16 variables, one target, 15 predictors
 [1] "tract"   "lon"     "lat"     "cmedv"   "crim"    "zn"      "indus"  
 [8] "nox"     "rm"      "age"     "dis"     "rad"     "tax"     "ptratio"
[15] "b"       "lstat"  
housing_maximal <- lm(cmedv ~ ., data = housing_numeric)
summary(housing_maximal)

Call:
lm(formula = cmedv ~ ., data = housing_numeric)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.934  -2.752  -0.619   1.711  26.120 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.45e+02   3.23e+02   -1.07  0.28734    
tract       -7.52e-04   4.46e-04   -1.69  0.09231 .  
lon         -6.79e+00   3.44e+00   -1.98  0.04870 *  
lat         -2.35e+00   5.36e+00   -0.44  0.66074    
crim        -1.09e-01   3.28e-02   -3.32  0.00097 ***
zn           4.40e-02   1.39e-02    3.17  0.00164 ** 
indus        2.75e-02   6.20e-02    0.44  0.65692    
nox         -1.55e+01   4.03e+00   -3.85  0.00014 ***
rm           3.81e+00   4.20e-01    9.07  < 2e-16 ***
age          5.82e-03   1.34e-02    0.43  0.66416    
dis         -1.38e+00   2.10e-01   -6.59  1.1e-10 ***
rad          2.36e-01   8.47e-02    2.78  0.00558 ** 
tax         -1.48e-02   3.74e-03   -3.96  8.5e-05 ***
ptratio     -9.49e-01   1.41e-01   -6.73  4.7e-11 ***
b            9.55e-03   2.67e-03    3.57  0.00039 ***
lstat       -5.46e-01   5.06e-02  -10.80  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.73 on 490 degrees of freedom
Multiple R-squared:  0.743, Adjusted R-squared:  0.735 
F-statistic: 94.3 on 15 and 490 DF,  p-value: <2e-16

The maximal model has an R.squared of \(0.7426\) which is much better than that we obtained for a simple model based on rm alone. How much can we simplify this maximal model, without losing out on R.squared?

Workflow: Model Reduction

We now proceed naively by removing one predictor after another. We will resort to what may amount to p-hacking by sorting the predictors based on their p-value1 in the maximal model and removing them in decreasing order of their p-value.

We will also use some powerful features from the purrr package (also part of the tidyverse), to create all these models all at once. Then we will be able to plot their R.squared values together and decide where we wish to trade off Explainability vs Complexity for our model.

# No of Quant predictor variables in the dataset
n_vars <- housing %>%
  select(where(is.numeric), -c(cmedv, medv)) %>%
  length()

# Maximal Model, now tidied
housing_maximal_tidy <- 
  housing_maximal %>% 
  broom::tidy() %>% 
  
# Obviously remove "Intercept" ;-D
  filter(term != "(Intercept)") %>% 
  
# And horrors! Sort variables by p.value
  arrange(p.value)

housing_maximal_tidy

The last 5 variables are clearly statistically insignificant.

And now we unleash the purrr package to create all the simplified models at once. We will construct a dataset containing three columns:

  • A list of all quantitative predictor variables
  • A sequence of numbers from 1 to N(predictor)
  • A β€œlist” column containing the housing data frame itself

We will use the iteration capability of purrr to sequentially drop one variable at a time from the maximal(15 predictor) model, build a new reduced model each time, and compute the r.squared:

housing_model_set <- tibble(all_vars = 
                            list(housing_maximal_tidy$term), # p-hacked order!!
                            keep_vars = seq(1, n_vars),
                            data = list(housing_numeric))
housing_model_set
# Unleash purrr in a series of mutates
housing_model_set <- housing_model_set %>%
  
# list of predictor variables for each model
  mutate(mod_vars =
           pmap(
             .l = list(all_vars, keep_vars, data),
             .f = \(all_vars, keep_vars, data) all_vars[1:keep_vars]
           )) %>%
  
# build formulae with these for linear regression
  mutate(formula = 
           map(.x = mod_vars,
               .f = \(mod_vars) as.formula(paste(
                         "cmedv ~", paste(mod_vars, collapse = "+")
                       )))) %>%
  
# use the formulae to build multiple linear models
  mutate(models =
           pmap(
             .l = list(data, formula),
             .f = \(data, formula) lm(formula, data = data)
           ))
# Check everything after the operation
housing_model_set
# Tidy up the models using broom to expose their metrics
housing_models_tidy <- housing_model_set %>% 
  mutate(tidy_models =
           map(
             .x = models,
             .f = \(x) broom::glance(x,
                                     conf.int = TRUE,
                                     conf.lvel = 0.95)
           )) %>% 

  # Remove unwanted columns, keep model and predictor count
  select(keep_vars, tidy_models) %>%
  unnest(tidy_models)

housing_models_tidy %>%
  gf_line(
    r.squared ~ keep_vars,
    ylab = "R.Squared",
    xlab = "No. params in the Linear Model",
    title = "Model Explainability vs Complexity",
    subtitle = "Model r.squared vs No. of Predictors",
    data = .
  ) %>%
  
  # Plot r.squared vs predictor count
  gf_point(r.squared ~ keep_vars,
           size = 3.5, color = "grey") %>%
  
  # Show off the selected best model
  gf_point(
    r.squared ~ keep_vars,
    size = 3.5,
    color = "red",
    data = housing_models_tidy %>% filter(keep_vars == 4)
  ) %>%
  
  gf_hline(yintercept = 0.7, linetype = "dashed") %>%
  gf_theme(my_theme())

At the loss of some 5% in the r.squared, we can drop our model complexity from 15 predictors to say 4! Our final model will then be:

housing_model_final <- 
  housing_model_set %>% 
  
  # filter for best model, with 4 variables
  filter(keep_vars == 4) %>% 
  
  # tidy up the model
  mutate(tidy_models =
           map(
             .x = models,
             .f = \(x) broom::tidy(x,
                                     conf.int = TRUE,
                                     conf.lvel = 0.95)
           )) %>% 
  
  # Remove unwanted columns, keep model and predictor count
  select(keep_vars, models, tidy_models) %>%
  unnest(tidy_models)

housing_model_final
housing_model_final %>%  
  # And plot the model
  # Remove the intercept term
  filter(term != "(Intercept)") %>% 
  gf_col(estimate ~ term, fill = ~ term, width = 0.25) %>% 
  gf_hline(yintercept = 0) %>% 
  gf_errorbar(conf.low + conf.high ~ term, 
              width = 0.1, 
              title = "Multiple Regression",
              subtitle = "Model Estimates with Confidence Intervals") %>% 
  gf_theme(my_theme())

Our current best model can be stated as:

\[ \widehat{cmedv} \sim 24.459 - 0.563 * dis - 0.673 * lstat - 0.957 * ptratio + 4.199 * rm \]

Workflow: Diagnostics

Let us use broom::augment to calculate residuals and predictions to arrive at a quick set of diagnostic plots.

housing_model_final_augment <- 
  housing_model_set %>% 
  filter(keep_vars == 4) %>% 
  
# augment the model
  mutate(augment_models =
           map(
             .x = models,
             .f = \(x) broom::augment(x)
           )) %>% 
  unnest(augment_models) %>% 
  select(cmedv:last_col())

housing_model_final_augment
housing_model_final_augment %>% 
  gf_point(.resid ~ .fitted, title = "Residuals vs Fitted") %>%
  gf_smooth() %>% 
  gf_theme(my_theme)
housing_model_final_augment %>% 
  gf_qq(~ .std.resid, title = "Q-Q Residuals") %>% 
  gf_qqline() %>%
  gf_theme(my_theme)
housing_model_final_augment %>% 
  gf_point(sqrt(.std.resid) ~ .fitted, 
           title = "Scale-Location Plot") %>%
    gf_smooth() %>% 
  gf_theme(my_theme)
housing_model_final_augment %>% 
  gf_point(.std.resid ~ .hat, 
           title = "Residuals vs Leverage") %>%
    gf_smooth() %>% 
  gf_theme(my_theme)

The residuals plot shows a curved trend, and certainly does not resemble the stars at night, so it is possible that we have left out some possible richness in our model-making, a β€œsystemic inadequacy”.

The Q-Q plot of residuals also shows a J-shape which indicates a non-normal distribution of residuals.

These could indicate that more complex model ( e.g. linear model with interactions between variables ( i.e. product terms ) may be necessary.

Conclusion

We have used a multiple-regression workflow that takes all predictor variables into account in a linear model, and then systematically simplified that model such that the performance was just adequate.

The models we chose were all linear of course, but without interaction terms : each predictor was used only for its main effect. When the diagnostic plots were examined, we did see some shortcomings in the model. This could be overcome with a more complex model. These might include selected interactions, transformations of target(\(cmedv^2\), or \(sqrt(cmedv)\)) and some selected predictors.

References

James, Witten, Hastie, Tibshirani, An Introduction to Statistical Learning. Chapter 3. Linear Regression. https://hastie.su.domains/ISLR2/Labs/Rmarkdown_Notebooks/Ch3-linreg-lab.html

R Package Citations
Package Version Citation
corrgram 1.14 Wright (2021)
corrplot 0.92 Wei and Simko (2021)
GGally 2.2.1 Schloerke et al. (2024)
gt 0.10.1 Iannone et al. (2024)
infer 1.0.7 Couch et al. (2021)
ISLR 1.4 James et al. (2021)
janitor 2.2.0 Firke (2023)
reghelper 1.1.2 Hughes and Beiner (2023)
Couch, Simon P., Andrew P. Bray, Chester Ismay, Evgeni Chasnovski, Benjamin S. Baumer, and Mine Γ‡etinkaya-Rundel. 2021. β€œinfer: An R Package for Tidyverse-Friendly Statistical Inference.” Journal of Open Source Software 6 (65): 3661. https://doi.org/10.21105/joss.03661.
Firke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Hughes, Jeffrey, and David Beiner. 2023. reghelper: Helper Functions for Regression Analysis. https://CRAN.R-project.org/package=reghelper.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2024. gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://CRAN.R-project.org/package=ISLR.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to β€œggplot2”. https://CRAN.R-project.org/package=GGally.
Wei, Taiyun, and Viliam Simko. 2021. R Package β€œcorrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wright, Kevin. 2021. corrgram: Plot a Correlogram. https://CRAN.R-project.org/package=corrgram.
Back to top

Footnotes

  1. James, Witten, Hastie, Tibshirani,An Introduction to Statistical Learning. Chapter 3. Linear Regression https://hastie.su.domains/ISLR2/Labs/Rmarkdown_Notebooks/Ch3-linreg-lab.htmlβ†©οΈŽ

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {Tutorial: {Multiple} {Linear} {Regression} with {Backward}
    {Selection}},
  date = {2023-05-03},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/files/backward-selection-1.html},
  langid = {en},
  abstract = {Using Multiple Regression to predict Quantitative Target
    Variables}
}
For attribution, please cite this work as:
V., Arvind. 2023. β€œTutorial: Multiple Linear Regression with Backward Selection.” May 3, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/files/backward-selection-1.html.