Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Tutorial: Multiple Linear Regression with Backward Selection
  • Teaching
    • Data Viz and Analytics
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Graphs
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Using AI in Analytics
        • Case Studies
          • Demo:Product Packaging and Elderly People
          • Ikea Furniture
          • Movie Profits
          • Gender at the Work Place
          • Heptathlon
          • School Scores
          • Children's Games
          • Valentine’s Day Spending
          • Women Live Longer?
          • Hearing Loss in Children
          • California Transit Payments
          • Seaweed Nutrients
          • Coffee Flavours
          • Legionnaire’s Disease in the USA
          • Antarctic Sea ice
          • William Farr's Observations on Cholera in London
      • TRIZ for Problem Solvers
        • I am Water
        • I am What I yam
        • Birds of Different Feathers
        • I Connect therefore I am
        • I Think, Therefore I am
        • The Art of Parallel Thinking
        • A Year of Metaphoric Thinking
        • TRIZ - Problems and Contradictions
        • TRIZ - The Unreasonable Effectiveness of Available Resources
        • TRIZ - The Ideal Final Result
        • TRIZ - A Contradictory Language
        • TRIZ - The Contradiction Matrix Workflow
        • TRIZ - The Laws of Evolution
        • TRIZ - Substance Field Analysis, and ARIZ
      • Math Models for Creative Coders
        • Maths Basics
          • Vectors
          • Matrix Algebra Whirlwind Tour
          • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
        • Tech
          • Tools and Installation
          • Adding Libraries to p5.js
          • Using Constructor Objects in p5.js
        • Geometry
          • Circles
          • Complex Numbers
          • Fractals
          • Affine Transformation Fractals
          • L-Systems
          • Kolams and Lusona
        • Media
          • Fourier Series
          • Additive Sound Synthesis
          • Making Noise Predictably
          • The Karplus-Strong Guitar Algorithm
        • AI
          • Working with Neural Nets
          • The Perceptron
          • The Multilayer Perceptron
          • MLPs and Backpropagation
          • Gradient Descent
        • Projects
          • Projects
      • Tech for Creative Education
        • 🧭 Using Idyll
        • 🧭 Using Apparatus
        • 🧭 Using g9.js
      • Literary Jukebox: In Short, the World
        • Italy - Dino Buzzati
        • France - Guy de Maupassant
        • Japan - Hisaye Yamamoto
        • Peru - Ventura Garcia Calderon
        • Russia - Maxim Gorky
        • Egypt - Alifa Rifaat
        • Brazil - Clarice Lispector
        • England - V S Pritchett
        • Russia - Ivan Bunin
        • Czechia - Milan Kundera
        • Sweden - Lars Gustaffsson
        • Canada - John Cheever
        • Ireland - William Trevor
        • USA - Raymond Carver
        • Italy - Primo Levi
        • India - Ruth Prawer Jhabvala
        • USA - Carson McCullers
        • Zimbabwe - Petina Gappah
        • India - Bharati Mukherjee
        • USA - Lucia Berlin
        • USA - Grace Paley
        • England - Angela Carter
        • USA - Kurt Vonnegut
        • Spain-Merce Rodoreda
        • Israel - Ruth Calderon
        • Israel - Etgar Keret
    • Posts
    • Blogs and Talks

    On this page

    • Setting up R Packages
    • Workflow: Read the Data
    • Workflow: Correlations
    • Workflow: Maximal Multiple Regression Model
    • Workflow: Model Reduction
    • Workflow: Diagnostics
    • Conclusion
    • References

    Tutorial: Multiple Linear Regression with Backward Selection

    Linear Regression
    Quantitative Predictor
    Quantitative Response
    Sum of Squares
    Residuals
    Author

    Arvind V.

    Published

    May 3, 2023

    Modified

    May 21, 2024

    Abstract
    Using Multiple Regression to predict Quantitative Target Variables

    Setting up R Packages

    options(scipen = 1, digits = 3) #set to three decimal 
    knitr::opts_chunk$set(echo = TRUE,warning = FALSE,message = FALSE)
    library(tidyverse)
    library(ggformula)
    library(mosaic)
    library(infer)
    # Let us set a plot theme for Data visualisation
    
    # my_theme <- function(){  # Creating a function
    #   theme_classic() +  # Using pre-defined theme as base
    #   theme(axis.text.x = element_text(size = 12, face = "bold"),  # Customizing axes text      
    #         axis.text.y = element_text(size = 12, face = "bold"),
    #         axis.title = element_text(size = 14, face = "bold"),  # Customizing axis title
    #         panel.grid = element_blank(),  # Taking off the default grid
    #         plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"),
    #         legend.text = element_text(size = 12, face = "italic"),  # Customizing legend text
    #         legend.title = element_text(size = 12, face = "bold"),  # Customizing legend title
    #         legend.position = "right",  # Customizing legend position
    #         plot.caption = element_text(size = 12))  # Customizing plot caption
    # }   
    
    my_theme <- function(){  # Creating a function
      theme_classic() +  # Using pre-defined theme as base
      theme(plot.title = element_text(face = "bold", size = 14),
            axis.text.x = element_text(size = 10, face = "bold"),  
            # Customizing axes text      
            axis.text.y = element_text(size = 10, face = "bold"),
            axis.title = element_text(size = 12, face = "bold"),  
            # Customizing axis title
            panel.grid = element_blank(),  # Taking off the default grid
            plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"),
            legend.text = element_text(size = 8, face = "italic"),  
            # Customizing legend text
            legend.title = element_text(size = 10, face = "bold"),  
            # Customizing legend title
            legend.position = "right",  # Customizing legend position
            plot.caption = element_text(size = 8))  # Customizing plot caption
    }   

    In this tutorial, we will use the Boston housing dataset. Our research question is:

    NoteResearch Question

    How do we predict the price of a house in Boston, based on other parameters Quantitative parameters such as area, location, rooms, and crime-rate in the neighbourhood?

    And how do we choose the “best” model, based on a tradeoff between Model Complexity and Model Accuracy?

    Workflow: Read the Data

    data("BostonHousing2", package = "mlbench")
    housing <- BostonHousing2
    inspect(housing)
    
    categorical variables:  
      name  class levels   n missing                                  distribution
    1 town factor     92 506       0 Cambridge (5.9%) ...                         
    2 chas factor      2 506       0 0 (93.1%), 1 (6.9%)                          
    
    quantitative variables:  
          name   class       min       Q1   median       Q3      max     mean
    1    tract integer   1.00000 1303.250 3393.500 3739.750 5082.000 2700.356
    2      lon numeric -71.28950  -71.093  -71.053  -71.020  -70.810  -71.056
    3      lat numeric  42.03000   42.181   42.218   42.252   42.381   42.216
    4     medv numeric   5.00000   17.025   21.200   25.000   50.000   22.533
    5    cmedv numeric   5.00000   17.025   21.200   25.000   50.000   22.529
    6     crim numeric   0.00632    0.082    0.257    3.677   88.976    3.614
    7       zn numeric   0.00000    0.000    0.000   12.500  100.000   11.364
    8    indus numeric   0.46000    5.190    9.690   18.100   27.740   11.137
    9      nox numeric   0.38500    0.449    0.538    0.624    0.871    0.555
    10      rm numeric   3.56100    5.886    6.208    6.623    8.780    6.285
    11     age numeric   2.90000   45.025   77.500   94.075  100.000   68.575
    12     dis numeric   1.12960    2.100    3.207    5.188   12.127    3.795
    13     rad integer   1.00000    4.000    5.000   24.000   24.000    9.549
    14     tax integer 187.00000  279.000  330.000  666.000  711.000  408.237
    15 ptratio numeric  12.60000   17.400   19.050   20.200   22.000   18.456
    16       b numeric   0.32000  375.377  391.440  396.225  396.900  356.674
    17   lstat numeric   1.73000    6.950   11.360   16.955   37.970   12.653
              sd   n missing
    1  1380.0368 506       0
    2     0.0754 506       0
    3     0.0618 506       0
    4     9.1971 506       0
    5     9.1822 506       0
    6     8.6015 506       0
    7    23.3225 506       0
    8     6.8604 506       0
    9     0.1159 506       0
    10    0.7026 506       0
    11   28.1489 506       0
    12    2.1057 506       0
    13    8.7073 506       0
    14  168.5371 506       0
    15    2.1649 506       0
    16   91.2949 506       0
    17    7.1411 506       0

    The original data are 506 observations on 14 variables, medv being the target variable:

    crim per capita crime rate by town
    zn proportion of residential land zoned for lots over 25,000 sq.ft
    indus proportion of non-retail business acres per town
    chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    nox nitric oxides concentration (parts per 10 million)
    rm average number of rooms per dwelling
    age proportion of owner-occupied units built prior to 1940
    dis weighted distances to five Boston employment centres
    rad index of accessibility to radial highways
    tax full-value property-tax rate per USD 10,000
    ptratio pupil-teacher ratio by town
    b 1000(B−0.63)2 where B is the proportion of Blacks by town
    lstat percentage of lower status of the population
    medv median value of owner-occupied homes in USD 1000’s

    The corrected data set has the following additional columns:

    cmedv corrected median value of owner-occupied homes in USD 1000’s
    town name of town
    tract census tract
    lon longitude of census tract
    lat latitude of census tract

    Our response variable is cmedv, the corrected median value of owner-occupied homes in USD 1000’s. Their are many Quantitative feature variables that we can use to predict cmedv. And there are two Qualitative features, chas and tax.

    Workflow: Correlations

    We can use purrr to evaluate all pair-wise correlations in one shot:

    all_corrs <- housing %>% 
      select(where(is.numeric)) %>% 
      
      # leave off cmedv/medv to get all the remaining ones
      select(- cmedv, -medv) %>%  
      
      # perform a cor.test for all variables against cmedv
      purrr::map(.x = .,
                 .f = \(x) cor.test(x, housing$cmedv)) %>%
      
      # tidy up the cor.test outputs into a tidy data frame
      map_dfr(broom::tidy, .id = "predictor") 
    
    all_corrs
    ABCDEFGHIJ0123456789
    predictor
    <chr>
    estimate
    <dbl>
    statistic
    <dbl>
    p.value
    <dbl>
    parameter
    <int>
    conf.low
    <dbl>
    conf.high
    <dbl>
    method
    <chr>
    tract0.4282510.6395.51e-245040.35430.4969Pearson's product-moment correlation
    lon-0.32295-7.6619.55e-14504-0.3989-0.2426Pearson's product-moment correlation
    lat0.006830.1538.78e-01504-0.08040.0939Pearson's product-moment correlation
    crim-0.38958-9.4968.71e-20504-0.4611-0.3130Pearson's product-moment correlation
    zn0.360398.6735.79e-175040.28210.4339Pearson's product-moment correlation
    indus-0.48475-12.4423.52e-31504-0.5487-0.4151Pearson's product-moment correlation
    nox-0.42930-10.6714.17e-24504-0.4978-0.3554Pearson's product-moment correlation
    rm0.6963021.7791.31e-745040.64850.7386Pearson's product-moment correlation
    age-0.37800-9.1661.24e-18504-0.4503-0.3007Pearson's product-moment correlation
    dis0.249315.7801.31e-085040.16570.3293Pearson's product-moment correlation
    Next
    12
    Previous
    1-10 of 15 rows | 1-8 of 9 columns
    all_corrs %>%
      gf_hline(yintercept = 0,
               color = "grey",
               linewidth = 2) %>%
      gf_errorbar(
        conf.high + conf.low ~ reorder(predictor, estimate),
        colour = ~ estimate,
        width = 0.5,
        linewidth = ~ -log10(p.value),
        caption = "Significance = -log10(p.value)"
      ) %>%
      gf_point(estimate ~ reorder(predictor, estimate)) %>%
      gf_labs(x = "Predictors", y = "Correlation with Median House Price") %>%
      gf_theme(my_theme()) %>%
      gf_theme(theme(axis.text.x = element_text(angle = 45, hjust = 1))) %>%
      gf_refine(
        scale_colour_distiller("Correlation", type = "div", palette = "RdBu"),
        scale_linewidth_continuous("Significance", range = c(0.25, 3)),
        guides(linewidth = guide_legend(reverse = TRUE)))

    The variables rm, lstat seem to have high correlations with cmedv which are also statistically significant.

    Workflow: Maximal Multiple Regression Model

    We will create a regression model for cmedv using all the other numerical predictor features in the dataset.

    housing_numeric <- housing %>% select(where(is.numeric), 
                                          
                        # remove medv
                        # an older version of cmedv
                                          -c(medv))
    names(housing_numeric) # 16 variables, one target, 15 predictors
     [1] "tract"   "lon"     "lat"     "cmedv"   "crim"    "zn"      "indus"  
     [8] "nox"     "rm"      "age"     "dis"     "rad"     "tax"     "ptratio"
    [15] "b"       "lstat"  
    housing_maximal <- lm(cmedv ~ ., data = housing_numeric)
    summary(housing_maximal)
    
    Call:
    lm(formula = cmedv ~ ., data = housing_numeric)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -13.934  -2.752  -0.619   1.711  26.120 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -3.45e+02   3.23e+02   -1.07  0.28734    
    tract       -7.52e-04   4.46e-04   -1.69  0.09231 .  
    lon         -6.79e+00   3.44e+00   -1.98  0.04870 *  
    lat         -2.35e+00   5.36e+00   -0.44  0.66074    
    crim        -1.09e-01   3.28e-02   -3.32  0.00097 ***
    zn           4.40e-02   1.39e-02    3.17  0.00164 ** 
    indus        2.75e-02   6.20e-02    0.44  0.65692    
    nox         -1.55e+01   4.03e+00   -3.85  0.00014 ***
    rm           3.81e+00   4.20e-01    9.07  < 2e-16 ***
    age          5.82e-03   1.34e-02    0.43  0.66416    
    dis         -1.38e+00   2.10e-01   -6.59  1.1e-10 ***
    rad          2.36e-01   8.47e-02    2.78  0.00558 ** 
    tax         -1.48e-02   3.74e-03   -3.96  8.5e-05 ***
    ptratio     -9.49e-01   1.41e-01   -6.73  4.7e-11 ***
    b            9.55e-03   2.67e-03    3.57  0.00039 ***
    lstat       -5.46e-01   5.06e-02  -10.80  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 4.73 on 490 degrees of freedom
    Multiple R-squared:  0.743, Adjusted R-squared:  0.735 
    F-statistic: 94.3 on 15 and 490 DF,  p-value: <2e-16

    The maximal model has an R.squared of 0.7426 which is much better than that we obtained for a simple model based on rm alone. How much can we simplify this maximal model, without losing out on R.squared?

    Workflow: Model Reduction

    We now proceed naively by removing one predictor after another. We will resort to what may amount to p-hacking by sorting the predictors based on their p-value1 in the maximal model and removing them in decreasing order of their p-value.

    We will also use some powerful features from the purrr package (also part of the tidyverse), to create all these models all at once. Then we will be able to plot their R.squared values together and decide where we wish to trade off Explainability vs Complexity for our model.

    # No of Quant predictor variables in the dataset
    n_vars <- housing %>%
      select(where(is.numeric), -c(cmedv, medv)) %>%
      length()
    
    # Maximal Model, now tidied
    housing_maximal_tidy <- 
      housing_maximal %>% 
      broom::tidy() %>% 
      
    # Obviously remove "Intercept" ;-D
      filter(term != "(Intercept)") %>% 
      
    # And horrors! Sort variables by p.value
      arrange(p.value)
    
    housing_maximal_tidy
    ABCDEFGHIJ0123456789
    term
    <chr>
    estimate
    <dbl>
    std.error
    <dbl>
    statistic
    <dbl>
    p.value
    <dbl>
    lstat-0.5464400.050619-10.7951.62e-24
    rm3.8094750.4197799.0752.77e-18
    ptratio-0.9489410.140991-6.7304.74e-11
    dis-1.3843700.210041-6.5911.13e-10
    tax-0.0148070.003737-3.9628.54e-05
    nox-15.5063794.032786-3.8451.36e-04
    b0.0095480.0026753.5693.94e-04
    crim-0.1088610.032789-3.3209.67e-04
    zn0.0439680.0138833.1671.64e-03
    rad0.2356740.0846652.7845.58e-03
    Next
    12
    Previous
    1-10 of 15 rows

    The last 5 variables are clearly statistically insignificant.

    And now we unleash the purrr package to create all the simplified models at once. We will construct a dataset containing three columns:

    • A list of all quantitative predictor variables
    • A sequence of numbers from 1 to N(predictor)
    • A “list” column containing the housing data frame itself

    We will use the iteration capability of purrr to sequentially drop one variable at a time from the maximal(15 predictor) model, build a new reduced model each time, and compute the r.squared:

    housing_model_set <- tibble(all_vars = 
                                list(housing_maximal_tidy$term), # p-hacked order!!
                                keep_vars = seq(1, n_vars),
                                data = list(housing_numeric))
    housing_model_set
    # Unleash purrr in a series of mutates
    housing_model_set <- housing_model_set %>%
      
    # list of predictor variables for each model
      mutate(mod_vars =
               pmap(
                 .l = list(all_vars, keep_vars, data),
                 .f = \(all_vars, keep_vars, data) all_vars[1:keep_vars]
               )) %>%
      
    # build formulae with these for linear regression
      mutate(formula = 
               map(.x = mod_vars,
                   .f = \(mod_vars) as.formula(paste(
                             "cmedv ~", paste(mod_vars, collapse = "+")
                           )))) %>%
      
    # use the formulae to build multiple linear models
      mutate(models =
               pmap(
                 .l = list(data, formula),
                 .f = \(data, formula) lm(formula, data = data)
               ))
    # Check everything after the operation
    housing_model_set
    # Tidy up the models using broom to expose their metrics
    housing_models_tidy <- housing_model_set %>% 
      mutate(tidy_models =
               map(
                 .x = models,
                 .f = \(x) broom::glance(x,
                                         conf.int = TRUE,
                                         conf.lvel = 0.95)
               )) %>% 
    
      # Remove unwanted columns, keep model and predictor count
      select(keep_vars, tidy_models) %>%
      unnest(tidy_models)
    
    housing_models_tidy %>%
      gf_line(
        r.squared ~ keep_vars,
        ylab = "R.Squared",
        xlab = "No. params in the Linear Model",
        title = "Model Explainability vs Complexity",
        subtitle = "Model r.squared vs No. of Predictors",
        data = .
      ) %>%
      
      # Plot r.squared vs predictor count
      gf_point(r.squared ~ keep_vars,
               size = 3.5, color = "grey") %>%
      
      # Show off the selected best model
      gf_point(
        r.squared ~ keep_vars,
        size = 3.5,
        color = "red",
        data = housing_models_tidy %>% filter(keep_vars == 4)
      ) %>%
      
      gf_hline(yintercept = 0.7, linetype = "dashed") %>%
      gf_theme(my_theme())
    ABCDEFGHIJ0123456789
    all_vars
    <list>
    keep_vars
    <int>
    data
    <list>
    <chr [15]>1<df[,16] [506 × 16]>
    <chr [15]>2<df[,16] [506 × 16]>
    <chr [15]>3<df[,16] [506 × 16]>
    <chr [15]>4<df[,16] [506 × 16]>
    <chr [15]>5<df[,16] [506 × 16]>
    <chr [15]>6<df[,16] [506 × 16]>
    <chr [15]>7<df[,16] [506 × 16]>
    <chr [15]>8<df[,16] [506 × 16]>
    <chr [15]>9<df[,16] [506 × 16]>
    <chr [15]>10<df[,16] [506 × 16]>
    Next
    12
    Previous
    1-10 of 15 rows
    ABCDEFGHIJ0123456789
    all_vars
    <list>
    keep_vars
    <int>
    data
    <list>
    mod_vars
    <list>
    <chr [15]>1<df[,16] [506 × 16]><chr [1]>
    <chr [15]>2<df[,16] [506 × 16]><chr [2]>
    <chr [15]>3<df[,16] [506 × 16]><chr [3]>
    <chr [15]>4<df[,16] [506 × 16]><chr [4]>
    <chr [15]>5<df[,16] [506 × 16]><chr [5]>
    <chr [15]>6<df[,16] [506 × 16]><chr [6]>
    <chr [15]>7<df[,16] [506 × 16]><chr [7]>
    <chr [15]>8<df[,16] [506 × 16]><chr [8]>
    <chr [15]>9<df[,16] [506 × 16]><chr [9]>
    <chr [15]>10<df[,16] [506 × 16]><chr [10]>
    Next
    12
    Previous
    1-10 of 15 rows | 1-4 of 6 columns

    At the loss of some 5% in the r.squared, we can drop our model complexity from 15 predictors to say 4! Our final model will then be:

    housing_model_final <- 
      housing_model_set %>% 
      
      # filter for best model, with 4 variables
      filter(keep_vars == 4) %>% 
      
      # tidy up the model
      mutate(tidy_models =
               map(
                 .x = models,
                 .f = \(x) broom::tidy(x,
                                         conf.int = TRUE,
                                         conf.lvel = 0.95)
               )) %>% 
      
      # Remove unwanted columns, keep model and predictor count
      select(keep_vars, models, tidy_models) %>%
      unnest(tidy_models)
    
    housing_model_final
    ABCDEFGHIJ0123456789
    keep_vars
    <int>
    models
    <list>
    term
    <chr>
    estimate
    <dbl>
    std.error
    <dbl>
    statistic
    <dbl>
    p.value
    <dbl>
    conf.low
    <dbl>
    conf.high
    <dbl>
    4<S3: lm>(Intercept)24.4594.05106.043.04e-0916.50032.418
    4<S3: lm>lstat-0.6730.0464-14.495.36e-40-0.764-0.582
    4<S3: lm>rm4.1990.42109.971.71e-213.3725.026
    4<S3: lm>ptratio-0.9570.1153-8.309.41e-16-1.184-0.731
    4<S3: lm>dis-0.5640.1261-4.479.74e-06-0.811-0.316
    5 rows
    housing_model_final %>%  
      # And plot the model
      # Remove the intercept term
      filter(term != "(Intercept)") %>% 
      gf_col(estimate ~ term, fill = ~ term, width = 0.25) %>% 
      gf_hline(yintercept = 0) %>% 
      gf_errorbar(conf.low + conf.high ~ term, 
                  width = 0.1, 
                  title = "Multiple Regression",
                  subtitle = "Model Estimates with Confidence Intervals") %>% 
      gf_theme(my_theme())

    Our current best model can be stated as:

    cmedv^∼24.459−0.563∗dis−0.673∗lstat−0.957∗ptratio+4.199∗rm

    Workflow: Diagnostics

    Let us use broom::augment to calculate residuals and predictions to arrive at a quick set of diagnostic plots.

    housing_model_final_augment <- 
      housing_model_set %>% 
      filter(keep_vars == 4) %>% 
      
    # augment the model
      mutate(augment_models =
               map(
                 .x = models,
                 .f = \(x) broom::augment(x)
               )) %>% 
      unnest(augment_models) %>% 
      select(cmedv:last_col())
    
    housing_model_final_augment
    ABCDEFGHIJ0123456789
    cmedv
    <dbl>
    lstat
    <dbl>
    rm
    <dbl>
    ptratio
    <dbl>
    dis
    <dbl>
    .fitted
    <dbl>
    .resid
    <dbl>
    .hat
    <dbl>
    .sigma
    <dbl>
    .cooksd
    <dbl>
    24.04.986.5815.34.0931.767-7.76660.008095.103.81e-03
    21.69.146.4217.84.9725.433-3.83280.002765.113.13e-04
    34.74.037.1817.84.9732.0802.61980.005965.113.18e-04
    33.42.947.0018.76.0630.5502.85010.007405.114.68e-04
    36.25.337.1518.76.0629.5676.63300.007475.102.56e-03
    28.75.216.4318.76.0626.6372.06300.005875.111.94e-04
    22.912.436.0115.25.5623.655-0.75540.009475.114.23e-05
    22.119.156.1715.25.9519.5852.51530.016795.118.44e-04
    16.529.935.6315.26.089.9846.51650.040085.101.42e-02
    18.917.106.0015.26.5919.897-0.99740.015995.111.26e-04
    Next
    123456
    ...
    51
    Previous
    1-10 of 506 rows | 1-10 of 11 columns
    housing_model_final_augment %>% 
      gf_point(.resid ~ .fitted, title = "Residuals vs Fitted") %>%
      gf_smooth() %>% 
      gf_theme(my_theme)
    housing_model_final_augment %>% 
      gf_qq(~ .std.resid, title = "Q-Q Residuals") %>% 
      gf_qqline() %>%
      gf_theme(my_theme)
    housing_model_final_augment %>% 
      gf_point(sqrt(.std.resid) ~ .fitted, 
               title = "Scale-Location Plot") %>%
        gf_smooth() %>% 
      gf_theme(my_theme)
    housing_model_final_augment %>% 
      gf_point(.std.resid ~ .hat, 
               title = "Residuals vs Leverage") %>%
        gf_smooth() %>% 
      gf_theme(my_theme)

    The residuals plot shows a curved trend, and certainly does not resemble the stars at night, so it is possible that we have left out some possible richness in our model-making, a “systemic inadequacy”.

    The Q-Q plot of residuals also shows a J-shape which indicates a non-normal distribution of residuals.

    These could indicate that more complex model ( e.g. linear model with interactions between variables ( i.e. product terms ) may be necessary.

    Conclusion

    We have used a multiple-regression workflow that takes all predictor variables into account in a linear model, and then systematically simplified that model such that the performance was just adequate.

    The models we chose were all linear of course, but without interaction terms : each predictor was used only for its main effect. When the diagnostic plots were examined, we did see some shortcomings in the model. This could be overcome with a more complex model. These might include selected interactions, transformations of target(cmedv2, or sqrt(cmedv)) and some selected predictors.

    References

    James, Witten, Hastie, Tibshirani, An Introduction to Statistical Learning. Chapter 3. Linear Regression. https://hastie.su.domains/ISLR2/Labs/Rmarkdown_Notebooks/Ch3-linreg-lab.html

    R Package Citations
    Package Version Citation
    corrgram 1.14 Wright (2021)
    corrplot 0.92 Wei and Simko (2021)
    GGally 2.2.1 Schloerke et al. (2024)
    gt 0.10.1 Iannone et al. (2024)
    infer 1.0.7 Couch et al. (2021)
    ISLR 1.4 James et al. (2021)
    janitor 2.2.0 Firke (2023)
    reghelper 1.1.2 Hughes and Beiner (2023)
    Couch, Simon P., Andrew P. Bray, Chester Ismay, Evgeni Chasnovski, Benjamin S. Baumer, and Mine Çetinkaya-Rundel. 2021. “infer: An R Package for Tidyverse-Friendly Statistical Inference.” Journal of Open Source Software 6 (65): 3661. https://doi.org/10.21105/joss.03661.
    Firke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
    Hughes, Jeffrey, and David Beiner. 2023. reghelper: Helper Functions for Regression Analysis. https://CRAN.R-project.org/package=reghelper.
    Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2024. gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.
    James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://CRAN.R-project.org/package=ISLR.
    Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to “ggplot2”. https://CRAN.R-project.org/package=GGally.
    Wei, Taiyun, and Viliam Simko. 2021. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
    Wright, Kevin. 2021. corrgram: Plot a Correlogram. https://CRAN.R-project.org/package=corrgram.
    Back to top

    Footnotes

    1. James, Witten, Hastie, Tibshirani,An Introduction to Statistical Learning. Chapter 3. Linear Regression https://hastie.su.domains/ISLR2/Labs/Rmarkdown_Notebooks/Ch3-linreg-lab.html↩︎

    Citation

    BibTeX citation:
    @online{v.2023,
      author = {V., Arvind},
      title = {Tutorial: {Multiple} {Linear} {Regression} with {Backward}
        {Selection}},
      date = {2023-05-03},
      url = {https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/files/backward-selection-1.html},
      langid = {en},
      abstract = {Using Multiple Regression to predict Quantitative Target
        Variables}
    }
    
    For attribution, please cite this work as:
    V., Arvind. 2023. “Tutorial: Multiple Linear Regression with Backward Selection.” May 3, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/files/backward-selection-1.html.

    License: CC BY-SA 2.0

    Website made with ❤️ and Quarto, by Arvind V.

    Hosted by Netlify .