Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Tutorial: Multiple Linear Regression with Forward Selection
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • πŸƒ Inference for a Single Mean
        • πŸƒ Inference for Two Independent Means
        • πŸƒ Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • πŸƒ Testing a Single Proportion
        • πŸƒ Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • πŸ•” Modelling and Predicting Time Series
      • Predictive Modelling
        • πŸ‰ Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • πŸ“ Intro to Linear Programming
        • πŸ’­ The Simplex Method - Intuitively
        • πŸ“… The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • πŸ•Ά Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • πŸ‰ Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • πŸ•” Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • πŸ•Ά Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Setting up R Packages
  • Workflow Plan
  • Workflow: EDA
    • Scatter Plots and Correlations
    • Correlation Error-Bars
  • Workflow: Minimal Multiple Regression Model
  • Workflow: Predictor Addition (Round#1)
  • Workflow: Predictor Addition (Round #2)
  • Workflow: Visualization
  • Discussion
  • Conclusion
  • References

Tutorial: Multiple Linear Regression with Forward Selection

Linear Regression
Quantitative Predictor
Quantitative Response
Sum of Squares
Residuals
Author

Arvind V.

Published

May 13, 2023

Modified

May 21, 2024

Abstract
Using Multiple Regression to predict Quantitative Target Variables

Setting up R Packages

options(scipen = 1, digits = 3) #set to three decimal 
knitr::opts_chunk$set(echo = TRUE,warning = FALSE,message = FALSE) 
library(tidyverse)
library(ggformula)
library(mosaic)
library(GGally)
library(corrgram)
library(corrplot)
library(broom)

# datasets
library(ISLR)
# Let us set a plot theme for Data visualization

theme_set(theme_light(base_size = 11, base_family = "Roboto Condensed"))

theme_update(
  panel.grid.minor = element_blank(),
  plot.title = element_text(face = "bold"),
  plot.title.position = "plot"
)

In this tutorial, we will use the Boston housing Hitters dataset(s) from the ISLR package. Our research question is:

NoteResearch Question

How do we predict the Salary of baseball players based on other Quantitative parameters such as Hits, HmRun AtBat?

And how do we choose the β€œbest” model, based on a trade-off between Model Complexity and Model Accuracy?

Workflow Plan

Our target variable is Salary.

We will start with an examination of correlations between Salary and other Quant predictors.

We will use a null model for our Linear Regression at first, keeping just an intercept term. Based on the examination of the r-square improvement offered by each predictor individually, we will add another quantitative predictor. We will follow this process through up to a point where the gains in model accuracy are good enough to justify the additional model complexity.

Note

This approach is the exact opposite of the earlier tutorial on multiple linear regression, where we started with a maximal model and trimmed it down based on an assessment of r.squared.

Workflow: EDA

The Hitters dataset has the following variables:

data("Hitters")
inspect(Hitters)

categorical variables:  
       name  class levels   n missing
1    League factor      2 322       0
2  Division factor      2 322       0
3 NewLeague factor      2 322       0
                                   distribution
1 A (54.3%), N (45.7%)                         
2 W (51.2%), E (48.8%)                         
3 A (54.7%), N (45.3%)                         

quantitative variables:  
      name   class  min    Q1 median     Q3   max    mean      sd   n missing
1    AtBat integer 16.0 255.2  379.5  512.0   687  380.93  153.40 322       0
2     Hits integer  1.0  64.0   96.0  137.0   238  101.02   46.45 322       0
3    HmRun integer  0.0   4.0    8.0   16.0    40   10.77    8.71 322       0
4     Runs integer  0.0  30.2   48.0   69.0   130   50.91   26.02 322       0
5      RBI integer  0.0  28.0   44.0   64.8   121   48.03   26.17 322       0
6    Walks integer  0.0  22.0   35.0   53.0   105   38.74   21.64 322       0
7    Years integer  1.0   4.0    6.0   11.0    24    7.44    4.93 322       0
8   CAtBat integer 19.0 816.8 1928.0 3924.2 14053 2648.68 2324.21 322       0
9    CHits integer  4.0 209.0  508.0 1059.2  4256  717.57  654.47 322       0
10  CHmRun integer  0.0  14.0   37.5   90.0   548   69.49   86.27 322       0
11   CRuns integer  1.0 100.2  247.0  526.2  2165  358.80  334.11 322       0
12    CRBI integer  0.0  88.8  220.5  426.2  1659  330.12  333.22 322       0
13  CWalks integer  0.0  67.2  170.5  339.2  1566  260.24  267.06 322       0
14 PutOuts integer  0.0 109.2  212.0  325.0  1378  288.94  280.70 322       0
15 Assists integer  0.0   7.0   39.5  166.0   492  106.91  136.85 322       0
16  Errors integer  0.0   3.0    6.0   11.0    32    8.04    6.37 322       0
17  Salary numeric 67.5 190.0  425.0  750.0  2460  535.93  451.12 263      59

Scatter Plots and Correlations

We should examine scatter plots and Correlations of Salary against these variables. Let us select a few sets of Quantitative and Qualitative features, along with the target variable Salary and do a pairs-plots with them:

Hitters %>% 
  select(Salary, AtBat, Hits, HmRun) %>% 
  GGally::ggpairs(title = "Plot 1", lower = list(continuous = wrap("smooth", alpha = 0.2)))
Hitters %>% 
  select(Salary, Runs, RBI, Walks,Years) %>% 
  GGally::ggpairs(title = "Plot 2", lower = list(continuous = wrap("smooth", alpha = 0.2)))
Hitters %>% 
  select(Salary, CRBI, CAtBat, CHits, CHmRun, CRuns, CWalks) %>% 
  GGally::ggpairs(title = "Plot 3", lower = list(continuous = wrap("smooth", alpha = 0.2)))
Hitters %>% 
  select(Salary, PutOuts,Assists,Errors) %>% 
  GGally::ggpairs(title = "Plot 4", lower = list(continuous = wrap("smooth", alpha = 0.2)))
Hitters %>% 
  select(Salary, League,Division,NewLeague) %>% 
  GGally::ggpairs(title = "Plot 5", lower = list(continuous = wrap("smooth", alpha = 0.2)))

AtBat and Hits seem relevant predictors for Salary. So are Runs, RBI,Walks, and Years. From Plot 2, both RBI and Walks are also inter-correlated with Runs. All the C* variables are well correlated with Salary and also among one another. (Plot3). Plot 4 has no significant correlations at all. Plot 5 shows Salary nearly equally distributed across League, Division, and NewLeague.

Correlation Error-Bars

We can also plot all correlations in one graph using cor.test and purrr:

all_corrs <- 
  Hitters %>% 
  select(where(is.numeric)) %>% 
  
  # leave off Salary and year to get all the remaining ones
  select(- Salary) %>% 
  
  
  # perform a cor.test for all variables against Salary
  purrr::map(.x = .,
             .f = \(x) cor.test(x, Hitters$Salary)) %>%
  
  # tidy up the cor.test outputs into a tidy data frame
  map_dfr(broom::tidy, .id = "predictor") %>% 
  arrange(desc(estimate))

all_corrs
all_corrs %>%
  gf_hline(yintercept = 0,
           linewidth = 2,
           color = "grey") %>%
  gf_errorbar(
    conf.high + conf.low ~ reorder(predictor, estimate),
    color = ~ estimate,
    linewidth =  ~ -log10(p.value),
    width = 0.5,
    caption = "Significance = -log10(p.value)"
  ) %>%
  gf_point(estimate ~ reorder(predictor, estimate)) %>%
  gf_labs(x = NULL, y = "Correlation with Salary") %>%
  #gf_theme(theme = my_theme()) %>%
  gf_refine(
    scale_colour_distiller("Correlation", type = "div",
                           palette = "RdBu"),
    scale_linewidth_continuous("Significance", range = c(0.25, 3))
  ) %>%
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            theme(axis.text.x = element_text(hjust = 1))) %>%
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            coord_flip())
ABCDEFGHIJ0123456789
predictor
<chr>
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
CRBI0.567011.11959.07e-24261
CRuns0.562710.99622.31e-23261
CHits0.548910.60904.26e-22261
CAtBat0.52619.99533.98e-20261
CHmRun0.52499.96375.01e-20261
CWalks0.48989.07682.82e-17261
RBI0.44958.12851.76e-14261
Walks0.44398.00244.01e-14261
Hits0.43877.88638.53e-14261
Runs0.41997.47371.18e-12261
Next
12
Previous
1-10 of 16 rows | 1-5 of 9 columns

There are a good many predictors which have statistically significant correlations with Salary, such as CRuns , CHmRun. The darker the colour, the higher is the correlation score; the fatter the bar, the higher is the significance of the correlation.

We now start with setting up simple Linear Regressions with no predictors, only an intercept. We then fit separate Linear Models using each predictor individually. Then based on the the improvement in r.squared offered by each predictor, we progressively add it to the model, until we are β€œsatisfied” with the quality of the model ( using rsquared and other means).

Let us now do this.

Workflow: Minimal Multiple Regression Model

Note the formula structure here: we want just and intercept.

lm_min <- lm(data = Hitters, Salary ~ 1)
summary(lm_min)

Call:
lm(formula = Salary ~ 1, data = Hitters)

Residuals:
   Min     1Q Median     3Q    Max 
  -468   -346   -111    214   1924 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    535.9       27.8    19.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 451 on 262 degrees of freedom
  (59 observations deleted due to missingness)
lm_min %>% broom::tidy()
lm_min %>% broom::glance()
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)53627.819.34.02e-52
1 row
ABCDEFGHIJ0123456789
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
p.value
<dbl>
df
<dbl>
00451NANANA
1 row | 1-6 of 12 columns

OK, so the intercept is highly significant, the t-statistic is also high, but the intercept contributes nothing to the r.squared!! It is of no use at all!

Workflow: Predictor Addition (Round#1)

We will now set up individual models for each predictor and look at the p.value and r.squared offered by each separate model:

names <- names(Hitters %>%
  select(where(is.numeric), 
         -c(Salary)))

n_vars <- length(names)

Hitters_model_set <- tibble(all_vars = list(names),
                            keep_vars = seq(1, n_vars),
                            data = list(Hitters))

# Unleash purrr in a series of mutates
Hitters_model_set <- Hitters_model_set %>%
  
# Select Single Predictor for each Simple Model
  mutate(mod_vars =
           pmap(
             .l = list(all_vars, keep_vars, data),
             .f = \(all_vars, keep_vars, data) all_vars[keep_vars]
           )) %>%
  
# build formulae with these for linear regression
  mutate(formula = map(.x = mod_vars,
                       .f = \(mod_vars) as.formula(paste(
                         "Salary ~", paste(mod_vars, collapse = "+")
                       )))) %>%
  
# use the formulae to build multiple linear models
  mutate(models =
           pmap(
             .l = list(data, formula),
             .f = \(data, formula) lm(formula, data = data)
           ))


# Tidy up the models using broom to expose their metrics
Hitters_model_set <- 
  Hitters_model_set %>% 
  mutate(tidy_models =
           map(
             .x = models,
             .f = \(x) broom::glance(x,
                                     conf.int = TRUE,
                                     conf.lvel = 0.95)
           ),
         predictor_name = names[keep_vars]) %>% 

  # Remove unwanted columns, keep model and predictor count
  select(keep_vars,predictor_name, tidy_models) %>%
  unnest(tidy_models) %>% 
  arrange(desc(r.squared))

# Check everything after the operation
Hitters_model_set
# Plot r.squared vs predictor count
Hitters_model_set %>% 
  gf_point(r.squared ~ reorder(predictor_name, r.squared), 
           size = 3.5, 
           color = "black",
           ylab = "R.Squared",
           xlab = "Params in the Linear Model", data = .) %>%
  #gf_theme(my_theme()) %>% 
  gf_refine(theme(axis.text.x = element_text(angle = 30,
                                             hjust = 1)))
ABCDEFGHIJ0123456789
keep_vars
<int>
predictor_name
<chr>
r.squared
<dbl>
adj.r.squared
<dbl>
12CRBI0.32145010.31885
11CRuns0.31660620.31399
9CHits0.30130170.29862
8CAtBat0.27681840.27405
10CHmRun0.27555210.27278
13CWalks0.23992560.23701
5RBI0.20201170.19895
6Walks0.19701810.19394
2Hits0.19243550.18934
4Runs0.17628120.17313
Next
12
Previous
1-10 of 16 rows | 1-4 of 14 columns

# Which is the winning Predictor?
winner <- Hitters_model_set %>% 
  arrange(desc(r.squared)) %>% 
  select(predictor_name) %>% 
  head(1) %>% as.character()
winner
[1] "CRBI"
# Here is the Round 1 Model
# Minimal model updated to included winning predictor
lm_round1 <- update(lm_min, ~. + CRBI)
lm_round1 %>% broom::tidy()
lm_round1 %>% broom::glance()
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)274.58032.85548.363.85e-15
CRBI0.7910.071111.129.07e-24
2 rows
ABCDEFGHIJ0123456789
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
p.value
<dbl>
df
<dbl>
0.3210.3193721249.07e-241
1 row | 1-6 of 12 columns

So we can add CRBI as a predictor to our model as a predictor gives us an improved r.squared of 0.321, which is the square of the correlation between Salary and CRBI, .567.

And the model itself is: (1)Salary∼274.580+0.791Γ—CRBI

Let’s press on to Round 2.

Workflow: Predictor Addition (Round #2)

We will set up a round-2 model using CRBI as the predictor, and then proceed to add each of the other predictors as an update to the model.

# Preliminaries
names <- names(Hitters %>%
  select(where(is.numeric), -c(Salary, winner)))
# names

n_vars <- length(names)
# n_vars
# names <- names %>% str_remove(winner)
# names
# n_vars <- n_vars-1


# Round 2 Iteration
Hitters_model_set <- tibble(all_vars = list(names),
                            keep_vars = seq(1, n_vars),
                            data = list(Hitters))
# Hitters_model_set 

# Unleash purrr in a series of mutates
Hitters_model_set <- Hitters_model_set %>%
  
# list of predictor variables for each model
  mutate(mod_vars =
           pmap(
             .l = list(all_vars, keep_vars, data),
             .f = \(all_vars, keep_vars, data) all_vars[keep_vars]
           )) %>%
  
# build formulae with these for linear regression
  mutate(formula = map(.x = mod_vars,
                       .f = \(mod_vars) as.formula(paste(
                         "Salary ~ CRBI +", paste(mod_vars, collapse = "+")
                       )))) %>%
  
# use the formulae to build multiple linear models
  mutate(models =
           pmap(
             .l = list(data, formula),
             .f = \(data, formula) lm(formula, data = data)
           ))
# Check everything after the operation
# Hitters_model_set

# Tidy up the models using broom to expose their metrics
Hitters_model_set <- 
  Hitters_model_set %>% 
  mutate(tidy_models =
           map(
             .x = models,
             .f = \(x) broom::glance(x,
                                     conf.int = TRUE,
                                     conf.lvel = 0.95)
           ),
         predictor_name = names[keep_vars]) %>% 

  # Remove unwanted columns, keep model and predictor count
  select(keep_vars,predictor_name, tidy_models) %>%
  unnest(tidy_models) %>% 
  arrange(desc(r.squared))

Hitters_model_set
# Plot r.squared vs predictor count
Hitters_model_set %>% 
  gf_point(r.squared ~ reorder(predictor_name, r.squared), 
                               size = 3.5,
                               ylab = "R.Squared",
                               xlab = "Param in the Linear Model") %>%
  #gf_theme(my_theme()) %>% 
  gf_refine(theme(axis.text.x = element_text(angle = 30, 
                                             hjust = 1)))
ABCDEFGHIJ0123456789
keep_vars
<int>
predictor_name
<chr>
r.squared
<dbl>
adj.r.squared
<dbl>
2Hits0.4250.421
4Runs0.4190.414
6Walks0.4000.396
1AtBat0.3980.393
5RBI0.3830.379
13PutOuts0.3830.378
7Years0.3530.348
3HmRun0.3450.340
11CRuns0.3280.323
14Assists0.3280.323
Next
12
Previous
1-10 of 15 rows | 1-4 of 14 columns

# Which is the winning Predictor?
# 
winner <- Hitters_model_set %>% 
  arrange(desc(r.squared)) %>% 
  select(predictor_name) %>% 
  head(1) %>% as.character()
winner
# Here is the Round 1 Model
lm_round2 <- update(lm_round1, ~. + Hits)
lm_round2 %>% broom::tidy()
lm_round2 %>% broom::glance()
[1] "Hits"
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
(Intercept)-47.9655.9825
CRBI0.690.0672
Hits3.300.4818
3 rows | 1-3 of 5 columns
ABCDEFGHIJ0123456789
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
0.4250.421343
1 row | 1-3 of 12 columns

And now the model itself is: (2)SalaryβˆΌβˆ’47.96+0.691Γ—CRBI+3.30Γ—Hits

Note the change in both intercept and the slope for CRBI when the new predictor Hits is added!!

Workflow: Visualization

Let us quickly see how this model might look. We know that with simple regression, we obtain a straight line as our model. Here, with two (or more) predictors, we should obtain a ….(hyper)plane! Play with the interactive plot below!

ABCDEFGHIJ0123456789
.rownames
<chr>
Salary
<dbl>
CRBI
<int>
Hits
<int>
.fitted
<dbl>
.resid
<dbl>
.hat
<dbl>
.sigma
<dbl>
.cooksd
<dbl>
.std.resid
<dbl>
-Alan Ashby475.041481505.0-30.030.005763441.49e-05-0.08772
-Alvin Davis480.0266130564.7-84.670.005103441.04e-04-0.24724
-Andre Dawson500.0838141995.6-495.600.013823439.87e-03-1.45361
-Andres Galarraga91.54687271.0-179.450.007043446.51e-04-0.52454
-Alfredo Griffin750.0336169741.78.310.011133442.22e-060.02433
-Al Newman70.093780.4-10.380.014903444.68e-06-0.03047
-Argenis Salazar100.03773218.5-118.530.008263443.34e-04-0.34668
-Andres Thomas75.03481242.9-167.870.007633446.17e-04-0.49083
-Andre Thornton1100.089092869.7230.270.017373442.70e-030.67660
-Alan Trammell517.1504159824.6-307.440.009043432.46e-03-0.89957
Next
123456
...
27
Previous
1-10 of 263 rows
050010001500Salarypred_mxSalaries of Baseball Players
plotly-logomark

Discussion

It is interesting that the second variable to be added was Hits which has a lower correlation of r=0.439 with Salary compared to some other Quant predictors such as Chits( r=0.525 ). This is because CRBI is hugely correlated with all of these predictors, so CRBI effectively acts as a proxy for all of these. See Plot 3.

We see that adding Hits to the model gives us an improved r.squared of 0.425.

Conclusion

We can proceed in this way to subsequent rounds and decide to stop when the model complexity (no. of predictors ) and the resulting gain in r.squared does not seem worth it.

NoteAutomatic Iteration Method

We ought to convert the above code into an R function and run it that way for a specific number of rounds to see how things pan out. That is in the next version of this Tutorial! It appears that there is, what else, an R Package, called reghelper that allows us to do this! πŸ˜‡ The reghelper::build_model() function can be used to:

  • Start with only an intercept
  • Sequentially add each of the other predictor variables into the model β€œblocks”
  • Blocks will be added in the order they are passed to the function, and variables from previous blocks will be included with each subsequent block, so they do not need to be repeated.

Type help(rehelper) in your Console.

library(reghelper)

big_model <- build_model(
  dv = Salary, 
  # Start with only an intercept lm(Salary ~ 1, data = .)
  
  # Sequentially add each of the other predictor variables
  # Pass through variable names (or interaction terms) to add for each block. 
  # To add one term to a block, just pass it through directly; 
  # to add multiple terms at a time to a block, pass it through in a vector or list. 
  # Interaction Terms can be specified using the vector/list
  # Blocks will be added in the order they are passed to the function
  # Variables from previous blocks will be included with each subsequent block,  so they do not need to be repeated.
  
  1, AtBat, Hits, HmRun, Runs, RBI, Walks, Years, CAtBat, CHits, 
  CHmRun, CRuns, CRBI, CWalks, PutOuts, Assists, Errors,
  
  data = Hitters, 
  model = 'lm')

This multiple model is a list object with 4 items. Type summary(big_model) in your Console.

We can clean it up a wee bit:

library(gt)
# big_model has 4 parts: formulas, residuals, coefficients, overall

overall_clean <- summary(big_model)$overall %>% as_tibble() %>% janitor::clean_names()

formulas_clean <- summary(big_model)$formulas %>% 
  as.character() %>% as_tibble() %>% 
  rename("model_formula" = value)

all_models <- cbind(formulas_clean, overall_clean) %>% 
  dplyr::select(1,2,8)
all_models %>% 
  gt::gt() %>% 
    tab_style(style = cell_fill(color = "grey"),
            locations = cells_body(rows = seq(1, 18, 2)))
model_formula r_squared delta_r_sq
Salary ~ 1 NA NA
Salary ~ 1 + AtBat 0.156 NA
Salary ~ 1 + AtBat + Hits 0.204 0.047748
Salary ~ 1 + AtBat + Hits + HmRun 0.227 0.023530
Salary ~ 1 + AtBat + Hits + HmRun + Runs 0.227 0.000137
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI 0.244 0.016807
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks 0.307 0.062553
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years 0.409 0.101919
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat 0.455 0.046369
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits 0.471 0.016234
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun 0.488 0.016771
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns 0.488 0.000333
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns + CRBI 0.489 0.001157
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks 0.498 0.008575
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + PutOuts 0.522 0.023739
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + PutOuts + Assists 0.527 0.005401
Salary ~ 1 + AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + PutOuts + Assists + Errors 0.528 0.000814

So we have a list of all models with main effects only. We could play with the build_model function to develop interaction models too! Slightly weird that the NULL model of Salary~1 does not show an r.squared value with build_model…??

References


https://ethanwicker.com/2021-01-11-multiple-linear-regression-002/

R Package Citations
Package Version Citation
corrgram 1.14 Wright (2021)
corrplot 0.92 Wei and Simko (2021)
GGally 2.2.1 Schloerke et al. (2024)
gt 0.10.1 Iannone et al. (2024)
infer 1.0.7 Couch et al. (2021)
ISLR 1.4 James et al. (2021)
janitor 2.2.0 Firke (2023)
reghelper 1.1.2 Hughes and Beiner (2023)
Couch, Simon P., Andrew P. Bray, Chester Ismay, Evgeni Chasnovski, Benjamin S. Baumer, and Mine Γ‡etinkaya-Rundel. 2021. β€œinfer: An R Package for Tidyverse-Friendly Statistical Inference.” Journal of Open Source Software 6 (65): 3661. https://doi.org/10.21105/joss.03661.
Firke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Hughes, Jeffrey, and David Beiner. 2023. reghelper: Helper Functions for Regression Analysis. https://CRAN.R-project.org/package=reghelper.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, and JooYoung Seo. 2024. gt: Easily Create Presentation-Ready Display Tables. https://CRAN.R-project.org/package=gt.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://CRAN.R-project.org/package=ISLR.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to β€œggplot2”. https://CRAN.R-project.org/package=GGally.
Wei, Taiyun, and Viliam Simko. 2021. R Package β€œcorrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wright, Kevin. 2021. corrgram: Plot a Correlogram. https://CRAN.R-project.org/package=corrgram.
Back to top

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {Tutorial: {Multiple} {Linear} {Regression} with {Forward}
    {Selection}},
  date = {2023-05-13},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/files/forward-selection-1.html},
  langid = {en},
  abstract = {Using Multiple Regression to predict Quantitative Target
    Variables}
}
For attribution, please cite this work as:
V., Arvind. 2023. β€œTutorial: Multiple Linear Regression with Forward Selection.” May 13, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/files/forward-selection-1.html.

License: CC BY-SA 2.0

Website made with ❀️ and Quarto, by Arvind V.

Hosted by Netlify .