Modelling with Linear Regression

Linear Regression
Quantitative Predictor
Quantitative Response
Sum of Squares
Residuals
Author

Arvind V.

Published

April 13, 2023

Modified

June 26, 2024

Abstract
Predicting Quantitative Target Variables

Slides and Tutorials

Multiple Regression - Forward Selection   Multiple Regression - Backward Selection   Permutation Test for Regression 

Setting up R Packages

knitr::opts_chunk$set(echo = TRUE, warning = TRUE, message = FALSE)
#options(scipen = 1, digits = 3) #set digits to three decimal places
library(tidyverse)
library(ggformula)
library(mosaic)
library(GGally)
library(corrplot)
library(corrgram)
library(ggstatsplot)
# library(equatiomatic)

Introduction

One of the most common problems in Prediction Analytics is that of predicting a Quantitative response variable, based on one or more Quantitative predictor variables or features. This is called Linear Regression. We will use the intuitions built up during our study of ANOVA to develop our ideas about Linear Regression.

Suppose we have data on salaries in a Company, with years of study and previous experience. Would we be able to predict the prospective salary of a new candidate, based on their years of study and experience? Or based on mileage done, could we predict the resale price of a used car? These are typical problems in Linear Regression.

In this tutorial, we will use the Boston housing dataset. Our research question is:

Research Question

How do we predict the price of a house in Boston, based on other parameters Quantitative parameters such as area, location, rooms, and crime-rate in the neighbourhood?

The Linear Regression Model

The premise here is that many common statistical tests are special cases of the linear model.

A linear model estimates the relationship between one continuous or ordinal variable (dependent variable or “response”) and one or more other variables (explanatory variable or “predictors”). It is assumed that the relationship is linear:1

\[ y_i \sim \beta_1*x_i + \beta_0\\ \]

or

\[ y_1 \sim exp(\beta_1)*x_i + \beta_0 \]

but not:

\[ y_i \sim \beta_1*exp(\beta_2*x_i) + \beta_0\\ \]

or

\[ y_i \sim \beta_1 *x^{\beta_2} + \beta_0 \]

\[ y = \beta_0 + \beta_1 *x \]

\(\beta_0\) is the intercept and \(\beta_1\) is the slope of the linear fit, that predicts the value of y based the value of x. Each prediction leaves a small “residual” error between the actual and predicted values. \(\beta_0\) and \(\beta_1\) are calculated based on minimizing the sum of squares of these residuals, and hence this method is called “ordinary least squares” (OLS) regression.

Least Squares

Least Squares

The net area of all the shaded squares is minimized in the calculation of \(\beta_0\) and \(\beta_1\). It is also possible that there is more than one explanatory variable: this is multiple regression.

\[ y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 ...+ \beta_n*x_n \]

where each of the \(\beta_i\) are slopes defining the relationship between y and \(x_i\). Together, the RHS of that equation defines an n-dimensional hyperplane. The model is linear in the parameters \(\beta_i\), e.g. these are OK:

\[ \color{blue}{ \begin{cases} & y_i = \pmb\beta_0 + \pmb\beta_1x_1 + \pmb\beta_2x_1^2 + \epsilon_i\\ & y_1 = \pmb\beta_0 + \pmb\gamma_1\pmb\delta_1x_1 + exp(\pmb\beta_2)x_2+ \epsilon_i\\ \end{cases} } \]

but not, for example, these:

\[ \color{red}{ \begin{cases} & y_i = \pmb\beta_0 + \pmb\beta_1x_1^{\beta_2} + \epsilon_i\\ & y_i = \pmb\beta_0 + exp(\pmb\beta_1x_1) + \epsilon_i\\ \end{cases} } \]

As per Lindoloev, many statistical tests, going from one-sample t-tests to two-way ANOVA, are special cases of this system. Also see Jeffrey Walker “A linear-model-can-be-fit-to-data-with-continuous-discrete-or-categorical-x-variables”.

Linear Models as Hypothesis Tests

Using linear models is based on the idea of Testing of Hypotheses. The Hypothesis Testing method typically defines a NULL Hypothesis where the statements read as “there is no relationship” between the variables at hand, explanatory and responses. The Alternative Hypothesis typically states that there is a relationship between the variables.

Accordingly, in fitting a linear model, we follow the process as follows: With \(y = \beta_0 + \beta_1 *x\)

  1. Make the following hypotheses:

\[ NULL\ Hypothesis\ H_0 => x\ and\ y\ are\ unrelated.\ (\beta_1 = 0) \]

\[ Alternate\ Hypothesis\ H_1 => x\ and\ y\ are\ linearly\ related\ (\beta_1 \ne 0) \]

  1. We “assume” that \(H_0\) is true.
  2. We calculate \(\beta_1\).
  3. We then find probability p that \(\beta_1 = Estimated\ Value\) when the NULL Hypothesis is assumed TRUE. This is the p-value. If that probability is p>=0.05, we say we “cannot reject” \(H_0\) and there is unlikely to be significant linear relationship.
  4. However, if p<= 0.05 can we reject the NULL hypothesis, and say that there could be a significant linear relationship, because the probability p that \(\beta_1 = Estimated\ Value\) by mere chance under \(H_0\) is very small.

Assumptions in Linear Models

We can write the assumptions in Linear Regression Models as an acronym, LINE:
1. L: \(\color{blue}{linear}\) relationship
2. I: Errors are independent (across observations)
3. N: y is \(\color{red}{normally}\) distributed at each “level” of x.
4. E: equal variance at all levels of x. No heteroscedasticity.

OLS Assumptions{#fig-ols-assumptions}

Hence a very concise way of expressing the Linear Model is:

\[ y \sim N(x_i^T * \beta, ~~\sigma^2) \]

General Linear Models

The target variable \(y\) is modelled as a normally distribute variable whose mean depends upon a linear combination of predictor variables \(x\), and whose variance is \(\sigma^2\).

Let us now read in the data and check for these assumptions as part of our Workflow.

Workflow: Read the Data

data("BostonHousing2", package = "mlbench")
housing <- BostonHousing2
inspect(housing)

categorical variables:  
  name  class levels   n missing                                  distribution
1 town factor     92 506       0 Cambridge (5.9%) ...                         
2 chas factor      2 506       0 0 (93.1%), 1 (6.9%)                          

quantitative variables:  
      name   class       min          Q1     median          Q3       max
1    tract integer   1.00000 1303.250000 3393.50000 3739.750000 5082.0000
2      lon numeric -71.28950  -71.093225  -71.05290  -71.019625  -70.8100
3      lat numeric  42.03000   42.180775   42.21810   42.252250   42.3810
4     medv numeric   5.00000   17.025000   21.20000   25.000000   50.0000
5    cmedv numeric   5.00000   17.025000   21.20000   25.000000   50.0000
6     crim numeric   0.00632    0.082045    0.25651    3.677083   88.9762
7       zn numeric   0.00000    0.000000    0.00000   12.500000  100.0000
8    indus numeric   0.46000    5.190000    9.69000   18.100000   27.7400
9      nox numeric   0.38500    0.449000    0.53800    0.624000    0.8710
10      rm numeric   3.56100    5.885500    6.20850    6.623500    8.7800
11     age numeric   2.90000   45.025000   77.50000   94.075000  100.0000
12     dis numeric   1.12960    2.100175    3.20745    5.188425   12.1265
13     rad integer   1.00000    4.000000    5.00000   24.000000   24.0000
14     tax integer 187.00000  279.000000  330.00000  666.000000  711.0000
15 ptratio numeric  12.60000   17.400000   19.05000   20.200000   22.0000
16       b numeric   0.32000  375.377500  391.44000  396.225000  396.9000
17   lstat numeric   1.73000    6.950000   11.36000   16.955000   37.9700
           mean           sd   n missing
1  2700.3557312 1.380037e+03 506       0
2   -71.0563887 7.540535e-02 506       0
3    42.2164403 6.177718e-02 506       0
4    22.5328063 9.197104e+00 506       0
5    22.5288538 9.182176e+00 506       0
6     3.6135236 8.601545e+00 506       0
7    11.3636364 2.332245e+01 506       0
8    11.1367787 6.860353e+00 506       0
9     0.5546951 1.158777e-01 506       0
10    6.2846344 7.026171e-01 506       0
11   68.5749012 2.814886e+01 506       0
12    3.7950427 2.105710e+00 506       0
13    9.5494071 8.707259e+00 506       0
14  408.2371542 1.685371e+02 506       0
15   18.4555336 2.164946e+00 506       0
16  356.6740316 9.129486e+01 506       0
17   12.6530632 7.141062e+00 506       0

The original data are 506 observations on 14 variables, medv being the target variable:

crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
b \(1000(B - 0.63)^2\) where B is the proportion of Blacks by town
lstat percentage of lower status of the population
medv median value of owner-occupied homes in USD 1000’s

The corrected data set has the following additional columns:

cmedv corrected median value of owner-occupied homes in USD 1000’s
town name of town
tract census tract
lon longitude of census tract
lat latitude of census tract

Our response variable is cmedv, the corrected median value of owner-occupied homes in USD 1000’s. Their are many Quantitative feature variables that we can use to predict cmedv. And there are two Qualitative features, chas and tax.

Workflow: EDA

In order to fit the linear model, we need to choose predictor variables that have strong correlations with the target variable. We will first do this with GGally, and then with the tidyverse itself. Both give us a very unique view into the correlations that exist within this dataset.

Simple Regression vs Multiple Regression

If there are many predictor variables, we would typically want to use more of them to make our model and predictions. There are three ways2 to include more predictors:

  • Backward Selection: We would typically start with a maximal model3 and progressively simplify the model by knocking off predictors that have the least impact on model accuracy.
  • Forward Selection: Start with no predictors and systematically add them one by one to increase the quality of the model
  • Mixed Selection: Wherein we start with no predictors and add them to gain improvement, or remove them at as their significance changes based on other predictors that have been added.

The first two are covered in the other tutorials above; Mixed Selection we will leave for a more advanced course. But for now we will first use just one predictor rm(Avg. no. of Rooms) to model housing prices.

Workflow: Model Building

We will first execute the lm test with code and evaluate the results. Then we will do an intuitive walk through of the process and finally, hand-calculate entire analysis for clear understanding.

Workflow: Model Checking and Diagnostics

We will follow much of the treatment on Linear Model diagnostics, given here on the STHDA website.

A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R.square that tells us how well the linear regression model fits to the data.

For example, the linear regression model makes the assumption that the relationship between the predictors (x) and the outcome variable is linear. This might not be true. The relationship could be polynomial or logarithmic.

Additionally, the data might contain some influential observations, such as outliers (or extreme values), that can affect the result of the regression.

Therefore, the regression model must be closely diagnosed in order to detect potential problems and to check whether the assumptions made by the linear regression model are met or not. To do so, we generally examine the distribution of residuals errors, that can tell us more about our data.

Checks for Uncertainty

Let us first look at the uncertainties in the estimates of slope and intercept. These are most easily read off from the broom::tidy-ed model:

# housing_lm_tidy <-  housing_lm %>% broom::tidy()
housing_lm_tidy

Plotting this is simple too:

housing_lm_tidy %>%
  gf_col(estimate ~ term, fill = ~ term, width = 0.25) %>% 
  gf_hline(yintercept = 0) %>% 
  gf_errorbar(conf.low + conf.high ~ term, 
              width = 0.1, 
              title = "Model Estimates with Confidence Intervals") 

Checks for Constant Variance/Heteroscedasticity

Linear Modelling makes 4 fundamental assumptions:(“LINE”)

  1. Linear relationship between y and x
  2. Observations are independent.
  3. Residuals are normally distributed
  4. Variance of the y variable is equal at all values of x.

We can check these using checks and graphs: Here we plot the residuals against the independent/feature variable and see if there is a gross variation in their range

housing_lm_augment %>% 
  gf_point(.resid ~ .fitted, title = "Residuals vs Fitted") %>%
  gf_smooth(method = "loess")
housing_lm_augment %>% 
  gf_hline(yintercept = 0, colour = "grey", linewidth = 2) %>%
  gf_point(.resid ~ cmedv, title = "Residuals vs Target Variable") 
housing_lm_augment %>% 
  gf_dhistogram(~ .resid, title = "Histogram of Residuals") %>% 
  gf_fitdistr()
housing_lm_augment %>% 
  gf_qq(~ .resid, title = "Q-Q Residuals") %>% 
  gf_qqline() 

The Q-Q plot of residuals also has significant deviations from the normal quartiles. The residuals are not quite “like the night sky”, i.e. random enough. These point to the need for a richer model, with more predictors. The “trend line” of residuals vs predictors show a U-shaped pattern, indicating significant nonlinearity: there is a curved relationship in the graph. The solution can be a nonlinear transformation of the predictor variables, such as \(\sqrt(X)\), \(log(X)\), or even \(X^2\). For instance, we might try a model for cmedv using \(rm^2\) instead of just rm as we have done. This will still be a linear model!

Tip

Base R has a crisp command to plot these diagnostic graphs. But we will continue to use ggformula.

plot(housing_lm)

Tip

One of the ggplot extension packages named lindia also has a crisp command to plot these diagnostic graphs.

library(lindia)
gg_diagnose(housing_lm, 
            mode = "base_r", # plots like those with base-r
  theme = theme(axis.title = element_text(size = 6, face = "bold"),
                title = element_text(size = 8))
)

The r-squared for a model lm(cmedv ~ rm^2) shows some improvement:

[1] 0.5501221

Conclusions

We have seen how starting from a basic EDA of the data, we have been able to choose a single Quantitative predictor variable to model a Quantitative target variable, using Linear Regression. As stated earlier, we may have wish to use more than one predictor variables, to build more sophisticated models with improved prediction capability. And there is more than one way of selecting these predictor variables, which we will examine in the Tutorials.

Secondly, sometimes it may be necessary to mathematically transform the variables in the dataset to enable the construction of better models, something that was not needed here.

We may also encounter cases where the predictor variables seem to work together; one predictor may influence “how well” another predictor works, something called an interaction effect or a synergy effect. We might then have to modify our formula to include interaction terms that look like \(predictor1 \times predictor2\).

So our Linear Modelling workflow might look like this: we have not seen all stages yet, but that is for another course module or tutorial!

Our Linear Regression WorkflowDataEDACheck RelationshipsBuild ModelTransform VariablesTry Multiple Regression and/or interaction effectsCheck Model DiagnosticsCheck R^2Interpret ModelApply ModelSimple orComplex Model DecisionIs the Model Possible?inspectggformulaglimpseskimcorrplotcorrgramggformula + purrrcor.testAll GoodInadequate11 Still Inadequate22Check R^2

References

  1. https://mlu-explain.github.io/linear-regression/
  2. The Boston Housing Dataset, corrected version. StatLib @ CMU, lib.stat.cmu.edu/datasets/boston_corrected.txt
  3. https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
  4. Andrew Gelman, Jennifer Hill, Aki Vehtari. Regression and Other Stories, Cambridge University Press, 2023.Available Online
  5. Michael Crawley, The R Book,second edition, 2013. Chapter 11.
  6. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Introduction to Statistical Learning, Springer, 2021. Chapter 3. https://www.statlearning.com/
  7. David C Howell, Permutation Tests for Factorial ANOVA Designs
  8. Marti Anderson, Permutation tests for univariate or multivariate analysis of variance and regression
  9. http://r-statistics.co/Assumptions-of-Linear-Regression.html
  10. Judd, Charles M., Gary H. McClelland, and Carey S. Ryan. 2017. “Introduction to Data Analysis.” In, 1–9. Routledge. https://doi.org/10.4324/9781315744131-1. Also see http://www.dataanalysisbook.com/index.html
  11. Patil, I. (2021). Visualizations with statistical details: The ‘ggstatsplot’ approach. Journal of Open Source Software, 6(61), 3167,https://doi:10.21105/joss.03167
R Package Citations
Package Version Citation
corrgram 1.14 Wright (2021)
corrplot 0.92 Wei and Simko (2021)
GGally 2.2.1 Schloerke et al. (2024)
ggstatsplot 0.12.3 Patil (2021)
ISLR 1.4 James et al. (2021)
janitor 2.2.0 Firke (2023)
lindia 0.10 Lee and Ventura (2023)
reghelper 1.1.2 Hughes and Beiner (2023)
supernova 3.0.0 Blake et al. (2024)
tidyverse 2.0.0 Wickham et al. (2019)
Blake, Adam, Jeff Chrabaszcz, Ji Son, and Jim Stigler. 2024. supernova: Judd, McClelland, & Ryan Formatting for ANOVA Output. https://CRAN.R-project.org/package=supernova.
Firke, Sam. 2023. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.
Hughes, Jeffrey, and David Beiner. 2023. reghelper: Helper Functions for Regression Analysis. https://CRAN.R-project.org/package=reghelper.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://CRAN.R-project.org/package=ISLR.
Lee, Yeuk Yu, and Samuel Ventura. 2023. lindia: Automated Linear Regression Diagnostic. https://CRAN.R-project.org/package=lindia.
Patil, Indrajeet. 2021. Visualizations with statistical details: The ggstatsplot approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to ggplot2. https://CRAN.R-project.org/package=GGally.
Wei, Taiyun, and Viliam Simko. 2021. R Package corrplot: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wright, Kevin. 2021. corrgram: Plot a Correlogram. https://CRAN.R-project.org/package=corrgram.
Back to top

Footnotes

  1. The model is linear in the parameters \(\beta_i\), e.g. We can have this:↩︎

  2. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Introduction to Statistical Learning, Springer, 2021. Chapter 3. Linear Regression. Available Online↩︎

  3. Michael Crawley, The R Book, Third Edition 2023. Chapter 9. Statistical Modelling↩︎

  4. Michael Crawley, The R Book, Third Edition 2023. Chapter 9. Statistical Modelling↩︎

  5. https://indrajeetpatil.github.io/ggstatsplot/reference/ggcoefstats.html↩︎

  6. https://github.com/UCLATALL/supernova↩︎

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {Modelling with {Linear} {Regression}},
  date = {2023-04-13},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/},
  langid = {en},
  abstract = {Predicting Quantitative Target Variables}
}
For attribution, please cite this work as:
V., Arvind. 2023. “Modelling with Linear Regression.” April 13, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/.