Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Inferential Modelling
  4. Modelling with Linear Regression
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Slides and Tutorials
  • Setting up R Packages
  • Introduction
  • The Linear Regression Model
  • Linear Models as Hypothesis Tests
  • Assumptions in Linear Models
  • Linear Model Workflow
    • Workflow: Read the Data
    • Workflow: EDA
    • Model Building
    • Workflow: Model Checking and Diagnostics
    • Workflow: Checks for Uncertainty
    • Checks for Constant Variance/Heteroscedasticity
  • Extras
  • Conclusions
  • References
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Inferential Modelling
  4. Modelling with Linear Regression

Modelling with Linear Regression

Linear Regression
Quantitative Predictor
Quantitative Response
Sum of Squares
Residuals
Author

Arvind V.

Published

April 13, 2023

Modified

May 20, 2025

Abstract
Predicting Quantitative Target Variables

Slides and Tutorials

Multiple Regression - Forward Selection   Multiple Regression - Backward Selection   Permutation Test for Regression 

Setting up R Packages

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# options(scipen = 1, digits = 3) #set digits to three decimal places
library(tidyverse)
library(ggformula)
library(mosaic)
library(GGally)
library(corrplot)
library(corrgram)
library(ggstatsplot)

Plot Theme

Show the Code
# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes

theme_custom <- function() {
  font <- "Roboto Condensed" # assign font family up front

  theme_classic(base_size = 14) %+replace% # replace elements we want to change

    theme(
      panel.grid.minor = element_blank(), # strip minor gridlines
      text = element_text(family = font),
      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        # size = 20,               #set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        # vjust = 2                #raise slightly
        margin = margin(0, 0, 10, 0)
      ),
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        # size = 14,                #font size
        hjust = 0,
        margin = margin(2, 0, 5, 0)
      ),
      plot.caption = element_text( # caption
        family = font, # font family
        size = 8, # font size
        hjust = 1
      ), # right align

      axis.title = element_text( # axis titles
        family = font, # font family
        size = 10 # font size
      ),
      axis.text = element_text( # axis text
        family = font, # axis family
        size = 8
      ) # font size
    )
}

# Set graph theme
theme_set(new = theme_custom())
#

Introduction

One of the most common problems in Prediction Analytics is that of predicting a Quantitative response variable, based on one or more Quantitative predictor variables or features. This is called Linear Regression. We will use the intuitions built up during our study of ANOVA to develop our ideas about Linear Regression.

Suppose we have data on salaries in a Company, with years of study and previous experience. Would we be able to predict the prospective salary of a new candidate, based on their years of study and experience? Or based on the mileage done, could we predict the resale price of a used car? These are typical problems in Linear Regression.

In this tutorial, we will use the Boston housing dataset. Our research question is:

NoteResearch Question

How do we predict the price of a house in Boston, based on other parameters Quantitative parameters such as area, location, rooms, and crime-rate in the neighbourhood?

The Linear Regression Model

The premise here is that many common statistical tests are special cases of the linear model.

A linear model estimates the relationship between one continuous or ordinal variable (dependent variable or “response”) and one or more other variables (explanatory variable or “predictors”). It is assumed that the relationship is linear:1

(1)yi∼β1∗xi+β0

or

(2)y1∼exp(β1)∗xi+β0

but not:

yi∼β1∗exp(β2∗xi)+β0

or

yi∼β1∗xβ2+β0

In Equation 1, β0 is the intercept and β1 is the slope of the linear fit, that predicts the value of y based the value of x. Each prediction leaves a small “residual” error between the actual and predicted values. β0 and β1 are calculated based on minimizing the sum of squares of these residuals, and hence this method is called “ordinary least squares” (OLS) regression.

Figure 1: Least Squares

The net area of all the shaded squares is minimized in the calculation of β0 and β1. As per Lindoloev, many statistical tests, going from one-sample t-tests to two-way ANOVA, are special cases of this system. Also see Jeffrey Walker “A linear-model-can-be-fit-to-data-with-continuous-discrete-or-categorical-x-variables”.

Linear Models as Hypothesis Tests

Using linear models is based on the idea of Testing of Hypotheses. The Hypothesis Testing method typically defines a NULL Hypothesis where the statements read as “there is no relationship” between the variables at hand, explanatory and responses. The Alternative Hypothesis typically states that there is a relationship between the variables.

Accordingly, in fitting a linear model, we follow the process as follows:

NoteModelling Process

With y=β0+β1∗x

  1. Make the following hypotheses:

NULL Hypothesis H0=>x and y are unrelated. (β1=0)

Alternate Hypothesis H1=>x and y are linearly related (β1≠0)

  1. We “assume” that H0 is true.
  2. We calculate β1.
  3. We then find probability p(β1=Estimated Value when the NULL Hypothesis is assumed TRUE). This is the p-value. If that probability is p>=0.05, we say we “cannot reject” H0 and there is unlikely to be significant linear relationship.
  4. However, if p<= 0.05 can we reject the NULL hypothesis, and say that there could be a significant linear relationship, because the probability p that β1=Estimated Value by mere chance under H0 is very small.

Assumptions in Linear Models

When does a Linear Model work? We can write the assumptions in Linear Regression Models as an acronym, LINE:
1. L: linear relationship between variables 2. I: Errors are independent (across observations)
3. N: y is normally distributed at each “level” of x.
4. E: y has the same variance at all levels of x. No heteroscedasticity.

Figure 2: OLS Assumptions

Hence a very concise way of expressing the Linear Model is:

y∼N(xiT∗β,  σ2)

ImportantGeneral Linear Models

The target variable y is modelled as a normally distribute variable whose mean depends upon a linear combination of predictor variables x, and whose variance is σ2.

Linear Model Workflow

OK, on with the computation!

Workflow: Read the Data

Let us now read in the data and check for these assumptions as part of our Workflow.

data("BostonHousing2", package = "mlbench")
housing <- BostonHousing2
inspect(housing)

categorical variables:  
  name  class levels   n missing                                  distribution
1 town factor     92 506       0 Cambridge (5.9%) ...                         
2 chas factor      2 506       0 0 (93.1%), 1 (6.9%)                          

quantitative variables:  
      name   class       min          Q1     median          Q3       max
1    tract integer   1.00000 1303.250000 3393.50000 3739.750000 5082.0000
2      lon numeric -71.28950  -71.093225  -71.05290  -71.019625  -70.8100
3      lat numeric  42.03000   42.180775   42.21810   42.252250   42.3810
4     medv numeric   5.00000   17.025000   21.20000   25.000000   50.0000
5    cmedv numeric   5.00000   17.025000   21.20000   25.000000   50.0000
6     crim numeric   0.00632    0.082045    0.25651    3.677083   88.9762
7       zn numeric   0.00000    0.000000    0.00000   12.500000  100.0000
8    indus numeric   0.46000    5.190000    9.69000   18.100000   27.7400
9      nox numeric   0.38500    0.449000    0.53800    0.624000    0.8710
10      rm numeric   3.56100    5.885500    6.20850    6.623500    8.7800
11     age numeric   2.90000   45.025000   77.50000   94.075000  100.0000
12     dis numeric   1.12960    2.100175    3.20745    5.188425   12.1265
13     rad integer   1.00000    4.000000    5.00000   24.000000   24.0000
14     tax integer 187.00000  279.000000  330.00000  666.000000  711.0000
15 ptratio numeric  12.60000   17.400000   19.05000   20.200000   22.0000
16       b numeric   0.32000  375.377500  391.44000  396.225000  396.9000
17   lstat numeric   1.73000    6.950000   11.36000   16.955000   37.9700
           mean           sd   n missing
1  2700.3557312 1.380037e+03 506       0
2   -71.0563887 7.540535e-02 506       0
3    42.2164403 6.177718e-02 506       0
4    22.5328063 9.197104e+00 506       0
5    22.5288538 9.182176e+00 506       0
6     3.6135236 8.601545e+00 506       0
7    11.3636364 2.332245e+01 506       0
8    11.1367787 6.860353e+00 506       0
9     0.5546951 1.158777e-01 506       0
10    6.2846344 7.026171e-01 506       0
11   68.5749012 2.814886e+01 506       0
12    3.7950427 2.105710e+00 506       0
13    9.5494071 8.707259e+00 506       0
14  408.2371542 1.685371e+02 506       0
15   18.4555336 2.164946e+00 506       0
16  356.6740316 9.129486e+01 506       0
17   12.6530632 7.141062e+00 506       0

The original data are 506 observations on 14 variables, medv being the target variable:

crim per capita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
b 1000(B−0.63)2 where B is the proportion of Blacks by town
lstat percentage of lower status of the population
medv median value of owner-occupied homes in USD 1000’s

The corrected data set has the following additional columns:

cmedv corrected median value of owner-occupied homes in USD 1000’s
town name of town
tract census tract
lon longitude of census tract
lat latitude of census tract

Our response variable is cmedv, the corrected median value of owner-occupied homes in USD 1000’s. Their are many Quantitative feature variables that we can use to predict cmedv. And there are two Qualitative features, chas and tax.

Workflow: EDA

In order to fit the linear model, we need to choose predictor variables that have strong correlations with the target variable. We will first do this with GGally, and then with the tidyverse itself. Both give us a very unique view into the correlations that exist within this dataset.

  • Workflow: Correlations with GGally
  • Correlations using cor.test and purrr

Let us select a few sets of Quantitative and Qualitative features, along with the target variable cmedv and do a pairs-plots with them:

# Set graph theme
theme_set(new = theme_custom())
#

housing %>%
  # Target variable cmedv
  # Predictors Rooms / Age / Distance to City Centres / Radial Highway Access
  select(cmedv, rm, age, dis) %>%
  GGally::ggpairs(
    title = "Plot 1",
    progress = FALSE,
    lower = list(continuous = wrap("smooth",
      alpha = 0.2
    ))
  )

##
housing %>%
  # Target variable cmedv
  # Predictors: Access to Radial Highways, / Resid. Land Proportion / proportion of non-retail business acres / full-value property-tax rate per USD 10,000
  select(cmedv, rad, zn, indus, tax) %>%
  GGally::ggpairs(
    title = "Plot 2",
    progress = FALSE,
    lower = list(continuous = wrap("smooth",
      alpha = 0.2
    ))
  )

##
housing %>%
  # Target variable cmedv
  # Predictors Crime Rate / Nitrous Oxide / Black Population / Lower Status Population
  select(cmedv, crim, nox, rad, b, lstat) %>%
  GGally::ggpairs(
    title = "Plot 3",
    progress = FALSE,
    lower = list(continuous = wrap("smooth",
      alpha = 0.2
    ))
  )

See the top row of the pairs plots. Clearly, rm (avg. number of rooms) is a big determining feature for median price cmedv. This we infer based on the large correlation of rm withcmedv, 0.696. The variableage (proportion of owner-occupied units built prior to 1940) may also be a significant influence on cmedv, with a correlation of −0.378.

None of the Quant variables rad, zn, indus, tax have a overly strong correlation with cmedv. .

The variable lstat (proportion of lower classes in the neighbourhood) as expected, has a strong (negative) correlation with cmedv; rad(index of accessibility to radial highways), nox(nitrous oxide) and crim(crime rate) also have fairly large correlations with cmedv, as seen from the pairs plots.

ImportantCorrelation Scores and Uncertainty

Recall that cor.test reports a correlation score and the p-value for the same. There is also a confidence interval reported for the correlation score, an interval within which we are 95% sure that the true correlation value is to be found.

Note that GGally too reports the significance of the correlation scores using stars, *** or **. This indicates the p-value in the scores obtained by GGally; Presumably, there is an internal cor.test that is run for each pair of variables and the p-value and confidence levels are also computed internally.

Let us plot (again) scatter plots of Quant Variables that have strong correlation with cmedv:

# Set graph theme
theme_set(new = theme_custom())
#
gf_point(
  data = housing,
  cmedv ~ age,
  title = "Price vs Proportion of houses older than 1940",
  ylab = "Median Price",
  xlab = "Proportion of older-than-1940 buildings"
)
##
gf_point(
  data = housing,
  cmedv ~ lstat,
  title = "Price vs Proportion of lower classes in the neighbourhood",
  ylab = "Median Price",
  xlab = "proportion of lower classes in the neighbourhood"
)
##
gf_point(
  data = housing,
  cmedv ~ rm,
  title = "Price vs Average no. of Rooms",
  ylab = "(cmedv) Median Price",
  xlab = "(rm) Avg. No. of Rooms"
)

So, rm does have a positive effect on cmedv, and age may have a (mild?) negative effect on cmedv; lstat seems to have a pronounced negative effet on cmedv. We have now managed to get a decent idea which Quant predictor variables might be useful in modelling cmedv: rm, lstat for starters, then perhapsage.

Let us also check the Qualitative predictor variables: Access to the Charles river (chas) does seem to affect the prices somewhat.

# Set graph theme
theme_set(new = theme_custom())
#
housing %>%
  # Target variable cmedv
  # Predictor Access to Charles River
  select(cmedv, chas) %>%
  GGally::ggpairs(
    title = "Plot 4",
    progress = FALSE,
    lower = list(continuous = wrap("smooth",
      alpha = 0.2
    ))
  )

Look at the bar plot above. While not too many properties can be near the Charles River (for obvious reasons) the box plots do seem to show some dependency of cmedv on chas.

Note

Qualitative predictors for a Quantitative target can be included in the model using what is called dummy variables, where each level of the Qualitative variable is given a one-hot kind of encoding. See for example https://www.statology.org/dummy-variables-regression/

This is somewhat advanced material: We will use the purrr package to develop all correlations with respect to our target variable in one shot and also plot these correlation test scores in an error-bar plot. See Tidy Modelling with R. This has the advantage of being able to depict all correlations in one plot. (We will use this approach again here when we trim our linear models down from the maximal one to a workable one of lesser complexity.). Let us do this.

We develop a list object containing all correlation test results with respect to cmedv, tidy these up using broom::tidy, and then plot these:

# Set graph theme
theme_set(new = theme_custom())
#

all_corrs <- housing %>%
  select(where(is.numeric)) %>%
  # leave off target variable cmedv and IDs
  # get all the remaining ones
  select(-cmedv, -medv) %>%
  purrr::map(
    .x = ., # All numeric variables selected in the previous step
    .f = \(.x) cor.test(.x, housing$cmedv)
  ) %>% # Apply the cor.test with `cmedv`

  # Tidy up the cor.test outputs into neat columns
  # Need ".id" column to keep track of predictor variable name
  map_dfr(broom::tidy, .id = "predictor")

all_corrs
ABCDEFGHIJ0123456789
predictor
<chr>
estimate
<dbl>
statistic
<dbl>
p.value
<dbl>
parameter
<int>
tract0.42825153510.63920915.514616e-24504
lon-0.322946685-7.66061259.548359e-14504
lat0.0068257920.15324228.782686e-01504
crim-0.389582441-9.49639958.711542e-20504
zn0.3603861778.67347975.785518e-17504
indus-0.484754379-12.44235383.522132e-31504
nox-0.429300219-10.67113944.167568e-24504
rm0.69630379421.77923041.307493e-74504
age-0.377998896-9.16612521.241939e-18504
dis0.2493148345.77960991.313250e-08504
Next
12
Previous
1-10 of 15 rows | 1-5 of 9 columns
all_corrs %>%
  gf_hline(
    yintercept = 0,
    color = "grey",
    linewidth = 2,
    title = "Correlations: Target Variable vs All Predictors",
    subtitle = "Boston Housing Dataset"
  ) %>%
  gf_errorbar(
    conf.high + conf.low ~ reorder(predictor, estimate),
    colour = ~estimate,
    width = 0.5,
    linewidth = ~ -log10(p.value),
    caption = "Significance = -log10(p.value)"
  ) %>%
  # Plot points(smallest geom) last!
  gf_point(estimate ~ reorder(predictor, estimate)) %>%
  gf_labs(x = "Predictors", y = "Correlation with cmedv") %>%
  # gf_theme(theme_minimal()) %>%

  # tilt the x-axis labels for readability
  gf_theme(theme(axis.text.x = element_text(angle = 45, hjust = 1))) %>%
  # Colour and linewidth scales + legends
  gf_refine(
    scale_colour_distiller("Correlation", type = "div", palette = "RdBu"),
    scale_linewidth_continuous("Significance",
      range = c(0.25, 3),

      # guide_legend(reverse = TRUE): Fat Lines mean higher significance
    )
  ) %>%
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)))

We can clearly see that rm and lstat have strong correlations with cmedv and should make good choices for setting up a minimal linear regression model. (medv is the older errored version of cmedv)

Model Building

We will first execute the lm test with code and evaluate the results. Then we will do an intuitive walk through of the process and finally, hand-calculate entire analysis for clear understanding.

  • Model Code
  • Forecasting with the Linear Model
  • Linear Model Intuitive
  • Linear Models Manually Demonstrated (Apologies to Spinoza)
  • Using Other Packages

R offers a very simple command lm to execute an Linear Model: Note the familiar formula of stating the variables: ( y∼x; where y = target, x = predictor)

housing_lm <- lm(cmedv ~ rm, data = housing)
summary(housing_lm)

Call:
lm(formula = cmedv ~ rm, data = housing)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.336  -2.425   0.093   2.918  39.434 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -34.6592     2.6421  -13.12   <2e-16 ***
rm            9.0997     0.4178   21.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.597 on 504 degrees of freedom
Multiple R-squared:  0.4848,    Adjusted R-squared:  0.4838 
F-statistic: 474.3 on 1 and 504 DF,  p-value: < 2.2e-16

The model for cmedv^ , the prediction for cmedvcan be written in the form of y=mx+c, as:

(3)cmedv^∼−34.65924+9.09967∗rm

Important
  • The effect size of rm on predicting cmedv a (slope) value of 9.09967 which is significant at p-value of <2.2e−16; for every one room increase in rm, we have a USD 90997 increase in median price cmedv.
  • The F-statistic for the Linear Model is given by F=474.3, which is very high. (We will use the F-statistic again when we do Multiple Regression.)
  • The R-squared value is R2=0.48 which means that rm is able to explain about half of the trend in cmedv; there is substantial variation in cmedv that is still left to explain, an indication that we should perhaps use a richer model, with more predictors. These aspects are explored in the Tutorials.

We can plot the scatter plot of these two variables with the model also over-plotted.

#| layout-ncol: 3
#| fig-width: 5
#| fig-height: 4

# Set graph theme
theme_set(new = theme_custom())
#
# Tidy Data frame for the model using `broom`
housing_lm_tidy <-
  housing_lm %>%
  broom::tidy(
    conf.int = TRUE,
    conf.level = 0.95
  )
housing_lm_tidy
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
conf.low
<dbl>
conf.high
<dbl>
(Intercept)-34.659242.6421358-13.117894.992332e-34-39.850200-29.468287
rm9.099670.417814121.779231.307493e-748.2787989.920542
2 rows
##
housing_lm_augment <-
  housing_lm %>%
  broom::augment(
    se_fit = TRUE,
    interval = "confidence"
  )
housing_lm_augment
ABCDEFGHIJ0123456789
cmedv
<dbl>
rm
<dbl>
.fitted
<dbl>
.lower
<dbl>
.upper
<dbl>
.se.fit
<dbl>
.resid
<dbl>
.hat
<dbl>
.sigma
<dbl>
24.06.57525.171084924.54754225.794627410.3173758-1.171084910.0023144756.603363
21.66.42123.769735823.18277524.356696930.2987563-2.169735780.0020508756.602860
34.77.18530.721883429.78473931.659027700.47699543.978116590.0052279736.601175
33.46.99829.020245228.19872329.841767610.41814524.379754820.0040175316.600670
36.27.14730.376096029.46335131.288841110.46457655.823904040.0049592906.598437
28.76.43023.851632823.26321824.440047570.29949624.848367190.0020610456.600024
22.96.01220.047970919.42984620.666095970.31461842.852029110.0022744336.602343
22.16.17221.503918020.92035922.087477520.29702490.596081960.0020271726.603517
16.55.63116.580996715.79367317.368320760.4007386-0.080996750.0036900096.603569
18.96.00419.975173519.35464120.595706440.3158439-1.075173530.0022921876.603396
Next
123456
...
51
Previous
1-10 of 506 rows | 1-9 of 11 columns
##
intercept <-
  housing_lm_tidy %>%
  filter(term == "(Intercept)") %>%
  select(estimate) %>%
  as.numeric()
##
slope <-
  housing_lm_tidy %>%
  filter(term == "rm") %>%
  select(estimate) %>%
  as.numeric()
##
housing %>%
  drop_na() %>%
  gf_point(
    cmedv ~ rm,
    title = "Price vs Average no. of Rooms",
    ylab = "Median Price",
    xlab = "Avg. No. of Rooms",
    alpha = 0.2
  ) %>%
  # Plot the model equation
  gf_abline(
    slope = slope, intercept = intercept,
    colour = "lightcoral",
    linewidth = 2
  ) %>%
  # Plot the model prediction points on the line
  gf_smooth(
    method = "lm", geom = "point",
    color = "grey30",
    size = 0.5
  ) %>%
  gf_refine(
    annotate(
      geom = "segment",
      y = 0, yend = 29, x = 7, xend = 7, # manually calculated
      linetype = "dashed",
      color = "dodgerblue",
      arrow = arrow(
        angle = 30,
        length = unit(0.25, "inches"),
        ends = "last",
        type = "closed"
      )
    ),
    annotate(
      geom = "segment",
      y = 29, yend = 29, x = 2.5, xend = 7, # manually calculated
      linetype = "dashed",
      arrow = arrow(
        angle = 30,
        length = unit(0.25, "inches"),
        ends = "first",
        type = "closed"
      ),
      color = "dodgerblue"
    )
  ) %>%
  gf_refine(
    scale_x_continuous(
      limits = c(2.5, 10),
      expand = c(0, 0)
    ),
    # removes plot panel margins
    scale_y_continuous(
      limits = c(0, 55),
      expand = c(0, 0)
    )
  ) %>%
  gf_theme(theme = theme_custom())

For any new value of rm, we go up to the vertical blue line and read off the predicted median price by following the horizontal blue line. That is how the model is used (by hand).

In practice, we use the broom package functions (tidy, glance and augment) to obtain a clear view of the model parameters and predictions of cmedv for all existing values of rm. We see estimates for the intercept and slope (rm) for the linear model, along with the standard errors and p.values for these estimated parameters. And we see the fitted values of cmedv for the existing rm; these values will naturally lie on the straight-line depicting the model. We will examine this augment-ed data more the section on Diagnostics.

To predict cmedv with new values of rm, we use predict. Let us now try to make predictions with some new data:

new <- tibble(rm = seq(3, 10)) # must be named "rm"
new %>% mutate(
  predictions =
    stats::predict(
      object = housing_lm,
      newdata = .,
      se.fit = FALSE
    )
)
ABCDEFGHIJ0123456789
rm
<int>
predictions
<dbl>
3-7.360234
41.739436
510.839105
619.938775
729.038445
838.138114
947.237784
1056.337454
8 rows

Note that “negative values” for predicted cmedv would have no meaning!

All that is very well, but what is happening under the hood of the lm command? Consider the cmedv (target) variable and the rm feature/predictor variable. What we do is:

  1. Plot a scatter plot gf_point(cmedv ~ rm, housing)
  2. Find a line that, in some way, gives us some prediction of cmedv for any given rm
  3. Calculate the errors in prediction and use those to find the “best” line.
  4. Use that “best” line henceforth as a model for prediction.

How does one fit the “best” line? Consider a choice of “lines” that we can use to fit to the data. Here are 6 lines of varying slopes (and intercepts ) that we can try as candidates for the best fit line:

It should be apparent that while we cannot determine which line may be the best, the worst line seems to be the one in the final plot, which ignores the x-variable rm altogether. This corresponds to the NULL Hypothesis, that there is no relationship between the two variables. Any of the other lines could be a decent candidate, so how do we decide?

In Fig A, the horizontal blue line is the overall mean of cmedv, denoted as μtot. The vertical green lines to the points show the departures of each point from this overall mean, called residuals. The sum of squares of these residuals in Fig A is called the Total Sum of Squares (SST).

(4)SST=Σ(y−μtot)2

In Fig B, the vertical red lines are the residuals of each point from the potential line of fit. The sum of the squares of these lines is called the Total Error Sum of Squares (SSE).

(5)SSE=Σ[(y−a−b∗rm)2]

It should be apparent that if there is any positive linear relationship between cmedv and rm,then SSE<SST.

How do we get the optimum slope + intercept? If we plot the SSE as a function of varying slope, we get:

#| echo: false
sim_model <- tibble(
  b = slope + seq(-5, 5),
  a = intercept,
  dat = list(tibble(
    cmedv = housing_sample$cmedv,
    rm = housing_sample$rm
  ))
) %>%
  mutate(r_squared = pmap_dbl(
    .l = list(a, b, dat),
    .f = \(a, b, dat) sum((dat$cmedv - (b * dat$rm + a))^2)
  ))
min_r_squared <- sim_model %>%
  select(r_squared) %>%
  min()
min_slope <- sim_model %>%
  filter(r_squared == min_r_squared) %>%
  select(b) %>%
  as.numeric()
sim_model %>%
  gf_point(r_squared ~ b, data = ., size = 2) %>%
  gf_line(ylab = "SSE", xlab = "slope", title = "Error vs Slope") %>%
  gf_hline(yintercept = min_r_squared, color = "red") %>%
  gf_segment(min_r_squared + 0 ~ min_slope + min_slope,
    colour = "red",
    arrow = arrow(ends = "last", length = unit(1, "mm"))
  ) %>%
  gf_refine(
    coord_cartesian(expand = FALSE),
    expand_limits(y = c(0, 20000), x = c(3.5, 15))
  )

We see that there is a quadratic minimum SSE at the optimum value of slope and at all other slopes, the SSE is higher. We can use this to find the optimum slope, which is what the function lm does.

Let us hand-calculate the numbers so we know what the test is doing. Here is the SST: we pretend that there is no relationship between cmedv ans rm and compute a NULL model:

# Calculate overall sum squares SST

SST <- deviance(lm(cmedv ~ 1, data = housing))
SST
[1] 42577.74

And here is the SSE:

SSE <- deviance(housing_lm)
SSE
[1] 21934.39

Given that the model leaves unexplained variations in cmedv to the extent of SSE, we can compute the SSR, the Regression Sum of Squares, the amount of variation in cmedv that the linear model does explain:

SSR <- SST - SSE
SSR
[1] 20643.35

We have SST=42577.74, SSE=21934.39 and therefore SSR=20643.35.

In order to calculate the F-Statistic, we need to compute the variances, using these sum of squares. We obtain variances by dividing by their Degrees of Freedom:

Fstat=SSR/dfSSRSSE/dfSSE

where dfSSR and dfSSE are respectively the degrees of freedom in SSR and SSE.

Let us calculate these Degrees of Freedom. If we have n= 506 observations of data, then:

  • SST clearly has degree of freedom n−1=505, since it uses all observations but loses one degree to calculate the global mean.
  • SSE was computed using the slope and intercept, so it has (n−2)=504 as degrees of freedom.
  • And therefore SSR being their difference has just 1 degree of freedom.

Now we are ready to compute the F-statistic:

n <- housing %>%
  count() %>%
  as.numeric()
df_SSR <- 1
df_SSE <- n - 2
F_stat <- (SSR / df_SSR) / (SSE / df_SSE)
F_stat
[1] 474.3349

The F-stat is compared with a critical value of the F-statistic, which is computed using the formula for the f-distribution in R. As with our hypothesis tests, we set the significance level to 0.95, and quote the two relevant degrees of freedom as parameters to qf() which computes the critical F value as a quartile:

F_crit <- qf(
  p = 0.95, # Significance level is 5%
  df1 = df_SSR, # Numerator degrees of freedom
  df2 = df_SSE
) # Denominator degrees of freedom
F_crit
[1] 3.859975
F_stat
[1] 474.3349

The F_crit value can also be seen in a plot2:

mosaic::pdist(
  dist = "f",
  q = F_crit,
  df1 = df_SSR, df2 = df_SSE
)

[1] 0.95

Any value of F more than the Fcrit occurs with smaller probability than 0.05. Our F_stat is much higher than Fcrit, by orders of magnitude! And so we can say with confidence that rm has a significant effect on cmedv.

The value of R.squared is also calculated from the previously computed sums of squares:

(6)R.squared=SSRSST=SSY−SSESST

r_squared <- (SST - SSE) / SST
r_squared
[1] 0.484839
# Also computable by
# mosaic::rsquared(housing_lm)

So R.squared = 0.484839

The value of Slope and Intercept are computed using a maximum likelihood derivation and the knowledge that the means square error is a minimum at the optimum slope: for a linear model y∼mx+c

slope=Σ[(y−ymean)∗(x−xmean)]Σ(x−xmean)2

Tip

Note that the slope is equal to the ratio of the covariance of x and y to the variance of x.

and

Intercept=ymean−slope∗xmean

slope <- mosaic::cov(cmedv ~ rm, data = housing) / mosaic::var(~rm, data = housing)
slope
[1] 9.09967
##
intercept <- mosaic::mean(~cmedv, data = housing) - slope * mosaic::mean(~rm, data = housing)
intercept
[1] -34.65924

So, there we are! All of this is done for us by one simple formula, lm()!

There is a very neat package called ggstatsplot3 that allows us to plot very comprehensive statistical graphs. Let us quickly do this:

library(ggstatsplot)
housing_lm %>%
  ggstatsplot::ggcoefstats(
    title = "Linear Model for Boston Housing",
    subtitle = "Using ggstatsplot"
  )

This chart shows the estimates for the intercept and rm along with their error bars, the t-statistic, degrees of freedom, and the p-value.

We can also obtain crisp-looking model tables from the new supernova package 4, which is based on the methods discussed in Judd et al.

library(supernova)
supernova::supernova(housing_lm)
 Analysis of Variance Table (Type III SS)
 Model: cmedv ~ rm

                                SS  df        MS       F   PRE     p
 ----- --------------- | --------- --- --------- ------- ----- -----
 Model (error reduced) | 20643.347   1 20643.347 474.335 .4848 .0000
 Error (from model)    | 21934.392 504    43.521                    
 ----- --------------- | --------- --- --------- ------- ----- -----
 Total (empty model)   | 42577.739 505    84.312                    

This table is very neat in that it gives the Sums of Squares for both the NULL (empty) model, and the current model for comparison. The PRE entry is the Proportional Reduction in Error, a measure that is identical with r.squared, which shows how much the model reduces the error compared to the NULL model(48%). The PRE idea is nicely discussed in Judd et al Section 10.

Workflow: Model Checking and Diagnostics

We will follow much of the treatment on Linear Model diagnostics, given here on the STHDA website.

A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R.square that tells us how well the linear regression model fits to the data.

For example, the linear regression model makes the assumption that the relationship between the predictors (x) and the outcome variable is linear. This might not be true. The relationship could be polynomial or logarithmic.

Additionally, the data might contain some influential observations, such as outliers (or extreme values), that can affect the result of the regression.

Therefore, the regression model must be closely diagnosed in order to detect potential problems and to check whether the assumptions made by the linear regression model are met or not. To do so, we generally examine the distribution of residuals errors, that can tell us more about our data.

Workflow: Checks for Uncertainty

Let us first look at the uncertainties in the estimates of slope and intercept. These are most easily read off from the broom::tidy-ed model:

# housing_lm_tidy <-  housing_lm %>% broom::tidy()
housing_lm_tidy
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
conf.low
<dbl>
conf.high
<dbl>
(Intercept)-34.659242.6421358-13.117894.992332e-34-39.850200-29.468287
rm9.099670.417814121.779231.307493e-748.2787989.920542
2 rows

Plotting this is simple too:

# Set graph theme
theme_set(new = theme_custom())
#
housing_lm_tidy %>%
  gf_col(estimate ~ term, fill = ~term, width = 0.25) %>%
  gf_hline(yintercept = 0) %>%
  gf_errorbar(conf.low + conf.high ~ term,
    width = 0.1,
    title = "Model Bar Plot for Estimates with Confidence Intervals"
  ) %>%
  gf_theme(theme = theme_custom())
##
housing_lm_tidy %>%
  gf_pointrange(estimate + conf.low + conf.high ~ term,
    title = "Model Point-Range Plot for Estimates with Confidence Intervals"
  ) %>%
  gf_hline(yintercept = 0) %>%
  gf_theme(theme = theme_custom())

The point-range plot helps to avoid what has been called “within-the-bar bias”. The estimate is just a value, which we might plot as a bar or as a point, with uncertainty error-bars.

Values within the bar are not more likely!! This is the bias that the point-range plot avoids.

Checks for Constant Variance/Heteroscedasticity

Linear Modelling makes 4 fundamental assumptions:(“LINE”)

  1. Linear relationship between y and x
  2. Observations are independent.
  3. Residuals are normally distributed
  4. Variance of the y variable is equal at all values of x.

We can check these using checks and graphs: Here we plot the residuals against the independent/feature variable and see if there is a gross variation in their range

housing_lm_augment %>%
  gf_point(.resid ~ .fitted, title = "Residuals vs Fitted") %>%
  gf_smooth(method = "loess")
housing_lm_augment %>%
  gf_hline(yintercept = 0, colour = "grey", linewidth = 2) %>%
  gf_point(.resid ~ cmedv, title = "Residuals vs Target Variable")
housing_lm_augment %>%
  gf_dhistogram(~.resid, title = "Histogram of Residuals") %>%
  gf_fitdistr()
housing_lm_augment %>%
  gf_qq(~.resid, title = "Q-Q Residuals") %>%
  gf_qqline()

The Q-Q plot of residuals also has significant deviations from the normal quartiles. The residuals are not quite “like the night sky”, i.e. random enough. These point to the need for a richer model, with more predictors. The “trend line” of residuals vs predictors show a U-shaped pattern, indicating significant nonlinearity: there is a curved relationship in the graph. The solution can be a nonlinear transformation of the predictor variables, such as (X), log(X), or even X2. For instance, we might try a model for cmedv using rm2 instead of just rm as we have done. This will still be a linear model!

Tip

Base R has a crisp command to plot these diagnostic graphs. But we will continue to use ggformula.

plot(housing_lm)

Tip

One of the ggplot extension packages named lindia also has a crisp command to plot these diagnostic graphs.

# Set graph theme
theme_set(new = theme_custom())
#
library(lindia)
gg_diagnose(housing_lm,
  mode = "base_r", # plots like those with base-r
  theme = theme(
    axis.title = element_text(size = 6, face = "bold"),
    title = element_text(size = 8)
  )
)

The r-squared for a model lm(cmedv ~ rm^2) shows some improvement:

[1] 0.5501221

Extras

NoteMultiple Regression

It is also possible that there is more than one explanatory variable: this is multiple regression.

(7)y=β0+β1∗x1+β2∗x2...+βn∗xn

where each of the βi are slopes defining the relationship between y and xi. Note that this is a vector dot-product, or inner-product, taken with a vector of input variables xi and a vector of weights, βi. Together, the RHS of that equation defines an n-dimensional hyperplane. The model is linear in the parameters βi, e.g. these are OK:

{yi=ββ0+ββ1x1+ββ2x12+ϵiy1=ββ0+γγ1δδ1x1+exp(ββ2)x2+ϵi

but not, for example, these:

{yi=ββ0+ββ1x1β2+ϵiyi=ββ0+exp(ββ1x1)+ϵi

There are three ways5 to include more predictors:

  • Backward Selection: We would typically start with a maximal model6 and progressively simplify the model by knocking off predictors that have the least impact on model accuracy.
  • Forward Selection: Start with no predictors and systematically add them one by one to increase the quality of the model
  • Mixed Selection: Wherein we start with no predictors and add them to gain improvement, or remove them at as their significance changes based on other predictors that have been added.

The first two are covered in the other tutorials above; Mixed Selection we will leave for a more advanced course. But for now we will first use just one predictor rm(Avg. no. of Rooms) to model housing prices.

Conclusions

We have seen how starting from a basic EDA of the data, we have been able to choose a single Quantitative predictor variable to model a Quantitative target variable, using Linear Regression. As stated earlier, we may have wish to use more than one predictor variables, to build more sophisticated models with improved prediction capability. And there is more than one way of selecting these predictor variables, which we will examine in the Tutorials.

Secondly, sometimes it may be necessary to mathematically transform the variables in the dataset to enable the construction of better models, something that was not needed here.

We may also encounter cases where the predictor variables seem to work together; one predictor may influence “how well” another predictor works, something called an interaction effect or a synergy effect. We might then have to modify our formula to include interaction terms that look like predictor1×predictor2.

So our Linear Modelling workflow might look like this: we have not seen all stages yet, but that is for another course module or tutorial!

Our Linear Regression WorkflowDataEDACheck RelationshipsBuild ModelTransform VariablesTry Multiple Regression and/or interaction effectsCheck Model DiagnosticsCheck R^2Interpret ModelApply ModelSimple orComplex Model DecisionIs the Model Possible?inspectggformulaglimpseskimcorrplotcorrgramggformula + purrrcor.testAll GoodInadequate11 Still Inadequate22Check R^2

References

  1. https://mlu-explain.github.io/linear-regression/
  2. The Boston Housing Dataset, corrected version. StatLib @ CMU, lib.stat.cmu.edu/datasets/boston_corrected.txt
  3. https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
  4. Andrew Gelman, Jennifer Hill, Aki Vehtari. Regression and Other Stories, Cambridge University Press, 2023.Available Online
  5. Michael Crawley.(2013). The R Book,second edition. Chapter 11.
  6. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Introduction to Statistical Learning, Springer, 2021. Chapter 3. https://www.statlearning.com/
  7. David C Howell, Permutation Tests for Factorial ANOVA Designs
  8. Marti Anderson, Permutation tests for univariate or multivariate analysis of variance and regression
  9. http://r-statistics.co/Assumptions-of-Linear-Regression.html
  10. Judd, Charles M., Gary H. McClelland, and Carey S. Ryan. 2017. “Introduction to Data Analysis.” In, 1–9. Routledge. https://doi.org/10.4324/9781315744131-1. Also see http://www.dataanalysisbook.com/index.html
  11. Patil, I. (2021). Visualizations with statistical details: The ‘ggstatsplot’ approach. Journal of Open Source Software, 6(61), 3167,https://doi:10.21105/joss.03167
R Package Citations
Package Version Citation
broom 1.0.8 Robinson, Hayes, and Couch (2025)
corrgram 1.14 Wright (2021)
corrplot 0.95 Wei and Simko (2024)
geomtextpath 0.1.5 Cameron and van den Brand (2025)
GGally 2.2.1 Schloerke et al. (2024)
ggstatsplot 0.13.1 Patil (2021)
ISLR 1.4 James et al. (2021)
janitor 2.2.1 Firke (2024)
lindia 0.10 Lee and Ventura (2023)
reghelper 1.1.2 Hughes and Beiner (2023)
supernova 3.0.0 Blake et al. (2024)
Blake, Adam, Jeff Chrabaszcz, Ji Son, and Jim Stigler. 2024. supernova: Judd, McClelland, & Ryan Formatting for ANOVA Output. https://doi.org/10.32614/CRAN.package.supernova.
Cameron, Allan, and Teun van den Brand. 2025. geomtextpath: Curved Text in “ggplot2”. https://doi.org/10.32614/CRAN.package.geomtextpath.
Firke, Sam. 2024. janitor: Simple Tools for Examining and Cleaning Dirty Data. https://doi.org/10.32614/CRAN.package.janitor.
Hughes, Jeffrey, and David Beiner. 2023. reghelper: Helper Functions for Regression Analysis. https://doi.org/10.32614/CRAN.package.reghelper.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://doi.org/10.32614/CRAN.package.ISLR.
Lee, Yeuk Yu, and Samuel Ventura. 2023. lindia: Automated Linear Regression Diagnostic. https://doi.org/10.32614/CRAN.package.lindia.
Patil, Indrajeet. 2021. “Visualizations with statistical details: The ‘ggstatsplot’ approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.
Robinson, David, Alex Hayes, and Simon Couch. 2025. broom: Convert Statistical Objects into Tidy Tibbles. https://doi.org/10.32614/CRAN.package.broom.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2024. GGally: Extension to “ggplot2”. https://doi.org/10.32614/CRAN.package.GGally.
Wei, Taiyun, and Viliam Simko. 2024. R Package “corrplot”: Visualization of a Correlation Matrix. https://github.com/taiyun/corrplot.
Wright, Kevin. 2021. corrgram: Plot a Correlogram. https://doi.org/10.32614/CRAN.package.corrgram.
Back to top

Footnotes

  1. The model is linear in the parameters βi, e.g. We can have this:↩︎

  2. Michael Crawley, The R Book, Third Edition 2023. Chapter 9. Statistical Modelling↩︎

  3. https://indrajeetpatil.github.io/ggstatsplot/reference/ggcoefstats.html↩︎

  4. https://github.com/UCLATALL/supernova↩︎

  5. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Introduction to Statistical Learning, Springer, 2021. Chapter 3. Linear Regression. Available Online↩︎

  6. Michael Crawley, The R Book, Third Edition 2023. Chapter 9. Statistical Modelling↩︎

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {Modelling with {Linear} {Regression}},
  date = {2023-04-13},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/},
  langid = {en},
  abstract = {Predicting Quantitative Target Variables}
}
For attribution, please cite this work as:
V., Arvind. 2023. “Modelling with Linear Regression.” April 13, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LinReg/.
Inferential Modelling
Modelling with Logistic Regression

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .