Modelling with Logistic Regression

Logistic Regression
Qualitative Variable
Probability
Odds
Log Transformation
Author

Arvind V

Published

April 13, 2023

Modified

June 26, 2024

Abstract
Predicting Qualitative Target Variables

Setting up R Packages

library(tidyverse)
library(ggformula)
library(mosaic)
library(infer)
library(regressinator) # pedagogic tool for GLMs
library(GLMsData) # Datasets from Dunn and Smyth
library(HSAUR3) # Datasets from Everitt and Hothorn
library(prettyglm) # create beautiful coefficient summaries of generalised linear models.
# remotes::install_github("UCLATALL/JMRData")
# library(JMRData)

Introduction

Sometimes the dependent variable is an either/or categorization. For example, the variable we want to predict might be won or lost the contest, has an ailment or not, voted or not in the last election, or graduated from college or not. There might even be more than two categories such as voted for Congress, BJP, or Independent; or never smoker, former smoker, or current smoker.

We saw with the General Linear Model that it models the mean of a target Quantitative variable as a linear weighted sum of the predictor variables:

\[ y \sim N(x_i^T * \beta, ~~\sigma^2) \]

This model is considered to be general because of the dependence on potentially more than one explanatory variable, v.s. the simple linear model:1 \(y = \beta_0 + \beta_1*x_1 + \epsilon\). The general linear model gives us model “shapes” that start from a simple straight line to a p-dimensional hyperplane.

Although a very useful framework, there are some situations where general linear models are not appropriate:

  • the range of Y is restricted (e.g. binary, count)
  • the variance of Y depends on the mean (Taylor’s Law)2

How do we use the familiar linear model framework when the target/dependent variable is Categorical?

Linear Models for Categorical Targets?

Recall that we spoke of dummy **predictor** variables for our linear models and how we would dummy code them using numerical values, such as 0 and 1, or +1 and -1. Could we try the same way for a target categorical variable?

\[ Y_i = \beta_0 + \beta_1*Xi + \epsilon_i\\ \nonumber where\\ \begin{align} Y_i &= 0 ~ if ~ "No"\\ \nonumber &= 1 ~ if ~"Yes" \nonumber \end{align} \]

Sadly this seems to not work for categorical dependent variables using a simple linear model as before. Consider the Credit Card Default data from the package ISLR.

Rows: 10,000
Columns: 4
$ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
$ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
$ income  <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…

We see balance and income are quantitative predictors; student is a qualitative predictor, and default is a qualitative target variable. If we naively use a linear model equation as model = lm(default ~ balance, data = Default) and plot it, then…

Naive Linear Model

Naive Linear Model

…it is pretty much clear from ?@fig-naive-linear-model that something is very odd. (no pun intended! See below!) If the only possible values for default are \(No = 0\) and \(Yes = 1\), how could we interpret predicted value of, say, \(Y_i = 0.25\) or \(Y_i = 1.55\), or perhaps \(Y_i = -0.22\)? Anything other than Yes/No is hard to interpret!

Problems…and Solutions

Where do we go from here?

Let us state what we might desire of our model:

  1. Model Equation: Despite this setback, we would still like our model to be as close as possible to the familiar linear model equation.
  2. Predictors and Weights: We have quantitative predictors so we still want to use a linear-weighted sum for the RHS (i.e predictor side) of the model equation.

What can we try to make this work? Especially for the LHS (i.e the target side)?

  1. Making the LHS continuous: What can we try? In dummy encoding our target variable, we found a range of [0,1], which is the same range for a probability value! Could we try to use probability of the outcome as our target, even though we are interested in binary outcomes? This would still leave us with a range of \([0,1]\) for the target variable, as before.
Note

Binomially distributed target variable

If we map our Categorical/Qualitative target variable into a Quantitative probability, we need immediately to look at the LINE assumptions in linear regression.

In linear regression, we assume a normally distributed target variable, i.e. the errors around the predicted value are normally distributed. With a categorical target variable with two levels \(0\) and \(1\) it would be impossible for the errors \(e_i = Y_i - \hat{Y_i}\) to have a normal distribution, as assumed for the statistical tests to be valid. The errors are bounded by \([0,1]\)! One candidate for the error distribution in this case is the binomial distribution, whose mean and variance are p and np(1-p) respectively.

Note immediately that the binomial variance moves with the mean!

The LINE assumption of normality is clearly violated. And from the figure above, extreme probabilities (near 1 or 0) are more stable (i.e., have less error variance) than middle probabilities. So the model has “built-in” heteroscedasticity, which we need to counter with transformations such as the \(log()\) function. More on this later.

  1. Odds?: How would one “extend” the range of a target variable from [0,1] to \([-\infty, \infty]\) ? One step would be to try the odds of the outcome, instead of trying to predict the outcomes directly (Yes or No), or their probabilities \([0,1]\).
Odds

Odds of an event with probability p of occurrence is defined as \(Odds = p/(1-p)\). As can be seen, the odds are the ratio of two probabilities, that of the event and its complement. In the Default dataset just considered, the odds of default and the odds of non-default can be calculated as:

\[ \begin{align} OddsDefault &=p(noDefault)/(1-p(noDefault))\\ \nonumber &= 0.9667/(1-0.9667)\\ \nonumber &= 29.0303\\ \end{align} \]

and OddsNoDefault = \(1/29.0303 = 0.03444709\).

Now, odds cover half of real number line, i.e. \([0, \infty]\) ! Clearly, when the probability p of an event is \(0\), the odds are \(0\)…and when it nears \(1\), the odds tend to \(\infty\). So we have transformed a simple probability that lies between \([0,1]\) to odds lying between \([0, \infty]\). That’s one step towards making a linear model possible; we have “removed” one of the limits on our linear model’s prediction range by using Odds as our target variable.

  1. Transformation using log()?: We need one more leap of faith: how do we convert a \([0, \infty]\) range to a \([-\infty, \infty]\)? Can we try a log transformation?

\[ log([0, \infty]) ~ = ~ [-\infty, \infty] \] This extends the range of our Qualitative target to the same as with a Quantitative target!

There is an additional benefit if this log() transformation: the Error Distributions with Odds targets. See the plot below. Odds are a necessarily nonlinear function of probability; the slope of Odds ~ probability also depends upon the probability itself, as we saw with the probability curve earlier.

(a) Odds
(b) Log Odds
Figure 1: Odds Plot

To understand this issue intuitively, consider what happens to, say, a 5% change in the odds ratio near 1.0 compared to more extreme odds ratios. If the odds ratio is \(1.0\), then the probabilities p and 1-p are \(0.5\), and \(0.5\). A 20% increase in the odds ratio to \(1.20\) would correspond to probabilities of \(0.545\) and \(0.455\). However, if the original probabilities were \(0.9\) and \(0.1\) for an odds ratio \(9\), then a 20% increase to \(10.8\) would correspond to probabilities of \(0.915\) and \(0.085\), a much smaller change in the probabilities. The log transformation provides a more linear relationship, which is what we desire.

So in our model, instead of modeling odds as the dependent variable, we will use \(log(odds)\), also known as the logit, defined as:

\[ \begin{align} log(odds_i) &= log[(p_i)/(1-p_i)]\\ \nonumber &= logit(p_i)\\ \end{align} \]

  1. Estimation of Model Parameters: The last problem to solve is that because we have made so many transformations to get to the logits that we want to model, the logic of minimizing the sum of squared errors(SSE) is no longer appropriate.
Note

The probabilities for default are \(0\) and \(1\)…the log(odds) will map respectively to \(-\infty\) and \(\infty\). So if we naively try to take residuals, we will find that they are all \(\infty\) !! Hence \(SSE\) cannot be computed and we need another way to assess the quality of our model.

Instead, we will have to use maximum likelihood estimation(MLE) to estimate the models, and we will use the \(X^2\) (“chi-squared”) statistic instead of t and Fto evaluate the model comparisons. The maximum likelihood method maximizes the probability of obtaining the data at hand against every choice of model parameters \(\beta_i\).

This is our Logistic Regression Model, which uses a Quantitative Predictor variable to predict a Categorical target variable. We write the model as ( for the Default dataset:

\[ \begin{align} logit(default) & = \beta_0 + \beta_1 * balance&\\ \nonumber log(p(default)/(1-p(default))) & = \beta_0+\beta_1 * balance&\\ \nonumber therefore\\ p(default) & = \frac{exp(\beta_0 + \beta_1 * balance)}{1 + exp(\beta_0 + \beta_1 * balance)}\\ \end{align} \]

From the Eqn.4 above it should be clear that a unit increase in balance should increase the odds of default by \(\beta_1\) units.

The RHS of Eqn.5 is a sigmoid function of the weighted sum of predictors and is limited to the range [0,1]. The parameters \(\beta_i\) need to be estimated using maximum likelihood methods.

(a) naive linear regression model
(b) logistic regression model
(c) log odds gives linear models
Figure 2: Model Plots

Logistic Regression Models as Hypothesis Tests

To Be Written Up.

Generalized Linear Model

Important

A generalized linear model is made up of a linear predictor:

\[ \eta_i = \beta_0 + \beta_1x_{1i} + ... + \beta_px_{pi} \]

and two functions:

  • a link function that describes how the mean, \(E(Y_i) = \mu_i\), depends on the linear predictor:

\[ g(\mu_i) = \eta_i \]

  • a variance function that describes how the variance, \(var(Y_i)\) depends on the mean:

\[ var(Y_i) = \Phi*V(\mu_i) \]

where the dispersion parameter \(\Phi\) is a constant.

For example we can obtain our general linear model with the following choice:

\[ \begin{align} & g(\mu_i) = \mu_i\\ & Phi = 1 \end{align} \]

If now we assume that the target variable \(Y_i\) is a binomial, i.e. a two-valued variable:

\[ \begin{align} & Y_i = binom(n_i,p_i)\\ & mean(Y_i) = n_ip_i\\ & var(Y_i) = n_ip_i(1-p_i) \end{align} \]

Now, we wish to model the proportions \(Y_i/n_i\), as our target. Then we can state that:

\[ \begin{align} mean(Y_i/n_i) = p_i := \mu_i\\ var(Y_i/n_i) = var(Y_i)/n_i^2 = \frac{p_i(1-p_i)}{n_i} := \sigma_i^2\\ \end{align} \]

Inspecting the above, we can write:

\[ \sigma_i^2 = \frac{\mu_i(1-\mu_i)}{n_i} \]

and since the link function needs to map \({[-\infty, \infty]}\) to \({[0,1]}\), we use the logit function:

\[ g(\mu_i) = logit(\mu_i) = log(\frac{\mu_i}{1-\mu_i}) \]

Workflow: Read the Data

Let us now read in the data and check for these assumptions as part of our Workflow.

Research Question

To Be Written Up.

data("earinf", package = "GLMsData")
inspect(earinf)

categorical variables:  
  name  class levels   n missing                                  distribution
1 Swim factor      2 287       0 Occas (50.2%), Freq (49.8%)                  
2  Loc factor      2 287       0 Beach (51.2%), NonBeach (48.8%)              
3  Age factor      3 287       0 15-19 (48.8%), 20-24 (27.5%) ...             
4  Sex factor      2 287       0 Male (65.5%), Female (34.5%)                 

quantitative variables:  
      name   class min Q1 median Q3 max      mean        sd   n missing
1 NumInfec integer   0  0      0  2  17 1.3867596 2.3385412 287       0
2    Infec integer   0  0      0  1   1 0.4738676 0.5001888 287       0
glimpse(earinf)
Rows: 287
Columns: 6
$ Swim     <fct> Occas, Occas, Occas, Occas, Occas, Occas, Occas, Occas, Occas…
$ Loc      <fct> NonBeach, NonBeach, NonBeach, NonBeach, NonBeach, NonBeach, N…
$ Age      <fct> 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19…
$ Sex      <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, M…
$ NumInfec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2…
$ Infec    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
skimr::skim(earinf)
Data summary
Name earinf
Number of rows 287
Number of columns 6
_______________________
Column type frequency:
factor 4
numeric 2
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Swim 0 1 FALSE 2 Occ: 144, Fre: 143
Loc 0 1 FALSE 2 Bea: 147, Non: 140
Age 0 1 FALSE 3 15-: 140, 20-: 79, 25-: 68
Sex 0 1 FALSE 2 Mal: 188, Fem: 99

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
NumInfec 0 1 1.39 2.34 0 0 0 2 17 ▇▁▁▁▁
Infec 0 1 0.47 0.50 0 0 0 1 1 ▇▁▁▁▇

Workflow: EDA

To Be Written Up.

Workflow: Model Building

Workflow: Model Checking and Diagnostics

Checks for Uncertainty

To Be Written Up.

Conclusions

So our Linear Modelling workflow might look like this: we have not seen all stages yet, but that is for another course module or tutorial!

flowchart TD
    A[(A: Data)] -->|mosaic  +  ggformula|B[B:EDA] 
    B --> |corrplot +  corrgram  + ggformula + purrr + cor.test| C(C: Check Relationships)
    C --> D[D: Decide on Simple/Complex Model]
    D --> E{E: Is the Model Possible?}
    E --> |Yes| G[G: Build Model]
    E -->|Nope| F[F: Transform Variables]
    E -->|Nope| K[K: Try Multiple Regression <br> and/or Interaction Terms]
    K --> D
    F --> D
    G --> H{H: Check Model Diagnostics}
    H --> |Problems| D
    H --> |All   good| I(Interpret Your Model)
    I --> J(((Apply the Model for Predictions)))
    

References

  1. Judd, Charles M. & McClelland, Gary H. & Ryan, Carey S. Data Analysis: A Model Comparison Approach to Regression, ANOVA, and Beyond. Routledge, Aug 2017. Chapter 14.

  2. https://yury-zablotski.netlify.app/post/how-logistic-regression-works/

  3. https://uc-r.github.io/logistic_regression

  4. https://francisbach.com/self-concordant-analysis-for-logistic-regression/

  5. https://statmath.wu.ac.at/courses/heather_turner/glmCourse_001.pdf

  6. https://jasp-stats.org/2022/06/30/generalized-linear-models-glm-in-jasp/

  7. P. Bingham, N.Q. Verlander, M.J. Cheal (2004). John Snow, William Farr and the 1849 outbreak of cholera that affected London: a reworking of the data highlights the importance of the water supply. Public Health Volume 118, Issue 6, September 2004, Pages 387-394. Read the PDF.

  8. https://peopleanalytics-regression-book.org/bin-log-reg.html

R Package Citations
Package Version Citation
ggtext 0.1.2 Wilke and Wiernik (2022)
GLMsData 1.4 Dunn and Smyth (2022)
HSAUR3 1.0.14 Hothorn and Everitt (2023)
prettyglm 1.0.1 Fowler (2023)
regressinator 0.1.3 Reinhart (2024)
Dunn, Peter K., and Gordon K. Smyth. 2022. GLMsData: Generalized Linear Model Data Sets. https://CRAN.R-project.org/package=GLMsData.
Fowler, Jared. 2023. prettyglm: Pretty Summaries of Generalized Linear Model Coefficients. https://CRAN.R-project.org/package=prettyglm.
Hothorn, Torsten, and Brian S. Everitt. 2023. HSAUR3: A Handbook of Statistical Analyses Using r (3rd Edition). https://CRAN.R-project.org/package=HSAUR3.
Reinhart, Alex. 2024. regressinator: Simulate and Diagnose (Generalized) Linear Models. https://CRAN.R-project.org/package=regressinator.
Wilke, Claus O., and Brenton M. Wiernik. 2022. ggtext: Improved Text Rendering Support for ggplot2. https://CRAN.R-project.org/package=ggtext.
Back to top

Footnotes

  1. https://statmath.wu.ac.at/courses/heather_turner/glmCourse_001.pdf↩︎

  2. https://en.wikipedia.org/wiki/Taylor%27s_law↩︎

Citation

BibTeX citation:
@online{v2023,
  author = {V, Arvind},
  title = {Modelling with {Logistic} {Regression}},
  date = {2023-04-13},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LogReg/},
  langid = {en},
  abstract = {Predicting Qualitative Target Variables}
}
For attribution, please cite this work as:
V, Arvind. 2023. “Modelling with Logistic Regression.” April 13, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LogReg/.