Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Inferential Modelling
  4. Modelling with Logistic Regression
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • πŸƒ Inference for a Single Mean
        • πŸƒ Inference for Two Independent Means
        • πŸƒ Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • πŸƒ Testing a Single Proportion
        • πŸƒ Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • πŸ•” Modelling and Predicting Time Series
      • Predictive Modelling
        • πŸ‰ Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • πŸ“ Intro to Linear Programming
        • πŸ’­ The Simplex Method - Intuitively
        • πŸ“… The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • πŸ•Ά Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • πŸ‰ Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • πŸ•” Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • πŸ•Ά Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Setting up R Packages
  • Introduction
  • The Logistic Regression Model
    • Linear Models for Categorical Targets?
    • Problems…and Solutions
  • Workflow: Breast Cancer Dataset
  • Workflow: Read the Data
    • Workflow: Data Munging
    • Workflow: EDA
    • Workflow: Model Building
  • Workflow: Logistic Regression Internals
  • Conclusions
  • References
  1. Teaching
  2. Data Analytics for Managers and Creators
  3. Inferential Modelling
  4. Modelling with Logistic Regression

Modelling with Logistic Regression

Logistic Regression
Qualitative Variable
Probability
Odds
Log Transformation
Author

Arvind V.

Published

April 13, 2023

Modified

May 25, 2025

Abstract
Predicting Qualitative Target Variables

Setting up R Packages

library(tidyverse)
library(ggformula)
library(mosaic)
library(skimr)
library(GGally)
library(infer)

Plot Theme

Show the Code
# https://stackoverflow.com/questions/74491138/ggplot-custom-fonts-not-working-in-quarto

# Chunk options
knitr::opts_chunk$set(
  fig.width = 7,
  fig.asp = 0.618, # Golden Ratio
  # out.width = "80%",
  fig.align = "center"
)
### Ggplot Theme
### https://rpubs.com/mclaire19/ggplot2-custom-themes

theme_custom <- function() {
  font <- "Roboto Condensed" # assign font family up front

  theme_classic(base_size = 14) %+replace% # replace elements we want to change

    theme(
      panel.grid.minor = element_blank(), # strip minor gridlines
      text = element_text(family = font),
      # text elements
      plot.title = element_text( # title
        family = font, # set font family
        # size = 20,               #set font size
        face = "bold", # bold typeface
        hjust = 0, # left align
        # vjust = 2                #raise slightly
        margin = margin(0, 0, 10, 0)
      ),
      plot.subtitle = element_text( # subtitle
        family = font, # font family
        # size = 14,                #font size
        hjust = 0,
        margin = margin(2, 0, 5, 0)
      ),
      plot.caption = element_text( # caption
        family = font, # font family
        size = 8, # font size
        hjust = 1
      ), # right align

      axis.title = element_text( # axis titles
        family = font, # font family
        size = 10 # font size
      ),
      axis.text = element_text( # axis text
        family = font, # axis family
        size = 8
      ) # font size
    )
}

# Set graph theme
theme_set(new = theme_custom())
#

Introduction

Sometimes the dependent variable is Qualitative: an either/or categorization. for example, or the variable we want to predict might be won or lost the contest, has an ailment or not, voted or not in the last election, or graduated from college or not. There might even be more than two categories such as voted for Congress, BJP, or Independent; or never smoker, former smoker, or current smoker.

The Logistic Regression Model

We saw with the General Linear Model that it models the mean of a target Quantitative variable as a linear weighted sum of the predictor variables:

(1)y∼N(xiTβˆ—Ξ²,  Οƒ2)

This model is considered to be general because of the dependence on potentially more than one explanatory variable, v.s. the simple linear model:1 y=Ξ²0+Ξ²1βˆ—x1+Ο΅. The general linear model gives us model β€œshapes” that start from a simple straight line to a p-dimensional hyperplane.

Although a very useful framework, there are some situations where general linear models are not appropriate:

  • the range of Y is restricted (e.g. binary, count)
  • the variance of Y depends on the mean (Taylor’s Law)2

How do we use the familiar linear model framework when the target/dependent variable is Categorical?

Linear Models for Categorical Targets?

Recall that we spoke of dummy-encoded Qualitative **predictor** variables for our linear models and how we would dummy encode them using numerical values, such as 0 and 1, or +1 and -1. Could we try the same way for a target categorical variable?

Yi=Ξ²0+Ξ²1βˆ—Xi+Ο΅i where 

Yi=0 if   "No"=1 if   "Yes"

Sadly this seems to not work for categorical dependent variables using a simple linear model as before. Consider the Credit Card Default data from the package ISLR.

ABCDEFGHIJ0123456789
default
<fct>
student
<fct>
balance
<dbl>
income
<dbl>
default_yes
<dbl>
NoNo729.526495244361.62510
NoYes817.180406612106.13470
NoNo1073.549164031767.13890
NoNo529.250604735704.49390
NoNo785.655882938463.49590
NoYes919.58853057491.55860
NoNo825.513330524905.22660
NoYes808.667504317600.45130
NoNo1161.057854037468.52930
NoNo0.000000029275.26830
Next
123456
...
1000
Previous
1-10 of 10,000 rows

We see balance and income are quantitative predictors; student is a qualitative predictor, and default is a qualitative target variable. If we naively use a linear model equation as model = lm(default ~ balance, data = Default) and plot it, then…

Figure 1: Naive Linear Model

…it is pretty much clear from Figure 1 that something is very odd. (no pun intended! See below!) If the only possible values for default are No=0 and Yes=1, how could we interpret predicted value of, say, Yi=0.25 or Yi=1.55, or perhaps Yi=βˆ’0.22? Anything other than Yes/No is hard to interpret!

Problems…and Solutions

Where do we go from here?

Let us state what we might desire of our model:

  1. Model Equation: Despite this setback, we would still like our model to be as close as possible to the familiar linear model equation.

    Yi=Ξ²0+Ξ²1βˆ—Xi+Ο΅i

    where  (2)Yi=0 if   "No"=1 if   "Yes"

  2. Predictors and Weights: We have quantitative predictors so we still want to use a linear-weighted sum for the RHS (i.e predictor side) of the model equation. What can we try to make this work? Especially for the LHS (i.e the target side)?

  3. Making the LHS continuous: What can we try? In dummy encoding our target variable, we found a range of [0,1], which is the same range for a probability value! Could we try to use probability of the outcome as our target, even though we are interested in binary outcomes? This would still leave us with a range of [0,1] for the target variable, as before.

NoteBinomially distributed target variable

If we map our Categorical/Qualitative target variable into a Quantitative probability, we need immediately to look at the LINE assumptions in linear regression.

In linear regression, we assume a normally distributed target variable, i.e. the residuals/errors around the predicted value are normally distributed. With a categorical target variable with two levels 0 and 1 it would be impossible for the errors ei=Yiβˆ’Yi^ to have a normal distribution, as assumed for the statistical tests to be valid. The errors are bounded by [0,1]! One candidate for the error distribution in this case is the binomial distribution, whose mean and variance are p and np(1-p) respectively.

Note immediately that the binomial variance moves with the mean! The LINE assumption of normality is clearly violated. And from the figure above, extreme probabilities (near 1 or 0) are more stable (i.e., have less error variance) than middle probabilities. So the model has β€œbuilt-in” heteroscedasticity, which we need to counter with transformations such as the log() function. More on this very shortly!

  1. Odds?: How would one β€œextend” the range of a target variable from [0,1] to [βˆ’βˆž,∞] ? One step would be to try the odds of the outcome, instead of trying to predict the outcomes directly (Yes or No), or their probabilities [0,1].
NoteOdds

Odds of an event with probability p of occurrence is defined as Odds=p/(1βˆ’p). As can be seen, the odds are the ratio of two probabilities, that of the event and its complement. In the Default dataset just considered, the odds of default and the odds of non-default can be calculated as:

ABCDEFGHIJ0123456789
default
<fct>
n
<int>
No9667
Yes333
2 rows

p(Default)=333/(333+9667)=0.333

therefore:

Odds of Default=p(Default)/(1βˆ’p(Default))=0.333/(1βˆ’0.333)=0.5

and OddsNoDefault = 0.9667/(1βˆ’0.9667)=29.

Now, odds cover half of real number line, i.e. [0,∞] ! Clearly, when the probability p of an event is 0, the odds are 0…and when it nears 1, the odds tend to ∞. So we have transformed a simple probability that lies between [0,1] to odds lying between [0,∞]. That’s one step towards making a linear model possible; we have β€œremoved” one of the limits on our linear model’s prediction range by using Odds as our target variable.

  1. Transformation using log()?: We need one more leap of faith: how do we convert a [0,∞] range to a [βˆ’βˆž,∞]? Can we try a log transformation?

log([0,∞]) = [βˆ’βˆž,∞]

This extends the range of our Qualitative target to the same as with a Quantitative target!

There is an additional benefit if this log() transformation: the Error Distributions with Odds targets. See the plot below. Odds are a necessarily nonlinear function of probability; the slope of Odds ~ probability also depends upon the probability itself, as we saw with the probability curve earlier.

(a) Odds
(b) Log Odds
Figure 2: Odds Plot

To understand this issue intuitively, consider what happens to, say, a 5% change in the odds ratio near 1.0. If the odds ratio is 1.0, then the probabilities p and 1-p are 0.5, and 0.5. A 20% increase in the odds ratio to 1.20 would correspond to probabilities of 0.545 and 0.455. However, if the original probabilities were 0.9 and 0.1 for an odds ratio 9, then a 20% increase (in odds ratio) to 10.8 would correspond to probabilities of 0.915 and 0.085, a much smaller change in the probabilities. The basic curve is non-linear and the log transformation flattens this out to provide a more linear relationship, which is what we desire.

So in our model, instead of modeling odds as the dependent variable, we will use log(odds), also known as the logit, defined as:

(3)log(oddsi)=log[pi/(1βˆ’pi)]=logit(pi)

This is our Logistic Regression Model, which uses a Quantitative Predictor variable to predict a Categorical target variable. We write the model as ( for the Default dataset ) :

(4)logit(default)=Ξ²0+Ξ²1βˆ—balance

This means that:

log(p(default)/(1βˆ’p(default)))=Ξ²0+Ξ²1βˆ—balance and therefore:

(5)p(default)=exp(Ξ²0+Ξ²1βˆ—balance)1+exp(Ξ²0+Ξ²1βˆ—balance)=11+expβˆ’(Ξ²0+Ξ²1βˆ—balance)

From the Equation 4 above it should be clear that a unit increase in balance should increase the odds of default by Ξ²1 units. The RHS of Equation 5 is a sigmoid function of the weighted sum of predictors and is limited to the range [0,1].

(a) naive linear regression model
(b) logistic regression model
(c) log odds gives linear models
Figure 3: Model Plots

If we were to include income also as a predictor variable in the model, we might obtain something like:

(6)p(default)=exp(Ξ²0+Ξ²1βˆ—balance+Ξ²2βˆ—income)1+exp(Ξ²0+Ξ²1βˆ—balance+Ξ²2βˆ—income)=11+expβˆ’(Ξ²0+Ξ²1βˆ—balance+Ξ²2βˆ—income)

This model Equation 6 is plotted a little differently, since it includes three variables. We’ll see this shortly, with code. The thing to note is that the formula inside the exp() is a linear combination of the predictors!

  1. Estimation of Model Parameters: The parameters Ξ²i now need to be estimated. How might we do that? This last problem is that because we have made so many transformations to get to the logits that we want to model, the logic of minimizing the sum of squared errors(SSE) is no longer appropriate.
NoteInfinite SSE!!

The probabilities for default are 0 and 1. At these values the log(odds) will map respectively to βˆ’βˆž and ∞ πŸ™€. So if we naively try to take residuals, we will find that they are all ∞ !! Hence the Sum of Squared Errors SSE cannot be computed and we need another way to assess the quality of our model.

Instead, we will have to use maximum likelihood estimation(MLE) to estimate the models. The maximum likelihood method maximizes the probability of obtaining the data at hand against every choice of model parameters Ξ²i. (And compare that method with the X2 (β€œchi-squared”) test and statistic instead of t and F to evaluate the model comparisons)

Workflow: Breast Cancer Dataset

Let us proceed with the logistic regression workflow. We will use the well-known Wisconsin breast cancer dataset, readily available from Vincent Arel-Bundock’s website.

Workflow: Read the Data

cancer <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/dslabs/brca.csv") %>%
  janitor::clean_names()
glimpse(cancer)
Rows: 569
Columns: 32
$ rownames            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ x_radius_mean       <dbl> 13.540, 13.080, 9.504, 13.030, 8.196, 12.050, 13.4…
$ x_texture_mean      <dbl> 14.36, 15.71, 12.44, 18.42, 16.84, 14.63, 22.30, 2…
$ x_perimeter_mean    <dbl> 87.46, 85.63, 60.34, 82.61, 51.71, 78.04, 86.91, 7…
$ x_area_mean         <dbl> 566.3, 520.0, 273.9, 523.8, 201.9, 449.3, 561.0, 4…
$ x_smoothness_mean   <dbl> 0.09779, 0.10750, 0.10240, 0.08983, 0.08600, 0.103…
$ x_compactness_mean  <dbl> 0.08129, 0.12700, 0.06492, 0.03766, 0.05943, 0.090…
$ x_concavity_mean    <dbl> 0.066640, 0.045680, 0.029560, 0.025620, 0.015880, …
$ x_concave_pts_mean  <dbl> 0.047810, 0.031100, 0.020760, 0.029230, 0.005917, …
$ x_symmetry_mean     <dbl> 0.1885, 0.1967, 0.1815, 0.1467, 0.1769, 0.1675, 0.…
$ x_fractal_dim_mean  <dbl> 0.05766, 0.06811, 0.06905, 0.05863, 0.06503, 0.060…
$ x_radius_se         <dbl> 0.2699, 0.1852, 0.2773, 0.1839, 0.1563, 0.2636, 0.…
$ x_texture_se        <dbl> 0.7886, 0.7477, 0.9768, 2.3420, 0.9567, 0.7294, 1.…
$ x_perimeter_se      <dbl> 2.058, 1.383, 1.909, 1.170, 1.094, 1.848, 1.735, 2…
$ x_area_se           <dbl> 23.560, 14.670, 15.700, 14.160, 8.205, 19.870, 20.…
$ x_smoothness_se     <dbl> 0.008462, 0.004097, 0.009606, 0.004352, 0.008968, …
$ x_compactness_se    <dbl> 0.014600, 0.018980, 0.014320, 0.004899, 0.016460, …
$ x_concavity_se      <dbl> 0.023870, 0.016980, 0.019850, 0.013430, 0.015880, …
$ x_concave_pts_se    <dbl> 0.013150, 0.006490, 0.014210, 0.011640, 0.005917, …
$ x_symmetry_se       <dbl> 0.01980, 0.01678, 0.02027, 0.02671, 0.02574, 0.014…
$ x_fractal_dim_se    <dbl> 0.002300, 0.002425, 0.002968, 0.001777, 0.002582, …
$ x_radius_worst      <dbl> 15.110, 14.500, 10.230, 13.300, 8.964, 13.760, 15.…
$ x_texture_worst     <dbl> 19.26, 20.49, 15.66, 22.81, 21.96, 20.70, 31.82, 2…
$ x_perimeter_worst   <dbl> 99.70, 96.09, 65.13, 84.46, 57.26, 89.88, 99.00, 8…
$ x_area_worst        <dbl> 711.2, 630.5, 314.9, 545.9, 242.2, 582.6, 698.8, 5…
$ x_smoothness_worst  <dbl> 0.14400, 0.13120, 0.13240, 0.09701, 0.12970, 0.149…
$ x_compactness_worst <dbl> 0.17730, 0.27760, 0.11480, 0.04619, 0.13570, 0.215…
$ x_concavity_worst   <dbl> 0.239000, 0.189000, 0.088670, 0.048330, 0.068800, …
$ x_concave_pts_worst <dbl> 0.12880, 0.07283, 0.06227, 0.05013, 0.02564, 0.065…
$ x_symmetry_worst    <dbl> 0.2977, 0.3184, 0.2450, 0.1987, 0.3105, 0.2747, 0.…
$ x_fractal_dim_worst <dbl> 0.07259, 0.08183, 0.07773, 0.06169, 0.07409, 0.083…
$ y                   <chr> "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", …
skim(cancer)
Data summary
Name cancer
Number of rows 569
Number of columns 32
_______________________
Column type frequency:
character 1
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
y 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rownames 0 1 285.00 164.40 1.00 143.00 285.00 427.00 569.00 β–‡β–‡β–‡β–‡β–‡
x_radius_mean 0 1 14.13 3.52 6.98 11.70 13.37 15.78 28.11 ▂▇▃▁▁
x_texture_mean 0 1 19.29 4.30 9.71 16.17 18.84 21.80 39.28 ▃▇▃▁▁
x_perimeter_mean 0 1 91.97 24.30 43.79 75.17 86.24 104.10 188.50 ▃▇▃▁▁
x_area_mean 0 1 654.89 351.91 143.50 420.30 551.10 782.70 2501.00 ▇▃▂▁▁
x_smoothness_mean 0 1 0.10 0.01 0.05 0.09 0.10 0.11 0.16 ▁▇▇▁▁
x_compactness_mean 0 1 0.10 0.05 0.02 0.06 0.09 0.13 0.35 ▇▇▂▁▁
x_concavity_mean 0 1 0.09 0.08 0.00 0.03 0.06 0.13 0.43 ▇▃▂▁▁
x_concave_pts_mean 0 1 0.05 0.04 0.00 0.02 0.03 0.07 0.20 ▇▃▂▁▁
x_symmetry_mean 0 1 0.18 0.03 0.11 0.16 0.18 0.20 0.30 ▁▇▅▁▁
x_fractal_dim_mean 0 1 0.06 0.01 0.05 0.06 0.06 0.07 0.10 ▆▇▂▁▁
x_radius_se 0 1 0.41 0.28 0.11 0.23 0.32 0.48 2.87 ▇▁▁▁▁
x_texture_se 0 1 1.22 0.55 0.36 0.83 1.11 1.47 4.88 ▇▅▁▁▁
x_perimeter_se 0 1 2.87 2.02 0.76 1.61 2.29 3.36 21.98 ▇▁▁▁▁
x_area_se 0 1 40.34 45.49 6.80 17.85 24.53 45.19 542.20 ▇▁▁▁▁
x_smoothness_se 0 1 0.01 0.00 0.00 0.01 0.01 0.01 0.03 ▇▃▁▁▁
x_compactness_se 0 1 0.03 0.02 0.00 0.01 0.02 0.03 0.14 ▇▃▁▁▁
x_concavity_se 0 1 0.03 0.03 0.00 0.02 0.03 0.04 0.40 ▇▁▁▁▁
x_concave_pts_se 0 1 0.01 0.01 0.00 0.01 0.01 0.01 0.05 ▇▇▁▁▁
x_symmetry_se 0 1 0.02 0.01 0.01 0.02 0.02 0.02 0.08 ▇▃▁▁▁
x_fractal_dim_se 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.03 ▇▁▁▁▁
x_radius_worst 0 1 16.27 4.83 7.93 13.01 14.97 18.79 36.04 ▆▇▃▁▁
x_texture_worst 0 1 25.68 6.15 12.02 21.08 25.41 29.72 49.54 ▃▇▆▁▁
x_perimeter_worst 0 1 107.26 33.60 50.41 84.11 97.66 125.40 251.20 ▇▇▃▁▁
x_area_worst 0 1 880.58 569.36 185.20 515.30 686.50 1084.00 4254.00 ▇▂▁▁▁
x_smoothness_worst 0 1 0.13 0.02 0.07 0.12 0.13 0.15 0.22 ▂▇▇▂▁
x_compactness_worst 0 1 0.25 0.16 0.03 0.15 0.21 0.34 1.06 ▇▅▁▁▁
x_concavity_worst 0 1 0.27 0.21 0.00 0.11 0.23 0.38 1.25 ▇▅▂▁▁
x_concave_pts_worst 0 1 0.11 0.07 0.00 0.06 0.10 0.16 0.29 ▅▇▅▃▁
x_symmetry_worst 0 1 0.29 0.06 0.16 0.25 0.28 0.32 0.66 ▅▇▁▁▁
x_fractal_dim_worst 0 1 0.08 0.02 0.06 0.07 0.08 0.09 0.21 ▇▃▁▁▁

We see that there are 31 Quantitative variables, all named as x_***, and one Qualitative variable,y, which is a two-level target. (B = Benign, M = Malignant). The dataset has 569 observations, and no missing data.

Workflow: Data Munging

Let us rename y as diagnosis and take two other Quantitative parameters as predictors, suitably naming them too. We will also create a binary-valued variable called diagnosis_malignant (Binary, Malignant = 1, Benign = 0) for use as a target in our logistic regression model.

Show the Code
cancer_modified <- cancer %>%
  rename(
    "diagnosis" = y,
    "radius_mean" = x_radius_mean,
    "concave_points_mean" = x_concave_pts_mean
  ) %>%
  ## Convert diagnosis to factor
  mutate(diagnosis = factor(
    diagnosis,
    levels = c("B", "M"),
    labels = c("B", "M")
  )) %>%
  ## New Variable
  mutate(diagnosis_malignant = if_else(diagnosis == "M", 1, 0)) %>%
  select(radius_mean, concave_points_mean, diagnosis, diagnosis_malignant)

cancer_modified
ABCDEFGHIJ0123456789
radius_mean
<dbl>
concave_points_mean
<dbl>
diagnosis
<fct>
diagnosis_malignant
<dbl>
13.5400.047810B0
13.0800.031100B0
9.5040.020760B0
13.0300.029230B0
8.1960.005917B0
12.0500.027490B0
13.4900.033840B0
11.7600.011150B0
13.6400.017230B0
11.9400.013490B0
Next
123456
...
57
Previous
1-10 of 569 rows
NoteResearch Question

How can we predict whether a cancerous tumour is Benign or Malignant, based on the variable radius_mean alone, and with both radius_mean and concave_points_mean?

Workflow: EDA

Let us use GGally to plot a set of combo-plots for our modified dataset:

Show the Code
theme_set(new = theme_custom())
#
cancer_modified %>%
  select(diagnosis, radius_mean, concave_points_mean) %>%
  GGally::ggpairs(
    mapping = aes(colour = diagnosis),
    switch = "both",
    # axis labels in more traditional locations(left and bottom)

    progress = FALSE,
    # no compute progress messages needed

    # Choose the diagonal graphs (always single variable! Think!)
    diag = list(continuous = "densityDiag", alpha = 0.3),
    # choosing density

    # Choose lower triangle graphs, two-variable graphs
    lower = list(continuous = wrap("points", alpha = 0.3)),
    title = "Cancer Pairs Plot #1"
  ) +
  scale_color_manual(
    values = c("forestgreen", "red2"),
    aesthetics = c("color", "fill")
  )

NoteBusiness Insights from GGally::ggpairs
  • The counts for β€œB” and β€œM” are not terribly unbalanced; and both the radius_mean and concave_pts_mean appear to have well-separated box plot distributions for β€œB” and β€œM”.
  • Given the visible separation of the box-plots for both variables radius_mean and concave_pts_mean, we can believe that these will be good choices as predictors.
  • Interestingly, radius_mean and concave_pts_mean are also mutually well-correlated, with a ρ=0.823; we may wish (later) to choose (a pair of) predictor variables that are less strongly correlated.

Workflow: Model Building

  • Model Code
  • Workflow: Model Checking and Diagnostics
  • Workflow: Checks for Uncertainty
  • Logistic Regression Models as Hypothesis Tests

Let us code two models, using one and then both the predictor variables:

Show the Code
cancer_fit_1 <- glm(diagnosis_malignant ~ radius_mean,
  data = cancer_modified,
  family = binomial(link = "logit")
)

cancer_fit_1 %>% broom::tidy()
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)-15.2458711.32462600-11.509571.180708e-30
radius_mean1.0335890.0931064511.101151.238377e-28
2 rows
Table 1: Simple Model

The equation for the simple model is:

(7)diagnosis_malignant∼Bernoulli(probdiagnosis_malignant=1=P^)log⁑[P^1βˆ’P^]=βˆ’15.25+1.03(radius_mean)

Increasing radius_mean by one unit changes the log odds by Ξ²1^=1.033 or equivalently it multiplies the odds by exp(Ξ²1^)=2.809. We can plot the model as shown below:

Show the Code
# Set graph theme
theme_set(new = theme_custom())
##
qthresh <- c(0.2, 0.5, 0.8)
beta01 <- coef(cancer_fit_1)[1]
beta11 <- coef(cancer_fit_1)[2]
decision_point <- (log(qthresh / (1 - qthresh)) - beta01) / beta11
##
cancer_modified %>%
  gf_point(
    diagnosis_malignant ~ radius_mean,
    colour = ~diagnosis,
    title = "diagnosis ~ radius_mean",
    xlab = "Average radius",
    ylab = "Diagnosis (1=malignant)", size = 3, show.legend = F
  ) %>%
  # gf_fun(exp(1.033 * radius_mean - 15.25) / (1 + exp(1.033 * radius_mean - 15.25)) ~ radius_mean, xlim = c(1, 30), linewidth = 3, colour = "red") %>%
  gf_smooth(
    method = glm,
    method.args = list(family = "binomial"),
    se = FALSE,
    color = "black"
  ) %>%
  gf_vline(xintercept = decision_point, linetype = "dashed") %>%
  gf_refine(
    annotate(
      "text",
      label = paste0("q = ", qthresh),
      x = decision_point + 0.45,
      y = 0.4,
      angle = -90
    ),
    scale_color_manual(values = c("forestgreen", "red2"))
  ) %>%
  gf_hline(yintercept = 0.5) %>%
  gf_theme(theme(plot.title.position = "plot")) %>%
  gf_refine(xlim(5, 30))
Figure 4: Simple Model plot

The dotted lines show how the model can be used to classify the data in to two classes (β€œB” and β€œM”) depending upon the threshold probability q.

Taking both predictor variables, we obtain the model:

Show the Code
cancer_fit_2 <- glm(diagnosis_malignant ~ radius_mean + concave_points_mean,
  data = cancer_modified,
  family = binomial(link = "logit")
)

cancer_fit_2 %>% broom::tidy()
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
(Intercept)-13.69892091.5737724-8.7045123.189440e-18
radius_mean0.63892430.10761615.9370702.901607e-09
concave_points_mean84.22355359.95943478.4566602.751446e-17
3 rows
Table 2

The equation for the more complex model is:

(8)diagnosis_malignant∼Bernoulli(probdiagnosis_malignant=1=P^)log⁑[P^1βˆ’P^]=βˆ’13.7+0.64(radius_mean)+84.22(concave_points_mean)

Increasing radius_mean by one unit changes the log odds by Ξ²1^=0.6389 or equivalently it multiplies the odds by exp(Ξ²1^)=1.894, provided concave_points_mean is held fixed.

We can plot the model as shown below: we create a scatter plot of the two predictor variables. The superimposed diagonal lines are lines for several constant values of threshold probability q.

Show the Code
# Set graph theme
theme_set(new = theme_custom())
##
beta02 <- coef(cancer_fit_2)[1]
beta12 <- coef(cancer_fit_2)[2]
beta22 <- coef(cancer_fit_2)[3]
##
decision_intercept <- 1 / beta22 * (log(qthresh / (1 - qthresh)) - beta02)
decision_slope <- -beta12 / beta22
##
cancer_modified %>%
  gf_point(concave_points_mean ~ radius_mean,
    color = ~diagnosis, shape = ~diagnosis,
    size = 3, alpha = 0.5
  ) %>%
  gf_labs(
    x = "Average radius",
    y = "Average concave\nportions of the\ncontours",
    color = "Diagnosis",
    shape = "Diagnosis",
    title = "diagnosis ~ radius_mean + concave_points_mean"
  ) %>%
  gf_abline(
    slope = decision_slope, intercept = decision_intercept,
    linetype = "dashed"
  ) %>%
  gf_refine(
    scale_color_manual(values = c("forestgreen", "red2")),
    annotate("text", label = paste0("q = ", qthresh), x = 10, y = c(0.08, 0.1, 0.115), angle = -17.155)
  ) %>%
  gf_theme(theme(plot.title.position = "plot"))
Figure 5: Complex Model plot

To Be Written Up.

To Be Written Up.

To Be Written Up.

Workflow: Logistic Regression Internals

All that is very well, but what is happening under the hood of the glm command? Consider the diagnosis (target) variable and say the average_radius feature/predictor variable. What we do is:

  1. Plot a scatter plot gf_point(diagnosis ~ average_radius, data = cancer_modified)
  2. Start with a sigmoid curve with some initial parameters Ξ²1^ and Ξ²0^ that gives us some prediction of the probability of diagnosis for any given average_radius
  3. We know the target labels for each data point ( i.e. β€œB” and β€œM”). We can calculate the likelihood of Ξ²1^ and Ξ²0^, given the data.
  4. We then change the values of Ξ²1^ andΞ²0^ and calculate the likelihood again.
  5. The set of parameters with the maximum likelihood(ML) for Ξ²1^ and Ξ²0^ gives us our logistic regression model.
  6. Use that model henceforth as a model for prediction.

How does one find out the β€œML” parameters? There is clearly a two step procedure:

  • Find the likelihood of the data for the parameters Ξ²1
  • Maximize the likelihood by varying them. In practice, the changes to the parameters (step 5) are made in accordance with a method such as the Newton-Raphson method that can rapidly find the ML values for the parameters.

Let us visualize the variations and computations from step(5). For the sake of clarity:

  • we will take a small sample of the original dataset
  • we take several different values for Ξ²0 and Ξ²1
  • Use these get a set of regression curves
  • which we superimpose on the scatter plot of the sample
Figure 6: Multiple Models

In Figure 6, we see three models: the β€œoptimum” one in black, and two others in green and red respectively.

We now project the actual points on to the regression curve, to obtain the predicted probability for each point.

The predicted probability pi for each datum(radius_mean) is for the tumour being Malignant. If the datum corresponds to a tumour that is Benign, we must take 1βˆ’pi. Each datum point is assumed to be independent, so we can calculate the likelihood as a product of probabilities, as follows: In this way, we calculate the likelihood of the data, give the model parameters as:

likelihood=∏Malignantpi Γ— βˆBenign(1βˆ’pi)=∏piyMalignant=1 Γ— (1βˆ’pi)yBenign=1=∏(pi)yi Γ— (1βˆ’pi)1βˆ’yi// since labels yi are binary 1 or 0// Lastly, since this is a product of small numbers, it can lead to inaccuracies, so we take the log of the whole thing to make it into an addition, obtaining the log-likelihood (LL):

(9)log likelihood  ll(Ξ²i)=log∏(pi)yiβˆ—(1βˆ’pi)1βˆ’yi=βˆ‘yiβˆ—log(pi)+(1βˆ’yi)βˆ—log(1βˆ’pi)

We now need to find the (global) maximum of this quantity and determine the Ξ²i. Flipping this problem around, we find the maximum likelihood by minimizing the slope/gradient of of the LL!! And, to minimize the slope of the LL, we use the Newton-Raphson method or equivalent. Phew!

NoteThe Newton-Raphson Method

  • The black curve y=fx) is the function to be minimized, i.e. it is the gradient of the LL function.
  • We start with any arbitrary starting value of x=x1,y1=f(x1) and calculate the tangent/slope/gradient equation fβ€²(x1) at point (x1,y1)=(x1,f(x1)).
  • The tangent fβ€²(x1) cuts the xβˆ’axis at x2.(Grey line).
  • Repeat.
  • Stop when the gradient becomes very small and xi changes very in successive iterations.

How do we calculate the next value of x using the tangent?

  • At (x1,y1), the tangent equation is: y=y1βˆ’slope1βˆ—(xβˆ’x1).
  • This equation applies at point (x2,0), so 0=y1βˆ’slope1βˆ—(x2βˆ’x1). (NOTE: Imagine that this is obtained by temporarily moving the y-axis to x=x1 (dotted line), so y1 in effect is the β€œc” in y=mx+c)
  • Solving for x2, we get: x2=y1/slope1βˆ’x1=f(x1)/fβ€²(x1)
  • Since f(x) is already the gradient of LL, we have: x2=x1βˆ’llβ€²(x1)/llβ€³(x1) !!

To be written up:

  • Formula for gradient of LL
  • Convergence of Newton- Raphson method for Maximum Likelihood
  • Hand Calculation of all steps (!!)

Conclusions

  • Logistic Regression is a great ML algorithm for predicting Qualitative target variables.
  • It also works for multi-level/multi-valued Qual variables (multinomial logistic regression)
  • The internals of Logistic Regression are quite different compared to Linear Regression

References

  1. Judd, Charles M. & McClelland, Gary H. & Ryan, Carey S. Data Analysis: A Model Comparison Approach to Regression, ANOVA, and Beyond. Routledge, Aug 2017. Chapter 14.
  2. Emi Tanaka.Logistic Regression https://emitanaka.org/iml/lectures/lecture-04A.html#/TOC. Course: ETC3250/5250, Monash University, Melbourne, Australia.
  3. Geeks for Geeks.Logistic Regression. https://www.geeksforgeeks.org/understanding-logistic-regression/
  4. Geeks for Geeks.Maximum Likelihood Estimation. https://www.geeksforgeeks.org/probability-density-estimation-maximum-likelihood-estimation/
  5. https://yury-zablotski.netlify.app/post/how-logistic-regression-works/
  6. https://uc-r.github.io/logistic_regression
  7. https://francisbach.com/self-concordant-analysis-for-logistic-regression/
  8. https://statmath.wu.ac.at/courses/heather_turner/glmCourse_001.pdf
  9. https://jasp-stats.org/2022/06/30/generalized-linear-models-glm-in-jasp/
  10. P. Bingham, N.Q. Verlander, M.J. Cheal (2004). John Snow, William Farr and the 1849 outbreak of cholera that affected London: a reworking of the data highlights the importance of the water supply. Public Health Volume 118, Issue 6, September 2004, Pages 387-394. Read the PDF.
  11. https://peopleanalytics-regression-book.org/bin-log-reg.html
  12. McGill University. Epidemiology https://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logfit.pdf
  13. https://arunaddagatla.medium.com/maximum-likelihood-estimation-in-logistic-regression-f86ff1627b67
R Package Citations
Package Version Citation
equatiomatic 0.3.6 Anderson, Heiss, and Sumners (2025)
ISLR 1.4 James et al. (2021)
Anderson, Daniel, Andrew Heiss, and Jay Sumners. 2025. equatiomatic: Transform Models into β€œLaTeX” Equations. https://doi.org/10.32614/CRAN.package.equatiomatic.
James, Gareth, Daniela Witten, Trevor Hastie, and Rob Tibshirani. 2021. ISLR: Data for an Introduction to Statistical Learning with Applications in r. https://doi.org/10.32614/CRAN.package.ISLR.
Back to top

Footnotes

  1. https://statmath.wu.ac.at/courses/heather_turner/glmCourse_001.pdfβ†©οΈŽ

  2. https://en.wikipedia.org/wiki/Taylor%27s_lawβ†©οΈŽ

Citation

BibTeX citation:
@online{v.2023,
  author = {V., Arvind},
  title = {Modelling with {Logistic} {Regression}},
  date = {2023-04-13},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LogReg/},
  langid = {en},
  abstract = {Predicting Qualitative Target Variables}
}
For attribution, please cite this work as:
V., Arvind. 2023. β€œModelling with Logistic Regression.” April 13, 2023. https://av-quarto.netlify.app/content/courses/Analytics/Modelling/Modules/LogReg/.
Modelling with Linear Regression
πŸ•” Modelling and Predicting Time Series

License: CC BY-SA 2.0

Website made with ❀️ and Quarto, by Arvind V.

Hosted by Netlify .