Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. R for Artists and Managers
  3. Iteration: Learning to purrr
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • Setting up R Packages
  • Introduction
  • Case Study #1: Multiple Models for Life Expectancy with gapminder
    • {{}} EDA Plots
    • Constructing a Linear Model
    • The purrr package
    • Using map_* functions from purrr
    • Using purrr to create multiple models
    • Model Visualization
  • Recent developments in dplyr
  • Conclusion
  • References
  1. Teaching
  2. R for Artists and Managers
  3. Iteration: Learning to purrr

Iteration: Learning to purrr

Performing Iterations in R

Iteration
purrr package
Author

Arvind Venkatadri

Published

June 14, 2023

Modified

May 21, 2024

Setting up R Packages

knitr::opts_chunk$set(message = FALSE)
library(tidyverse)
library(gapminder)
library(ggformula)

Learning to purrr

Learning to purrr

Introduction

Often we want to perform the same operation on several different sets of data. Rather than repeat the operation for each instance of data, it is faster, more intuitive, and less error-prone if we create a data structure that holds all the data, and use the map-* series functions from the purrr package to perform all the repeated operations in one shot.

This requires getting used to. We need to understand:

  • the data structure
  • the iteration mechanism using map functions
  • the form of the results

Case Study #1: Multiple Models for Life Expectancy with gapminder

We will start with a complete case study and then work backwards to understand the various pieces of code that make it up.

Let us look at the gapminder dataset:

skimr::skim(gapminder)
Data summary
Name gapminder
Number of rows 1704
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1 FALSE 142 Afg: 12, Alb: 12, Alg: 12, Ang: 12
continent 0 1 FALSE 5 Afr: 624, Asi: 396, Eur: 360, Ame: 300

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 1979.50 17.27 1952.00 1965.75 1979.50 1993.25 2007.0 ▇▅▅▅▇
lifeExp 0 1 59.47 12.92 23.60 48.20 60.71 70.85 82.6 ▁▆▇▇▇
pop 0 1 29601212.32 106157896.74 60011.00 2793664.00 7023595.50 19585221.75 1318683096.0 ▇▁▁▁▁
gdpPercap 0 1 7215.33 9857.45 241.17 1202.06 3531.85 9325.46 113523.1 ▇▁▁▁▁

We have lifeExp, gdpPerCap, and pop as Quant variables over time (year) for each country in the world. Suppose now that we wish to create Linear Regression Models predicting lifeExp using year, for each country. ( We will leave out gdpPercap and pop for now) The straightforward by laborious and naive way would be to use the lm command after filtering the dataset for each country, creating 140+ Linear Models manually! This would be horribly tedious!

There is a better way with purrr, and also more recently, with dplyr itself. Let us see both methods, the established purrr method first, and the new dplyr based method thereafter.

{{}} EDA Plots

We can first plot lifeExp over year, grouped by country:

ggplot(gapminder,aes(x = year, y = lifeExp, colour = country)) + 
  geom_line(show.legend = FALSE) + 
  theme_classic()
ggplot(gapminder,aes(x = year, y = lifeExp, colour = country)) + 
  geom_line(show.legend = FALSE) + 
  facet_wrap(~ continent) + 
  theme_classic()

By and large we see positive slopes, but some countries do show non-linear behaviour.

Constructing a Linear Model

Let us take 1950 as a baseline year for all countries. Then we model lifeExp using year1950 across all countries together:

gapminder <- gapminder %>% 
  mutate(year1950 = year - 1950) # baseline year
model <- lm(lifeExp ~ year1950, data = gapminder)
summary(model)

Call:
lm(formula = lifeExp ~ year1950, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.949  -9.651   1.697  10.335  22.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.86028    0.55792   89.37   <2e-16 ***
year1950     0.32590    0.01632   19.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.63 on 1702 degrees of freedom
Multiple R-squared:  0.1898,    Adjusted R-squared:  0.1893 
F-statistic: 398.6 on 1 and 1702 DF,  p-value: < 2.2e-16
model %>% broom::tidy() # Parameters of the Model
model %>% broom::glance() # Statistics of the Model
ABCDEFGHIJ0123456789
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
(Intercept)49.86027650.5579183389.36841
year19500.32590380.0163236919.96509
2 rows | 1-4 of 5 columns
ABCDEFGHIJ0123456789
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
0.18975710.189281111.63055398.6047
1 row | 1-4 of 12 columns

Since the slope 0.3259038 is positive, life expectancy has been increasing over the years, across all countries. (But r.squared 0.1897571 is low, so this model does not explain much).

How do we do this for each country? We need to use the split-apply-combine method to achieve this. The combination of group_by and summarise is a example of the split > apply > combine method. For example, we could (split) the data by country, calculate the linear model each group (apply), and (combine) the results in a data frame.

However, this first-attempt code for a per-country linear model does not work:

```{r}
#| eval: false
gapminder %>% 
  group_by(country) %>% 
  summarise(linmod = lm(lifeExp ~ year1950, data = .))
```

This is because the linmod variable is a list variable and cannot be accommodated in a simple column, which is what summarize will try to create. So we need to be able to create “list” columns in a data frame…how do we do that? Before we contemplate that, let us understand the capabilities of the purrr package in R.

The purrr package

The purrr package contains a new class of functions, that can take vectors/tibbles/lists as input, and perform an identical function over each component of these, and generate vectors/tibbles/lists as output. These are the map_* functions that are part of the purrr package. The * in the map_* function defines what kind of output (vector/tibble/list) the function generates.

Let us look at a few short examples.

Using map_* functions from purrr

The basic structure of the map_* functions is:

```{r}
#| eval: false
map_typeOfResult(.x = what_to_iterate_with, 
                 .f = function_to_apply)

map_typeOfResult(.x = what_to_iterate_with, 
                 .f = \(x) function_to_apply(x, additional_parameters))
```

Two examples:

# Example 1: Input: vector, Output: vector
diamonds %>% 
  select(where(is.numeric)) %>% 
  
  # We need dbl-type numbers in output **vector**
map_dbl(.x = ., 
        .f = mean)
       carat        depth        table        price            x            y 
   0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
           z 
   3.5387338 
# Example 2: Input: vector, Output: tibble
diamonds %>% 
  select(where(is.numeric)) %>% 
  
  # We need dbl-type numbers in output **vector**
map_df(.x = ., 
       .f = mean)
ABCDEFGHIJ0123456789
carat
<dbl>
depth
<dbl>
table
<dbl>
price
<dbl>
x
<dbl>
y
<dbl>
z
<dbl>
0.797939761.749457.457183932.85.7311575.7345263.538734
1 row
Note

Note map_dbl outputs a (numeric) vector, and map_df outputs a tibble.

In each of the above examples, each vector in the diamonds dataset was passed to the respective map_* function as the parameter.x.

Sometimes the function .f may need some additional parameters to be specified, and these may not come from the input .x:

# Example 3, with additional parameters to .f
palmerpenguins::penguins %>% 
  select(where(is.numeric)) %>% 
  
  map_dbl(.x = ., 
          .f = \(x) mean(x, na.rm = TRUE))
   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.92193          17.15117         200.91520        4201.75439 
             year 
       2008.02907 
  # penguins has two rows of NA entries which need to be dropped
  # Hence this additional parameter for the `mean` function


# Example 4: if we want a tibble output
palmerpenguins::penguins %>% 
  select(where(is.numeric)) %>% 
  
  map_df(.x = ., 
         .f = \(x) mean(x, na.rm = TRUE))
ABCDEFGHIJ0123456789
bill_length_mm
<dbl>
bill_depth_mm
<dbl>
flipper_length_mm
<dbl>
body_mass_g
<dbl>
year
<dbl>
43.9219317.15117200.91524201.7542008.029
1 row

The .f function can be anything, even a ggformula plot command; in this case the output will not be a vector or a tibble, but a list:

#library(ggformula)
palmerpenguins::penguins %>% 
  select(where(is.numeric)) %>% select(-year) %>% drop_na() %>% 
  
  # `map` gives a list output
  map(.x = .,
      .f = \(x) gf_histogram(~x, bins = 30) %>% 
        gf_theme(theme_classic())
  )
$bill_length_mm


$bill_depth_mm


$flipper_length_mm


$body_mass_g

Note: we need to do just a bit of extra pre-work to get the variable names on the x-axis of the histograms. There is a possibility to store all the plots in a separate column

OK, so we can get vectors/tibbles/lists as output using vectors as inputs. Why would it be desirable to provide tibble/list as an input to a map_* function?

Using purrr to create multiple models

Now that we have some handle on purrr’s map functions, we can see how to develop a linear regression model for every country in the gapminder dataset. It should be clear from the command for a linear model:

```{r}
#| eval: false
model <- lm(target ~ predictor(s), 
            data  = tibble_containing_target_and_predictors_columns)
```

that we need to specify three things: target, predictors, and the data tibble for the development of a linear model. To do this for each country in gapminder, here is the process:

  1. Group the gapminder data by country (and continent)
  2. Create a column containing unique per-country data for each country. This column would hence contain a tibble in each cell. This is a list column!
  3. Use map which would take country and the data columns created above to create an lm object for each country (in another list column)
  4. Use map again with broom::tidy as the function to give us clean columns for the model per country.
  5. Use that multi-model tibble to plot graphs for each country.

Let us do this now!

gapminder_models <- gapminder %>% 
  group_by(continent, country) %>% 
  
  # Create a per-country tibble in a new column called "data_list"
  nest(.key = "data_list") 
gapminder_models
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
data_list
<list>
AfghanistanAsia<tibble[,5]>
AlbaniaEurope<tibble[,5]>
AlgeriaAfrica<tibble[,5]>
AngolaAfrica<tibble[,5]>
ArgentinaAmericas<tibble[,5]>
AustraliaOceania<tibble[,5]>
AustriaEurope<tibble[,5]>
BahrainAsia<tibble[,5]>
BangladeshAsia<tibble[,5]>
BelgiumEurope<tibble[,5]>
Next
123456
...
15
Previous
1-10 of 142 rows
gapminder_models <- gapminder_models %>%
  # We use mutate + map to add a list column containing linear models
  mutate(model = map(.x = data_list, 
                     
          # One column .x to iterate over
          # The .x list column contains data frames
          # So we access individual columns for target and predictors 
          # within these individual data frames
                     .f = \(.x) lm(lifeExp ~ year1950, data = .x)
          )) %>% 
  
  # Use mutate + map again to expose the columns of the models
  # Use broom:: tidy, broom::glance(), and 
  # Use broom::augment for separate columns
  mutate(model_params = map(.x = model, 
                      .f = \(.x) broom::tidy(.x, 
                                             conf.int = TRUE, 
                                             conf.lvel = 0.95)),
         model_metrics = map(.x = model,
                      .f = \(.x) broom::glance(.x)),
         
         model_augment = map(.x  = model, 
                             .f = \(.x) broom::augment(.x))
         ) 
gapminder_models
ABCDEFGHIJ0123456789
country
<fct>
continent
<fct>
data_list
<list>
model
<list>
model_params
<list>
model_metrics
<list>
model_augment
<list>
AfghanistanAsia<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
AlbaniaEurope<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
AlgeriaAfrica<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
AngolaAfrica<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
ArgentinaAmericas<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
AustraliaOceania<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
AustriaEurope<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
BahrainAsia<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
BangladeshAsia<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
BelgiumEurope<tibble[,5]><S3: lm><tibble[,7]><tibble[,12]><tibble[,8]>
Next
123456
...
15
Previous
1-10 of 142 rows

We can now take this tibble with multiple models and use broom to tidy, to glance at, and to augment the models:

params <- gapminder_models %>% 
  select(continent, country,model_params, model_metrics) %>% 
  ungroup() %>% 
  # Now unpack the linear model parameters into columns
  unnest(cols = model_params)
params
ABCDEFGHIJ0123456789
continent
<fct>
country
<fct>
term
<chr>
estimate
<dbl>
std.error
<dbl>
statistic
<dbl>
p.value
<dbl>
AsiaAfghanistan(Intercept)29.356637530.69898127841.99917581.404235e-12
AsiaAfghanistanyear19500.275328670.02045093413.46289019.835213e-08
EuropeAlbania(Intercept)58.559761771.13357581251.65932541.787180e-13
EuropeAlbaniayear19500.334683220.03316638710.09103631.462763e-06
AfricaAlgeria(Intercept)42.236414920.75626904055.84839878.215265e-14
AfricaAlgeriayear19500.569279720.02212707025.72774931.808143e-10
AfricaAngola(Intercept)31.707974130.80428746339.42368322.634887e-12
AfricaAngolayear19500.209339860.0235320038.89596444.593498e-06
AmericasArgentina(Intercept)62.225019110.167091314372.40127884.795627e-22
AmericasArgentinayear19500.231708390.00488879147.39584744.215567e-13
Next
123456
...
29
Previous
1-10 of 284 rows | 1-7 of 10 columns
###
metrics <- gapminder_models %>% 
  select(continent, country, model_metrics) %>% 
  ungroup() %>% 
  # Now unpack the linear model parameters into columns
  unnest(cols = model_metrics)
metrics
ABCDEFGHIJ0123456789
continent
<fct>
country
<fct>
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
p.value
<dbl>
df
<dbl>
AsiaAfghanistan0.947712260.9424834831.2227880181.24940989.835213e-081
EuropeAlbania0.910577770.9016355451.9830615101.82901381.462763e-061
AfricaAlgeria0.985117210.9836289321.3230064661.91708641.808143e-101
AfricaAngola0.887814630.8765960931.407009179.13818234.593498e-061
AmericasArgentina0.995568100.9951249050.29230722246.36634874.215567e-131
OceaniaAustralia0.979647740.9776125110.6206086481.34586278.667222e-101
EuropeAustria0.992134010.9913474140.40740941261.29629027.435240e-121
AsiaBahrain0.966739810.9634137911.6395865290.65973941.015855e-081
AsiaBangladesh0.989360870.9882969560.9766908929.92636883.369501e-111
EuropeBelgium0.994540560.9939946120.29290251821.68839551.196280e-121
Next
123456
...
15
Previous
1-10 of 142 rows | 1-8 of 14 columns
###
augments <- gapminder_models %>% 
  select(continent, country, model_augment) %>% 
  ungroup() %>% 
  # Now unpack the linear model parameters into columns
  unnest(cols = model_augment)
augments
ABCDEFGHIJ0123456789
continent
<fct>
country
<fct>
lifeExp
<dbl>
year1950
<dbl>
.fitted
<dbl>
.resid
<dbl>
.hat
<dbl>
.sigma
<dbl>
.cooksd
<dbl>
AsiaAfghanistan28.80100229.90729-1.106295e+000.294871791.21181262.427205e-01
AsiaAfghanistan30.33200731.28394-9.519382e-010.224941721.23751181.134714e-01
AsiaAfghanistan31.997001232.66058-6.635816e-010.168997671.26588633.603567e-02
AsiaAfghanistan34.020001734.03722-1.722494e-020.127039631.28891711.653992e-05
AsiaAfghanistan36.088002235.413876.741317e-010.099067601.26700341.854831e-02
AsiaAfghanistan38.438002736.790511.647488e+000.085081591.15400189.225358e-02
AsiaAfghanistan39.854003238.167161.686845e+000.085081591.14707609.671389e-02
AsiaAfghanistan40.822003739.543801.278202e+000.099067601.20824266.668277e-02
AsiaAfghanistan41.674004240.920447.535583e-010.127039631.26058263.165567e-02
AsiaAfghanistan41.763004742.29709-5.340851e-010.168997671.27405082.334344e-02
Next
123456
...
171
Previous
1-10 of 1,704 rows | 1-9 of 10 columns

Model Visualization

We can now plot these models and their uncertainty (i.e Confidence Intervals). We can select a few of the countries and plot:

params_filtered <- params %>% 
  filter(country %in% c("India", "United States", "Brazil", "China"), 
         term == "year1950") %>% 
  select(country, estimate, conf.low, conf.high,p.value) %>%
  arrange(estimate)
params_filtered
###
params_filtered %>%
  gf_errorbar(conf.high + conf.low ~ reorder(country, estimate),
              linewidth = ~ -log10(p.value), width = 0.3,
              ylab = "Effect Size", 
              xlab = "Country",
              title = "Effect of years on Life Expectancy",
              caption = "Significance = - log10(p.value)") %>% 
  
  gf_point(estimate ~ reorder(country,estimate), 
           colour = "black", size = 4) %>%
  
  gf_theme(theme_classic()) %>% 
  
  gf_refine(coord_flip(),
            scale_linewidth_continuous("Significance", 
                                       range = c(0.2, 3))) %>%
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            theme(axis.text.x = element_text(angle = 30, hjust = 1)))
ABCDEFGHIJ0123456789
country
<fct>
estimate
<dbl>
conf.low
<dbl>
conf.high
<dbl>
United States0.18416920.16866190.1996766
Brazil0.39008950.37793220.4022468
India0.50532100.44104240.5695996
China0.53071490.38698310.6744466
4 rows | 1-4 of 5 columns

But we can do better: visualize all models at once. What we will do is to plot the r.squared on the x-axis and the model term year1950 on the y-axis. We will need to combine params and metrics to do this:

params_combo <- params %>% 
  select(continent, country, term , estimate) %>% 
  filter(term == "year1950") %>% 
  left_join(metrics %>% select(continent, country, r.squared))
params_combo
###
params_combo %>% 
  gf_point(reorder(country, r.squared) ~ r.squared, 
           color = "grey70") %>%
  gf_point(reorder(country, r.squared) ~ r.squared, 
           data = params_combo %>% filter(continent == "Africa"),
           shape = 21, size = 3,
    fill = "salmon",
    ylab = "Country",
    title = "African Countries are Hard to Model") %>% 
  
  gf_label(60 ~ 0.25,
           label = "African Countries",
           fill = "salmon",
           color = "black",
           inherit = FALSE) %>%
    
  gf_theme(theme_classic()) %>% 
  gf_refine(theme(axis.text.y = element_text(size = 3, face = "bold")))
###
params_combo %>%
  gf_point(estimate ~ r.squared, color = "grey70") %>% 
  gf_point(estimate ~ r.squared,
           data = params_combo %>% 
             filter(continent == "Africa"),
           shape = 21,size =3,
    fill = "salmon",
    ylab = "Slope Estimate for Linear Model",
    title = "African Countries are Hard to Model",
    show.legend = FALSE) %>%
  gf_label(0.3 ~ 0.25,
           label = "African Countries",
           fill = "salmon",
           color = "black",
           inherit = FALSE) %>%
  gf_theme(theme_minimal())
ABCDEFGHIJ0123456789
continent
<fct>
country
<fct>
AsiaAfghanistan
EuropeAlbania
AfricaAlgeria
AfricaAngola
AmericasArgentina
OceaniaAustralia
EuropeAustria
AsiaBahrain
AsiaBangladesh
EuropeBelgium
Next
123
...
15
Previous
1-10 of 142 rows | 1-2 of 5 columns

As can be seen, there are many models with low values of r.squared and these are sadly all about countries in Africa. The linear model fares badly for these countries, since there are other factors (not just year) that affects lifeExp in these countries.

We can look at the model metrics and see for which (African) countries the model fares the worst. We will reverse sort on r.squared and choose the 5 worst models:

metrics %>% slice_min(order_by = r.squared, n = 5)
ABCDEFGHIJ0123456789
continent
<fct>
country
<fct>
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
p.value
<dbl>
df
<dbl>
logLik
<dbl>
AIC
<dbl>
AfricaRwanda0.01715964-0.081124406.5582690.17459230.68489271-38.5020583.00411
AfricaBotswana0.03402340-0.062574266.1121770.35221770.56604141-37.6567381.31346
AfricaZimbabwe0.05623196-0.038144847.2054310.59582400.45802901-39.6313585.26271
AfricaZambia0.05983644-0.034179924.5287130.63644710.44353181-34.0585974.11717
AfricaSwaziland0.06821087-0.024968056.6440910.73204190.41225301-38.6580783.31614
5 rows | 1-10 of 14 columns

There are of course reasons for this: genocide in Rwanda, and hyper-inflation in Zimbabwe, and of course the HIV-AIDS pandemic. These reasons are not captured in the original gapminder data!

One last plot! We can plot the model intercept on the x-axis and the slope year term on the y-axis to see where countries were in the beginning (1950) and at what rate they have improved in lifeExp:

params %>%
  select(continent, country, term , estimate) %>%
  pivot_wider(
    id_cols = c(continent, country),
    names_from = term,
    values_from = estimate
  ) %>%
  left_join(metrics %>% select(continent, country, r.squared)) %>%
  
  gf_point(
    year1950 ~ `(Intercept)`,
    color = ~ continent,
    size = ~ r.squared,
    xlab = "Baseline at 1950",
    ylab = "Rate of Improvement",
    title = "Asian Countries Show Improvement in Life Expectancy",
    subtitle = "African Countries still struggling",
    caption = "Data from Gapminder"
  ) %>%
  gf_refine(scale_size(range = c(0.1, 4)),
            scale_color_manual(
              values =
                c(
                  "Africa" = "salmon",
                  "Asia" = "limegreen",
                  "Americas" = "grey90",
                  "Europe" = "grey90",
                  "Oceania" = "grey90"
                )
            )) %>%
  gf_refine(guides(size = guide_legend(reverse = TRUE))) %>%
  gf_theme(theme_classic())

Many Asian countries were low in lifeExp in 1950 and have shown good rates of improvement; r.squared is also decent. Sadly African countries had low lifeExp in 1950 and have not shown good rates of improvement.

Recent developments in dplyr

In recent times, the familiar dplyr package also has experimental functions that are syntactically easier and offer pretty much purrr-like capability, and without introducing the complexity of the list columns or list output.

Look the code below and decipher how it works:

# Using group_modify
gapminder_model_dplyr <- gapminder %>%
  group_by(continent, country) %>%
  
  # Here is the new function in dplyr!
  # No need to use `mutate`
  dplyr::group_modify(
    .data = .,
    
    # .f MUST generate a tibble here and *not* a list
    # Hence broom::tidy is essential!
    # glance/tidy is part of the group_map's .f variable.
    # Applies to each model

    .f = ~ lm(lifeExp ~ year, data = .) %>%
      broom::glance(conf.int = TRUE,  # try `tidy()` and `augment()`
                    conf.lvel = 0.95)
  ) %>%
  
  # We already have a grouped tibble from `group_modify()`
  # So just ungroup()
  ungroup()

gapminder_model_dplyr
ABCDEFGHIJ0123456789
continent
<fct>
country
<fct>
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
p.value
<dbl>
df
<dbl>
AfricaAlgeria0.985117210.9836289321.3230064661.91708641.808143e-101
AfricaAngola0.887814630.8765960931.407009179.13818234.593498e-061
AfricaBenin0.966601990.9632621881.1746910289.41903081.037138e-081
AfricaBotswana0.03402340-0.0625742596.11217730.35221775.660414e-011
AfricaBurkina Faso0.918710500.9105815512.0470915113.01711909.047506e-071
AfricaBurundi0.765995970.7425955701.610777832.73430721.925677e-041
AfricaCameroon0.680178390.6481962333.243213621.26743109.627817e-041
AfricaCentral African Republic0.493244480.4425689263.52452909.73338141.087700e-021
AfricaChad0.872375500.8596130541.831439568.35486348.815616e-061
AfricaComoros0.996850760.9965358400.47864683165.37297097.633040e-141
Next
123456
...
15
Previous
1-10 of 142 rows | 1-8 of 14 columns

There is no nesting and un-nesting; the data is the familiar tibble throughout! This seems like a simple and elegant method.

TipUsing dplyr::group_modify

Note: group_modify is new experimental function in dplyr (June 2023), as are group_map, list_cbind and list_rbind. group_modify requires that the operation in .fgenerates a tibble, not a list, and we can retain the grouping variable easily too. We can remove the groups with ungroup.

group_modify() looks very clear and crisp, in my opinion. And very learner-friendly!

Conclusion

We have seen how purrr simplifies the application of functions iteratively to large groups of data, in a faster, replicable, and less error-prone manner. The basic idea (see video below) is:
- Use tidyr::nest to create a grouped data frame with a nested list column
- Use purrr::map_* to create a model for each of these data frames in the list column. The model will also be a column(usually) containing a list
- Use broom::tidy to convert the list model-column into a data frame for visualization

References

  1. Rebecca Barter, Learn to purrr. https://www.rebeccabarter.com/blog/2019-08-19_purrr

  2. Emorie Beck, Introduction to purrr. https://emoriebeck.github.io/R-tutorials/purrr/#

  3. Sander Wuyts, purrr Tutorial. https://sanderwuyts.com/en/blog/purrr-tutorial/

  4. Jared Wilber, Using the tidyverse for Machine Learning. https://www.jwilber.me/nest/

  5. Dan Ovando,Data Wrangling and Model Fitting using purrr

  6. Cormac Nolan, Modelling with Nested Data frames. https://github.com/cormac85/modelling_practice/blob/master/nested_data_frames.Rmd

Back to top
Lab-12: Time is a Him!!
Lab-13: Old Tortoise Taught Us

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .