Iteration: Learning to purrr

Performing Iterations in R

Iteration

purrr package

Author

Arvind Venkatadri

Published

June 14, 2023

Modified

May 21, 2024

Setting up R Packages

knitr::opts_chunk$set(message = FALSE)
library(tidyverse)
library(gapminder)
library(ggformula)

Introduction

Often we want to perform the same operation on several different sets of data. Rather than repeat the operation for each instance of data, it is faster, more intuitive, and less error-prone if we create a data structure that holds all the data, and use the map-* series functions from the purrr package to perform all the repeated operations in one shot.

This requires getting used to. We need to understand:

the data structure
the iteration mechanism using map functions
the form of the results

Case Study #1: Multiple Models for Life Expectancy with `gapminder`

We will start with a complete case study and then work backwards to understand the various pieces of code that make it up.

Let us look at the gapminder dataset:

skimr::skim(gapminder)

Data summary
Name	gapminder
Number of rows	1704
Number of columns	6
_______________________
Column type frequency:
factor	2
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
country	0	1	FALSE	142	Afg: 12, Alb: 12, Alg: 12, Ang: 12
continent	0	1	FALSE	5	Afr: 624, Asi: 396, Eur: 360, Ame: 300

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	1979.50	17.27	1952.00	1965.75	1979.50	1993.25	2007.0	▇▅▅▅▇
lifeExp	1	59.47	12.92	23.60	48.20	60.71	70.85	82.6	▁▆▇▇▇
pop	1	29601212.32	106157896.74	60011.00	2793664.00	7023595.50	19585221.75	1318683096.0	▇▁▁▁▁
gdpPercap	1	7215.33	9857.45	241.17	1202.06	3531.85	9325.46	113523.1	▇▁▁▁▁

We have lifeExp, gdpPerCap, and pop as Quant variables over time (year) for each country in the world. Suppose now that we wish to create Linear Regression Models predicting lifeExp using year, for each country. ( We will leave out gdpPercap and pop for now) The straightforward by laborious and naive way would be to use the lm command after filtering the dataset for each country, creating 140+ Linear Models manually! This would be horribly tedious!

There is a better way with purrr, and also more recently, with dplyr itself. Let us see both methods, the established purrr method first, and the new dplyr based method thereafter.

{{}} EDA Plots

We can first plot lifeExp over year, grouped by country:

ggplot(gapminder,aes(x = year, y = lifeExp, colour = country)) + 
  geom_line(show.legend = FALSE) + 
  theme_classic()
ggplot(gapminder,aes(x = year, y = lifeExp, colour = country)) + 
  geom_line(show.legend = FALSE) + 
  facet_wrap(~ continent) + 
  theme_classic()

By and large we see positive slopes, but some countries do show non-linear behaviour.

Constructing a Linear Model

Let us take \(1950\) as a baseline year for all countries. Then we model lifeExp using year1950 across all countries together:

gapminder <- gapminder %>% 
  mutate(year1950 = year - 1950) # baseline year
model <- lm(lifeExp ~ year1950, data = gapminder)
summary(model)


Call:
lm(formula = lifeExp ~ year1950, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.949  -9.651   1.697  10.335  22.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.86028    0.55792   89.37   <2e-16 ***
year1950     0.32590    0.01632   19.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.63 on 1702 degrees of freedom
Multiple R-squared:  0.1898,    Adjusted R-squared:  0.1893 
F-statistic: 398.6 on 1 and 1702 DF,  p-value: < 2.2e-16

model %>% broom::tidy() # Parameters of the Model
model %>% broom::glance() # Statistics of the Model

Since the slope 0.3259038 is positive, life expectancy has been increasing over the years, across all countries. (But r.squared 0.1897571 is low, so this model does not explain much).

How do we do this for each country? We need to use the split-apply-combine method to achieve this. The combination of group_by and summarise is a example of the split > apply > combine method. For example, we could (split) the data by country, calculate the linear model each group (apply), and (combine) the results in a data frame.

However, this first-attempt code for a per-country linear model does not work:

```{r}
#| eval: false
gapminder %>% 
  group_by(country) %>% 
  summarise(linmod = lm(lifeExp ~ year1950, data = .))
```

This is because the linmod variable is a list variable and cannot be accommodated in a simple column, which is what summarize will try to create. So we need to be able to create “list” columns in a data frame…how do we do that? Before we contemplate that, let us understand the capabilities of the purrr package in R.

The `purrr` package

The purrr package contains a new class of functions, that can take vectors/tibbles/lists as input, and perform an identical function over each component of these, and generate vectors/tibbles/lists as output. These are the map_* functions that are part of the purrr package. The * in the map_* function defines what kind of output (vector/tibble/list) the function generates.

Let us look at a few short examples.

Using `map_*` functions from `purrr`

The basic structure of the map_* functions is:

```{r}
#| eval: false
map_typeOfResult(.x = what_to_iterate_with, 
                 .f = function_to_apply)

map_typeOfResult(.x = what_to_iterate_with, 
                 .f = \(x) function_to_apply(x, additional_parameters))
```

Two examples:

# Example 1: Input: vector, Output: vector
diamonds %>% 
  select(where(is.numeric)) %>% 
  
  # We need dbl-type numbers in output **vector**
map_dbl(.x = ., 
        .f = mean)

       carat        depth        table        price            x            y 
   0.7979397   61.7494049   57.4571839 3932.7997219    5.7311572    5.7345260 
           z 
   3.5387338

# Example 2: Input: vector, Output: tibble
diamonds %>% 
  select(where(is.numeric)) %>% 
  
  # We need dbl-type numbers in output **vector**
map_df(.x = ., 
       .f = mean)

Note

Note map_dbl outputs a (numeric) vector, and map_df outputs a tibble.

In each of the above examples, each vector in the diamonds dataset was passed to the respective map_* function as the parameter.x.

Sometimes the function .f may need some additional parameters to be specified, and these may not come from the input .x:

# Example 3, with additional parameters to .f
palmerpenguins::penguins %>% 
  select(where(is.numeric)) %>% 
  
  map_dbl(.x = ., 
          .f = \(x) mean(x, na.rm = TRUE))

   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
         43.92193          17.15117         200.91520        4201.75439 
             year 
       2008.02907

  # penguins has two rows of NA entries which need to be dropped
  # Hence this additional parameter for the `mean` function


# Example 4: if we want a tibble output
palmerpenguins::penguins %>% 
  select(where(is.numeric)) %>% 
  
  map_df(.x = ., 
         .f = \(x) mean(x, na.rm = TRUE))

The .f function can be anything, even a ggformula plot command; in this case the output will not be a vector or a tibble, but a list:

#library(ggformula)
palmerpenguins::penguins %>% 
  select(where(is.numeric)) %>% select(-year) %>% drop_na() %>% 
  
  # `map` gives a list output
  map(.x = .,
      .f = \(x) gf_histogram(~x, bins = 30) %>% 
        gf_theme(theme_classic())
  )

$bill_length_mm


$bill_depth_mm


$flipper_length_mm


$body_mass_g

Note: we need to do just a bit of extra pre-work to get the variable names on the x-axis of the histograms. There is a possibility to store all the plots in a separate column

OK, so we can get vectors/tibbles/lists as output using vectors as inputs. Why would it be desirable to provide tibble/list as an input to a map_* function?

Using `purrr` to create multiple models

Now that we have some handle on purrr’s map functions, we can see how to develop a linear regression model for every country in the gapminder dataset. It should be clear from the command for a linear model:

```{r}
#| eval: false
model <- lm(target ~ predictor(s), 
            data  = tibble_containing_target_and_predictors_columns)
```

that we need to specify three things: target, predictors, and the data tibble for the development of a linear model. To do this for each country in gapminder, here is the process:

Group the gapminder data by country (and continent)
Create a column containing unique per-country data for each country. This column would hence contain a tibble in each cell. This is a list column!
Use map which would take country and the data columns created above to create an lm object for each country (in another list column)
Use map again with broom::tidy as the function to give us clean columns for the model per country.
Use that multi-model tibble to plot graphs for each country.

Let us do this now!

gapminder_models <- gapminder %>% 
  group_by(continent, country) %>% 
  
  # Create a per-country tibble in a new column called "data_list"
  nest(.key = "data_list") 
gapminder_models

gapminder_models <- gapminder_models %>%
  # We use mutate + map to add a list column containing linear models
  mutate(model = map(.x = data_list, 
                     
          # One column .x to iterate over
          # The .x list column contains data frames
          # So we access individual columns for target and predictors 
          # within these individual data frames
                     .f = \(.x) lm(lifeExp ~ year1950, data = .x)
          )) %>% 
  
  # Use mutate + map again to expose the columns of the models
  # Use broom:: tidy, broom::glance(), and 
  # Use broom::augment for separate columns
  mutate(model_params = map(.x = model, 
                      .f = \(.x) broom::tidy(.x, 
                                             conf.int = TRUE, 
                                             conf.lvel = 0.95)),
         model_metrics = map(.x = model,
                      .f = \(.x) broom::glance(.x)),
         
         model_augment = map(.x  = model, 
                             .f = \(.x) broom::augment(.x))
         ) 
gapminder_models

We can now take this tibble with multiple models and use broom to tidy, to glance at, and to augment the models:

params <- gapminder_models %>% 
  select(continent, country,model_params, model_metrics) %>% 
  ungroup() %>% 
  # Now unpack the linear model parameters into columns
  unnest(cols = model_params)
params

###
metrics <- gapminder_models %>% 
  select(continent, country, model_metrics) %>% 
  ungroup() %>% 
  # Now unpack the linear model parameters into columns
  unnest(cols = model_metrics)
metrics

###
augments <- gapminder_models %>% 
  select(continent, country, model_augment) %>% 
  ungroup() %>% 
  # Now unpack the linear model parameters into columns
  unnest(cols = model_augment)
augments

Model Visualization

We can now plot these models and their uncertainty (i.e Confidence Intervals). We can select a few of the countries and plot:

params_filtered <- params %>% 
  filter(country %in% c("India", "United States", "Brazil", "China"), 
         term == "year1950") %>% 
  select(country, estimate, conf.low, conf.high,p.value) %>%
  arrange(estimate)
params_filtered
###
params_filtered %>%
  gf_errorbar(conf.high + conf.low ~ reorder(country, estimate),
              linewidth = ~ -log10(p.value), width = 0.3,
              ylab = "Effect Size", 
              xlab = "Country",
              title = "Effect of years on Life Expectancy",
              caption = "Significance = - log10(p.value)") %>% 
  
  gf_point(estimate ~ reorder(country,estimate), 
           colour = "black", size = 4) %>%
  
  gf_theme(theme_classic()) %>% 
  
  gf_refine(coord_flip(),
            scale_linewidth_continuous("Significance", 
                                       range = c(0.2, 3))) %>%
  gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
            theme(axis.text.x = element_text(angle = 30, hjust = 1)))

But we can do better: visualize all models at once. What we will do is to plot the r.squared on the x-axis and the model term year1950 on the y-axis. We will need to combine params and metrics to do this:

params_combo <- params %>% 
  select(continent, country, term , estimate) %>% 
  filter(term == "year1950") %>% 
  left_join(metrics %>% select(continent, country, r.squared))
params_combo
###
params_combo %>% 
  gf_point(reorder(country, r.squared) ~ r.squared, 
           color = "grey70") %>%
  gf_point(reorder(country, r.squared) ~ r.squared, 
           data = params_combo %>% filter(continent == "Africa"),
           shape = 21, size = 3,
    fill = "salmon",
    ylab = "Country",
    title = "African Countries are Hard to Model") %>% 
  
  gf_label(60 ~ 0.25,
           label = "African Countries",
           fill = "salmon",
           color = "black",
           inherit = FALSE) %>%
    
  gf_theme(theme_classic()) %>% 
  gf_refine(theme(axis.text.y = element_text(size = 3, face = "bold")))
###
params_combo %>%
  gf_point(estimate ~ r.squared, color = "grey70") %>% 
  gf_point(estimate ~ r.squared,
           data = params_combo %>% 
             filter(continent == "Africa"),
           shape = 21,size =3,
    fill = "salmon",
    ylab = "Slope Estimate for Linear Model",
    title = "African Countries are Hard to Model",
    show.legend = FALSE) %>%
  gf_label(0.3 ~ 0.25,
           label = "African Countries",
           fill = "salmon",
           color = "black",
           inherit = FALSE) %>%
  gf_theme(theme_minimal())

As can be seen, there are many models with low values of r.squared and these are sadly all about countries in Africa. The linear model fares badly for these countries, since there are other factors (not just year) that affects lifeExp in these countries.

We can look at the model metrics and see for which (African) countries the model fares the worst. We will reverse sort on r.squared and choose the 5 worst models:

metrics %>% slice_min(order_by = r.squared, n = 5)

There are of course reasons for this: genocide in Rwanda, and hyper-inflation in Zimbabwe, and of course the HIV-AIDS pandemic. These reasons are not captured in the original gapminder data!

One last plot! We can plot the model intercept on the x-axis and the slope year term on the y-axis to see where countries were in the beginning (1950) and at what rate they have improved in lifeExp:

params %>%
  select(continent, country, term , estimate) %>%
  pivot_wider(
    id_cols = c(continent, country),
    names_from = term,
    values_from = estimate
  ) %>%
  left_join(metrics %>% select(continent, country, r.squared)) %>%
  
  gf_point(
    year1950 ~ `(Intercept)`,
    color = ~ continent,
    size = ~ r.squared,
    xlab = "Baseline at 1950",
    ylab = "Rate of Improvement",
    title = "Asian Countries Show Improvement in Life Expectancy",
    subtitle = "African Countries still struggling",
    caption = "Data from Gapminder"
  ) %>%
  gf_refine(scale_size(range = c(0.1, 4)),
            scale_color_manual(
              values =
                c(
                  "Africa" = "salmon",
                  "Asia" = "limegreen",
                  "Americas" = "grey90",
                  "Europe" = "grey90",
                  "Oceania" = "grey90"
                )
            )) %>%
  gf_refine(guides(size = guide_legend(reverse = TRUE))) %>%
  gf_theme(theme_classic())

Many Asian countries were low in lifeExp in 1950 and have shown good rates of improvement; r.squared is also decent. Sadly African countries had low lifeExp in 1950 and have not shown good rates of improvement.

Recent developments in `dplyr`

In recent times, the familiar dplyr package also has experimental functions that are syntactically easier and offer pretty much purrr-like capability, and without introducing the complexity of the list columns or list output.

Look the code below and decipher how it works:

# Using group_modify
gapminder_model_dplyr <- gapminder %>%
  group_by(continent, country) %>%
  
  # Here is the new function in dplyr!
  # No need to use `mutate`
  dplyr::group_modify(
    .data = .,
    
    # .f MUST generate a tibble here and *not* a list
    # Hence broom::tidy is essential!
    # glance/tidy is part of the group_map's .f variable.
    # Applies to each model

    .f = ~ lm(lifeExp ~ year, data = .) %>%
      broom::glance(conf.int = TRUE,  # try `tidy()` and `augment()`
                    conf.lvel = 0.95)
  ) %>%
  
  # We already have a grouped tibble from `group_modify()`
  # So just ungroup()
  ungroup()

gapminder_model_dplyr

There is no nesting and un-nesting; the data is the familiar tibble throughout! This seems like a simple and elegant method.

Using dplyr::group_modify

Note: group_modify is new experimental function in dplyr (June 2023), as are group_map, list_cbind and list_rbind. group_modify requires that the operation in .fgenerates a tibble, not a list, and we can retain the grouping variable easily too. We can remove the groups with ungroup.

group_modify() looks very clear and crisp, in my opinion. And very learner-friendly!

Conclusion

We have seen how purrr simplifies the application of functions iteratively to large groups of data, in a faster, replicable, and less error-prone manner. The basic idea (see video below) is:
- Use tidyr::nest to create a grouped data frame with a nested list column
- Use purrr::map_* to create a model for each of these data frames in the list column. The model will also be a column(usually) containing a list
- Use broom::tidy to convert the list model-column into a data frame for visualization

References

Rebecca Barter, Learn to purrr. https://www.rebeccabarter.com/blog/2019-08-19_purrr
Emorie Beck, Introduction to purrr. https://emoriebeck.github.io/R-tutorials/purrr/#
Sander Wuyts, purrr Tutorial. https://sanderwuyts.com/en/blog/purrr-tutorial/
Jared Wilber, Using the tidyverse for Machine Learning. https://www.jwilber.me/nest/
Dan Ovando,Data Wrangling and Model Fitting using purrr
Cormac Nolan, Modelling with Nested Data frames. https://github.com/cormac85/modelling_practice/blob/master/nested_data_frames.Rmd

Setting up R Packages

Introduction

Case Study #1: Multiple Models for Life Expectancy with gapminder

{{}} EDA Plots

Constructing a Linear Model

The purrr package

Using map_* functions from purrr

Using purrr to create multiple models

Model Visualization

Recent developments in dplyr

Conclusion

References

Case Study #1: Multiple Models for Life Expectancy with `gapminder`

The `purrr` package

Using `map_*` functions from `purrr`

Using `purrr` to create multiple models

Recent developments in `dplyr`