Iteration: Learning to purrr
Performing Iterations in R
Setting up R Packages
Introduction
Often we want to perform the same operation on several different sets of data. Rather than repeat the operation for each instance of data, it is faster, more intuitive, and less error-prone if we create a data structure that holds all the data, and use the map-*
series functions from the purrr
package to perform all the repeated operations in one shot.
This requires getting used to. We need to understand:
- the data structure
- the iteration mechanism using
map
functions - the form of the results
Case Study #1: Multiple Models for Life Expectancy with gapminder
We will start with a complete case study and then work backwards to understand the various pieces of code that make it up.
Let us look at the gapminder
dataset:
skimr::skim(gapminder)
Name | gapminder |
Number of rows | 1704 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
country | 0 | 1 | FALSE | 142 | Afg: 12, Alb: 12, Alg: 12, Ang: 12 |
continent | 0 | 1 | FALSE | 5 | Afr: 624, Asi: 396, Eur: 360, Ame: 300 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 1979.50 | 17.27 | 1952.00 | 1965.75 | 1979.50 | 1993.25 | 2007.0 | ▇▅▅▅▇ |
lifeExp | 0 | 1 | 59.47 | 12.92 | 23.60 | 48.20 | 60.71 | 70.85 | 82.6 | ▁▆▇▇▇ |
pop | 0 | 1 | 29601212.32 | 106157896.74 | 60011.00 | 2793664.00 | 7023595.50 | 19585221.75 | 1318683096.0 | ▇▁▁▁▁ |
gdpPercap | 0 | 1 | 7215.33 | 9857.45 | 241.17 | 1202.06 | 3531.85 | 9325.46 | 113523.1 | ▇▁▁▁▁ |
We have lifeExp
, gdpPerCap
, and pop
as Quant variables over time (year
) for each country in the world. Suppose now that we wish to create Linear Regression Models predicting lifeExp
using year
, for each country. ( We will leave out gdpPercap
and pop
for now) The straightforward by laborious and naive way would be to use the lm
command after filtering the dataset for each country, creating 140+ Linear Models manually! This would be horribly tedious!
There is a better way with purrr
, and also more recently, with dplyr
itself. Let us see both methods, the established purrr
method first, and the new dplyr
based method thereafter.
{{}} EDA Plots
We can first plot lifeExp
over year
, grouped by country
:
ggplot(gapminder,aes(x = year, y = lifeExp, colour = country)) +
geom_line(show.legend = FALSE) +
theme_classic()
ggplot(gapminder,aes(x = year, y = lifeExp, colour = country)) +
geom_line(show.legend = FALSE) +
facet_wrap(~ continent) +
theme_classic()
By and large we see positive slopes, but some countries do show non-linear behaviour.
Constructing a Linear Model
Let us take \(1950\) as a baseline year for all countries. Then we model lifeExp
using year1950
across all countries together:
gapminder <- gapminder %>%
mutate(year1950 = year - 1950) # baseline year
model <- lm(lifeExp ~ year1950, data = gapminder)
summary(model)
Call:
lm(formula = lifeExp ~ year1950, data = gapminder)
Residuals:
Min 1Q Median 3Q Max
-39.949 -9.651 1.697 10.335 22.158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.86028 0.55792 89.37 <2e-16 ***
year1950 0.32590 0.01632 19.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.63 on 1702 degrees of freedom
Multiple R-squared: 0.1898, Adjusted R-squared: 0.1893
F-statistic: 398.6 on 1 and 1702 DF, p-value: < 2.2e-16
model %>% broom::tidy() # Parameters of the Model
model %>% broom::glance() # Statistics of the Model
Since the slope 0.3259038 is positive, life expectancy has been increasing over the years, across all countries. (But r.squared
0.1897571 is low, so this model does not explain much).
How do we do this for each country? We need to use the split-apply-combine
method to achieve this. The combination of group_by
and summarise
is a example of the split > apply > combine
method. For example, we could (split) the data by country, calculate the linear model each group (apply), and (combine) the results in a data frame.
However, this first-attempt code for a per-country linear model does not work:
```{r}
#| eval: false
gapminder %>%
group_by(country) %>%
summarise(linmod = lm(lifeExp ~ year1950, data = .))
```
This is because the linmod
variable is a list variable and cannot be accommodated in a simple column, which is what summarize
will try to create. So we need to be able to create “list” columns in a data frame…how do we do that? Before we contemplate that, let us understand the capabilities of the purrr
package in R.
The purrr
package
The purrr
package contains a new class of functions, that can take vectors/tibbles/lists as input, and perform an identical function over each component of these, and generate vectors/tibbles/lists as output. These are the map_*
functions that are part of the purrr
package. The *
in the map_*
function defines what kind of output (vector/tibble/list) the function generates.
Let us look at a few short examples.
Using map_*
functions from purrr
The basic structure of the map_*
functions is:
```{r}
#| eval: false
map_typeOfResult(.x = what_to_iterate_with,
.f = function_to_apply)
map_typeOfResult(.x = what_to_iterate_with,
.f = \(x) function_to_apply(x, additional_parameters))
```
Two examples:
# Example 1: Input: vector, Output: vector
diamonds %>%
select(where(is.numeric)) %>%
# We need dbl-type numbers in output **vector**
map_dbl(.x = .,
.f = mean)
carat depth table price x y
0.7979397 61.7494049 57.4571839 3932.7997219 5.7311572 5.7345260
z
3.5387338
# Example 2: Input: vector, Output: tibble
diamonds %>%
select(where(is.numeric)) %>%
# We need dbl-type numbers in output **vector**
map_df(.x = .,
.f = mean)
Note map_dbl
outputs a (numeric) vector, and map_df
outputs a tibble
.
In each of the above examples, each vector in the diamonds
dataset was passed to the respective map_* function as the parameter.x
.
Sometimes the function .f
may need some additional parameters to be specified, and these may not come from the input .x
:
# Example 3, with additional parameters to .f
palmerpenguins::penguins %>%
select(where(is.numeric)) %>%
map_dbl(.x = .,
.f = \(x) mean(x, na.rm = TRUE))
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
43.92193 17.15117 200.91520 4201.75439
year
2008.02907
# penguins has two rows of NA entries which need to be dropped
# Hence this additional parameter for the `mean` function
# Example 4: if we want a tibble output
palmerpenguins::penguins %>%
select(where(is.numeric)) %>%
map_df(.x = .,
.f = \(x) mean(x, na.rm = TRUE))
The .f
function can be anything, even a ggformula
plot command; in this case the output will not be a vector or a tibble, but a list:
#library(ggformula)
palmerpenguins::penguins %>%
select(where(is.numeric)) %>% select(-year) %>% drop_na() %>%
# `map` gives a list output
map(.x = .,
.f = \(x) gf_histogram(~x, bins = 30) %>%
gf_theme(theme_classic())
)
OK, so we can get vectors/tibbles/lists as output using vectors as inputs. Why would it be desirable to provide tibble/list as an input to a map_*
function?
Using purrr
to create multiple models
Now that we have some handle on purrr’s map
functions, we can see how to develop a linear regression model for every country in the gapminder
dataset. It should be clear from the command for a linear model:
```{r}
#| eval: false
model <- lm(target ~ predictor(s),
data = tibble_containing_target_and_predictors_columns)
```
that we need to specify three things: target
, predictors
, and the data
tibble for the development of a linear model. To do this for each country
in gapminder
, here is the process:
- Group the
gapminder
data bycountry
(andcontinent
) - Create a column containing unique per-country
data
for each country. This column would hence contain atibble
in each cell. This is alist
column! - Use
map
which would takecountry
and thedata
columns created above to create anlm
object for each country (in another list column) - Use
map
again withbroom::tidy
as the function to give us clean columns for the model per country. - Use that multi-model tibble to plot graphs for each country.
Let us do this now!
gapminder_models <- gapminder %>%
group_by(continent, country) %>%
# Create a per-country tibble in a new column called "data_list"
nest(.key = "data_list")
gapminder_models
gapminder_models <- gapminder_models %>%
# We use mutate + map to add a list column containing linear models
mutate(model = map(.x = data_list,
# One column .x to iterate over
# The .x list column contains data frames
# So we access individual columns for target and predictors
# within these individual data frames
.f = \(.x) lm(lifeExp ~ year1950, data = .x)
)) %>%
# Use mutate + map again to expose the columns of the models
# Use broom:: tidy, broom::glance(), and
# Use broom::augment for separate columns
mutate(model_params = map(.x = model,
.f = \(.x) broom::tidy(.x,
conf.int = TRUE,
conf.lvel = 0.95)),
model_metrics = map(.x = model,
.f = \(.x) broom::glance(.x)),
model_augment = map(.x = model,
.f = \(.x) broom::augment(.x))
)
gapminder_models
We can now take this tibble with multiple models and use broom
to tidy
, to glance
at, and to augment
the models:
params <- gapminder_models %>%
select(continent, country,model_params, model_metrics) %>%
ungroup() %>%
# Now unpack the linear model parameters into columns
unnest(cols = model_params)
params
###
metrics <- gapminder_models %>%
select(continent, country, model_metrics) %>%
ungroup() %>%
# Now unpack the linear model parameters into columns
unnest(cols = model_metrics)
metrics
###
augments <- gapminder_models %>%
select(continent, country, model_augment) %>%
ungroup() %>%
# Now unpack the linear model parameters into columns
unnest(cols = model_augment)
augments
Model Visualization
We can now plot these models and their uncertainty (i.e Confidence Intervals). We can select a few of the countries and plot:
params_filtered <- params %>%
filter(country %in% c("India", "United States", "Brazil", "China"),
term == "year1950") %>%
select(country, estimate, conf.low, conf.high,p.value) %>%
arrange(estimate)
params_filtered
###
params_filtered %>%
gf_errorbar(conf.high + conf.low ~ reorder(country, estimate),
linewidth = ~ -log10(p.value), width = 0.3,
ylab = "Effect Size",
xlab = "Country",
title = "Effect of years on Life Expectancy",
caption = "Significance = - log10(p.value)") %>%
gf_point(estimate ~ reorder(country,estimate),
colour = "black", size = 4) %>%
gf_theme(theme_classic()) %>%
gf_refine(coord_flip(),
scale_linewidth_continuous("Significance",
range = c(0.2, 3))) %>%
gf_refine(guides(linewidth = guide_legend(reverse = TRUE)),
theme(axis.text.x = element_text(angle = 30, hjust = 1)))
But we can do better: visualize all models at once. What we will do is to plot the r.squared
on the x-axis and the model term year1950
on the y-axis. We will need to combine params
and metrics
to do this:
params_combo <- params %>%
select(continent, country, term , estimate) %>%
filter(term == "year1950") %>%
left_join(metrics %>% select(continent, country, r.squared))
params_combo
###
params_combo %>%
gf_point(reorder(country, r.squared) ~ r.squared,
color = "grey70") %>%
gf_point(reorder(country, r.squared) ~ r.squared,
data = params_combo %>% filter(continent == "Africa"),
shape = 21, size = 3,
fill = "salmon",
ylab = "Country",
title = "African Countries are Hard to Model") %>%
gf_label(60 ~ 0.25,
label = "African Countries",
fill = "salmon",
color = "black",
inherit = FALSE) %>%
gf_theme(theme_classic()) %>%
gf_refine(theme(axis.text.y = element_text(size = 3, face = "bold")))
###
params_combo %>%
gf_point(estimate ~ r.squared, color = "grey70") %>%
gf_point(estimate ~ r.squared,
data = params_combo %>%
filter(continent == "Africa"),
shape = 21,size =3,
fill = "salmon",
ylab = "Slope Estimate for Linear Model",
title = "African Countries are Hard to Model",
show.legend = FALSE) %>%
gf_label(0.3 ~ 0.25,
label = "African Countries",
fill = "salmon",
color = "black",
inherit = FALSE) %>%
gf_theme(theme_minimal())
As can be seen, there are many models with low values of r.squared
and these are sadly all about countries in Africa. The linear model fares badly for these countries, since there are other factors (not just year
) that affects lifeExp
in these countries.
We can look at the model metrics and see for which (African) countries the model fares the worst. We will reverse sort on r.squared
and choose the 5 worst models:
There are of course reasons for this: genocide in Rwanda, and hyper-inflation in Zimbabwe, and of course the HIV-AIDS pandemic. These reasons are not captured in the original gapminder
data!
One last plot! We can plot the model intercept
on the x-axis and the slope year
term on the y-axis to see where countries were in the beginning (1950) and at what rate they have improved in lifeExp
:
params %>%
select(continent, country, term , estimate) %>%
pivot_wider(
id_cols = c(continent, country),
names_from = term,
values_from = estimate
) %>%
left_join(metrics %>% select(continent, country, r.squared)) %>%
gf_point(
year1950 ~ `(Intercept)`,
color = ~ continent,
size = ~ r.squared,
xlab = "Baseline at 1950",
ylab = "Rate of Improvement",
title = "Asian Countries Show Improvement in Life Expectancy",
subtitle = "African Countries still struggling",
caption = "Data from Gapminder"
) %>%
gf_refine(scale_size(range = c(0.1, 4)),
scale_color_manual(
values =
c(
"Africa" = "salmon",
"Asia" = "limegreen",
"Americas" = "grey90",
"Europe" = "grey90",
"Oceania" = "grey90"
)
)) %>%
gf_refine(guides(size = guide_legend(reverse = TRUE))) %>%
gf_theme(theme_classic())
Many Asian countries were low in lifeExp
in 1950 and have shown good rates of improvement; r.squared
is also decent. Sadly African countries had low lifeExp
in 1950 and have not shown good rates of improvement.
Recent developments in dplyr
In recent times, the familiar dplyr
package also has experimental functions that are syntactically easier and offer pretty much purrr
-like capability, and without introducing the complexity of the list columns or list output.
Look the code below and decipher how it works:
# Using group_modify
gapminder_model_dplyr <- gapminder %>%
group_by(continent, country) %>%
# Here is the new function in dplyr!
# No need to use `mutate`
dplyr::group_modify(
.data = .,
# .f MUST generate a tibble here and *not* a list
# Hence broom::tidy is essential!
# glance/tidy is part of the group_map's .f variable.
# Applies to each model
.f = ~ lm(lifeExp ~ year, data = .) %>%
broom::glance(conf.int = TRUE, # try `tidy()` and `augment()`
conf.lvel = 0.95)
) %>%
# We already have a grouped tibble from `group_modify()`
# So just ungroup()
ungroup()
gapminder_model_dplyr
There is no nesting and un-nesting; the data is the familiar tibble throughout! This seems like a simple and elegant method.
dplyr::group_modify
Note: group_modify
is new experimental function in dplyr
(June 2023), as are group_map
, list_cbind
and list_rbind
. group_modify
requires that the operation in .f
generates a tibble, not a list, and we can retain the grouping variable easily too. We can remove the groups with ungroup
.
group_modify()
looks very clear and crisp, in my opinion. And very learner-friendly!
Conclusion
We have seen how purrr
simplifies the application of functions iteratively to large groups of data, in a faster, replicable, and less error-prone manner. The basic idea (see video below) is:
- Use tidyr::nest
to create a grouped data frame with a nested list column
- Use purrr::map_*
to create a model for each of these data frames in the list column. The model will also be a column(usually) containing a list
- Use broom::tidy
to convert the list model-column into a data frame for visualization
References
Rebecca Barter, Learn to purrr. https://www.rebeccabarter.com/blog/2019-08-19_purrr
Emorie Beck, Introduction to purrr. https://emoriebeck.github.io/R-tutorials/purrr/#
Sander Wuyts, purrr Tutorial. https://sanderwuyts.com/en/blog/purrr-tutorial/
Jared Wilber, Using the tidyverse for Machine Learning. https://www.jwilber.me/nest/
Dan Ovando,Data Wrangling and Model Fitting using purrr
Cormac Nolan, Modelling with Nested Data frames. https://github.com/cormac85/modelling_practice/blob/master/nested_data_frames.Rmd