Evolution and Flow

Line and Area Plots

Dumbbell Plots

Parallel Set Plots

Alluvial Plots

Sankey Diagrams

Chord Diagrams

Bump Charts

Author

Arvind V.

Published

November 22, 2022

Modified

September 22, 2024

Abstract

Changes in Information over Space and Time

Slides and Tutorials

R Tutorial

“My stories run up and bite me in the leg – I respond by writing them down – everything that goes on during the bite. When I finish, the idea lets go and runs off.”

— Ray Bradbury, science-fiction writer (22 Aug 1920-2012)

Setting up R Packages

library(tidyverse)
library(ggstream)
library(ggformula)
# remotes::install_github("corybrunson/ggalluvial@main", build_vignettes = TRUE)
library(ggalluvial)
library(ggsankeyfier)
# install.packages("devtools")
# devtools::install_github("davidsjoberg/ggsankey")
library(ggsankey)
library(networkD3)
library(echarts4r) # Interactive graphs

What Time Evolution Charts can we plot?

In these cases, the x-axis is typically time…and we chart the variable of another Quant variable with respect to time, using a line geometry.

Let is take a healthcare budget dataset from Our World in Data: We will plot graphs for 5 countries (India, China, Brazil, Russia, Canada ).

And Introducting echarts4r

We will also build interactive versions of these charts using echarts4r!

Download this data by clicking on the button below:

health <-
  read_csv("data/public-health-expenditure-share-GDP-OWID.csv")

health_filtered <- health %>%
  filter(Entity %in% c(
    "India",
    "China",
    "United States",
    "United Kingdom",
    "Russia",
    "Sweden"
  ))

Using ggformula
Using echarts4r

# Set graph theme
theme_set(new = theme_custom())

gf_point(
  data = health_filtered,
  public_health_expenditure_pc_gdp ~ Year,
  colour = ~Entity,
  ylab = "Healthcare Budget\n as % of GDP",
  title = "Line Charts to show Evolution (over Time )"
) %>%
  gf_line()
###
gf_area(
  data = health_filtered,
  public_health_expenditure_pc_gdp ~ Year,
  fill = ~Entity, alpha = 0.3,
  ylab = "Healthcare Budget\n as % of GDP",
  title = "Area Charts to show Evolution (over Time )"
) %>%
  gf_line(colour = ~Entity)

health_filtered %>%
  group_by(Entity) %>%
  e_charts(Year) %>%
  e_scatter(public_health_expenditure_pc_gdp) %>%
  e_line(public_health_expenditure_pc_gdp) %>%
  e_x_axis(name = "Year", min = 1850, max = 2050) %>%
  e_y_axis(
    name = "Public Health Expenditure",
    nameLocation = "middle", nameGap = 25
  ) %>%
  e_tooltip()
###
health_filtered %>%
  group_by(Entity) %>%
  e_charts(Year) %>%
  e_scatter(public_health_expenditure_pc_gdp) %>%
  e_area(public_health_expenditure_pc_gdp) %>%
  e_x_axis(name = "Year", min = 1850, max = 2050) %>%
  e_y_axis(
    name = "Public Health Expenditure",
    nameLocation = "middle", nameGap = 25
  ) %>%
  e_tooltip()

What Space Evolution Charts can we plot?

Here, the space can be any Qual variable, and we can chart another Quant or Qual variable move across levels of the first chosen Qual variable.

For instance we can contemplate enrollment at a University, and show how students move from course to course in a University. Or how customers drift from one category of products or brands to another….or the movement of cricket players from one IPL Team to another !!

Here is what Thomas Lin Pedersen says:

A parallel sets diagram is a type of visualisation showing the interaction between multiple categorical variables. If the variables have an intrinsic order the representation can be thought of as a Sankey Diagram. If each variable is a point in time it will resemble an Alluvial diagram.

The Qualitative variables being connected are mapped to stages/axes
Each level within a Qual variable is mapped to nodes / strata / lodes;
And the connections between the strata of the axes are called flows / edges / links / alluvia.

Such diagrams are best used when you want to show a many-to-many mapping between two domains or multiple paths through a set of stages E.g Students going through multiple courses during a semester of study.

Here is an example of a Sankey Diagram: This diagram show how energy is converted or transmitted before being consumed or lost: supplies are on the left, and demands are on the right. (Data: Department of Energy & Climate Change via Tom Counsell)¹:

Switching to ggplot here

For the next few charts, there are (as yet) no equivalents in ggformula. Hence we will use ggplot.

Case Study-1: Titanic Dataset

# library(ggalluvial)
data("Titanic")
Titanic <- Titanic %>% as_tibble()
Titanic

Table Form Data

Note that this data is in tidy wide / table form, with separate columns for each Qualitative variable and a separate count column, which we saw when we examined Categorical Data. This is, in my opinion, intuitively the best form of data to plot a Sankey plot with. But there are other forms such as the tidy long form which we have been using practically all this while. You will find examples of on the ggalluvial website using tidy long form data. https://corybrunson.github.io/ggalluvial/

# Set graph theme
theme_set(new = theme_custom())

##
Titanic %>% ggplot(
  data = .,

  # Select the Categorical Variables for the vertical Axes / Stages
  aes(
    axis1 = Class,
    axis2 = Sex,
    axis3 = Age,
    axis4 = Survived,
    y = n
  ), fill = "white"
) +

  # Alluvials between Categorical Axes
  geom_alluvium(aes(fill = Survived),
    colour = "black",
    linewidth = 0.25
  ) +

  # Vertical segments for each Categorical Variable2
  geom_stratum(
    colour = "black",
    linewidth = 1,
    fill = "white"
  ) +

  # Labels for each "level" of the Categorical Axes
  geom_text(
    stat = "stratum",
    aes(label = after_stat(stratum))
  ) +



  # Scales and Colours
  scale_x_discrete(
    limits = c("Class", "Sex", "Age"),
    expand = c(.2, .05)
  ) +
  scale_fill_manual(values = c("red3", "springgreen3")) +
  xlab("Demographic") +
  ggtitle(
    "Passengers on the maiden voyage of the Titanic",
    "Stratified by demographics and survival"
  )

Here is how the package ggalluvial defines the elements of a typical alluvial plot:

An axis is a dimension (variable) along which the data are vertically arranged at a fixed horizontal position. The plot above uses three categorical axes: Class, Sex, and Age.

The groups at each axis are depicted as opaque blocks called strata. For example, the Class axis contains four strata: 1st, 2nd, 3rd, and Crew.

Horizontal (x-) splines called alluvia span the entire width of the plot. In this plot, each alluvium corresponds to a fixed strata value of each axis variable, indicated by its vertical position at the axis, as well as of the Survived variable, indicated by its fill color.

The segments of the alluvia between pairs of adjacent axes are flows.

The alluvia intersect the strata at lodes. The lodes are not visualized in the above plot, but they can be inferred as filled rectangles extending the flows through the strata at each end of the plot or connecting the flows on either side of the center stratum.

The ggsankeyfier also plots alluvial and sankey diagrams. Just for variety, we will convert the Titanic data to long form and then plot it with ggsankeyfier.

Titanic %>%
  as_tibble() %>%
  ggsankeyfier::pivot_stages_longer(
    data = .,
    stages_from = c("Class", "Sex", "Age", "Survived"),
    values_from = "n",
    additional_aes_from = "Survived"
  ) -> Titanic_long
Titanic_long

This data is in long form, with stages defining the axes in the graph, and the node variable giving us levels within each (Qualitative) axis. The edge_id labels both ends (from and to) of each connector or edge/flow/alluvium. Let us plot this now:

Titanic_long %>%
  ggplot(aes(
    x = stage, y = n,
    group = node, connector = connector,
    edge_id = edge_id
  )) +
  geom_sankeynode(v_space = "auto") +
  geom_sankeyedge(aes(fill = Survived), v_space = "auto") +
  labs(x = "")

Let us make an interactive graph for this dataset using echarts4.

ClassSex <-
  Titanic %>%
  group_by(Class, Sex) %>%
  summarise(cs = sum(n)) %>%
  ungroup() %>%
  rename("source" = Class, "target" = Sex, "value" = cs)

SexAge <-
  Titanic %>%
  group_by(Sex, Age) %>%
  summarise(sa = sum(n)) %>%
  ungroup() %>%
  rename("source" = Sex, "target" = Age, "value" = sa)

AgeSurvived <-
  Titanic %>%
  group_by(Age, Survived) %>%
  summarise(as = sum(n)) %>%
  ungroup() %>%
  rename("source" = Age, "target" = Survived, "value" = as)

Combo <- rbind(ClassSex, SexAge, AgeSurvived)
Combo

Combo %>%
  e_charts() %>%
  e_sankey(source, target, value) %>%
  e_title("Titanic: Who lived, and who didn't?") %>%
  e_tooltip()

The process with echarts4r is quite different, since the data structure used by this package is different:

The echarts4r package needs to have source and target columns for axes, along with a value to determine the width of the alluvium.
The names in the source and target can repeat, and can appear in both source and target columns in order to create a multi-axis diagram. Hence the data needs to be inherently in long form.
However, for the values, we need to manually calculate the aggregate totals for alluvia between each consecutive pairs of axes (i.e Qual variables). This is not done automatically in echarts4r, but it is with ggalluvial.
So we create grouped aggregate summaries for each pair of Qualitative variables that we wish to plot consecutively ( i.e as axis1, axis2…)
Stack these pair-wise alluvia totals into one combo data frame using rbind(), after renaming the variables to “source”, “target” and “value”.

Phew! seems like too much work to do…I wonder if good, old-fashioned pivot-longer will get us here…

Chord Diagram

We will explore this diagram when we explore network graphs with the tidygraph and ggraph packages.

Dumbbell Plots

A simple plot that can quickly indicate changes in multiple variables/aspects over either a time or a space variable is a dumbbell plot. This is a combination of scatter plot + a segment plot. Let us take our previously loaded health dataset and plot just the change in expenditure for multiple countries, across a time span of 8 years (2010 - 2018)

Using ggformula
Using ggplot

# Set graph theme
theme_set(new = theme_custom())
##
health_2010_2018 <- health %>%
  # select Years 2010 and 2018
  filter(Year %in% c(2010, 2018)) %>%
  # Make separate columns for each year, easier that way
  # Though not essential
  pivot_wider(
    id_cols = c(Entity, Code),
    names_from = Year,
    names_prefix = "Year",
    values_from = public_health_expenditure_pc_gdp
  )
health_2010_2018

# Set graph theme
theme_set(new = theme_custom())
##
health_2010_2018 %>%
  # remove NA data across the data set
  drop_na() %>%
  # take the top 20 countries based on 2018 allocation
  slice_max(n = 20, order_by = Year2018) %>%
  gf_segment(Entity + Entity ~ Year2010 + Year2018,
    colour = "grey",
    linewidth = 2
  ) %>%
  gf_point(Entity ~ Year2018,
    colour = ~"2018"
  ) %>%
  gf_point(Entity ~ Year2010,
    colour = ~"2010"
  )

## Can we do better? Sort the bars, improve axis ticks, title..
# Set graph theme
theme_set(new = theme_custom())
##
health_2010_2018 %>%
  # remove NA data across the data set
  drop_na() %>%
  # take the top 20 countries based on 2018 allocation
  slice_max(n = 20, order_by = Year2018) %>%
  # plot segments first
  gf_segment(
    reorder(Entity, Year2018) + reorder(Entity, Year2018) ~
      Year2010 + Year2018,
    colour = "grey",
    linewidth = 2
  ) %>%
  # Then plot points
  gf_point(reorder(Entity, Year2018) ~ Year2018,
    colour = ~"2018",
    size = 3
  ) %>%
  gf_point(
    reorder(Entity, Year2018) ~ Year2010,
    colour = ~"2010", size = 3,
    xlab = "Health Expenditure as Percentage of GDP",
    ylab = "Country",
    title = "Healthcare Budgets Changes between 2010 to 2018",
    subtitle = "Bars are Sorted",
    caption = "And the X-Axis is in percentage"
  ) %>%
  gf_refine(
    scale_x_continuous(
      breaks = scales::breaks_width(2),
      labels = scales::label_percent(suffix = "%", scale = 1)
    ),
    scale_colour_manual(name = "Year", values = c("red", "green"))
  )

# Set graph theme
theme_set(new = theme_custom())


health_2010_2018 <- health %>%
  # select Years 2010 and 2018
  filter(Year %in% c(2010, 2018)) %>%
  # Make separate columns for each year, easier that way
  # Though not essential
  pivot_wider(
    id_cols = c(Entity, Code),
    names_from = Year,
    names_prefix = "Year",
    values_from = public_health_expenditure_pc_gdp
  )

health_2010_2018 %>%
  # remove NA data across the data set
  drop_na() %>%
  # take the top 20 countries based on 2018 allocation
  slice_max(n = 20, order_by = Year2018) %>%
  ggplot() +
  geom_segment(
    aes(
      y = Entity, yend = Entity,
      x = Year2010, xend = Year2018
    ),
    colour = "grey",
    linewidth = 2
  ) +
  geom_point(aes(y = Entity, x = Year2018, colour = "2018")) +
  geom_point(aes(y = Entity, x = Year2010, colour = "2010"))
## Can we do better?

health_2010_2018 %>%
  # remove NA data across the data set
  drop_na() %>%
  # take the top 20 countries based on 2018 allocation
  slice_max(n = 20, order_by = Year2018) %>%
  ggplot() +
  # plot segments first
  geom_segment(
    aes(
      y = reorder(Entity, Year2018), yend = reorder(Entity, Year2018),
      x = Year2010, xend = Year2018
    ),
    colour = "grey",
    linewidth = 2
  ) +

  # Then plot points
  geom_point(aes(
    y = reorder(Entity, Year2018), x = Year2018,
    colour = "2018"
  ), size = 3) +
  geom_point(aes(
    y = reorder(Entity, Year2018), x = Year2010,
    colour = "2010"
  ), size = 3) +
  labs(
    x = "Health Expenditure as Percentage of GDP",
    y = "Country", title = "Healthcare Budgets",
    subtitle = "Changes between 2010 to 2018"
  ) +
  scale_x_continuous(
    breaks = breaks_width(2),
    labels = scales::label_percent(suffix = "%", scale = 1)
  ) +
  scale_colour_manual(name = "Year", values = c("red", "green"))

Wait, But Why?

Changes can be over time, or over “space”
In the latter case, we can think of some Quantity changing over (multiple levels of) multiple Qualitative variables. E.g. Sales over Product Type over Showroom Location over Festival Season…
When a single Quant varies over a single multi-level Qual, the Chord Diagram may be simpler than the Sankey/Alluvial. E.g Bird migration across Multiple Locations. This can even show bidirectional changes. ( Sankeys with loops are also possible, however)
When you have a Quant that changes over only one two-level Qual variable, the Dumbbell plot becomes an option.

Conclusion

We see that we can visualize “evolutions” over time and space. The evolutions can represent changes in the quantities of things, or their categorical affiliations or groups.

What business/design data would you depict in this way? Revenue streams? Employment? Expenditures over time and market? Migration? App usage patterns? There are many possibilities!

Note also that the Bump Charts are a special case of Alluvial/Sankey charts where each node connects/flows to only one other node.

Your Turn

Within the ggalluvial package are two datasets, majors and vaccinations. Plot alluvial charts for both of these.
Go to the American Life Panel Website where you will find many public datasets. Try to take one and make charts from it that we have learned in this Module.

References

Global Migration, https://download.gsb.bund.de/BIB/global_flow/ A good example of the use of a Chord Diagram.
ggalluvial cheatsheet,https://cheatography.com/seleven/cheat-sheets/ggalluvial/
John Coene, Sankey plots with echarts4r, https://echarts4r.john-coene.com/articles/chart_types.html#sankey
Other packages: Sankey plot | the R Graph Gallery (r-graph-gallery.com)
Another package: Sankey diagrams in ggplot2 with ggsankey | RCHARTS (r-charts.com)
Sankey Charts using networkD3: http://christophergandrud.github.io/networkD3

R Package Citations

Package	Version	Citation
echarts4r	0.4.5	Coene (2023)
ggalluvial	0.12.5	Brunson (2020); Brunson and Read (2023)
ggsankey	0.0.99999	Sjoberg (2024)
ggsankeyfier	0.1.8	de Vries (2024)
ggstream	0.1.0	Sjoberg (2021)
networkD3	0.4	Allaire et al. (2017)

Allaire, J. J., Christopher Gandrud, Kenton Russell, and CJ Yetman. 2017. networkD3: D3 JavaScript Network Graphs from r. https://CRAN.R-project.org/package=networkD3.

Brunson, Jason Cory. 2020. “ggalluvial: Layered Grammar for Alluvial Plots.” Journal of Open Source Software 5 (49): 2017. https://doi.org/10.21105/joss.02017.

Brunson, Jason Cory, and Quentin D. Read. 2023. “ggalluvial: Alluvial Plots in ‘ggplot2’.” http://corybrunson.github.io/ggalluvial/.

Coene, John. 2023. Echarts4r: Create Interactive Graphs with “Echarts JavaScript” Version 5. https://CRAN.R-project.org/package=echarts4r.

de Vries, Pepijn. 2024. ggsankeyfier: Create Sankey and Alluvial Diagrams Using “ggplot2”. https://CRAN.R-project.org/package=ggsankeyfier.

Sjoberg, David. 2021. ggstream: Create Streamplots in “ggplot2”. https://CRAN.R-project.org/package=ggstream.

———. 2024. ggsankey: Sankey, Alluvial and Sankey Bump Plots. https://github.com/davidsjoberg/ggsankey.

Footnotes

D3 JavaScript Network Graphs from R: christophergandrud.github.io/networkD3/↩︎

Citation

BibTeX citation:

@online{v.2022,
  author = {V., Arvind},
  title = {\textless Iconify-Icon
    Icon=“carbon:sankey-Diagram”\textgreater\textless/Iconify-Icon\textgreater{}
    {Evolution} and {Flow}},
  date = {2022-11-22},
  url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/70-EvolutionFlow/},
  langid = {en},
  abstract = {Changes in Information over Space and Time}
}

For attribution, please cite this work as:

V., Arvind. 2022. “<Iconify-Icon Icon=‘carbon:sankey-Diagram’></Iconify-Icon> Evolution and Flow.” November 22, 2022. https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/70-EvolutionFlow/.