Facing the Abyss

EDA

Workflow

Descriptive

Author

Arvind V

Published

October 21, 2023

Modified

November 17, 2024

Abstract

A complete EDA Workflow

A Data Analytics Process

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.

Download this document as a Work Template

Hit the </>Code button at upper right to copy/save this very document as a Quarto Markdown template for your work. Delete the text that you don’t need, but keep most of the Sections as they are!

Setting up R Packages

Install packages using install.packages() in your Console.
Load up your libraries in a so-labelled setup chunk:

library(tidyverse)
library(mosaic)
library(ggformula)
library(ggridges)
library(skimr)
##
library(GGally)
library(corrplot)
library(corrgram)
library(crosstable) # Summary stats tables
library(kableExtra)
##
library(paletteer) # Colour Palettes for Peasants
##
## Add other packages here as needed, e.g.:
## scales/ggprism;
## ggstats/correlation;
## vcd/vcdExtra/ggalluvial/ggpubr;
## sf/tmap/osmplotr/rnaturalearth;
## igraph/tidygraph/ggraph/graphlayouts;

Use Namespace based Code

Warning

Try always to name your code-command with the package from whence it came! So use dplyr::filter() / dplyr::summarize() and not just filter() or summarize(), since these commands could exist across multiple packages, which you may have loaded last.

(One can also use the conflicted package to set this up, but this is simpler for beginners like us. )

Read Data

Use readr::read_csv(). Do not use read.csv().

Examine the Data

Use dplyr::glimpse()
Use mosaic::inspect() or skimr::skim()

Summarize the Data

Use dplyr::summarise() and/or crosstable::crosstable()
Highlight any interesting summary stats, missing data, or data imbalances

Data Dictionary and Experiment Description

A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord…)
If there are wrongly coded variables in the original data, state them in their correct form, so you can munge the data in the next step
Declare what might be target and predictor variables, based on available information of the experiment, or a description of the data.

Data Munging

Convert variables to factors as needed
Reformat / Rename other variables as needed
Clean badly formatted columns (e.g. text + numbers) using tidyr::separate_**_**()
Save the data as a modified file
Do not mess up the original data file

Form Hypotheses

Question-1

State the Question or Hypothesis.
(Temporarily) Drop variables using dplyr::select()
Create new variables if needed with dplyr::mutate()
Filter the data set using dplyr::filter()
Reformat data to wide/long if needed with tidyr::pivot_longer() or tidyr::pivot_wider()
Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
For Charts:
- Use title, subtitle, legend and scales appropriately in your chart
- Use a colour palette from the paletteer package that suits your message and taste. See references and commands at the end of this document.
- Prefer ggformula unless you are using a chart that is not yet supported therein (eg. ggbump::geom_bump, vcd::mosaic or ggstats::gglikert)
- Use gf_facet_*** as appropriate to show small multiple graphs for clarity

For Tables:
- Use crosstable::crosstable(...) %>% as_flextable() to create HTML tables of summaries
- Use df_print: paged in your YAML header to make nice paged tables for your data frames
- Use kableExtra::kable() %>% kable_paper(c("hover", "striped", "responsive"), full_width = F) or similar to make HTML tables of intermediate results/data where you think appropriate
For Statistical Tests:
- Use mosaic::.... to run your statistical tests(t, wilcox, prop, chi.square…), since it has a formula interface similar to ggformula
- Use broom::tidy() and/or broom::augment() to check your stat test results, and to convert them into tibbles for presentation and plotting.
- You could convert the output of broom:... into an HTML table using the kableExtra code shown above.
- Use supernova::supernova() to create friendly and clear ANOVA tables if needed:

Call:
   aov(formula = body_mass_g ~ species, data = penguins)

Terms:
                  species Residuals
Sum of Squares  145190219  70069447
Deg. of Freedom         2       330

Residual standard error: 460.7946
Estimated effects may be unbalanced

 Analysis of Variance Table (Type III SS)
 Model: body_mass_g ~ species

                                    SS  df           MS       F   PRE     p
 ----- --------------- | ------------- --- ------------ ------- ----- -----
 Model (error reduced) | 145190219.113   2 72595109.557 341.895 .6745 .0000
 Error (from model)    |  70069446.803 330   212331.657                    
 ----- --------------- | ------------- --- ------------ ------- ----- -----
 Total (empty model)   | 215259665.916 332   648372.488


  group_1   group_2       diff pooled_se      q    df    lower    upper p_adj
  <chr>     <chr>        <dbl>     <dbl>  <dbl> <int>    <dbl>    <dbl> <dbl>
1 Chinstrap Adelie      26.924    47.837  0.563   330 -132.353  186.201 .9164
2 Gentoo    Adelie    1386.273    40.241 34.450   330 1252.290 1520.255 .0000
3 Gentoo    Chinstrap 1359.349    49.532 27.444   330 1194.430 1524.267 .0000

Use mosaic::do(n) * stat-test(...) or use the infer package to run Permutation or Bootstrap Tests

Inference-1

Present the final Inference clearly in text, with clear reference to your chart, and perhaps p.values, confidence intervals from stats tests. . . . .

Question-n

….

Inference-n

….

Conclusion

Describe what you have done, what the graph(s) and test(s) shows and why it all so interesting. What could be done next?

References

https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/
Colour Palettes

Over 2500 colour palettes are available in the paletteer package. Can you find tayloRswift? wesanderson? harrypotter? timburton? You could also find/define palettes that are in line with your Company’s logo / colour schemes.

Here are the Qualitative Palettes: (searchable)

And the Quantitative/Continuous palettes: (searchable)

Use the commands:

## For Qual variable-> colour/fill:
scale_colour_paletteer_d(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)

## For Quant variable-> colour/fill:
scale_colour_paletteer_c(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)

--- title: <iconify-icon icon="guidance:falling-rocks" width="1.2em" height="1.2em"></iconify-icon><iconify-icon icon="game-icons:falling" width="1.2em" height="1.2em"></iconify-icon> Facing the Abyss author: "Arvind V" date: 21/Oct/2023 date-modified: "`r Sys.Date()`" abstract-title: "Abstract" abstract: "A complete EDA Workflow" order: 05 df-print: paged image: preview.jpeg image-alt: Image by rawpixel.com code-tools: true categories: - EDA - Workflow - Descriptive --- ## A Data Analytics Process So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what? The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible. ::: callout-note ### Download this document as a Work Template Hit the `</>Code` button at upper right to copy/save this very document as a Quarto Markdown template for your work. Delete the text that you don't need, but keep most of the Sections as they are! ::: ## {{< iconify noto-v1 package >}} Setting up R Packages 1. Install packages using `install.packages()` in your Console. 1. Load up your libraries in a so-labelled `setup` chunk: ```{r} #| label: setup #| echo: true #| include: true #| message: false #| warning: false library(tidyverse) library(mosaic) library(ggformula) library(ggridges) library(skimr) ## library(GGally) library(corrplot) library(corrgram) library(crosstable) # Summary stats tables library(kableExtra) ## library(paletteer) # Colour Palettes for Peasants ## ## Add other packages here as needed, e.g.: ## scales/ggprism; ## ggstats/correlation; ## vcd/vcdExtra/ggalluvial/ggpubr; ## sf/tmap/osmplotr/rnaturalearth; ## igraph/tidygraph/ggraph/graphlayouts; ``` ### Use Namespace based Code ::: callout-warning Try always to **name** your code-command with the package from whence it came! So use `dplyr::filter()` / `dplyr::summarize()` and **not** just `filter()` or `summarize()`, since these commands could exist across multiple packages, which you may have loaded **last**. (One can also use the `conflicted` package to set this up, but this is simpler for beginners like us. ) ::: ## {{< iconify ic baseline-input >}} Read Data - Use `readr::read_csv()`. Do **not** use `read.csv()`. ## {{< iconify file-icons influxdata >}} Examine the Data - Use `dplyr::glimpse()` - Use `mosaic::inspect()` or `skimr::skim()` ## {{< iconify file-icons influxdata >}} Summarize the Data - Use `dplyr::summarise()` and/or `crosstable::crosstable()` - Highlight any interesting summary stats, missing data, or data imbalances ## {{< iconify streamline dictionary-language-book-solid >}} Data Dictionary and Experiment Description - A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord...) - If there are *wrongly coded* variables in the original data, state them in their correct form, so you can munge the data in the next step - Declare what might be *target* and *predictor* variables, based on available information of the **experiment**, or a description of the data. ## {{< iconify carbon clean >}} Data Munging - Convert variables to factors as needed - Reformat / Rename other variables as needed - Clean badly formatted columns (e.g. text + numbers) using `tidyr::separate_**_**()` - **Save the data as a modified file** - **Do not mess up the original data file** ## {{< iconify material-symbols lab-research >}} Form Hypotheses ### Question-1 - State the Question or Hypothesis. - (Temporarily) Drop variables using `dplyr::select()` - Create new variables if needed with `dplyr::mutate()` - Filter the data set using `dplyr::filter()` - Reformat data to wide/long if needed with `tidyr::pivot_longer()` or `tidyr::pivot_wider()` - Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference - For Charts: - Use `title`, `subtitle`, `legend` and `scales` appropriately in your chart - Use a colour palette from the `paletteer` package that suits your message and taste. See references and commands at the end of this document. - Prefer `ggformula` unless you are using a chart that is not yet supported therein (eg. `ggbump::geom_bump`, `vcd::mosaic` or `ggstats::gglikert`) - Use `gf_facet_***` as appropriate to show small multiple graphs for clarity ```{r} #| label: figure-1 #| fig-showtext: true #| fig-format: png #| echo: false ## Set graph theme ## Idiotic that we have to repeat this every chunk ## Open issue in Quarto theme_set(new = theme_classic()) ### library(palmerpenguins) penguins %>% drop_na() %>% gf_point(body_mass_g ~ flipper_length_mm, colour = ~ species) %>% gf_labs(title = "My First Penguins Plot", subtitle = "Using ggformula", x = "Flipper Length mm", y = "Body Mass gms", caption = "I love penguins, and R") ``` - For Tables: - Use `crosstable::crosstable(...) %>% as_flextable()` to create HTML tables of summaries - Use `df_print: paged` in your YAML header to make nice paged tables for your data frames - Use `kableExtra::kable() %>% kable_paper(c("hover", "striped", "responsive"), full_width = F)` or similar to make HTML tables of intermediate results/data where you think appropriate - For Statistical Tests: - Use `mosaic::....` to run your statistical tests(t, wilcox, prop, chi.square...), since it has a formula interface similar to `ggformula` - Use `broom::tidy()` and/or `broom::augment()` to check your stat test results, and to convert them into tibbles for presentation and plotting. - You could convert the output of `broom:...` into an HTML table using the `kableExtra` code shown above. - Use `supernova::supernova()` to create friendly and clear ANOVA tables if needed: ```{r} #| echo: false #| message: false #| warning: false ## Set graph theme theme_set(theme_classic()) ### library(palmerpenguins) penguins <- penguins %>% drop_na() penguins_anova <- aov(body_mass_g ~ species, data = penguins) penguins_anova supernova::supernova(penguins_anova) supernova::pairwise(penguins_anova, plot = T,alpha = 0.05,var_equal = T) ``` - Use `mosaic::do(n) * stat-test(...)` or use the `infer` package to run Permutation or Bootstrap Tests ### Inference-1 - Present the final Inference clearly in text, with clear reference to your chart, and perhaps `p.values`, `confidence intervals` from stats tests. . . . . ### Question-n .... ### Inference-n .... ## {{< iconify fluent-mdl2 decision-solid >}} {{< iconify ic outline-interests >}}{{< iconify carbon chart-3d >}} Conclusion Describe what you have done, what the graph(s) and test(s) shows and why it all so interesting. What could be done next? ## {{< iconify ooui references-rtl >}} References 1. <https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/> 2. Colour Palettes Over 2500 colour palettes are available in the `paletteer` package. Can you find `tayloRswift`? `wesanderson`? `harrypotter`? `timburton`? You could also find/define palettes that are in line with your Company's logo / colour schemes. Here are the Qualitative Palettes: (searchable) ```{r} #| echo: false library(reactable) palettes_d_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10) ``` And the Quantitative/Continuous palettes: (searchable) ```{r} #| echo: false palettes_c_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10) ``` Use the commands: ```{r} #| eval: false #| echo: true ## For Qual variable-> colour/fill: scale_colour_paletteer_d(name = "Legend Name", palette = "package::palette", dynamic = TRUE/FALSE) ## For Quant variable-> colour/fill: scale_colour_paletteer_c(name = "Legend Name", palette = "package::palette", dynamic = TRUE/FALSE) ```