Facing the Abyss

EDA

Workflow

Descriptive

Author

Arvind V

Published

October 21, 2023

Modified

November 4, 2024

Abstract

A complete EDA Workflow

A Data Analytics Process

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.

Setting up R Packages

Install packages using install.packages() in your Console.
Load up your libraries in a setup chunk:

knitr::opts_chunk$set(dev = "ragg_png")

library(tidyverse)
library(mosaic)
library(palmerpenguins)
library(ggformula)
library(ggridges)
library(skimr)
##
library(GGally)
library(corrplot)
library(corrgram)

Go to https://fonts.google.com/ and choose some professional looking, or funky looking, fonts.

library(extrafont)
extrafont::loadfonts(quiet = TRUE)
##
library(showtext)
## Loading Google fonts (https://fonts.google.com/)
font_add_google(name = "Fira Sans Condensed", family = "fira")
font_add_google("Gochi Hand", "gochi")
font_add_google("Schoolbell", "bell")
font_add_google("Montserrat Alternates", "montserrat")
font_add_google("Roboto Condensed", "roboto")
### Automatically use showtext to render text
showtext_auto()

Read Data

Use readr::read_csv()

Examine Data

Use dplyr::glimpse()
Use mosaic::inspect() or skimr::skim()
Use dplyr::summarise() and crosstable::crosstable()
Format your tables with knitr::kable()
Highlight any interesting summary stats or data imbalances

Data Dictionary and Experiment Description

A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord…)
If there are wrongly coded variables in the original data, state them in their correct form, so you can munge the in the next step
Declare what might be target and predictor variables, based on available information of the experiment, or a description of the data

Data Munging

Convert variables to factors as needed
Reformat / Rename other variables as needed
Clean badly formatted columns (e.g. text + numbers) using tidyr::separate_**_**()
Save the data as a modified file
Do not mess up the original data file

Form Hypotheses

Question-1

State the Question or Hypothesis
(Temporarily) Drop variables using dplyr::select()
Create new variables if needed with dplyr::mutate()
Filter the data set using dplyr::filter()
Reformat data if needed with tidyr::pivot_longer() or tidyr::pivot_wider()
Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
Use title, subtitle, legend and scales appropriately in your chart
Prefer ggformula unless you are using a chart that is not yet supported therein (eg. ggbump() or plot_likert())

## Set graph theme
## Idiotic that we have to repeat this every chunk
## Open issue in Quarto

penguins %>%
  drop_na() %>%
  gf_point(body_mass_g ~ flipper_length_mm,
    colour = ~species
  ) %>%
  gf_labs(
    title = "My First Penguins Plot",
    subtitle = "Using ggformula with fonts",
    x = "Flipper Length mm", y = "Body Mass gms",
    caption = "I love penguins, and R"
  ) %>%
  gf_theme(theme_classic()) %>%
  gf_theme(theme(
    panel.grid.minor = element_blank(),
    ###
    text = element_text(family = "fira", size = 14),
    ###
    plot.title = element_text(
      family = "roboto",
      face = "bold",
      size = 28, hjust = 0
    ),
    plot.subtitle = element_text(
      family = "montserrat",
      face = "bold",
      size = 18, hjust = 0
    ),
    plot.margin = margin(2, 2, 2, 2, unit = "pt"),
    axis.title = element_text(size = 20),
    plot.caption = element_text(family = "gochi", size = 14),
    legend.title = element_text(
      family = "bell",
      face = "bold",
      size = 20
    ),
    legend.text = element_text(
      family = "fira",
      size = 12
    ),
    legend.background = element_rect(
      fill = "cornsilk",
      colour = "black"
    ),
    legend.margin = margin(
      t = 2,
      r = 2,
      b = 2,
      l = 2,
      unit = "pt"
    )
  ))

Inference-1

. . . .

Question-n

….

Inference-n

….

One Most Interesting Graph

Conclusion

Describe what the graph shows and why it so interesting. What could be done next?

References

https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/

--- title: <iconify-icon icon="guidance:falling-rocks" width="1.2em" height="1.2em"></iconify-icon><iconify-icon icon="game-icons:falling" width="1.2em" height="1.2em"></iconify-icon> Facing the Abyss author: "Arvind V" date: 21/Oct/2023 date-modified: "`r Sys.Date()`" abstract-title: "Abstract" abstract: "A complete EDA Workflow" order: 200 image: preview.jpeg image-alt: Image by rawpixel.com code-tools: true categories: - EDA - Workflow - Descriptive --- ## A Data Analytics Process So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what? The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible. ## {{< iconify noto-v1 package >}} Setting up R Packages 1. Install packages using `install.packages()` in your Console. 1. Load up your libraries in a `setup` chunk: ```{r} #| label: setup #| include: true #| message: false #| warning: false knitr::opts_chunk$set(dev = "ragg_png") library(tidyverse) library(mosaic) library(palmerpenguins) library(ggformula) library(ggridges) library(skimr) ## library(GGally) library(corrplot) library(corrgram) ``` Go to <https://fonts.google.com/> and choose some professional looking, or funky looking, fonts. ```{r} #| label: themes-and-fonts library(extrafont) extrafont::loadfonts(quiet = TRUE) ## library(showtext) ## Loading Google fonts (https://fonts.google.com/) font_add_google(name = "Fira Sans Condensed", family = "fira") font_add_google("Gochi Hand", "gochi") font_add_google("Schoolbell", "bell") font_add_google("Montserrat Alternates", "montserrat") font_add_google("Roboto Condensed", "roboto") ### Automatically use showtext to render text showtext_auto() ``` ## {{< iconify ic baseline-input >}} Read Data - Use `readr::read_csv()` ## {{< iconify file-icons influxdata >}} Examine Data - Use `dplyr::glimpse()` - Use `mosaic::inspect()` or `skimr::skim()` - Use `dplyr::summarise()` and `crosstable::crosstable()` - Format your tables with `knitr::kable()` - Highlight any interesting summary stats or data imbalances ## {{< iconify streamline dictionary-language-book-solid >}} Data Dictionary and Experiment Description - A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord...) - If there are *wrongly coded* variables in the original data, state them in their correct form, so you can munge the in the next step - Declare what might be *target* and *predictor* variables, based on available information of the **experiment**, or a description of the data ## {{< iconify carbon clean >}} Data Munging - Convert variables to factors as needed - Reformat / Rename other variables as needed - Clean badly formatted columns (e.g. text + numbers) using `tidyr::separate_**_**()` - **Save the data as a modified file** - **Do not mess up the original data file** ## {{< iconify material-symbols lab-research >}} Form Hypotheses ### Question-1 - State the Question or Hypothesis - (Temporarily) Drop variables using `dplyr::select()` - Create new variables if needed with `dplyr::mutate()` - Filter the data set using `dplyr::filter()` - Reformat data if needed with `tidyr::pivot_longer()` or `tidyr::pivot_wider()` - Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference - Use `title`, `subtitle`, `legend` and `scales` appropriately in your chart - Prefer `ggformula` unless you are using a chart that is not yet supported therein (eg. `ggbump()` or `plot_likert()`) ```{r} #| label: figure-1 #| fig-showtext: true #| fig-format: png ## Set graph theme ## Idiotic that we have to repeat this every chunk ## Open issue in Quarto penguins %>% drop_na() %>% gf_point(body_mass_g ~ flipper_length_mm, colour = ~ species) %>% gf_labs(title = "My First Penguins Plot", subtitle = "Using ggformula with fonts", x = "Flipper Length mm", y = "Body Mass gms", caption = "I love penguins, and R") %>% gf_theme(theme_classic()) %>% gf_theme(theme( panel.grid.minor = element_blank(), ### text = element_text(family = "fira", size = 14), ### plot.title = element_text( family = "roboto", face = "bold", size = 28, hjust = 0 ), plot.subtitle = element_text( family = "montserrat", face = "bold", size = 18, hjust = 0), plot.margin = margin(2,2,2,2, unit = "pt"), axis.title = element_text(size = 20), plot.caption = element_text(family = "gochi", size = 14), legend.title = element_text( family = "bell", face = "bold", size = 20 ), legend.text = element_text(family = "fira", size = 12), legend.background = element_rect(fill = "cornsilk", colour = "black"), legend.margin = margin( t = 2, r = 2, b = 2, l = 2, unit = "pt" ) )) ``` ### Inference-1 . . . . ### Question-n .... ### Inference-n .... ## {{< iconify ic outline-interests >}}{{< iconify carbon chart-3d >}} One Most Interesting Graph ## {{< iconify fluent-mdl2 decision-solid >}} Conclusion Describe what the graph shows and why it so interesting. What could be done next? ## {{< iconify ooui references-rtl >}} References 1. <https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/>