Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Facing the Abyss
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • A Data Analytics Process
  • Setting up R Packages
    • Use Namespace based Code
  • Read Data
  • Examine the Data
  • Summarize the Data
  • Data Dictionary and Experiment Description
  • Data Munging
  • Form Hypotheses
    • Question-1
    • Inference-1
    • Question-n
    • Inference-n
  • Conclusion
  • References

Facing the Abyss

EDA
Workflow
Descriptive
Author

Arvind V

Published

October 21, 2023

Modified

November 17, 2024

Abstract
A complete EDA Workflow

A Data Analytics Process

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.

NoteDownload this document as a Work Template

Hit the </>Code button at upper right to copy/save this very document as a Quarto Markdown template for your work. Delete the text that you don’t need, but keep most of the Sections as they are!

Setting up R Packages

  1. Install packages using install.packages() in your Console.
  2. Load up your libraries in a so-labelled setup chunk:
library(tidyverse)
library(mosaic)
library(ggformula)
library(ggridges)
library(skimr)
##
library(GGally)
library(corrplot)
library(corrgram)
library(crosstable) # Summary stats tables
library(kableExtra)
##
library(paletteer) # Colour Palettes for Peasants
##
## Add other packages here as needed, e.g.:
## scales/ggprism;
## ggstats/correlation;
## vcd/vcdExtra/ggalluvial/ggpubr;
## sf/tmap/osmplotr/rnaturalearth;
## igraph/tidygraph/ggraph/graphlayouts;

Use Namespace based Code

Warning

Try always to name your code-command with the package from whence it came! So use dplyr::filter() / dplyr::summarize() and not just filter() or summarize(), since these commands could exist across multiple packages, which you may have loaded last.

(One can also use the conflicted package to set this up, but this is simpler for beginners like us. )

Read Data

  • Use readr::read_csv(). Do not use read.csv().

Examine the Data

  • Use dplyr::glimpse()
  • Use mosaic::inspect() or skimr::skim()

Summarize the Data

  • Use dplyr::summarise() and/or crosstable::crosstable()
  • Highlight any interesting summary stats, missing data, or data imbalances

Data Dictionary and Experiment Description

  • A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord…)
  • If there are wrongly coded variables in the original data, state them in their correct form, so you can munge the data in the next step
  • Declare what might be target and predictor variables, based on available information of the experiment, or a description of the data.

Data Munging

  • Convert variables to factors as needed
  • Reformat / Rename other variables as needed
  • Clean badly formatted columns (e.g. text + numbers) using tidyr::separate_**_**()
  • Save the data as a modified file
  • Do not mess up the original data file

Form Hypotheses

Question-1

  • State the Question or Hypothesis.
  • (Temporarily) Drop variables using dplyr::select()
  • Create new variables if needed with dplyr::mutate()
  • Filter the data set using dplyr::filter()
  • Reformat data to wide/long if needed with tidyr::pivot_longer() or tidyr::pivot_wider()
  • Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
  • For Charts:
    • Use title, subtitle, legend and scales appropriately in your chart
    • Use a colour palette from the paletteer package that suits your message and taste. See references and commands at the end of this document.
    • Prefer ggformula unless you are using a chart that is not yet supported therein (eg. ggbump::geom_bump, vcd::mosaic or ggstats::gglikert)
    • Use gf_facet_*** as appropriate to show small multiple graphs for clarity

  • For Tables:
    • Use crosstable::crosstable(...) %>% as_flextable() to create HTML tables of summaries
    • Use df_print: paged in your YAML header to make nice paged tables for your data frames
    • Use kableExtra::kable() %>% kable_paper(c("hover", "striped", "responsive"), full_width = F) or similar to make HTML tables of intermediate results/data where you think appropriate
  • For Statistical Tests:
    • Use mosaic::.... to run your statistical tests(t, wilcox, prop, chi.square…), since it has a formula interface similar to ggformula
    • Use broom::tidy() and/or broom::augment() to check your stat test results, and to convert them into tibbles for presentation and plotting.
    • You could convert the output of broom:... into an HTML table using the kableExtra code shown above.
    • Use supernova::supernova() to create friendly and clear ANOVA tables if needed:
Call:
   aov(formula = body_mass_g ~ species, data = penguins)

Terms:
                  species Residuals
Sum of Squares  145190219  70069447
Deg. of Freedom         2       330

Residual standard error: 460.7946
Estimated effects may be unbalanced
 Analysis of Variance Table (Type III SS)
 Model: body_mass_g ~ species

                                    SS  df           MS       F   PRE     p
 ----- --------------- | ------------- --- ------------ ------- ----- -----
 Model (error reduced) | 145190219.113   2 72595109.557 341.895 .6745 .0000
 Error (from model)    |  70069446.803 330   212331.657                    
 ----- --------------- | ------------- --- ------------ ------- ----- -----
 Total (empty model)   | 215259665.916 332   648372.488                    


  group_1   group_2       diff pooled_se      q    df    lower    upper p_adj
  <chr>     <chr>        <dbl>     <dbl>  <dbl> <int>    <dbl>    <dbl> <dbl>
1 Chinstrap Adelie      26.924    47.837  0.563   330 -132.353  186.201 .9164
2 Gentoo    Adelie    1386.273    40.241 34.450   330 1252.290 1520.255 .0000
3 Gentoo    Chinstrap 1359.349    49.532 27.444   330 1194.430 1524.267 .0000
  • Use mosaic::do(n) * stat-test(...) or use the infer package to run Permutation or Bootstrap Tests

Inference-1

  • Present the final Inference clearly in text, with clear reference to your chart, and perhaps p.values, confidence intervals from stats tests. . . . .

Question-n

….

Inference-n

….

Conclusion

Describe what you have done, what the graph(s) and test(s) shows and why it all so interesting. What could be done next?

References

  1. https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/

  2. Colour Palettes

Over 2500 colour palettes are available in the paletteer package. Can you find tayloRswift? wesanderson? harrypotter? timburton? You could also find/define palettes that are in line with your Company’s logo / colour schemes.



Here are the Qualitative Palettes: (searchable)

package
palette
length
type
novelty
awtools
a_palette
8
sequential
true
awtools
ppalette
8
qualitative
true
awtools
bpalette
16
qualitative
true
awtools
gpalette
4
sequential
true
awtools
mpalette
9
qualitative
true
awtools
spalette
6
qualitative
true
basetheme
brutal
10
qualitative
true
basetheme
clean
10
qualitative
true
basetheme
dark
10
qualitative
true
basetheme
deepblue
10
qualitative
true
1–10 of 2415 rows
...



And the Quantitative/Continuous palettes: (searchable)

package
palette
type
ggthemes
Blue-Green Sequential
sequential
ggthemes
Blue Light
sequential
ggthemes
Orange Light
sequential
ggthemes
Blue
sequential
ggthemes
Orange
sequential
ggthemes
Green
sequential
ggthemes
Red
sequential
ggthemes
Purple
sequential
ggthemes
Brown
sequential
ggthemes
Gray
sequential
1–10 of 319 rows
...



Use the commands:

## For Qual variable-> colour/fill:
scale_colour_paletteer_d(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)

## For Quant variable-> colour/fill:
scale_colour_paletteer_c(
  name = "Legend Name",
  palette = "package::palette",
  dynamic = TRUE / FALSE
)
Back to top
Source Code
---
title: <iconify-icon icon="guidance:falling-rocks" width="1.2em" height="1.2em"></iconify-icon><iconify-icon icon="game-icons:falling" width="1.2em" height="1.2em"></iconify-icon> Facing the Abyss
author: "Arvind V"
date: 21/Oct/2023
date-modified: "`r Sys.Date()`"
abstract-title: "Abstract"
abstract: "A complete EDA Workflow"
order: 05
df-print: paged
image: preview.jpeg
image-alt: Image by rawpixel.com
code-tools: true
categories:
- EDA
- Workflow
- Descriptive
---


## A Data Analytics Process

So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?

The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.

::: callout-note
### Download this document as a Work Template
Hit the `</>Code` button at upper right to copy/save this very document as a Quarto Markdown template for your work. 
Delete the text that you don't need, but keep most of the Sections as they are!

:::
## {{< iconify noto-v1 package >}} Setting up R Packages

1. Install packages using `install.packages()` in your Console. 
1. Load up your libraries in a so-labelled `setup` chunk: 

```{r}
#| label: setup
#| echo: true
#| include: true
#| message: false
#| warning: false

library(tidyverse)
library(mosaic)
library(ggformula)
library(ggridges)
library(skimr)
##
library(GGally)
library(corrplot)
library(corrgram)
library(crosstable) # Summary stats tables
library(kableExtra)
## 
library(paletteer) # Colour Palettes for Peasants
##
## Add other packages here as needed, e.g.:
## scales/ggprism;
## ggstats/correlation;
## vcd/vcdExtra/ggalluvial/ggpubr; 
## sf/tmap/osmplotr/rnaturalearth; 
## igraph/tidygraph/ggraph/graphlayouts; 

```


### Use Namespace based Code
::: callout-warning

Try always to **name** your code-command with the package from whence it came!
So use `dplyr::filter()` / `dplyr::summarize()` and **not** just `filter()` or `summarize()`, since these commands could exist across multiple packages, which you may have loaded **last**.

(One can also use the `conflicted` package to set this up, but this is simpler for beginners like us. )

:::
## {{< iconify ic baseline-input >}} Read Data
- Use `readr::read_csv()`. Do **not** use `read.csv()`. 

## {{< iconify file-icons influxdata >}} Examine the Data

- Use `dplyr::glimpse()`
- Use `mosaic::inspect()` or `skimr::skim()`

## {{< iconify file-icons influxdata >}} Summarize the Data

- Use `dplyr::summarise()` and/or `crosstable::crosstable()`
- Highlight any interesting summary stats, missing data, or data imbalances

## {{< iconify streamline dictionary-language-book-solid >}} Data Dictionary and Experiment Description

- A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord...)
- If there are *wrongly coded* variables in the original data, state them in their correct form, so you can munge the data in the next step
- Declare what might be *target* and *predictor* variables, based on available information of the **experiment**, or a description of the data.

## {{< iconify carbon clean >}} Data Munging

- Convert variables to factors as needed
- Reformat / Rename other variables as needed
- Clean badly formatted columns (e.g. text + numbers) using `tidyr::separate_**_**()`
- **Save the data as a modified file**
- **Do not mess up the original data file**

## {{< iconify  material-symbols lab-research >}} Form Hypotheses


### Question-1
- State the Question or Hypothesis.
- (Temporarily) Drop variables using `dplyr::select()`
- Create new variables if needed with `dplyr::mutate()`
- Filter the data set using `dplyr::filter()`
- Reformat data to wide/long if needed with `tidyr::pivot_longer()` or `tidyr::pivot_wider()`
- Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
- For Charts:
  - Use `title`, `subtitle`, `legend` and `scales` appropriately in your chart
  - Use a colour palette from the `paletteer` package that suits your message and taste. See references and commands at the end of this document.
  - Prefer `ggformula` unless you are using a chart that is not yet supported therein (eg. `ggbump::geom_bump`, `vcd::mosaic` or `ggstats::gglikert`)
  - Use `gf_facet_***` as appropriate to show small multiple graphs for clarity


```{r}
#| label: figure-1
#| fig-showtext: true
#| fig-format: png
#| echo: false

## Set graph theme
## Idiotic that we have to repeat this every chunk
## Open issue in Quarto
theme_set(new = theme_classic())
###
library(palmerpenguins)
penguins %>% 
  drop_na() %>% 
  gf_point(body_mass_g ~ flipper_length_mm, 
           colour = ~ species) %>% 
  gf_labs(title = "My First Penguins Plot",
          subtitle = "Using ggformula",
          x = "Flipper Length mm", y = "Body Mass gms",
          caption = "I love penguins, and R")
  

```

- For Tables:
  - Use `crosstable::crosstable(...) %>% as_flextable()` to create HTML tables of summaries
  - Use `df_print: paged` in your YAML header to make nice paged tables for your data frames
  - Use `kableExtra::kable() %>% kable_paper(c("hover", "striped", "responsive"), full_width = F)` or similar to make HTML tables of intermediate results/data where you think appropriate
- For Statistical Tests:
  - Use `mosaic::....` to run your statistical tests(t, wilcox, prop, chi.square...), since it has a formula interface similar to `ggformula`
  - Use `broom::tidy()` and/or `broom::augment()` to check your stat test results, and to convert them into tibbles for presentation and plotting.
  - You could convert the output of `broom:...` into an HTML table using the `kableExtra` code shown above.
  - Use `supernova::supernova()` to create friendly and clear ANOVA tables if needed:

```{r}
#| echo: false
#| message: false
#| warning: false
## Set graph theme
theme_set(theme_classic())
###
library(palmerpenguins)
penguins <- penguins %>% drop_na()
penguins_anova <- aov(body_mass_g ~ species, data = penguins)
penguins_anova
supernova::supernova(penguins_anova)
supernova::pairwise(penguins_anova, plot = T,alpha = 0.05,var_equal = T)

```

  - Use `mosaic::do(n) * stat-test(...)` or use the `infer` package to run Permutation or Bootstrap Tests



### Inference-1
  - Present the final Inference clearly in text, with clear reference to your chart, and perhaps `p.values`, `confidence intervals` from stats tests. 
.
.
.
.


### Question-n
....

### Inference-n
....




## {{< iconify fluent-mdl2 decision-solid >}} {{< iconify ic outline-interests >}}{{< iconify carbon chart-3d >}} Conclusion
Describe what you have done, what the graph(s) and test(s) shows and why it all so interesting. What could be done next?

## {{< iconify ooui references-rtl >}} References

1. <https://shancarter.github.io/ucb-dataviz-fall-2013/classes/facing-the-abyss/>

2. Colour Palettes

Over 2500 colour palettes are available in the `paletteer` package. Can you find `tayloRswift`? `wesanderson`? `harrypotter`? `timburton`? You could also find/define palettes that are in line with your Company's logo / colour schemes. 

<br><br>
Here are the Qualitative Palettes: (searchable)
<br><br>
```{r}
#| echo: false
library(reactable)
palettes_d_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10)
```

<br><br>
And the Quantitative/Continuous palettes: (searchable)
<br><br>
```{r}
#| echo: false
palettes_c_names %>% reactable::reactable(data = ., filterable = TRUE, minRows = 10)
```
<br><br>
Use the commands:

```{r}
#| eval: false
#| echo: true

## For Qual variable-> colour/fill: 
scale_colour_paletteer_d(name = "Legend Name", 
                          palette = "package::palette",
                          dynamic = TRUE/FALSE)
                          
## For Quant variable-> colour/fill: 
scale_colour_paletteer_c(name = "Legend Name", 
                          palette = "package::palette",
                          dynamic = TRUE/FALSE)

```

                          

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .