Facing the Abyss
EDA
Workflow
Descriptive
Abstract
A complete EDA Workflow
A Data Analytics Process
So you have your shiny new R skills and you’ve successfully loaded a cool dataframe into R… Now what?
The best charts come from understanding your data, asking good questions from it, and displaying the answers to those questions as clearly as possible.
Setting up R Packages
- Install packages using
install.packages()
in your Console. - Load up your libraries in a
setup
chunk:
Go to https://fonts.google.com/ and choose some professional looking, or funky looking, fonts.
library(extrafont)
extrafont::loadfonts(quiet = TRUE)
##
library(showtext)
## Loading Google fonts (https://fonts.google.com/)
font_add_google(name = "Fira Sans Condensed", family = "fira")
font_add_google("Gochi Hand", "gochi")
font_add_google("Schoolbell", "bell")
font_add_google("Montserrat Alternates", "montserrat")
font_add_google("Roboto Condensed", "roboto")
### Automatically use showtext to render text
showtext_auto()
Read Data
Examine Data
- Use
dplyr::glimpse()
- Use
mosaic::inspect()
orskimr::skim()
- Use
dplyr::summarise()
andcrosstable::crosstable()
- Format your tables with
knitr::kable()
- Highlight any interesting summary stats or data imbalances
Data Dictionary and Experiment Description
- A table containing the variable names, their interpretation, and their nature(Qual/Quant/Ord…)
- If there are wrongly coded variables in the original data, state them in their correct form, so you can munge the in the next step
- Declare what might be target and predictor variables, based on available information of the experiment, or a description of the data
Data Munging
- Convert variables to factors as needed
- Reformat / Rename other variables as needed
- Clean badly formatted columns (e.g. text + numbers) using
tidyr::separate_**_**()
- Save the data as a modified file
- Do not mess up the original data file
Form Hypotheses
Question-1
- State the Question or Hypothesis
- (Temporarily) Drop variables using
dplyr::select()
- Create new variables if needed with
dplyr::mutate()
- Filter the data set using
dplyr::filter()
- Reformat data if needed with
tidyr::pivot_longer()
ortidyr::pivot_wider()
- Answer the Question with a Table, a Chart, a Test, using an appropriate Model for Statistical Inference
- Use
title
,subtitle
,legend
andscales
appropriately in your chart - Prefer
ggformula
unless you are using a chart that is not yet supported therein (eg.ggbump()
orplot_likert()
)
## Set graph theme
## Idiotic that we have to repeat this every chunk
## Open issue in Quarto
penguins %>%
drop_na() %>%
gf_point(body_mass_g ~ flipper_length_mm,
colour = ~species
) %>%
gf_labs(
title = "My First Penguins Plot",
subtitle = "Using ggformula with fonts",
x = "Flipper Length mm", y = "Body Mass gms",
caption = "I love penguins, and R"
) %>%
gf_theme(theme_classic()) %>%
gf_theme(theme(
panel.grid.minor = element_blank(),
###
text = element_text(family = "fira", size = 14),
###
plot.title = element_text(
family = "roboto",
face = "bold",
size = 28, hjust = 0
),
plot.subtitle = element_text(
family = "montserrat",
face = "bold",
size = 18, hjust = 0
),
plot.margin = margin(2, 2, 2, 2, unit = "pt"),
axis.title = element_text(size = 20),
plot.caption = element_text(family = "gochi", size = 14),
legend.title = element_text(
family = "bell",
face = "bold",
size = 20
),
legend.text = element_text(
family = "fira",
size = 12
),
legend.background = element_rect(
fill = "cornsilk",
colour = "black"
),
legend.margin = margin(
t = 2,
r = 2,
b = 2,
l = 2,
unit = "pt"
)
))
Inference-1
. . . .
Question-n
….
Inference-n
….
One Most Interesting Graph
Conclusion
Describe what the graph shows and why it so interesting. What could be done next?