library(tidyverse)
library(mosaic)
library(ggformula)
library(vtable)
# Generate Data
# library(simulate) TO BE FOUND AND INSTALLED!!!!
library(regressinator)
library(holodeck)
library(explore)
library(charlatan)
library(ids) # animals, adjectives, sentences, and proquints
library(rcorpora)
library(simstudy)
Generating Fake Data in R
Introduction
Often we need to generate fake data for teaching and demo purposes. This post uncovers several different packages for this purpose.
Set Up the R Packages
Using simulate
sim_bernoulli(prob = 0.2, params = NULL, data = df)
sim_beta(shape1 = 0.2, shape2 = 0.8, params = NULL)
Using regressinator
https://www.refsmmat.com/regressinator/
The regressinator is a pedagogical tool for conducting simulations of regression analyses and diagnostics. It can:
- Simulate populations with predictor variables from arbitrary distributions
- Simulate response variables that are functions of the predictor variables plus error, or are drawn from a distribution related to the predictors
- Given a model, simulate from the population sampling distribution of that model’s estimates
- Given a model fit to data, generate new simulated data based on the model fit
- Facilitate lineup plots comparing diagnostics on the fitted model to diagnostics where all model assumptions are met.
library(regressinator)
linear_pop <- population(
x1 = predictor("rnorm", mean = 4, sd = 10),
x2 = predictor("runif", min = 0, max = 10),
y = response(
0.7 + 2.2 * x1 - 0.2 * x2, # relationship between X and Y
family = gaussian(), # link function and response distribution
error_scale = 1.5 # sd; errors are scaled by this amount
)
)
In general, population()
defines a population according to the following relationship:
\[ Y ∼ Some ~ Distribution\\ \] \[ ~g(E[Y | X = x]) = \mu(x)\\ \] \[ where ~ μ(x)=any~function~of~x\\ \]
If family
is not specified the default is Gaussian, and the link function g
is identity.
We can create a population with binary outcomes and a logistic link function:
logistic_pop <- population(
x1 = predictor("rnorm", mean = 0, sd = 10),
x2 = predictor("runif", min = 0, max = 10),
y = response(0.7 + 2.2 * x1 - 0.2 * x2,
family = binomial(link = "logit"))
)