--- title: "Simulate by Design" author: "Lisa DeBruine" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulate by Design} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", out.width = "100%" ) ggplot2::theme_set(ggplot2::theme_bw()) set.seed(8675309) ``` ```{r, warning=FALSE, message=FALSE} library(faux) ``` The `sim_design()` function creates a dataset with a specific between- and within-subjects design. ## Quick example For example, the following creates a 4w*2b design with 100 observations in each cell. The between-subject factor is `pet` with two levels ("cat" and "dog"). The within-subject factor is `time` with four levels ("morning", "noon", "evening" and "night"). The data is sampled from a population where the mean for the `cat_morning` cell is 10, the mean for the `cat_noon` cell is 12, the mean for the `cat_evening` cell is 14, and the mean for the `cat_night` cell is 16. The mean for the `dog_morning` cell is 10, the mean for the `dog_noon` cell is 15, the mean for the `dog_evening` cell is 20, and the mean for the `dog_night` cell is 25. All cells have a SD of 5 and all within-subject cells are correlated r = 0.5. ```{r plot-sim-design} between <- list(pet = c(cat = "Cat Owners", dog = "Dog Owners")) within <- list(time = c("morning", "noon", "evening", "night")) mu <- data.frame( cat = c(10, 12, 14, 16), dog = c(10, 15, 20, 25), row.names = within$time ) # add factor labels for plotting vardesc <- c(pet = "Type of Pet", time = "Time of Day") df <- sim_design(within, between, n = 100, mu = mu, sd = 5, r = .5, empirical = TRUE, vardesc = vardesc, plot = TRUE) ``` ## Specification of design parameters ### Factor and level names #### Anonymous If you don't feel like naming your factors and levels, you can just put in a vector of levels. So you can make a quick 2w\*3w\*2b with the following code. ```{r anon} df <- sim_design(within = c(2,3), between = c(2), n = 10, mu = 1:12, sd = 1, r = 0.5) ``` #### List of vectors You can specify between-subject and within-subject factors as a list of vectors where the item names are the factor labels and the vectors are the level labels. ```{r vlist} between <- list( pet = c("cat", "dog") ) within <- list( time = c("day", "night") ) df <- sim_design(within, between, mu = 1:4) ``` #### Levels with underscores In wide format, faux uses an underscore to separate level names. Therefore, any underscores in factor level names will generate an error. ```{r, error = TRUE, purl = FALSE} # with default sep = _ within <- list( A = c("A_1", "A_2"), B = c("B_1", "B_2") ) sim_design(within, n = 5, plot = FALSE) ``` You can change this default with `faux_options()`. ```{r} faux_options(sep = ".") sim_design(within, n = 5, plot = FALSE) ``` ```{r} # put the separator back to _ for the rest of this vignette faux_options(sep = "_") ``` #### List of named vectors/lists You can also specify factors as a list of named vectors or lists where the item names are the factor labels, the vector names are the level labels that are used in the data table, and the vector items are the long labels for a codebook or plot. ```{r named-vectors} between <- list( pet = c(cat = "Is a cat person", dog = "Is a dog person") ) within <- list( time = c(day = "Tested during the day", night = "Tested at night") ) df <- sim_design(within, between, mu = 1:4) ``` #### Factor labels Add factor labels with the `vardesc` argument. You can also customise the ID and DV column names (they default to `c(id = "id")` and `c(y = "value")`). ```{r} vardesc <- c(pet = "Type of Pet", time = "Time of Day") df <- sim_design(within, between, mu = 1:4, id = c(pet_id = "Pet ID"), dv = c(score = "Score on the Test"), vardesc = vardesc) ``` If you have any within-subjects factors and your table is in wide format, you will only see the DV name in plots. But it will be used if you convert the table from wide to long. ```{r} wide2long(df) %>% head() ``` ### Specifying N You can specify the Ns for each between-subject cell as a single number, named list, or data frame. #### Single number You usually want to specify `n` as a single number. This is N per cell, not total sample size. ```{r} n <- 20 # n per cell, not total design <- check_design(2, c(2,2), n = n, plot = FALSE) str(design$n) ``` #### Named list You can also specify `n` as a named list of Ns per between-subject cell. ```{r} n <- list( B1a_B2a = 10, B1a_B2b = 20, B1b_B2a = 30, B1b_B2b = 40 ) design <- check_design(2, c(2,2), n = n, plot = FALSE) str(design$n) ``` #### Dataframe Or as a data frame. You just need to get the row or column names right, but they don't have to be in the right order. ```{r} n <- data.frame( B1b_B2b = 40, B1a_B2a = 10, B1a_B2b = 20, B1b_B2a = 30 ) design <- check_design(2, c(2,2), n = n, plot = FALSE) str(design$n) ``` You can specify the cells as row names or column names and `check_design()` will fix them. Since `n` **has** to be the same for each within-subject factor, you can specify n as a single column with any name. ```{r} n <- data.frame(n = c(10, 20, 30, 40), row.names = c("B1a_B2a", "B1a_B2b", "B1b_B2a", "B1b_B2b")) design <- check_design(2, c(2,2), n = n, plot = FALSE) str(design$n) ``` ### Mu and SD The specifications for `mu` and `sd` need both within-subject and between-subject cells. You can specify these as a single numbers, a vector, a named list of named vectors or a data frame. #### Unnamed vector An unnamed vector is a quick way to specify means and SDs, but the order relative to your between- and within-subject cells can be confusing. ```{r} between <- list(pet = c("cat", "dog"), condition = c("A", "B")) within <- list(time = c("day", "night")) mu <- c(10, 20, 30, 40, 50, 60, 70, 80) design <- check_design(within, between, mu = mu, plot = FALSE) str(design$mu) ``` #### Named list of named vectors A named list of named vectors prevents confusion due to order. The levels of the between-subject factors are the list names and the levels of the within-subject factors are the vector names, but their order doesn't matter. ```{r} mu <- list( cat_B = c(night = 40, day = 30), cat_A = c(day = 10, night = 20), dog_A = c(day = 50, night = 60), dog_B = c(day = 70, night = 80) ) design <- check_design(within, between, mu = mu, sd = 1, plot = FALSE) str(design$mu) ``` Alternatively, you can specify them as data frames. ```{r} mu <- data.frame( cat_A = c(10, 20), cat_B = c(30, 40), dog_A = c(50, 60), dog_B = c(70, 80), row.names = c("day", "night") ) design <- check_design(within, between, mu = mu, plot = FALSE) str(design$mu) ``` If you transpose the dataframe, this works out fine unless your within- and between-subject cells have identical names. ```{r} mu <- data.frame( day = c(10, 30, 50, 70), night = c(20, 40, 60, 80), row.names = c("cat_A", "cat_B", "dog_A", "dog_B") ) design <- check_design(within, between, mu = mu, plot = FALSE) str(design$mu) ``` ### Correlations If you have any within-subject factors, you need to set the correlation for each between-cell. Here, we only have two levels of one within-subject factor, so can only set one correlation per between-cell. ```{r} r <- list( cat_A = .5, cat_B = .5, dog_A = .6, dog_B = .4 ) design <- check_design(within, between, r = r, plot = FALSE) design$r ``` #### Upper right triangle If you have more than 2 within-subject cells, you can specify each specific correlation in the upper right triangle of the correlation matrix as a vector. ```{r} r <- list( B1a = c(.10, .20, .30, .40, .50, .60), B1b = c(.15, .25, .35, .45, .55, .65) ) design <- check_design(4, 2, r = r, plot = FALSE) design$r ``` #### Correlation matrix You can also enter the correlation matrix from `cor()`. ```{r} within <- list(cars = c("speed", "dist")) between <- list(half = c("first", "last")) r <- list( first = cor(cars[1:25,]), last = cor(cars[26:50,]) ) design <- check_design(within, between, r = r, plot = FALSE) design$r ``` ### Empirical If you set `empirical = TRUE`, you will get the *exact* means, SDs and correlations you specified. If you set `empirical = FALSE` or omit that argument, your data will be sampled from a population with those parameters, but your dataset will not have exactly those values (just on average). ```{r empirical} between <- list(pet = c("cat", "dog")) within <- list(time = c("day", "night")) mu <- list( cat = c(day = 10, night = 20), dog = c(day = 30, night = 40) ) sd <- list( cat = c(day = 5, night = 10), dog = c(day = 15, night = 20) ) r <- list(cat = .5, dog = .6) df <- sim_design(within, between, n = 100, mu = mu, sd = sd, r = r, empirical = TRUE) ``` `r get_params(df, between = "pet") %>% knitr::kable()` ## More factors Here is a 2w\*3w\*2b\*2b example. When you have multiple within or between factors, you need to specify parameters by cell. Cell names are the level names, in the order they are listed in the `within` or `between` arguments, separated by underscores. Foe example, if you have one within-subject factor of condition with levels `con` and `inc`, and another within-subject factor of version with levels `easy`, `med`, and `hard`, your cell labels will be: `con_easy`, `inc_easy`, `con_med`, `inc_med`, `con_hard`, and `inc_hard`. If you have any characters in your level names except letters and numbers, they will be replaced by a full stop (e.g., `my super-good level_name` will become `my.super.good.level.name`). ```{r} within <- list( condition = c(con = "Mean of congruent trials", inc = "Mean of incongruent trials"), version = c(easy = "Easy", med = "Medium", hard = "Difficult") ) between <- list( experience = c(novice = "Novice", expert = "Expert"), time = c(day = "Before 5pm", night = "After 5pm") ) mu <- data.frame( row.names = c("con_easy", "con_med", "con_hard", "inc_easy", "inc_med", "inc_hard"), novice_day = 10:15, novice_night = 11:16, expert_day = 9:14, expert_night = 10:15 ) ``` You can set the correlation for each between-cell to a single number. ```{r} r <- list( novice_day = 0.3, novice_night = 0.2, expert_day = 0.5, expert_night = 0.4 ) ``` Or you can set the full correlation matrix with a vector or matrix. Since we have 6 within-cells, this is a 6x6 matrix or a vector of the upper right 15 values. ```{r} # upper right triangle correlation specification # inc and con have r = 0.5 within each difficultly level, 0.2 otherwise # ce, ie, cm, im, ch, ih triangle <- c(0.5, 0.2, 0.2, 0.2, 0.2, #con_easy 0.2, 0.2, 0.2, 0.2, #inc_easy 0.5, 0.2, 0.2, #con_med 0.2, 0.2, #inc_med 0.5) #con_hard #inc_hard r <- list( novice_day = triangle, novice_night = triangle, expert_day = triangle, expert_night = triangle ) ``` You can set `long = TRUE` to return the data frame in long format, which is usually easier for plotting. ```{r} df <- sim_design(within, between, n = 100, mu = mu, sd = 2, r = r, dv = c(rt = "Reaction Time"), plot = FALSE, long = TRUE) head(df) ``` ## Multiple datasets You might want to make multiple datasets with the same design for simulations. It can be inefficient to just put `sim_design()` in an apply function or a for loop (especially if you leave plot = TRUE). The patterns below make this faster. ### rep argument You can simulate multiple datasets by setting the `rep` argument of `sim_design()` to a number greater than 1. This will return a nested data frame with one column called `rep` and another column called `data`, that contains each individual data frame. This method is faster than repeatedly calling `sim_design()`, which will check the syntax of your design each time, and returns a nested dataset that makes it easy to apply the same analysis to each replicate. The code below creates 5 data frames with a 2W*2B design. ```{r} df <- sim_design(within = 2, between = 2, n = 50, mu = c(1, 1, 1, 1.5), sd = 1, r = 0.5, plot = FALSE, long = TRUE, rep = 5) df ``` You can access an individual data frame using `df$data[[1]]` or run the same function on each data frame using a pattern like below: ```{r} # using tidyverse functions analyse <- function(data) { stats::aov(y ~ B1 * W1 + Error(id/W1), data = data) %>% broom::tidy() } df %>% dplyr::mutate(analysis = lapply(data, analyse)) %>% dplyr::select(-data) %>% tidyr::unnest(analysis) ``` ### Unnested reps You can also get an unnested data frame by setting `nested = FALSE`. This is a bit faster to return, but less useful for the pattern above. ```{r} df <- sim_design(within = 2, between = 2, n = 2, mu = c(1, 1, 1, 1.5), sd = 1, r = 0.5, plot = FALSE, long = TRUE, rep = 2, nested = FALSE) df ``` ### Simulate by design The `check_design()` function converts any abbreviated design specification to the fully expanded version; it checks your design specification when you run `sim_design()` or you can run it on its own to create a validated design list. ```{r, results='asis'} between <- list(pet = c("cat", "dog")) within <- list(time = c("day", "night")) vardesc <- c(pet = "Type of Pet", time = "Time of Day") design <- check_design(within, between, n = 10, mu = 1:4, sd = 1:4, r = 0.5, vardesc = vardesc, plot = FALSE) design ``` You can then use the design object to create simulated datasets. If you set the design for `sim_design()` with the `design` argument instead of specifying parameters, it will skip the time-consuming checks, which can speed up simulations by 30% or more. ```{r} data <- sim_design(design = design) ```