---
title: Simulating datasets
author: Ken Kellner
date: July 7, 2024
bibliography: unmarked.bib
csl: ecology.csl
output: 
  rmarkdown::html_vignette:
    fig_width: 5
    fig_height: 3.5
    number_sections: true
    toc: true
vignette: >
  %\VignetteIndexEntry{Simulating datasets}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction

Simulating datasets is a powerful and varied tool when conducting `unmarked` analyses.
Writing our own code to simulate a dataset based on a given model is an excellent learning tool, and can help us test if a given model is generating the expected results.
If we simulate a series of datasets based on a fitted model, and calculate a statistic from each of those fits, we can generate a distribution of the statistic - this is what the `parboot` function does.
This can be helpful, for example, when testing goodness of fit.
Finally, simulation can be a useful component of power analysis when a closed-form equation for power is not available.

`unmarked` provides two different ways of generating simulated datasets, depending on the stage we are at in the modeling process.

1. Generating simulated datasets from a fitted model we already have
2. Generating simulated datasets from scratch

For (1), we simply call the `simulate` method on our fitted model object and new dataset(s) are generated.
This is the approach used by `parboot`.
In this vignette we will focus on (2), a more flexible approach to simulation, also using the `simulate` method, that allows us to generate a dataset corresponding to any `unmarked` model from scratch.

# Components of a call to simulate

We will need to provide several pieces of information to `simulate` in order to simulate a dataset from scratch in `unmarked`.

1. An `unmarkedFrame` of the appropriate type defining the desired experimental design
2. The model to use (`model`), if the `unmarkedFrame` is used for multiple model types
3. Other arguments required by the fitting function for which we are simulating. Most importantly this will include formulas for each submodel.
4. A named list of parameter or coefficient values (`coefs`) controlling the simulation, which correspond to the formula(s) specified earlier.

# Simulating an occupancy dataset

The easiest way to demonstrate how to use `simulate` is to look at an example: we'll start with a simple one for occupancy.

## 1. The `unmarkedFrame`

Suppose we want to simulate a single-season occupancy dataset in which site occupancy is affected by elevation.
The first step is to create an `unmarkedFrame` object of the appropriate type, which defines the experimental design and includes any covariates we want to use in the simulation.
Since we want to simulate an occupancy dataset, we'll create an `unmarkedFrameOccu`.

The `unmarkedFrameOccu` function takes three arguments: the observation matrix `y`, the site covariates `siteCovs`, and the observation-level covariates `obsCovs`.
The dimensions of `y` define how many sites and replicate samples the study includes.
We'll create a blank `y` matrix (i.e., filled with `NA`s) of dimension 300 x 8, indicating we want our study to have 300 sites and 8 sampling occasions.
The values you put in this `y` matrix don't matter, you can put anything in there you want as they'll be overwritten with the simulated values later.
It's only used to define the number of sites and occasions.

```{r}
library(unmarked)
set.seed(123)
M <- 300
J <- 8
y <- matrix(NA, M, J)
```

Earlier we said we want to include an elevation covariate, so we'll simulate the covariate now and add it to a data frame.
We could create several covariates here, including factors, etc.

```{r}
site_covs <- data.frame(elev = rnorm(M))
```

We're not using any observation covariates, so we can now make the complete `unmarkedFrameOccu`:

```{r}
umf <- unmarkedFrameOccu(y = y, siteCovs = site_covs)
head(umf)
```

## 2. Specify the model type

Since `unmarkedFrameOccu` is used by both the single-season occupancy model (`occu`) and the Royle-Nichols occupancy model (`occuRN`), we need to tell `unmarked` which one to use.

```{r}
model <- occu
```

Most `unmarkedFrame` types in `unmarked` are used by only one model fitting function, so this step is often unnecessary.

## 3. Specify other arguments to the fitting function

Take a look at the help file for `occu`.
When fitting a single-season occupancy model we need to provide, in addition to the data, the `formula` argument defining the model structure.
We'll need to provide these same argument(s) to `simulate`.
Many fitting functions will have multiple required arguments, such as the mixture distribution to use, key functions, etc.

Here we specify a double right-hand-side formula as required by `occu`, specifying an effect of elevation on occupancy.

```{r}
form <- ~1~elev
```

## 4. Specify the corresponding parameter values

The model structure, as defined by the formula above, implies a certain set of parameter/coefficient values (intercepts, slopes) we need to supply to `simulate`.
These need to be supplied as a named list, where each list element corresponds to one submodel (such as `state` for occupancy and `det` for detection).
Each list element is a numeric vector of the required parameter values.
It can be tricky to figure out the structure of this list, so `simulate` allows you to not include it at first, in which case the function will return a template for you to fill in.

```r
simulate(umf, model = model, formula = form)
```

```
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $state
## (Intercept)        elev
##           0           0
##
## $det
## (Intercept)
##           0
```

```
## Error: Specify coefs argument as shown above
```

We need to supply a list with two elements `state` and `det`.
The `state` element contains two values, the intercept and the slope corresponding to elevation.
The `det` element contains only the intercept since we have no covariates on detection.
Note that all values supplied in this list *must* be on the inverse link scale, which will depend on the specific submodel used.
So for example, a value of 0 for `det` implies a detection probability of 0.5, because we're using the logit link function.

```{r}
plogis(0)
```

Now let's make our own `coefs` list:

```{r}
cf <- list(state = c(0, -0.4), det = 0)
```

Here we're setting a negative effect of elevation on occupancy.

## Run simulate

We now have all the pieces to simulate a dataset.

```{r}
out <- simulate(umf, model = occu, formula = ~1~elev, coefs = cf)
```

The result is always a list of `unmarkedFrame`s.
By default, we just get one, but we can get more with the `nsim` argument.

```{r}
head(out[[1]])
```

The simulated `unmarkedFrame` now contains `y` values and is ready to use.

## Fit a model to the simulated dataset

As a quick check, let's fit a model to our simulated dataset.

```{r}
occu(~1~elev, data = out[[1]])
```

We get out roughly the same parameters that we put in, as expected.

# Simulating a more complex dataset: gdistremoval

The `gdistremoval` function fits the model of @Amundson_2014, which estimates abundance using a combination of distance sampling and removal sampling data.
When simulating a dataset based on this model, we have to provide several additional pieces of information related to the structure of the distance and removal sampling analyses.

## 1. The `unmarkedFrame`

First create the appropriate type of `unmarkedFrame`, which is `unmarkedFrameGDR`.
There's two y-matrices: one for distance sampling and one for removal sampling.
We'll create a dataset with 4 distance bins and 5 removal periods.

```{r}
set.seed(123)
M <- 100
Jdist <- 4
Jrem <- 5

y_dist <- matrix(NA, M, Jdist)
y_rem <- matrix(NA, M, Jrem)
```

We'll create an elevation site covariate and a wind observation covariate.
Observation-level covariates are only used by the removal part of the model, so they should have the same number of values as `y_rem`. 

```{r}
site_covs <- data.frame(elev = rnorm(M))
obs_covs <- data.frame(wind = rnorm(M * Jrem))
```

Finally we can create the `unmarkedFrameGDR`.
We'll also need to specify the distance bins and the units for the distance part of the model here.
See `?unmarkedFrameGDR` for more information.

```{r}
umf <- unmarkedFrameGDR(yRem = y_rem, yDist = y_dist, siteCovs = site_covs, obsCovs = obs_covs,
                        dist.breaks = c(0,25,50,75,100), unitsIn = 'm')
```

```{r}
head(umf)
```

## 2. Arguments sent to `gdistremoval`

Looking at `?gdistremoval`, required arguments include `lambdaformula`, `removalformula`, and `distanceformula`.
We need to set these formula values to control the simulation.
We'll also use the negative binomial distribution for abundance.

```{r}
lambdaformula <- ~elev # elevation effect on abundance
removalformula <- ~wind # wind effect on removal p
distanceformula <- ~1
mixture <- "NB"
```

## 3. Coefficient values

As in the previous section, we'll leave the `coefs` argument blank at first and get the correct output structure.

```r
simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
         mixture="NB")
```

```
## coefs should be a named list of vectors, with the following structure
## (replace 0s with your values):
##
## $lambda
## (Intercept)        elev
##           0           0
##
## $alpha
## alpha
##     0
##
## $dist
## (Intercept)
##           0
##
## $rem
## (Intercept)        wind
##           0           0
```

```
## Error: Specify coefs argument as shown above
```

We need to set two values for the abundance (`lambda`) model on the log scale, one for `dist` which represents the distance function sigma parameter (log scale), one for the negative binomial dispersion parameter `alpha` (log scale), and two for the removal detection probability model (logit scale).

We'll pick the (relatively arbitrary) values below:

```{r}
cf <- list(lambda = c(log(5), 0.7),
           dist = log(50),
           alpha = 0.1,
           rem = c(-1, -0.3))
```

## 4. Run simulation

Now provide everything to `simulate`.
Note we don't need to provide the `model` argument because `unmarkedFrameGDR` is used for only one fitting function (`gdistremoval`).

We'll simulate 2 datasets.

```{r}
out <- simulate(umf, lambdaformula=~elev, removalformula=~wind, distanceformula=~1,
                coefs=cf, mixture="NB", nsim=2)
```

```{r}
lapply(out, head)
```

## Fit model to simulated dataset

As a check, we'll fit the same model used for simulation to one of the datasets.

```{r}
gdistremoval(lambdaformula=~elev, removalformula=~wind, distanceformula=~1, data=out[[1]],
             mixture="NB")
```

Looks good.

# Conclusion

The `simulate` function provides a flexible tool for simulating data from any model in `unmarked`.
These datasets can be used for a variety of purposes, such as for teaching examples, testing models, or developing new tools that work with `unmarked`.
Additionally, simulating datasets is a key component of the power analysis workflow in `unmarked` - see the power analysis vignette for more examples.

# References