--- title: "Setophaga example" author: "Philip Mostert" date: "`r Sys.Date()`" bibliography: '`r system.file("References.bib", package="PointedSDMs")`' biblio-style: authoryear output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Setophaga example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE, fig.width=8, fig.height=5 ) ``` # Introduction Predicting the distribution of species across space and time is a fundamental piece for answering a multitude of questions surrounding the field of ecology. Consequentially, numerous statistical tools have been developed to assist ecologists in making these types of predictions; the most common of which is certainly species distribution models (SDMs). These types of models are heavily reliant on the quality and quantity of data available across multiple time periods in order to produce meaningful results. Thankfully, the $21^{st}$ century has been associated with a boom in digital technology and the collapse of data storage costs achieved, allowing us to see data on the environment and species occurrences grow at unprecedented rates. The problem however is that all these data are disparate in nature; they all have their own sampling protocols, assumptions and attributes, making joint analysis of them difficult. On account of this issue, the field of integrated species distribution modelling has progressed considerably over the last few years -- and now the associated methods are well-developed, and the benefits of using such models (over single dataset or simple data-pooling models) are apparent. A major handicap to integrated modeling lies in the fact that the tools required to make inference with these models have not been established -- especially with regards to general software to be used by ecologists. This in turn has stagnated the growth of applying these models to real data. In light of this impediment to the field of integrated modelling, we decided to make *PointedSDMs*, an *R* package to help simplify the modelling process of integrated species distribution models (ISDMs) for ecologists. It does so by using the *INLA* framework [@rue2009approximate] to approximate the models and by constructing wrapper functions around those provided in the *R* package, *inlabru* [@bachl2019inlabru]. This *R markdown* file illustrates an example of modelling an ISDM with *PointedSDMs,* using three disparate datasets containing information regarding *Setophaga,* collected around Pennsylvania state (United States of America). The first part of this file contains a brief introduction to the statistical model used in ISDMs, as well as an introduction to the included functions in the package. The second part of this file uses *PointedSDMs* to run the ISDM for the provided data. We note that, for this particular vignette, no inference was made due to computational intensity of the model. However the *R* script and data are provided below so that the you may carry out inference. ```{r Install PointedSDMs, warning = FALSE, message = FALSE, eval = TRUE} ##Install if need be library(PointedSDMs) ``` ### Statistical model The goal of our statistical model is to infer the "true" distribution of our species' using the observations we have at hand, as well as environmental covariates describing the studied area, $\boldsymbol{X}$, and other parameters describing the species, $\boldsymbol{\theta}$. To do so, we construct a hierarchical state-space model composing of two parts: the underlying process model as well as observation models. The process model is a stochastic process which describes how the points are distributed in space. The most commonly assumed process model is the Log-Gaussian Cox process (LGCP), which is characterized by the intensity function $\lambda(s)$: such that a higher a higher intensity at a location implies the species is more abundant there, as well as a Gaussian random field used to account for all the environmental covariates not included in the model. The observation models give a statistical description of the data collection process, and are chosen for each dataset based on their underlying sampling protocols (see [@isaac2020data] for a detailed review of the types of observation models typically used in integrated models). As a result, we can write the likelihood of the observation model for a given dataset, $Y_i$ as $\mathcal{L} \left( Y_i \mid \lambda(s), \theta_i \right)$. Therefore, given a collection of heterogeneous datasets $\{Y_1, Y_2,...,Y_n\}$, the full-likelihood of the statistical process is given by:$$\mathcal{L} \left( \boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{\theta}, \boldsymbol{\phi} \right) = p\left( \lambda(s), \boldsymbol{X}, \boldsymbol{\phi} \right) \times \prod_{i=1}^n \mathcal{L} \left(Y_i \mid \lambda(s), \theta_i\right),$$where $\boldsymbol\phi$ are parameters assumed for the underlying process. ### Package use *PointedSDMs* is designed to simplify the modelling process of integrated species distribution models by using wrapper functions around those found in *R-INLA* and *inlabru*. The simplification is regarding four key steps in the statistical-modelling process: - data preparation, - model fitting, - cross-validation, - prediction. The first function of interest with regards to these steps is `startISDM`, which is designed to create objects required by *inlabru* to create an ISDM, assign the relevant information to individual objects used in model specification, as well as structure the provided datasets together into a homogenized framework. The output of this function is an *R6* object, and so there are several slot functions included to help specify the model at a finer scale. Use either `?specifyISDM` or `.$help()` to get a comprehensive description of these slot functions. ```{r startISDM} args(startISDM) ``` The next function is `startSpecies` which is used to create a multi-species ISDM. The arguments for this function are similar to those of `startISDM`, but focus on specifying components for the species in the model. Assistance on the slot functions of the object may be obtained by using `?specifySpecies` or by using `.$help()`. ```{r startSpecies} args(startSpecies) ``` After the model is specified using one of the above-mentioned functions, inference may be made using `fitISDM`. This function takes the data object created and runs the integrated model; the output of this function is in essence an *inlabru* model output with additional information to be used by the sequential functions. ```{r fitISDM} args(fitISDM) ``` Spatial blocked cross-validation may be performed using the `blockedCV` function. This function has two types of cross-validation: 1) iteratively calculate and averge the deviance information criteria (DIC) for models run without data from the selected spatial block 2) Estimate predictions for a specified dataset in the left out fold, and use these predictions as an offset in a new model to calculate a cross-validation score. Before this function is used, `.$spatialBlock()` from the object produced by one of the `start*` functions is required in order to specify how the data should be spatially blocked (see below for an example). ```{r args for blockedCV} args(blockedCV) ``` `datasetOut` provides cross-validation for integrated models by running the full model with one less dataset, and calculating a score measuring the relative importance of the left out dataset in the full model. ```{r datasetOut} args(datasetOut) ``` Finally prediction of the full model may be completed using the `predict` function; the arguments here are mostly related to specifying which components are required to be predicted, however the function can act identically to the *inlabru* `predict` function if need be. After predictions are complete, plots may be made using the generic `plot` function. ## *Setophaga* example This example aims to predict the distribution of three species of genus *setophaga* across Pennsylvania state. This example is notable in integrated modelling since it has been used by two seminal papers in the field, namely those by: @isaac2020data and @miller2019recent. This file extends the example by adding two additional datasets containing a further two species. ### Model preparations The first step in our analysis is to load in the packages required. ```{r Load packages, message=FALSE, warning=FALSE} library(INLA) library(inlabru) library(USAboundaries) library(sf) library(blockCV) library(ggplot2) library(sn) library(terra) library(RColorBrewer) library(cowplot) library(knitr) library(dplyr) library(spocc) ``` Finally, additional objects required by *PointedSDMs* and the *R-INLA* [@martins2013bayesian] and *inlabru* [@bachl2019inlabru] packages need to be assembled. An *sf* object of Pennsylvania (obtained using the *USAboundaries* [@USABoundaries] package) is the first of these objects required, and it will be used to construct an *fm_mesh_2d* object as well as help with the plotting of graphics. In order to create one of these polygons objects, we are first required to define the projection reference system. ```{r Map of PA} proj <- "+proj=utm +zone=17 +datum=WGS84 +units=km" PA <- USAboundaries::us_states(states = "Pennsylvania") PA <- st_transform(PA, proj) ``` #### Species occurrence data For our analysis, we used three datasets containing the geocoded locations of our studied species'. The first is a presence-only dataset obtained via *eBird* through the *spocc* package [@spocc], which is a toolkit used to obtain species' observation data from a selection of popular online data repositories (*GBIF*, *Vernet*, *BISON*, *iNaturalist* and *eBird*). These data were obtained using the following script: ```{r, get_eBird} species <- c('caerulescens', 'fusca', 'magnolia') dataSets <- list() for (bird in species) { raw_data <- spocc::occ( query = paste('Setophaga', bird), from = "gbif", date = c("2005-01-01", "2005-12-31"), geometry = st_bbox(st_transform(PA, '+proj=longlat +datum=WGS84 +no_defs')))$gbif rows <- grep("EBIRD", raw_data$data[[paste0('Setophaga_', bird)]]$collectionCode) raw_data <- data.frame(raw_data$data[[1]][rows, ]) raw_data$Species_name <- rep(bird, nrow(raw_data)) data_sp <- st_as_sf( x = raw_data[, names(raw_data) %in% c("longitude", "latitude", 'Species_name')], coords = c('longitude', 'latitude'), crs = '+proj=longlat +datum=WGS84 +no_defs') data_sp <- st_transform(data_sp, proj) dataSets[[paste0('eBird_', bird)]] <- data_sp[unlist(st_intersects(PA, data_sp)),] } ``` The *PointedSDMs* package also contains two additional structured datasets (*BBA, BBS*), which were both assumed to be presence absence datasets in the study conducted by @isaac2020data, containing the variable, *NPres*, denoting the presence (or number of presences in the *BBS* dataset) at each sampling location. However, we changed the *NPres* variable name of the *BBS* dataset to *Counts* in order to consider it a counts dataset for illustrative purposes. No additional changes were made to the *BBA* dataset, and so it was considered a presence absence dataset as originally intended. These datasets may be loaded using the following script: ```{r Load points} data('SetophagaData') dataSets[['BBA']] <- SetophagaData$BBA dataSets[['BBS']] <- SetophagaData$BBS ``` | **Dataset name** | Sampling protocol | **Number of observations** | **Species name** | **Source** | |:--:|:--:|:--:|:--:|:--:| | *BBS* | Counts | 45 | *Caerulescens* | [North American Breeding Bird Survey](https://www.pwrc.usgs.gov/bbs/){style="text-decoration: underline; color: rgb(252, 111, 9) !important;"} | | *BBA* | Detection/nondetection | 5165 | *Caerulescens* | [Pennsylvania Breeding Bird Atlas](https://ebird.org/atlaspa/home){style="text-decoration: underline; color: rgb(252, 111, 9) !important;"} | | *eBird_caerulescens* | Present only | 264 | *Caerulescens* | [eBird](https://ebird.org/home){style="text-decoration: underline; color: rgb(252, 111, 9) !important;"} | | *eBird_magnolia* | Present only | 354 | *Magnolia* | [eBird](https://ebird.org/home){style="text-decoration: underline; color: rgb(252, 111, 9) !important;"} | | *eBird_fusca* | Present only | 217 | *Fusca* | [eBird](https://ebird.org/home){style="text-decoration: underline; color: rgb(252, 111, 9) !important;"} | Table 1: Table illustrating the different datasets used in the analysis. #### Covariate data Species distribution models study the relationship between our in-situ species observations and the underlying environment, and so in line with the study conducted by @isaac2020data, we considered two covariates: *elevation*, describing the height above sea level (meters) and *canopy*, describing the proportion of tree canopy covered in the area. Furthermore, we standardize these covariates to get them onto the same scale. These two covariates were obtained from the @hollister2017elevatr and @FEDdata R packages respectively. ```{r Covariate data, message = FALSE, warning = FALSE} covariates <- terra::rast(system.file('extdata/SetophagaCovariates.tif', package = "PointedSDMs")) values(covariates$PA_lc_NLCD_2011_Tree_Canopy_L48_nlcd)[is.na(values(covariates$PA_lc_NLCD_2011_Tree_Canopy_L48_nlcd))] <- 0 covariates <- scale(covariates) names(covariates) <- c('elevation', 'canopy') plot(covariates) ``` #### Additional objects The *R-INLA* package requires a Delaunay triangulated mesh used to approximate our spatial random fields, which is created by supplying the `max.edge`, `offset` and `cutoff` arguments as well as our *sf* map of Pennsylvania to the `fm_mesh_2d_inla` function. With this mesh, `startISDM` will create integration points required by our model using *fmesher*'s `fm_int` function. ```{r Mesh, warning = FALSE, message = FALSE, fig.width=8, fig.height=5} mesh <- fm_mesh_2d_inla(boundary = inla.sp2segment(PA), cutoff = 0.2 * 5, max.edge = c(0.1, 0.24) * 80, #40 #120 offset = c(0.1, 0.4) * 100, crs = st_crs(proj)) mesh_plot <- ggplot() + gg(mesh) + ggtitle('Plot of mesh') + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) mesh_plot ``` We furthermore define model options which will be used for all runs in this vignette. ```{r modelOptions} modelOptions <- list(control.inla = list(int.strategy = 'ccd', cmin = 0), verbose = TRUE, safe = TRUE) ``` ##### `startISDM` The `startISDM` function is the first step in the modeling of our species data. The aim of the function is to organize and standardize our data to ensure that each process is correctly modeled in the integrated framework. For this standardization to work, we specify the response variable names of the counts data (`responseCounts`) and present absence (`responsePA`) as well as the coordinate reference system used (`Projection`). We may specify the formula for the covariate effects using the argument `Formulas = list(covariateFormula = ~.)`. If were to omit this argument, the model would create a formula for the fixed effects included in `spatialCovariates`. ```{r Model prep, warning = FALSE, message = FALSE} caerulescensData <- dataSets[c(1,4,5)] caerulescensModel <- startISDM(caerulescensData, Boundary = PA, Projection = proj, Mesh = mesh, responsePA = 'NPres', responseCounts = 'Counts', spatialCovariates = covariates, Formulas = list( covariateFormula = ~ elevation*canopy) ) ``` As mentioned above, there are some slot functions now inside the `caerulescensModel` objects, which allow for a finer level of customization of the components for the integrated model. Documentation for these slot functions may be found by using ```{r help, eval = FALSE} caerulescensModel$help() ``` ###### `.$plot` Let's have a look at a graphical representation of the species' distribution, using the `.$plot()` function within `caerulescensModel`. ```{r dataset plot, fig.width=8, fig.height=5} caerulescensModel$plot() + theme_bw() + ggtitle('Plot of the datasets') + theme(plot.title = element_text(hjust = 0.5)) ``` ###### `.$specifySpatial` When conducting Bayesian analysis, the choice of prior distribution is imperative. In our example, we chose Penalizing complexity priors [@simpson2017penalising], which are designed to control the spatial range and marginal standard deviation in the GRF's Matérn covariance function in order to reduce over-fitting in the model. This can be done with ease using the `.$specifySpatial` function, which will also update all the components associated with this field automatically. ```{r specifySpatial} caerulescensModel$specifySpatial(sharedSpatial = TRUE, constr = TRUE, prior.sigma = c(0.1, 0.05), prior.range = c(200, 0.1)) ``` ###### `.$addBias` Presence only datasets (such as our *eBird* data) are often noted to contain numerous biases, which in turn can heavily skew results obtained from our model. A common approach to account for these biases is by using a variable in the model such as sampling effort (some examples include: distance traveled, number of observations per visit, duration spent observing). However in the absence of such variables, running a second spatial field in the model can improve performance of the integrated model [@simmonds2020more]. To add a second spatial field to our model, we use the `.$addBias` function, which is part of the `specifyISDM` object created above. ```{r bias fields} caerulescensModel$addBias(datasetNames = 'eBird_caerulescens') caerulescensModel$specifySpatial(Bias = TRUE, prior.sigma = c(0.1, 0.05), prior.range = c(120, 0.1)) ``` ###### `.$priorsFixed` Suppose we knew *a priori* what the mean and precision values of one of the fixed terms for a given species was: we could add this knowledge to the model using the `.$priorsFixed` function. ```{r priorsFixed} caerulescensModel$priorsFixed(Effect = 'Intercept', mean.linear = 0, prec.linear = 1) ``` ###### `.$changeComponents` If we would like to see what the components required by *inlabru* in our model are, we can use the `.$changeComponents` function with no arguments specified. ```{r changeComponents} caerulescensModel$changeComponents() ``` This model is thus specified with: two environmental covariates and their quadratics, intercept terms for the datasets and a shared spatial field. ###### `.$specifyRandom` To speed up computations, we set the *beta* parameter for the shared spatial effect to `TRUE`. This may be done using the `.$specifyRandom` function. ```{r specifyRandom} caerulescensModel$specifyRandom(copyModel = list(beta = list(fixed = TRUE))) caerulescensModel$changeComponents() ``` ##### `fitISDM` The integrated model is easily run with the `fitISDM` function as seen below by supplying the occurrence and *R-INLA* objects (created above with `startISDM`). ```{r fitISDM run} caerulescensEst <- fitISDM(data = caerulescensModel, options = modelOptions) summary(caerulescensEst) ``` ###### `predict` and `plot` A significant part of SDMs is creating prediction maps to understand the species' spread. Predictions of the integrated SDM's from `fitISDM` are made easy using the `predict` function. By supplying the relevant components, the predict function will create the formula required by *inlabru*'s `predict` function to make predictive maps. In this example, we made predictions for the spatial fields of the species, on the linear scale using 100 samples. *PointedSDMs* also provides a general plotting function to plot the maps using the predicted values obtained in the `predict` function. The plot below shows the predictions of the two spatial fields of our species. ```{r predict and plot} caerulescensPredictions <- predict(caerulescensEst, data = fm_pixels(mesh = mesh, mask = PA), spatial = TRUE, n.samples = 100) plot(caerulescensPredictions, variable = c('mean', 'sd')) caerulescensBias <- predict(caerulescensEst, data = fm_pixels(mesh = mesh, mask = PA), bias = TRUE, n.samples = 100) plot(caerulescensBias, variable = c('mean', 'sd')) ``` ##### `startSpecies` The next function of interest is `startSpecies`, which is used to construct a multi-species ISDM. The argument `speciesName` is required, and it denotes the name of the species variable common across the datasets. Additional arguments include: `speciesSpatial` to control the type of spatial effect used for the species, `speciesIntercept` to control the intercept for the species and `speciesEnvironment` to control the environmental effects for the species (common across species or shared). For this example, we use the default argument choices which include: a model with a spatial effect per species (with shared hyperparameters) and a random intercept term for the species. We remove the dataset specific spatial term for computational purposes by setting `pointsSpatial = NULL`. ```{r startSpeciesStart} speciesModel <- startSpecies(dataSets, Boundary = PA, pointsSpatial = NULL, Projection = proj, Mesh = mesh, responsePA = 'NPres', responseCounts = 'Counts', spatialCovariates = covariates, speciesName = 'Species_name') ``` The output of this function is an *R6* object, and additional documentation from the function be be obtained using `?.$help()`. ```{r species help, eval = FALSE} speciesModel$help() ``` ###### Specifying the model Like the single-species model provided above, we specify the priors for the fixed and random effects using `.$specifySpatial` , `.$priorsFixed` and `.$specifyRandom`. In `.$specifyRandom`, the argument `speciesGroup` is used to change the prior for the precision (inverse variance) of group model for the spatial effects, and `speciesIntercepts` is used to change the prior for the precision for the random intercept term. For both of these parameters, we choose to fix the parameters to variance 1 for simplicity. ```{r specifySpecies} speciesModel$specifySpatial(Species = TRUE, constr = TRUE, prior.sigma = c(0.01, 0.05), prior.range = c(100, 0.1)) speciesModel$priorsFixed(Effect = 'Intercept', mean.linear = 0, prec.linear = 1) speciesModel$specifyRandom(speciesGroup = list(model = "iid", hyper = list( prec = list( initial = log(1), fixed = TRUE ))), speciesIntercepts = list( initial = log(1), fixed = TRUE )) ``` ###### Fitting, predicting and plotting We may then estimate the model using `fitISDM`. ```{r fitSpecies} modelOptionsSpecies <- modelOptions modelOptionsSpecies$control.inla$h <- 5e-4 modelOptionsSpecies$control.inla$tolerance <- 5e-5 speciesEst <- fitISDM(data = speciesModel, options = modelOptionsSpecies) summary(speciesEst) ``` Predictions and plotting are completed as follows: ```{r predictionsSpecies} speciesPredictions <- predict(speciesEst, data = fm_pixels(mesh = mesh, mask = PA), spatial = TRUE, n.samples = 100) plot(speciesPredictions) ``` ##### Model evaluation *PointedSDMs* has two functions to functions to evaluate models. These are `blockedCV` and `datasetOut.` ###### `.$spatialBlock` `.$spatialBlock` is used to set up spatial blocked cross-validation for the model by assigning each point in the datasets a block based on where the point is located spatially. For this example, we chose four blocks (`k=4`) for our model, based around a 50x50 grid (`rows = 50, cols = 50`). See the figure below for an illustration of the spatial block: the amount of data within each block appears reasonably equal. ```{r spatialBlock, warning = FALSE, message = FALSE, fig.width=8, fig.height=5} caerulescensModel$spatialBlock(k = 4, rows_cols = c(50, 50), plot = TRUE, seed = 123) + theme_bw() ``` ##### `blockedCV` The blocked model may then be estimated with `blockedCV`. Note that this will estimate `k` models, so it may take a while to complete. ```{r blockedCV, warning = FALSE} spatialBlocked <- blockedCV(data = caerulescensModel, options = modelOptions) ``` ```{r print spatialBlocked} spatialBlocked ``` More so, we can compare the cross-validation score from this model to one without the shared spatial effect (specified with `pointsSpatial = NULL` in `startISDM`). ```{r No fields model, message = FALSE, warning = FALSE} no_fields <- startISDM(caerulescensData, pointsSpatial = NULL, Boundary = PA, Projection = proj, Mesh = mesh, responsePA = 'NPres', responseCounts = 'Counts', spatialCovariates = covariates, Formulas = list( covariateFormula = ~ elevation*canopy) ) no_fields$spatialBlock(k = 4, rows_cols = c(50, 50), plot = TRUE, seed = 123) + theme_bw() ``` ```{r spatialBlocked_no_fields} spatialBlocked_no_fields <- blockedCV(data = no_fields, options = modelOptions) ``` ```{r print spatialBlocked_no_fields} spatialBlocked_no_fields ``` Based on the DIC scores, we conclude that the model with the shared spatial field provides a better fit of the data. ##### `datasetOut` *PointedSDMs* also includes the function `datasetOut`, which iteratively omits one dataset (and its associated marks) out of the full model. For example, we can calculate the effect on the other variables by leaving out the dataset *BBA*, which contributes the most occurrences used in our joint-likelihood model by a significant amount. Furthermore, by setting `predictions = TRUE` we are able to calculate some cross-validation score by leaving out the selected dataset. This score is calculated by predicting the covariate values of the left-out using the reduced model (i.e the model with the dataset left out), using the predicted values as an offset in a new model, and then finding the difference between the marginal-likelihood of the full model (i.e the model with all the datasets considered) and the marginal-likelihood of the offset model. ```{r Leave one out, message = FALSE, warning = FALSE} dataset_out <- datasetOut(model = caerulescensEst, dataset = "BBA", predictions = TRUE) dataset_out ``` #### References