Predicting the distribution of species across space and time is a fundamental piece for answering a multitude of questions surrounding the field of ecology. Consequentially, numerous statistical tools have been developed to assist ecologists in making these types of predictions; the most common of which is certainly species distribution models (SDMs).
These types of models are heavily reliant on the quality and quantity of data available across multiple time periods in order to produce meaningful results. Thankfully, the \(21^{st}\) century has been associated with a boom in digital technology and the collapse of data storage costs achieved, allowing us to see data on the environment and species occurrences grow at unprecedented rates. The problem however is that all these data are disparate in nature; they all have their own sampling protocols, assumptions and attributes, making joint analysis of them difficult.
On account of this issue, the field of integrated species distribution modelling has progressed considerably over the last few years – and now the associated methods are well-developed, and the benefits of using such models (over single dataset or simple data-pooling models) are apparent.
A major handicap to integrated modeling lies in the fact that the tools required to make inference with these models have not been established – especially with regards to general software to be used by ecologists. This in turn has stagnated the growth of applying these models to real data.
In light of this impediment to the field of integrated modelling, we decided to make PointedSDMs, an R package to help simplify the modelling process of integrated species distribution models (ISDMs) for ecologists. It does so by using the INLA framework (Rue, Martino, and Chopin 2009) to approximate the models and by constructing wrapper functions around those provided in the R package, inlabru (Bachl et al. 2019).
This R markdown file illustrates an example of modelling an ISDM with PointedSDMs, using three disparate datasets containing information regarding Setophaga, collected around Pennsylvania state (United States of America). The first part of this file contains a brief introduction to the statistical model used in ISDMs, as well as an introduction to the included functions in the package. The second part of this file uses PointedSDMs to run the ISDM for the provided data. We note that, for this particular vignette, no inference was made due to computational intensity of the model. However the R script and data are provided below so that the you may carry out inference.
The goal of our statistical model is to infer the “true” distribution of our species’ using the observations we have at hand, as well as environmental covariates describing the studied area, \(\boldsymbol{X}\), and other parameters describing the species, \(\boldsymbol{\theta}\). To do so, we construct a hierarchical state-space model composing of two parts: the underlying process model as well as observation models.
The process model is a stochastic process which describes how the points are distributed in space. The most commonly assumed process model is the Log-Gaussian Cox process (LGCP), which is characterized by the intensity function \(\lambda(s)\): such that a higher a higher intensity at a location implies the species is more abundant there, as well as a Gaussian random field used to account for all the environmental covariates not included in the model.
The observation models give a statistical description of the data collection process, and are chosen for each dataset based on their underlying sampling protocols (see (Isaac et al. 2020) for a detailed review of the types of observation models typically used in integrated models). As a result, we can write the likelihood of the observation model for a given dataset, \(Y_i\) as \(\mathcal{L} \left( Y_i \mid \lambda(s), \theta_i \right)\).
Therefore, given a collection of heterogeneous datasets \(\{Y_1, Y_2,...,Y_n\}\), the full-likelihood of the statistical process is given by:\[\mathcal{L} \left( \boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{\theta}, \boldsymbol{\phi} \right) = p\left( \lambda(s), \boldsymbol{X}, \boldsymbol{\phi} \right) \times \prod_{i=1}^n \mathcal{L} \left(Y_i \mid \lambda(s), \theta_i\right),\]where \(\boldsymbol\phi\) are parameters assumed for the underlying process.
PointedSDMs is designed to simplify the modelling process of integrated species distribution models by using wrapper functions around those found in R-INLA and inlabru. The simplification is regarding four key steps in the statistical-modelling process:
data preparation,
model fitting,
cross-validation,
prediction.
The first function of interest with regards to these steps is
startISDM
, which is designed to create objects required by
inlabru to create an ISDM, assign the relevant information to
individual objects used in model specification, as well as structure the
provided datasets together into a homogenized framework.
The output of this function is an R6 object, and so there
are several slot functions included to help specify the model at a finer
scale. Use either ?specifyISDM
or .$help()
to
get a comprehensive description of these slot functions.
The next function is startSpecies
which is used to
create a multi-species ISDM. The arguments for this function are similar
to those of startISDM
, but focus on specifying components
for the species in the model. Assistance on the slot functions of the
object may be obtained by using ?specifySpecies
or by using
.$help()
.
After the model is specified using one of the above-mentioned
functions, inference may be made using fitISDM
. This
function takes the data object created and runs the integrated model;
the output of this function is in essence an inlabru model
output with additional information to be used by the sequential
functions.
Spatial blocked cross-validation may be performed using the
blockedCV
function. This function has two types of
cross-validation: 1) iteratively calculate and averge the deviance
information criteria (DIC) for models run without data from the selected
spatial block 2) Estimate predictions for a specified dataset in the
left out fold, and use these predictions as an offset in a new model to
calculate a cross-validation score. Before this function is used,
.$spatialBlock()
from the object produced by one of the
start*
functions is required in order to specify how the
data should be spatially blocked (see below for an example).
datasetOut
provides cross-validation for integrated
models by running the full model with one less dataset, and calculating
a score measuring the relative importance of the left out dataset in the
full model.
Finally prediction of the full model may be completed using the
predict
function; the arguments here are mostly related to
specifying which components are required to be predicted, however the
function can act identically to the inlabru
predict
function if need be. After predictions are
complete, plots may be made using the generic plot
function.
This example aims to predict the distribution of three species of genus setophaga across Pennsylvania state. This example is notable in integrated modelling since it has been used by two seminal papers in the field, namely those by: Isaac et al. (2020) and Miller et al. (2019). This file extends the example by adding two additional datasets containing a further two species.
The first step in our analysis is to load in the packages required.
library(INLA)
library(inlabru)
library(USAboundaries)
library(sf)
library(blockCV)
library(ggplot2)
library(sn)
library(terra)
library(RColorBrewer)
library(cowplot)
library(knitr)
library(dplyr)
library(spocc)
Finally, additional objects required by PointedSDMs and the R-INLA (Martins et al. 2013) and inlabru (Bachl et al. 2019) packages need to be assembled.
An sf object of Pennsylvania (obtained using the USAboundaries (rOpenSci 2018) package) is the first of these objects required, and it will be used to construct an fm_mesh_2d object as well as help with the plotting of graphics. In order to create one of these polygons objects, we are first required to define the projection reference system.
proj <- "+proj=utm +zone=17 +datum=WGS84 +units=km"
PA <- USAboundaries::us_states(states = "Pennsylvania")
PA <- st_transform(PA, proj)
For our analysis, we used three datasets containing the geocoded locations of our studied species’. The first is a presence-only dataset obtained via eBird through the spocc package (Chamberlain 2021), which is a toolkit used to obtain species’ observation data from a selection of popular online data repositories (GBIF, Vernet, BISON, iNaturalist and eBird). These data were obtained using the following script:
species <- c('caerulescens', 'fusca', 'magnolia')
dataSets <- list()
for (bird in species) {
raw_data <- spocc::occ(
query = paste('Setophaga', bird),
from = "gbif",
date = c("2005-01-01", "2005-12-31"),
geometry = st_bbox(st_transform(PA,
'+proj=longlat +datum=WGS84 +no_defs')))$gbif
rows <- grep("EBIRD", raw_data$data[[paste0('Setophaga_', bird)]]$collectionCode)
raw_data <- data.frame(raw_data$data[[1]][rows, ])
raw_data$Species_name <- rep(bird, nrow(raw_data))
data_sp <- st_as_sf(
x = raw_data[, names(raw_data) %in% c("longitude", "latitude", 'Species_name')],
coords = c('longitude', 'latitude'),
crs = '+proj=longlat +datum=WGS84 +no_defs')
data_sp <- st_transform(data_sp, proj)
dataSets[[paste0('eBird_', bird)]] <- data_sp[unlist(st_intersects(PA, data_sp)),]
}
The PointedSDMs package also contains two additional structured datasets (BBA, BBS), which were both assumed to be presence absence datasets in the study conducted by Isaac et al. (2020), containing the variable, NPres, denoting the presence (or number of presences in the BBS dataset) at each sampling location. However, we changed the NPres variable name of the BBS dataset to Counts in order to consider it a counts dataset for illustrative purposes. No additional changes were made to the BBA dataset, and so it was considered a presence absence dataset as originally intended. These datasets may be loaded using the following script:
Dataset name | Sampling protocol | Number of observations | Species name | Source |
---|---|---|---|---|
BBS | Counts | 45 | Caerulescens | North American Breeding Bird Survey |
BBA | Detection/nondetection | 5165 | Caerulescens | Pennsylvania Breeding Bird Atlas |
eBird_caerulescens | Present only | 264 | Caerulescens | eBird |
eBird_magnolia | Present only | 354 | Magnolia | eBird |
eBird_fusca | Present only | 217 | Fusca | eBird |
Table 1: Table illustrating the different datasets used in the analysis.
Species distribution models study the relationship between our in-situ species observations and the underlying environment, and so in line with the study conducted by Isaac et al. (2020), we considered two covariates: elevation, describing the height above sea level (meters) and canopy, describing the proportion of tree canopy covered in the area. Furthermore, we standardize these covariates to get them onto the same scale. These two covariates were obtained from the Hollister and Shah (2017) and Bocinsky et al. (2019) R packages respectively.
covariates <- terra::rast(system.file('extdata/SetophagaCovariates.tif',
package = "PointedSDMs"))
values(covariates$PA_lc_NLCD_2011_Tree_Canopy_L48_nlcd)[is.na(values(covariates$PA_lc_NLCD_2011_Tree_Canopy_L48_nlcd))] <- 0
covariates <- scale(covariates)
names(covariates) <- c('elevation', 'canopy')
plot(covariates)
The R-INLA package requires a Delaunay triangulated mesh
used to approximate our spatial random fields, which is created by
supplying the max.edge
, offset
and
cutoff
arguments as well as our sf map of
Pennsylvania to the fm_mesh_2d_inla
function. With this
mesh, startISDM
will create integration points required by
our model using fmesher’s fm_int
function.
mesh <- fm_mesh_2d_inla(boundary = inla.sp2segment(PA),
cutoff = 0.2 * 5,
max.edge = c(0.1, 0.24) * 80, #40 #120
offset = c(0.1, 0.4) * 100,
crs = st_crs(proj))
mesh_plot <- ggplot() +
gg(mesh) +
ggtitle('Plot of mesh') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
mesh_plot
We furthermore define model options which will be used for all runs in this vignette.
modelOptions <- list(control.inla =
list(int.strategy = 'ccd',
cmin = 0),
verbose = TRUE,
safe = TRUE)
startISDM
The startISDM
function is the first step in the modeling
of our species data. The aim of the function is to organize and
standardize our data to ensure that each process is correctly modeled in
the integrated framework. For this standardization to work, we specify
the response variable names of the counts data
(responseCounts
) and present absence
(responsePA
) as well as the coordinate reference system
used (Projection
).
We may specify the formula for the covariate effects using the
argument Formulas = list(covariateFormula = ~.)
. If were to
omit this argument, the model would create a formula for the fixed
effects included in spatialCovariates
.
caerulescensData <- dataSets[c(1,4,5)]
caerulescensModel <- startISDM(caerulescensData, Boundary = PA,
Projection = proj, Mesh = mesh,
responsePA = 'NPres', responseCounts = 'Counts',
spatialCovariates = covariates,
Formulas =
list(
covariateFormula = ~ elevation*canopy)
)
As mentioned above, there are some slot functions now inside the
caerulescensModel
objects, which allow for a finer level of
customization of the components for the integrated model. Documentation
for these slot functions may be found by using
.$plot
Let’s have a look at a graphical representation of the species’
distribution, using the .$plot()
function within
caerulescensModel
.
.$specifySpatial
When conducting Bayesian analysis, the choice of prior distribution
is imperative. In our example, we chose Penalizing complexity priors
(Simpson et al. 2017), which are designed
to control the spatial range and marginal standard deviation in the
GRF’s Matérn covariance function in order to reduce over-fitting in the
model. This can be done with ease using the
.$specifySpatial
function, which will also update all the
components associated with this field automatically.
.$addBias
Presence only datasets (such as our eBird data) are often
noted to contain numerous biases, which in turn can heavily skew results
obtained from our model. A common approach to account for these biases
is by using a variable in the model such as sampling effort (some
examples include: distance traveled, number of observations per visit,
duration spent observing). However in the absence of such variables,
running a second spatial field in the model can improve performance of
the integrated model (Simmonds et al.
2020). To add a second spatial field to our model, we use the
.$addBias
function, which is part of the
specifyISDM
object created above.
.$priorsFixed
Suppose we knew a priori what the mean and precision values
of one of the fixed terms for a given species was: we could add this
knowledge to the model using the .$priorsFixed
function.
.$changeComponents
If we would like to see what the components required by
inlabru in our model are, we can use the
.$changeComponents
function with no arguments
specified.
This model is thus specified with: two environmental covariates and their quadratics, intercept terms for the datasets and a shared spatial field.
fitISDM
The integrated model is easily run with the fitISDM
function as seen below by supplying the occurrence and R-INLA
objects (created above with startISDM
).
caerulescensEst <- fitISDM(data = caerulescensModel,
options = modelOptions)
summary(caerulescensEst)
predict
and plot
A significant part of SDMs is creating prediction maps to understand
the species’ spread. Predictions of the integrated SDM’s from
fitISDM
are made easy using the predict
function. By supplying the relevant components, the predict function
will create the formula required by inlabru’s
predict
function to make predictive maps. In this example,
we made predictions for the spatial fields of the species, on the linear
scale using 100 samples.
PointedSDMs also provides a general plotting function to
plot the maps using the predicted values obtained in the
predict
function. The plot below shows the predictions of
the two spatial fields of our species.
caerulescensPredictions <- predict(caerulescensEst,
data = fm_pixels(mesh = mesh,
mask = PA),
spatial = TRUE,
n.samples = 100)
plot(caerulescensPredictions, variable = c('mean', 'sd'))
caerulescensBias <- predict(caerulescensEst,
data = fm_pixels(mesh = mesh,
mask = PA),
bias = TRUE,
n.samples = 100)
plot(caerulescensBias, variable = c('mean', 'sd'))
startSpecies
The next function of interest is startSpecies
, which is
used to construct a multi-species ISDM. The argument
speciesName
is required, and it denotes the name of the
species variable common across the datasets. Additional arguments
include: speciesSpatial
to control the type of spatial
effect used for the species, speciesIntercept
to control
the intercept for the species and speciesEnvironment
to
control the environmental effects for the species (common across species
or shared).
For this example, we use the default argument choices which include:
a model with a spatial effect per species (with shared hyperparameters)
and a random intercept term for the species. We remove the dataset
specific spatial term for computational purposes by setting
pointsSpatial = NULL
.
speciesModel <- startSpecies(dataSets, Boundary = PA,
pointsSpatial = NULL,
Projection = proj, Mesh = mesh,
responsePA = 'NPres', responseCounts = 'Counts',
spatialCovariates = covariates,
speciesName = 'Species_name')
The output of this function is an R6 object, and additional
documentation from the function be be obtained using
?.$help()
.
Like the single-species model provided above, we specify the priors
for the fixed and random effects using .$specifySpatial
,
.$priorsFixed
and .$specifyRandom
. In
.$specifyRandom
, the argument speciesGroup
is
used to change the prior for the precision (inverse variance) of group
model for the spatial effects, and speciesIntercepts
is
used to change the prior for the precision for the random intercept
term. For both of these parameters, we choose to fix the parameters to
variance 1 for simplicity.
speciesModel$specifySpatial(Species = TRUE,
constr = TRUE,
prior.sigma = c(0.01, 0.05),
prior.range = c(100, 0.1))
speciesModel$priorsFixed(Effect = 'Intercept',
mean.linear = 0,
prec.linear = 1)
speciesModel$specifyRandom(speciesGroup = list(model = "iid",
hyper = list(
prec = list(
initial = log(1),
fixed = TRUE
))),
speciesIntercepts = list(
initial = log(1), fixed = TRUE
))
We may then estimate the model using fitISDM
.
modelOptionsSpecies <- modelOptions
modelOptionsSpecies$control.inla$h <- 5e-4
modelOptionsSpecies$control.inla$tolerance <- 5e-5
speciesEst <- fitISDM(data = speciesModel,
options = modelOptionsSpecies)
summary(speciesEst)
Predictions and plotting are completed as follows:
PointedSDMs has two functions to functions to evaluate
models. These are blockedCV
and
datasetOut.
.$spatialBlock
.$spatialBlock
is used to set up spatial blocked
cross-validation for the model by assigning each point in the datasets a
block based on where the point is located spatially. For this example,
we chose four blocks (k=4
) for our model, based around a
50x50 grid (rows = 50, cols = 50
). See the figure below for
an illustration of the spatial block: the amount of data within each
block appears reasonably equal.
blockedCV
The blocked model may then be estimated with blockedCV
.
Note that this will estimate k
models, so it may take a
while to complete.
More so, we can compare the cross-validation score from this model to
one without the shared spatial effect (specified with
pointsSpatial = NULL
in startISDM
).
no_fields <- startISDM(caerulescensData,
pointsSpatial = NULL,
Boundary = PA,
Projection = proj, Mesh = mesh,
responsePA = 'NPres', responseCounts = 'Counts',
spatialCovariates = covariates,
Formulas =
list(
covariateFormula = ~ elevation*canopy)
)
no_fields$spatialBlock(k = 4, rows_cols = c(50, 50),
plot = TRUE, seed = 123) + theme_bw()
Based on the DIC scores, we conclude that the model with the shared spatial field provides a better fit of the data.
datasetOut
PointedSDMs also includes the function
datasetOut
, which iteratively omits one dataset (and its
associated marks) out of the full model. For example, we can calculate
the effect on the other variables by leaving out the dataset
BBA, which contributes the most occurrences used in our
joint-likelihood model by a significant amount.
Furthermore, by setting predictions = TRUE
we are able
to calculate some cross-validation score by leaving out the selected
dataset. This score is calculated by predicting the covariate values of
the left-out using the reduced model (i.e the model with the dataset
left out), using the predicted values as an offset in a new model, and
then finding the difference between the marginal-likelihood of the full
model (i.e the model with all the datasets considered) and the
marginal-likelihood of the offset model.