---
title: "ORCHAMP_dataset"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ORCHAMP_dataset}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(fig.path='figures/',
                      collapse = TRUE,
                      comment = "#>",
                      fig.width = 8, fig.height = 7
)
```

# Fitting and evaluating a join trait distribution model

In this vignette we explain how to fit, evaluate and interpret a joint trait distribution model (JTDM). We refer to Poggiato et al., in prep. "Joint models and predictions of community traits" for the full description of JTDMs, and we rather focus here on presenting the R package \code{jtdm} by applying it to the dataset of the case study presented in the publication.

## Installing the R package

```{r,warning=FALSE,results='hide',message=FALSE}
## CRAN
#install.packages('jtdm', repos = "http://cran.us.r-project.org")
## Github
#library(devtools)
#install_github("giopogg/jtdm")
library(jtdm)
library(ggplot2)
```

# Fit a jtdm to data

JTDMs relate a vector of average community traits (e.g. community weighted mean, CWM) to a vector of environmental covariates. The matrix Y contains the CWM traits of specific leaf area (SLA), leaf nitrogen content (LNC) and plant height of 116 plant community sites situated along 21 elevation gradients (www.orchamp.osug.fr) in the French Alps. 
```{r Y}
data(Y)
summary(Y)
```
We consider two environmental covariates: Growing Degree Days (GDD, the sum of temperature of days with positive temperature in the growing season) to represent the average length and intensity of the growing season and intensity of freezing events (Freezing Degree Days, the sum of temperature of days with negative temperature in the growing season, FDD). To account for habitat type, we included a variable that was set to 1 when the site was in a forest, 0 otherwise (hereafter habitat). These data are stored in the matrix X.
```{r X}
data(X)
summary(X)
```

JTDMs infer a linear regression for each CWM trait as a function of the environmental covariates, together with an inter-traits residual covariance matrix. Therefore, the parameters of a JTDM are the regression coefficients $B$ and the residual covariance $\Sigma$. Notice that inferring a generalized linear model is possible, even thus not yet implemented in our package. The inference of JTDM is implemented in the Bayesian framework with the function `jtdm_fit`.
This functions samples from the posterior distribution, which has been analytically determined. Therefore, there is no need for classical MCMC convergence checks. The syntax of the function is similar to the popular function `lm`. This function requires the response matrix Y, the predictor matrix X and a (right-hand only) formula to specify how CWM traits depends on environmental variables. We choose here to include in the model linear and quadratic terms (using orthogonal polynomials) of GDD and FDD in interaction with habitat to enable non linear trait-environment relationships and to allow for different trait-environment relationships in forest and open habitats. The \code{jtdm_fit} is very fast, however, to ensure that the whole vignette runs quickly fast, we only sample a relative low number of samples from the posterior distribution (100 here). To obtain reliable results, the number of samples should be increased (around 1000).

``` {r,results='hide',warning=FALSE, message=FALSE}
# Short MCMC to obtain a fast example: results are unreliable !
m = jtdm_fit(Y = Y, X = X,
             formula = as.formula("~poly(GDD,2)+poly(FDD,2)+poly(GDD,2):forest+poly(FDD,2):forest"),
             sample = 100)

summary(m)
```


## Get parameters

Since we ran the JTDM with very short chains, the MCMC has not converged. You can try to run the model with longer chains and check its convergence.

We can now look at the inferred the regression coefficients. The `getB` function provides the MCMC samples of the regression coefficients matrix $B$, together with its mean, and $95\%$ credible interval. 
```{r}
# Inferred parameters
B = getB(m)
```

`get_sigma` provides instead the MCMC samples of the residual covariance matrix $\Sigma$ together with its mean and $95\%$ credible interval. 
```{r}
Sigma = get_sigma(m)$Smean
```

We can plot these parameters with the function `plot`
```{r}
plot(m)
```

## Predict and evaluate model fit

We can predict CWM traits on a (new) set of sites using `jtdm_predict`. Here, we predict on the training dataset and compute goodness of fit measures of these predictions like $R^2$ and RMSE, by setting `validation=TRUE`. We can choose whether to obtain the full posterior predictive distribution (`FullPost=TRUE` which is more time consuming) or just its posterior mean (`FullPost=FALSE`, which we advice if the aim is to only compute goodness of fit measures).
```{r}
predictions = jtdm_predict(m, Xnew = X, Ynew = Y,validation = T)
predictions$R2
predictions$RMSE
```
We can evaluate the performances of the model using a K-fold cross-validation using the function `jtdmCV`
```{r,results = 'hide',message=FALSE,warning=FALSE}
predictionsCV = jtdmCV(m, K = 5, sample = 100)

predictionsCV$R2
predictionsCV$RMSE
```

## Trait-environment relationships

We can now analyze the inferred trait-environment relationships using the function `partial_response`, that computes and plots the partial response curve of a focal trait along a focal environmental gradient. This function takes as input the model `m`, the focal environmental variable `indexGradient` and the focal trait `indexTrait`, using the names of environmental variables and traits as specified in the column names of X and Y respectively. The function builts a dataset of environmental variables by building the gradient of the focal environmental variable and keeping all other variables fixed to their mean. Then, it predicts the marginal distribution of the focal trait for the so-built environmental dataset. The user can choose whether to obtain the full predictive distribution  (`FullPost=TRUE`), or the predictive distribution of the mean term (`FullPost="mean"`, which we advice here), or just the posterior mean (`FullPost="mean"`). The function outputs are the plot of the inferred trait-environment relationship and the predictions of the model (used to produce the plot). For example, we can plot the partial response curve relationship between SLA and GDD
```{r}
partial_response(m, indexGradient = "GDD",indexTrait = "SLA")$p
```


The user can eventually provide a given gradient for the focal environmental variable `XFocal` (which is otherwise built on regular grid with length given by `grid.length`). The user can also choose to fix the non focal environmental variables to another value. For example, we can obtain the partial response curve of SLA and GDD in forest
```{r}
partial_response(m, indexGradient = "GDD", indexTrait = "SLA",
                 FixX = list(GDD = NULL, FDD = NULL, forest = 1))$p
```

## Multivariate confidence intervals

We can then compute the partial response curves of pairwise CWM trait combinations together with their $95\%$ credible region, what we define in the publication as the most likely CWM combination and envelop of possible CWM combinations. This is done by the function `ellipse_plot`, that takes as input the model `m`, the focal environmental variable `indexGradient` and the two focal traits `indexTrait`, using the names of environmental variables and traits as specified in the column names of X and Y respectively. The function builds a gradient of the focal environmental variable while keeping all other variables fixed to their mean, and then predict and plots the joint distribution of the focal traits. The user can choose whether to obtain the full predictive distribution  (`FullPost=TRUE`), or the predictive distribution of the mean term (`FullPost=F`, which we advice here to obtain smoother curves).

```{r}
# plot the pairwise SLA-LNC partial response curve along the GDD gradient
ellipse_plot(m, indexTrait = c("SLA","LNC"), indexGradient = "GDD")
```


The user can also choose to fix the non-focal environmental variables to another value. For example, we can obtain the partial response curves of the most likely CWM combination and envelop of possible CWM combinations of SLA and GDD in forest.

```{r}
# plot the pairwise SLA-LNC partial response curve along the GDD gradient
ellipse_plot(m, indexTrait = c("SLA","LNC"), indexGradient="GDD",
             FixX = list(GDD = NULL, FDD = NULL, forest = 1))
```

## Joint predictions

The `jtdm` package allows to define a region in the community-trait space and compute their joint probabilities for a given set of environmental conditions. This is done by the function `joint_trait_prob`. To define a given region in the community-trait space, the user has to define the focal traits `indexTrait` (any number of traits is accepted) and trait-specific thresholds through the parameter `bounds`. `bounds` is a list of the length of `indexTrait`, where each element of the list is a vector of length two. The vector represents the inferior and superior bounds of the region for the specified trait. For example, if we consider two traits (e.g. SLA and LNC), `bounds=list(c(20,Inf),c(20,Inf))` corresponds to the region in the community-trait space where both SLA and LNC both take values greater than 20. We can then define the sites (i.e. the set of environmental conditions) in which to compute joint probabilities. For example we can compute joint probabilities of both SLA and LNC to be greater than 20 in a high altitude site. This measures the relative suitability of communities where both SLA and LNC are higher than 20 in a high altitude site.

```{r}
joint_trait_prob(m, indexTrait = c("SLA","LNC"), Xnew = X["VCHA_2940",],
                 bounds = list(c(20,Inf),c(20,Inf)),
                 FullPost = TRUE)$PROBmean

```

Unsurprisingly, the probability is low. Then, we compute how this probability vary along the GDD gradient using the function `joint_trait_prob_gradient`. The function builts a dataframe where the focal variable varies along a gradient and the other (non-focal) variables are fixed to their mean (but see FixX parameter for fixing non-focal variables to user-defined values) and predict the joint probabilities along the gradient.
``` {r}
joint = joint_trait_prob_gradient(m, indexTrait = c("SLA","LNC"), 
                                  indexGradient = "GDD",
                                  bounds = list(c(mean(Y[,"SLA"]),Inf),c(mean(Y[,"SLA"]),Inf)),
                                  FullPost = TRUE)
```

We can then plot such predictions.


```{r,echo=FALSE}
table = data.frame(x=joint$gradient, mean= joint$GradProbsmean,
                   q02 = joint$GradProbsq025,
                   q97 = joint$GradProbsq975)
ggplot(data=table) + geom_ribbon(mapping=aes(x=x, ymin=q02, ymax=q97),position = position_dodge(0.3), size=1,alpha=0.2) + geom_line(mapping=aes(x=x, y=mean), size=1, position=position_dodge(width=0.3),col="#F8766D") + xlab("GDD")  + theme_classic() +
  ggtitle("Joint probability of SLA and LNC to be both greater than 10 as a function of GDD")
```


As climatic conditions become more favorable (i.e. GDD increases), the probability of having high values of both traits increases.

# Author
This package is currently developed by Giovanni Poggiato from Laboratoire d’Ecologie Alpine. It is supported by the ANR GAMBAS. The framework implemented in this package is described in: Joint modeling and predictions of community traits. Poggiato Giovanni, Gaüzere Pierre, Martinez Almoyna, Camille, Deschamps, Gabrielle, Renaud, Julien, Violle, Cyrille, Münkemüller, Tamara, Thuiller, Wilfried. In preparation.