The mosquito occurrence data from Golding et al 2015 (published in Parasites & Vectors) available from figshare here is useful for exploring how datasets need to be prepped for running Conditional Random Fields (CRF) models. Here, we will download the raw data from figshare (note, an internet connection will be needed for this step) change ‘dipping_round’ to a factor variable and remove un-needed columns
<- tempfile()
temp download.file('https://ndownloader.figshare.com/files/2075362',
temp)<- read.csv(temp, as.is = T)
dataset unlink(temp)
We can now change the categorical dipping_round
and
field_site
variables to factors and remove some un-needed
variables
$dipping_round <- as.factor(dataset$dipping_round)
dataset$field_site <- as.factor(dataset$field_site)
datasetc(1,2,5,6)] <- NULL dataset[,
It is important here to examine the level names of factor variables,
as the 1st level (i.e. the dummy level) will be dropped from the dataset
during conversion to model matrix format (as in standard
lme4
analysis of factor covariates)
levels(dataset$dipping_round)[1]
levels(dataset$field_site)[1]
The next step is to convert any factor variables into model matrix
format. As mentioned above, this step will drop the first level of a
factor and then create an additional column for each additional level
(i.e. dipping_round
levels "3", "5" and "6"
will all be assigned their own unique columns, while
dipping_round
level "2"
will be dropped and
treated as the reference level). It is also convenient to change names
of the new covariate columns so they are easier to view and interpret
(done here using dplyr::rename_all
)
library(dplyr)
= dataset %>%
analysis.data cbind(.,data.frame(model.matrix(~.[,'field_site'],
-1])) %>%
.)[,cbind(.,data.frame(model.matrix(~.[,'dipping_round'],
-1])) %>%
.)[,::select(-field_site,-dipping_round) %>%
dplyr::rename_all(funs(gsub("\\.|model.matrix", "", .))) dplyr
Finally, we need to convert species abundances to binary
presence-absence format (as we are only estimating co-occurrences, not
co-abundances). It is also highly advisable to scale any continuous
variables so they all have mean = 0
and
sd = 1
1:16] <- ifelse(analysis.data[, 1:16] > 0, 1, 0)
analysis.data[, 17:20] <- scale(analysis.data[, 17:20], center = T, scale = T) analysis.data[,