This vignette provides an introduction to the s2dv_cube, which is the data format used within clim4health. It discusses different types of climate data, why an s2dv_cube object is useful to explore these, and how to construct your own s2dv_cube.
First, letβs load the package.
library(clim4health)When using subseasonal to decadal predictions, we use hindcasts to assess the skill of the predictions.
π‘ 2-dimensional time? Consider a seasonal forecast that we issue today (for example, in early July 2026), for the next three months (July, August, and September 2026). To assess the skill of this forecast, we need to compare how every forecast issued in July of previous years (e.g.Β 1994 to 2016 - a standard reference period) performed at predicting the observed conditions in the following three months (July, August, and September of each year). To do this, we need to load the hindcasts issued in every July of the reference period. To this end, we load the data with a two-dimensional time structure, where the first time dimension corresponds to the forecast initialisation date (or start date - July of each year), and the second time dimension corresponds to the forecast lead time (July, August, and September of the given year). A hindcast might look like:
| Forecast Start Date | Lead Time 1 | Lead Time 2 | Lead Time 3 |
|---|---|---|---|
| July 1994 | July 1994 | August 1994 | September 1994 |
| July 1995 | July 1995 | August 1995 | September 1995 |
| β¦ | β¦ | β¦ | β¦ |
| July 2016 | July 2016 | August 2016 | September 2016 |
To compare the hindcast with the observed conditions, we also need to load the observed data for the same time period. Often, we use reanalysis data, or we can also use direct observations from weather stations if available, to explore more local climate conditions.
π‘ Tip: We can also load our observational data with a two-dimensional time structure to align it with the hindcast data.
In this section, we introduce the datasets that are used within the clim4health package. In clim4health, we load and manipulate data using the s2dv_cube object, a type of data structure able to handle complex multi-dimensional climate data. Briefly, the s2dv_cube is a named list of the data, its dimensions, coordinates, and any additional attributes. How to explore the cube structure and its attributes is described below.
The loading function in clim4health is c4h_load(), and takes the following input parameters:
path A string to the folder containing the files to be loaded, or a vector of specific file paths.
variable The variable that you want to load.
year A vector of years, e.g.Β 1994:2016. Default is all years found in the specified path.
month A vector of months, e.g.Β c(1,2) for January and February. Default is all months (1:12).
day A vector of days in the month, e.g.Β c(1, 2, 3) for the first three days of each month. Default is all days.
time A vector of hours (0:23). Only needed when datasets are hourly. Default is all hours.
leadtime_month A vector of months for the forecast leadtime. e.g.Β c(1, 2) for February and March for data initialised in February. Default is "all". If "all" for hindcast and forecast data, all leadtime months will be loaded. If "all" for reanalysis or station data, data will be loaded as a time series in the time dimension, and the sdate dimension will be set to 1.
ext File extension. Options include "nc", ".nc", "csv", and ".csv".
bbox A vector of coordinates to subset the data spatially. In order: c(lat_max, lon_min, lat_min, lon_max).
We use the parameters year, month, day, time, and leadtime_month to specify the time period and time dimensions of the data we want to load. To load data in the sdate or time dimension:
sdate: year, month, day, and time correspond to all the start dates that you want to load. To load data for January and February for the years 2010 to 2016, you would select year = 2010:2016 and month = 1:2.
time: leadtime_month corresponds to the forecast lead time and tells clim4health how many months to load in the time dimension. To load our hindcast data issued only in January for the forecast months of January, February and March, for the years 1994 to 2016, we would select year = 1994:2016, month = 1, and leadtime_month = 1:3. Note that this exact specification would result in a 2D time structure as given in the table above (Section 1).
Consider our seasonal forecast example. We will load the forecast data for January, February, and March 2025, and the corresponding hindcast data for January, February, and March of previous years (2010 to 2012). We will also load the reanalysis data for the same time period. In a full analysis, the reference period should be much longer (typically 1994 to 2016). All of these datasets will be stored as s2dv_cube objects.
sdate: the forecast initialisation date (or start date - so January 2025).
time: the forecast lead time (so January, February and March of 2025).
ensemble: the ensemble members of the forecast model.
latitude and longitude: the spatial dimensions of the data.
fcst_path <- system.file("extdata/forecast/", package = "clim4health")
fcst_path <- paste0(fcst_path, "/")
# Load the forecast data
forecast <- c4h_load(fcst_path,
variable = "t2m",
year = 2025,
month = 1,
leadtime_month = "all",
ext = "nc")
forecast <- c4h_convert_units(forecast, to = "celsius")sdate: the forecast initialisation date (or start date - so January of each year from 1994 to 2016).
time: the forecast lead time (so January, February and March of each year).
ensemble: the ensemble members of the hindcast model.
latitude and longitude: the spatial dimensions of the data.
hindcast_path <- system.file("extdata/hindcast/", package = "clim4health")
hindcast_path <- paste0(hindcast_path, "/")
# Load the hindcast data
hindcast <- c4h_load(hindcast_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = 1:3,
ext = "nc")
hindcast <- c4h_convert_units(hindcast, to = "celsius")time: the time dimension (so January, February and March of each year from 1994 to 2016).
latitude and longitude: the spatial dimensions of the data.
π We have to reshape and load the reanalysis data to have the same dimensions as the hindcast data (i.e.Β with dimensions of sdate and time) to be able to compare the two datasets.
reanalysis_path <- system.file("extdata/reanalysis/", package = "clim4health")
reanalysis_path <- paste0(reanalysis_path, "/")
# Load the reanalysis data
reanalysis <- c4h_load(reanalysis_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = 1:3,
ext = "nc")
reanalysis <- c4h_convert_units(reanalysis, to = "celsius")c4h_load() will return an s2dv_cube object. We can explore the structure of this object to understand how the data is stored and how to manipulate it.
An s2dv_cube is a named list of the data, its dimensions, coordinates, and any additional attributes. Given our s2dv_cube object called forecast containing forecast data, the list elements contained within forecast are:
forecast$data: a multi-dimensional array containing one or more variables, such as temperature and pressure.forecast$dims: a named vector of the dimensions of forecast$data containing the names of the dimensions and their lengths.forecast$coords: a named list of the coordinates of the array. For example, if a variable is stored on dimensions of latitude, longitude, and time, then forecast$coords will contain the values of each of these coordinates. In some cases, only indices will be stored in the coordinates, such as in the case of the time dimension, where the actual dates are stored in the attributes (see below).forecast$attrs: a named list of all available metadata. It includes metadata for each variable, and common shared attributes between variables.message("print forecast class")
class(forecast)
message("print dimensions of the stored data")
dim(forecast$data)
message("print the names of the list elements in forecast")
names(forecast)
message("print a summary of the data stored in forecast")
summary(forecast$data)
message("print extended information about the list elements in forecast")
str(forecast)clim4health enforces the following dimensions for its s2dv_cube objects:
dataset - the datasets that are loaded.
var - the variables that are loaded, e.g.Β temperature, precipitation, etc.
sdate - the forecast initialisation or start dates.
time - the forecast lead time.
ensemble - the model ensemble members.
one or two spatial dimensions, either:
latitude and longitude for gridded data
area for polygon data, e.g.Β administrative areas such as states or provinces
location for point data, e.g.Β weather station data
π‘ Tip: the dimension dataset will always have length 1 in clim4health cases. It only exists because it is required for package dependencies in the downscaling and skill functions.
To better understand how dates are handled in clim4health, take a look at the attributes of the s2dv_cube. For example, all s2dv_cubes loaded via clim4health will contain a Dates attribute.
print(forecast$attrs$Dates)
print(dim(forecast$attrs$Dates))
print(forecast$attrs$Dates[, 1]) # to obtain the forecast start dates
print(forecast$coords$sdate) # start dates are also stored as a coordinateWe can plot the s2dv_cube objects using the function c4h_plot. The plotting function will search the available dimensions in the data cube and plot maps for each of them. This is problematic for the forecast object, which has three time points and 51 ensemble members, which would mean a total of 153 maps! We can do several things, first, simply select the time points and ensemble members we are interested in:
c4h_plot(forecast, time = 1:3, ensemble = 1:3)We have discussed the dimensions of clim4health s2dv_cube objects, but many different attributes are stored within the s2dv_cube which can be useful. Letβs take a general look at what information is stored in an s2dv_cube.
print(forecast)
forecast$
β
βββ data # Array with named dimensions containing the data
β
βββ dims # Named vector of the dimensions of the data array
β
βββ coords$ # Named list of the coordinates of the data array
β βββ dataset # Index of the dataset(s) loaded
β βββ var # Name(s) of the variable(s) loaded
β βββ sdate # Index of the forecast initialisation or start
β β dates
β βββ time # Index of the forecast lead time
β βββ ensemble # Index of the model ensemble members
β βββ latitude # Values of the latitude coordinates
β βββ longitude # Values of the longitude coordinates
β
βββ attrs$ # Named list of all attributes and metadata
βββ Dates # 2-dimensional array of dates (sdate and time)
βββ Variable # Named list of variable information
β βββ varName # Character vector of variable name(s), equal
β β to coords$var
β βββ metadata # Named list of metadata for all variables
β βββ variable # Character vector of variable name(s)
β βββ units # Named vector of variable unit(s)
β βββ source_files # Character vector of the source file(s)
β for each variable
βββ source_files # Character vector of all source files used to
load the data
For example, to access the units of a given variable, you can type forecast$attrs$Variable$metadata$units. Equivalently, the function c4h_convert_units() looks in this location to find the units of the variable(s), so you can access the same information by typing c4h_convert_units(forecast).
Suppose you have climate data in csv format, and you want to convert it into an s2dv_cube to be able to use it with the functions in clim4health. Given the diverse nature of climate data, there is no one-size-fits-all approach to converting data into an s2dv_cube, but the following steps can be used as a general guide.
Letβs load the example csv file stored in clim4health using another method.
csv_path <- system.file("extdata/stations/", package = "clim4health")
data_in <- read.csv(paste0(csv_path, "/temp_vallecauca.csv"))Take a look at the structure of the data to understand how it is stored and what information is available.
print(head(data_in))We see that this data contains columns called station_code, lat, lon, date, temp_mean, temp_min, and temp_max.
π‘ Tip: At this point, you could also filter your data by requirements such as relevant dates, latitudes, longitudes, and variables.
First, letβs extract the coordinates at which our data will be stored.
# these will be used to match values in the data_in object
locations_long <- data_in$station_code
lats_long <- data_in$lat
lons_long <- data_in$lon
# these will be our unique coordinates for the s2dv_cube
locations <- unique(locations_long)
lats <- unique(lats_long[match(locations, locations_long)])
lons <- unique(lons_long[match(locations, locations_long)])
# extract dates in a time series format (sdate = 1, time = length of dates)
dates_long <- data_in$date
dates <- unique(dates_long)
# we can choose specific date ranges if desired
dates <- dates[which(lubridate::month(dates) %in% 1:2)]
# set up final coordinates
ensemble <- 1
var <- c("temp_mean", "temp_min", "temp_max")Now, we need to match the values in the data frame to the coordinates we have just extracted, and reshape the data into a multi-dimensional array with the correct dimensions.
# create an empty array with the correct dimensions
data_array <- array(NA, dim = c(dataset = 1,
var = length(var),
sdate = 1,
time = length(dates),
ensemble = length(ensemble),
location = length(locations)))
# fill the array with the values from the data frame, matching the coordinates
for (i in seq_along(var)) {
var_name <- var[i]
for (j in seq_along(locations)) {
location_name <- locations[j]
for (k in seq_along(dates)) {
date_value <- dates[k]
value <- data_in[locations_long == location_name &
dates_long == date_value, var_name]
# only fill the array if there is a value in the data frame
if (length(value) != 0) {
data_array[1, i, 1, k, 1, j] <- value
}
}
}
}π‘ Tip: Here we loop over dimensions to explicitly show how the new array is filled, but it can be more efficiently filled in case there are more locations or dates. Similarly, we set sdate to have length 1 here, but you can also add a loop over this too.
Now that we have reshaped the data, we need to add the dimensions, coordinates, and attributes to create the full s2dv_cube object.
new_dims <- dim(data_array)
new_coords <- list(dataset = 1, # index
var = var, # variable names
sdate = 1, # index
time = seq_along(dates), # index
ensemble = ensemble, # index
location = seq_along(locations)) # index
new_attrs <- list(
Dates = array(dates, dim = c(sdate = 1, time = length(dates))),
Variable = list(varName = var,
metadata = list(variable = var,
units = c(temp_mean = "degrees C",
temp_min = "degrees C",
temp_max = "degrees C"),
source_files = "inst/extdata/stations/temp_vallecauca.csv")),
source_files = "inst/extdata/stations/temp_vallecauca.csv")
# in the case of station data, we also need to add the latitude and longitude coordinates to the attributes
new_attrs$location <- list(longitude = lons,
latitude = lats)
# we also need to ensure the Dates attribute is a POSIXct object
new_attrs$Dates <- as.POSIXct(new_attrs$Dates)
dim(new_attrs$Dates) <- c(sdate = 1, time = length(dates))π‘ Warning: Converting to POSIXct objects can flatten your dates array, so make sure you double check both the values of the dates array and its dimension!
Finally, we can create the s2dv_cube object. There are two ways to do this. The first is simply to manually create the list and set its class, and the second is to use the function s2dv_cube() from the package CSTools.
# method 1: manually create the list and set its class
s2dv_cube_manual <- list(data = data_array,
dims = new_dims,
coords = new_coords,
attrs = new_attrs)
class(s2dv_cube_manual) <- "s2dv_cube"
# method 2: use the function s2dv_cube() from CSTools
s2dv_cube_function <- CSTools::s2dv_cube(data_array,
coords = new_coords,
varName = var,
metadata = new_attrs$Variable$metadata,
Dates = new_attrs$Dates,
source_files = new_attrs$source_files)We have successfully created our own s2dv_cube object from a csv file, and we can now use this object with the functions in clim4health.