---
title: "Introduction to s2dv_cube objects"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Introduction to s2dv_cube objects}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  warning = FALSE,
  message = FALSE,
  eval = FALSE
)
```

## Overview

This vignette provides an introduction to the `s2dv_cube`, which is the data format used within **clim4health**. It discusses different types of climate data, why an `s2dv_cube` object is useful to explore these, and how to construct your own `s2dv_cube`.

## 0. Load the package

First, let's load the package.

```{r load_package}
library(clim4health)
```

## 1. Forecasts, hindcasts, and reanalyses

When using subseasonal to decadal predictions, we use *hindcasts* to assess the skill of the predictions.

1.  **Forecasts:** a prediction of weather or climate conditions in the future, initialised now, using current or recent atmospheric/oceanic initial conditions.
2.  **Hindcasts:** a forecast which uses initial conditions from the past. This generates a prediction of past climate. We use these to compare how *skillful* the model is at predicting the actual observed value, by comparing the predictions to the observations.

> 💡 **2-dimensional time?** Consider a seasonal forecast that we issue *today* (for example, in early July 2026), for the next three months (July, August, and September 2026). To assess the skill of this forecast, we need to compare how every forecast issued in July of previous years (e.g. 1994 to 2016 - a standard reference period) performed at predicting the observed conditions in the following three months (July, August, and September of each year). To do this, we need to load the hindcasts issued in every July of the reference period. To this end, we load the data with a **two-dimensional time structure**, where the first time dimension corresponds to the forecast initialisation date (or start date - July of each year), and the second time dimension corresponds to the forecast lead time (July, August, and September of the given year). A hindcast might look like:

```{r input_functions_table, echo = FALSE, eval = TRUE}
knitr::kable(
  data.frame(
    `Forecast Start Date` = c("July 1994", "July 1995", "...", "July 2016"),
    `Lead Time 1` = c("July 1994", "July 1995", "...", "July 2016"),
    `Lead Time 2` = c("August 1994", "August 1995", "...", "August 2016"),
    `Lead Time 3` = c("September 1994", "September 1995", "...", "September 2016"),
    check.names = FALSE
    ),
  format = "markdown",
  caption = "Table 1: Hindcast structure, for forecasts issued in July. Note that Lead Time 1 corresponds to the same month as the forecast initialisation date (the forecast is issued at the beginning of the month, and provides information for that same month)."
)
```

To compare the hindcast with the observed conditions, we also need to load the observed data for the same time period. Often, we use *reanalysis* data, or we can also use direct observations from weather stations if available, to explore more local climate conditions.

3.  **Reanalysis:** a dataset that combines direct observations with a climate model to produce a complete and physically-consistent dataset of past climate conditions. Reanalysis data is often used as a proxy for observed conditions when direct observations are not available, or to provide a more complete picture of past climate conditions.

> 💡 **Tip**: We can also load our observational data with a **two-dimensional time structure** to align it with the hindcast data.

## 2. Climate data in clim4health

In this section, we introduce the datasets that are used within the **clim4health** package. In **clim4health**, we load and manipulate data using the `s2dv_cube` object, a type of data structure able to handle complex multi-dimensional climate data. Briefly, the `s2dv_cube` is a named list of the data, its dimensions, coordinates, and any additional attributes. How to explore the cube structure and its attributes is described below.

### Loading multi-dimensional data with **c4h_load**

The loading function in **clim4health** is `c4h_load()`, and takes the following input parameters:

-   `path` A string to the folder containing the files to be loaded, or a vector of specific file paths.

-   `variable` The variable that you want to load.

-   `year` A vector of years, e.g. `1994:2016`. Default is all years found in the specified path.

-   `month` A vector of months, e.g. `c(1,2)` for January and February. Default is all months (`1:12`).

-   `day` A vector of days in the month, e.g. `c(1, 2, 3)` for the first three days of each month. Default is all days.

-   `time` A vector of hours (`0:23`). Only needed when datasets are hourly. Default is all hours.

-   `leadtime_month` A vector of months for the forecast leadtime. e.g. `c(1, 2)` for February and March for data initialised in February. Default is `"all"`. If `"all"` for hindcast and forecast data, all leadtime months will be loaded. If `"all"` for reanalysis or station data, data will be loaded as a time series in the `time` dimension, and the `sdate` dimension will be set to 1.

-   `ext` File extension. Options include `"nc"`, `".nc"`, `"csv"`, and `".csv"`.

-   `bbox` A vector of coordinates to subset the data spatially. In order: `c(lat_max, lon_min, lat_min, lon_max)`.

We use the parameters `year`, `month`, `day`, `time`, and `leadtime_month` to specify the time period and time dimensions of the data we want to load. To load data in the **sdate** or **time** dimension:

-   **sdate**: `year`, `month`, `day`, and `time` correspond to all the start dates that you want to load. To load data for January and February for the years 2010 to 2016, you would select `year = 2010:2016` and `month = 1:2`.

-   **time**: `leadtime_month` corresponds to the forecast lead time and tells clim4health how many months to load in the **time** dimension. To load our hindcast data issued only in January for the forecast months of January, February and March, for the years 1994 to 2016, we would select `year = 1994:2016`, `month = 1`, and `leadtime_month = 1:3`. Note that this exact specification would result in a 2D time structure as given in the table above (Section 1).

### Loading example data

Consider our seasonal forecast example. We will load the forecast data for January, February, and March 2025, and the corresponding hindcast data for January, February, and March of previous years (2010 to 2012). We will also load the reanalysis data for the same time period. In a full analysis, the reference period should be much longer (typically 1994 to 2016). All of these datasets will be stored as `s2dv_cube` objects.

#### The forecast data is stored with dimensions

-   **sdate**: the forecast initialisation date (or start date - so January 2025).

-   **time**: the forecast lead time (so January, February and March of 2025).

-   **ensemble**: the ensemble members of the forecast model.

-   **latitude** and **longitude**: the spatial dimensions of the data.

```{r Load forecast data}
fcst_path <- system.file("extdata/forecast/", package = "clim4health")
fcst_path <- paste0(fcst_path, "/")

# Load the forecast data
forecast <- c4h_load(fcst_path,
                     variable = "t2m",
                     year = 2025,
                     month = 1,
                     leadtime_month = "all",
                     ext = "nc")

forecast <- c4h_convert_units(forecast, to = "celsius")
```

#### The hindcast data is stored with dimensions

-   **sdate**: the forecast initialisation date (or start date - so January of each year from 1994 to 2016).

-   **time**: the forecast lead time (so January, February and March of each year).

-   **ensemble**: the ensemble members of the hindcast model.

-   **latitude** and **longitude**: the spatial dimensions of the data.

```{r Load hindcast data}
hindcast_path <- system.file("extdata/hindcast/", package = "clim4health")
hindcast_path <- paste0(hindcast_path, "/")

# Load the hindcast data
hindcast <- c4h_load(hindcast_path,
                     variable = "t2m",
                     year = 2010:2012,
                     month = 1,
                     leadtime_month = 1:3,
                     ext = "nc")

hindcast <- c4h_convert_units(hindcast, to = "celsius")
```

#### The reanalysis data is stored with dimensions

-   **time**: the time dimension (so January, February and March of each year from 1994 to 2016).

-   **latitude** and **longitude**: the spatial dimensions of the data.

👉 We have to reshape and load the reanalysis data to have the same dimensions as the hindcast data (i.e. with dimensions of `sdate` and `time`) to be able to compare the two datasets.

```{r Load reanalysis data}
reanalysis_path <- system.file("extdata/reanalysis/", package = "clim4health")
reanalysis_path <- paste0(reanalysis_path, "/")

# Load the reanalysis data
reanalysis <- c4h_load(reanalysis_path,
                       variable = "t2m",
                       year = 2010:2012,
                       month = 1,
                       leadtime_month = 1:3,
                       ext = "nc")

reanalysis <- c4h_convert_units(reanalysis, to = "celsius")
```

### Exploring the loaded data

`c4h_load()` will return an `s2dv_cube` object. We can explore the structure of this object to understand how the data is stored and how to manipulate it.

An `s2dv_cube` is a named list of the data, its dimensions, coordinates, and any additional attributes. Given our `s2dv_cube` object called `forecast` containing forecast data, the list elements contained within `forecast` are:

1.  `forecast$data`: a multi-dimensional array containing one or more variables, such as temperature and pressure.
2.  `forecast$dims`: a named vector of the dimensions of `forecast$data` containing the names of the dimensions and their lengths.
3.  `forecast$coords`: a named list of the coordinates of the array. For example, if a variable is stored on dimensions of latitude, longitude, and time, then `forecast$coords` will contain the values of each of these coordinates. In some cases, only indices will be stored in the coordinates, such as in the case of the *time* dimension, where the actual dates are stored in the attributes (see below).
4.  `forecast$attrs`: a named list of all available metadata. It includes metadata for each variable, and common shared attributes between variables.

```{r}
message("print forecast class")
class(forecast)
message("print dimensions of the stored data")
dim(forecast$data)
message("print the names of the list elements in forecast")
names(forecast)
message("print a summary of the data stored in forecast")
summary(forecast$data)
message("print extended information about the list elements in forecast")
str(forecast)
```

### Data dimensions in clim4health

**clim4health** enforces the following dimensions for its `s2dv_cube` objects:

-   **dataset** - the datasets that are loaded.

-   **var** - the variables that are loaded, e.g. temperature, precipitation, etc.

-   **sdate** - the forecast initialisation or start dates.

-   **time** - the forecast lead time.

-   **ensemble** - the model ensemble members.

-   one or two spatial dimensions, either:

    -   **latitude** and **longitude** for gridded data

    -   **area** for polygon data, e.g. administrative areas such as states or provinces

    -   **location** for point data, e.g. weather station data

> 💡 **Tip**: the dimension **dataset** will always have length 1 in clim4health cases. It only exists because it is required for package dependencies in the downscaling and skill functions.

To better understand how dates are handled in **clim4health**, take a look at the *attributes* of the `s2dv_cube`. For example, all `s2dv_cubes` loaded via **clim4health** will contain a `Dates` attribute.

```{r}
print(forecast$attrs$Dates)

print(dim(forecast$attrs$Dates))

print(forecast$attrs$Dates[, 1]) # to obtain the forecast start dates
print(forecast$coords$sdate) # start dates are also stored as a coordinate
```

We can plot the `s2dv_cube` objects using the function `c4h_plot`. The plotting function will search the available dimensions in the data cube and plot maps for each of them. This is problematic for the forecast object, which has three time points and 51 ensemble members, which would mean a total of 153 maps! We can do several things, first, simply select the time points and ensemble members we are interested in:

```{r}
c4h_plot(forecast, time = 1:3, ensemble = 1:3)
```

## Exploring the s2dv_cube structure

We have discussed the dimensions of clim4health `s2dv_cube` objects, but many different *attributes* are stored within the `s2dv_cube` which can be useful. Let's take a general look at what information is stored in an `s2dv_cube`.

```{r structure}
print(forecast)
```

```{=html}
<pre lang="markdown">
forecast$
│
├── data                    # Array with named dimensions containing the data       
│
├── dims                    # Named vector of the dimensions of the data array
│
├── coords$                 # Named list of the coordinates of the data array 
│   ├── dataset                # Index of the dataset(s) loaded
│   ├── var                    # Name(s) of the variable(s) loaded
│   ├── sdate                  # Index of the forecast initialisation or start
│   │                            dates
│   ├── time                   # Index of the forecast lead time
│   ├── ensemble               # Index of the model ensemble members
│   ├── latitude               # Values of the latitude coordinates
│   └── longitude              # Values of the longitude coordinates
│
└── attrs$                  # Named list of all attributes and metadata
    ├── Dates                  # 2-dimensional array of dates (sdate and time)
    ├── Variable               # Named list of variable information
    │   ├── varName               # Character vector of variable name(s), equal
    │   │                           to coords$var
    │   └── metadata              # Named list of metadata for all variables
    │       ├── variable             # Character vector of variable name(s)
    │       ├── units                # Named vector of variable unit(s)
    │       └── source_files         # Character vector of the source file(s)
    │                                  for each variable
    └── source_files           # Character vector of all source files used to
                                 load the data
</pre>
```

For example, to access the units of a given variable, you can type `forecast$attrs$Variable$metadata$units`. Equivalently, the function `c4h_convert_units()` looks in this location to find the units of the variable(s), so you can access the same information by typing `c4h_convert_units(forecast)`.

## 3. Constructing an s2dv_cube from scratch

Suppose you have climate data in csv format, and you want to convert it into an `s2dv_cube` to be able to use it with the functions in **clim4health**. Given the diverse nature of climate data, there is no one-size-fits-all approach to converting data into an `s2dv_cube`, but the following steps can be used as a general guide.

### 1. Load the data

Let's load the example csv file stored in clim4health using another method.

```{r load_csv}
csv_path <- system.file("extdata/stations/", package = "clim4health")
data_in <- read.csv(paste0(csv_path, "/temp_vallecauca.csv"))
```

Take a look at the structure of the data to understand how it is stored and what information is available.

```{r explore_csv}
print(head(data_in))
```

We see that this data contains columns called `station_code`, `lat`, `lon`, `date`, `temp_mean`, `temp_min`, and `temp_max`.

> 💡 **Tip**: At this point, you could also filter your data by requirements such as relevant dates, latitudes, longitudes, and variables.

### 2. Reshape the data

First, let's extract the coordinates at which our data will be stored.

```{r extract_coords}
# these will be used to match values in the data_in object
locations_long <- data_in$station_code
lats_long <- data_in$lat
lons_long <- data_in$lon

# these will be our unique coordinates for the s2dv_cube
locations <- unique(locations_long)
lats <- unique(lats_long[match(locations, locations_long)])
lons <- unique(lons_long[match(locations, locations_long)])

# extract dates in a time series format (sdate = 1, time = length of dates)
dates_long <- data_in$date
dates <- unique(dates_long)
# we can choose specific date ranges if desired
dates <- dates[which(lubridate::month(dates) %in% 1:2)]

# set up final coordinates
ensemble <- 1
var <- c("temp_mean", "temp_min", "temp_max")
```

Now, we need to match the values in the data frame to the coordinates we have just extracted, and reshape the data into a multi-dimensional array with the correct dimensions.

```{r reshape_data}
# create an empty array with the correct dimensions
data_array <- array(NA, dim = c(dataset = 1,
                                var = length(var),
                                sdate = 1,
                                time = length(dates),
                                ensemble = length(ensemble),
                                location = length(locations)))

# fill the array with the values from the data frame, matching the coordinates
for (i in seq_along(var)) {
  var_name <- var[i]
  for (j in seq_along(locations)) {
    location_name <- locations[j]
    for (k in seq_along(dates)) {
      date_value <- dates[k]
      value <- data_in[locations_long == location_name &
                       dates_long == date_value, var_name]
      # only fill the array if there is a value in the data frame
      if (length(value) != 0) {
        data_array[1, i, 1, k, 1, j] <- value
      }
    }
  }
}
```

> 💡 **Tip**: Here we loop over dimensions to explicitly show how the new array is filled, but it can be more efficiently filled in case there are more locations or dates. Similarly, we set sdate to have length 1 here, but you can also add a loop over this too.

### 3. Add dimensions, coordinates, and attributes

Now that we have reshaped the data, we need to add the dimensions, coordinates, and attributes to create the full `s2dv_cube` object.

```{r add_dims_coords_attrs}
new_dims <- dim(data_array)

new_coords <- list(dataset = 1,                     # index
                   var = var,                       # variable names
                   sdate = 1,                       # index
                   time = seq_along(dates),         # index
                   ensemble = ensemble,             # index
                   location = seq_along(locations)) # index

new_attrs <- list(
  Dates = array(dates, dim = c(sdate = 1, time = length(dates))),
  Variable = list(varName = var,
                  metadata = list(variable = var,
                                  units = c(temp_mean = "degrees C",
                                            temp_min = "degrees C",
                                            temp_max = "degrees C"),
                                  source_files = "inst/extdata/stations/temp_vallecauca.csv")),
  source_files = "inst/extdata/stations/temp_vallecauca.csv")

# in the case of station data, we also need to add the latitude and longitude coordinates to the attributes
new_attrs$location <- list(longitude = lons,
                           latitude = lats)

# we also need to ensure the Dates attribute is a POSIXct object
new_attrs$Dates <- as.POSIXct(new_attrs$Dates)
dim(new_attrs$Dates) <- c(sdate = 1, time = length(dates))
```

> 💡 **Warning**: Converting to POSIXct objects can flatten your dates array, so make sure you double check both the values of the dates array and its dimension!

### 4. Create the s2dv_cube object

Finally, we can create the `s2dv_cube` object. There are two ways to do this. The first is simply to manually create the list and set its class, and the second is to use the function `s2dv_cube()` from the package **CSTools**.

```{r create_s2dv_cube}
# method 1: manually create the list and set its class
s2dv_cube_manual <- list(data = data_array,
                         dims = new_dims,
                         coords = new_coords,
                         attrs = new_attrs)
class(s2dv_cube_manual) <- "s2dv_cube"

# method 2: use the function s2dv_cube() from CSTools
s2dv_cube_function <- CSTools::s2dv_cube(data_array,
                                         coords = new_coords,
                                         varName = var,
                                         metadata = new_attrs$Variable$metadata,
                                         Dates = new_attrs$Dates,
                                         source_files = new_attrs$source_files)
```

We have successfully created our own `s2dv_cube` object from a csv file, and we can now use this object with the functions in **clim4health**.