---
title: "clim4health overview"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{clim4health overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center",
  warning = FALSE,
  message = FALSE,
  eval = FALSE
)
```

# Overview

**clim4health** is an R package designed to load, transform, process, plot, and save climate data. It is specifically designed for use in downscaling and verification of seasonal forecast data.

The **clim4health** package is intended for use with other packages developed within the [HARMONIZE](https://www.harmonize-tools.org/) project, to spatially and temporally harmonize multiple sources of data for use in health impact studies.

This vignette provides an overview of the **clim4health** package methodology, the core functions, and a simple example workflow using real data. It deals with climate data in multidimensional arrays called *s2dv_cubes* (see the `clim4health_s2dv_cubes` vignette for more details). More complex use cases and examples are described in other vignettes (`clim4health_downscaling` regarding downscaling methodologies and `clim4health_verification` regarding the verification methodologies in **clim4health**). Terms are defined in the glossary in the vignette `clim4health_glossary`.

## Installation

You can install **clim4health** directly from CRAN or download the development version:

```R
# Install from CRAN
install.packages("clim4health")

# Get the development version from Gitlab
devtools::install_git('https://earth.bsc.es/gitlab/ghr/clim4health.git')
```

Once the package is installed, you can load it into your R session using `library(clim4health)`.

```{r load_package}
library(clim4health)
```

Vignettes loaded in the package can be accessed in R by typing `vignette("vignettename")` (e.g., `vignette("clim4health_overview")`).

## Data requirements

**clim4health** can load and transform climate data from the Copernicus Climate Data Store ([CDS](https://cds.climate.copernicus.eu/)). Specifically, it can currently load hourly, daily, or monthly *reanalysis* on single levels (ERA5 and ERA5-Land), as well as seasonal forecast monthly statistics on single levels. **clim4health** provides the option to download this data directly from the CDS.

Additionally, **clim4health** is able to load climate data stored in a csv format (for example, weather station data). The data should have a column for the date, as well as a column or columns containing variable values. However, given the variety of formats used to store climate data in csv files, the user may need to either reformat the data before using clim4health's load function, or follow the guide for formatting data into an s2dv_cube object (see the `clim4health_s2dv_cubes` vignette for more information and a worked example).

## clim4health structure

```{r clim4health_structure,fig.width=10, fig.height=10, fig.align='center', out.width='100%', echo= FALSE, eval = TRUE, fig.cap="Figure 1: Structure of the **clim4health** package, outlining its functions and general functionality."}

knitr::include_graphics("https://gitlab.earth.bsc.es/ghr/clim4health/-/blob/main/inst/figures/clim4health_function_order.svg?ref_type=heads")
```

This vignette outlines one full example workflow using the **clim4health** package. The process is organized into three key stages: obtaining input data, transforming and processing the data, and preparing outputs.

### 1. Obtain input data

To use **clim4health** functions, climate data must be loaded into an R session in a consistent format called an *s2dv_cube*. This object can be created by:

-   downloading data from the CDS using `c4h_get()`, and then loading it using `c4h_load()`.

-   loading data from a local file and transforming it into an s2dv_cube either manually or using `CSTools::s2dv_cube()`. The `clim4health_s2dv_cubes` vignette provides a worked example of how to do this for weather station data.

```{r input_functions_table, echo = FALSE}
knitr::kable(
  data.frame(
    Function = c("`c4h_get_help()`", "`c4h_get()`", "`c4h_load()`", "`c4h_convert_units()`"),
    Purpose = c(
      "List the available datasets and variables that can be downloaded from the CDS via clim4health.",
      "Download data from the CDS.",
      "Load climate data into an s2dv_cube object, either data downloaded using `c4h_get()` or data from a local csv file.",
      "Print or convert the units of a variable in an s2dv_cube object."
    )
    ),
  format = "markdown",
  caption = "Table 1: Overview of functions to prepare input data for clim4health."
)
```

To use the rest of the **clim4health** functions, the data must be in an s2dv_cube format with the following dimensions:

-   **dataset** - the datasets that are loaded.

-   **var** - the variables that are loaded, e.g. temperature, precipitation, etc.

-   **sdate** - the forecast initialisation or start dates.

-   **time** - the forecast lead time.

-   **ensemble** - the model ensemble members.

-   one or two spatial dimensions, either:

    -   **latitude** and **longitude** for gridded data

    -   **area** for polygon data, e.g. administrative areas such as states or provinces

    -   **location** for point data, e.g. weather station data

> 💡 **Tip**: the dimension **dataset** will always have size 1 in **clim4health** cases. It only exists because it is required for package dependencies in the downscaling and verification functions.

Once the data has been loaded as an s2dv_cube, the user can move on to the transformation and postprocessing steps, which are described in the next section.

### 2. Transform and process the data

There are several functions available to transform and process the data, which can be applied in any order depending on the required output. These include:

```{r transform_functions_table, echo = FALSE}
knitr::kable(
  data.frame(
    Function = c("`c4h_space()`", "`c4h_time()`", "`c4h_downscale()`", "`c4h_verify()`", "`c4h_index()`", "`c4h_collapse()`"),
    Purpose = c(
      "Spatially aggregate gridded data to administrative areas.",
      "Temporal aggregation to a desired time resolution.",
      "Downscale, calibrate and bias adjust climate data.",
      "Perform a forecast verification and quality assessment of forecast skill.",
      "Mask climate data by applying thresholds to the data.",
      "Collapse the data to a single value across one or more dimensions, e.g. calculating the ensemble mean or maximum value across time."
    )
    ),
  format = "markdown",
  caption = "Table 2: Overview of functions to transform and process data for clim4health."
)
```

For the more complex functions `c4h_downscale()` and `c4h_verify()`, vignettes are available (`clim4health_downscaling` and `clim4health_verification`) which provide more detailed information about the methodologies available and relevant choices of input arguments.

### 3. Prepare outputs

Once the data has been transformed and processed as required, the user can choose to plot the data, or convert it to a different data type, and save it.

```{r output_functions_table, echo = FALSE}
knitr::kable(
  data.frame(
    Function = c("`c4h_plot()`", "`c4h_plotskill()`", "`c4h_convert()`", "`c4h_save()`"),
    Purpose = c(
      "Plot aspects of the data.",
      "Plot forecast skill and its significance.",
      "Convert the data from s2dv_cube to a different data type, including areal, tabular, or raster.",
      "Save the transformed data."
    )
    ),
  format = "markdown",
  caption = "Table 3: Overview of functions to visualise and output data for clim4health."
)
```

# clim4health workflow

In this example, we demonstrate how to use **clim4health** to load and downscale seasonal forecast data, and plot the data at various stages.

In this example, we will perform the following steps:

**Step 1:**

-   Load the forecast, hindcast, and reanalysis data into R as s2dv_cube objects.
-   Convert the units of the datasets.
-   Plot the forecast, hindcast, and reanalysis data to visualize the raw data.

**Step 2:**

-   Spatially aggregate the hindcast and reanalysis data to administrative areas.
-   Alternatively, downscale the forecast and hindcast data to the finer spatial resolution of the reanalysis data.
-   Plot the downscaled forecast and hindcast data to visualize the downscaling results.
-   Perform a forecast verification to assess the skill of the forecast model compared to the reanalysis data.
-   Apply a threshold-based mask to the downscaled forecast data to identify only grid cells that exceed a certain temperature threshold.
-   Aggregate the data to a different temporal resolution, e.g. weekly or yearly.
-   Collapse the data across one or more dimensions, e.g. to calculate the ensemble mean or maximum value across time.

**Step 3:**

-   Convert the transformed data to a different data type, e.g. to a raster, polygon, or data frame.
-   Save the data.

> 📝 **NOTE**: the order of operations performed here is illustrative only, and users should decide based on their requirements.

First, load the required libraries.

```{r load_libraries}
library(clim4health)
```

## 1. Loading, inspecting, and preprocessing data

### Dataset description

The data used in this example are the monthly mean 2m temperature forecast (issued in January 2025 for 3 months lead time) from ECMWF's System 5.1 seasonal forecast, its corresponding hindcast, as well as reanalysis data from ERA5Land covering the same hindcast period (2010-2012).

> 📝 **NOTE**: in reality, a much longer hindcast period is required to downscale and verify seasonal forecasts, but a short period is used here for demonstration purposes.

```{r load_forecast_data}
# Load the forecast data
fcst_path <- system.file("extdata/forecast/", package = "clim4health")
fcst_path <- paste0(fcst_path, "/")

fcst <- c4h_load(fcst_path,
                 variable = "t2m",
                 year = 2025,
                 month = 1,
                 leadtime_month = "all",
                 ext = "nc")

# Print a summary of the values of the forecast data
summary(fcst$data)
```

We can also load the hindcast and reanalysis data, which should be initialized in the same month as the forecast (e.g. January), but for previous years. The hindcast and reanalysis should have the same time period for the downscaling and verifications steps.

```{r load_hindcast_reanalysis_data}
# Load the hindcast data
hindcast_path <- system.file("extdata/hindcast/", package = "clim4health")
hindcast_path <- paste0(hindcast_path, "/")

hcst <- c4h_load(hindcast_path,
                 variable = "t2m",
                 year = 2010:2012,
                 month = 1,
                 leadtime_month = 1:3,
                 ext = "nc")

reanalysis_path <- system.file("extdata/reanalysis/", package = "clim4health")
reanalysis_path <- paste0(reanalysis_path, "/")

# Load the reanalysis data
rean <- c4h_load(reanalysis_path,
                 variable = "t2m",
                 year = 2010:2012,
                 month = 1,
                 leadtime_month = 1:3,
                 ext = "nc")
```

Print the dimensions of the hindcast and reanalysis data, to compare with the forecast.

```{r print_raw_data}
print(fcst$dims)
print(hcst$dims)
print(rean$dims)
```

Follow the `clim4health_s2dv_cubes` vignette for more information on the structure of the clim4health s2dv_cube objects, as well as details about the types of climate data used here.

### Data preprocessing

Now that we have loaded the data, let's check the units.

```{r check_units}
# Display the current units
c4h_convert_units(fcst, var = "t2m")
```

We saw above that the temperature data are stored in Kelvin, which is common for climate model output and reanalyses due to the model physics. We can convert the units to Celsius, which is more useful in health impact studies, using the `c4h_convert_units()` function.

```{r convert_units}
# Convert from Kelvin to Celsius
fcst <- c4h_convert_units(fcst, to = "celsius", var = "t2m", from = "K")
hcst <- c4h_convert_units(hcst, to = "celsius", var = "t2m", from = "K")
rean <- c4h_convert_units(rean, to = "celsius", var = "t2m", from = "K")

# Display the new units
print(c4h_convert_units(fcst, var = "t2m"))
```

> 📝 **NOTE**: If you are using monthly mean data from CDS, take care with units, in particular for variables such as precipitation. For example, the precipitation variable to be downloaded from the ERA5-Land monthly mean dataset is called `total_precipitation`, and the data is stored in units of `m`. However, this does not refer to *monthly* total precipitation. Instead, it is the *monthly mean of the daily* total precipitation. In this case, you would also need to perform an additional transformation to obtain the monthly sum precipitation. Unfortunately, **clim4health** cannot differentiate between the identical units of `m`, and so you would have to make this conversion manually (by typing `output_data$data <- output_data$data * 30.44`).

### Plot the raw data

We can use `c4h_plot()` to plot the forecast, hindcast, and reanalysis data to visualize the raw data.

```{r plot_raw_data}
# Plot the forecast data
c4h_plot(fcst, var = "t2m", ensemble = TRUE)
# Plot the hindcast data
c4h_plot(hcst, var = "t2m", ensemble = TRUE)
# Plot the reanalysis data
c4h_plot(rean, var = "t2m", ensemble = TRUE)
```

## 2. Processing the data

Once the data has been loaded and explored, we can use **clim4health** functions to transform, downscale, and verify the data as required. In this example, we will spatially aggregate the data to municipalities, calibrate the forecast and hindcast data to the reanalysis data, and perform a forecast verification to assess the skill of the forecast model. We will also apply a threshold-based mask to the data to identify only grid cells that exceed a certain temperature threshold.

### Spatial aggregation

Typically, climate data is stored on grids of latitude and longitude, particularly if it comes from a climate model or reanalysis. However, for health impact studies, we often want to aggregate the data to administrative areas such as municipalities, states, or provinces, so that it can be aligned with recorded health data. We can use the `c4h_space()` function to spatially aggregate the gridded data to these administrative areas.

```{r spatial_aggregation}
# Load the sf package to read in spatial data
library(sf)

# Region municipalities
munip_path <- system.file("extdata", "areas", "munip_vallecauca.gpkg",
                          package = "clim4health")
munip <- read_sf(munip_path)

# Spatial aggregation
hcst_aggr <- c4h_space(hcst, munip)
rean_aggr <- c4h_space(rean, munip)
```

The climate datasets `hcst_aggr`, and `rean_aggr` are now aggregated to the same spatial areas, and, instead of containing `latitude` and `longitude` dimensions, they now contain an `area` dimension, which corresponds to the municipalities.

```{r print_dims}
print(hcst_aggr$dims)
```

We can plot the aggregated data in the same way as the gridded data, as `c4h_space()` stores the corresponding polygon geometries in `cube$attrs$area` in the output s2dv_cube `cube`, and `c4h_plot()` can utilize these geometries for visualization.

```{r plot_aggregated_data}
# Plot the aggregated hindcast data
c4h_plot(hcst_aggr, ensemble = TRUE)
# Plot the aggregated reanalysis data
c4h_plot(rean_aggr, ensemble = TRUE)
```

### Downscaling

`c4h_downscale()` allows the user to downscale, calibrate, and bias adjust the forecast and hindcast data to the finer spatial resolution of the reanalysis data. This is an important step for providing forecast data at a finer spatial resolution, and may help correct any systematic biases in the forecast model data.

> 📝 **NOTE**: see more information about the downscaling and calibrating capabilities in the `clim4health_downscaling` vignette.

To use `c4h_downscale()`, the user must specify different parameters depending on the required methodology. The key parameters are:

-   `downscale_function`: the downscaling/calibration method to use. Options are `"Interpolation"`, `"Intbc"`, `"Intlr"`, and `"Analogs"`. See the `clim4health_downscaling` vignette for more information about these methods.

-   `obs`: the observational data.

-   `exp`: the experimental data (generally speaking, the hindcast data. It should have the same time period as the observational data, and is used to calculate the bias adjustment parameters).

-   `exp_cor`: the experimental data to which the bias adjustment parameters will be applied (generally speaking, the forecast data). If it is NULL, the bias adjustment parameters will be applied to `exp` instead. This is useful if the user wants to apply the bias adjustment parameters to a different dataset (e.g. forecast data) than the one used to calculate the bias adjustment parameters (e.g. hindcast data).

First, let's downscale the forecast data.

```{r downscaling}
cal <- c4h_downscale(downscale_function = "Intbc", # downscaling method
                     obs = rean,                   # observational data, used to bias adjust the experiment
                     exp = hcst,                   # the experimental data to calculate the bias adjustment parameters
                     exp_cor = fcst,               # the experimental data to which the bias adjustment parameters will be applied
                     method_bc = "bias",           # for "Intbc", the bias correction method to use
                     method_remap = "bilinear")    # the method to remap the data to the new grid
```

> 📝 **NOTE**: you may see warnings when running `c4h_downscale()`. These are completely normal. For example, if you are using ERA5-Land reanalysis near the ocean, you will have many NA values, which may generate warnings.

`c4h_downscale()` returns a list of two s2dv_cubes:

-   `cal$exp` contains the downscaled/calibrated experimental data (in this case, the forecast data),

-   `cal$obs` contains the observational data.

> 📝 **NOTE**: in many cases, `cal$obs` will be the same as `rean_aggr` (the input `obs`), but there are certain cases where it may differ. See the `clim4health_downscaling` vignette for more information.

We can also downscale the hindcast data, which is used in the verification step to assess the skill of the forecast model. In this case, we leave the `exp_cor` parameter as NULL.

```{r downscale_hindcast}
cal_hcst <- c4h_downscale(downscale_function = "Intbc",
                          obs = rean,
                          exp = hcst,
                          method_bc = "bias",
                          method_remap = "bilinear",
                          loocv = FALSE)
```

Plot the downscaled forecast and hindcast data to visualise the results of the downscaling.

```{r plot_downscaled_data}
# Plot the downscaled forecast data
c4h_plot(cal$exp, ensemble = TRUE)
# Plot the downscaled hindcast data
c4h_plot(cal_hcst$exp, ensemble = TRUE)
```

### Verification and postprocessing

When using subseasonal to decadal predictions, we use hindcasts to assess the *skill* of the predictions. We can use `c4h_verify()` to perform a forecast verification and quality assessment of the forecast model, by comparing the downscaled hindcast data to the reanalysis data. This will give us an indication of how skillful the forecast model is at predicting the observed values.

> 📝 **NOTE**: there are many different ways to perform a skill assessment or verification. See the `clim4health_verification` vignette for more information.

`c4h_verify()` requires the user to specify a selection of metrics using the parameter `metrics`, as well the experimental data (`exp`) and the observational data (`obs`). Further parameters can be specified depending on the selected metrics. The possible metrics to select are explored in detail in the `clim4health_verification` vignette, but for this example, we will calculate the Continuous Ranked Probability Skill Score (CRPSS) of the forecast model compared to a reference forecast of climatology.

```{r verification}
skill <- c4h_verify(exp = cal_hcst$exp,   # the experimental data (downscaled hindcast data)
                    obs = cal_hcst$obs,   # the observational data (downscaled hindcast data)
                    metrics = c("CRPSS")) # the desired metrics to calculate
```

`c4h_verify()` returns a list of lists, where each element of the outer list corresponds to a metric, and each element of the inner list contains both the values of the metric, its significance, and in some cases additional information such as the upper and lower confidence intervals.

For example, `skill` contains only one element, `skill$CRPSS`, which is a list containing the values of CRPSS in `skill$CRPSS$crpss` and the significance of the CRPSS values in `skill$CRPSS$sign`. We can plot the skill of the forecast model using `c4h_plotskill()`. This differs from `c4h_plot()` in that it is specifically designed to plot skill scores, and can add a layer to the plot to indicate the significance of the skill values.

We can also explore the significance of the skill values by adding a layer to the plot using the `sign` parameter.

```{r plot_skill_significance}
c4h_plotskill(skill$CRPSS$crpss, sign = skill$CRPSS$sign)
```

> 📝 **NOTE**: in this case, there are no significant values! Don't panic. This is because we only loaded three years of data to compare the hindcast and reanalysis, which is not enough to produce any significant result.

### Masking data

`c4h_index()` allows the user to apply threshold-based indices to the data. For example, we can apply a mask to the forecast data to identify only grid cells that exceed a certain temperature threshold, which may be relevant for health impact studies (e.g. where certain vectors are only active above a certain temperature threshold). The user can specify:

-   `data`: the data to which the index will be applied.

-   `return_mask`: whether to return a 0-1 mask of where the threshold is exceeded, or to return the original data with values that do not exceed the threshold masked to NA.

-   `lower_threshold`: the lower threshold to apply to the data. Values below this threshold will be masked.

-   `upper_threshold`: the upper threshold to apply to the data. Values above this threshold will be masked.

-   `closed`: whether the interval should be open or closed (i.e. if the threshold limits should be included in the valid interval or not).

For example, we can apply `c4h_index()` to the downscaled forecast data to identify only grid cells where the temperature exceeds 20ºC, and then we can plot this masked dataset.

```{r apply_mask}
mask <- c4h_index(data = cal$exp,
                  return_mask = FALSE,
                  lower_threshold = 20)

c4h_plot(mask, ensemble = TRUE)
```

### Aggregating data temporally

`c4h_time()` allows the user to aggregate their climate data to different temporal resolutions. For example, the station data included within **clim4health** contains daily data, but perhaps the user would like to use weekly data instead, to align with health data recorded by epiweek.

The user must specify:

-   `data`: the data to which the temporal aggregation will be applied.

-   `time_aggregation`: the type of transformation to be applied, e.g. `"weekly"`.

-   `dim_aggregation`: the dimension across which the temporal aggregation will be applied, e.g. `"sdate"` or `"time"`.

-   `fun`: the function to apply to the data for the temporal aggregation, e.g. `"mean"` or `"max"`.

-   `week_start`: if `time_aggregation` is `"weekly"`, the day on which the week starts, e.g. `"Monday"`.

Let's load some daily station data to demonstrate this function.

```{r load_station_data}
station_path <- system.file("extdata/stations/",
                            package = "clim4health")
station_data <- c4h_load(station_path, variable = "temp_mean",
                         ext = "csv")
```

Since we have not specified any desired dates in our `c4h_load()` function, all available data from the csv file has been loaded into the `time` dimension, and the `sdate` dimension has length 1.

We can use `c4h_time()` to aggregate the data to obtain weekly mean daily-mean temperatures, by applying the mean function across the `time` dimension.

```{r temporal_aggregation}
weekly_data <- c4h_time(data = station_data,
                        time_aggregation = "weekly",
                        dim_aggregation = "time",
                        fun = "mean",
                        week_start = "Monday")
```

If we wanted to obtain a climatology, we could load the data with years in the `sdate` dimension, and then apply `c4h_time()` to calculate the mean across the `sdate` dimension.

```{r climatology}
station_data <- c4h_load(station_path, variable = "temp_mean",
                         ext = "csv", year = 2010:2012, month = 1, leadtime_month = 1:12)

print(station_data$dims)
```

`station_data` now contains the `sdate` dimension with length 93, which corresponds to the 93 start dates specified (year = 3 x month = 1 x day = 31), and `time` dimension with length 12, which corresponds to the 12 leadtime months specified. We can apply `c4h_time()` to calculate the mean across the `sdate` dimension to obtain a climatology (one value for each month).

```{r climatology_2}
yearly_stat <- c4h_time(data = station_data,
                        time_aggregation = "yearly",
                        dim_aggregation = "sdate",
                        fun = "mean")
```

### Collapsing data across a dimension

Climate data is generally multi-dimensional, and sometimes it can be useful to collapse the data across one or more dimensions to obtain a single value. For example, we may want to calculate the ensemble median or mean across all ensemble members, or the mean value across years (using the `sdate` dimension). We can use `c4h_collapse()` to do this.

```{r climatology_3}
climatology <- c4h_collapse(data = yearly_stat,
                            dim = "sdate",
                            fun = "mean")

print(climatology$dims)
```

We now have our climatology data, which has one value for each month.

## 3. Outputting the data

Once the data has been transformed as required, we can use **clim4health** functions to convert to different formats, and then save any of these.

### Convert data from an s2dv_cube to a different data type

Once all the desired processing has been done, you can convert the data from an s2dv_cube to a different data type using `c4h_convert()`. It takes an s2dv_cube as input, and a string containing the desired output type - SpatRaster objects (`"terra"`), stars (`"stars"`), data frames (`"data.frame"`), or polygon objects (`"sf"`). You may also choose to use the flags `crs` to identify the coordinate reference system for the output and `drop` to remove any singleton dimensions.

```{r convert_data_type}
csv_out <- c4h_convert(data = climatology, output_format = "data.frame", drop = TRUE)

terra_out <- c4h_convert(data = cal$exp, output_format = "terra")

polygons_out <- c4h_convert(data = cal$exp, output_format = "sf")
```

### Save the data

Finally, you can save the data using `c4h_save()`. This function allows you to save the processed data in a variety of formats, determined by the requested file extension. Valid file extensions include `"nc"`, `"tif"`, `"csv"`, `"gpkg"`, and `"shp"`, depending on whether the data to be saved is gridded, areal, or point data.

```{r }
c4h_save(climatology, path = "path_to_file.csv")
```

> 📝 **NOTE**: We recommend using `c4h_save()` at the end of your analysis, when the data is in its final format. If you think you will want to apply further **clim4health** functions later, we recommend saving an intermediate file using `saveRDS()` and loading it later using `readRDS()`.