This vignette provides an introduction to the c4h_verify() function, used for forecast verification. Forecast verification is an important step in evaluating the performance of climate forecasts, especially relevant in the development of climate-related early warning systems, since a forecast with poor skill may not provide reliable information to users. Verification involves comparing forecasted values (from a hindcast) to observed values (e.g. from reanalysis) to assess the performance of the prediction system at a given location, time, and leadtime. It is often necessary to downscale the hindcast to the same spatial resolution as the observations before performing the verification, which can be done using c4h_downscale().
Within this vignette, the different verification metrics are described, and examples of how to use the function are provided.
First, let’s load the package.
library(clim4health)Let’s load some sample data that we can use to demonstrate the verification methods.
hcst_path <- system.file("extdata/hindcast/", package = "clim4health")
hcst_path <- paste0(hcst_path, "/")
hcst <- c4h_load(hcst_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = "all",
ext = "nc")
rean_path <- system.file("extdata/reanalysis/", package = "clim4health")
rean_path <- paste0(rean_path, "/")
rean <- c4h_load(rean_path,
variable = "t2m",
year = 2010:2012,
month = 1,
leadtime_month = 1:3,
ext = "nc")Look at the dimensions of the observed and hindcast datasets (recall we can use dim(rean$data) or rean$dims). We can see that rean has more latitude and longitude points, despite covering the same area - it has higher spatial resolution.
First, we need to downscale the hindcast to the same spatial resolution as the observations. In the same step, let’s convert to degrees Celsius.
hcst <- c4h_convert_units(hcst, to = "celsius")
rean <- c4h_convert_units(rean, to = "celsius")
dwn <- c4h_downscale("Intbc", exp = hcst, obs = rean,
method_remap = "bilinear", method_bc = "evmos")💡 Note: Instead of downscaling, you could also spatially aggregate the hindcast and reanalysis data using
c4h_space()to obtain the same spatial resolution, before calculating any skill metrics.
Let’s explore the functionality of c4h_verify() and the different verification metrics that can be calculated, before applying them in examples.
c4h_verify() takes many input parameters depending on the skill metrics to be calculated. The following are the key input arguments:
exp: an s2dv_cube object containing the experimental data.obs: an s2dv_cube object containing the observational data.metrics: a character vector identifying the verification metrics to be calculated.Depending on the metrics chosen, additional optional arguments may be selected, which will be discussed below.
There are many widely-used metrics to assess the quality of a forecast product. These typically compare the hindcasts and observations to see when the hindcast makes reasonable predictions of the observed values. The following are optional metrics that may be calculated within c4h_verify() and are selected using the metrics parameter.
The metrics are calculated by comparing forecast ensemble member predictions to observations.
"BSS" - Brier Skill ScoreThis is a skill score for binary or categorical outcomes (for example, does a forecast successfully predict above normal temperatures?). The Brier Score (BS) measures the mean squared error between forecast probabilities and actual outcomes, and the associated skill score (BSS) compares the forecast’s BS to a reference forecast BS (often the climatology).
In the example of predicting above-median temperatures in a given month:
Two commonly used thresholds for BSS are BSS10 and BSS90: these define the categories as the 10th and 90th percentiles respectively. They are useful to assess whether the forecast correctly predicts extremes.
For example, BSS90 assesses whether the forecast can predict when temperatures exceed the 90th percentile threshold (the 90th percentile is the value that 90% of the observed temperatures fall below). In this case:
"RPSS" - Ranked Probability Skill ScoreThis skill score is based on the Ranked Probability Score (RPS), which extends the Brier Score (and associated skill) to multiple ordered categories. For example, you may wish to assess a forecast’s ability to predict below-normal, near-normal and above-normal temperatures, where above-normal corresponds to temperatures over the 66th percentile, and below-normal to those below the 33rd.
"CRPSS" - Continuous Ranked Probability Skill ScoreThe Continuous Ranked Probability Score (CRPS) measures the integrated squared difference between the forecast’s cumulative probability distribution and the observed value, represented as a step function. It differs from the BS and RPS in that it considers the full continuous probability distribution rather than discrete categories. The Continuous Ranked Probability Skill Score (CRPSS) compares the CRPS of a forecast to that of a reference (often climatology).
"AbsBiasSS" - Absolute Bias Skill ScoreAbsolute bias (AbsBias) measures the absolute difference between the mean forecast value (over ensemble members) and the mean observed value (the climatological mean). Its associated skill score (AbsBiasSS) compares the forecast AbsBias with that of a climatological reference.
"MSE" - Mean Squared ErrorThe average of the squared differences between forecast ensemble mean and observed values. The lower the value, the smaller the errors. Large errors contribute more to the MSE because of the squared transformation.
"MSSS" - Mean Squared Error Skill ScoreThe skill score associated with the MSE.
"RMSE" - Root Mean Squared ErrorThe square root of the MSE, returned in the same units as the original forecast and observation values. The lower the value, the smaller the errors. Large errors are still weighted more heavily due to the squaring step, but the square root makes the result more interpretable than the MSE.
"RMSSS" - Root Mean Square Skill ScoreThe skill score associated with the Root Mean Squared Error (RMSE), which is the square root of the MSE.
"ROCSS" - Relative Operating Characteristic Skill ScoreThe Relative Operating Characteristic Skill Score (ROCSS) assesses the discriminatory ability of a forecast. It is based on the ROC curve, which plots the Hit Rate against the False Alarm Rate across multiple probability thresholds:
For a binary event such as “temperature above mean” and a probabilistic forecast, a probability threshold is applied to convert the forecast into a binary prediction (e.g. “predict the event if more than 50% of ensemble members predict it”). By varying this threshold, a Hit Rate and False Alarm Rate are calculated at each value, and the ROC curve is plotted comparing these (example below).
💡 Note: In the context of statistical modelling (for example, of disease cases), the Hit Rate is the same as the sensitivity, and the False Alarm Rate is the same as
1 - specificity.
A “good” forecast has a high hit rate and low false alarm rate, so lies above the diagonal
A random forecast performs no better than chance, producing equal hit and false alarm rates at every threshold, and so lies on the diagonal
A “bad” forecast has a higher false alarm rate than hit rate, so lies below the diagonal.
To calculate the ROCSS, we calculate the area under the curve (AUC), which ranges from 0 to 1 where a higher value indicates better discrimination. The ROCSS = 2*AUC - 1, giving:
ROCSS = 1: perfect discrimination — at some probability threshold, the forecast achieves a hit rate of 100% and a false alarm rate of 0%
ROCSS = 0: no discrimination — the forecast performs no better than chance
ROCSS < 0: worse than random — the forecast has a higher false alarm rate than hit rate, and would perform better if predictions were reversed.
For skill scores where significance testing is applicable (BSS, RPSS, CRPSS, MSSS, RMSSS, ROCSS), a p-value is calculated to assess whether the forecast skill is significantly different from zero (i.e. significantly better or worse than climatology). A significance threshold of α = 0.05 is applied by default, meaning that a skill score is considered statistically significant if there is less than a 5% probability that the observed skill arose by chance.
💡 Note: Significance testing is not typically applied to raw error metrics (MSE, RMSE, AbsBias) since these are not expressed relative to a reference and do not have a natural null hypothesis of zero.
Using the data loaded and downscaled above, let’s perform an example skill assessment. In this example, we have only loaded 3 years of data for each leadtime, which means that we do not have a lot of information to assess whether the hindcast is skillful or not. However, the general process is (for gridded data):
skill <- c4h_verify(exp = dwn$exp,
obs = dwn$obs,
metrics = c("BSS", "CRPSS"),
brier_thresholds = c(0.1, 0.9))The output of c4h_verify is a list of lists. The outer list contains the names of the metrics, and the inner contains the value of the metric and its significance, if applicable. In the above, we calculated the Brier Skill Score (BSS) at the 10th and 90th percentiles and Continuous Ranked Probability Skill Score (CRPSS), so we can find them in the list as:
names(skill)Within each metric, we save the value and the significance (by default alpha = 0.05). Looking at the dimensions of the outputs, note that the sdate dimension will now always have length 1, because the skill is calculated by assessing statistical relationships across the start dates.
names(skill$BSS10)
print(skill$BSS10$bss$dims)💡 Note: Because we have loaded so little data, the significance of all our metrics is FALSE. If we were to load more data, we would see that our hindcast has significant skill in some locations or for some dates.
Each value and significance is an s2dv_cube object. They can be plotted using c4h_plotskill().
c4h_plotskill(skill$BSS10$bss, skill$BSS10$sign, ensemble = TRUE, boundaries = "countries")There are many optional arguments in c4h_verify() that the user can specify to tailor their skill assessment. Currently, we recommend using the default values for most of these, but here we provide some guidance on how to choose them.
💡 Note: Currently,
c4h_verify()applies arguments across all metrics (except for a minority of cases, explained below). This is why it is recommended to use the default values. In future, it will be possible to specify arguments for specific metrics.
refThis argument should include an optional s2dv_cube containing a reference forecast to be used in the skill score calculations. If not provided, the climatological forecast is used as the reference forecast. In this case, the skill scores are then calculated by comparing the forecast provided in exp to the climatology of a forecast provided in ref - you would thus expect the skill scores to be 0 everywhere if using the same forecast in exp and ref.
brier_thresholdsThis argument is only relevant if “BSS” is included in metrics. It is a numeric vector of thresholds to be applied to calculate different Brier Skill Scores. The default is c(0.1, 0.9), which corresponds to the BSS10 and BSS90 respectively. You could specify other thresholds, for example c(0.33, 0.66) to calculate the BSS33 and BSS66, which would correspond to the 33rd and 66th percentiles respectively. In this case, you would then be assessing the forecast’s ability to predict below-normal and above-normal values respectively.
prob_thresholdsprob_thresholds acts similarly to brier_thresholds, but it is used when “RPSS” or “ROCSS” is in metrics. The default is c(1/3, 2/3), which corresponds to assessing the ability of the forecast to predict below-normal, normal, and above-normal values simultaneously. You could specify other thresholds, such as c(0.2, 0.4, 0.6, 0.8) to assess the forecast’s ability to predict values in the 5 categories defined by the quintiles.
sig_method and sig_method.typesig_method indicates the method used to calculate significance. The default is different for different metrics. For “BSS”, “RPSS”, “CRPSS”, and “AbsBiasSS”, the default is "RandomWalk". For “MSSS” and “RMSSS”, the default is “one-sided Fisher”. "Diebold-Mariano" is also possible for “RPSS” and “BSS”.
sig_method.type indicates the type of test to be applied if using a Random Walk sig_method. The default is "two.sided.approx", which assesses whether exp and ref are significantly different in skill with a two-sided test using an approximation. Other options include "two.sided", which uses an exact two-sided test, and "greater" or "less". "greater" assesses whether exp shows significantly better skill than ref with a one-sided test for negatively-oriented scores (scores where lower values are better), while "less" assesses whether exp shows significantly better skill than ref with a one-sided test for positively-oriented scores (scores where higher values are better). Examples of negatively-oriented scores are “RPS” and “RMSE”, and examples of positively-oriented scores are “ROC”. Thus, if you wish to assess skill using a one-sided test, you could set sig_method.type = "greater" for calculating “RPSS” and “RMSSS”, and sig_method.type = "less" for calculating “ROCSS”. More details can be found in the documentation of s2dv::RandomWalkTest().
alphaThis is the significance level to be applied in the significance testing of the skill scores. The default is 0.05. This value is applied across all metrics, but in some specific cases, the value will be incompatible with a chosen significance test, and the default will be applied. In such cases, a warning is issued!
ncoresThis is the number of cores to be used in the parallelisation of the skill score calculations. The default is 1, so no parallelisation is applied. If you have a large dataset and want to speed up the calculations, you can increase this value. An error will be returned if you specify a value higher than the number of cores available on your machine.
N.effThis argument specifies the effective sample size to use in the significance testing for metrics “BSS”, “RPSS”, “CRPSS”, or “RMSSS”. The default is NA, which uses the effective number of independent observations calculated from the data using s2dv::Eno().
indices_for_climThis is a vector of the indices to be taken along the sdate dimension for computing the thresholds between probabilistic categories. If NULL (default), the whole time period is used. If you wanted to specify the probability thresholds based only on the first 10 years of the data, you could set indices_for_clim = 1:10. This argument is only relevant if “BSS”, “RPSS”, “CRPSS”, or “ROCSS” is in metrics.
cross.val and clim.cross.valThese arguments are logical values (TRUE/FALSE) that indicate whether to apply cross-validation (i.e. excluding values) in the calculation of the probability categories. It is recommended to use their default values (FALSE, TRUE, respectively).
weights_exp and weights_refThis is an array to use to weight the forecast/reference ensemble members in the probability category calculations. The default is NULL, which means that no weighting is applied. This is recommended.
comp_dim and limitscomp_dim refers to the dimension along which to only take into account complete data. That is to say, data is removed along comp_dim if there is at least one NA between limits, where the argument limits specifies the range of indices along comp_dim to be considered. It is recommended to use the default values for these arguments, which means that no data is removed.
confIf “MSE” or “RMSE” is in metrics, conf specifies whether or not to return the confidence intervals. By default, they are returned (conf = TRUE).
na.rmBy default, this is FALSE, meaning that NA values are not removed in the calculations of “BSS”, “RPSS”, or “AbsBiasSS”. If any NA value is present, then the function will return NA there.
sign and pvalFor the metrics “MSSS” and “RMSSS”, sign and pval specify whether or not to compute the statistical significance and p-values of the test Ho: (R)MSSS = 0. By default, sign is TRUE and pval is FALSE. The default is that the significance is returned (TRUE/FALSE), but the p-values used to calculate this are not).
rocss_catThis argument should be an integer indicating the category to be returned when calculating “ROCSS”. The default is 1, which corresponds to the first category defined by prob_thresholds. With the default prob_thresholds = c(1/3, 2/3), this corresponds to the first category (i.e. below normal). If you wanted to calculate the ROCSS for the second category (i.e. normal), you would set rocss_cat = 2. It is currently only possible to return one category at a time.
Need a high-speed mirror for your open-source project?
Contact our mirror admin team at info@clientvps.com.
This archive is provided as a free public service to the community.
Proudly supported by infrastructure from VPSPulse , RxServers , BuyNumber , UnitVPS , OffshoreName and secure payment technology by ArionPay.