---
title: "Revisiting Whittaker-Henderson Smoothing"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Revisiting Whittaker-Henderson Smoothing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: biblio.bib
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "",
  fig.width = 10, 
  fig.asp = 0.618,
  fig.align = "center",
  out.width = "100%"
)

library(WH)
```

## What is Whittaker-Henderson smoothing ?

### Origin

Whittaker-Henderson (WH) smoothing is a gradation method aimed at correcting the effect of sampling fluctuations on an observation vector. It is applied to evenly-spaced discrete observations. Initially proposed by @whittaker1922new for constructing mortality tables and further developed by the works of @henderson1924new, it remains one of the most popular methods among actuaries for constructing experience tables in life insurance. Extending to two-dimensional tables, it can be used for studying various risks, including but not limited to: mortality, disability, long-term care, lapse, mortgage default, and unemployment.

### The one-dimensional case

Let $\mathbf{y}$ be a vector of observations and $\mathbf{w}$ a vector of positive weights, both of size $n$. The estimator associated with Whittaker-Henderson smoothing is given by:

$$
\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}}\{F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) + R_{\lambda,q}(\boldsymbol{\theta})\}
$$

where:

-   $F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = \underset{i = 1}{\overset{n}{\sum}} w_i(y_i - \theta_i)^2$ represents a fidelity criterion to the observations,

-   $R_{\lambda,q}(\boldsymbol{\theta}) = \lambda \underset{i = 1}{\overset{n - q}{\sum}} (\Delta^q\boldsymbol{\theta})_i^2$ represents a smoothness criterion.

In the latter expression, $\Delta^q$ denotes the forward difference operator of order $q$, such that for any $i\in[1,n - q]$:

$$
(\Delta^q\boldsymbol{\theta})_i = \underset{k = 0}{\overset{q}{\sum}} \begin{pmatrix}q \\ k\end{pmatrix}(- 1)^{q - k} \theta_{i + k}.
$$

Let us define $W = \text{Diag}(\mathbf{w})$, the diagonal matrix of weights, and $D_{n,q}$ as the order $q$ difference matrix of dimensions $(n-q) \times n$, such that $(D_{n,q}\boldsymbol{\theta})_i = (\Delta^q\boldsymbol{\theta})_i$ for all $i \in [1, n-q]$. The most commonly used difference matrices of order 1 and 2 have the following forms:

$$
D_{n,1} = \begin{bmatrix}
-1 & 1 & 0 & \ldots & 0 \\
0 & -1 & 1 & \ldots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & \ldots & 0 & -1 & 1 \\
\end{bmatrix}
\quad\text{and}\quad
D_{n,2} = \begin{bmatrix}
1 & -2 & 1 & 0 & \ldots & 0 \\
0 & 1 & -2 & 1 & \ldots & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\
0 & \ldots & 0 & 1 & -2 & 1 \\
\end{bmatrix}.
$$

The fidelity and smoothness criteria can be rewritten in matrix form as:

$$
F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \quad \text{and} \quad R_{\lambda,q}(\boldsymbol{\theta}) = \lambda\boldsymbol{\theta}^TD_{n,q}^TD_{n,q}\boldsymbol{\theta}.
$$

The associated estimator for smoothing becomes:

$$
\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace
$$

where $P_{\lambda} = \lambda D_{n,q}^TD_{n,q}$.

### The two-dimensional case

In the bidimensional case, let us consider a matrix $Y$ of observations and a matrix $\Omega$ of non-negative weights, both of dimensions $n_x \times n_z$. The estimator associated with the Whittaker-Henderson smoothing can be written as:

$$
\widehat{Y} = \underset{\Theta}{\text{argmin}}\{F(Y,\Omega, \Theta) + R_{\lambda,q}(\Theta)\}
$$

where:

- $F(Y,\Omega, \Theta) = \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z} \Omega_{i,j}(Y_{i,j} - \Theta_{i,j})^2$ represents a fidelity criterion to the observations,

- $R_{\lambda,q}(\Theta) = \lambda_x \sum_{j = 1}^{n_z}\sum_{i = 1}^{n_x - q_x} (\Delta^{q_x}\Theta_{\bullet,j})_i^2 + \lambda_z \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z - q_z} (\Delta^{q_z}\Theta_{i,\bullet})_j^2$ is a smoothness criterion.

This latter criterion can be written as the sum of two one-dimensional regularization criteria, with orders $q_x$ and $q_z$, applied respectively to all rows and all columns of $\Theta$. It is also possible to adopt matrix notations by defining $\mathbf{y} = \textbf{vec}(Y)$, $\mathbf{w} = \textbf{vec}(\Omega)$, and $\boldsymbol{\theta} = \textbf{vec}(\Theta)$ as the vectors obtained by stacking the columns of the matrices $Y$, $\Omega$, and $\Theta$, respectively. Additionally, let us denote $W = \text{Diag}(\mathbf{w})$ and $n = n_x \times n_z$. The fidelity and smoothness criteria can then be rewritten as linear combinations of the vectors $\mathbf{y}$, $\mathbf{w}$, and $\boldsymbol{\theta}$:

$$
\begin{aligned}
F(\mathbf{y},\mathbf{w}, \boldsymbol{\theta}) &= (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \\
R_{\lambda,q}(\boldsymbol{\theta}) &= \boldsymbol{\theta}^{T}(\lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}) \boldsymbol{\theta}.
\end{aligned}
$$

and the estimator associated with the smoothing takes the same form as in the one-dimensional case:

$$
\hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace
$$

except in this case $P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}$.

### Explicit solution

Whittaker-Henderson estimator has an explicit solution:

$$\hat{\mathbf{y}} = (W + P_{\lambda})^{-1}W\mathbf{y}.$$

Indeed, as a minimum, $\hat{\mathbf{y}}$ satisfies:

$$0 = \left.\frac{\partial}{\partial \boldsymbol{\theta}}\right|_{\hat{\mathbf{y}}}\left\lbrace(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\right\rbrace = - 2 X^TW(y - \hat{\mathbf{y}}) +2X^TP_{\lambda} \hat{\mathbf{y}}.$$

It follows that $(W + P_{\lambda})\hat{\boldsymbol{\theta}} = W\mathbf{y}$, and if $W + P_{\lambda}$ is invertible, then $\hat{\mathbf{y}}$ is indeed a solution for the original equation. When $\lambda \neq 0$, $W + P_{\lambda}$ is invertible as long as $\mathbf{w}$ has $q$ non-zero elements in the one-dimensional case, and $\Omega$ has at least $q_x \times q_z$ non-zero elements distributed over $q_x$ different rows and $q_z$ different columns in the two-dimensional case. These are sufficient conditions, which are always satisfied in practice and will not be demonstrated here.

## How to use the package?

The `WH` package features two main functions `WH_1d` and `WH_2d` corresponding to the one-dimensional and two-dimensional cases respectively. Two arguments are mandatory for those functions:

* The vector (or matrix in the two-dimension case) `d` corresponding to the number of observed events of interest by age (or by age and duration in the two-dimension case). `d` should have named elements (or rows and columns) for the model results to be extrapolated.

* The vector (or matrix in the two-dimension case) `ec` corresponding to the portfolio central exposure by age (or by age and duration in the two-dimension case) whose dimensions should match those of `d`. The contribution of each individual to the portfolio central exposure corresponds to the time the individual was actually observed with corresponding age (and duration in the two-dimension cas). It always ranges from 0 to 1 and is affected by individuals leaving the portfolio, no matter the cause, as well as censoring and truncating phenomena.

Additional arguments may be supplied, whose description is given in the documentation of the functions.

The package also embed two fictive agregated datasets to illustrate how to use it:

* `portfolio_mortality` contains the agregated number of deaths and associated central exposure by age for an annuity portfolio.

* `portfolio_LTC` contains the agregated number of deaths and associated central exposure by age and duration (in years) since the onset of LTC for the annuitant database of a long-term care portfolio.

```{r fit-1d}
# One-dimensional case
d <- portfolio_mort$d
ec <- portfolio_mort$ec

WH_1d_fit <- WH_1d(d, ec)
```

```{r fit-2d}
# Two-dimensional case
keep_age <- which(rowSums(portfolio_LTC$ec) > 5e2)
keep_duration <- which(colSums(portfolio_LTC$ec) > 1e3)

d  <- portfolio_LTC$d[keep_age, keep_duration]
ec <- portfolio_LTC$ec[keep_age, keep_duration]

WH_2d_fit <- WH_2d(d, ec)
```

Functions `WH_1d` and `WH_2d` output objects of class `"WH_1d"` and `"WH_2d"` to which additional functions (including generic S3 methods) may be applied:

* The `print` function provides a glimpse of the fitted results

```{r print}
WH_1d_fit
WH_2d_fit
```

* The `plot` function generates rough plots of the model fit, the associated standard deviation, the model residuals or the associated degrees of freedom. See the `plot.WH_1d` and `plot.WH_2d` functions help for more details.

```{r plot}
plot(WH_1d_fit)
plot(WH_1d_fit, "res")
plot(WH_1d_fit, "edf")

plot(WH_2d_fit)
plot(WH_2d_fit, "std_y_hat")
```

* The `predict` function generates an extrapolation of the model. It requires a `newdata` argument, a named list with one or two elements corresponding to the positions of the new observations. In the two-dimension case constraints are used so that the predicted values matches the fitted values for the initial observations [see @carballo2021prediction to understand why this is required].

```{r predict}
WH_1d_fit |> predict(newdata = 18:99) |> plot()
WH_2d_fit |> predict(newdata = list(age = 50:99,
                                    duration = 0:19)) |> plot()
```


* Finally the `output_to_df` function converts an `"WH_1d"` or `"WH_2d"` object into a `data.frame`. Information about the fit is discarded in the process. This function may be useful to produce better visualizations from the data, for example using the ggplot2 package.

```{r}
WH_1d_df <- WH_1d_fit |> output_to_df()
WH_2d_df <- WH_2d_fit |> output_to_df()
```

## Further WH smoothing theory

### How to obtain credibility intervals?

From the explicit solution of WH smoothing, $\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{-1}W\mathbb{E}(\mathbf{y}) \ne \mathbb{E}(\mathbf{y})$ when $\lambda \ne 0$. This implies that penalization introduces a smoothing bias, which prevents the construction of a confidence interval centered around $\mathbb{E}(\mathbf{y})$. Therefore, in this section, we turn to a Bayesian approach where smoothing can be interpreted more naturally.

Let us suppose that $\mathbf{y} | \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, W^{-})$ and $\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})$ where $P_{\lambda}^{-}$ denotes the pseudo-inverse of the matrix $P_{\lambda}$. The Bayes' formula allows us to express the posterior likelihood $f(\boldsymbol{\theta} | \mathbf{y})$ associated with these choices in the following form:

$$\begin{aligned}
f(\boldsymbol{\theta} | \mathbf{y}) &= \frac{f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta})}{f(y)} \\
&\propto f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta}) \\
&\propto \exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right)\exp\left(-\frac{1}{2}\boldsymbol{\theta}^TP_{\lambda} \boldsymbol{\theta}\right) \\
&\propto \exp\left(- \frac{1}{2}\left[(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right]\right).
\end{aligned}$$

The mode of the posterior distribution, $\hat{\boldsymbol{\theta}} = \underset{\boldsymbol{\theta}}{\text{argmax}} [f(\boldsymbol{\theta} | \mathbf{y})]$, also known as the maximum a posteriori (MAP) estimate, coincides with the explicit solution $\hat{\mathbf{y}}$. 

A second-order Taylor expansion of the log-posterior likelihood around $\hat{\mathbf{y}} = \hat{\boldsymbol{\theta}}$ gives us:

$$\ln f(\boldsymbol{\theta} | \mathbf{y}) = \ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) + \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}^{T}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T} \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})$$

$$\text{where} \quad \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = 0 \quad \text{and} \quad \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = - (W + P_{\lambda}).$$

Note that this last derivative no longer depends on $\boldsymbol{\theta}$, and higher-order derivatives beyond 2 are all zero. The Taylor expansion allows for an exact computation of $\ln f(\boldsymbol{\theta} | \mathbf{y})$. By substituting the result back into the Taylor expansion, we obtain:

$$\begin{aligned}
f(\boldsymbol{\theta} | \mathbf{y}) &\propto \exp\left[\ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) - \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \\
&\propto \exp\left[- \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right]
\end{aligned}$$

which can immediately be recognized as the density of the $\mathcal{N}(\hat{\boldsymbol{\theta}},(W + P_{\lambda})^{- 1})$ distribution.

The assumption $\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})$ corresponds to a simple Bayesian formalization of the smoothness criterion. It reflects a (improper) prior belief of the modeler about the underlying distribution of the observation vector $\mathbf{y}$.

The use of Whittaker-Henderson smoothing in the Bayesian framework and the construction of credible intervals are conditioned on the validity of the assumption $\mathbf{y} | \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, W^{-})$. The components of the observation vector should be independent and have known variances. The weight vector $\mathbf{w}$ used should contain the inverses of these variances. Under these assumptions, credible intervals at $100(1 - \alpha)$% for smoothing can be constructed and take the form:

$$\mathbb{E}(\mathbf{y}) | \mathbf{y} \in \left[\hat{\mathbf{y}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(W + P_{\lambda})^{-1}\right\rbrace}\right]$$

with probability $1 -\frac{\alpha}{2}$ where $\hat{\mathbf{y}} = (W + P_{\lambda})^{- 1}W\mathbf{y}$ and $\Phi$ denotes the cumulative distribution function of the standard normal distribution. According to @marra2012, these credible intervals provide satisfactory coverage of the corresponding frequentist confidence intervals and can therefore be used as practical substitutes.

## What should I use as observations and weights?

The previous section highlighted the need to apply the Whittaker-Henderson smoothing to independent observation vectors $\mathbf{y}$ and weight vectors $\mathbf{w}$ corresponding to the inverses of the variances of the components of $\mathbf{y}$ in order to obtain a measure of uncertainty in the results. In this section, we propose, within the framework of duration models used for constructing experience tables, vectors $\mathbf{y}$ and $\mathbf{w}$ that satisfy these conditions.

### Duration models framework: one-dimensional case

Consider the observation of $m$ individuals in a longitudinal study subject to left truncation and right censoring phenomena. Suppose we want to estimate a single distribution that depends on only one continuous explanatory variable, denoted by $x$. For illustration purposes, we can consider it as a mortality distribution with the explanatory variable of interest $x$ representing age. Such a distribution is fully characterized by one of the following quantities:

- the cumulative distribution function $F(x)$ or its complement, the survival function $S(x) = 1 - F(x)$,
- the associated probability density function $f(x) = - \frac{\text{d}}{\text{d}x}S(x)$,
- the instantaneous hazard function $\mu(x) = - \frac{\text{d}}{\text{d}x}\ln S(x)$.

Suppose that the considered distribution depends on a vector of parameters $\boldsymbol{\theta}$ that we want to estimate using maximum likelihood. The likelihood associated with the observation of the individuals can be written as follows:

$$
\mathcal{L}(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\prod}} \left[\frac{f(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{\delta_i}\left[\frac{S(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{1 - \delta_i}
$$

Where $x_i$ represents the age at the start of observation, $t_i$ represents the observation duration, i.e., the time elapsed between the start and end dates of observation, and $\delta_i$ is the indicator of event observation, which takes the value 1 if the event of interest is observed and 0 if the observation is censored. We will not go into the details of calculating these three quantities, which should take into account the individual-specific information such as the subscription date, redemption date if applicable, as well as the global characteristics of the product, such as the presence of a deductible or medical selection phenomenon, and the choice of a restricted observation period due to data quality issues or delays in the reporting of observations. These factors typically lead to a narrower observation period than the actual presence period of individuals in the portfolio.

The various quantities introduced above are related by the following relationships:

$$
S(x) = \exp\left(\underset{u = 0}{\overset{x}{\int}}\mu(u)\text{d}u\right)\quad \text{and} \quad f(x) = \mu(x)S(x).
$$


The maximization of the likelihood in that equation is equivalent to the maximization of the associated log-likelihood, which can be rewritten using only the instantaneous hazard function (also known as the mortality rate in the case of the death risk):

$$
\ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} \left[\delta_i \ln\mu(x_i + t_i,\boldsymbol{\theta}) - \underset{u = 0}{\overset{t_i}{\int}}\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right]
$$


To discretize the problem, let us assume that the mortality rate is piecewise constant over one-year intervals between two integer ages. Formally, we have $\mu(x + \epsilon) = \mu(x)$ for all $x \in \mathbb{N}$ and $\epsilon \in [0,1[$. Furthermore, if $\mathbf{1}$ denotes the indicator function, then for any $0 \leq a < x_{\max}$, we have $\sum_{x = x_{\min}}^{x_{\max}} \mathbf{1}(x \leq a < x + 1) = 1$, where $x_{\min} = \min(\mathbf{x})$ and $x_{\max} = \max(\mathbf{x})$. The log-likelihood can be rewritten as:

$$
\begin{aligned}
\ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} &\left[\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \delta_i\mathbf{1}(x \le x_i + t_i < x + 1)\ln\mu(x_i + t_i,\boldsymbol{\theta})\right. \\ 
&- \left.\underset{u = 0}{\overset{t_i}{\int}}\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right].
\end{aligned}
$$

The assumption of piecewise constant mortality rate implies that:

$$
\begin{aligned}
\mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x_i + t_i,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x,\boldsymbol{\theta})\quad\text{and}\\ \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + u < x + 1) \ln\mu(x,\boldsymbol{\theta}).
\end{aligned}
$$


It is then possible to interchange the two summations to obtain the following expressions:

$$
\begin{aligned}
\ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \left[\ln\mu(x,\boldsymbol{\theta}) d(x) - \mu(x,\boldsymbol{\theta}) e_c(x)\right] \quad \text{where} \\
d(x) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \quad \text{and} \\
e_c(x) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\text{d}u = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x - x_i + 1) - \max(0, x - x_i)\right]^+
\end{aligned}
$$

by denoting $a^+ = \max(a, 0)$, where $d(x)$ and $e_c(x)$ correspond to the number of observed deaths between ages $x$ and $x + 1$ and the sum of observation durations of individuals between these ages, respectively (the latter quantity is also known as central exposure to risk).

## Extension to the two-dimensional case

The extension of the proposed approach to the two-dimensional framework requires only minor adjustments to the previous reasoning. Let $z_{\min} = \min(\mathbf{z})$ and $z_{\max} = \max(\mathbf{z})$. The piecewise constant assumption for the mortality rate needs to be extended to the second dimension. Formally, we now assume that $\mu(x + \epsilon, z + \xi) = \mu(x, z)$ for all pairs $x, z \in \mathbb{N}$ and $\epsilon, \xi \in [0,1[$. The sums involving the variable $x$ are then replaced by double sums considering all combinations of $x$ and $z$. The log-likelihood is given by:

$$
\begin{aligned}
\ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \underset{z = z_{\min}}{\overset{z_{\max}}{\sum}}\left[\ln\mu(x,z,\boldsymbol{\theta}) d(x,z) - \mu(x,z,\boldsymbol{\theta}) \mathbf{e_c}(x,z)\right] \quad \text{where} \\
d(x,z) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \mathbf{1}(z \le z_i + t_i < z + 1) \quad \text{and}\\
\mathbf{e_c}(x,z) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\mathbf{1}(z \le z_i + u < z + 1)\text{d}u \\
& = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x + 1 - x_i, z + 1 - z_i) - \max(0, x - x_i, z - z_i)\right]^+
\end{aligned}
$$

## Likelihood equations {#sec-cons}

The choice $\boldsymbol{\mu}(\boldsymbol{\theta}) = \exp(\boldsymbol{\theta})$, which corresponds to considering one parameter per observation, allows us to relate to the Whittaker-Henderson smoothing. Using the exponential function ensures positive values for the estimated mortality rate. The expressions of the likelihood in the unidimensional or two-dimensional case can then be written in a common vectorized form:

$$
\ell(\boldsymbol{\theta}) = \boldsymbol{\theta}^{T}\mathbf{d} - \exp(\boldsymbol{\theta})^{T}\mathbf{e_c}
$$

where $\mathbf{d}$ and $\mathbf{e}_c$ represent the vectors of observed deaths and exposures to risk, respectively. The derivatives of the likelihood function for this model are given by:

$$\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \left[\mathbf{d} -\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}\right] \quad \text{and} \quad \frac{\partial^2 \ell}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^T} = - \text{Diag}(\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}).$$

Note that these likelihood equations are exactly what we would obtain by assuming that the observed numbers of deaths, conditional on the observed exposures to risk $\mathbf{e}_c$, follow Poisson distributions with parameters $\boldsymbol{\mu}(\boldsymbol{\theta}) \odot \mathbf{e}_c$. The model presented here has many similarities with a Poisson GLM [@nelder1972glm]. However, the initial assumptions are not the same for both models.

These likelihood equations have an explicit solution given by $\hat{\boldsymbol{\theta}} = \ln(\mathbf{d} / \mathbf{e}_c)$. This model, which treats each age independently, is known as the crude rates estimator. The properties of the maximum likelihood estimator imply that asymptotically $\hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, W_{\hat{\boldsymbol{\theta}}}^{-1})$, where $W_{\hat{\boldsymbol{\theta}}}$ is a diagonal matrix with elements $\exp(\hat{\boldsymbol{\theta}}) \odot \mathbf{e}_c = (\mathbf{d} / \mathbf{e}_c) \odot \mathbf{e}_c = \mathbf{d}$. It should be noted that the asymptotic nature and the validity of this approximation are related to the number of individuals $m$ in the portfolio and not the size $n$ of the vectors $\mathbf{d}$ and $\mathbf{e}_c$.

Thus, we have shown that in the framework of duration models, using the crude rates estimator, asymptotically $\ln(\mathbf{d} / \mathbf{e}_c) \sim \mathcal{N}(\ln\boldsymbol{\mu}, W^{-1})$, where $W = \text{Diag}(\mathbf{d})$. This justifies applying the Whittaker-Henderson smoothing to the observation vector $y = \ln(\mathbf{d} / \mathbf{e}_c)$ and weight vector $\mathbf{w} = \mathbf{d}$. We obtain the following associated credibility intervals:

$$\ln\boldsymbol{\mu} | \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\right\rbrace}\right]$$

where $\hat{\boldsymbol{\theta}} = (\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\text{Diag}(\mathbf{d})[\ln(\mathbf{d}) - \ln \mathbf{e}_c]$. Results on $\boldsymbol{\mu}$ can be obtained directly by exponentiating the above expression.

## How is the optimal smoothing parameter determined ?

In the definition of WH smoothing, $(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})$ represents a fidelity criterion to the observations, and $\boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}$ represents a smoothness criterion. The relative importance of these criteria is controlled by the parameter (or pair of parameters in the two-dimensional case) $\lambda$. The result of smoothing is highly sensitive to the chosen value of the smoothing parameter. Ideally, the selection of the smoothing parameter is based on the optimization of a statistical criterion, typically belonging to one of two major families. 

On one hand, there are criteria based on the minimization of the model's prediction error, among which the Akaike information criterion [AIC, @akaike1973] and the generalized cross-validation [GCV, @wahba1980gcv] are included. The Bayesian information criterion [BIC, @schwarz1978bic] has a similar form to AIC, although its theoretical justification is very different, and thus it can be considered part of this group.

On the other hand, there are criteria based on the maximization of a likelihood function known as the marginal likelihood. This type of criterion was introduced by @patterson1971reml in the Gaussian case, initially under the name of restricted likelihood (REML), and used by @anderssen1974time for the selection of smoothing parameters. @wahba1985comparison and @kauermann2005note show that criteria minimizing prediction error have the best asymptotic performance, but their convergence to the optimal smoothing parameters is slower. For finite sample sizes, criteria based on the maximization of a likelihood function are considered a more robust choice by many authors, such as @reiss2009smoothing or @wood2011reml. For these reasons, we will favor the marginal likelihood as the selection criterion, especially since this choice naturally fits into the Bayesian framework introduced in the past sections.

Let us start from the notations and assumptions from last sections, namely $\mathbf{y} | \boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\theta}, W^{-})$ and $\boldsymbol{\theta} | \lambda \sim \mathcal{N}(0, P_{\lambda} ^{-})$. In a purely Bayesian approach, it would be necessary to define a prior distribution on $\lambda$ and then estimate the posterior distribution of each parameter vector using methods such as Markov Chain Monte Carlo. The empirical Bayesian approach we adopt seeks to find the value of $\lambda$ that maximizes the marginal likelihood:

$$\mathcal{L}^m_\text{norm}(\lambda) = f(\mathbf{y} | \lambda) = \int f(\mathbf{y}, \boldsymbol{\theta} | \lambda)\text{d}\boldsymbol{\theta}  = \int  f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta} |\lambda)\text{d}\boldsymbol{\theta}.$$

This corresponds to the maximum likelihood method applied to the smoothing parameter. Let us explicitly rewrite the expressions of $f(\mathbf{y} | \boldsymbol{\theta})$ and $f(\boldsymbol{\theta} |\lambda)$ introduced previously:

$$\begin{aligned}
f(\mathbf{y} | \boldsymbol{\theta}) &= \sqrt{\frac{|W|_{+}}{(2\pi)^{n_*}}}\exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right) \\
f(\boldsymbol{\theta} |\lambda) &= \sqrt{\frac{|P_{\lambda}|_{+}}{(2\pi)^{p - q}}} \exp\left(- \frac{1}{2}\boldsymbol{\theta}^{T}P_{\lambda} \boldsymbol{\theta}\right)
\end{aligned}$$

where $|A|_{+}$ denotes the product of the non-zero eigenvalues of $A$, $n_*$ is the number of non-zero diagonal elements of $W$, and $q$ is the number of zero eigenvalues of $P_{\lambda}$ ($q = q_x \times q_z$ in the two-dimensional case). Based on the Taylor expansion performed to obtain credibility intervals, let us recall that:

$$
\ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda) = \ln f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T} (W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})
$$

which leads to:

$$\begin{aligned}
\mathcal{L}^m_\text{norm}(\lambda) &= \int \exp[\ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda)]\text{d}\boldsymbol{\theta} \\
&= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \int \exp\left[- \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda}) \right]\text{d}\boldsymbol{\theta} \\
&= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\
&= f_\mathbf{y}(\mathbf{y} | \hat{\boldsymbol{\theta}}_{\lambda}) f_{\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\
&= \sqrt{\frac{|W|_{+}| P_{\lambda} |_{+}}{(2\pi)^{n_* - q}|W + P_{\lambda}|}} \exp\left(- \frac{1}{2}\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^T P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right]\right).
\end{aligned}$$

The associated log-likelihood can be expressed as follows:

$$\begin{aligned}
\ell^m_\text{reg}(\lambda) = - \frac{1}{2}&\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^{T}P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right. \\
&\left.- \ln|W|_{+} - \ln|P_{\lambda}|_{+} + \ln|W + P_{\lambda}| + (n_* - q)\ln(2\pi)\right].
\end{aligned}$$

Once the selection of $\lambda$ has been determined using the estimator $\hat{\lambda} = \underset{\lambda}{\text{argmax}}\: \ell^m_\text{reg}(\lambda)$, the lack of an explicit solution to this equation forces us to resort to numerical methods for its resolution. The Newton algorithm could once again be employed here and is a robust choice. This approach was notably adopted by @wood2011reml. However, explicitly calculating the derivatives of the likelihood $\ell^m_\text{reg}$ is rather difficult from an operational perspective. Instead, we will use general heuristics such as those provided by @brent1973optimize and @nelder1965optim, which are applicable to any sufficiently smooth function. These heuristics do not require derivative calculations and are implemented in the `optimize` and `optim` functions of the statistical programming language $\mathsf{R}$.

## Advanced WH smoothing theory

See the upcoming paper available [here](https://hal.science/hal-04124043)

## References {-}