\input{share/preamble} %\VignetteIndexEntry{Loss distributions modeling} %\VignettePackage{actuar} %\SweaveUTF8 \title{Loss modeling features of \pkg{actuar}} \author{Christophe Dutang \\ Université Paris Dauphine \\[3ex] Vincent Goulet \\ Université Laval \\[3ex] Mathieu Pigeon \\ Université du Québec à Montréal} \date{} <>= library(actuar) options(width = 52, digits = 4) @ \begin{document} \maketitle \section{Introduction} \label{sec:introduction} One important task of actuaries is the modeling of claim amount and claim count distributions for ratemaking, loss reserving or other risk evaluation purposes. Package \pkg{actuar} features many support functions for loss distributions modeling: \begin{enumerate} \item support for heavy tail continuous distributions useful in loss severity modeling; \item support for phase-type distributions for ruin theory; \item functions to compute raw moments, limited moments and the moment generating function (when it exists) of continuous distributions; \item support for zero-truncated and zero-modified extensions of the discrete distributions commonly used in loss frequency modeling; \item extensive support of grouped data; \item functions to compute empirical raw and limited moments; \item support for minimum distance estimation using three different measures; \item treatment of coverage modifications (deductibles, limits, inflation, coinsurance). \end{enumerate} Vignette \code{"distributions"} covers the points 1--4 above in great detail. This document concentrates on points 5--8. \section{Grouped data} \label{sec:grouped-data} Grouped data is data represented in an interval-frequency manner. Typically, a grouped data set will report that there were $n_j$ claims in the interval $(c_{j - 1}, c_j]$, $j = 1, \dots, r$ (with the possibility that $c_r = \infty$). This representation is much more compact than an individual data set --- where the value of each claim is known --- but it also carries far less information. Now that storage space in computers has essentially become a non issue, grouped data has somewhat fallen out of fashion. Still, grouped data remains useful as a means to represent data, if only graphically --- for example, a histogram is nothing but a density approximation for grouped data. Moreover, various parameter estimation techniques rely on grouped data. For these reasons, \pkg{actuar} provides facilities to store, manipulate and summarize grouped data. A standard storage method is needed since there are many ways to represent grouped data in the computer: using a list or a matrix, aligning $n_j$ with $c_{j - 1}$ or with $c_j$, omitting $c_0$ or not, etc. With appropriate extraction, replacement and summary methods, manipulation of grouped data becomes similar to that of individual data. Function \code{grouped.data} creates a grouped data object similar to --- and inheriting from --- a data frame. The function accepts two types of input: \begin{enumerate} \item a vector of group boundaries $c_0, c_1, \dots, c_r$ and one or more vectors of group frequencies $n_1, \dots, n_r$ (note that there should be one more group boundary than group frequencies); \item individual data $x_1, \dots, x_n$ and either a vector of breakpoints $c_1, \dots, c_r$, a number $r$ of breakpoints or an algorithm to determine the latter. \end{enumerate} In the second case, \code{grouped.data} will group the individual data using function \code{hist}. The function always assumes that the intervals are contiguous. \begin{example} \label{ex:grouped.data-1} Consider the following already grouped data set: \begin{center} \begin{tabular}{lcc} \toprule Group & Frequency (Line 1) & Frequency (Line 2) \\ \midrule $(0, 25]$ & 30 & 26 \\ $(25, 50]$ & 31 & 33 \\ $(50, 100]$ & 57 & 31 \\ $(100, 150]$ & 42 & 19 \\ $(150, 250]$ & 65 & 16 \\ $(250, 500]$ & 84 & 11 \\ \bottomrule \end{tabular} \end{center} We can conveniently and unambiguously store this data set in R as follows: <>= x <- grouped.data(Group = c(0, 25, 50, 100, 150, 250, 500), Line.1 = c(30, 31, 57, 42, 65, 84), Line.2 = c(26, 33, 31, 19, 16, 11)) @ Internally, object \code{x} is a list with class <>= class(x) @ The package provides a suitable \code{print} method to display grouped data objects in an intuitive manner: <>= x @ \qed \end{example} \begin{example} \label{ex:grouped.data-2} Consider Data Set~B of \citet[Table~11.2]{LossModels4e}: \begin{center} \begin{tabular}{*{10}{r}} 27 & 82 & 115 & 126 & 155 & 161 & 243 & 294 & 340 & 384 \\ 457 & 680 & 855 & 877 & 974 & \np{1193} & \np{1340} & \np{1884} & \np{2558} & \np{15743} \end{tabular} \end{center} We can represent this data set as grouped data using either an automatic or a suggested number of groups (see \code{?hist} for details): <>= y <- c( 27, 82, 115, 126, 155, 161, 243, 294, 340, 384, 457, 680, 855, 877, 974, 1193, 1340, 1884, 2558, 15743) grouped.data(y) grouped.data(y, breaks = 5) @ The above grouping methods use equi-spaced breaks. This is rarely appropriate for heavily skewed insurance data. For this reason, \code{grouped.data} also supports specified breakpoints (or group boundaries): <>= grouped.data(y, breaks = c(0, 100, 200, 350, 750, 1200, 2500, 5000, 16000)) @ \qed \end{example} The package supports the most common extraction and replacement methods for \code{"grouped.data"} objects using the usual \code{[} and \code{[<-} operators. In particular, the following extraction operations are supported. (In the following, object \code{x} is the grouped data object of \autoref{ex:grouped.data-1}.) <>= x <- grouped.data(Group = c(0, 25, 50, 100, 150, 250, 500), Line.1 = c(30, 31, 57, 42, 65, 84), Line.2 = c(26, 33, 31, 19, 16, 11)) @ \begin{enumerate}[i)] \item Extraction of the vector of group boundaries (the first column): <>= x[, 1] @ \item Extraction of the vector or matrix of group frequencies (the second and third columns): <>= x[, -1] @ \item Extraction of a subset of the whole object (first three lines): <>= x[1:3, ] @ \end{enumerate} Notice how extraction results in a simple vector or matrix if either of the group boundaries or the group frequencies are dropped. As for replacement operations, the package implements the following. \begin{enumerate}[i)] \item Replacement of one or more group frequencies: <>= x[1, 2] <- 22; x x[1, c(2, 3)] <- c(22, 19); x @ \item Replacement of the boundaries of one or more groups: <>= x[1, 1] <- c(0, 20); x x[c(3, 4), 1] <- c(55, 110, 160); x @ \end{enumerate} It is not possible to replace the boundaries and the frequencies simultaneously. The mean of grouped data is \begin{equation} \hat{\mu} = \frac{1}{n} \sum_{j = 1}^r a_j n_j, \end{equation} where $a_j = (c_{j - 1} + c_j)/2$ is the midpoint of the $j$th interval, and $n = \sum_{j = 1}^r n_j$, whereas the variance is \begin{equation} \frac{1}{n} \sum_{j = 1}^r n_j (a_j - \hat{\mu})^2. \end{equation} The standard deviation is the square root of the variance. The package defines methods to easily compute the above descriptive statistics: <>= mean(x) var(x) sd(x) @ Higher empirical moments can be computed with \code{emm}; see \autoref{sec:empirical-moments}. The R function \code{hist} splits individual data into groups and draws an histogram of the frequency distribution. The package introduces a method for already grouped data. Only the first frequencies column is considered (see \autoref{fig:histogram} for the resulting graph): <>= hist(x[, -3]) @ \begin{figure}[t] \centering <>= hist(x[, -3]) @ \caption{Histogram of a grouped data object} \label{fig:histogram} \end{figure} \begin{rem} One will note that for an individual data set like \code{y} of \autoref{ex:grouped.data-2}, the following two expressions yield the same result: <>= hist(y) hist(grouped.data(y)) @ \end{rem} R has a function \code{ecdf} to compute the empirical cdf $F_n(x)$ of an individual data set: \begin{equation} \label{eq:ecdf} F_n(x) = \frac{1}{n} \sum_{j = 1}^n I\{x_j \leq x\}, \end{equation} where $I\{\mathcal{A}\} = 1$ if $\mathcal{A}$ is true and $I\{\mathcal{A}\} = 0$ otherwise. The function returns a \code{"function"} object to compute the value of $F_n(x)$ in any $x$. The approximation of the empirical cdf for grouped data is called an ogive \citep{LossModels4e,HoggKlugman}. It is obtained by joining the known values of $F_n(x)$ at group boundaries with straight line segments: \begin{equation} \tilde{F}_n(x) = \begin{cases} 0, & x \leq c_0 \\ \dfrac{(c_j - x) F_n(c_{j-1}) + (x - c_{j-1}) F_n(c_j)}{% c_j - c_{j - 1}}, & c_{j-1} < x \leq c_j \\ 1, & x > c_r. \end{cases} \end{equation} The package includes a generic function \code{ogive} with methods for individual and for grouped data. The function behaves exactly like \code{ecdf}. \begin{example} \label{ex:ogive} Consider first the grouped data set of \autoref{ex:grouped.data-1}. Function \code{ogive} returns a function to compute the ogive $\tilde{F}_n(x)$ in any point: <>= (Fnt <- ogive(x)) @ Methods for functions \code{knots} and \code{plot} allow, respectively, to obtain the knots $c_0, c_1, \dots, c_r$ of the ogive and to draw a graph (see \autoref{fig:ogive}): <>= knots(Fnt) Fnt(knots(Fnt)) plot(Fnt) @ \begin{figure}[t] \centering <>= plot(Fnt) @ \caption{Ogive of a grouped data object} \label{fig:ogive} \end{figure} To add further symmetry between functions \code{hist} and \code{ogive}, the latter also accepts in argument a vector individual data. It will call \code{grouped.data} and then computes the ogive. (Below, \code{y} is the individual data set of \autoref{ex:grouped.data-2}.) <>= (Fnt <- ogive(y)) knots(Fnt) @ \qed \end{example} A method of function \code{quantile} for grouped data objects returns linearly smoothed quantiles, that is, the inverse of the ogive evaluated at various points: <>= Fnt <- ogive(x) @ <>= quantile(x) Fnt(quantile(x)) @ Finally, a \code{summary} method for grouped data objects returns the quantiles and the mean, as is usual for individual data: <>= summary(x) @ \section{Data sets} \label{sec:data-sets} This is certainly not the most spectacular feature of \pkg{actuar}, but it remains useful for illustrations and examples: the package includes the individual dental claims and grouped dental claims data of \cite{LossModels4e}: <>= data("dental"); dental data("gdental"); gdental @ \section{Calculation of empirical moments} \label{sec:empirical-moments} The package provides two functions useful for estimation based on moments. First, function \code{emm} computes the $k$th empirical moment of a sample, whether in individual or grouped data form. For example, the following expressions compute the first three moments for individual and grouped data sets: <>= emm(dental, order = 1:3) emm(gdental, order = 1:3) @ Second, in the same spirit as \code{ecdf} and \code{ogive}, function \code{elev} returns a function to compute the empirical limited expected value --- or first limited moment --- of a sample for any limit. Again, there are methods for individual and grouped data (see \autoref{fig:elev} for the graphs): <>= lev <- elev(dental) lev(knots(lev)) plot(lev, type = "o", pch = 19) lev <- elev(gdental) lev(knots(lev)) plot(lev, type = "o", pch = 19) @ \begin{figure}[t] \centering <>= par(mfrow = c(1, 2)) plot(elev(dental), type = "o", pch = 19) plot(elev(gdental), type = "o", pch = 19) @ \caption{Empirical limited expected value function of an individual data object (left) and a grouped data object (right)} \label{fig:elev} \end{figure} \section{Minimum distance estimation} \label{sec:minimum-distance} Two methods are widely used by actuaries to fit models to data: maximum likelihood and minimum distance. The first technique applied to individual data is well covered by function \code{fitdistr} of the package \pkg{MASS} \citep{MASS}. The second technique minimizes a chosen distance function between theoretical and empirical distributions. Package \pkg{actuar} provides function \code{mde}, very similar in usage and inner working to \code{fitdistr}, to fit models according to any of the following three distance minimization methods. \begin{enumerate} \item The Cramér-von~Mises method (\code{CvM}) minimizes the squared difference between the theoretical cdf and the empirical cdf or ogive at their knots: \begin{equation} d(\theta) = \sum_{j = 1}^n w_j [F(x_j; \theta) - F_n(x_j; \theta)]^2 \end{equation} for individual data and \begin{equation} d(\theta) = \sum_{j = 1}^r w_j [F(c_j; \theta) - \tilde{F}_n(c_j; \theta)]^2 \end{equation} for grouped data. Here, $F(x)$ is the theoretical cdf of a parametric family, $F_n(x)$ is the empirical cdf, $\tilde{F}_n(x)$ is the ogive and $w_1 \geq 0, w_2 \geq 0, \dots$ are arbitrary weights (defaulting to $1$). \item The modified chi-square method (\code{chi-square}) applies to grouped data only and minimizes the squared difference between the expected and observed frequency within each group: \begin{equation} d(\theta) = \sum_{j = 1}^r w_j [n (F(c_j; \theta) - F(c_{j - 1}; \theta)) - n_j]^2, \end{equation} where $n = \sum_{j = 1}^r n_j$. By default, $w_j = n_j^{-1}$. \item The layer average severity method (\code{LAS}) applies to grouped data only and minimizes the squared difference between the theoretical and empirical limited expected value within each group: \begin{equation} d(\theta) = \sum_{j = 1}^r w_j [\LAS(c_{j - 1}, c_j; \theta) - \tilde{\LAS}_n(c_{j - 1}, c_j; \theta)]^2, \end{equation} where $\LAS(x, y) = \E{X \wedge y} - \E{X \wedge x}$, % $\tilde{\LAS}_n(x, y) = \tilde{E}_n[X \wedge y] - \tilde{E}_n[X \wedge x]$ and $\tilde{E}_n[X \wedge x]$ is the empirical limited expected value for grouped data. \end{enumerate} The arguments of \code{mde} are a data set, a function to compute $F(x)$ or $\E{X \wedge x}$, starting values for the optimization procedure and the name of the method to use. The empirical functions are computed with \code{ecdf}, \code{ogive} or \code{elev}. \begin{example} \label{ex:mde} The expressions below fit an exponential distribution to the grouped dental data set, as per example~2.21 of \cite{LossModels}: <>= op <- options(warn = -1) # hide warnings from mde() @ <>= mde(gdental, pexp, start = list(rate = 1/200), measure = "CvM") mde(gdental, pexp, start = list(rate = 1/200), measure = "chi-square") mde(gdental, levexp, start = list(rate = 1/200), measure = "LAS") @ <>= options(op) # restore warnings @ \qed \end{example} It should be noted that optimization is not always as simple to achieve as in \autoref{ex:mde}. For example, consider the problem of fitting a Pareto distribution to the same data set using the Cramér--von~Mises method: <>= mde(gdental, ppareto, start = list(shape = 3, scale = 600), measure = "CvM") @ <>= out <- try(mde(gdental, ppareto, start = list(shape = 3, scale = 600), measure = "CvM"), silent = TRUE) cat(sub(", scale", ",\n scale", out)) @ Working in the log of the parameters often solves the problem since the optimization routine can then flawlessly work with negative parameter values: <>= pparetolog <- function(x, logshape, logscale) ppareto(x, exp(logshape), exp(logscale)) (p <- mde(gdental, pparetolog, start = list(logshape = log(3), logscale = log(600)), measure = "CvM")) @ The actual estimators of the parameters are obtained with <>= exp(p$estimate) @ %$ This procedure may introduce additional bias in the estimators, though. \section{Coverage modifications} \label{sec:coverage} Let $X$ be the random variable of the actual claim amount for an insurance policy, $Y^L$ be the random variable of the amount paid per loss and $Y^P$ be the random variable of the amount paid per payment. The terminology for the last two random variables refers to whether or not the insurer knows that a loss occurred. Now, the random variables $X$, $Y^L$ and $Y^P$ will differ if any of the following coverage modifications are present for the policy: an ordinary or a franchise deductible, a limit, coinsurance or inflation adjustment \cite[see][chapter~8 for precise definitions of these terms]{LossModels4e}. \autoref{tab:coverage} summarizes the definitions of $Y^L$ and $Y^P$. \begin{table} \centering \begin{tabular}{lll} \toprule Coverage modification & Per-loss variable ($Y^L$) & Per-payment variable ($Y^P$)\\ \midrule Ordinary deductible ($d$) & $\begin{cases} 0, & X \leq d \\ X - d, & X > d \end{cases}$ & $\begin{cases} X - d, & X > d \end{cases}$ \medskip \\ Franchise deductible ($d$) & $\begin{cases} 0, & X \leq d \\ X, & X > d \end{cases}$ & $\begin{cases} X, & X > d \end{cases} $ \medskip \\ Limit ($u$) & $\begin{cases} X, & X \leq u \\ u, & X > u \end{cases}$ & $\begin{cases} X, & X \leq u \\ u, & X > u \end{cases}$ \bigskip \\ Coinsurance ($\alpha$) & $\alpha X$ & $\alpha X$ \medskip \\ Inflation ($r$) & $(1 + r)X$ & $(1 + r)X$ \\ \bottomrule \end{tabular} \caption{Coverage modifications for per-loss variable ($Y^L$) and per-payment variable ($Y^P$) as defined in \cite{LossModels4e}.} \label{tab:coverage} \end{table} Often, one will want to use data $Y^P_1, \dots, Y^P_n$ (or $Y^L_1, \dots, Y^L_n$) from the random variable $Y^P$ ($Y^L$) to fit a model on the unobservable random variable $X$. This requires expressing the pdf or cdf of $Y^P$ ($Y^L$) in terms of the pdf or cdf of $X$. Function \code{coverage} of \pkg{actuar} does just that: given a pdf or cdf and any combination of the coverage modifications mentioned above, \code{coverage} returns a function object to compute the pdf or cdf of the modified random variable. The function can then be used in modeling like any other \code{dfoo} or \code{pfoo} function. \begin{example} \label{ex:coverage} Let $Y^P$ represent the amount paid by an insurer for a policy with an ordinary deductible $d$ and a limit $u - d$ (or maximum covered loss of $u$). Then the definition of $Y^P$ is \begin{equation} Y^P = \begin{cases} X - d, & d \leq X \leq u \\ u - d, & X \geq u \end{cases} \end{equation} and its pdf is \begin{equation} \label{eq:pdf-YP} f_{Y^P}(y) = \begin{cases} 0, & y = 0 \\ \dfrac{f_X(y + d)}{1 - F_X(d)}, & 0 < y < u - d \\ \dfrac{1 - F_X(u)}{1 - F_X(d)}, & y = u - d \\ 0, & y > u - d. \end{cases} \end{equation} Assume $X$ has a gamma distribution. Then an R function to compute the pdf \eqref{eq:pdf-YP} in any $y$ for a deductible $d = 1$ and a limit $u = 10$ is obtained with \code{coverage} as follows: <>= f <- coverage(pdf = dgamma, cdf = pgamma, deductible = 1, limit = 10) f f(0, shape = 5, rate = 1) f(5, shape = 5, rate = 1) f(9, shape = 5, rate = 1) f(12, shape = 5, rate = 1) @ \qed \end{example} Note how function \code{f} in the previous example is built specifically for the coverage modifications submitted and contains as little useless code as possible. The function returned by \code{coverage} may be used for various purposes, most notably parameter estimation, as the following example illustrates. \begin{example} Let object \code{y} contain a sample of claims amounts from policies with the deductible and limit of \autoref{ex:coverage}. One can fit a gamma distribution by maximum likelihood to the claim severity distribution as follows: <>= x <- rgamma(100, 2, 0.5) y <- pmin(x[x > 1], 9) op <- options(warn = -1) # hide warnings from fitdistr() @ <>= library(MASS) fitdistr(y, f, start = list(shape = 2, rate = 0.5)) @ <>= options(op) # restore warnings @ \qed \end{example} Vignette \code{"coverage"} contains more detailed formulas for the pdf and the cdf under various combinations of coverage modifications. \bibliography{actuar} \end{document} %%% Local Variables: %%% mode: noweb %%% TeX-master: t %%% coding: utf-8 %%% End: