%\VignetteIndexEntry{Examples of focused model comparison: linear regression} %\VignetteEngine{knitr::knitr} %\VignetteDepends{fic,ggplot2,GGally,gapminder} \documentclass[nojss,nofooter]{jss} <>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE, digits = 3) library("knitr") opts_chunk$set(fig.path="fic-") @ \author{Christopher Jackson\\\email{chris.jackson@mrc-bsu.cam.ac.uk}} \title{Examples of focused model comparison: linear regression} \Plainauthor{Christopher Jackson, MRC Biostatistics Unit} \usepackage{bm} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\bgamma}{\boldsymbol{\gamma}} \newcommand{\bdelta}{\boldsymbol{\delta}} \Abstract{This vignette illustrates focused model comparison with the \pkg{fic} package for linear regression models. Examples are given of covariate selection and polynomial order selection, with focuses defined by the mean, median or other quantiles of the outcome. } \Keywords{models} \begin{document} The linear regression model considered here has the general form \[ y_i \sim N(\mu_i, \sigma^2), \quad \mu_i = \alpha + \sum \beta_{s} x_{is} . \] for observations $i=1,\ldots,n$. The regressors $x_{is}$ might represent different covariates, contrasts between levels of a factor, functions of covariates such as polynomials, or interactions between different covariates. \section{Covariate selection in linear regression} Firstly we present a simple covariate selection problem in the well-known \code{mtcars} dataset from the \pkg{datasets} package distributed with standard \proglang{R} installations. The outcome $y_i$ is the fuel efficiency of car model $i$ measured in MPG. The wide model is taken to be the model suggested in \citet{henderson1981building} which includes the following predictors \begin{itemize} \item \code{am}: transmission type (0=automatic, 1=manual) \item \code{wt}: weight in 1000 lbs \item \code{qsec}: quarter mile time in seconds \item \code{disp}: displacement (cubic inches) \item \code{hp}: gross horsepower \end{itemize} Paired scatterplots of these variables suggest that \code{mpg} is correlated with all of these predictors, but many of the predictors themselves are correlated with each other. <>= mtcars$am <- factor(mtcars$am) library(ggplot2) if (requireNamespace("GGally",quietly=TRUE)){ GGally::ggpairs(mtcars[,c("mpg","am","wt","qsec","disp","hp")], aes(colour=am)) } else pairs(mtcars[,c("mpg","am","wt","qsec","disp","hp")]) @ <<>>= wide.lm <- lm(mpg ~ am + wt + qsec + disp + hp, data=mtcars) @ We compare all submodels of this wide model, with the minimal model including only an intercept. The \code{all_inds} function constructs a matrix of indicators \code{inds} for whether each coefficient (column) is included in each submodel (row). <<>>= library(fic) ncovs_wide <- length(coef(wide.lm)) - 1 inds0 <- c(1, rep(0, ncovs_wide)) inds <- all_inds(wide.lm, inds0) @ The focus is taken as the mean outcome (\code{focus=mean_normal}) for a car with covariate values supplied in \code{X}: automatic transmission \code{am=0} and values of the other four continuous covariates defined by their means in the data. <<>>= cmeans <- colMeans(model.frame(wide.lm)[,c("wt","qsec","disp","hp")]) X <- rbind( "auto" = c(intercept=1, am=0, cmeans), "manual" = c(intercept=1, am=1, cmeans) ) ficres <- fic(wide.lm, inds=inds, focus=mean_normal, X=X) summary(ficres) @ <>= ggplot_fic(ficres) @ \begin{figure}[h] \centering \includegraphics{fic-mtcars-1} \caption{Focused comparison of linear regression models for the \code{mtcars} data. } \label{fig:ficres} \end{figure} There is a cluster of submodels whose focus estimates are judged to have relatively low bias and mean square error. The model with minimal mean square error, for either focus, omits \code{wt} and \code{qsec}. Given the strong correlation of \code{wt} with \code{disp} and \code{qsec} with \code{hp}, these two variables do not improve the precision of the focus estimate. \section{Polynomial order selection} A common model selection problem is to choose an appropriate level of flexibility for a nonlinear relationship of an outcome with a predictor. This is often implemented through polynomial regression. In this example, a linear model with orthogonal polynomials is used to represent the relationship of life expectancy to GDP per capita for 1704 countries (worldwide) and years from 1952 to 2007, using data from \url{http://www.gapminder.org}, packaged by \citet{gapminder:pkg}. The dataset used for analysis excludes Kuwait, whose data follow a distinct pattern. The scatterplot shows a diminishing increase in life expectancy as GDP increases above a certain level. <<>>= library(gapminder) gap2 <- gapminder[gapminder$country !="Kuwait",] pal <- heat.colors(5) p <- ggplot(gap2, aes(x=gdpPercap, y=lifeExp)) + geom_point() + xlab("GDP per capita (US$, inflation-adjusted)") + ylab("Life expectancy (years)") + geom_point(data=gapminder[gapminder$country =="Kuwait",], col="gray") + annotate("text", x=80000, y=70, label="Kuwait", col="gray") @ A wide model is fitted with a polynomial relationship of degree 5. Fitted values from each model are added to the scatterplot. <<>>= wide.lm <- lm(lifeExp ~ poly(gdpPercap,5), data=gap2) yilab <- c(0, 50, 100, 65, 85) for (i in 2:5) { poly.lm <- lm(lifeExp ~ poly(gdpPercap,i), data=gap2) ft <- data.frame(x=gap2$gdpPercap, y=fitted(poly.lm)) ft <- ft[order(ft$x),] p <- p + geom_line(data=ft, aes(x=x,y=y), col=pal[i], lwd=2, alpha=0.8) + annotate("text", x=60000, y=yilab[i], col="black", label=sprintf("Polynomial degree %s", i)) } gdp_focus <- c(10000, 25000, 40000) p <- p + geom_vline(xintercept=gdp_focus, col="gray") + scale_x_continuous(breaks=c(0, gdp_focus, 60000, 80000, 100000)) p @ Submodels of degrees 2, 3 and 4 are compared in terms of how well they estimate three focuses: the average life expectancy at GDP per capita of \$10,000, \$25,000 and \$40,000. Note that the parameters include the intercept, so, for example, the simplest model, the quadratic polynomial model, has three parameters indicated by entries of 1 in the first row of \code{inds}. <<>>= inds <- rbind("quadratic"= c(1,1,1,0,0,0), "cubic" =c(1,1,1,1,0,0), "quartic" =c(1,1,1,1,1,0), "degree 5" =c(1,1,1,1,1,1)) X <- newdata_to_X(list(gdpPercap=gdp_focus), wide.lm, intercept=TRUE) rownames(X) <- gdp_focus (ficres <- fic(wide.lm, inds=inds, focus=mean_normal, X=X)) summary(ficres) @ While the most complex model gives the most precise estimates of mean life expectancy at all focuses, the preference for the complex model is less strong for GDP=10000 --- at this point there are more data, the models give more consistent focus estimates, and the bias incurred by using a simpler model is less. This is a simplified example --- alternative approaches to nonlinear regression might involve, e.g. splines or fractional polynomials. In theory, these can be implemented as linear additive models of the form shown here. Though exact details of implementing focused model comparison have not been investigated for these classes of models --- note that this would require all submodels to be nested within a single wide model. Note also the importance of considering knowledge of the underlying mechanism when building a regression model, for example, we might be sure that the relationship is monotonic. \subsection{Quantiles as the focus} \citet{claeskens:hjort:book} show that for a normal linear regression model, FIC and MSE are the same for a focus defined by the mean outcome as for a focus defined by any quantile of the outcome. We can check this in this example, while demonstrating how to implement quantiles as focus functions in \pkg{fic}. Firstly, the median of a normal distribution is equal to the mean, and is independent of the variance. Therefore we will get identical answers to the results for \code{focus=mean_normal} above by doing: <<>>= median_normal<- function(par,X){ qnorm(0.5, mean = as.numeric(X %*% par)) } ficres <- fic(wide.lm, inds=inds, focus=median_normal, X=X) @ Other quantiles, however, depend on the variance. Therefore a \code{sigma} argument should be defined for the focus function. This allows, e.g. a 10\% quantile focus to be implemented as <<>>= q10_normal <- function(par, X, sigma){ qnorm(0.1, mean = as.numeric(X %*% par), sd=sigma) } ficres <- fic(wide.lm, inds=inds, focus=q10_normal, X=X) @ However, we can define focus functions with arbitrary additional arguments. This allows any quantile to be defined using one common function, with an argument, say, \code{focus_p}, specifying the particular quantile to return. in a <<>>= quantile_normal <- function(par, X, sigma, focus_p=0.5){ qnorm(focus_p, mean = as.numeric(X %*% par), sd=sigma) } @ This argument can be passed to \code{fic}, along with the focus function, to fully specify the focus of interest. If a vector of values is supplied in \code{focus_p}, then multiple focuses are evaluated at once.\footnote{Note that vectors for \texttt{X} are treated differently from vectors for other focus arguments. If a named vector is supplied for \texttt{X} it is assumed to refer to multiple covariate values defining a single focus. If a vector is supplied for any other argument, it is assumed to identify multiple focuses. To completely avoid ambiguity for any argument, a matrix can be supplied, where the rows identify focuses and the columns identify, e.g. covariate values. } <<>>= ficres <- fic(wide.lm, inds=inds, focus=quantile_normal, X=X[1,], focus_p=c(0.1,0.5,0.9)) @ We can check that the results match between the alternative ways of setting up \code{fic} for the same focus. \section{Relation of focused model comparison with AIC} Using the \code{mtcars} example, we illustrate when focused model comparison agrees with model comparison using AIC. The following code performs focused model comparison for 32 distinct focus parameters, defined as the log likelihood contribution from each of the 32 observed covariate combinations in the \code{mtcars} data. Firstly the focus function is defined as the log density for an individual outcome. \citet{fic} show that differences between submodels in the expected mean square error of this focus are asymptotically equivalent to differences in AIC. <<>>= focus_loglik <- function(par,X,sigma,Y){ mu <- as.numeric(X %*% par) dnorm(Y,mu,sigma,log=TRUE) } @ To illustrate this result, we run \code{fic} with $n=32$ variants of this focus defined by the observed outcomes \code{Y} and covariates \code{X} in the \code{mtcars} data. <<>>= wide.lm <- lm(mpg ~ am + wt + qsec + disp + hp, data=mtcars) ncovs_wide <- length(coef(wide.lm)) - 1 inds0 <- c(1, rep(0, ncovs_wide)) inds <- all_inds(wide.lm, inds0) X <- model.matrix(wide.lm) Y <- model.response(model.frame(wide.lm)) ficres <- fic(wide.lm, inds=inds, focus=focus_loglik, X=X, Y=Y) @ We then extract the results averaged over these focuses, automatically computed by \code{fic} with each focus weighted equally, and extract the AICs of the submodels. The preference among models from the averaged FIC result agrees with AIC, up to sampling error. <>= ficres <- ficres[ficres$vals=="ave",] aics <- sapply(attr(ficres,"sub"), AIC) ggplot(data=NULL, aes(x=ficres[["rmse"]], y=aics)) + geom_point() + xlab("Root mean square error of log density estimate") + ylab("AIC") @ \bibliography{fic} \end{document}