--- title: "Extending the cv package" author: "John Fox and Georges Monette" date: "`r Sys.Date()`" package: cv output: rmarkdown::html_vignette: fig_caption: yes bibliography: ["cv.bib"] csl: apa.csl vignette: > %\VignetteIndexEntry{Extending the cv package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, message = TRUE, warning = TRUE, fig.align = "center", fig.height = 6, fig.width = 7, fig.path = "fig/", dev = "png", comment = "#>" #, # eval = nzchar(Sys.getenv("REBUILD_VIGNETTES")) ) # save some typing knitr::set_alias(w = "fig.width", h = "fig.height", cap = "fig.cap") # colorize text: use inline as `r colorize(text, color)` colorize <- function(x, color) { if (knitr::is_latex_output()) { sprintf("\\textcolor{%s}{%s}", color, x) } else if (knitr::is_html_output()) { sprintf("%s", color, x) } else x } .opts <- options(digits = 5) ``` The **cv** package is designed to be extensible in several directions. In this vignette, we discuss three kinds of extensions, ordered by increasing general complexity: (1) adding a cross-validation cost criterion; (2) adding a model class that's not directly accommodated by the `cv()` default method or by another directly inherited method, with separate consideration of mixed-effects models; and (3) adding a new model-selection procedure suitable for use with `selectModel()`. ## Adding a cost criterion A cost criterion suitable for use with `cv()` or `cvSelect()` should take two arguments, `y` (the observed response vector) and `yhat` (a vector of fitted or predicted response values), and return a numeric index of lack of fit. The **cv** package supplies several such criteria: `mse(y, yhat)`, which returns the mean-squared prediction error for a numeric response; `rmse(y, yhat)`, which returns the (square-)root mean-squared error; `medAbsErr(y, yhat)`, which returns the median absolute error; and `BayesRule(y, yhat)` (and its non-error-checking version, `BayesRule2(y, yhat))`, suitable for use with a binary regression model, where `y` is the binary response coded `0` for a "failure" or `1` for a "success"; where `yhat` is the predicted probability of success; and where the proportion of *incorrectly* classified cases is returned. To illustrate using a different prediction cost criterion, we'll base a cost criterion on the area under the receiver operating characteristic ("ROC") curve for a logistic regression. The ROC curve is a graphical representation of the classification power of a binary regression model, and the area under the ROC curve ("AUC"), which varies from 0 to 1, is a common summary measure based on the ROC [see @Wikipedia-ROC:2023]. The **Metrics** package [@HamnerFrasco:2018] includes a variety of measures useful for model selection, including an `auc()` function. We convert the AUC into a cost measure by taking its complement: ```{r AUCcomp} AUCcomp <- function(y, yhat) 1 - Metrics::auc(y, yhat) ``` We then apply `AUCcomp()` to the the Mroz logistic regression, discussed in the introductory vignette on cross-validating regression models, which we reproduce here. Using the `Mroz` data frame from the **carData** package [@FoxWeisberg:2019]: ```{r Mroz-regression} data("Mroz", package="carData") m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial) summary(m.mroz) AUCcomp(with(Mroz, as.numeric(lfp == "yes")), fitted(m.mroz)) ``` Cross-validating this cost measure is straightforward: ```{r Mroz-CV-ROC} library("cv") cv(m.mroz, criterion=AUCcomp, seed=3639) ``` As expected, the cross-validated complement to the AUC is somewhat less optimistic than the criterion computed from the model fit to the whole data set. As we explain in the vignette "Cross-validating regression models," the `cv()` function differentiates between CV criteria that are averages of casewise components and criteria that are not. Computation of bias corrections and confidence intervals is limited to the former. We show in the technical and computational vignette that the AUC, and hence its complement, cannot be expressed as averages of casewise components. `cv()` looks for a `"casewise loss"` attribute of the value returned by a CV criterion function. If this attribute exists, then the criterion is treated as the mean of casewise components, and `cv()` uses the unexported function `getLossFn()` to construct a function that returns the casewise components of the criterion. We illustrate with the `mse()`: ```{r} mse cv:::getLossFn(mse(rnorm(100), rnorm(100))) ``` For this scheme to work, the "casewise loss" attribute must be a character string (or vector of character strings), here `"(y - yhat)^2"`, that evaluates to an expression that is a function of `y` and `yhat`, and that computes the vector of casewise components of the CV criterion. ## Adding a model class not covered by the default `cv()` method ### Independently sampled cases Suppose that we want to cross-validate a multinomial logistic regression model fit by the `multinom()` function in the **nnet** package [@VenablesRipley:2002]. We borrow an example from @Fox:2016 [Sec. 14.2.1], with data from the British Election Panel Study on vote choice in the 2001 British election. Data for the example are in the `BEPS` data frame in the **carData** package: ```{r BEPS-data} data("BEPS", package="carData") head(BEPS) ``` The polytomous (multi-category) response variable is `vote`, a factor with levels `"Conservative"`, `"Labour"`, and `"Liberal Democrat"`. The predictors of `vote` are: * `age`, in years; * `econ.cond.national` and `econ.cond.household`, the respondent's ratings of the state of the economy, on 1 to 5 scales. * `Blair`, `Hague`, and `Kennedy`, ratings of the leaders of the Labour, Conservative, and Liberal Democratic parties, on 1 to 5 scales. * `Europe`, an 11-point scale on attitude towards European integration, with high scores representing "Euro-skepticism." * `political.knowledge`, knowledge of the parties' positions on European integration, with scores from 0 to 3. * `gender`, `"female"` or `"male"`. The model fit to the data includes an interaction between `Europe` and `political.knowledge`; the other predictors enter the model additively: ```{r BEPS-model} library("nnet") m.beps <- multinom( vote ~ age + gender + economic.cond.national + economic.cond.household + Blair + Hague + Kennedy + Europe * political.knowledge, data = BEPS ) car::Anova(m.beps) ``` Most of the predictors, including the `Europe` $\times$ `political.knowledge` interaction, are associated with very small $p$-values; the `Anova()` function is from the **car** package [@FoxWeisberg:2019]. Here's an "effect plot", using the the **effects** package [@FoxWeisberg:2019] to visualize the `Europe` $\times$ `political.knowledge` interaction in a "stacked-area" graph: ```{r BEPS-plot, fig.width=9, fig.height=5} plot( effects::Effect( c("Europe", "political.knowledge"), m.beps, xlevels = list(Europe = 1:11, political.knowledge = 0:3), fixed.predictors = list(given.values = c(gendermale = 0.5)) ), lines = list(col = c("blue", "red", "orange")), axes = list(x = list(rug = FALSE), y = list(style = "stacked")) ) ``` To cross-validate this multinomial-logit model we need an appropriate cost criterion. None of the criteria supplied by the **cv** package---for example, neither `mse()`, which is appropriate for a numeric response, nor `BayesRule()`, which is appropriate for a binary response---will do. One possibility is to adapt Bayes rule to a polytomous response: ```{r BayesRuleMulti} head(BEPS$vote) yhat <- predict(m.beps, type = "class") head(yhat) BayesRuleMulti <- function(y, yhat) { result <- mean(y != yhat) attr(result, "casewise loss") <- "y != yhat" result } BayesRuleMulti(BEPS$vote, yhat) ``` The `predict()` method for `"multinom"` models called with argument `type="class"` reports the Bayes-rule prediction for each case---that is, the response category with the highest predicted probability. Our `BayesRuleMulti()` function calculates the proportion of misclassified cases. Because this value is the mean of casewise components, we attach a `"casewise loss"` attribute to the result (as explained in the preceding section). The marginal proportions for the response categories are ```{r BEPS-response-distribution} xtabs(~ vote, data=BEPS)/nrow(BEPS) ``` and so the marginal Bayes-rule prediction, that everyone will vote Labour, produces an error rate of $1 - 0.47213 = 0.52787$. The multinomial-logit model appears to do substantially better than that, but does its performance hold up to cross-validation? We check first whether the default `cv()` method works "out-of-the-box" for the `"multinom"` model: ```{r BEPS-test-default, error=TRUE} cv(m.beps, seed=3465, criterion=BayesRuleMulti) ``` The default method of `GetResponse()` (a function supplied by the **cv** package---see `?GetResponse`) fails for a `"multinom"` object. A straightforward solution is to supply a `GetResponse.multinom()` method that returns the factor response [using the `get_response()` function from the **insight** package, @LudeckeWaggonerMakowski:2019], ```{r GetResponse.multinom} GetResponse.multinom <- function(model, ...) { insight::get_response(model) } head(GetResponse(m.beps)) ``` and to try again: ```{r BEPS-test-default-2, error=TRUE} cv(m.beps, seed=3465, criterion=BayesRuleMulti) ``` A `traceback()` (not shown) reveals that the problem is that the default method of `cv()` calls the `"multinom"` method for `predict()` with the argument `type="response"`, when the correct argument should be `type="class"`. We therefore must write a "`multinom`" method for `cv()`, but that proves to be very simple: ```{r cv.nultinom} cv.multinom <- function (model, data, criterion = BayesRuleMulti, k, reps, seed, ...) { model <- update(model, trace = FALSE) NextMethod( type = "class", criterion = criterion, criterion.name = deparse(substitute(criterion)) ) } ``` That is, we simply call the default `cv()` method with the `type` argument properly set. In addition to supplying the correct `type` argument, our method sets the default `criterion` for the `cv.multinom()` method to `BayesRuleMulti`. Adding the argument `criterion.name=deparse(substitute(criterion))` is inessential, but it insures that printed output will include the name of the criterion function that's employed, whether it's the default `BayesRuleMulti` or something else. Prior to invoking `NextMethod()`, we called `update()` with `trace=FALSE` to suppress the iteration history reported by default by `multinom()`---it would be tedious to see the iteration history for each fold. Then: ```{r BEPS-cv} summary(cv(m.beps, seed=3465)) ``` The cross-validated polytomous Bayes-rule criterion confirms that the fitted model does substantially better than the marginal Bayes-rule prediction that everyone votes for Labour. #### Calling cvCompute() `cv()` methods for independently sampled cases, such as `cv.default()`, `cv.lm()`, and `cv.glm()`, work by setting up calls to the `cvCompute()` function, which is exported from the **cv** package to support development of `cv()` methods for additional classes of regression models. In most cases, however, such as the preceding `cv.multinom()` example, it will suffice and be much simpler to set up a suitable call to `cv.default()` via `NextMethod()`. To illustrate how to use `cvCompute()` directly, we write an alternative, and necessarily more complicated, version of `cv.multinom()`. ```{r cv.multinom-alternative} cv.multinom <- function(model, data = insight::get_data(model), criterion = BayesRuleMulti, k = 10, reps = 1, seed = NULL, details = k <= 10, confint = n >= 400, level = 0.95, ncores = 1, start = FALSE, ...) { f <- function(i) { # helper function to compute to compute fitted values, # etc., for each fold i indices.i <- fold(folds, i) model.i <- if (start) { update(model, data = data[-indices.i,], start = b, trace = FALSE) } else { update(model, data = data[-indices.i,], trace = FALSE) } fit.all.i <- predict(model.i, newdata = data, type = "class") fit.i <- fit.all.i[indices.i] # returns: # fit.i: fitted values for the i-th fold # crit.all.i: CV criterion for all cases based on model with # i-th fold omitted # coef.i: coefficients for the model with i-th fold omitted list( fit.i = fit.i, crit.all.i = criterion(y, fit.all.i), coef.i = coef(model.i) ) } fPara <- function(i, multinom, ...) { # helper function for parallel computation # argument multinom makes multinom() locally available # ... is necessary but not used indices.i <- fold(folds, i) model.i <- if (start) { update(model, data = data[-indices.i,], start = b, trace = FALSE) } else { update(model, data = data[-indices.i,], trace = FALSE) } fit.all.i <- predict(model.i, newdata = data, type = "class") fit.i <- fit.all.i[indices.i] list( fit.i = fit.i, crit.all.i = criterion(y, fit.all.i), coef.i = coef(model.i) ) } n <- nrow(data) # see ?cvCompute for definitions of arguments cvCompute( model = model, data = data, criterion = criterion, criterion.name = deparse(substitute(criterion)), k = k, reps = reps, seed = seed, details = details, confint = confint, level = level, ncores = ncores, type = "class", start = start, f = f, fPara = fPara, multinom = nnet::multinom ) } ``` Notice that separate "helper" functions are defined for non-parallel and parallel computations.[^multinom-parallel] The new version of `cv.multinom()` produces the same results as the version that calls `cv.default()`:[^multinom-scoping] ```{r BEPS-cv-alt-version} summary(cv(m.beps, seed=3465)) ``` [^multinom-parallel]: Try the following, for example, with both versions of `cv.multinom()` (possibly replacing `ncores=2` with a larger number): ``` system.time(print(cv1 <- cv(m.beps, k="loo"))) system.time(print(cv2 <- cv(m.beps, k="loo", ncores=2))) all.equal(cv1, cv2) ``` [^multinom-scoping]: A subtle point is that we added a `multinom` argument to the local function `fPara()`, which is passed to the `fPara` argument of `cvCompute()`. There is also a `multinom` argument to `cvCompute()`, which is set to the `multinom` function in the **nnet** package. The `multinom` argument isn't directly defined in `cvCompute()` (examine the definition of this function), but is passed through the `...` argument. `cvCompute()`, in turn, will pass `multinom` to `fPara()` via `...`, allowing `fPara()` to find this function when it calls `update()` to refit the model with each fold `i` omitted. This scoping issue arises because `cvCompute()` uses `foreach()` for parallel computations, even though the **nnet** package is attached to the search path in the current R session via `library("nnet")`. `cv.default()` is able to handle the scoping issue transparently by automatically locating `multinom()`. ### Mixed-effects models Adding a `cv()` method for a mixed-model class is somewhat more complicated. We provide the `cvMixed()` function to facilitate this process, and to see how that works, consider the `"lme"` method from the **cv** package: ```{r cv.lme} cv:::cv.lme ``` Notice that `cv.lme()` sets up a call to `cvMixed()`, which does the computational work. Most of the arguments of `cvMixed()` are familiar: * `model` is the mixed-model object, here of class `"lme"`. * `package` is the name of the package in which the mixed-modeling function used to fit the model, here `lme()`, resides---i.e., `"nlme"`; `cvMixed()` uses this argument to retrieve the package namespace. * `data` is the data set to which the model is fit, by default extracted by the `get_data()` function in the **insight** package. * `criterion` is the CV criterion, defaulting to the `mse()` function. * `k` is the number of CV folds, defaulting to `"loo"` for CV by clusters and `10` for CV by cases. * `reps` is the number of times the CV process is repeated, defaulting to `1`. * `seed` is the seed for R's random-number generator, defaulting to a randomly selected (and saved) value. * `ncores` is the number of cores to use for parallel computation; if `1`, the default, then the computation isn't parallelized. * `clusterVariables` is a character vector of the names of variables defining clusters; if missing, then CV is based on cases rather than clusters. The remaining arguments are unfamiliar: * `predict.clusters.args` is a named list of arguments to be passed to the `predict()` function to obtain predictions for the full data set from a model fit to a subset of the data for cluster-based CV. The first two arguments should be `object` and `newdata`. It is typically necessary to tell `cvMixed()` how to base predictions only on fixed effects; in the case of `"lme"` models, this is done by setting `level = 0`. * Similarly, `predict.cases.args` is a named list of arguments to be passed to `predict()` for case-based CV. Setting `level = 1` includes random effects in the predictions. * `fixed.effects` is used to compute detailed fold-based statistics. Finally, any additional arguments, absorbed by `...`, are passed to `update()` when the model is refit with each fold omitted. `cvMixed()` returns an object of class `"cv"`. Now imagine that we want to support a new class of mixed-effects models. To be concrete, we illustrate with the `glmmPQL()` function in the **MASS** package [@VenablesRipley:2002], which fits generalized-linear mixed-effects models by penalized quasi-likelihood.[^glmmPQL] Not coincidentally, the arguments of `glmmPQL()` are similar to those of `lme()` (with an additional `family` argument), because the former iteratively invokes the latter; so `cv.glmmPQL()` should resemble `cv.lme()`. [^glmmPQL]: This example is somewhat artificial in that `glmmPQL()` has largely been superseded by computationally superior functions, such the `glmer()` function in the **lme4** package. There is, however, one situation in which `glmmPQL()` might prove useful: to specify serial dependency in case-level errors within clusters for longitudinal data, which is not currently supported by `glmer()`. As it turns out, neither the default method for `GetResponse()` nor `insight::get_data()` work for `"glmmPQL"` objects. These objects include a `"data"` element, however, and so we can simply extract this element as the default for the `data` argument of our `cv.glmmPQL()` method. To get the response variable is more complicated: We refit the fixed part of the model as a GLM with only the regression constant on the right-hand side, and extract the response from that; because all we need is the response variable, we limit the number of GLM iterations to 1 and suppress warning messages about non-convergence: ```{r GetResponse.glmmPQL} GetResponse.glmmPQL <- function(model, ...) { f <- formula(model) f[[3]] <- 1 # regression constant only on RHS model <- suppressWarnings(glm( f, data = model$data, family = model$family, control = list(maxit = 1) )) cv::GetResponse(model) } ``` Writing the `cv()` method is then straightforward: ```{r cv.glmmPQL} cv.glmmPQL <- function(model, data = model$data, criterion = mse, k, reps = 1, seed, ncores = 1, clusterVariables, ...) { cvMixed( model, package = "MASS", data = data, criterion = criterion, k = k, reps = reps, seed = seed, ncores = ncores, clusterVariables = clusterVariables, predict.clusters.args = list( object = model, newdata = data, level = 0, type = "response" ), predict.cases.args = list( object = model, newdata = data, level = 1, type = "response" ), fixed.effects = nlme::fixef, verbose = FALSE, ... ) } ``` We set the argument `verbose=FALSE` to suppress `glmmPQL()`'s iteration counter when `cvMixed()` calls `update()`. Let's apply our newly minted method to a logistic regression with a random intercept in an example that appears in `?glmmPQL`: ```{r glmmPQL-example} library("MASS") m.pql <- glmmPQL( y ~ trt + I(week > 2), random = ~ 1 | ID, family = binomial, data = bacteria ) summary(m.pql) ``` We compare this result to that obtained from `glmer()` in the **lme4** package: ```{r compare-to-lme4} library("lme4") m.glmer <- glmer(y ~ trt + I(week > 2) + (1 | ID), family = binomial, data = bacteria) summary(m.glmer) # comparison of fixed effects: car::compareCoefs(m.pql, m.glmer) ``` The two sets of estimates are similar, but not identical Finally, we try out our `cv.glmmPQL()` method, cross-validating both by clusters and by cases, ```{r cv-example-glmmPQL, cache=TRUE} summary(cv(m.pql, clusterVariables="ID", criterion=BayesRule)) summary(cv(m.pql, data=bacteria, criterion=BayesRule, seed=1490)) ``` and again compare to `glmer()`: ```{r cv-example-glmer, cache=TRUE} summary(cv(m.glmer, clusterVariables="ID", criterion=BayesRule)) summary(cv(m.glmer, data=bacteria, criterion=BayesRule, seed=1490)) ``` ## Adding a model-selection procedure The `selectStepAIC()` function supplied by the **cv** package, which is based on the `stepAIC()` function from the **nnet** package [@VenablesRipley:2002] for stepwise model selection, is suitable for the `procedure` argument of `cvSelect()`. The use of `selectStepAIC()` is illustrated in the vignette on cross-validating model selection. We'll employ `selectStepAIC()` as a "template" for writing a CV model-selection procedure. To see the code for this function, type `cv::selectStepAIC` at the R command prompt, or examine the sources for the **cv** package at https://github.com/gmonette/cv (the code for `selectStepAIC()` is in https://github.com/gmonette/cv/blob/main/R/cv-select.R). Another approach to model selection is all-subsets regression. The `regsubsets()` function in the **leaps** package [@LumleyMiller:2020] implements an efficient algorithm for selecting the best-fitting linear least-squares regressions for subsets of predictors of all sizes, from 1 through the maximum number of candidate predictors.[^1] To illustrate the use of `regsubsets()`, we employ the `swiss` data frame supplied by the **leaps** package: [^1]: The `regsubsets()` function computes several measures of model predictive performance, including the $R^2$ and $R^2$ adjusted for degrees of freedom, the residual sums of squares, Mallows's $C_p$, and the BIC. Several of these are suitable for comparing models with differing numbers of coefficients---we use the BIC below---but all necessarily agree when comparing models with the *same* number of coefficients. ```{r swiss} library("leaps") head(swiss) nrow(swiss) ``` The data set includes the following variables, for each of 47 French-speaking Swiss provinces circa 1888: * `Fertility`: A standardized fertility measure. * `Agriculture`: The percentage of the male population engaged in agriculture. * `Examination`: The percentage of draftees into the Swiss army receiving the highest grade on an examination. * `Education`: The percentage of draftees with more than a primary-school education. * `Catholic`: The percentage of the population who were Catholic. * `Infant.Mortality`: The infant-mortality rate, expressed as the percentage of live births surviving less than a year. Following @LumleyMiller:2020, we treat `Fertility` as the response and the other variables as predictors in a linear least-squares regression: ```{r swiss-lm} m.swiss <- lm(Fertility ~ ., data=swiss) summary(m.swiss) summary(cv(m.swiss, seed=8433)) ``` Thus, the MSE for the model fit to the complete data is considerably smaller than the CV estimate of the MSE. Can we do better by selecting a subset of the predictors, taking account of the additional uncertainty induced by model selection? First, let's apply best-subset selection to the complete data set: ```{r subset-selection} #| fig.cap = "Selecting the best model of each size." swiss.sub <- regsubsets(Fertility ~ ., data=swiss) summary(swiss.sub) (bics <- summary(swiss.sub)$bic) which.min(bics) car::subsets(swiss.sub, legend="topright") ``` The graph, produced by the `subsets()` function in the **car** package, shows that the model with the smallest BIC is the "best" model with 4 predictors, including `Agriculture`, `Education`, `Catholic`, and `Infant.Mortality`, but not `Examination`: ```{r best-model} m.best <- update(m.swiss, . ~ . - Examination) summary(m.best) summary(cv(m.best, seed=8433)) # use same folds as before ``` The MSE for the selected model is (of course) slightly higher than for the full model fit previously, but the cross-validated MSE is a bit lower; as we explain in the vignette on cross-validating model selection, however, it isn't kosher to select and cross-validate a model on the same data. Here's a function named `selectSubsets()`, meant to be used with `cvSelect()`, suitable for cross-validating the model-selection process: ```{r selectSubsets} selectSubsets <- function(data = insight::get_data(model), model, indices, criterion = mse, details = TRUE, seed, save.model = FALSE, ...) { if (inherits(model, "lm", which = TRUE) != 1) stop("selectSubsets is appropriate only for 'lm' models") y <- GetResponse(model) formula <- formula(model) X <- model.matrix(model) if (missing(indices)) { if (missing(seed) || is.null(seed)) seed <- sample(1e6, 1L) # select the best model from the full data by BIC sel <- leaps::regsubsets(formula, data = data, ...) bics <- summary(sel)$bic best <- coef(sel, 1:length(bics))[[which.min(bics)]] x.names <- names(best) # fit the best model; intercept is already in X, hence - 1: m.best <- lm(y ~ X[, x.names] - 1) fit.all <- predict(m.best, newdata = data) return(list( criterion = criterion(y, fit.all), model = if (save.model) m.best # return best model else NULL )) } # select the best model omitting the i-th fold (given by indices) sel.i <- leaps::regsubsets(formula, data[-indices,], ...) bics.i <- summary(sel.i)$bic best.i <- coef(sel.i, 1:length(bics.i))[[which.min(bics.i)]] x.names.i <- names(best.i) m.best.i <- lm(y[-indices] ~ X[-indices, x.names.i] - 1) # predict() doesn't work here: fit.all.i <- as.vector(X[, x.names.i] %*% coef(m.best.i)) fit.i <- fit.all.i[indices] # return the fitted values for i-th fold, CV criterion for all cases, # and the regression coefficients list( fit.i = fit.i, # fitted values for i-th fold crit.all.i = criterion(y, fit.all.i), # CV crit for all cases coefficients = if (details) { # regression coefficients coefs <- coef(m.best.i) # fix coefficient names names(coefs) <- sub("X\\[-indices, x.names.i\\]", "", names(coefs)) coefs } else { NULL } ) } ``` A slightly tricky point is that because of scoping issues, `predict()` doesn't work with the model fit omitting the $i$th fold, and so the fitted values for all cases are computed directly as $\widehat{\mathbf{y}}_{-i} = \mathbf{X} \mathbf{b}_{-i}$, where $\mathbf{X}$ is the model-matrix for all of the cases, and $\mathbf{b}_{-i}$ is the vector of least-squares coefficients for the selected model with the $i$th fold omitted. Additionally, the command `lm(y[-indices] ~ X[-indices, x.names.i] - 1)`, which is the selected model with the $i$th fold deleted, produces awkward coefficient names like `"X[-indices, x.names.i]Infant.Mortality"`. Purely for aesthetic reasons, the command `sub("X\\[-indices, x.names.i\\]", "", names(coefs))` fixes these awkward names, removing the extraneous text, `"X[-indices, x.names.i]"`. Applying `selectSubsets()` to the full data produces the full-data cross-validated MSE (which we obtained previously): ```{r test-selectSubsets} selectSubsets(model=m.swiss) ``` Similarly, applying the function to an imaginary "fold" of 5 cases returns the MSE for the cases in the fold, based on the model selected and fit to the cases omitting the fold; the MSE for all of the cases, based on the same model; and the coefficients of the selected model, which includes 4 or the 5 predictors (and the intercept): ```{r test-selectSubsets-fold} selectSubsets(model=m.swiss, indices=seq(5, 45, by=10)) ``` Then, using `selectSubsets()` in cross-validation, invoking the `cv.function()` method for `cv()`, we get: ```{r cvSelect-selectSubsets} cv.swiss <- cv( selectSubsets, working.model = m.swiss, data = swiss, seed = 8433 # use same folds ) summary(cv.swiss) ``` Cross-validation shows that model selection exacts a penalty in MSE. Examining the models selected for the 10 folds reveals that there is some uncertainty in identifying the predictors in the "best" model, with `Agriculture` sometimes appearing and sometimes not: ```{r best-models-by-folds} compareFolds(cv.swiss) ``` As well, the fold-wise MSE varies considerably, reflecting the small size of the `swiss` data set (47 cases). ## References