--- title: "Optimization" output: rmarkdown::html_vignette: fig_width: 6 fig_height: 4 bibliography: ../inst/REFERENCES.bib link-citations: yes vignette: > %\VignetteIndexEntry{Optimization} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```r library(galamm) ``` The purpose of this vignette is to describe the optimization procedure used by `galamm`, and what kind of tools one can use in the case of convergence issues. ## High-Level Overview The optimization procedure used by `galamm` is described in Section 3 of @sorensenLongitudinalModelingAgeDependent2023. It consists of two steps: - In the inner loop, the marginal likelihood is evaluated at a given set of parameters. The marginal likelihood is what you obtain by integrating out the random effects, and this integration is done with the Laplace approximation. The Laplace approximation yields a large system of equations that needs to be solved iteratively, except in the case with conditionally Gaussian responses and unit link function, for which a single step is sufficient to solve the system. When written in matrix-vector form, this system of equations will in most cases have an overwhelming majority of zeros, and to avoid wasting memory and time on storing and multiplying zero, we use sparse matrix methods. - In the outer loop, we try to find the parameters that maximize the marginal likelihood. For each new set of parameters, the whole procedure in the inner loop has to be repeated. By default, we use the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm with box constraints [@byrdLimitedMemoryAlgorithm1995], abbreviated L-BFGS-B. In particular, we use the implementation in R's `optim()` function, which is obtained by setting `method = "L-BFGS-B"`. L-BFGS-B requires first derivatives, and these are obtained by automatic differentiation [@skaugAutomaticDifferentiationFacilitate2002]. In most use cases of `galamm`, we also use constraints on some of the parameters, e.g., to ensure that variances are non-negative. As an alternative, the Nelder-Mead algorithm with box constraints [@batesFittingLinearMixedEffects2015;@nelderSimplexMethodFunction1965] from `lme4` is also available. Since the Nelder-Mead algorithm is derivative free, automatic differentiation is not used in this case, except for computing the Hessian matrix at the final step. At convergence, the Hessian matrix of second derivatives is computed exactly, again using automatic differentiation. The inverse of this matrix is the covariance matrix of the parameter estimates, and is used to compute Wald type confidence intervals. ## Modifying the L-BFGS-B algorithm We will illustrate some ways of modifying the optimization procedure with the covariate measurement model example shown in the [vignette on models with mixed response types](https://lcbc-uio.github.io/galamm/articles/mixed_response.html). Here we start by simply setting up what we need to fit the model. ```r loading_matrix <- matrix(c(1, 1, NA), ncol = 1) families <- c(gaussian, binomial) family_mapping <- ifelse(diet$item == "chd", 2, 1) formula <- y ~ 0 + chd + (age * bus):chd + fiber + (age * bus):fiber + fiber2 + (0 + loading | id) ``` Fitting the model with default arguments yields a warning when we look at the summary object. ```r mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix ) summary(mod) #> Warning in vcov.galamm(object, parm = "lambda"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> Warning in vcov.galamm(object, "beta"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> #> AIC BIC logLik deviance df.resid #> 2837.6 2892.9 -1406.8 12529.3 730 #> #> Lambda: #> loading SE #> lambda1 1.000 . #> lambda2 1.000 . #> lambda3 -2.026 . #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 0 0 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.78692 NA NA NA #> fiber 17.96184 NA NA NA #> fiber2 -0.64927 NA NA NA #> chd:age 0.06682 NA NA NA #> chd:bus -0.06882 NA NA NA #> fiber:age -0.20480 NA NA NA #> fiber:bus -1.69601 NA NA NA #> chd:age:bus -0.04934 NA NA NA #> fiber:age:bus 0.16097 NA NA NA ``` In this case, we can increase the amount of information provided by `optim`, with the `trace` argument. To avoid getting too much output, we also reduce the number of iterations. We set the `control` argument as follows: ```r control <- galamm_control(optim_control = list(maxit = 5, trace = 3, REPORT = 1)) ``` Here, `maxit = 5` means that we take at most 5 iterations, `trace = 3` means that we want more information from L-BFGS-B, and `REPORT= = 1` means that we want L-BFGS-B to report information at each step it takes. We provide this object to the `control` argument in `galamm`, and rerun the model: ```r mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, control = control ) #> N = 11, M = 20 machine precision = 2.22045e-16 #> At X0, 0 variables are exactly at the bounds #> At iterate 0 f= 2148 |proj g|= 122.68 #> At iterate 1 f = 2132.1 |proj g|= 275.51 #> At iterate 2 f = 2100.1 |proj g|= 193.7 #> At iterate 3 f = 1975.6 |proj g|= 177.11 #> At iterate 4 f = 1923.2 |proj g|= 165.7 #> At iterate 5 f = 1898.8 |proj g|= 83.839 #> At iterate 6 f = 1887.9 |proj g|= 49.147 #> final value 1887.871646 #> stopped after 6 iterations vcov(mod) #> Warning in vcov.galamm(mod): Rank deficient Hessian matrix.Could not compute covariance matrix. #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] #> [1,] NA NA NA NA NA NA NA NA NA #> [2,] NA NA NA NA NA NA NA NA NA #> [3,] NA NA NA NA NA NA NA NA NA #> [4,] NA NA NA NA NA NA NA NA NA #> [5,] NA NA NA NA NA NA NA NA NA #> [6,] NA NA NA NA NA NA NA NA NA #> [7,] NA NA NA NA NA NA NA NA NA #> [8,] NA NA NA NA NA NA NA NA NA #> [9,] NA NA NA NA NA NA NA NA NA ``` Since what we did was simply to turn in more reporting, it is no surprise that the Hessian is still rank deficient, but from the output, it is also clear that there are no obvious errors, like values that diverge to infinity. The latter may also happen from time to time. By default, L-BFGS-B uses the last 5 evaluations of the gradient to approximate the Hessian that is used during optimization (not to be confused with the exact Hessian compute with automatic differentiation after convergence). We try to increase this to 25, and see if that makes a difference. This is done with the `lmm` argument. We also reduce the amount of reporting to be every 10th step, and avoid setting the maximum number of iterations, which means that `optim()`'s default option is used. ```r control <- galamm_control(optim_control = list(trace = 3, REPORT = 10, lmm = 25)) ``` It is clear that neither this solved the issue. ```r mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, control = control ) #> N = 11, M = 25 machine precision = 2.22045e-16 #> At X0, 0 variables are exactly at the bounds #> At iterate 0 f= 2148 |proj g|= 122.68 #> At iterate 10 f = 1770.3 |proj g|= 30.656 #> At iterate 20 f = 1467.2 |proj g|= 11.286 #> At iterate 30 f = 1413 |proj g|= 4.3102 #> #> iterations 38 #> function evaluations 44 #> segments explored during Cauchy searches 39 #> BFGS updates skipped 0 #> active bounds at final generalized Cauchy point 1 #> norm of the final projected gradient 0.001689 #> final function value 1406.8 #> #> F = 1406.8 #> final value 1406.801104 #> converged vcov(mod) #> Warning in vcov.galamm(mod): Rank deficient Hessian matrix.Could not compute covariance matrix. #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] #> [1,] NA NA NA NA NA NA NA NA NA #> [2,] NA NA NA NA NA NA NA NA NA #> [3,] NA NA NA NA NA NA NA NA NA #> [4,] NA NA NA NA NA NA NA NA NA #> [5,] NA NA NA NA NA NA NA NA NA #> [6,] NA NA NA NA NA NA NA NA NA #> [7,] NA NA NA NA NA NA NA NA NA #> [8,] NA NA NA NA NA NA NA NA NA #> [9,] NA NA NA NA NA NA NA NA NA ``` Looking at the model output again, we see that the random effect variance is estimated to be exactly zero. ```r summary(mod) #> Warning in vcov.galamm(object, parm = "lambda"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> Warning in vcov.galamm(object, "beta"): Rank deficient Hessian matrix.Could not compute covariance matrix. #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> Control: control #> #> AIC BIC logLik deviance df.resid #> 2837.6 2892.9 -1406.8 12529.3 730 #> #> Lambda: #> loading SE #> lambda1 1.000 . #> lambda2 1.000 . #> lambda3 -1.922 . #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 0 0 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.78673 NA NA NA #> fiber 17.96179 NA NA NA #> fiber2 -0.64904 NA NA NA #> chd:age 0.06685 NA NA NA #> chd:bus -0.06894 NA NA NA #> fiber:age -0.20479 NA NA NA #> fiber:bus -1.69628 NA NA NA #> chd:age:bus -0.04937 NA NA NA #> fiber:age:bus 0.16092 NA NA NA ``` These types of obviously wrong zero variance estimates are well-known for users of mixed models [@hodgesRichlyParameterizedLinear2013]. We see if increasing the initial value for the variance parameter solves the issue. This is done with the `start` argument to `galamm`. The start argument requires a named list, with optional arguments `theta`, `beta`, `lambda`, and `weights`, giving initial values for each of these groups of parameters. In this case `theta` is the standard deviation of the random effect, and we increase it to 10 to see what happens. By default, the initial value equals 1. ```r mod <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, start = list(theta = 10), control = control ) #> N = 11, M = 25 machine precision = 2.22045e-16 #> At X0, 0 variables are exactly at the bounds #> At iterate 0 f= 2827.6 |proj g|= 123.31 #> At iterate 10 f = 1764.5 |proj g|= 103.86 #> At iterate 20 f = 1623.6 |proj g|= 131.81 #> At iterate 30 f = 1447.9 |proj g|= 75.096 #> At iterate 40 f = 1400.5 |proj g|= 35.591 #> At iterate 50 f = 1373.1 |proj g|= 3.3359 #> At iterate 60 f = 1372.2 |proj g|= 0.0016541 #> #> iterations 60 #> function evaluations 72 #> segments explored during Cauchy searches 61 #> BFGS updates skipped 0 #> active bounds at final generalized Cauchy point 0 #> norm of the final projected gradient 0.00165415 #> final function value 1372.16 #> #> F = 1372.16 #> final value 1372.160386 #> converged ``` Now we see that the model converged and that the Hessian is no longer rank deficient. ```r summary(mod) #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> Control: control #> #> AIC BIC logLik deviance df.resid #> 2768.3 2823.6 -1372.2 2002.9 730 #> #> Lambda: #> loading SE #> lambda1 1.0000 . #> lambda2 1.0000 . #> lambda3 -0.1339 0.05121 #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 23.64 4.862 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.91525 0.27229 -7.03373 2.011e-12 #> fiber 17.94849 0.48686 36.86601 1.620e-297 #> fiber2 0.22402 0.41783 0.53614 5.919e-01 #> chd:age 0.06615 0.05931 1.11531 2.647e-01 #> chd:bus -0.02895 0.34355 -0.08427 9.328e-01 #> fiber:age -0.21204 0.10090 -2.10135 3.561e-02 #> fiber:bus -1.68303 0.63721 -2.64123 8.261e-03 #> chd:age:bus -0.04999 0.06507 -0.76822 4.424e-01 #> fiber:age:bus 0.16818 0.11223 1.49854 1.340e-01 ``` ## Optimization with the Nelder-Mead algorithm The Nelder-Mead algorithm is turned on by setting `method = "Nelder-Mead"` when calling `galamm_control()`. We also turn on reporting every 20th function evaluation by setting `verbose = 1`: ```r control <- galamm_control( optim_control = list(verbose = 1), method = "Nelder-Mead" ) ``` We provide the estimates obtained with the L-BFGS-B algorithm as initial values. For this we can use the convenience function `extract_optim_parameters`: ```r start <- extract_optim_parameters(mod) ``` We now fit the model, providing the initial values to the `start` argument. ```r mod_nm <- galamm( formula = formula, data = diet, family = families, family_mapping = family_mapping, factor = "loading", load.var = "item", lambda = loading_matrix, control = control, start = start ) #> (NM) 20: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 40: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 60: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 80: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 100: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 120: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 140: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 160: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 180: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 200: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 220: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 240: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 260: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 280: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 300: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 320: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 340: f = 1372.16 at 1.84246 -1.91525 17.9485 0.224017 0.066146 -0.0289499 -0.212035 -1.68303 -0.0499864 0.168178 -0.133909 #> (NM) 360: f = 1372.16 at 1.84247 -1.91525 17.9485 0.223968 0.0661412 -0.028921 -0.21203 -1.68308 -0.0499804 0.168172 -0.133908 #> (NM) 380: f = 1372.16 at 1.84247 -1.91525 17.9485 0.22398 0.0661421 -0.0289289 -0.212034 -1.68304 -0.0499816 0.168174 -0.133909 #> (NM) 400: f = 1372.16 at 1.84247 -1.91525 17.9485 0.223986 0.0661415 -0.0289274 -0.212031 -1.68305 -0.0499811 0.168171 -0.133909 #> (NM) 420: f = 1372.16 at 1.84247 -1.91525 17.9485 0.223985 0.0661434 -0.0289358 -0.212032 -1.68304 -0.0499824 0.168172 -0.133908 ``` The summary output shows that Nelder-Mead found exactly the same optimum in this particular case, which is not surprising given the intial values that we provided. ```r summary(mod_nm) #> GALAMM fit by maximum marginal likelihood. #> Formula: formula #> Data: diet #> Control: control #> #> AIC BIC logLik deviance df.resid #> 2768.3 2823.6 -1372.2 2002.9 730 #> #> Lambda: #> loading SE #> lambda1 1.0000 . #> lambda2 1.0000 . #> lambda3 -0.1339 0.05121 #> #> Random effects: #> Groups Name Variance Std.Dev. #> id loading 23.64 4.862 #> Number of obs: 742, groups: id, 333 #> #> Fixed effects: #> Estimate Std. Error z value Pr(>|z|) #> chd -1.91525 0.27229 -7.0337 2.011e-12 #> fiber 17.94851 0.48686 36.8660 1.618e-297 #> fiber2 0.22398 0.41783 0.5360 5.919e-01 #> chd:age 0.06614 0.05931 1.1153 2.647e-01 #> chd:bus -0.02893 0.34355 -0.0842 9.329e-01 #> fiber:age -0.21203 0.10090 -2.1013 3.561e-02 #> fiber:bus -1.68305 0.63721 -2.6413 8.260e-03 #> chd:age:bus -0.04998 0.06507 -0.7682 4.424e-01 #> fiber:age:bus 0.16817 0.11223 1.4985 1.340e-01 ``` ## Implementation Details At a given set of parameters, the marginal likelihood is evaluated completely in C++. For solving the penalized iteratively reweighted least squares problem arising due to the Laplace approximation, we use sparse matrix methods from the Eigen C++ template library through the `RcppEigen` package [@batesFastElegantNumerical2013]. In order to keep track of the derivatives throughout this iterative process, we use the [autodiff library](https://autodiff.github.io/) [@lealAutodiffModernFast2018]. However, since `autodiff` natively only supports dense matrix operations with `Eigen`, we have extended this library so that it also supports sparse matrix operations. This modified version of the `autodiff` library can be found at `inst/include/autodiff/`. In order to maximize the marginal likelihood, we currently rely on the `optim()` function in R. To make use of the fact that both the marginal likelihood value itself and first derivatives are returned from the C++ function, we use memoisation, provided by the `memoise` package [@wickhamMemoiseMemoisationFunctions2021]. However, the optimization process still involves copying all model data between R and C++ for each new set of parameters. This is potentially an efficiency bottleneck with large datasets, although with the limited profiling that has been done so far, it seems like the vast majority of the computation time is spent actually solving the penalized iteratively reweighted least squares problem in C++. ## Future Improvements We aim to perform also the outer optimization loop in C++, to avoid copying data back and forth between R and C++ during optimization. This requires finding an off-the-shelf optimization routine which is as good as the L-BFGS-B implementation provided by `optim()`, and which plays well with `autodiff`. In addition, the current implementation uses only forward mode automatic differentiation. In the future, we aim to add backward mode as an option, as this might turn out to be more efficient for problems with a large number of variables. # References