--- title: "An introduction to bootstrap p-values for regression models using the boot.pval package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{boot_summary} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(boot.pval) ``` An introduction to bootstrap p-values for regression models using the `boot.pval` package ======================================= ## Background p-values can be computed by inverting the corresponding confidence intervals, as described in Section 14.2 of [Thulin (2024)](https://www.modernstatisticswithr.com/mathschap.html#confintequal) and Section 3.12 in [Hall (1992)](https://link.springer.com/book/10.1007/978-1-4612-4384-7). This package contains functions for computing bootstrap p-values in this way. The approach relies on the fact that: - The p-value of the two-sided test for the parameter theta is the smallest alpha such that theta is not contained in the corresponding 1-alpha confidence interval, - For a test of the parameter theta with significance level alpha, the set of values of theta that aren't rejected by the two-sided test (when used as the null hypothesis) is a 1-alpha confidence interval for theta. ## Summaries for regression models Summary tables with confidence intervals and p-values for the coefficients of regression models can be obtained using the `boot_summary` (most models) and `censboot_summary` (models with censored response variables) functions. Currently, the following models are supported: - Linear models fitted using `lm`, - Generalised linear models fitted using `glm` or `glm.nb`, - Nonlinear models fitted using `nls`, - Robust linear models fitted using `MASS::rlm`, - Ordered logistic or probit regression models fitted (without weights) using `MASS:polr`, - Linear mixed models fitted using `lme4::lmer` or `lmerTest::lmer`, - Generalised linear mixed models fitted using `lme4::glmer`. - Cox PH regression models fitted using `survival::coxph` (using `censboot_summary`). - Accelerated failure time models fitted using `survival::survreg` or `rms::psm` (using `censboot_summary`). - Any regression model such that: `residuals(object, type="pearson")` returns Pearson residuals; `fitted(object)` returns fitted values; `hatvalues(object)` returns the leverages, or perhaps the value 1 which will effectively ignore setting the hatvalues. In addition, the `data` argument should contain no missing values among the columns actually used in fitting the model. A number of examples are available in Chapters 8 and 9 of [Modern Statistics with R](https://www.modernstatisticswithr.com/). Here are some simple examples with a linear regression model for the `mtcars` data: ```{r message=FALSE} # Bootstrap summary of a linear model for mtcars: model <- lm(mpg ~ hp + vs, data = mtcars) boot_summary(model) # Use 9999 bootstrap replicates and adjust p-values for # multiplicity using Holm's method: boot_summary(model, R = 9999, adjust.method = "holm") # Export results to a gt table: boot_summary(model, R = 9999) |> summary_to_gt() ``` ```{r eval = FALSE} # Export results to a Word document: library(flextable) boot_summary(model, R = 9999) |> summary_to_flextable() |> save_as_docx(path = "my_table.docx") ``` And a toy example for a generalised linear mixed model (using a small number of bootstrap repetitions): ```{r eval = FALSE} library(lme4) model <- glmer(TICKS ~ YEAR + (1|LOCATION), data = grouseticks, family = poisson) boot_summary(model, R = 99) ``` ## Speeding up computations For complex models, speed can be greatly improved by using parallelisation. This is set using the `parallel` (available options are `"multicore"` and `"snow"`). The number of CPUs to use is set using `ncpus`. ```{r eval = FALSE} model <- glmer(TICKS ~ YEAR + (1|LOCATION), data = grouseticks, family = poisson) boot_summary(model, R = 999, parallel = "multicore", ncpus = 10) ``` ## Survival models Survival regression models should be fitted using the argument `model = TRUE`. A summary table can then be obtained using `censboot_summary`. By default, the table contains exponentiated coefficients (i.e. hazard ratios, in the case of a Cox PH model). ```{r message = FALSE} library(survival) # Weibull AFT model: model <- survreg(formula = Surv(time, status) ~ age + sex, data = lung, dist = "weibull", model = TRUE) # Table with exponentiated coefficients: censboot_summary(model) # Cox PH model: model <- coxph(formula = Surv(time, status) ~ age + sex, data = lung, model = TRUE) # Table with hazard ratios: censboot_summary(model) # Table with original coefficients: censboot_summary(model, coef = "raw") ``` ## Other hypothesis tests Bootstrap p-values for hypothesis tests based on `boot` objects can be obtained using the `boot.pval` function. The following examples are extensions of those given in the documentation for `boot::boot`: ```{r message = FALSE} # Hypothesis test for the city data # H0: ratio = 1.4 library(boot) ratio <- function(d, w) sum(d$x * w)/sum(d$u * w) city.boot <- boot(city, ratio, R = 999, stype = "w", sim = "ordinary") boot.pval(city.boot, theta_null = 1.4) # Studentized test for the two sample difference of means problem # using the final two series of the gravity data. diff.means <- function(d, f) { n <- nrow(d) gp1 <- 1:table(as.numeric(d$series))[1] m1 <- sum(d[gp1,1] * f[gp1])/sum(f[gp1]) m2 <- sum(d[-gp1,1] * f[-gp1])/sum(f[-gp1]) ss1 <- sum(d[gp1,1]^2 * f[gp1]) - (m1 * m1 * sum(f[gp1])) ss2 <- sum(d[-gp1,1]^2 * f[-gp1]) - (m2 * m2 * sum(f[-gp1])) c(m1 - m2, (ss1 + ss2)/(sum(f) - 2)) } grav1 <- gravity[as.numeric(gravity[,2]) >= 7, ] grav1.boot <- boot(grav1, diff.means, R = 999, stype = "f", strata = grav1[ ,2]) boot.pval(grav1.boot, type = "stud", theta_null = 0) ```