--- title: "Semiparametric Covariate Effects in the brea Package" author: "Adam King" date: "2025-08-30" bibliography: brea.bib link-citations: TRUE output: rmarkdown::html_vignette #output: rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Semiparametric Covariate Effects in the brea Package} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Background The `brea` package offers a number of advanced discrete event time modeling features, one of which is generalized additive models (GAM) style incorporation of arbitrary smooth nonlinear covariate effects. For example, classically with either discrete or continuous time Cox proportional hazards models, we assume the effect of time $t$ is modeled via an arbitrary smooth function called a *baseline hazard*, and this function is incorporated additively on the linear predictor scale. We may however also wish to model the nonlinear effects of other quantitative covariates, such as a patient's age in a biomedical study. This tutorial will illustrate how to include such nonlinear effects in discrete time-to-event models using the `brea` package. Because including such nonlinear functions can result in much slower performance of the MCMC algorithms used to obtain inferences, we also illustrate the use of an optional Metropolis-Hastings algorithm that can dramatically increase the efficiency of the inference algorithms. Throughout this tutorial, we assume the reader is familiar with the basics of discrete time survival analysis, elementary Bayesian analysis (including inference via Markov chain Monte Carlo), and basic use of the `brea` package. All of these topics are covered in the *Introduction to `brea`* vignette, which we strongly suggest the reader work through first. ## Modeling Nonlinear Covariate Effects in the Discrete Cox Model Here we will briefly review the discrete time version of the Cox proportional hazards model introduced in [@cox1972]. Then we will show how to extend the model by including nonlinear effects of the form $f_m(X_m)$ inside the linear predictor and in turn explain our Bayesian formulation for the functions $f_m$. Finally, we will briefly discuss inference algorithms for the parameters representing the functions $f_m$. ### The Discrete Time Cox Proportional Hazards Model Let $T$ denote the discrete time of event occurrence; by convention we assume the possible timepoints $t$ of occurrence are the positive integers ($t=1,2,3,\ldots$). The discrete time Cox proportional hazards model relates the discrete time hazard rate $\lambda(t)=P(T=t|T\geq t)$ to a linear predictor $\eta(t)$ incorporating covariate effects using the logit link fuction: $$ \text{log}\left(\frac{\lambda(t)}{1-\lambda(t)}\right) = \eta(t) = f_0(t)+\beta_1X_1 + \cdots + \beta_MX_M $$ The function $f_0(t)$ is the *baseline hazard* that models the effect of discrete time $t$ on the linear predictor scale, and we classically we do not presume any specific functional form for $f_0(t)$ and instead just assume the function is an arbitrary smooth function. In contrast, the other covariate effects $\beta_m X_m$ are modeled linearly as in standard multiple linear regression. ### Additive Modeling of Nonlinear Covariate Effects We would like to extend the above Cox model by allowing arbitrary nonlinear effects for quantitative covariates other than just time $t$. We will also explicitly represent the potentially time-varying nature of any of our covariate values by writing the $m^\text{th}$ covariate as $X_m(t)$. With this change, there is no longer any reason to single out time $t$ as a distinct covariate needing separate notation from the other $X_m(t)$, since we could just for example let the first covariate be $t$ by letting $X_1(t)=t$. Hence, we can write our model as: $$ \text{log}\left(\frac{\lambda(t)}{1-\lambda(t)}\right) = \eta(t) = f_1(X_1(t))+f_2(X_2(t))+\cdots +f_M(X_M(t)) $$ For categorical covariates, we may still use a representation $f_m(X_m(t))$ for the corresponding effect by declaring that the function $f_m$ assumes a distinct parameter value for each possible category of $X_m$. For example, if the possible covariate categories are coded using positive integers $k=1,\ldots,K$ (as the `brea` package assumes), then we may let $f_m(k)=\beta_k$ so that $f_m(X_m(t))$ becomes simply $\beta_{X_m(t)}$. ### Modeling Nonlinear Effects Using Step Functions There are many possible choices for how to model the smooth functions $f_m$ when $X_m$ is a quantitative variable. For example, we could use a parametric function such as a polynomial or a crude step function; both of these possibilities are illustrated for the baseline hazard $f_0(t)$ in the *Introduction to `brea`* vignette. However, it is often not possible to know in advance the appropriate functional form for a function that relates a quantitative covariate like time $t$ to the hazard of event occurrence. In addition, the functional form may not be able to be accurately captured by a simple polynomial or step function with a small number of steps. Thus, we propose modeling the functions $f_m$ using a highly flexible formulation. Specifically, we will use a step function with a large number of steps (usually 10--100 steps) along with a prior distribution on the step heights that ensures the resulting functional form is not too irregular (i.e., the function is approximately smooth). Specifically, suppose we have a quantitative covariate $X$, and let $c_0,c_1,\ldots,c_K$ be a sequence of step boundaries such that all values of $X$ satisfy $c_0