\documentclass{article} %\VignetteIndexEntry{lm.br} \SweaveOpts{prefix.string=lmbr} \setlength{\parskip}{0.5ex plus0.1ex minus0.1ex} \setlength{\parindent}{0em} \title{\texttt{lm.br}: An \textsf{R} Package for Broken Line Regression} \author{Marc Adams} \date{} \usepackage[authoryear,round,longnamesfirst]{natbib} \begin{document} \bibliographystyle{abbrvnat} \maketitle \begin{abstract} The \textsf{R} package \texttt{lm.br} delivers exact tests and exact confidence regions for a changepoint in linear or multiple linear regression. This package implements the likelihood theory of conditional inference. Examples demonstrate its use and show some properties of the broken-line models. \end{abstract} \section{Theory} A broken-line model consists of two straight lines joined at a changepoint. Three variants are \begin{equation} \label{model1} y_i = \alpha + \beta(x_i-\theta)_{-} + \beta{}'(x_i-\theta)_{+} + e_i \end{equation} \begin{equation} \label{model2} y_i = \alpha + \beta(x_i-\theta)_{-} + e_i \end{equation} \begin{equation} \label{model3} y_i = \beta(x_i-\theta)_{-} + e_i \end{equation} denoting $a_{-}=min(a,0)$ and $a_{+}=max(a,0)$, where $e \sim N(0,\sigma^{2}\Sigma)$. Parameters $\theta$, $\alpha$, $\beta$, $\beta{}'$, $\sigma$ are unknown but $\Sigma$ is known. Model (2) is a threshold model, while model (3) would apply for a known threshold level. Inference about a parameter uses the assumed model and resulting distribution of a test statistic.\\ A test statistic $D$ assigns a numeric value to a postulate parameter value, $p_{0}$, depending on the model and the observations. $D(p_{0})$ is itself a random variable because it is a function of the random observations. A significance level is the probability that $D$ could be worse than the observed value, $SL(p_{0}) = Pr [ D(p_{0}) > D(p_{0})_{obs} ]$, based on the model. The set of postulate values such that $SL > \alpha$ is a $100(1-\alpha)\%$ confidence region for the true parameter value.\\ Conditional inference incorporates sufficient statistics that account for the other, unknown parameters. This refinement determines the exact distribution of a test statistic, even for small data sets. Student's $t$, for example, is the distribution of a sample mean conditional on a sufficient statistic for the variance. See \citet[ch. 15]{kalbfleisch:1985}, \citet[ex. 5.1, 5.5]{cox+hinkley:1974}.\\ The likelihood-ratio is an optimal test statistic. \citet{knowles+siegmund:1989} examined an exact significance test, using likelihood-ratio, for the null hypothesis of a single line versus a broken line. \citet{knowles+siegmund+zhang:1991} derived the conditional likelihood-ratio (CLR) significance tests for the non-linear parameter in semilinear regression. \citet{siegmund+zhang:1994} applied these tests to get exact confidence intervals for the changepoint $\theta$ in models (1) and (2), and exact confidence regions for the two-parameter changepoint $(\theta, \alpha)$ in model (2). \citet{knowles+siegmund+zhang:1991} also developed a formula to evaluate these tests rapidly, which \texttt{lm.br} implements.\\ \texttt{lm.br} extends this theory. Their method derives the conditional likelihood-ratio test for $(\theta, \alpha)$ in model (1). The theory adapts to the case $\sigma$ known, which is useful for the Normal approximation of a binary random variable \citep[eq. 2.28]{cox+snell:1989}. And these tests simplify for a postulate changepoint value outside the range of $x$-values.\\ For comparisons, Approximate-F (AF) is another inference method that is common in nonlinear regression. The AF method estimates the distribution of a likelihood-ratio statistic by its asymptotic $\chi^{2}$ distribution with partial conditioning on a sufficient statistic for the variance. See \citet[sec. 24.6]{draper+smith:1998}. Simulations and examples cover all of the above theory.\\ \section{Simulation Tests} \begin{center} \begin{tabular}{rccccc} \multicolumn{6}{c}{\textsf{Coverage frequencies of the 95\% confidence interval on 100 arbitrary models}} \rule[-2ex]{0pt}{0ex}\\ \hline \rule[-2ex]{0pt}{5.5ex} & & & \textbf{CLR} & & \textbf{AF} \\ \textbf{10 observations}, & $x_{1}-1 < \theta < x_{10}+1$ & & \textbf{95.0 \textendash\space 95.2} & & \textbf{90.0 \textendash\space 97.5} \\ \rule[-2ex]{0pt}{5.5ex} \textbf{30 observations}, & $x_{10} < \theta < x_{20}$ & & \textbf{95.0 \textendash\space 95.2} & & \textbf{90.8 \textendash\space 95.0} \\ \rule[-2ex]{0pt}{0ex} \textbf{100 observations}, & $x_{10} < \theta < x_{20}$ & & \textbf{95.0 \textendash\space 95.2} & & \textbf{91.3 \textendash\space 95.0} \\ \hline \end{tabular} \end{center} \medskip To give one specific example, coverage frequency is 95.2\% by CLR but 90.7\% by AF for a first-line slope -1, second-line slope +0.5, changepoint $\theta =3$, and 10 observations at $x$ = (1.0, 1.1, 1.3, 1.7, 2.4, 3.9, 5.7, 7.6, 8.4, 8.6) with $\sigma = 1$.\\ A program created the arbitrary models using $U \sim Uniform(0,1)$ with\\ \begin{tabular}{cccc} \rule[-2ex]{0pt}{5.5ex} $n = 10$ & $x_{1}$ = 1 & $x_{i} = x_{i-1} + 2U$ for $i>1$ & $\theta = x_{1} - 1 + (x_{n}-x_{1}+2)U$ \\ \rule[-2ex]{0pt}{0ex} $\alpha = 0$ & $\beta = -1$ & $\beta{}' = 2 - 2.5U$ & $\sigma = 0.1 + 2U$ \\ \end{tabular} \\ or n= 30 or n= 100 and $\theta = x_{10} + (x_{20}-x_{10})U$. For each model, the program generated one million sets of random $y_i = \alpha+\beta(x_i-\theta)_{- }+\beta{}'(x_i-\theta)_{+}+\sigma N(0,1)$ and counted how often $SL(\theta) > .05$. These coverage frequencies should be accurate to $\pm$0.05\%. \section{Examples} \subsection{Broken Line Regression} A broken-line model could fit drinking-and-driving survey results. Yearly surveys, taken in different months over the years, were adjusted by a seasonal index based on monthly surveys for a similar question \citep{tirf:1998-2007,camh:2003}. The annual surveys asked respondents if in the past 30 days they had driven within two hours after one drink, while the monthly surveys asked if in the past 30 days they had driven within one hour after two drinks. Figure \ref{fig:cr} shows the survey results without and with seasonal adjustment, and the exact 90\% confidence region for a changepoint if the adjustment were valid. \begin{figure}[htbp] \begin{center} << fig=TRUE, height=5, echo=FALSE, results=hide >>= library(lm.br) log_odds <- c( -1.194, -2.023, -2.285, -1.815, -1.673, -1.444, -1.237, -1.228 ) year <- c(1998.92, 2001.25, 2002.29, 2003.37, 2004.37, 2005.71, 2006.71, 2007.71) VarCov <- matrix( c( 0.0361, 0, 0, 0, 0, 0, 0, 0, 0, 0.0218, 0.0129, 0, 0, 0, 0, 0, 0, 0.0129, 0.0319, 0, 0, 0, 0, 0, 0, 0, 0, 0.0451, 0.0389, 0, 0, 0, 0, 0, 0, 0.0389, 0.0445, 0, 0, 0, 0, 0, 0, 0, 0, 0.0672, 0.0607, 0.0607, 0, 0, 0, 0, 0, 0.0607, 0.0664, 0.0607, 0, 0, 0, 0, 0, 0.0607, 0.0607, 0.0662 ) , nrow = 8, ncol = 8 ) dd <- lm.br( log_odds ~ year, w = VarCov, inv = TRUE, var.known = TRUE ) bounds <- dd$cr( CL=0.90, out='v') n <- length(dd$x1) nbd <- nrow(bounds) title <- "Exact 90% confidence region for changepoint" x <- y <- matrix( NA, max(n,nbd), 4 ) x[1:n,1] <- dd$x1 y[1:n,1] <- dd$y x[1:nbd,2:3] <- bounds[,1] y[1:nbd,2:3] <- bounds[,2:3] x[1:n,4] <- dd$x1 y[1:n,4] <- c( -1.4571, -1.60646, -1.65565, -1.67545, -1.53292, -1.76094, -1.55349, -1.54487 ) matplot( x, y, type=c('p','l','l','p'), pch=c(15,0), lty='solid', col='black', lwd=2, cex=1.5, main=title, xlab=dd$x1nm, ylab=dd$ynm ) @ \caption{\label{fig:cr}Drinking-and-driving surveys log-odds (blank squares) and log-odds with seasonal adjustment (solid squares) versus year, and the exact 90\% confidence region for a changepoint $(\theta,\alpha)$ by CLR.} \end{center} \end{figure} Data input, and commands to get confidence intervals for the changepoint, are <<>>= log_odds <- c( -1.194, -2.023, -2.285, -1.815, -1.673, -1.444, -1.237, -1.228 ) year <- c( 1998.92, 2001.25, 2002.29, 2003.37, 2004.37, 2005.71, 2006.71, 2007.71 ) VarCov <- matrix( c(0.0361, 0, 0, 0, 0, 0, 0, 0, 0, 0.0218, 0.0129, 0, 0, 0, 0, 0, 0, 0.0129, 0.0319, 0, 0, 0, 0, 0, 0, 0, 0, 0.0451, 0.0389, 0, 0, 0, 0, 0, 0, 0.0389, 0.0445, 0, 0, 0, 0, 0, 0, 0, 0, 0.0672, 0.0607, 0.0607, 0, 0, 0, 0, 0, 0.0607, 0.0664, 0.0607, 0, 0, 0, 0, 0, 0.0607, 0.0607, 0.0662), nrow=8, ncol=8) dd <- lm.br( log_odds ~ year, w= VarCov, inv= T, var.known= T ) dd$ci( ) dd$ci( method = "AF" ) @ The wide difference between the CLR and AF confidence intervals above is due to plateaus in the significance level on end-intervals. Both the CLR and AF methods give a constant significance level for all postulate values $\theta_{0}$ on $[ x_{1}, x_{2} ]$, on $[ x_{n-1}, x_{n} ]$, or outside of $[ x_{1}, x_{n} ]$, in a model (1) with $x_1 < x_2 < ... < x_n$. (Coverage probability on these intervals is still exactly 95\% by CLR, as the simulation tests show.) The inference assumes that any line slope is possible, extending to an instantaneous drop near December 1998 in this example. \subsection{Multiple Regression} \texttt{lm.br} can test for a changepoint in multiple linear regression. \texttt{lm.br} tests for a change in one coefficient of the regression model, assuming continuity. It does not test for an arbitrary structural change that might include changes of two or more coefficients or discontinuity.\\ \citet{liu+wu+zidek:1997} suggested a changepoint for the coefficient of car weight in a linear fit of miles-per-gallon against weight and horsepower, for 38 cars of 1978-79 models. One of \textsf{R}'s included datasets is the ratings for 32 cars, 1973-74 models. Analysis of this 1973-74 dataset by the exact conditional likelihood-ratio inference also shows evidence for a changepoint: <<>>= lm.br( mpg ~ wt + hp, data = mtcars ) @ For multiple regression, \texttt{lm.br} applies an orthogonal transformation to a canonical model \citep{siegmund+zhang:1994}. One way to see how this method works is formulaic. The composite likelihood-ratio statistic uses optimal values for unknown parameters. A canonical model lets these optimal coefficients of linear terms reduce their correspondent errors to zero always. Thus they have no effect on inference, so the algebra can omit them. This elimination reduces a multiple-predictor model to a single-predictor model. See \citet[ch. 6]{hoffman+kunze:1971}, \citet[sec. 7.1]{lehmann:2005}. \section{Summary} If a broken line with Normal errors represents the relationship between a factor and responses, then \texttt{lm.br} solves the inference step for the changepoint. This package uses the technique of conditional inference to allow for the other, unknown terms in the model. Fitting a broken line can reveal the plausible interval for a changepoint, although practical cause-effect relations usually have a smooth transition. Any statistical analysis should examine the fit of the model and the error distribution with graphs and significance tests, interpret results, and consider adjustments to the model or alternate models. \bibliography{lm.br} \end{document}