--- title: "Confidence Interval Estimation via Simulations" author: "Peter E. Freeman" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Confidence Interval Estimation via Simulations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction Recall that to determine confidence interval bounds, we solve the equation \begin{align*} F_Y(y_{\rm obs} \vert \theta) - q = 0 \end{align*} for $\theta$. Here, $Y$ is our chosen statistic (e.g., $\bar{X}$), $F_Y(\cdot)$ is the cumulative distribution function (or cdf) for $Y$'s sampling distribution, $y_{\rm obs}$ is the observed statistic value, and $q$ is a quantile that we determine given the confidence coefficient $1-\alpha$ and the type of interval we are constructing (one-sided lower bound, one-sided upper bound, or two-sided). In this vignette, we examine the situation in which the sampling distribution cdf is *not* analytically derivable, but where we *can* simulate individual data using code in an existing `R` package (e.g., `rnorm()`). In this situation, we would create a function that, given a simulated dataset, returns the statistic value, and we would pass that function into `cdfinv.sim()`. ## Example Let's suppose we sample $n = 6$ data from a $\text{Beta}(a,2.8)$ distribution and we wish to compute a 95% lower bound for $a$, given an observed statistic value. ## What is the Statistic? Given that the beta distribution is a member of the exponential family, it makes sense to determine and apply a sufficient statistic. For a $\text{Beta}(a,2.8)$ distribution, a sufficient statistic, found via likelihood factorization, is \begin{align*} Y = \prod_{i=1}^n X_i \,. \end{align*} Using products is generally a sub-optimal choice, given the possiblity of floating-point underflows. Since one-to-one functions of sufficient statistics are also sufficient, we adopt $Y = -\sum_{i=1}^n \log X_i$ instead. The minus sign is included simply to make the statistic values positive. In `R`, we code this statistic as follows: ```{r} # give the function a name that will not conflict with others in the namespace beta.stat <- function(x) { -sum(log(x)) } ``` ## Using cdfinv.sim() Below we show how we can pass the function `beta.stat`, now defined in `R`'s global environment, to `cdfinv.sim()`. The first three arguments are standard: the distribution "name" is `beta` (`cdfinv()` will tack a `p` on the front), the parameter of interest is `shape1`, and the (assumed) observed statistic value is 10.5. The next argument specifies that we wish to compute a (95% by default) lower bound on $a$ (aka `shape1`), instead of the default two-sided interval. As we know that $a > 0$, we specify a lower parameter bound (`lpb`) of 0. The sample size is `nsamp` (here, 6), and `stat.func` is the function we just defined above, `beta.stat`. The last argument is an extra one specifically required for `rbeta()`. ```{r} require(cdfinv) cdfinv.sim("beta","shape1",10.5,bound="lower",lpb=0,nsamp=6,stat.func=beta.stat,shape2=2.8) ``` The 95% lower bound on $a$ is $\hat{\theta}_L = 0.502$. Note that because we work with empirical sampling distributions, this value is an estimate that under default conditions takes $\sim$ 1-10 CPU seconds to compute on a typical computer. For greater precision, one should consider increasing `numsim` from its default value of $10^5$ to $10^6$ or greater, while being mindful of the increased time needed to complete the computation.