---
title: "Confidence Interval Estimation via Simulations"
author: "Peter E. Freeman"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Confidence Interval Estimation via Simulations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

Recall that to determine confidence interval bounds, we solve the equation
\begin{align*}
F_Y(y_{\rm obs} \vert \theta) - q = 0
\end{align*}
for $\theta$. Here, $Y$ is our chosen statistic (e.g., $\bar{X}$),
$F_Y(\cdot)$ is the cumulative distribution function (or cdf) for
$Y$'s sampling distribution, $y_{\rm obs}$ is the observed statistic value,
and $q$ is a quantile that we determine given the confidence coefficient 
$1-\alpha$ and the type of interval we are constructing (one-sided lower
bound, one-sided upper bound, or two-sided).

In this vignette, we examine the situation in which the sampling distribution
cdf is *not* analytically derivable, but where we *can* simulate individual
data using code in an existing `R` package (e.g., `rnorm()`).
In this situation, we would create a function that, given a simulated dataset,
returns the statistic value, and we would pass that function into 
`cdfinv.sim()`.

## Example

Let's suppose we sample $n = 6$ data from a $\text{Beta}(a,2.8)$
distribution and we wish to compute a 95% lower bound for $a$, given an
observed statistic value.

## What is the Statistic?

Given that the beta distribution is a member of the exponential family,
it makes sense to determine and apply a sufficient statistic. For
a $\text{Beta}(a,2.8)$ distribution, a sufficient statistic, found via
likelihood factorization, is
\begin{align*}
Y = \prod_{i=1}^n X_i \,.
\end{align*}
Using products is generally a sub-optimal choice, given the possiblity of
floating-point underflows. Since one-to-one functions of sufficient statistics
are also sufficient, we adopt $Y = -\sum_{i=1}^n \log X_i$ instead. The
minus sign is included simply to make the statistic values positive.

In `R`, we code this statistic as follows:
```{r}
# give the function a name that will not conflict with others in the namespace
beta.stat <- function(x)
{
  -sum(log(x))
}
```

## Using cdfinv.sim()

Below we show how we can pass the function `beta.stat`, 
now defined in `R`'s global environment,
to `cdfinv.sim()`.
The first three arguments are standard: the distribution "name" is
`beta` (`cdfinv()` will tack a `p` on the front), the parameter of interest is
`shape1`, and the (assumed) observed statistic value is 10.5.
The next argument specifies that
we wish to compute a (95% by default) lower bound on $a$ (aka `shape1`), 
instead of the default two-sided interval.
As we know that $a > 0$, we specify a lower parameter bound (`lpb`) of 0.
The sample size is `nsamp` (here, 6), and `stat.func` is the function we
just defined above, `beta.stat`.
The last argument is an extra one specifically
required for `rbeta()`.

```{r}
require(cdfinv)
cdfinv.sim("beta","shape1",10.5,bound="lower",lpb=0,nsamp=6,stat.func=beta.stat,shape2=2.8)
```

The 95% lower bound on $a$ is $\hat{\theta}_L = 0.502$. Note that because we
work with empirical sampling distributions, this value is an estimate that under default
conditions takes $\sim$ 1-10 CPU seconds to compute on a typical computer. For greater
precision, one should consider increasing `numsim` from its default value of $10^5$ to
$10^6$ or greater, while being mindful of the increased time needed to complete
the computation.