Confidence Interval Estimation via Simulations

Peter E. Freeman

Introduction

Recall that to determine confidence interval bounds, we solve the equation \[\begin{align*} F_Y(y_{\rm obs} \vert \theta) - q = 0 \end{align*}\] for \(\theta\). Here, \(Y\) is our chosen statistic (e.g., \(\bar{X}\)), \(F_Y(\cdot)\) is the cumulative distribution function (or cdf) for \(Y\)’s sampling distribution, \(y_{\rm obs}\) is the observed statistic value, and \(q\) is a quantile that we determine given the confidence coefficient \(1-\alpha\) and the type of interval we are constructing (one-sided lower bound, one-sided upper bound, or two-sided).

In this vignette, we examine the situation in which the sampling distribution cdf is not analytically derivable, but where we can simulate individual data using code in an existing R package (e.g., rnorm()). In this situation, we would create a function that, given a simulated dataset, returns the statistic value, and we would pass that function into cdfinv.sim().

Example

Let’s suppose we sample \(n = 6\) data from a \(\text{Beta}(a,2.8)\) distribution and we wish to compute a 95% lower bound for \(a\), given an observed statistic value.

What is the Statistic?

Given that the beta distribution is a member of the exponential family, it makes sense to determine and apply a sufficient statistic. For a \(\text{Beta}(a,2.8)\) distribution, a sufficient statistic, found via likelihood factorization, is \[\begin{align*} Y = \prod_{i=1}^n X_i \,. \end{align*}\] Using products is generally a sub-optimal choice, given the possiblity of floating-point underflows. Since one-to-one functions of sufficient statistics are also sufficient, we adopt \(Y = -\sum_{i=1}^n \log X_i\) instead. The minus sign is included simply to make the statistic values positive.

In R, we code this statistic as follows:

# give the function a name that will not conflict with others in the namespace
beta.stat <- function(x)
{
  -sum(log(x))
}

Using cdfinv.sim()

Below we show how we can pass the function beta.stat, now defined in R’s global environment, to cdfinv.sim(). The first three arguments are standard: the distribution “name” is beta (cdfinv() will tack a p on the front), the parameter of interest is shape1, and the (assumed) observed statistic value is 10.5. The next argument specifies that we wish to compute a (95% by default) lower bound on \(a\) (aka shape1), instead of the default two-sided interval. As we know that \(a > 0\), we specify a lower parameter bound (lpb) of 0. The sample size is nsamp (here, 6), and stat.func is the function we just defined above, beta.stat. The last argument is an extra one specifically required for rbeta().

require(cdfinv)
cdfinv.sim("beta","shape1",10.5,bound="lower",lpb=0,nsamp=6,stat.func=beta.stat,shape2=2.8)
## $DISTR
## [1] "beta"
## 
## $PARAM
## [1] "shape1"
## 
## $STAT
## [1] 10.5
## 
## $bound
## [1] 0.5018137
## 
## $q
## [1] 0.05

The 95% lower bound on \(a\) is \(\hat{\theta}_L = 0.502\). Note that because we work with empirical sampling distributions, this value is an estimate that under default conditions takes \(\sim\) 1-10 CPU seconds to compute on a typical computer. For greater precision, one should consider increasing numsim from its default value of \(10^5\) to \(10^6\) or greater, while being mindful of the increased time needed to complete the computation.