Recall that to determine confidence interval bounds, we solve the equation \[\begin{align*} F_Y(y_{\rm obs} \vert \theta) - q = 0 \end{align*}\] for \(\theta\). Here, \(Y\) is our chosen statistic (e.g., \(\bar{X}\)), \(F_Y(\cdot)\) is the cumulative distribution function (or cdf) for \(Y\)’s sampling distribution, \(y_{\rm obs}\) is the observed statistic value, and \(q\) is a quantile that we determine given the confidence coefficient \(1-\alpha\) and the type of interval we are constructing (one-sided lower bound, one-sided upper bound, or two-sided).
In this vignette, we examine the situation in which the sampling
distribution cdf is not analytically derivable, but where we
can simulate individual data using code in an existing
R
package (e.g., rnorm()
). In this situation,
we would create a function that, given a simulated dataset, returns the
statistic value, and we would pass that function into
cdfinv.sim()
.
Let’s suppose we sample \(n = 6\) data from a \(\text{Beta}(a,2.8)\) distribution and we wish to compute a 95% lower bound for \(a\), given an observed statistic value.
Given that the beta distribution is a member of the exponential family, it makes sense to determine and apply a sufficient statistic. For a \(\text{Beta}(a,2.8)\) distribution, a sufficient statistic, found via likelihood factorization, is \[\begin{align*} Y = \prod_{i=1}^n X_i \,. \end{align*}\] Using products is generally a sub-optimal choice, given the possiblity of floating-point underflows. Since one-to-one functions of sufficient statistics are also sufficient, we adopt \(Y = -\sum_{i=1}^n \log X_i\) instead. The minus sign is included simply to make the statistic values positive.
In R
, we code this statistic as follows:
Below we show how we can pass the function beta.stat
,
now defined in R
’s global environment, to
cdfinv.sim()
. The first three arguments are standard: the
distribution “name” is beta
(cdfinv()
will
tack a p
on the front), the parameter of interest is
shape1
, and the (assumed) observed statistic value is 10.5.
The next argument specifies that we wish to compute a (95% by default)
lower bound on \(a\) (aka
shape1
), instead of the default two-sided interval. As we
know that \(a > 0\), we specify a
lower parameter bound (lpb
) of 0. The sample size is
nsamp
(here, 6), and stat.func
is the function
we just defined above, beta.stat
. The last argument is an
extra one specifically required for rbeta()
.
require(cdfinv)
cdfinv.sim("beta","shape1",10.5,bound="lower",lpb=0,nsamp=6,stat.func=beta.stat,shape2=2.8)
## $DISTR
## [1] "beta"
##
## $PARAM
## [1] "shape1"
##
## $STAT
## [1] 10.5
##
## $bound
## [1] 0.5018137
##
## $q
## [1] 0.05
The 95% lower bound on \(a\) is
\(\hat{\theta}_L = 0.502\). Note that
because we work with empirical sampling distributions, this value is an
estimate that under default conditions takes \(\sim\) 1-10 CPU seconds to compute on a
typical computer. For greater precision, one should consider increasing
numsim
from its default value of \(10^5\) to \(10^6\) or greater, while being mindful of
the increased time needed to complete the computation.