109
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Binomial Confidence Intervals for Rare Events: Importance of Defining Margin of Error Relative to Magnitude of Proportion

ORCID Icon & ORCID Icon
Received 07 Jun 2023, Accepted 21 Apr 2024, Accepted author version posted online: 02 May 2024
Accepted author version

ABSTRACT

Confidence interval performance is typically assessed in terms of two criteria: coverage probability and interval width (or margin of error). In this paper, we assess the performance of four common proportion interval estimators: the Wald, Clopper-Pearson (exact), Wilson and Agresti-Coull, in the context of rare-event probabilities. We define the interval precision in terms of a relative margin of error which ensures consistency with the magnitude of the proportion. Thus, confidence interval estimators are assessed in terms of achieving a desired coverage probability whilst simultaneously satisfying the specified relative margin of error. We illustrate the importance of considering both coverage probability and relative margin of error when estimating rare-event proportions, and show that within this framework, all four interval estimators perform somewhat similarly for a given sample size and confidence level. We identify relative margin of error values that result in satisfactory coverage whilst being conservative in terms of sample size requirements, and hence suggest a range of values that can be adopted in practice. The proposed relative margin of error scheme is evaluated analytically, by simulation, and by application to a number of recent studies from the literature.

Disclaimer

As a service to authors and researchers we are providing this version of an accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proofs will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to these versions also.

1 Introduction

A fundamental problem in applied statistics is the construction of a confidence interval (CI) for a binomial proportion, p. In many applications, one deals with a large population within which an event of interest is rare. For example, in clinical statistics, p could represent the proportion of patients exhibiting treatment side effects; such a scenario arose in the context of COVID-19 vaccination (Polack et al., 2020). In manufacturing, the number of defective components is often very small relative to the large number of components produced. Indeed, many manufacturers now achieve a defect rate of 3.4 in one million (Evans and Lindsay, 2015; Woodall and Montgomery, 2014). In the aviation industry, strict regulations ensure that safety incidents are deemed as a rare occurrence, for example, Boeing (2022) shows that very few incidents occur within a large sample of flights. (Note: we revisit the COVID-19 and aviation examples as case studies in Section 8, along with an ADHD medication example.)

Ascertaining the order of magnitude of p is important in “large populations” such as the aforementioned. Indeed, with population sizes in the millions (or billions), there is a big practical difference between p=104 and p=106 (but such differences are much less important/detectable in smaller populations). For example, in high-throughput manufacturing, a difference in the order of magnitude in the failure rate has significant implications for the number of defects and/or product returns. From a purely pragmatic perspective, note that ten thousand observations are needed to obtain, on average, one event when p=104. However, while it is expected that relatively large samples will be required to adequately estimate the order of magnitude of a small proportion, p, practitioners will need more specific guidance on the sample size requirements; this is not well covered by the existing literature.

The problem of constructing a CI for p has a wide literature, including several comparative studies, for example, Gonçalves et al. (2012), Leemis and Trivedi (1996), Newcombe (1998) and Pires and Amado (2008). These works assess various proportion estimators, for example, Pires and Amado (2008) compare twenty different methods. However, these works, and the literature in general, focus primarily on situations where p is moderately large. As such, there is much less guidance in the existing literature regarding the scenario where p is small. Furthermore, there is little discussion of relative margin of error, which is needed in this small p setting. Whereas relative margin of error is not a prominent feature of CI assessment for moderately large proportions, it is essential that the margin of error scales with the magnitude of p for rare events. Therefore, we consider a valid CI estimator as one that achieves a desired coverage probability whilst also maintaining a specified relative margin of error, and, in contrast to much of the existing literature, we focus on the small p regime of p[106,101], where relative margin of error is especially important.

For our analysis, we consider the most widely used binomial confidence interval, the Wald interval, along with three other common intervals: Clopper-Pearson (exact), Wilson (score) and Agresti-Coull (adjusted Wald) (Agresti and Coull, 1998; Clopper and Pearson, 1934; Wilson, 1927). Despite its widespread use, the Wald interval is known to produce inadequate coverage when p is near 0 or 1, and/or the sample size, n, is small. It has also been well documented that this interval suffers from erratic coverage, even when p is moderate (Agresti and Coull, 1998; Blyth and Still, 1983; Böhning, 1994; Vollset, 1993). Brown et al. (2001) show that this coverage fluctuation occurs for large n and recommend against using the Wald interval in practice. Newcombe (1998) also discourages the use of the Wald interval and suggests that its use be restricted to sample size planning. In recent work, Andersson (2023) discusses the deficiencies of the Wald interval and examines its coverage and noncoverage performance relative to the Wilson interval. Whilst the criticisms of the Wald interval can be justified, particularly when n is small, it is worth noting that the issue of erratic coverage is not unique to the Wald interval; this behavior is related to the binomial distribution and we illustrate (in Section 5) that it occurs for all four interval estimators.

A common approach in determining sample sizes is to set the (Wald) CI margin of error equal to a specified value, ϵ, and then solve for n. In order to maintain consistency between ϵ and p, we consider the relative margin of error, ϵR=ϵ/p, and obtain sample sizes by setting ϵR to a specified value and solving each interval equation for n. Lwanga and Lemeshow (1991) provide (Wald) sample size calculations for fixed and relative margins of error in the range [0.01, 0.5] for p[0.05,0.95]. However, in our work, we focus on the small-p regime of p[106,101] and provide computed coverage probabilities relating to ϵR[0.05,0.75]. In this regime, it is important to consider relative precision over fixed precision (fixed ϵ value). For example, ϵ=0.1 might be considered as reasonable precision for p=0.4, but could equally be considered reasonable for p=0.2. However, where ϵ=0.05 could be considered as a valid margin of error for p=101, it is far too large for a success probability of the order p=103. Ultimately, we find that ϵR[0.1,0.5] yields a good compromise between estimation precision, coverage performance, and sample size requirements.

In this work, we illustrate the importance of using relative margin of error in a small p regime, recommend practical tolerances for both relative margin of error and coverage probability, and provide comparisons in terms of sample size requirements. We show that when CI performance is assessed in terms of both coverage probability and relative margin of error, the four CI estimators perform similarly in many cases. Although the differences between the estimators is less pronounced when considering the relative margin of error, we show that the Wilson (score) interval provides the best overall performance. In addition, we provide practical guidance on the sample sizes required to attain reasonable CI performance for various (small) values of p. We anticipate that this guidance will be useful to researchers working in real-world small-p applications.

The remainder of this article is organized as follows. In Section 2, we review some of the proportion estimators proposed in the literature, focusing, in particular, on the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals. Section 3 provides details of the CI evaluation criteria used in the work. Section 4 briefly discusses the initial estimation of p for sample size planning, then moves on to using relative margin of error in such planning, and CI performance evaluation. Sections 5 and 6 illustrate the importance of employing a relative margin of error scheme in CI assessment. Section 7 presents the challenge of estimating a rare-event proportion using a small sample size. In Section 8, we present a number of case studies to demonstrate the relative margin of error schemes in assessing the validity of estimated intervals. Finally, the article concludes in Section 9 with a discussion.

2 Binomial Proportion Interval Estimators

Several methods have been devised to estimate a binomial proportion, p, including the Wald, Clopper-Pearson, Wilson, Agresti-Coull, Jeffreys, arcsine transformation, Jeffreys’ Prior and the likelihood ratio interval. A range of ensemble approaches have also been considered, for example, Kabaila et al. (2016), Park and Leemis (2019) and Turek and Fletcher (2012). In this work we assess the performance of the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals in standard form, i.e., without modification or application of continuity correction.

2.1 Details of the Intervals Considered in our Work

The Wald interval is included in this study as it is the most widely known and used estimator. We assess the Clopper and Pearson (1934) interval as it is an exact method, i.e., unlike the other estimators used in this work, it is based directly on the cumulative probabilities of the binomial distribution. For this reason it is often regarded as the “gold standard” in binomial proportion estimation (Agresti and Coull, 1998; Gonçalves et al., 2012; Newcombe, 1998). Where the Wald interval is known to produce inadequate coverage for small n or p, the Clopper-Pearson interval is generally regarded as being overly conservative, unless n is quite large (Agresti and Coull, 1998; Brown et al., 2001; Newcombe, 1998; Thulin, 2014). The intervals proposed by Wilson (1927) and Agresti and Coull (1998) both offer a compromise between the (liberal) Wald interval, and the (conservative) Clopper-Pearson interval.

We assess the performance of the Wilson and Agresti-Coull intervals in this work given their popularity within the literature. For example, Brown et al. (2001) recommend the Wilson or Jeffreys interval for small n. For larger n, they recommend the Wilson, Jeffreys or Agresti-Coull intervals, preferring the Agresti-Coull method for its simpler presentation. Agresti and Coull (1998) recommend the Wilson interval, and add that the 95% Wilson interval has similar performance to their method. Newcombe (1998) remarks on the mid-p method (an exact method closely related to the Clopper-Pearson method), and the Wilson method, noting the Wilson’s advantage of having a simple closed form. Vollset (1993) too recommends the mid-p and Wilson (uncorrected and continuity corrected) intervals, along with the Clopper-Pearson interval, stating that those four intervals can be safely used at all times. Pires and Amado (2008) recommend the continuity corrected arcsine transformation or the Agresti-Coull method. Krishnamoorthy and Peng (2007) show that when controlling for Type I and Type II error rates in one-sided hypotheses, the Wilson (score) and exact (Clopper-Pearson) tests require the same sample size. However, for two-sided hypothesis applications, and for constructing confidence intervals, they recommend the Wilson method.

Listed below are the Wald (W), Clopper-Pearson (CP), Wilson (WS) and Agresti-Coull (AC) interval formulas, where zα/2 denotes the 1α/2 quantile of the standard normal distribution and x is the number of successes in a sample size of size n.W:p̂±zα/2p̂(1p̂)n,where p̂=x/n.CP:Lower: Beta(α/2;np̂,n(1p̂)+1),Upper: Beta(1α/2;np̂+1,n(1p̂)),where Beta() is the quantile function of the Beta distribution.WS:p̂+zα/22/2n±zα/2p̂(1p̂)/n+zα/22/4n21+zα/22/n.AC: p˜±zα/2p˜(1 p˜)n˜,where p˜=np̂+zα/22/2n+zα/22 and n˜=n+zα/22.

2.2 Estimating Proportions when No Events are Observed

To provide an initial flavor of the challenging nature of estimating rare-event probabilities, we first consider the situation where no events are observed in the sample. Letting x denote the number of observed events, for very small p, it will be quite likely in small samples that x=0, and, hence, p̂=0/n=0. For example, consider an event that occurs with probability p=103, and consider the case where the sample size is n100 such that Pr(X=0)>0.9, i.e., it is very likely that no events will be observed in this sample.

In this scenario where no events are observed, the resulting intervals are very conservative for small n. For example, with a sample of size n=100, the following intervals are obtained: W: [0, 0], CP: [0, 0.0362], WS: [0, 0.0370], AC:[0.0074,0.0444]. Clearly the Wald interval is degenerate, and the remaining intervals are far too wide to be useful in small p settings where a good estimate of magnitude is required; the issue is more acute for smaller proportions, e.g., the probability of observing no events is 0.99 if p=104 and 0.999 if p=105. As it is clear that a sample size of n=100 will not suffice, further guidance is required in these scenarios, which we provide in the sequel. Of course, the required precision is analysis specific: the above intervals would be perfectly adequate if one was only interested in assessing if p<0.1, for example. However, for situations where it is important to determine the order of magnitude, such as assessing the failure rate in high-volume manufacturing, then sufficiently large samples will be required to gain a reasonable estimate of the order of magnitude of p.

3 Evaluation Criteria

The most commonly used CI evaluation criteria are coverage probability and expected width (Gonçalves et al., 2012) — and these are the criteria that we consider in this work. However, other performance metrics have been proposed. For example, Vos and Hudson (2005) interpret an interval as the non-rejected parameter values in a hypothesis test and discuss the p -confidence and p -bias criteria; Newcombe (1998) presents a criterion using noncoverage as an indicator of location; and Park and Leemis (2019) adopt an ensemble approach and use root mean squared error and mean absolute deviation to measure CI performance.

3.1 Coverage Probability

The coverage probability can be interpreted as the computed interval’s long-run percentage inclusion of the unknown parameter. Denoting Lx and Ux as the lower and upper CI bounds formed with x successes (suppressing the dependence on p and the significance level, α), the expected coverage probability, which we denote CPr, for a fixed parameter p, is given by(1) CPr(n,p)=x=0n(nx)px(1p)nx1(LxpUx),(1)

where 1(·) is an indicator function that takes the value 1 when its argument is true, and 0 otherwise.

3.2 Expected Width

The expected width, which we denote EW, is given by(2) EW(n,p)=x=0n(nx)px(1p)nx(UxLx),(2)

and the expected margin of error, EMoE, is then given as EMoE(n,p)=EW(n,p)/2.

4 Calculating Sample Size

The first problem in CI estimation is determining the sample size required to achieve a desired estimation precision. There have been a range of sample size determination methods discussed in the literature, e.g., Gonçalves et al. (2012), Korn (1986) and Liu and Bailey (2002), but here we adopt the common approach of deriving the sample size from the CI formula with fixed ϵR=ϵ/p. Used in conjunction with the Wald margin of error, one obtains(3) n=zα/22(1p*)ϵR2p*,(3)

where · denotes the ceiling function, and p* denotes an anticipated value of p, i.e., an initial estimate. (Sample size formulas for the Clopper-Pearson, Wilson and Agresti-Coull intervals are given in Appendix A.)

4.1 Initial Estimate of p

Selecting a value for p* is required to make equation (3) operational, and this is an inherent practical challenge in any such sample size calculation. In some situations, it might be possible to overcome this problem by utilizing subject matter knowledge or results from a previous study. If no previous information is available, a common approach is to consider the value of p*=0.5, but we do not adopt this approach here given that the focus of this work is on small/rare-event probabilities.

In some situations one might be able to gain a reasonable estimate of p. Consider a manufacturing environment where, for example, past experience or consultation with process experts could provide an analyst with information on the order of magnitude of p. One may be able to deduce that the true proportion is more likely to be of the order 1 in 10,000 rather than 1 in 1,000. Such insight could be sufficient in setting a reasonable value for p* and subsequently determining an appropriate sample size. Such initial estimation of an unknown parameter is a topic worthy of further discussion but is beyond the scope of this work. Here the focus is on the performance of the intervals after an initial estimate has been obtained.

4.2 Margin of Error Relative to p

The required precision is largely analysis dependent; what could be considered reasonable accuracy in one setting might be completely inappropriate in another. Although a relative margin of error scheme cannot be rigidly prescribed, we suggest a general scheme to avoid intervals that are too wide to be practically useful, or indeed intervals that are too narrow, in the sense that a reasonable estimate could have been obtained using fewer resources.

To ensure that the margin of error is not larger than the order of magnitude of p*, we impose ϵR1. Thus, one could consider 0ϵR1 as a plausible margin of error scheme. However, considering ϵR values too close to the bound of 1 results in very wide intervals, whilst considering ϵR values too close to the bound of 0 results in very narrow intervals. As ϵR approaches zero, an increasingly large sample size is required, and such high precision is not likely to be required for many studies. Therefore, we suggest ϵR[0.1,0.5] as a reasonable scheme. This scheme ensures that the interval is not impractically wide, nor excessively narrow (in terms of demanding very large sample sizes), and we show in Section 6 that acceptable coverage is achieved for this range of ϵR values.

A comparison of calculated sample sizes corresponding to ϵR=0.4 is provided in Table 1. (Sample size values in this and subsequent tables are rounded to 2 significant digits.) We can see from Table 1 that the Wald, Wilson and Agresti-Coull sample sizes are similar across the p* range, but the Clopper-Pearson sample sizes are approximately 40% larger which is indicative of this method’s conservatism. Note that Table 1 is only intended to provide an initial sense of the sample size requirements. However, final results can be found in Tables 10 and 11, and these account for the fact that empirical coverage for binomial proportion interval estimators is non-monotonic in the sample size (see Sections 5 and 6).

Given that ϵR is a function of p*, the above calculated sample sizes are only applicable to each specific p*. Even if the true proportion p equals p* (i.e., the initial estimate is perfect), p̂ will of course vary from sample to sample and may not equal p; hence, the realized relative margin of error will typically differ from ϵR.

4.3 ϵ▪p* Compatibility

Next we illustrate the importance of defining the margin of error in relation to the magnitude of the proportion. Consider the following fixed margin of error schemes:

  • Scheme 1: ϵ=4·102,p*=p

  • Scheme 2: ϵ=4·102,p*=0.5

  • Scheme 3: ϵ=4·104,p*=p

  • Scheme 4: ϵ=4·104,p*=0.5

Table 2 displays the calculated Wald sample sizes and coverage probabilities corresponding to the above margin of error schemes. (A comparison of Wald, Clopper-Pearson, Wilson and Agresti-Coull coverage probabilities for the above ϵ schemes is given in Appendix B.) Referring to Table 2, fixing ϵ=4·102 and considering p*=p (Scheme 1) creates sample sizes that reduce dramatically as a function of p. This results in coverage probabilities that are completely inadequate for p102. In Scheme 2, both ϵ and p* are fixed and this creates a constant sample size of n=6·102. This sample size is reasonable for p=101, but is insufficient for the remaining p values, which is reflected in the poor coverage performance.

Scheme 3 is similar to Scheme 1, but here, ϵ is reduced to 4·104. This ϵp combination produces sufficient coverage for p102, but deteriorates for the smaller p values. In Scheme 4, p* is fixed at 0.5 and ϵ=4·104, this results in a constant sample size of n=6·106, which produces good coverage throughout the p range, particularly for p105. Whilst the coverage is satisfactory in this scheme, the magnitude of ϵ is not compatible with all p values, particularly p=101 and p105. For p=101 the resulting interval is [0.0996, 0.1004] which is too narrow in the sense that a reasonable interval could be obtained with a significantly reduced sample size. For p=106 the interval is truncated at [0, 0.000401]. Here, even though the coverage is reasonable, the interval is too wide to be practically useful since its width is two orders of magnitude larger than p.

The Wald sample sizes and coverage probabilities associated with the relative margin of error schemes: ϵR{0.05,0.1,0.2,0.3,0.4,0.5,0.75} are given in Table 3.

From Table 3, we see that by considering ϵ in relation to the magnitude of p, the coverage probabilities are reasonable across the p range, but now, the analyst must choose a scheme such that the resulting interval’s width is appropriate. For example, consider p*=p=101 and ϵR=0.05 where the resulting interval is [0.095, 0.105]. This interval is very narrow and the large sample size of 1.4·104 reflects this quite stringent margin of error. Moving to ϵR=0.75 has the advantage of significantly reducing the sample size, but, of course, the interval is significantly wider at [0.025, 0.175]. To obtain intervals that are neither too liberal nor too conservative, that are reasonable in terms of coverage performance, and which avoid excessively large sample sizes, we recommend ϵR[0.1,0.5] as a reasonable scheme.

The coverage values for the Clopper-Pearson, Wilson and Agresti-Coull intervals are similar to those shown in Table 3 for ϵR0.5. A comparison of coverage probabilities for ϵR=0.75 is given in Appendix B.

4.4 Suitability of ϵR Scheme

A range of qualifications/criteria are often used to check the validity of using approximate CI estimators. Fleiss et al. (2003) state that the normal distribution provides excellent approximations to exact binomial procedures when np5 and n(1p)5. Leemis and Trivedi (1996) also discuss the np5(or 10) and n(1p)5(or 10) qualification.

We examine the proposed ϵR scheme to assess its compatibility with the qualification np*a and n(1p*)a, where a{5,10}, in relation to the Wald sample size equation.

Multiplying equation (3) by p* gives (4) np*=zα/22(1p*)ϵR2aϵRzα/22(1p*)a.(4)

For a given n, np*<n(1p*) when p*<0.5 and, therefore, equation (4) is sufficient in the small-p regime to ensure both np* and n(1p*) are greater than a. An evaluation of ϵR for p*[106,101] and α{0.1,0.05,0.01} is provided in Table 4, which shows that our suggested relative margin of error scheme, ϵR[0.1,0.5], lies below the threshold of equation (4). (The case of ϵR=0.5 negligibly exceeds the threshold of 0.493 for p*=101,α=0.1.)

4.5 Tolerances for Assessing CI Performance

In this section we suggest suitable tolerances for assessing interval performance in terms of coverage probability and relative margin of error. In relation to achieving a desired coverage probability, one usually considers (1α)100±ϵ*%, where ϵ* denotes a predefined coverage tolerance. The definition of such a tolerance is dependent on the individual researcher and particular study, and is thus difficult to quantify. In one study (1α)100±4% might be acceptable, whilst in another, one might require (1α)100±0.5%. We suggest that ϵ*{1,2,3} would be reasonable tolerances for most analyses, and as such, consider acceptable expected coverage probabilities as CPr(1α)100±3%, where CPr is described in equation (1).

A tolerance is also necessary with regard to the relative margin of error. As with the coverage, the desired margin of error is dependent on the particular research question and hence can not be rigidly prescribed. However, as previously discussed, it is important that the magnitude of the margin of error reflect the magnitude of the estimated proportion.

Table 5 provides suggested tolerances for the assessment of ϵR and CPr for (1α)100% confidence intervals which could be considered reasonable in most settings.

5 Relative Margin of Error Central to Performance

Next we illustrate how the relative margin of error is fundamental to CI performance evaluation. We show that when a valid confidence interval is defined as achieving a desired coverage probability whilst simultaneously satisfying a minimum relative margin of error, the four interval estimators perform similarly for a given np*α combination.

To demonstrate CI performance we consider the expected relative margin of error, which we define as ϵ˜R=EMoE/p, where EMoE is half of the expected width (see equation (2)). As discussed in Section 4.2, a relative margin of error exceeding 1 is not acceptable from a practical perspective, and, as per Section 4.4, we suggest that it should not exceed 0.5.

Figure 1 provides a 99% CI performance comparison for p*=p=101, and shows that the Wilson, Agresti-Coull and Clopper-Pearson intervals all achieve satisfactory coverage across the sample size range, whereas for n260, the coverage of the Wald interval oscillates around the lower limit of 98%. For example, the coverage is satisfactory at n=240, but then falls below 98% at n=260. This phenomenon of coverage oscillation relates to the discreteness of the binomial distribution and has been previously discussed in the literature, e.g., Agresti and Coull (1998), Andersson (2023), Blyth and Still (1983), Brown et al. (2001), and Vollset (1993). For a given p, the empirical coverage does converge to the (1α)100% level with n as one would expect, but it does so in an oscillatory fashion for neighboring values of n. Figures 1 and 2 show that all four estimators suffer from this erratic behavior.

Whilst the coverage performance of the Wald interval is inferior to the other three intervals for n<240, none of the intervals satisfy the ϵ˜R0.5 requirement at these lower sample sizes. Thus, by stipulating a minimum requirement for ϵ˜R, the poor coverage at small n is rendered irrelevant and the performance of the Wald interval is more comparable to the other three intervals when ϵ˜R0.5.

A comparison of the performance of a 95% CI for p*=p=102 is given in Figure 2 which shows that the Wald interval achieves ϵ˜R0.5 for n 1,600. For n 1,600 the Wald interval encounters five sample sizes where the coverage drops below the lower limit of 94%. The Clopper-Pearson interval requires a sample size of n= 2,000 to satisfy both ϵ˜R0.5 and CPr[94%,96%], with the coverage exceeding the upper limit of 96% on seven occasions for n> 2,000. The Agresti-Coull interval performs very well for n1,800, with just one value ( n= 2,600), failing to satisfy both CPr and ϵ˜R thereafter. The Wilson interval provides the best performance, satisfying both CPr and ϵ˜R requirements n 1,600.

Figures 1 and 2 highlight the similarities in performance when one considers ϵ˜R. In general, moderate-to-large sample sizes are required to satisfy both CPr and ϵ˜R criteria, and at these sample sizes the performance across the four intervals is reasonably comparable.

6 CI Performance Tables

Tables 6 through 9 provide a 95% CI comparison for p*=p=101 and p*=p=106, across a range of sample sizes and further illustrate the performance similarities among the estimators. The table cells are color coded according to the tolerances discussed in Table 5: target (green), acceptable (yellow), minimally acceptable (orange) and unacceptable (red).

We first consider p*=p=101 and 10n140, with Table 6 showing that none of the intervals satisfy the desired CPr and ϵ˜R requirements simultaneously. The importance of considering the relative margin of error in CI evaluation is clearly evident. In several cases the coverage probability lies within the desired tolerance but the excessive relative margin of error renders the estimate impractical. For example, referring to the Wilson interval, CPr(20,101)=95.7%, however, ϵ˜R=1.32 which is not acceptable.

We see from Table 7 that by increasing n, ϵ˜R0.5 is satisfied (at n=150, the Clopper-Pearson interval marginally exceeds 0.5, but is less than 0.5 thereafter). Each interval encounters sample sizes where CPr exceeds the bounds of ±1%, but overall, CPr is satisfactory.

As per Table 8, none of the intervals satisfy CPr[94%,96%] and ϵ˜R0.5 for p*=p=106 and n14·106. The Wilson method performs best in this scheme, and if the tolerances of CPr[93%,97%] and ϵ˜R0.75 were considered, it would produce a valid interval n.

Referring to Table 9, for 15·106n25·106 and p*=p=106, the Wald interval has the worst coverage, with four CPr values exceeding 95±2%. The Wilson and Agresti-Coull intervals perform the best, but overall, all four intervals perform well in this large sample size scheme, particularly if the coverage tolerance was considered as CPr95±2%.

The CPr and ϵ˜R performance across the p* range is given in Tables 10 and 11. Shown is the performance of each CI estimator at a selection of sample sizes of interest. The Wald, Wilson and Agresti-Coull methods perform similarly for a given np* combination, as shown in Table 10. The CPr and ϵ˜R values of the Clopper-Pearson slightly exceed the desired limits, but overall, the performance is quite reasonable.

Table 11 gives the sample sizes required to maintain a desired level of CPr and ϵ˜R. Three performance schemes were investigated: (i) CPr95±3%,ϵ˜R1, (ii) CPr95±2%,ϵ˜R0.75 and (iii) CPr=95±1%,ϵ˜R0.5. For each scheme, the Wilson interval required the smallest sample size to achieve (and maintain) the desired performance, thus providing further evidence of its overall superiority among the four estimators.

Table 12 shows the ϵ˜R values pertaining to the sample sizes displayed in Table 11 ( ϵR values were found to be very similar to the given ϵ˜R values). It can be seen that in relation to a 95% CI, to ensure that the coverage remains within ±2%, the Wald interval requires ϵ˜R0.37, whereas the Wilson interval requires ϵ˜R=0.75. The ϵ˜R values corresponding to maintaining the coverage within 95±1% (green table cells) are in close agreement with our recommendation to use ϵ˜R[0.1,0.5].

7 Estimating a Rare Event with a Small Sample Size

It is clear from the above results, that as expected a priori, quite large sample sizes are required to accurately estimate rare-event probabilities. Achieving accuracy on the order of magnitude for a small p is usually most relevant in large populations, where it will also be possible to collect large samples. For example, a quality engineer may have little problem in obtaining high-throughput process data of order n=106 or greater, and, in this large-scale production setting, it will be critical to know whether the defect rate is, say, one in one thousand, or one in ten thousand. In Section 8, we assess three data-rich scenarios from the literature, i.e., cases that involve estimating a small proportion whilst utilizing large samples.

Notwithstanding the fact that accurate estimation of a small p is most important in large populations, an analyst may be faced with the challenge of estimating a small p with a limited sample size. We touched on this problem in Section 2.2, but now consider CI performance in more detail using the approach of Section 6.

Assume that the true proportion is of order 102. As we have seen previously in Table 11, sample sizes of order n=103 will be needed to accurately estimate p. However, here, we assume that the analyst is dealing with a hard-to-reach population where n100; the performance of the four intervals is displayed in Table 13.

It is clear that all four intervals perform quite poorly in this scenario both in terms of coverage and relative margin of error. The coverage of the Wilson interval is notably better than the others, and is reasonable for some sample sizes, albeit is still somewhat erratic. This interval does achieve excellent coverage for n=80 for example, but the relative margin of error is ϵ˜R3, i.e., the margin of error of 0.03 is much larger than p=0.01. If the analyst only requires a rough estimate of p, for example, to answer the question of whether or not it is less than 0.1, then such a large margin of error will be acceptable. On the other hand, if the aim to is accurately estimate the order of magnitude of p, this will not be achievable for such small sample sizes (and, clearly, performance will degrade further for even smaller p). This again highlights the importance of considering relative margin of error in the small p setting, and our suggestion is to use ϵ˜R[0.1,0.5].

An anonymous reviewer advised us of two modern CI estimation approaches: an asymptotic method based on generalized fiducial inference (GFI) (Hannig, 2009), and a recently-developed exact method known as the “repro samples” method (Xie and Wang, 2022). The GFI method is an extension of the fiducial argument proposed by Fisher (1930), and the repro method is a simulation-based method that provides a finite sample CI coverage guarantee, which is particularly useful in small samples. We have tested both of these more modern methods (see Appendix C), and have found that they provide reasonable coverage (starting from a conservative position akin to the exact Clopper-Pearson method). However, when p and n are small, the methods experience the same issues as the classical methods we have considered; in particular, the relative margin of error is too large to be used in settings where the order of magnitude of a small p is of interest. (It is noteworthy, however, that the GFI and repro sampling methods are general inference procedures that provide good finite-sample performance in a wide range of problems beyond proportion estimation.) Ultimately, all of our work points to the fact that large samples are required in this small-p setting, and we have provided guidelines in Section 6.

8 Case Studies

We now consider the use of the relative margin of error in the estimation of small/rare-event proportions using data from the literature. More specifically, we consider: (i) a study on the prevalence of ADHD prescriptions in adolescents, (ii) a clinical trial relating to COVID-19 vaccine efficacy, and (iii) accident data from commercial jet aircraft records.

Using the values of n and p̂ reported in each of the aforementioned studies, we evaluate the validity of a 95% Wald CI in terms of the interval’s relative margin of error. We discuss the Wald interval as it is the most commonly used interval estimator, and for each of these case studies, it produces similar results to the Clopper-Pearson, Wilson and Agresti-Coull intervals. We also refer to our sample size calculations/CI performance analyses to assess the suitability of the sample size in terms of achieving the desired coverage.

8.1 Assessing Prevalence of ADHD Medication

The first study we consider is a study conducted by Sawyer et al. (2017) to assess the prevalence of stimulant and antidepressant medication in Australian children and adolescents with symptoms of ADHD (Attention-Deficit/Hyperactivity Disorder) and major depressive disorder (MDD). A nationally representative sample of n= 6,310 children between the age of 4 and 17 was obtained, which found that 13.7% of those with symptoms meeting the criteria of ADHD had used stimulant medications.

For a sample size of n= 6,310, and an estimated proportion of magnitude p̂=0.137, the 95% Wald CI is given as: [0.129, 0.145], with a realized relative margin of error of ϵR̂=ϵ̂/p̂0.062. Using the Delta method (see Appendix D), a 95% CI for ϵR is given as [0.060, 0.064]. (For each case study, a CI for ϵ˜R was obtained using Monte Carlo simulation in conjunction with the Delta method, and each was found to be in agreement with the corresponding CI estimate for ϵR.) For this study, the ϵR CI values fall outside of our recommended range of ϵR[0.1,0.5] meaning that the interval is somewhat narrower than what we recommend, i.e., one could achieve an acceptable result with fewer observations. Indeed, Table 14 displays the sample sizes for a selection of ϵR̂ values in this range, and, note, for example, that ϵR̂=0.4 leads to a sample size approximately forty times smaller than the sample size used in the study.

It can also be seen from Table 14 that a relative margin of error of ϵR̂=0.2 corresponds to a CI which is similar to that computed at ϵR̂=0.06, but uses a sample size that is approximately ten times smaller. Had the order of magnitude of p been known in advance (e.g., if it was known that p1/10, rather than p1/100), then a smaller sample size would have sufficed. When p is of the order 102, very good coverage is achieved for n= 1,800 (see Table 10), hence the study sample size of n= 6,310 is more than adequate for the scenarios where p=101 and p=102.

8.2 COVID-19 Vaccine Efficacy

An efficacy trial of the BNT162b2 mRNA COVID-19 vaccine was conducted by Polack et al. (2020). In this placebo-controlled, observer-blind trial, 43,548 participants were randomly assigned either the BNT162b2 vaccine or a placebo treatment. Of the 21,720 participants who received the vaccine, there were 8 cases of COVID-19 after the second dose. This leads to an estimated proportion of p̂=x/n= 8/21,720 3.7·104, and, therefore, the 95% Wald CI is given by [1.1·104,6.2·104], with ϵR̂0.69, and a 95% CI for ϵR of [0.421, 0.875]. As per Section 4.2 we recommend ϵR[0.1,0.5], which is incorporated in the above interval. However, the CI lower bound is very close to our recommended ϵR upper bound, and thus, in our suggested scheme, the computed interval could be questioned with regard to its width.

Aside from having a somewhat large margin of error (relative to p̂), we need to consider the sample size in relation to the order of magnitude of p̂. I.e., we must assess if the sample size is large enough to provide acceptable coverage. For example, if p were 104, Table 11 indicates that a sample size of the order 105 is required, whereas, here, the sample size is of the order 104. Indeed, we have calculated that, with p=3.7·104 and n= 21,720, the expected coverage is just 89.2%. The Clopper-Pearson, Wilson and Agresti-Coull intervals perform better in this np scheme, achieving coverage of 96.9%,95.2% and 95.2% (respectively). However, for this np combination, all three intervals have ϵR̂>0.72. Thus, to obtain a CI estimate where the margin of error is more consistent with p̂, and/or to enhance the coverage probability, a larger sample size would be required.

8.3 Commercial Aircraft Accidents

A summary of annual commercial jet aircraft flight hours, departures and accidents is provided by Boeing (2022). In the year 2021, there were 21.6 million aircraft departures with a total of 23 recorded incidents/accidents. Although all aircraft departures and accidents are recorded here, we may still view this as a sample from a larger population of flights that might have taken place (had demand been higher) or indeed for flights in upcoming years (provided that conditions such as aviation regulations and the composition of aircraft fleets remain similar). Therefore, it is still of interest to compute a confidence interval in this scenario, and, irrespective of the specific target population, the data still suffice for the purpose of demonstrating our proposed scheme.

For this data, p̂1.1·106, which gives a 95% Wald CI of [0.66·106,1.54·106], with ϵR̂0.40 ( 95% CI for ϵR of [0.328, 0.494]). This relative margin of error is consistent with our recommendation of ϵR[0.1,0.5], and as discussed in Section 4.3, provides a satisfactory estimator. Indeed, following the approach of Table 10, we have found that, when p=1.1·106, all four estimators satisfy CPr95±1% and ϵ˜R0.5 when n=21.6·106.

In the context of estimating the proportion of aircraft accidents, the analyst has no control over the sample size. That is to say, had a larger sample size been required, one would simply have to wait for more aircraft departures to occur. However, our analysis provides us with reassurance that our computed interval will perform satisfactorily.

9 Discussion

When constructing confidence intervals for small success probabilities it is important that the margin or error, ϵ, be considered relative to the magnitude of the proportion, p. Incompatibilities between ϵ and p can lead to completely unsatisfactory coverage or unnecessarily narrow intervals that require extremely large sample sizes. When dealing with moderate success probabilities, say p0.2, this is less important, but in the context of small or rare-event success probabilities, the consideration of ϵ relative to p is crucial to reduce the possibility of substantial mismatching between ϵ and p. For example, ϵ=0.05 might be considered as valid precision for p=101, but such a margin of error is not compatible with a proportion of the order p=103.

To ensure ϵ is compatible with the order of magnitude of p, we recommend using a relative margin of error scheme, ϵR. We suggest restricting the range of values to ϵR[0.1,0.5] as higher values lead to imprecision and poor interval coverage, whereas lower values lead to sample sizes that are likely to be impractically large for many studies. Our recommendation of ϵR[0.1,0.5] avoids intervals that are impractically wide or restrictively narrow in terms of sample size requirements, and we show that adequate performance is achieved within this range. In contrast to the existing literature, we have highlighted the importance of the relative margin of error, ϵR, in conjunction with the empirical coverage, when assessing CI performance in the small-p setting. When both criterion are considered simultaneously the Wald, Clopper-Pearson (exact), Wilson and Agresti-Coull intervals perform similarly in many cases. In general, all four intervals fail to satisfy both criteria when the sample size is small, with improved performance at larger sample sizes as expected. For example, for a 95% confidence interval when p=101, none of the methods produce a satisfactory interval for 10n140. Each interval achieves the nominal coverage of 95% at some (albeit not all) sample sizes in this range, but in each case the desired limit of ϵR0.5 is exceeded. Once the sample size is increased ( n150), and the ϵR requirement is satisfied, all four intervals perform well in terms of coverage.

The coverage probabilities of the Wald and Clopper-Pearson intervals for small n are generally poor, particularly in comparison to the Wilson and Agresti-Coull intervals. However, the considerable difference in coverage in such situations is rendered immaterial once the (we believe reasonable) requirement that ϵR0.5 is considered. When satisfactory performance is defined as achieving a desired CPr and ϵR, the performance across these commonly-used intervals is much more comparable, particularly if one considers empirical coverage in the range (1α)100±2%. In this relative margin of error framework the criticisms of inadequate coverage for the Wald interval, and excessive conservatism for the Clopper-Pearson interval, are somewhat alleviated, and all four intervals perform quite similarly. Although there are performance similarities, the Wilson and Agresti-Coull intervals are generally superior to the intervals of Wald and Clopper-Pearson. The Wilson and Agresti-Coull intervals achieve similar CPr and ϵR values for given npα combinations, however the Wilson interval is narrower and achieves favorable performance at lower sample sizes.

When the success probability is small, failure to consider the margin of error relative to the order of magnitude of the estimated proportion can result in poor coverage, and/or intervals which are unnecessarily narrow or excessively wide. As shown in the case studies presented in Section 8, the relative margin of error criterion provides a simple and effective assessment of the validity of an estimated interval in terms of its width/margin of error. For example, we have shown that all of the interval estimators considered in this paper performed poorly for the COVID-19 study (Section 8.2) in terms of the relative margin or error, meaning that the confidence intervals were all impractically wide — and the Wald interval also had notably poor coverage. It is important to ensure that the interval precision is compatible with the order of magnitude of p. The relative margin of error serves as a useful evaluation criterion in this regard, and as such, we suggest that it should be considered when planning statistical studies.

Acknowledgments

The authors would like to thank two reviewers, an associate editor, and the editor for their useful comments and suggestions. The paper was much improved as a result of their feedback.

Funding

This work was funded by the Technological University of the Shannon and Science Foundation Ireland (via the Confirm Smart Manufacturing Research Centre, grant number: 16/RC/3918).

Declaration of Interest

The authors report that there are no competing interests to declare.

References

  • Agresti, A. and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126.
  • Andersson, P. G. (2023). The Wald confidence interval for a binomial p as an illuminating “bad” example. The American Statistician, 74(4):443–448.
  • Blyth, C. R. and Still, H. A. (1983). Binomial confidence intervals. Journal of the American Statistical Association, 78(381):108–116.
  • Boeing (2022). Statistical summary of commercial jet airplane accidents. https://www.boeing.com/resources/boeingdotcom/company/about_bca/pdf/statsum.pdf. [Online] [Accessed on 04-December-2023].
  • Böhning, D. (1994). Better approximate confidence intervals for a binomial parameter. The Canadian Journal of Statistics, 22(2):207–218.
  • Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2):101–133.
  • Clopper, C. J. and Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(1):404–413.
  • Evans, J. R. and Lindsay, W. M. (2015). An introduction to Six Sigma & process improvement. CENGAGE Learning, USA, 2nd edition.
  • Fisher, R. (1930). Inverse probability. Proceedings of the Cambridge philosophical society, xxvi:528–535.
  • Fleiss, J. L., Levin, B., and Cho Paik, M. (2003). Statistical methods for rates and proportions. John Wiley & Sons, New Jersey, 3rd edition.
  • Gonçalves, L., De Oliveira, M. R., Pascoal, C., and Pires, A. (2012). Sample size for estimating a binomial proportion: comparison of different methods. Journal of Applied Statistics, 39(11):2453–2473.
  • Hannig, J. (2009). On generalized fiducial inference. Statistica Sinica, 19(2):491–544.
  • Kabaila, P., Welsh, A. H., and Abeysekera, W. (2016). Model-averaged confidence intervals. Scandinavian Journal of Statistics, 43(1):35–48.
  • Korn, E. L. (1986). Sample size tables for bounding small proportions. Biometrics, 42(1):213–216.
  • Krishnamoorthy, K. and Peng, J. (2007). Some properties of the exact and score methods for binomial proportion and sample size calculation. Communications in Statistics - Simulation and Computation, 36(6):1171–1186.
  • Leemis, L. M. and Trivedi, K. S. (1996). A comparison of approximate interval estimators for the bernoulli parameter. The American Statistician, 50(1):63–68.
  • Liu, W. and Bailey, B. J. R. (2002). Sample size determination for constructing a constant width confidence interval for a binomial success probability. Statistics & Probability Letters, 56(1):1–5.
  • Lwanga, S. K. and Lemeshow, S. (1991). Sample size determination in health studies: A practical manual. World Health Organisation, Geneva.
  • Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17(1):857–872.
  • Park, H. and Leemis, L. M. (2019). Ensemble confidence intervals for binomial proportions. Statistics in Medicine, 38(1):3460–3475.
  • Pires, A. M. and Amado, C. (2008). Interval estimators for a binomial proportion: comparison of twenty methods. REVSTAT - Statistical Journal, 6(2):165–197.
  • Polack, F. P., Thomas, S. J., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S., Perez, J. L., Pérez Marc, G., Moreira, E. D., Zerbini, C., Bailey, R., Swanson, K. A., Roychoudhury, S., Koury, K., Li, P., Kalina, W. V., Cooper, D., Frenck, R. W., Hammitt, L. L., and …Gruber, W. C. (2020). Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. The New England Journal of Medicine, 383(27):2603–2615.
  • Sawyer, M. G., Reece, C. E., Sawyer, A. C., Johnson, S., Lawrence, D., and Zubrick, S. R. (2017). The prevalence of stimulant and antidepressant use by Australian children and adolescents with attention-deficit/ hyperactivity disorder and major depressive disorder: a national survey. Journal of Child and Adolescent Psychopharmacology, 27(2):177–184.
  • Thulin, M. (2014). Coverage-adjusted confidence intervals for a binomial proportion. Scandinavian Journal of Statistics, 41(1):291–300.
  • Turek, D. and Fletcher, D. (2012). Model-averaged Wald confidence intervals. Computational Statistics and Data Analysis, 56(1):2809–2815.
  • Vollset, S. E. (1993). Confidence intervals for a binomial proportion. Statistics in Medicine, 12(1):809–824.
  • Vos, P. and Hudson, S. (2005). Evaluation criteria for discrete confidence intervals: beyond coverage and length. The American Statistician, 59(2):137–142.
  • Wilson, E. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(1):209–212.
  • Woodall, W. H. and Montgomery, D. C. (2014). Some current directions in the theory and application of statistical process monitoring. Journal of Quality Technology, 46(1):78–94.
  • Xie, M. and Wang, P. (2022). Repro samples method for finite- and large-sample inferences. arXiv preprint arXiv:2206.06421.

Appendix

A Sample Size Formulas

Clopper-Pearson Interval

Letting nLB be the minimum n satisfying: p*Beta(α/2;np*,n(1p*)+1)ϵ and letting nUB be the minimum n satisfying: Beta(1α/2;np*+1,n(1p*))p*ϵ, the sample size is given by: n=max{nLB,nUB}

Wilson Interval ϵ=zα/2p(1p)/n+zα/22/(4n2)1+zα/22/nn=max{zα/222ϵ2(p*(1p*)2ϵ2±ϵ2(14p*(1p*))+(p*(1p*))2)}

Agresti-Coull Interval

Letting p˜=(np+zα/22/2)/(n+zα/22) and n˜=n+zα/22:ϵ=zα/2p˜(1p˜)n˜n=max{13a(b+ξkC+Δ0ξkC)};k{0,1,2}

where a=4ϵ2 b=4zα/22(3ϵ2p*(1p*)) ξ=(1+3)/2 C=(Δ1±Δ124Δ03)/23 Δ0=16zα/24(3ϵ2p*(1p*))224ϵ2zα/24(6ϵ21) Δ1=128zα/26(3ϵ2p*(1p*))3+432ϵ4zα/26(4ϵ21)288ϵ2zα/26(3ϵ2p*(1p*))(6ϵ21)

B Fixed Margin of Error Schemes

Table 15 shows that for the fixed margin of error scheme: ϵ=4·104,p*=0.5, all four intervals produce coverage close to the nominal value. For the remaining three margin of error schemes the Wald interval produces very poor coverage in comparison to the other intervals. For example, in Scheme 1 ( ϵ=4·102,p*=p), the Wald coverage for p=102 is just 21.4%, whereas the coverage of the Clopper-Pearson, Wilson and Agresti-Coull intervals is 97.7%.

Whilst the Clopper-Pearson, Wilson and Agresti-Coull intervals achieve better coverage, all four intervals fail to produce satisfactory coverage if the margin of error scheme is not compatible with the magnitude of p. For example, in Scheme 3 ( ϵ=4·104,p*=p) the Wald interval produces a coverage of just 0.2% for p=105. However, the Clopper-Pearson, Wilson and Agresti-Coull intervals achieve coverage of 99.8%, 99.8% and 100% respectively. Whilst these coverage probabilities are much closer to the desired 95% coverage than the Wald’s 0.2%, they are too far from the nominal value to be of practical use.

Referring to Table 16, for ϵR=0.75 and p*=p102, the Wald coverage is considerably poorer than the other three intervals. Both the Clopper-Pearson and Agresti-Coull intervals produce reasonable coverage at approximately 97%. The Wilson interval performs the best, producing coverage very close to the nominal 95%p.

C CI Estimation using GFI and Repro Samples Methods

To compare the performance of the Wald, Clopper-Pearson, Wilson and Agresti-Coull methods against the more modern GFI (Hannig, 2009) and repro sampling (Xie and Wang, 2022) methods, we consider p[101,104], with sample sizes ranging from n=20 to n=104. This covers a range of scenarios from the main paper, including larger samples as per Section 6 and smaller samples as per Section 7. As these more modern methods are simulation-based, this makes exact performance calculations more challenging (as were done in the main paper). Therefore, here, we have carried out a simulation study with a large number of replicates (5000). The results are presented in Table 17.

We can see from Table 17, that, in terms of coverage, both of these methods converge towards the nominal 95% level with the sample size (starting from a conservative point akin to the exact Clopper-Pearson interval). Interestingly, in some of the more challenging scenarios presented here, e.g., p=103 with n=103 and p=104 and n=104, one or both of these methods provide the best empirical coverage (albeit still being a 2-3 percentage points out from the nominal level in these scenarios). In any case, these methods do not appear to offer dramatic improvements on the other methods considered, and, in particular, the key issue of ϵ˜R being large for small n is intrinsic to all of the methods. It is important to recognize, however, that the GFI and repro sampling methods are not limited only to proportion estimation, but, rather, are general inferential techniques that extend far beyond this to many other problems.

D CI For ϵR

By the Delta method, f(p̂)N(f(p),f(p)2σp̂2) as n. Given f(p̂)=ϵR̂=ϵ̂/p̂, and σp̂=p̂(1p̂)/n, a (1α)100% Wald CI for ϵR is given byz˜n1p̂p̂±zα/2z˜4np̂4p̂1p̂p̂(1p̂)n,

where z˜ is the (1α˜/2) quantile of the standard normal distribution, and α˜ is the significance level pertaining to ϵ̂.

Figure 1: CPr versus ϵ˜R for p*=p=101,α=0.01. Dashed (gray) line represents the nominal CPr value. Dot-dashed (red) lines represent the target CPr and ϵ˜R tolerances from . Sample size range is from n=100, to n=800, in steps of 20. Labels shown adjacent to data points depict the first n where both CPr and ϵ˜R requirements are satisfied. Additional data labels are referred to within the text.

Figure 1: CPr versus ϵ˜R for p*=p=10−1,α=0.01. Dashed (gray) line represents the nominal CPr value. Dot-dashed (red) lines represent the target CPr and ϵ˜R tolerances from Table 5. Sample size range is from n=100, to n=800, in steps of 20. Labels shown adjacent to data points depict the first n where both CPr and ϵ˜R requirements are satisfied. Additional data labels are referred to within the text.

Figure 2: CPr versus ϵ˜R for p*=p=102,α=0.05. Dashed (gray) line represents the nominal CPr value. Dot-dashed (red) lines represent the target CPr and ϵ˜R tolerances from . Sample size range is from n= 1,000, to n= 8,000, in steps of 200. Labels shown adjacent to data points depict the first n where both CPr and ϵ˜R requirements are satisfied. Additional data labels are referred to within the text.

Figure 2: CPr versus ϵ˜R for p*=p=10−2,α=0.05. Dashed (gray) line represents the nominal CPr value. Dot-dashed (red) lines represent the target CPr and ϵ˜R tolerances from Table 5. Sample size range is from n= 1,000, to n= 8,000, in steps of 200. Labels shown adjacent to data points depict the first n where both CPr and ϵ˜R requirements are satisfied. Additional data labels are referred to within the text.

Table 1: Sample size comparison for ϵR=0.4

Table 2: Wald-based sample size comparison - fixed ϵ

Table 3: Wald-based sample size comparison - variable ϵ

Table 4: ϵR thresholds as per equation (4) for a{5,10}

Table 5: Coverage & relative margin of error tolerances

Table 6: 95% CI performance - p*=p=101, “small” n

Table 7: 95% CI performance - p*=p=101, “large” n

Table 8: 95% CI performance - p*=p=106, “small” n

Table 9: 95% CI performance - p*=p=106, “large” n

Table 10: 95% CI performance comparison - n

Table 11: 95% CI sample size comparison

Table 12: ϵ˜R values corresponding to maintained coverage

Table 13: 95% CI performance - p*=p=102, small n

Table 14: 95% Wald CI - ϵR̂ comparison

Table 15: Coverage comparison - fixed ϵ

Table 16: Coverage comparison - ϵR=0.75

Table 17: Comparison of classical and modern CI methods