1,324
Views
2
CrossRef citations to date
0
Altmetric
General

Evidential Calibration of Confidence Intervals

ORCID Icon, ORCID Icon & ORCID Icon
Pages 47-57 | Received 16 Jan 2023, Accepted 14 May 2023, Published online: 26 Jun 2023

Abstract

We present a novel and easy-to-use method for calibrating error-rate based confidence intervals to evidence-based support intervals. Support intervals are obtained from inverting Bayes factors based on a parameter estimate and its standard error. A k support interval can be interpreted as “the observed data are at least k times more likely under the included parameter values than under a specified alternative.” Support intervals depend on the specification of prior distributions for the parameter under the alternative, and we present several types that allow different forms of external knowledge to be encoded. We also show how prior specification can to some extent be avoided by considering a class of prior distributions and then computing so-called minimum support intervals which, for a given class of priors, have a one-to-one mapping with confidence intervals. We also illustrate how the sample size of a future study can be determined based on the concept of support. Finally, we show how the bound for the Type I error rate of Bayes factors leads to a bound for the coverage of support intervals. An application to data from a clinical trial illustrates how support intervals can lead to inferences that are both intuitive and informative.

1 Introduction

A pervasive problem in data analysis is to draw inferences about unknown parameters of statistical models. For instance, data analysts are often interested in identifying a set of parameter values which are relatively compatible with the observed data. Here we focus on a particular method for doing so—the support set—that arguably represents a natural evidential answer to the problem both from a likelihoodist (Edwards Citation1971; Royall Citation1997; Blume Citation2002) and a Bayesian (Wagenmakers et al. Citation2022) point of view. In either paradigm, statistical evidence may be defined via the Law of Likelihood (Hacking Citation1965), that is, data constitute evidence for one parameter value over an alternative parameter value if the likelihood of the data under that parameter value is larger than under the alternative parameter value. The likelihood ratio (or Bayes factor) measures the strength of evidence, and it plays also a central role in the construction of support sets, as we will explain in the following.

Let f(x | θ) denote the likelihood of the observed data x. Let θ be an unknown parameter and denote by (1) BF01(x;θ0)=f(x | H0)f(x | H1)=f(x | θ0)f(x | θ) f(θ | H1) dθ(1) the Bayes factor quantifying the strength of evidence which the observed data x provide for the simple null hypothesis H0:θ=θ0 relative to a (possibly composite) alternative hypothesis H1:θθ0, with f(x | H1) the marginal likelihood of x obtained from integrating the likelihood f(x | θ) with respect to the prior density of the parameter f(θ | H1) under the alternative H1 (Jeffreys Citation1961; Kass and Raftery Citation1995). For constructing a support interval, one views the Bayes factor (1) as a function of the null value θ0 for fixed data x. A k support set for θ is then given by the set of parameter values for which the data are k times more likely than under the alternative hypothesis H1 (Wagenmakers et al. Citation2022), that is, (2) SIk={θ0:BF01(x;θ0)k}.(2)

The support set thus includes the parameter values for which the observed data provide statistical evidence of at least level k.

illustrates different support sets (in this case intervals) for a log hazard ratio parameter θ quantifying the effect of the drug dexamethasone on the mortality of hospitalized patients with Covid-19 enrolled in the RECOVERY trial (RECOVERY Collaborative Group Citation2021). Shown is also the Bayes factor for testing H0:θ=θ0 versus H1:θθ0 viewed as a function of the null value θ0. A k support set is obtained from “cutting” this function at height k, and taking the parameter values with a Bayes factor value larger than k as part of the set. In practice, it is not clear which value of k should be chosen. One possibility is to select k based on conventional classifications of Bayes factors or likelihood ratios. lists three of them. For instance, using the classification from Jeffreys (Citation1961, Appendix B), the k=10 support interval ranging from 0.27 to 0.1 can be interpreted to contain log hazard ratios that are strongly supported by the data, whereas the k=1/10 support interval ranging from 0.37 to 0 can be interpreted to contain log hazard ratios that are at least not strongly contradicted by the data.

Figure 1: The RECOVERY trial (RECOVERY Collaborative Group Citation2021) found that dexamethasone treatment reduced mortality compared to usual care in hospitalized Covid-19 patients (estimated log hazard ratio θ̂=0.19 with standard error σ=0.05 and 95% confidence interval from 0.29 to 0.07). Assuming a normal likelihood θ̂ | θN(θ,σ2), the Bayes factor for contrasting H0:θ=θ0 to H1:θθ0 is shown as a function of the null value θ0. A unit-information normal distribution θ | H1N(μθ=0.22,σθ2=4) centered around the clinically relevant log hazard ratio is used as prior for θ under H1. Support intervals for different support levels k indicate the range of log hazard ratios supported by the data.

Figure 1: The RECOVERY trial (RECOVERY Collaborative Group Citation2021) found that dexamethasone treatment reduced mortality compared to usual care in hospitalized Covid-19 patients (estimated log hazard ratio θ̂=−0.19 with standard error σ=0.05 and 95% confidence interval from −0.29 to −0.07). Assuming a normal likelihood θ̂ | θ∼N(θ,σ2), the Bayes factor for contrasting H0:θ=θ0 to H1:θ≠θ0 is shown as a function of the null value θ0. A unit-information normal distribution θ | H1∼N(μθ=−0.22,σθ2=4) centered around the clinically relevant log hazard ratio is used as prior for θ under H1. Support intervals for different support levels k indicate the range of log hazard ratios supported by the data.

Table 1: Classifications of evidence for H0 provided by Bayes factors BF01=k.

The construction of support sets thus parallels the construction of frequentist confidence sets: A (1α)100% confidence set corresponds to the set of parameter values which are not rejected by a null hypothesis significance test at level α. It can equally be displayed and obtained from a so-called p-value function, which is the p-value of the data viewed as a function of the null value (Fraser Citation2019; Rafi and Greenland Citation2020). Despite these similarities, the interpretation of support and confidence sets is rather different; support sets contain parameter values for which there is at least a certain amount of statistical evidence, whereas confidence sets are defined through the long-run frequency of including the unknown parameter θ with probability equal to their confidence level. The parameter values in a confidence sets are typically interpreted as being “compatible” with a particular dataset, but this is debatable as the confidence level is concerned with the confidence set as a procedure over multiple replications.

Although support sets are conceptually simple and intuitive, they have not been applied to many problems. It is also unclear how they relate to the more widely used confidence sets. In this article we thus shed light on the connection between support and confidence sets. Specifically, we provide methods for calibrating approximate confidence sets to approximate support sets and vice versa in the important case when the data consists of an estimate of a univariate parameter θ with approximate normal likelihood (Section 2). To do so, we derive novel and easy-to-use formulas for computing support intervals that only require summary statistics typically reported in research articles, for example, point estimates, standard errors, or confidence intervals. This scenario is highly relevant as a large part of commonly used estimators satisfy the approximate normality assumption, and also because one often does not have access to the raw data but only the summary statistics. Computing a support interval requires the specification of a prior distribution for θ under the alternative H1, and we compare several classes of distributions. We also show how bounding the evidence against the null hypothesis for a certain class of prior distributions leads to the novel concept of a minimum support set. Our minimum support sets are directly related to well-known bounds of Bayes factors (Berger and Sellke Citation1987; Sellke, Bayarri, and Berger Citation2001; Held and Ott Citation2018). In Section 3, we show how minimum support sets provide confidence sets an evidential interpretation with respect to certain classes of priors. We then illustrate how the sample size of a future study can be determined based on support, which provides a novel alternative to the conventional approaches based on either power or precision of an interval estimator (Section 5). Finally, we show how the universal bound for the Type I error rate of Bayes factors can be used for bounding the coverage of support sets, even under sequential analyses with optional stopping (Section 6). As a running example, we use data from the RECOVERY trial (RECOVERY Collaborative Group Citation2021), as already introduced in .

2 Support Intervals under Normality

Denote by θ̂ an asymptotically normal estimator of an unknown univariate parameter θ, possibly the maximum likelihood estimator (MLE). Suppose its squared standard error σ2 is an estimate of the asymptotic variance of θ̂, so that an approximate normal likelihood θ̂ | θN(θ,σ2) is justifiable. For example, θ̂ could be an estimated regression coefficient from a generalized linear model and σ its standard error. In many simple settings, the standard error is of the form σ=λ/n, where λ2 is the variance corresponding to one effective unit and n is the effective sample size, for example, the number of measurements or the number of events (Spiegelhalter, Abrams, and Myles Citation2004, sec. 2.4), see also Berger, Bayarri, and Pericchi (Citation2013) for a generalization of effective sample size to more complex settings with dependent data. An approximate (1α)100% confidence interval for θ is given by (3) θ̂±σ×Φ1(1α/2)(3) with Φ1() the quantile function of the standard normal distribution. The confidence level (1α)100% represents the long run frequency with which the true parameter is included in the confidence interval (assuming that the sampling model is correct). Note that the interval (3) also corresponds to the (1α)100% posterior credible interval based on an (improper) uniform prior for θ, corresponding to Jeffreys’s transformation invariant prior (Jeffreys Citation1961; Ly et al. Citation2017) and thus also representing the default interval estimate for θ from a Bayesian estimation perspective. We will now contrast the confidence interval (3) to several types of support intervals.

2.1 Normal Prior Under the Alternative

To construct a support interval for θ using the data summary θ̂ with θ̂ | θN(θ,σ2), specification of a prior for θ under the alternative H1 is required. Specifying a normal prior θ | H1N(μθ,σθ2) results in the Bayes factor (4) BF01(θ̂;θ0)=1+σθ2σ2 exp[12{(θ̂θ0)2σ2(θ̂μθ)2σ2+σθ2}].(4)

Now, fixing the Bayes factor (4) to k and solving for θ0 leads to the k support interval (5) θ̂±σ×log(1+σθ2σ2)+(θ̂μθ)2σ2+σθ22logk.(5)

Similar to the confidence interval (3), the support interval (5) is centered around the parameter estimate θ̂. However, while the width of the confidence interval is only determined through the confidence level (1α)100% and standard error σ, the width of the support interval also depends on the specified prior for θ under H1. Moreover, for k>1 it may happen that the support interval is empty, as the term below the square root in (5) may become negative for too large k>1. This means that in order to find the desired level of support k>1, the data have to be sufficiently informative (relative to the prior), that is, the squared standard error σ2 has to be sufficiently small relative to the prior variance σθ2.

In the following, we will discuss how different prior means μθ and variances σθ2 affect the resulting support intervals. When the prior variance decreases (σθ20), the prior approaches a point mass at μθ. The width of the support interval is then fully determined by the difference between the parameter estimate θ̂ and the prior mean μθ divided by the standard error σ. A smaller difference between θ̂ and μθ leads to a tighter support interval. In contrast, for priors that become increasingly diffuse (σθ2), the k1 support interval (5) extends to the entire real line, indicating that all values θR receive more support from the data than the diffuse alternative, regardless of the data, that is, the observed estimate θ̂, standard error σ, and the location of the prior mean μθ. This particular behavior provides another perspective on the well-known Jeffreys-Lindley paradox (Wagenmakers and Ly Citation2023); the confidence interval from (3) only spans a finite range around the parameter estimate θ̂, so that the corresponding null hypothesis significance tests would reject the parameter values outside, whereas for the same values the Bayes factor would indicate evidence for the null hypothesis. Finally, centering the prior around the parameter estimate (μθ=θ̂) and setting the prior variance equal to the variance of one effective observation (σθ2=n×σ2 with n the effective sample size), produces the support interval for Jeffreys’s approximate Bayes factor (Wagenmakers Citation2022) which is equal to the well-known approximation of the Bayes factor based on the Bayesian information criterion (Raftery Citation1999). In this case, the standard error multiplier has a particularly simple form M={log(1+n)2logk}, showing that at least nk21 effective observations are required for the respective support interval with k1 to be nonempty.

2.2 Local Normal Prior Under the Alternative

The support interval based on the normal prior (5) depends on the specification of a prior mean and prior variance. A different approach is to use a so-called local prior, that is, a unimodal and symmetric prior centered around the null value θ0 (Berger and Delampady Citation1987). Choosing a local normal prior with variance σθ2 corresponds to setting μθ=θ0 in (4), which leads to the Bayes factor (6) BF01(θ̂;θ0)=1+σθ2σ2 exp{12 (θ̂θ0)2σ2(1+σ2/σθ2)}.(6)

The k support interval based on the Bayes factor (6) is then given by (7) θ̂±σ×{log(1+σθ2σ2)2logk}(1+σ2σθ2).(7)

While the Bayes factor (6) is a special case of the Bayes factor (4), the support interval (7) is not a special case of the support interval (5). This is because the prior for θ under H1 is different for each null value θ0, whereas it is always the same under the two-parameter normal prior approach. To fully specify the support interval (7), the prior variance σθ2 needs to be chosen. One standard choice is to set it equal to the variance of a single observation (σθ2=n×σ2), known as unit-information prior (Kass and Wasserman Citation1995). This approach leads to the k support interval (8) θ̂±σ×{log(1+n)2logk}(1+1/n).(8)

For this type of support interval, the standard error multiplier M=[{log(1+n)2logk}(1+1/n)] is wider than for the Jeffreys’s approximate Bayes factor by a factor of (1+1/n) but the condition nk21 for the k1 support interval to be nonempty is the same.

2.3 Nonlocal Normal Moment Prior Under the Alternative

Another attractive class of priors for θ under the alternative is given by so-called nonlocal priors. These priors are characterized by having zero density at the null value θ0, thereby leading to a faster accumulation of evidence than local priors when the null hypothesis is actually true (Johnson and Rossell Citation2010). One popular type of nonlocal priors is given by normal moment priors θNM(θ0,σθ), with symmetry point θ0 and spread σθ which have density f(θ | θ0,σθ)=N(θ ; θ0,σθ2)×(θθ0)2/σθ2 where N( ; θ0,σθ2) denotes the density function of a normal distribution with mean θ0 and variance σθ2. The Bayes factor employing a prior θ | H1NM(θ0,σθ) is then given by BF01(θ̂;θ0)=(1+σθ2σ2)3/2×exp{12 (θ̂θ0)2σ2(1+σ2/σθ2)}{1+(θ̂θ0)2σ2(1+σ2/σθ2)}1from which the corresponding k support interval can be derived to be (9) θ̂±σ×[2W0{(1+σθ2/σ2)3/2e2k}1](1+σ2σθ2)(9) with W0() denoting the principal branch of the Lambert W function. The Lambert W function is the (complex) multivalued function W() satisfying W(x) exp{W(x)}=x. For real x, it is defined for x[1/e,). For x0 the function has a unique value, whereas in the interval x(1/e,0), the function has two branches: W0(x)>1 for all x(1/e,0) termed the principal branch, and W1(x)<1 for all x(1/e,0), see Corless et al. (Citation1996) for more details. It is possible that the support interval (9) is empty, as for the other two types of support intervals. This happens when the Lambert W term is smaller than one half so that the square root is undefined. Since W0(0.82)1/2, this situations occurs when (1+σθ2/σ2)3/2<0.82×2ke, meaning that the standard error σ has to be sufficiently small relative to the prior spread parameter σθ and the support level k, so that the interval is nonempty.

2.4 Comparison of Priors

To better understand the advantages and disadvantages of the previously discussed priors, the resulting support intervals can be compared in terms of their width as a function of the sample size n ( top). For small sample sizes, the normal prior with mean equal to the observed parameter estimate produces the narrowest k=1 support intervals, followed by the local normal prior, the normal prior with mean one standard deviation away from the observed estimate, and lastly the nonlocal normal moment prior. Thus, a well-chosen normal prior can increase the precision of support inference, whereas a poorly chosen normal prior can decrease precision. However, the differences in width between the priors mostly disappear with increasing sample size. In the realistic range between 10 and a few hundred samples, the local normal prior seems to be a reasonable default choice, as it leads to support intervals almost as narrow as the normal (correct mean) prior, without the need to specify a mean.

Figure 2: Comparison of prior distributions for the parameter θ under the alternative H1 in terms of the resulting support interval width and the highest level for which it is nonempty. A data model θ̂ | θN(θ,λ2/n=4/n) is assumed in all cases. The prior scale/spread parameter is set to σθ=2. The normal prior (correct mean) has a mean equal to the parameter estimate θ̂, while the normal prior (wrong mean) has a mean one standard deviation λ=2 away from θ̂.

Figure 2: Comparison of prior distributions for the parameter θ under the alternative H1 in terms of the resulting support interval width and the highest level for which it is nonempty. A data model θ̂ | θ∼N(θ,λ2/n=4/n) is assumed in all cases. The prior scale/spread parameter is set to σθ=2. The normal prior (correct mean) has a mean equal to the parameter estimate θ̂, while the normal prior (wrong mean) has a mean one standard deviation λ=2 away from θ̂.

Another aspect in which the priors can be compared is the highest support level k for which the resulting support intervals are nonempty ( bottom). We see that for the same sample size n, the highest support levels from the normal and local normal priors are similar and show the same growth rates. In contrast, the highest support level from the nonlocal moment prior is higher and grows much faster. This is expected because nonlocal priors are designed to produce Bayes factors with faster accumulation of evidence for the null hypothesis. Thus, although nonlocal moment priors result in wider support intervals than the other priors, for small sample sizes they may be the only type of prior that can produce a support interval at, say, Jeffreys’s strong evidence level k=10.

3 Support Intervals based on Bayes Factor Bounds

In some situations it is clear which prior for θ should be chosen under the alternative H1, for example, when a parameter estimate from a previous dataset is available. In other situations it is less clear and different priors may produce drastically different results. To provide a more objective assessment of evidence in the latter situation, several authors have proposed to instead specify only a class of prior distributions and then select the one prior among them that leads to the Bayes factor providing the strongest possible evidence against the null hypothesis H0 (Edwards, Lindman, and Savage Citation1963; Berger and Sellke Citation1987; Sellke, Bayarri, and Berger Citation2001; Held and Ott Citation2018). Here we refer to these Bayes factor bounds as minimum Bayes factors for the null H0 over the alternative H1, as we are interested in the support for null values θ0.

We will now show how minimum Bayes factors can be used for obtaining so-called minimum support sets. Specifically, a k minimum support set is given by (10) minSIk={θ0:minBF01(x;θ0)k},(10) where minBF01(x;θ0) is the smallest possible Bayes factor for testing H0:θ=θ0 versus H1:θθ0 that can be obtained from a class of prior distributions for θ under the alternative H1. That is, given the data, for each θ0 the prior for θ under H1 is cherry-picked from a class of priors to obtain the lowest evidence for H0:θ=θ0 possible. Minimum support intervals thus provide a Bayes/non-Bayes compromise (Good Citation1992) as they do not require specification of a specific prior distribution but still allow for an evidential interpretation of the resulting interval.

One property of minimum Bayes factors is that they can only be used to asses the maximum evidence against the null hypothesis but not for it. Minimum support sets inherit this property, meaning that they can only be obtained for support levels k1. For instance a k=1/3 minimum support set includes the parameter values under which the observed data are at most 3 times less likely compared to under all priors from the specified class of alternative. Being unable to obtain support intervals with k>1 is the price that needs to be paid for having to only specify a class of prior distributions but not a specific prior itself. We will now discuss minimum support intervals from several important classes of distributions.

3.1 Class of All Distributions Under the Alternative

Among the class of all possible priors under H1, the prior which is most favorable toward the alternative is a point mass at the observed effect estimate H1:θ=θ̂ (Edwards, Lindman, and Savage Citation1963). The resulting minimum Bayes factor is given by (11) minBF01(θ̂;θ0)=exp{12(θ̂θ0)2σ2},(11) for which twice the negative log equals the standard likelihood ratio test statistic when θ̂ is the MLE. Inverting (11) for θ0 leads to the k minimum support interval (12) θ̂±σ×2logk.(12)

Interestingly, defining a support interval relative to the likelihood of the data under the MLE has already been suggested by Fisher (Citation1956). shows Fisher’s classification of evidence for this type of interval. Also Royall made use of the minimum support interval (12), usually with support levels k=1/8 and k=1/32. He noted: “The 1/8 and 1/32 likelihood intervals are not confidence intervals, in general, but they truly represent what confidence intervals are often mistaken to represent, namely parameter values that the sample does not represent evidence against, that is, values that are ‘consistent with the observations.’ We can speak in this way, asserting that there is not strong evidence against a point inside the interval, without reference to an alternative value, because the statement is true for all alternatives. Every point inside the 1/8 interval is consistent with the observations in the strong sense that there is no other possible value of the parameter that is better supported by a factor as large as 8” (Royall Citation1997, p. 101). While we agree that the support interval (12) is a useful bound, it is important to note that from a Bayesian perspective it represents the most blatantly biased assessment of support in the sense that assigning a point prior at the observed parameter estimate hardly reflects prior knowledge about θ but can rather be considered cheating (Berger and Sellke Citation1987). This is reflected by the fact that for a given estimate (i.e., dataset) and fixed support level k, the interval represents the narrowest support interval among all possible support intervals. When minimizing over the class of all two-parameter normal priors, that is, the Bayes factor (4), we also obtain the same minimum Bayes factor (11) and consequently the same minimum support interval (12).

3.2 Class of Local Normal Alternatives

When the class of priors for θ under the alternative H1 is given by normal distributions centered around the null value θ0, choosing its variance to be σθ2=max{(θ̂θ0)2σ2,0} maximizes the marginal likelihood of the data under H1. Plugging this variance in the Bayes factor (6) leads to the minimum Bayes factor over the class of local normal priors (13) minBF01(θ̂;θ0)={|θ̂θ0|σexp{(θ̂θ0)22σ2}e if |θ̂θ0|σ>11 else(13) as first shown by Edwards, Lindman, and Savage (Citation1963). Equating (13) to k and solving for θ0 leads then to the k minimum support interval (14) θ̂±σ×W1(k2/e),(14) with W1() the branch of the Lambert W function that satisfies W(y)<1 for y(e1,0). For k=1, the standard error multiplier becomes M=W1(1/e)=1. Hence, the data provide support for all parameter values within one standard error around the observed parameter estimate θ̂ when the class of priors for the parameter is given by local normal alternatives.

3.3 Class of p-based Alternatives

Vovk (Citation1993) and Sellke, Bayarri, and Berger (Citation2001) proposed a minimum Bayes factor where the data are summarized through a p-value. The idea is that under the null hypothesis H0:θ=θ0, a p-value should be uniformly distributed, whereas under the alternative it should have a monotonically decreasing density characterized by the class of Beta(ξ,1) distributions (with ξ  1). Choosing ξ such that the marginal likelihood of the data under H1 is maximized, leads to well-known “eplogp” minimum Bayes factor (15) minBF01(p;θ0)={eplogp if pe11 else(15) with p=2{1Φ(|θ̂θ0|/σ)}. Equating (15) to k and solving for θ0, leads to the k minimum support interval (16) θ̂±σ×Φ1[1exp{W1(k/e)}2].(16)

For k=1, the standard error multiplier is given by M=Φ1[1exp{W1(1/e)}/2]=Φ1[11/(2e)]0.90, so the k=1 minimum support interval is just slightly tighter than the one based on local normal alternatives.

3.4 Mapping between Confidence and Minimum Support Levels

For all types of minimum support intervals discussed so far, there is a one-to-one mapping between their minimum support level k and the confidence level (1α)100% of the approximate confidence interval (3), see . The conventional default level of 95% corresponds to a k=1/6.8 support level for the class of all priors under the alternative, a k=1/2.5 support level for the eplogp, and a k=1/2.1 support level for the local normal prior calibration. Conversely, the k=1/10 minimum support interval corresponds to the 96.81% confidence interval for the class of all priors, the 99.25% confidence interval for eplogp, and the 99.43% confidence intervals for the local normal prior calibration. Similar to the mappings between Bayes factor bounds and p-values (Held and Ott Citation2018), the mappings displayed in provide confidence intervals an evidential interpretation. Specifically, they enhance their long-term frequency interpretation with an interpretation that directly relates to the minimum support that the observed data provide for the parameter values in the interval.

Figure 3: Mapping between confidence level (1α)100% and minimum support level k for different types of minimum support intervals.

Figure 3: Mapping between confidence level (1−α)100% and minimum support level k for different types of minimum support intervals.

4 Example RECOVERY trial

We now compute the above (minimum) support intervals for the data from the RECOVERY trial (RECOVERY Collaborative Group Citation2021). With the standard error σ known, the minimum support intervals are fully specified and can be readily computed. For the normal, local normal, and the nonlocal normal moment prior we choose their parameters as follows. The trial steering committee determined the sample size of the trial based on an assumed clinically relevant log hazard ratio of log0.8=0.22. This effect size can be used to inform the normal prior under the alternative H1, that is, we specify the mean μθ=0.22 along with the unit-information variance σθ2=4 for a log hazard ratio (Spiegelhalter, Abrams, and Myles Citation2004, sec. 2.4.2). Likewise, we use the unit-information variance σθ2=4 as the variance of the local normal prior. The spread parameter of the nonlocal moment prior σθ is elicited with a similar approach as in Pramanik and Johnson (Citation2022); The value σθ=0.28 is selected so that 90% probability mass is assigned to log hazard ratios between θ0log2 and θ0+log2, representing effect sizes that at most half or double the mortality hazards relative to the null value θ0.

shows the corresponding k support intervals for different values of k. The support intervals based on normal (second row) and local normal prior (third row) mostly coincide for all considered support levels k. The k=10 support intervals (blue) from both types indicate that log hazard ratios between 0.27 and 0.1 receive strong support from the data compared to alternative parameter values. In contrast, the k=10 support interval (blue) based on the nonlocal normal moment prior (fourth row) is slightly wider, indicating that values between 0.28 and 0.09 are strongly supported by the data. For smaller support levels (k<10) this trend reverses and the normal and local normal prior support intervals are wider than the one based on the nonlocal normal prior. Finally, each parameter value not included in a k support interval corresponds to a point-null hypothesis for which the respective Bayes factor is smaller than k, similar to the relationship between confidence intervals and p-values. For instance, one can immediately see that the Bayes factor based on nonlocal moment priors indicates strong evidence (BF01<1/10) against H0:θ=0 as the value is not included in the interval, whereas this is not the case for the Bayes factors based on normal and local normal priors.

Figure 4: Different support intervals for the data from the RECOVERY trial. The normal prior is centered around μθ=0.22 and has unit variance σθ2=4. The local normal prior also has unit variance σθ2=4. The spread parameter of the nonlocal normal moment prior is σθ=0.28.

Figure 4: Different support intervals for the data from the RECOVERY trial. The normal prior is centered around μθ=−0.22 and has unit variance σθ2=4. The local normal prior also has unit variance σθ2=4. The spread parameter of the nonlocal normal moment prior is σθ=0.28.

The three bottom rows in show different types of k minimum support intervals computed for the data from the RECOVERY trial. Since minimum support intervals are only nonempty for k1, only such support levels are shown. The (yellow) k=1 minimum support interval for the class of all priors (fifth row) is just a point at the observed effect estimate θ̂=0.19. In contrast, the (yellow) k=1 minimum support intervals based on local normal priors (sixth row) and the eplogp calibration (last row) span about one standard error around the effect estimate. Also for k=1/3 (orange) and k=1/10 (red), the minimum support interval based on the class of all priors is much narrower than the ones based on local normal and eplogp, yet all of them are narrower than the ordinary support intervals. This illustrates that minimum support intervals provide an overly pessimistic assessment of support for parameter values, in the same way that Bayes factor bounds provide an overly pessimistic quantification of evidence for the null hypothesis.

5 Design of New Studies based on Support

The sample size of a future study is typically derived to achieve (i) a targeted power of a hypothesis test, or (ii) a targeted precision of a future confidence/credible interval. Here, we provide an alternative where the sample size of a future study is determined to achieve a desired level of support.

Assume we wish to conduct a study and analyze the resulting parameter estimate θ̂ using the support interval based on a normal prior (5). Further assume that we either specify a reasonable prior from existing knowledge or use the prior for Jeffreys’s approximate Bayes factor. The goal is now to determine the sample size n such that we can identify the parameter values which are strongly supported by the future data, for instance, with a support level k=10 representing “strong” support in the classification from Jeffreys (Citation1961). In order for the k>1 support interval (5) to be nonempty, the standard error σ of the parameter estimate θ̂ needs to be sufficiently small so that the term in the square root becomes nonnegative, that is, it must hold that (17) log(1+σθ2σ2)+(θ̂μθ)2σ2+σθ22logk.(17)

The sample size n can now be determined such that the standard error σ is small enough for (17) to hold. The resulting sample size then guarantees that parameter values with the desired level of support will be identified. In general, this needs to be done numerically, but for the Jeffreys’s approximate Bayes factor prior (μθ=θ̂ and σθ2=nσ2), the simple expression nk21 mentioned earlier exists. For instance, if we want a k=10 support interval to be nonempty, we must take at least 1021=99 samples.

While the previously described approach guarantees that a k>1 support interval is nonempty and includes at least one parameter value θ, one may want to guarantee that the resulting k support interval will span a desired length (18) l=2σ×Mk,(18) with Mk the standard error multiplier of a k support interval. In general, numerical methods are required for computing the n such that (18) is satisfied, yet again for the support interval based on Jeffreys’s approximate Bayes factor there are explicit solutions available (19) n=k2 exp{-W(k2l24λ2)}(19) with λ2 the variance of one (effective) observation and assuming log(1+n)/log(n)1. From (19) two things are apparent: (i) the argument to W() has to be larger than 1/e for the function value to be defined, meaning that the possible width is limited by l(4λ2)/k2, (ii) since the argument to W() is negative, there are always two solutions given by the two real branches of the Lambert W function, if any exist at all. For instance, for a standard error of σ=λ/n with λ=2, a support level k=10, and a desired width l=0.2, Equationequation (19) leads to the sample sizes n1=143 and n2=862 (when rounded to the next larger integer). Both lead to the k=10 support interval spanning the desired width l=0.2, yet for the study employing the larger sample size n2 other support intervals with higher support levels k can be computed compared to a study employing the smaller sample size n1.

6 Error Control via the Universal Bound

The universal bound (Royall Citation1997, sec. 1.4) ensures that for k<1 and when the null hypothesis H0:θ=θ0 is true, the probability for finding evidence at most of level k for H0 cannot be larger than k, that is (20) Pr{BF01(x;θ0)k | H0}k(20) for any prior of θ under the alternative H1. Remarkably, the universal bound is also valid under sequential analyses with optional stopping as soon as a Bayes factor smaller than k is obtained (Robbins Citation1970; Pace and Salvan Citation2020). In contrast, frequentist tests and confidence sets typically have to be adjusted for sequential analyses to guarantee appropriate error rates, and the theory and applicability can become quite involved.

Lindon and Malek (Citation2020) proved that k support sets with k<1 are also valid (1k)100% confidence sets. Their proof and the related “safe and anytime valid inference” theory (see, e.g., Grünwald, de Heide, and Koolen Citation2019) is based on relatively technical results from martingale theory. We now briefly show how the universal bound can also be used to derive error rate guarantees for support intervals. Assume there is a true parameter θ=θ*. For any (data-independent) prior for θ under the alternative hypothesis H1, the coverage of the corresponding k support set SIk with k<1 is bounded by (21) Pr(SIkθ* | θ=θ*)=Pr{BF01(x;θ*)k | θ=θ*}=1Pr{BF01(x;θ*)<k | θ=θ*}1k(21) where the first equality follows from the definition of a k support set (2), whereas the inequality follows from the universal bound (20). This shows that a k support set with k<1 is also a valid (1k)100% confidence set, even under sequential analyses with optional stopping, so that computing support intervals based on accumulating data leads to a (1k)100% confidence sequence (Lai Citation1976; Howard et al. Citation2021). Of course, the coverage bound rests on the assumption that the data model is correctly specified and a misspecified data model will result in incorrect coverage. Furthermore, the bound is based on simple null hypotheses, but it can also be shown to hold for composite null hypotheses when special types of priors are assigned to the nuisance parameters (Hendriksen, de Heide, and Grünwald Citation2021).

For the case of a univariate parameter θ as considered earlier, construction of (1k)100% approximate confidence interval via the normal prior support interval from (5) corresponds to the proposal by Pace and Salvan (Citation2020). These authors studied this particular case in detail and gave also frequentist motivations for the prior distributions interpreting them as weighting functions. Moreover, they found that the method is also applicable to parameter estimates from marginal, conditional, and profile likelihoods, and that the coverage of the intervals is controlled even under slight model misspecifications. We refer to Pace and Salvan (Citation2020) for further details.

A k<1 support interval will usually be wider than a standard (1k)100% confidence interval. On the other hand, a k<1 support interval has at least (1k)100% coverage, even under optional stopping (at least for point null hypotheses as is the case here), which is not satisfied by a standard (1k)100% confidence interval. Due to their property of valid coverage based on arbitrary number of looks at the data, k<1 support interval will also typically be wider than (1k)100% confidence intervals adjusted via group sequential or adaptive trial methodology which are more fine-tuned to specific interim analysis strategies (Wassmer and Brannath Citation2016). These strategies are, however, typically more restrictive and computationally involved compared to the flexible and easily computable k<1 support intervals which we present here.

It must be noted that the coverage bound (21) only holds for support intervals but not for minimum support intervals. This is because the minimum support intervals are derived based on priors that depend on the data, which violates the assumption of the universal bound. Minimum support intervals are thus only useful for giving confidence intervals an evidential interpretation, but a k minimum support interval with k<1, itself does not provide (1k)100% coverage under optional stopping.

7 Discussion

Misinterpretations and misconceptions of confidence intervals are common (Hoekstra et al. Citation2014; Greenland et al. Citation2016). We showed how confidence intervals can be reinterpreted as minimum support intervals which have an intuitive interpretation in terms of the minimum evidence that the data provide for the included parameter values. We also obtained easy-to-use formulas for different types of support intervals for an unknown parameter based on an estimate and standard error thereof. summarizes our results, their limitation being the reliance on the normality assumption which may be inadequate for small sample sizes. More appropriate support intervals can be obtained from considering the exact likelihood of the data instead of a normal approximation, however, typically the support interval will not be available in closed-form anymore and require the raw data rather than only the point estimate and standard error.

Table 2: Summary of confidence intervals (CI), support intervals (SI), and minimum support intervals (minSI) for an unknown parameter θ based on a parameter estimate θ̂ with standard error σ.

Which type of support interval should data analysts use in practice? We believe that the support interval based on a normal prior distribution is the most intuitive for encoding external knowledge. This type should therefore be preferably used whenever external knowledge is available. At the same time, the support interval based on a local normal prior with unit-information variance (Kass and Wasserman Citation1995) seems to be a reasonable “default” choice in cases where no external knowledge is available. Finally, we believe that minimum support intervals are mostly useful for giving confidence intervals an evidential interpretation due to the one-to-one mapping between the two.

It is also not clear which support level k should be used for computing support intervals. If space permits, we recommend to visualize the Bayes factor as a function of the null value as in . A similar approach has also been proposed by Grünwald (Citation2023) under the name of E-posterior. The Bayes factor visualization provides readers with a more gradual assessment of support, and any desired k support interval can be read off from it. If there are space constraints, a compromise is to report support intervals for different levels (e.g., k{1/10,1,10}) or to present a forest plot with “telescope” style support intervals with ascending support levels stacked on top of each other, as in . We are hesitant to recommend a “default” support level because any classification of support is arbitrary, just like the 95% confidence level convention. We believe that k=1 is perhaps the least arbitrary default level, as it represents the tipping point at which the included parameter values begin to receive support from the data (although not necessarily strong support).

Other approaches for reinterpreting confidence intervals have been proposed. For instance, Rafi and Greenland (Citation2020) propose to rename confidence intervals to “compatibility” intervals and give their confidence level an information theoretical interpretation. For example, a 95% confidence interval contains parameter values with at most 4.3 bits refutational “surprisal.” This notion of compatibility is logically weaker than the notion of support considered in this article as a failure to refute a parameter value cannot establish that this parameter value is supported without reference to alternatives (Greenland Citation2023). Compatibility intervals are in this sense similar to minimum support intervals; without a specified prior under the alternative hypothesis only the maximum surprisal/evidence against the included parameter values can be quantified.

We also showed how the coverage of k support intervals with k<1 is bounded by (1k)100%, which holds even under sequential analyses with optional stopping. For instance, a k=1/20 support interval has valid 95% coverage. Of course, such error rate guarantees rest on the assumption that the data model has been correctly specified, which in most real world applications will be violated to some extent. We do not see this as a problem for the evidential interpretation of support intervals, which is usually of more concern to data analysts. Evidential inference does not rely on a statistical model being “true” in some abstract sense. Bayes factors and support intervals simply quantify the relative predictive performance that the combination of data model and parameter distribution yield on out-of-sample data (Kass and Raftery Citation1995; O’Hagan and Forster Citation2004; Gneiting and Raftery Citation2007; Fong and Holmes Citation2020). Such “descriptive inferential statistics” are especially important for the analysis of convenience data samples which typically violate assumptions of the underlying statistical model (Amrhein, Trafimow, and Greenland Citation2019; Shafer Citation2021). In fact, even one of the best known proponents of p-values—R.A. Fisher—noted “For all purposes, and more particularly for the communication of the relevant evidence supplied by a body of data, the values of the Mathematical Likelihood are better fitted to analyze, summarize, and communicate statistical evidence of types too weak to supply true probability statements” (Fisher Citation1956, p. 70) clearly recognizing the importance of inferential tools based on relative likelihood for making sense out of data.

Acknowledgments

We thank Leonhard Held for helpful comments on an earlier version of the manuscript. We thank Michael Lindon for interesting discussions and for letting us know about his work on the connection between support and confidence sets. We thank Glenn Shafer for attending us about R.A. Fisher’s work on relative likelihood. We thank Sander Greenland for valuable feedback on the first version of the manuscript. We thank the editor Joshua Tebbs, the anonymous associate editor and the anonymous reviewer for useful comments. Our acknowledgment of these individuals does not imply their endorsement of this article.

Disclosure Statement

The authors report that there are no conflicts of interest to declare.

Data Availability Statement

The point estimate and 95% confidence interval of the adjusted log hazard ratio were extracted from the abstract of RECOVERY Collaborative Group (Citation2021). All analyses were conducted in the R programming language version 4.3.0 (R Core Team Citation2023). Code and data for reproducing the results in this manuscript are available at https://github.com/SamCH93/ECoCI. A snapshot of the GitHub repository at the time of writing this article is archived at https://doi.org/10.5281/zenodo.6723249. An R package for calibration of confidence intervals to (minimum) support intervals is available at https://CRAN.R-project.org/package=ciCalibrate, see the Appendix for an illustration.

Additional information

Funding

This work was supported in part by an NWO Vici grant (016.Vici.170.083) to EJW, and a Swiss National Science Foundation mobility grant (part of 189295) to SP.

References

  • Amrhein, V., Trafimow, D., and Greenland, S. (2019), “Inferential Statistics as Descriptive Statistics: There is No Replication Crisis If We Don’t Expect Replication,” The American Statistician, 73, 262–270. DOI: 10.1080/00031305.2018.1543137.
  • Berger, J., Bayarri, M. J., and Pericchi, L. R. (2013), “The Effective Sample Size,” Econometric Reviews, 33, 197–217. DOI: 10.1080/07474938.2013.807157.
  • Berger, J. O., and Delampady, M. (1987), “Testing Precise Hypotheses,” Statistical Science 2, 317–335. DOI: 10.1214/ss/1177013238.
  • Berger, J. O., and Sellke, T. (1987), “Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence,” Journal of the American Statistical Association, 82, 112–122. DOI: 10.2307/2289131.
  • Blume, J. D. (2002), “Likelihood Methods for Measuring Statistical Evidence,” Statistics in Medicine, 21, 2563–2599. DOI: 10.1002/sim.1216.
  • Corless, R. M., Gonnet, G. H., Hare, D. E. G., Jeffrey, D. J., and Knuth, D. E. (1996), “On the Lambert W Function,” Advances in Computational Mathematics, 5, 329–359. DOI: 10.1007/BF02124750.
  • Edwards, A. W. F. (1971). Likelihood, London: Cambridge University Press.
  • Edwards, W., Lindman, H., and Savage, L. J. (1963), “Bayesian Statistical Inference for Psychological Research,” Psychological Review, 70, 193–242. DOI: 10.1037/h0044139.
  • Fisher, R. A. (1956), Statistical Methods and Scientific Inference, Edinburgh: Oliver & Boyd.
  • Fong, E., and Holmes, C. C. (2020), “On the Marginal Likelihood and Cross-validation,” Biometrika, 107, 489–496. DOI: 10.1093/biomet/asz077.
  • Fraser, D. A. S. (2019), “The p-value Function and Statistical Inference,” The American Statistician, 73, 135–147. DOI: 10.1080/00031305.2018.1556735.
  • Gneiting, T., and Raftery, E. (2007), “Strictly Proper Scoring Rules, Prediction, and Estimation,” Journal of the American Statistical Association, 102, 359–377. DOI: 10.1198/016214506000001437.
  • Good, I. J. (1992), “The Bayes/non-Bayes Compromise: A Brief Review,” Journal of the American Statistical Association, 87, 597–606. DOI: 10.1080/01621459.1992.10475256.
  • Greenland, S. (2023), “Divergence versus Decision P-values: A Distinction Worth Making in Theory and Keeping in Practice: Or, How Divergence P-values Measure Evidence Even When Decision P-values Do Not,” Scandinavian Journal of Statistics, 50, 54–88. DOI: 10.1111/sjos.12625.
  • Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., and Altman, D. G. (2016), “Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations,” European Journal of Epidemiology, 31, 337–350. DOI: 10.1007/s10654-016-0149-3.
  • Grünwald, P., de Heide, R., and Koolen, W. (2019), “Safe Testing,” DOI: 10.48550/ARXIV.1906.07801., preprint.
  • Grünwald, P. (2023), “The E-posterior,” Philosophical Transactions of the Royal Society A, 381. DOI: 10.1098/rsta.2022.0146.
  • Hacking, I. (1965), Logic of Statistical Inference, New York: Cambridge University Press.
  • Held, L., and Ott, M. (2018), “On p-values and Bayes Factors,” Annual Review of Statistics and Its Application, 5, 393–419. DOI: 10.1146/annurev-statistics-031017-100307.
  • Hendriksen, A., de Heide, R., and Grünwald, P. (2021), “Optional Stopping with Bayes Factors: A Categorization and Extension of Folklore Results, with an Application to Invariant Situations,” Bayesian Analysis, 16, 961–989. DOI: 10.1214/20-BA1234.
  • Hoekstra, R., Morey, R. D., Rouder, J. N., and Wagenmakers, E.-J. (2014), “Robust Misinterpretation of Confidence Intervals,” Psychonomic Bulletin & Review volume, 21, 1157–1164. DOI: 10.3758/s13423-013-0572-3.
  • Howard, S. R., Ramdas, A., McAuliffe, J., and Sekhon, J. (2021), “Time-Uniform, Nonparametric, Nonasymptotic Confidence Sequences,” The Annals of Statistics, 49, 1055–1080. DOI: 10.1214/20-AOS1991.
  • Jeffreys, H. (1961), Theory of Probability (3rd ed.), Oxford: Clarendon Press.
  • Johnson, V. E., and Rossell, D. (2010), “On the Use of Non-local Prior Densities in Bayesian Hypothesis Tests,” Journal of the Royal Statistical Society, Series B, 72, 143–170. DOI: 10.1111/j.1467-9868.2009.00730.x.
  • Kass, R. E., and Raftery, A. E. (1995), “Bayes Factors,” Journal of the American Statistical Association, 90, 773–795. DOI: 10.1080/01621459.1995.10476572.
  • Kass, R. E., and Wasserman, L. (1995), “A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion,” Journal of the American Statistical Association, 90, 928–934. DOI: 10.1080/01621459.1995.10476592.
  • Lai, T. L. (1976), “On Confidence Sequences,” The Annals of Statistics, 4, 265–280. DOI: 10.1214/aos/1176343406.
  • Lindon, M., and Malek, A. (2020), “Sequential Testing of Multinomial Hypotheses with Applications to Detecting Implementation Errors and Missing Data in Randomized Experiments,” available at https://arxiv.org/abs/2011.03567v1.
  • Ly, A., Marsman, M., Verhagen, J., Grasman, R. P., and Wagenmakers, E.-J. (2017), “A Tutorial on Fisher Information,” Journal of Mathematical Psychology, 80, 40–55. DOI: 10.1016/j.jmp.2017.05.006.
  • O’Hagan, A., and Forster, J. J. (2004), Kendall’s Advanced Theory of Statistics, volume 2B: Bayesian Inference (2nd ed.), London, UK: Arnold.
  • Pace, L., and Salvan, A. (2020), “Likelihood, Replicability and Robbins’ Confidence Sequences,” International Statistical Review, 88, 599–615. DOI: 10.1111/insr.12355.
  • Pramanik, S., and Johnson, V. E. (2022), “Efficient Alternatives for Bayesian Hypothesis Tests in Psychology,” Psychological Methods. DOI: 10.1037/met0000482.
  • R Core Team (2023), R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
  • Rafi, Z., and Greenland, S. (2020), “Semantic and Cognitive Tools to Aid Statistical Science: Replace Confidence and Significance by Compatibility and Surprise,” BMC Medical Research Methodology, 20, 244. DOI: 10.1186/s12874-020-01105-9.
  • Raftery, A. E. (1999), “Bayes Factors and BIC,” Sociological Methods & Research, 27, 411–427. DOI: 10.1177/0049124199027003005.
  • RECOVERY Collaborative Group. (2021), “Dexamethasone in Hospitalized Patients with Covid-19,” New England Journal of Medicine, 384, 693–704. DOI: 10.1056/nejmoa2021436.
  • Robbins, H. (1970), “Statistical Methods Related to the Law of the Iterated Logarithm,” The Annals of Mathematical Statistics, 41, 1397–1409. DOI: 10.1214/aoms/1177696786.
  • Royall, R. (1997), Statistical Evidence: A Likelihood Paradigm, London; New York: Chapman & Hall.
  • Sellke, T., Bayarri, M. J., and Berger, J. O. (2001), “Calibration of p Values for Testing Precise Null Hypotheses,” The American Statistician, 55, 62–71. DOI: 10.1198/000313001300339950.
  • Shafer, G. (2021), “Descriptive Probability,” working paper #59 (version September 30, 2021). Available at http://probabilityandfinance.com/articles/59.pdf.
  • Spiegelhalter, D. J., Abrams, R., and Myles, J. P. (2004), Bayesian Approaches to Clinical Trials and Health-Care Evaluation, New York: Wiley.
  • Vovk, V. G. (1993), “A Logic of Probability, With Application to the Foundations of Statistics,” Journal of the Royal Statistical Society, Series B, 55, 317–341. DOI: 10.1111/j.2517-6161.1993.tb01904.x.
  • Wagenmakers, E.-J. (2022), “Approximate Objective Bayes Factors from P-values and Sample Size: The 3pn Rule,” DOI: 10.31234/osf.io/egydq.
  • Wagenmakers, E.-J., Gronau, Q. F., Dablander, F., and Etz, A. (2022), “The Support Interval,” Erkenntnis, 87, 589–601. DOI: 10.1007/s10670-019-00209-z.
  • Wagenmakers, E.-J., and Ly, A. (2023), “History and Nature of the Jeffreys-Lindley Paradox,” Archive for History of Exact Sciences, 77, 25–72. DOI: 10.1007/s00407-022-00298-3.
  • Wassmer, G., and Brannath, W. (2016), Group Sequential and Confirmatory Adaptive Designs in Clinical Trials, Cham: Springer. DOI: 10.1007/978-3-319-32562-0.

Appendix:

The ciCalibrate Package

We provide an R implementation of the support intervals and underlying Bayes factor functions from . The package is available at https://CRAN.R-project.org/package=ciCalibrate and can be installed by executing install.packages("ciCalibrate") in an R console. The following code snippet illustrates the computation and plotting of support interval and Bayes factor function.