956
Views
1
CrossRef citations to date
0
Altmetric
Research Article

Beyond the Classical Type I Error: Bayesian Metrics for Bayesian Designs Using Informative Priors

Received 31 May 2023, Accepted 27 Mar 2024, Published online: 31 May 2024

Abstract

There is growing interest in Bayesian clinical trial designs with informative prior distributions, for example for extrapolation of adult data to pediatrics, or use of external controls. While the classical Type I error is commonly used to evaluate such designs, it cannot be strictly controlled and it is acknowledged that other metrics may be more appropriate. We focus on two common situations—borrowing control data or information on the treatment contrast—and discuss several fully probabilistic metrics to evaluate the risk of false positive conclusions. Each metric requires specification of a design prior, which can differ from the analysis prior and permits understanding of the behavior of a Bayesian design under scenarios where the analysis prior differs from the true data generation process. The metrics include the average Type I error and the pre-posterior probability of a false positive result. When borrowing control data, our empirical cases demonstrate that the average Type I error is asymptotically controlled (in certain cases strictly) when the analysis and design prior coincide. We illustrate use of these Bayesian metrics with real applications, and discuss how they could facilitate discussions between sponsors, regulators and other stakeholders about the appropriateness of Bayesian borrowing designs for pivotal studies.

1 Introduction

The litmus test for new drugs are large, randomized clinical studies, known as pivotal trials. These trials are conducted to provide convincing evidence of a drug’s efficacy and safety and to support its cost-effectiveness evaluation (Petrou Citation2012). The costs and the operational complexity of these studies are remarkable: in extreme cases, it can take hundreds of millions of dollars (Moore et al. Citation2018) and multiple years to conduct them. Unsurprisingly, this has led to the development of many statistical methods that aim to increase the efficiency, reduce the sample size or generally allow greater flexibility in the design of pivotal studies. Historically, most of these methods were formulated in the frequentist statistical framework. This is due to the dominating role that hypothesis testing (and the associated p-value) has been given in the assessment of novel therapies, with the common understanding that two adequately powered pivotal studies that show statistically significant results at a 2-sided α-level of 0.05 are required to establish effectiveness (Food and Drug Administration Citation1998). In this context, strict Type I error control has established itself as a guiding principle to judge whether the results of a trial can be used to contribute to the required level of evidence or not. Consequently, this precludes any method that formally incorporates existing evidence into the analysis of a pivotal trial, such as Bayesian designs with informative priors. Of note, here and subsequently we refer to Bayesian designs as clinical trials that use Bayesian inference for the parameter of interest rather than to experiments that invoke Bayes risk or a Bayesian utility function for decision making.

Over the last few years however, there has been increasing awareness of the substantial limitations which this strong focus on the frequentist framework and the Type I error control entails. For example, in its recent draft guidance Demonstrating Substantial Evidence of Effectiveness for Human Drugs and Biological Products (Food and Drug Administration 2019a), the US Food and Drug Administration (FDA) specifically discusses study designs that use external controls. Similarly, in its guidance Adaptive Design Clinical Trials for Drugs and Biologics (Food and Drug Administration 2019b), the FDA highlights Bayesian designs that use informative priors. Both are situations in which strict control of Type I error is not usually possible. Furthermore, several approvals were granted both in the United State and in Europe based on non-randomized studies using external controls (Goring et al. Citation2019). Even though these approvals were typically for rare diseases, they show the increasing willingness of regulators to consider evidence that has been generated outside the classical framework.

A particularly appealing approach to incorporate external evidence is via the use of informative prior distributions in the Bayesian framework. Since the prior distribution can be specified at the design stage, adherence to the principle of pre-specification is guaranteed, which ensures both transparency and avoidance of bias due to post-hoc decisions. Furthermore, through appropriate methods such as mixture priors (Schmidli et al. Citation2014), a possible prior-data conflict can also be mitigated at the design stage. Therefore, discussions and negotiations between the stakeholders (typically, regulators and pharmaceutical companies) can happen early in the process, that is before the study is started. Such discussions will concentrate on relevant aspects including, for example, the selection of the external data, the decision rule for claiming study success, and the metrics to evaluate the study design. Again, the FDA has published a draft guidance (Food and Drug Administration 2020) that summarizes in more granularity what information on the study design they consider important for such cases.

In descriptions of Bayesian designs with informative priors, however, it is common to find an evaluation of the associated classical (i.e., frequentist or conditional) Type I error. This Type I error, if defined in the traditional way by considering only the sampling distribution of the observed data from the current trial, cannot be strictly controlled (Viele et al. Citation2014; Psioda and Ibrahim Citation2019; Kopp-Schneider, Calderazzo, and Wiesenfarth Citation2020) and depending on several factors, it can be above, below or equal to its nominal level. The question is of course whether this is a problem of the Bayesian approach, or of the metric itself. Intuitively, we might think that there must be a Bayesian error metric that bears the same essential property for Bayesian designs that the classical Type I error does for frequentist designs, namely that it is controlled at a pre-specified level.

The goal of this article is to investigate this question in more detail for the archetypal setting of a 2-arm randomized trial, while acknowledging that it matters whether information on one group (typically, the control), or on the treatment contrast, is borrowed. For these two situations, building upon previous works (Spiegelhalter and Friedman Citation1986; Pennello and Thompson Citation2007; Chuang-Stein and Kirby Citation2017; Psioda and Ibrahim Citation2019), we will introduce several metrics, synthesize them within the same framework, describe their relationships and clarify the distinct situations in which each metric is valuable. We also illustrate their use with real applications. Importantly, we prove that—under certain conditions—one particular metric, the average (i.e., unconditional) Type I error, is actually controlled for Bayesian designs that leverage information on the control group. This result addresses the overarching question of whether a Bayesian error metric with analogous properties to the classical Type I error exists. We believe that this systematic exploration of metrics, their interconnections, and the establishment of important properties collectively fill a gap in the evaluation of Bayesian designs as emphasized for example in one of FDA’s guidances (Food and Drug Administration 2020).

The article is structured as follows. First, we provide a short overview of the main ideas for constructing informative priors with a focus on the meta-analytic predictive prior. We then introduce the different metrics for Bayesian design evaluation. We illustrate them using two case studies, and conclude with a discussion.

2 Methods

2.1 Introductory Considerations

We consider the case of a novel test treatment (t) being compared to a control treatment (c) in a pivotal study. We will refer to this pivotal study as the new study. The underlying true treatment effects are denoted by θt,new and θc,new, respectively. For example, θc,new and θt,new could be the (true) mean changes from baseline or the (true) log-odds of response under the test and control treatment. The key quantity of interest will be the treatment effect (or treatment contrast), which, without loss of generality, we denote as δnew=θt,newθc,new. In the Bayesian framework, we quantify our information about the treatment contrast before the data from the new study are available by the prior distribution p(δnew). Once the data ynew are available, the prior distribution will be updated according to Bayes theorem to obtain the posterior distribution, that is p(δnew|ynew)p(ynew|δnew)×p(δnew).

We note that the above involves an explicit formulation for the prior of δnew in the sense that a probability distribution for δnew is specified, and that the likelihood is also for the treatment contrast (such as a difference in means between treatment and control with corresponding standard error). However, this posterior distribution could also be derived in an implicit way by updating a prior distribution for p(θc,new,θt,new) to obtain a posterior p(δnew=θt,newθc,new|yt,new,yc,new).

In practical applications, the choice of an explicit or implicit formulation will predominantly be taken based on whether information about the group-specific treatment effect(s) θt,new,θc,new, or the treatment contrast δnew is available.

We are now interested in leveraging information yh from one or more previous studies which will inform us about the (true) treatment effect of the control group or the (true) treatment contrast. While a number of approaches exist for leveraging this information, here we will focus on robust meta-analytic predictive priors as described in Schmidli et al. (Citation2014). The robust meta-analytic predictive prior for parameter βnew(=θc,new or δnew in the current setting) has the following form p(βnew|yh)=w pMAP(βnew|yh)+(1w) probust(βnew) where pMAP(βnew|yh) is the meta-analytic predictive prior estimated from the historical data, probust(βnew) is the robust (vague, for example, unit-information) prior, and w is the a priori weight. For further details regarding the derivation of the meta-analytic predictive prior, we refer to Schmidli et al. (Citation2014). As it is often overlooked, we note that the weights w,1w will be updated to weights w˜,1w˜ once data are available, where the update follows mixture calculus (Schmidli et al. Citation2014).

Regardless of the situation, that is, whether data on the control treatment or on the treatment contrast are leveraged, decision-making will typically be based on the posterior distribution of the treatment contrast. A typical success criterion based on this Bayesian posterior distribution takes the form (1) Study success1{Pr(δnew>δnull|ynew)1α}(1) where 1 is the indicator function and δnull is a clinically meaningful minimum threshold for the treatment contrast of interest. If δnull=0, then this is a canonical success criterion for superiority, whereas for non-inferiority, δnull will be a pre-specified non-inferiority (NI) margin. Importantly, the Bayesian success criterion in (1) leads to the same success region as a classical 1-sided significance test of the null hypothesis δnew=δnull at level α in case that an improper constant prior for δnew is used. Furthermore, as with frequentist designs, success criteria can be specified that use values other than 0 or the NI margin for δnull , or, indeed, that use a value other than α=0.025. In these cases, the choice of δnull and of the corresponding α requires further consideration and typically involves extended discussions among various stakeholders. Examples of other Bayesian success criteria that have been discussed and are used in practice can be found in Food and Drug Administration (2020), Walley et al. (Citation2015), and Roychoudhury, Scheuer, and Neuenschwander (Citation2018). In what follows, we focus on Bayesian designs with success criterion defined as in (1) with δnull and α chosen to be any suitable values.

2.2 Metrics to Evaluate Bayesian Designs

For simplicity, in the following we omit the subscript denoting historical or new study and use θt,θc,δ, which will correspond to the parameters in the new study. Similarly, we also omit explicit conditioning on historical data in the prior distribution and just use p(θt),p(θc) and p(δ), but when necessary we will make clear what information a prior depends on.

Classical frequentist Type I error and power are well-established metrics to evaluate frequentist designs. Additionally, assurance (O’Hagan, Stevens, and Campbell Citation2005) is nowadays often used to acknowledge uncertainty about the true treatment effect when evaluating frequentist or Bayesian designs. However, in fact all of these metrics are special cases of a common metric m, which can be expressed as (2) m(CP(δ),p(δ))=Pr(Study success|δ)p(δ)dδ(2) where

  • “Study success” is the success rule for the design—this could be defined in terms of frequentist or Bayesian criteria, but here our main focus is on Bayesian designs with success defined as in (1).

  • CP(δ)=Pr(Study success|δ)=ynew1{Study success}p(ynew|δ)dynew is the conditional power, that is the probability (wrt sampling distribution of the new study data) of study success conditional on an assumed value for the treatment contrast δ (and treating any other nuisance parameters such as the sampling variance of the new study data, and any historical data yh used in a Bayesian design, as fixed).

  • p(δ) is a function defining assumptions about the true value of δ.

A few interesting insights follow immediately. For example, by using a Dirac measure for p(δ) with point mass at δnull (denoted Δδnull(.)), we obtain: (3) m(CP(δ),Δδnull)=Pr(Study success|δ)Δδnull(δ)dδ=Pr(Study success|δ=δnull)Classical Type I error(3)

When the study success criterion is based on a frequentist significance test of the null hypothesis δ=δnull, (3) is simply the classical Type I error. For a Bayesian design with success defined as in (1), metric (3) can also be viewed as a classical repeated sampling Type I error rate associated with the specified Bayesian design, and we will use the term “classical Type I error” to refer to both situations. Similarly, for the classical power (of either a frequentist or Bayesian design) we take a Dirac measure with point mass at the hypothesized value for the treatment contrast under the alternative. For assurance, we finally use a distribution for p(δ) that reflects our uncertainty around the hypothesized treatment effect. At this point, it becomes obvious that p(δ) is a prior distribution, but one that plays a different role to the prior specified as part of the Bayesian analysis in Section 2.1. This has sometimes led to confusion in Bayesian designs, as there are now two priors present: the prior used for the actual analysis (which we will refer to as the analysis prior), and the prior for the design evaluation (which we will refer to as the design prior). This semantic differentiation should help to bring clarity with regards to which prior is the object of interest. Note that the analysis prior is not seen explicitly in (2), nor in any of the subsequent equations in this section, as it is embedded in the definition of Bayesian study success (1) considered here, which is a function of the posterior distribution for δ.

We note that from a very stringent viewpoint, the separation between analysis prior and design prior is not entirely adhering to the Bayesian paradigm: the analysis prior is the best reflection of the evidence and the corresponding uncertainty and thus, this is the prior that should be used throughout. However, we acknowledge that calibration of Bayesian designs under different assumptions about the true parameter value(s) is useful (Grieve Citation2016), and is typically expected by regulatory agencies (Food and Drug Administration 2020). Indeed, drift (Viele et al. Citation2018; Lim et al. Citation2019)—defined as the difference between the true parameter value in the new study and the historical prior value—is the key quantity that drives bias (and hence the risk of incorrect conclusions) in Bayesian borrowing designs. Assumptions about how much drift is plausible therefore require careful consideration and discussion by the trial sponsor and regulatory agency. These assumptions could be reflected by fixing different values for δ as is done to evaluate classical Type I error or power. Yet, which values to pick, and how likely are they? A natural extension is the aforementioned differentiation between the two priors, which facilitates the exploration of “what if” type of sensitivity scenarios for calibration purposes and has been advocated by several authors (Wang and Gelfand Citation2002; Spiegelhalter, Abrams, and Myles Citation2004). It essentially permits understanding of the behavior of a Bayesian design under “drift” scenarios where the prior information used for analysis may differ from the true process generating the data.

When analysis prior information is available on θc and/or θt , we typically wish to evaluate the design using design priors p(θc) and/or p(θt) instead of using directly a design prior p(δ). In this case, it is convenient to also express the study success rule as a function of θt,θc to make explicit the dependence on the design parameters of interest. This leads to the following formulation of metric (2): (4) m(CP(θc,θt),p(θc,θt))=Pr(Study success|θc,θt)p(θc,θt)dθcdθt.(4)

The traditional definitions of Type I error and power impose deterministic constraints on the relationship between θt and θc (i.e., θtθc=δ* where δ* is a deterministic value and equals the null or alternative value of the treatment contrast, respectively). Because the parameters θt and θc can be expressed as functions of each other and of this fixed value δ*, it follows that a design prior for only one of them is necessary in order to evaluate metric (4), as the prior for the other can be derived from their relationship. If we assume a design prior p(θc), this leads to (5) m(CP(θc,θt=θc+δ*),p(θc))=Pr(Study success|θc,θt=θc+δ*)Classical Type I error or powerp(θc)dθc.(5)

Two interesting insights are as follows:

  1. As already noted, for a Bayesian design with prior information only on the treatment contrast, the metric defined by m(CP(δ),Δδnull) in (3) for a specified choice of δnull and α in the success rule reduces to a single value analogous to the classical Type I error.

  2. For a Bayesian design with prior information on the control treatment, the classical Type I error (or power) is not a single value but a “pointwise” rate that varies as a function of the true value of θc , while the metric defined by m(CP(θc,θt=θc+δnull),p(θc)) in (5) for a specified choice of δnull and α in the success rule is the average (unconditional, or marginal) of this classical Type I error with respect to the design prior distribution p(θc).

2.3 Specific Considerations When Borrowing Information on the Control Response

In the special case of normal prior and normal likelihood with known standard deviation, the average Type I error defined in (5) is strictly controlled at level α if the analysis prior is also used as the design prior p(θc) (see proof in supplementary material). Moreover, we show empirically in the examples that it also appears to hold in more general cases with normal mixture priors. We find this an important and remarkable point, as it provides a bridge (and consistency) between two quite different worlds: in the frequentist world, we obtain strict Type I error control as we are conditioning on the assumption of no treatment effect. In the Bayesian world, we obtain asymptotic (as shown empirically) or, in some cases, even strict control as we are marginalizing over the analysis prior distribution, assuming this is the best reflection of the available information. We think this is particularly re-assuring since it means once agreement has been reached on the analysis prior for the control response, asymptotic Type I error control with respect to the average (i.e., unconditional) Type I error over the same prior (now viewed as the design prior) is guaranteed.

Instead of integrating the classical Type I error or power over the design prior as in (5), we may also consider a subset of the parameter space, or the maximum or minimum value over the range of the parameter space, for example: maxθcΘcPr(Study success|θc,θt=θc+δ*).

We also note that it may be of interest to evaluate the conditional power at true values other than the point null δnull. For example, we might be interested in the conditional power if the investigational treatment is assumed to be harmful, that is δ*<δnull. This concept will become important in the next section.

2.4 Specific Considerations When Borrowing Information on the Treatment Contrast

When borrowing on the treatment contrast, if the prior information supports a positive effect of the investigational treatment (as is typically the case), this corresponds to a scenario where the prior is in conflict with the null treatment effect, resulting in an inflated classical Type I error rate relative to what is typically considered acceptable. Kopp-Schneider, Calderazzo, and Wiesenfarth (Citation2020) and Psioda and Ibrahim (Citation2019) have shown that if strict control of Type I error is required in this setting then prior information is effectively discarded. This is also reflected in FDA’s guidance on Complex Innovative Trial Designs (Food and Drug Administration 2020), where it is stated that when Type I error probability is not applicable (e.g., some Bayesian designs that borrow external information), appropriate alternative trial characteristics should be considered.

We concur with Psioda and Xue (Citation2020) that in cases where stakeholders believe reliable and relevant prior information on a treatment contrast is pertinent to the analysis of a new study, then to disregard this evidence in the name of ensuring strict Type I error control is questionable. On the other hand, if relaxation of classical frequentist Type I error control is to be permitted, other metrics are needed that help balance the potential efficiency gains of borrowing with unintended negative consequences that will arise when in reality, despite sound rationale, the prior information is not pertinent.

One such alternative metric proposed by Psioda and Ibrahim (Citation2019) is an average Type I error that is somewhat analogous to (5): (6) m(CP(δ),pnull(δ))=Pr(Study success|δ)pnull(δ)dδ(6) where the average is with respect to a design prior, pnull(δ), which must be chosen to be consistent with the assumed null treatment effect. Psioda and Ibrahim (Citation2019) suggest that a logical choice is to use the normalized analysis prior truncated on the range of values for the treatment effect that are consistent with the null: (7) pnull(δ)=p(δ)1{δδnull}δδnullp(δ)dδ=p(δ)1{δδnull}Pr(δδnull).(7)

Psioda and Ibrahim (Citation2019) show that the average Type 1 error (6) can be controlled at a specified level α through appropriate calibration of design parameters (e.g., prior weight on historical data and current trial sample size) implemented via a simulation-based grid search over the design space. They note, however, that in scenarios where the tail of the null design prior has negligible mass on values of δ inferior to δnull , then little or no historical information can be borrowed if the average Type I error is to be controlled at conventional levels for α. Importantly, for analysis priors specified directly on the treatment contrast, there is no equivalent of the result proved in the supplemental material for strong control of average Type I error when borrowing on the control response. This is because, in situations where the analysis prior favors non-null treatment effects, there is a fundamental inconsistency between the analysis prior and the null treatment effect, and so the null design prior cannot be chosen to be consistent with the analysis prior.

Here, we propose an alternative metric which addresses the inconsistency between the prior information and the null treatment effect by explicitly accounting for the probability that the treatment effect is null or harmful under a suitably-chosen design prior: (8) m˜(CP(δ),Pr(δδnull),p(δ))=m(CP(δ),pnull(δ))Average Type I error under null design prior×Pr(δδnull)Prob treatment effect is null/harmful=Pr(Study success|δ)p(δ)1{δδnull}Pr(δδnull)dδ×Pr(δδnull)=Pr(Study success|δ)1{δδnull}p(δ)dδand true treatment effect is null or harmfulJoint probability that trial is success=δδnullPr(Study success|δ)p(δ)dδ.(8)

Metric (8) is equal to the average Type I error (6) (where the average is wrt the null design prior (7)) multiplied by the prior probability (under the corresponding untruncated version of the design prior p(δ)) of the treatment effect being null or harmful. Closer inspection of the penultimate row of Equationequation (8) shows that this metric can also be interpreted as the joint probability of the true treatment effect being null or harmful and the study being declared a success. Spiegelhalter and Friedman (Citation1986) call this the Type III error of actually drawing a false positive conclusion, and it is sometimes also referred to as the pre-posterior probability of a false positive result. Metric (8) is also closely related to a metric proposed by Chuang-Stein and Kirby (Citation2017) to support decision-making for clinical trials. Their proposal is to calculate the probability of a correct decision for a trial design, which is the sum of two quantities: (a) the joint probability of the true treatment effect being beneficial and the trial being declared a success, plus (b) the joint probability of the true treatment effect being null or harmful and the trial failing to meet the success criteria. These various joint probabilities are illustrated in . It can be seen that the probability of a correct decision equals pTP+pTN, whilst our proposed metric in (8) to assess the pre-posterior probability of obtaining a false positive outcome is simply pFP .

Table 1 Joint probabilities of (truth, decision) for a clinical trial (modified from Chuang-Stein and Kirby Chuang-Stein and Kirby (Citation2017)).

Calculation of (8) (and the other probabilities in ) requires specification of a suitable design prior for δ. As with the average Type I error defined in (5), the analysis prior may be used for this design prior, or other choices could be considered, such as a prior representing skeptical beliefs about the true treatment contrast. One appealing choice is a “spike and slab” design prior, that is p(δ)=πnullΔδnull+(1πnull)pbenefit(δ), where the prior probability density on null or harmful values is a Dirac measure concentrated on a point mass (the ‘spike’) at δnull with weight πnull , and pbenefit(δ) represents the design prior for non-null (beneficial) values of the treatment effect. Under this design prior, metric (8) becomes: (9) m˜(CP(δ),πnull,πnullΔδnull(δ)+(1πnull)pbenefit(δ))=δδnull​​​​​​​​​​​​​Pr(Study success|δ){πnullΔδnull(δ)+(1πnull)pbenefit(δ)}dδ=πnullδδnull​​​​​​​​​​​​​Pr(Study success|δ)Δδnull(δ)dδ+(1πnull)δδnull​​​​​​​​​​​​​Pr(Study success|δ)pbenefit(δ)}dδ=πnulleffect is nullProb treatment×Pr(Study success|δ=δnull)Classical Type I error+0.(9)

The spike and slab design prior reduces sensitivity of metric (8) to how quickly the design prior down-weights values of δ that are more extreme than the point null, since all the mass on non-beneficial values is concentrated on the point null value. Closer inspection of (9) shows that this is also equivalent to the classical Type I error in (3) multiplied by the probability (under the spike and slab design prior) that the treatment effect is null.

3 Case Study 1: Borrowing Historical Placebo Data

We illustrate several metrics on a case-study inspired by a double-blind, randomized, placebo-controlled proof-of-concept study testing whether secukinumab, a human anti-IL-17A monoclonal antibody, was safe and effective for the treatment of moderate to severe Crohn’s disease (Hueber et al. Citation2012; Weber et al. Citation2020). The disease status is assessed with the Crohn’s Disease Activity Index (CDAI), which consists of a weighted sum of eight clinical or laboratory variables (Best et al. Citation1976). CDAI scores can range from 0 to about 600, where an asymptomatic condition corresponds to a value below 150, and severe disease is defined as a value greater than 450. The study’s primary endpoint is the CDAI change from baseline at week 6, with negative values indicating an improvement of the patient’s condition. Historical placebo data are available from 6 studies with a total of 671 patients on placebo (see supplementary material, Table S1). The primary endpoint is assumed normally distributed with a known standard deviation of σ = 88, estimated from the literature and historical studies.

3.1 Design

We discuss a Bayesian Dynamic Borrowing (BDB) design for the new study, using a 2:1 randomization ratio (treatment to control) and an informative prior for the placebo arm to supplement the data from the concurrently randomized placebo subjects. The planned sample size is 40 patients in the active treatment group and 20 patients in the placebo group. Assuming a true difference of –70 in the change from baseline at 6 weeks as compared to placebo, the study provides 83% power in a frequentist framework with 1-sided α=0.025. While the original study was designed with a dual-criterion (Roychoudhury, Scheuer, and Neuenschwander Citation2018), we use here for simplicity a single Bayesian success criterion equivalent to classical statistical significance under an improper prior, which we define as Pr(θt,newθc,new<0 | ynew)0.975, where θt,new and θc,new denote the true mean CDAI change from baseline at week 6 in the active and control arms, respectively, in the new study.

We derive an informative prior for the control arm using the meta-analytic-predictive (MAP) approach to summarise the historical placebo data (Schmidli et al. Citation2014; Neuenschwander et al. Citation2010) (see details in supplementary material), and approximate it with a mixture Normal distribution (Weber et al. Citation2020) with three components: (10) pMAP(θc,new | yc,1,,yc,6)=0.51×N(51.0,19.92)+0.44×N(46.8,7.62)+0.05×N(54.1,51.72)(10) where pMAP denotes the density of the MAP prior. Finally, we robustify the MAP prior to mitigate a possible prior-data conflict (Schmidli et al. Citation2014), by adding a robust unit-information component to the mixture distribution (): (11) pRMAP(θc,new | yc,1,,yc,6)= 0.8×pMAP+ 0.20×N(50.0,882),(11) where pRMAP denotes the density of the robust MAP prior. The mean of the vague component is set to –50 to be consistent with the historical data, and its standard deviation is set to σ = 88, so that the vague prior component is approximately equivalent to one subject’s worth of information. The choice of the weight of 20% on the robust component reflects our prior belief about the possibility of non-exchangeability between the placebo effect estimated from historical studies and the placebo effect in the new study.

Fig. 1 Crohn’s disease application. Left: Forest plot of observed data for the historical studies, and posterior and predictive (MAP) estimates of the true placebo effect. Right: Robust MAP prior distribution with prior weights of 80% on the historical data (MAP) and 20% on the vague (robust) component.

Fig. 1 Crohn’s disease application. Left: Forest plot of observed data for the historical studies, and posterior and predictive (MAP) estimates of the true placebo effect. Right: Robust MAP prior distribution with prior weights of 80% on the historical data (MAP) and 20% on the vague (robust) component.

We consider three different analysis priors for the placebo arm:

  • A (very) vague prior, defined as θc,new  N(50,88002), not borrowing historical information

  • The MAP prior (10)

  • The robust MAP prior (11)

The vague prior is also used for the active arm in all designs.

3.2 Classical Type I Error

presents the pointwise classical (frequentist) Type I error rate for each of the three Bayesian designs over a large but plausible range of true values of CDAI mean change from baseline between –150 and 50, when both the treatment and the placebo are assumed to have the same effect (i.e., θt,new = θc,new). First, it can be seen that, when using vague priors for both arms (i.e., no borrowing of historical data), the classical Type I error rate does not depend on the true value of the placebo effect θc,new and is controlled at 2.5%. On the other hand, when borrowing historical placebo data (using either MAP or robust MAP analysis prior), the pointwise classical Type I error rate can be increased or decreased, depending on the difference between the true placebo effect in the new study and the observed effect in historical data. In particular, when the true placebo effect is better (more negative) than the observed effect in historical data, the classical Type I error rate can be increased compared to the nominal level. In this case, using historical information penalizes the observed placebo effect, leading to a more pronounced treatment contrast estimate and increasing the classical Type I error. Over the range of values for the true placebo response shown in , the maximum of the classical Type I error rate is 19% with the MAP prior, but it can theoretically reach 100% for biologically implausible values of θc,new (lower than –1000, see supplementary material, Figure S1). As expected, the robust MAP prior reduces this Type I error inflation to a maximum of 11% on this range (and across the entire range), as the informative components in the prior are entirely discarded for large prior-data conflicts. Readers may also have noted that the Type I error rates for both the MAP and robust MAP prior designs are slightly lower than the nominal level when the true placebo effect is identical or very similar to the observed historical value; this relates to the fact that the sampling distribution of the posterior test statistic (i.e., the indicator of whether the posterior success criteria is met or not) is unbiased (or nearly unbiased) and has less variability (due to the prior information) than under an improper prior, so the tail area will be less than α.

Fig. 2 Crohn’s disease application: classical Type I error for Bayesian designs with three different analysis priors for the control arm.

Fig. 2 Crohn’s disease application: classical Type I error for Bayesian designs with three different analysis priors for the control arm.

3.3 Average Type I Error

We evaluate the average (unconditional) Type I error proposed in (5) over four design priors (see ):

Fig. 3 Crohn’s disease application. Different design priors used for the average Type I error calculations, overlaid on the Type I error curve (from ) for Bayesian designs with three different analysis priors for the control arm.

Fig. 3 Crohn’s disease application. Different design priors used for the average Type I error calculations, overlaid on the Type I error curve (from Figure 2) for Bayesian designs with three different analysis priors for the control arm.
  1. A “vague” design prior chosen to be the same as the vague prior used for the analysis.

  2. A “skeptical” design prior θc,new  N(90,252) that corresponds to the posterior distribution of a stand-alone analysis of the historical study with the “most extreme” placebo effect (APhTh04).

  3. A “realistic” design prior based on all relevant historical data, chosen to be the same as the MAP analysis prior.

  4. A “robust” design prior chosen to be the same as the robust MAP analysis prior.

The results are presented in . A consistent viewpoint is to use the same distribution as analysis prior and design prior, meaning that the distribution of assumed values for the placebo effect used to evaluate the design corresponds to the prior assumption about the placebo effect used in the analysis. The values in bold in provide empirical evidence that, under this viewpoint, the average Type I error rate is controlled at 2.5%, as discussed in Section 2.3. As sensitivity analyses, we assume some skepticism about the distribution of the true placebo effect, to help understand how the analysis prior behaves when using another design prior.

Table 2 Crohn’s disease application. Average Type I error.

First, we observe that the average Type I error is always controlled to its nominal level when the analysis prior is vague, regardless of which design prior is assumed. This is expected since there is no increase of the pointwise frequentist Type I error rate with this analysis prior.

On the other hand, when the BDB design is evaluated under a scenario where all possible values of the true placebo effect are considered (almost) equally likely (vague design prior), we observe increased average Type I error rates (48.5% and 45.6% for designs using the MAP or robust MAP analysis prior respectively). This is because extreme values for the true control effect are considered to be possible with the vague design prior, resulting in large, but implausible, prior data conflicts.

The Type I error rate is much lower when it is averaged over more informative design priors compared to the vague design prior. Assuming the skeptical design prior for the placebo effect results in an average Type I error of 13.4% with the MAP analysis prior, reduced to 8.8% with the robust MAP analysis prior thanks to the robustification. Averaging over the plausible, but conservative, robust MAP design prior leads to a small Type I error increase with the MAP analysis prior (3.2%). On the other hand, the Type I error is reduced to 2.2% when using the robust MAP analysis prior but averaging over the MAP design prior.

In summary, assuming the same distribution for the analysis and the design priors is the most consistent assumption, and our results provide empirical confirmation that the average Type I error of BDB designs is controlled in these scenarios. Sensitivity analyses were conducted to assess the impact of evaluating the design using assumptions (design prior) for the true placebo effect that are not consistent with the analysis prior: they show that average Type I error increases are possible, but they are large only for highly implausible assumptions (vague design prior). Plots such as can be a helpful way to illustrate how much credence is placed on different values of the Type I error under different design priors, and we recommend including visual representations such as this to aid communication of the average Type I error to stakeholders.

4 Case Study 2: Borrowing Historical Information on a Treatment Contrast

The second case study is inspired by a recent FDA approval of sBLA 125370/S-064 for belimumab (Benlysta) IV formulation for use in children aged 5–17 years. Benlysta was approved by the FDA for adult patients with active, seropositive lupus erythematosus (SLE) in 2011. A pediatric post-marketing study was required and the applicant undertook to conduct a randomized, double-blind, placebo-controlled trial targeting to enroll 100 pediatric subjects 5–17 years of age with active systemic SLE. The pediatric study was not fully powered by design, efficacy was planned to be descriptive and no formal statistical hypothesis testing was proposed. The study was completed in 2018, with a total of 92 subjects.

To facilitate the review of Benlysta, the FDA requested a post-hoc Bayesian analysis to further evaluate the efficacy of Benlysta in pediatric SLE patients by using relevant information from the adult studies. The rationale was to provide more reliable efficacy estimates in the pediatric study in a setting where the clinical review team believed that the disease and patient response to treatment are likely to be similar between adults and pediatrics (see Food and Drug Administration (2018) for details). For the purposes of the present (hypothetical) case-study, we consider how a study of Benlysta in pediatric subjects could have been prospectively designed using a pre-specified BDB analysis borrowing efficacy data from the adult pivotal trials to provide confirmatory evidence of a positive benefit-risk of Benlysta in children with SLE.

4.1 Design of Pediatric Trial

Evidence of efficacy has been established in adults in two independent pivotal Phase 3 trials, which are pooled and considered to be one single source of historical data, indexed by h. The primary endpoint was response at week 52 on the SLE responder index (SRI) Food and Drug Administration (2018), and the summary measure of treatment effect was the odds ratio for Benlysta compared to placebo. The pooled odds ratio based on a total of Nh = 1125 subjects from these studies was 1.62 (95% CI 1.27–2.05), which on the log odds ratio scale corresponds to a point estimate of yh=0.48 with standard error of sh=0.121.

A pediatric trial is proposed using a BDB design, to draw inference about the treatment effect in the pediatric population by supplementing the pediatric data with data from the pivotal adult studies. The planned sample size is 100 patients (50 patients per arm), with study success defined as having at least 97.5% posterior probability that the true log odds ratio of response in pediatrics at week 52 on the SRI disease activity index on Benlysta compared to placebo exceeds 0: Pr(δnew>0 | ynew)0.975, where δnew denotes the true pediatric log odds ratio of response.

Let ynew denote the observed log odds ratio of response from a logistic regression of the pediatric data and snew its estimated standard error. We assume a Normal sampling distribution ynewN(δnew,snew2) with snew treated as fixed, and we build a Bayesian robust mixture prior for the pediatric treatment contrast δnew (): p(δnew|yh,sh)=0.7×N(0.48,0.1212)+ 0.3×N(0,2.872).

Fig. 4 Pediatric lupus application: robust mixture prior distribution with prior weights of 70% on the adult data (= posterior distribution of the odds ratio from the pooled adult studies, assuming an initial non-informative prior) and 30% on the vague (robust) component.

Fig. 4 Pediatric lupus application: robust mixture prior distribution with prior weights of 70% on the adult data (= posterior distribution of the odds ratio from the pooled adult studies, assuming an initial non-informative prior) and 30% on the vague (robust) component.

The informative component represents the posterior distribution for the adult treatment contrast (log odds ratio) obtained from a Bayesian analysis of the pooled adult data yh , assuming a Normal sampling distribution with known variance sh2 and an improper prior on the mean. A prior weight of w=70% is assigned to this adult component, representing the prior degree of belief that the adult treatment effect estimated from the pivotal studies provides relevant information about the treatment effect in pediatric patients. The mean of the vague component is set to 0 (i.e., centered at the null hypothesis of no effect) and the variance is set to Nh/2×sh2=2.872 so that the effective sample size of the vague component is worth just one subject per arm.

For comparison, we also consider a Bayesian analysis with a vague prior defined as δnewN(0,1002), not borrowing any adult information.

4.2 Classical Type I Error

(top) and (top section, final column showing point mass design prior) present the classical frequentist operating characteristics of the designs under the two different analysis priors. In contrast to Case Study 1 (where borrowing was on the placebo response and the pointwise classical Type I error depended on the drift between the true placebo response and the historical placebo data), there is only one value of the classical Type I error for a given BDB design when borrowing prior information directly on the treatment contrast. Therefore, the Type I error and power do not vary with the true pediatric odds ratio, on the x-axis of , but they correspond to selected values on this axis.

Fig. 5 Pediatric lupus application. Top: Probability of success (PoS) curves showing classical Type I error (true pediatric odds ratio = 1) and power (true pediatric odds ratio = 1.6) for the Bayesian study designs using robust mixture analysis prior or vague analysis prior. Middle: PoS curves for both analysis priors, overlaid with the three different null design priors used to calculate average Type I error. Bottom: PoS curves for δδnull(=log(1)) for both analysis priors, overlaid with the three different full design priors used to calculate the pre-posterior probability of actually declaring a false positive result.

Fig. 5 Pediatric lupus application. Top: Probability of success (PoS) curves showing classical Type I error (true pediatric odds ratio = 1) and power (true pediatric odds ratio = 1.6) for the Bayesian study designs using robust mixture analysis prior or vague analysis prior. Middle: PoS curves for both analysis priors, overlaid with the three different null design priors used to calculate average Type I error. Bottom: PoS curves for δ≤δnull(= log (1)) for both analysis priors, overlaid with the three different full design priors used to calculate the pre-posterior probability of actually declaring a false positive result.

Table 3 Pediatric lupus application. Operating characteristics of the Bayesian study designs using different analysis and design priors.

The prior information from adult data supports a positive treatment effect. Since, by definition, the Type I error is evaluated by assuming the true treatment effect is null (i.e., a true odds ratio equal to 1), then it is calculated under a scenario where the prior is in conflict with the null treatment effect, resulting in a large inflation of the Type I error with the BDB design using the robust adult prior (33%, as compared to 2.5% for the design with vague analysis prior that does not borrow adult information). As indicated in Section 2.4, a strict control of the Type I error while borrowing on historical data is not possible in this context. (top) also illustrates the “flip side” of borrowing the adult prior information in terms of the power gain (blue lines): for example, the probability of meeting the trial success criteria if the true pediatric odds ratio is 1.6 is 77% for the BDB design and only 21% for the Bayesian design with vague analysis prior.

4.3 Average Type I Error

For the Bayesian design using each of the two analysis priors, (top section) shows the average Type I error (3) evaluated under three alternative design priors:

  1. Truncated adult design prior, chosen to be the normalized truncated lower tail (0) of the posterior from the pooled adult studies under an initial improper prior.

  2. Truncated robust design prior, chosen to be the normalized truncated lower tail (0) of the robust adult analysis prior.

  3. Point mass at δ=log(1), which is equivalent to the assumption used to calculate the classical Type I error discussed in the previous subsection.

(middle) shows these three null design priors superimposed on the PoS curves for each of the analysis priors. Recall that the average Type I error is calculated by integrating each of the PoS curves wrt the relevant null design prior. From (middle), we see that the adult null design prior is highly concentrated near the null odds ratio of eδ = 1, with negligible weight on values for the odds ratio below 0.9. As a consequence, the average Type I error under this design prior is very close to the classical Type I error values, represented as the average PoS under a point mass design prior on odds ratio eδ = 1. By contrast, the robust null design prior has a very heavy left tail, and is approximately uniform over log odds ratios in the range (,log(1)). This results in much lower average Type I errors for both analysis models compared to under the null adult design prior. Such a heavy tail may be viewed as unreasonable for the null design prior, and so other options could be considered, such as truncating the lower tail of the null design prior at a plausible value (Psioda and Ibrahim Citation2019). However, judgments about what this value should be may be hard to elicit.

4.4 Other Metrics

In the context of a pediatric bridging study, where there is scientific reason to expect the treatment effect in children to be similar to that demonstrated in adults, the prior probability of the null effect being true is expected to be low. Indeed, if there were a high prior probability of such a large drift between the true pediatric effect and the adult evidence, then a bridging strategy would not seem to be an option in the first place.

Using the Bayesian framework, the prior probability of treatment benefit in children can be formally quantified, and is one of the metrics identified by Pennello and Thompson (Citation2007) and Travis, Rothman, and Thomson (Citation2023) as being helpful for evaluating Bayesian designs in regulatory settings. For the pediatric lupus example, the prior probability of efficacy under the vague analysis prior is 50%, which represents a position of clinical equipoise, but could be considered overly pessimistic in a setting where there is already confirmatory evidence of efficacy in adults. Under the robust mixture analysis prior, this probability is 85%, which reflects the available adult evidence and known similarities and differences between adults and children, whilst still leaving open the non-negligible possibility (15%) that the drug is not effective in children. By contrast, if the adult prior were to be used directly as the analysis prior without any down-weighting, the prior probability of efficacy in children would be >99.9%. This prior probability is greater than the decision threshold (which requires at least 97.5% probability of efficacy) and so, if this analysis prior were considered justified, then an efficacy trial in children may not be considered necessary.

The above discussion about prior probability of efficacy focuses on the prior information being used in the analysis of the pediatric trial. It is also helpful to consider the probability of efficacy (or conversely, the probability of a null or harmful treatment effect) from the perspective of the design prior. As proposed in Section 2.4, in settings where there is reliable and relevant evidence favoring a non-null treatment effect, a useful metric is the (pre-posterior) probability of actually drawing a false positive conclusion (8). The bottom part of reports these two probabilities under three different design priors that are defined on the full range of support for δ (i.e., not truncated to the null/harmful region): the adult design prior, the robust design prior and a “spike & slab” prior chosen to be equal to the robust design prior but replacing the left tail for δlog(1) by a point mass at δ=log(1) with weight equal to the probability δlog(1) under the robust design prior. (bottom) shows these three design priors superimposed on the PoS curves for values of δδnull for each of the analysis priors. Comparison with the middle row of this figure illustrates the difference between the pre-posterior probability of drawing a false positive conclusion (bottom row) and the average Type I error (middle row). The latter averages the PoS curve over a design prior with support restricted to null or harmful values of δ, while the former averages the portion of the PoS curve associated with null or harmful values of δ with respect to a design prior with support on the full range of δ.

The pre-posterior probability of actually declaring a false positive result is negligible for both analysis models under the adult design prior, reflecting the fact that the probability of no treatment benefit (i.e., the null being true) under this design prior is only 0.004%. Even under the robust design prior, which has 15% probability of the null being true, the chances of actually declaring a false positive result are less than 1% for both analysis priors. This probability is somewhat higher under the spike and slab design prior, but still below 5%, which may be considered a reasonable level of risk in challenging settings such a pediatrics.

These findings emphasize the limitations of relying solely on the average Type I error in designs involving information borrowing on the treatment contrast. We suggest considering the additional metrics outlined in to gain a more comprehensive understanding of the design’s operating characteristics under various assumptions. Additionally, visual representations, as demonstrated in , are recommended to enhance the overall understanding of the design’s behavior.

5 Discussion

Bayesian clinical trials that leverage historical, or more generally study-external, data have become increasingly popular over time, offering a new toolbox in drug development. Many fully Bayesian clinical studies that leveraged historical data have been conducted already, and some of them were published in top-tier clinical journals (Baeten et al. Citation2013; Böhm et al. Citation2020; Richeldi et al. Citation2022). However, the vast majority of these studies were either done in early development (phase I/II), post-approval (phase IV) or undertaken by academic groups. The few examples used for registration purposes were essentially for pediatric indications or in rare and ultra-rare diseases (Goring et al. Citation2019). One likely reason for this is that strict control of the classical (frequentist) Type I error, which is not possible when leveraging historical data, has proven as a sticking point in discussions with regulators. However, as the frequentist (long-run) interpretation of probability is not the only way to interpret probabilities, likewise in this article we argue that the frequentist viewpoint on Type I error is not the only one we should adopt. Instead, we suggest that Bayesian metrics should be used for Bayesian designs. While the concept of assurance (O’Hagan, Stevens, and Campbell Citation2005)—or Bayesian average power (Chuang-Stein and Kirby Citation2017)—has become instrumental nowadays to support decision-making, we argue that the average Type I error, which is its equivalent under the null hypothesis, is also a relevant metric to inform decision-makers about the risk of a false positive result associated with a Bayesian design. In designs where information is borrowed on the treatment contrast, we further recommend calculation of the probability of actually declaring a false positive result, to provide further insights on the design performance. Altogether, these metrics provide a comprehensive and reliable way to assess the properties of Bayesian studies as illustrated in the two case studies that we presented.

An important feature of our work is that it relies on a willingness to adopt the Bayesian approach also for assessing the risk of, for example, declaring a treatment as effective which in reality is ineffective. Consequently, it is a necessity to agree on analysis and design priors, with the latter being used to assess the operating characteristics of the trial design under various scenarios (“what if” situations). This is a somewhat elaborate task, in particular since many professionals may still be unfamiliar with it, given that assessing the classical Type I error rate has no comparable requirement (the null hypothesis is, in a sense, trivial). Thus, more upfront discussion and alignment may be needed before an agreement is reached. Furthermore, when borrowing on the treatment contrast, the choice of the null design prior to evaluate the unconditional Type I error can be particularly challenging. By virtue of its nature, borrowing on the treatment contrast usually implies that the treatment has some effect, that is, there is possibly substantial evidence against the null hypothesis. Therefore, we embraced ideas outlined in the work by Spiegelhalter and Friedman (Citation1986) and Chuang-Stein and Kirby (Citation2017) to calculate certain joint (or pre-posterior) probabilities that reflect the probability of the null hypothesis actually being true. While these joint probabilities also require specification of a design prior, this prior is not restricted to be consistent with the null hypothesis. To further simplify specification of the design prior in this setting, we also proposed using the upper bound on the pre-posterior probability of actually drawing a false positive conclusion. This just requires specification of the tail area probability of the null being true under the design prior (i.e., the shape of the null tail is irrelevant and so it can be thought of as a point mass at the null), which is then multiplied by the classical Type I error.

In their discussion of regulatory perspectives on informative Bayesian methods for pediatric efficacy trials, Travis, Rothman, and Thomson (Citation2023) comment on the importance of taking into consideration the likelihood of the true parameter value being in a particular region when assessing the Type I error. They also note that an important factor for regulators is the ability to implement consistent standards across studies. We argue that use of Bayesian metrics involving explicit definition of a design prior can help to address both these requirements. The design prior provides a mechanism to pre-specify and make explicit what assumptions are being made about the chances of the true parameter values of interest (e.g., true control response; true treatment contrast) being in any particular region when evaluating the operating characteristics of a clinical trial design. The design prior would need to be agreed between sponsors and regulators on a case by case basis to reflect plausible assumptions about the disease and the treatment effects. However, conditional on an agreed design prior, regulators could then require control of appropriate Bayesian metrics, such as those we discuss here, at a consistent level across products. For example, designs that borrow prior information on the control arm could be required to control average Type I error at a prescribed level that is consistent for all products in a certain class or disease area or category of unmet need. For designs that borrow information on the treatment contrast, one approach could be to require the upper bound on the pre-posterior probability of actually making a Type I error to be consistent between different products. As already noted, agreeing a design prior in this case reduces to requiring sponsors and regulators to agree on the prior probability of the true treatment effect being null. The sponsor could then optimize the trial design balancing sample size against the amount of external information to borrow in order to maintain the upper bound on a false positive outcome at the agreed level. Such an approach could also address Travis, Rothman, and Thomson’s (2023) concern that using the same classical Type I error rate across products will force less borrowing for more effective products. Whilst a more effective product would, indeed, have a higher conditional Type I error if borrowing the same amount of prior information as a less effective product, we would expect the design prior for the more effective product to have a lower probability of the null being true. Hence, the pre-posterior probability of actually making a Type I error would not necessarily be higher for the more effective product.

It is worth noting that more than one design prior may be specified if desired, in order to reflect a range of stakeholder opinions. Thus, we could envisage a situation where regulators agree two different design priors, representing, say optimistic and skeptical judgments about the likelihood of the true parameters being in certain regions. Consistent thresholds for relevant Bayesian metrics could be set for each type of design prior, with a more stringent level of control being required under the optimistic design prior.

One reason for differences between the historical prior and new trial data that could lead to an increased risk of erroneous conclusions in a Bayesian borrowing design is systematic variation in baseline prognostic factors. If these factors are measured in the historical and new data, this source of drift can be addressed through statistical modeling such as regression or propensity score weighting (Banbeta, Lesaffre, and van Rosmalen Citation2022; Fu et al. Citation2023). All the metrics presented in this paper extend naturally to this situation. The target parameter for information borrowing in the new study (i.e., θc or δ) is then the marginal covariate-adjusted control treatment effect or treatment contrast, respectively. Since at the design stage, the exact distribution of covariates in the new study is usually unknown, an explicit assumption must be made about the expected covariate distribution. The metrics in Section 2 then require specification of a design prior to describe assumptions about the marginal covariate-adjusted treatment effect or treatment contrast in the new study for the specified covariate distribution.

The evaluation of designs that leverage historical data may look quite involved. However, it is noteworthy to mention that the complexity of many designs—whether they leverage historical data or not—often require simulations to ‘stress-test’ assumptions and understand their operating characteristics. Therefore, it is probably fair to assume that the technical skills required to perform the evaluations of Bayesian designs as proposed in this work should not present a barrier anymore. The more challenging task might be to familiarize stakeholders with the concept of metrics that go beyond the (well-known) classical Type I error (European Special Interest Group on Historical Data Citation2022). Explaining which risks the metrics help to assess and how they complement other existing metrics may thus be an important part when presenting the evaluation of a design that uses historical data. Concurrently, it is often very useful to discuss with key stakeholders so called data scenarios, that is, hypothetical data and accompanying results as they could actually occur in the study. This often helps stakeholders to get a very concrete impression of what could happen, as opposed to average statements as obtained from metrics.

Finally, it might also be useful to remind ourselves that the fundamental questions around studies that leverage historical data have not changed over time. In his seminal work Designing for nonparametric Bayesian survival analysis using historical controls, Van Ryzin (Citation1980) nicely summarized that it is an elementary bias-variance tradeoff question which ultimately should guide if—and to what extent—historical data are to be used. The strong focus on classical (frequentist) Type I error control for pivotal studies, however, has shifted much of the considerations to the bias question only. It is thus important to re-consider what metrics should be used for evaluating designs that—by construction—aim at optimizing the bias-variance tradeoff, such as Bayesian designs using historical data, or adaptive designs with treatment or population selection at interim (Robertson et al. Citation2023). Similarly, it is paramount to better understand how existing metrics could be used in a more principled way (Grieve Citation2016; Walley and Grieve Citation2021). The current situation is in some sense similar to what happened when Stein (Citation1956) and James and Stein (Citation1961) showed that the “usual” least-squares estimator for the mean is inadmissible in dimensions three or higher. While the James-Stein estimator is biased, it outperforms the least-squares estimator with regards to mean squared error. Thus, a more holistic viewpoint was required from a metrics perspective to judge which estimator to prefer. Similarly, we conclude that a more holistic viewpoint is required to judge which study design serves the purpose of a registration trial best.

Supplemental material

Supplemental Material

Download PDF (309.4 KB)

Disclosure Statement

Nicky Best is an employee of GSK and holds shares of GSK. Maxine Ajimi is an employee of AstraZeneca and holds shares of AstraZeneca. Beat Neuenschwander is an employee of Novartis and holds shares of Novartis and Sandoz. Gaëlle Saint-Hilary is president and single associate of Saryga. Simon Wandel is an employee of Novartis and holds shares of Novartis, Alcon and Sandoz.

Additional information

Funding

The author(s) reported there is no funding associated with the work featured in this article.

References

  • Baeten, D., Baraliakos, X., Braun, J., Sieper, J., Emery, P., van der Heijde, D., et al. (2013), “Anti-interleukin-17a Monoclonal Antibody Secukinumab in Treatment of Ankylosing Spondylitis: A Randomised, Double-Blind, Placebo-Controlled Trial,” Lancet, 382, 1705–1713. DOI: 10.1016/S0140-6736(13)61134-4.
  • Banbeta, A., Lesaffre, E., and van Rosmalen, J. (2022), “The Power Prior with Multiple Historical Controls for the Linear Regression Model,” Pharmaceutical Statistics, 21, 418–38. DOI: 10.1002/pst.2178.
  • Best, W., Becktel, J., Singleton, J., Kern Jr., F. (1976), “Development of a Crohn’s Disease Activity Index. National Cooperative Crohn’s Disease Study,” Gastroenterology, 70, 439–444. DOI: 10.1016/S0016-5085(76)80163-1.
  • Böhm, M., Kario, K., Kandzari, D., Mahfoud, F., Weber, M. A., Schmieder, R. E., et al. (2020), “Efficacy of Catheter-based Renal Denervation in the Absence of Antihypertensive Medications (spyral htn-off med pivotal): A Multicentre, Randomised, Sham-Controlled Trial,” Lancet, 395, 1444–1451. DOI: 10.1016/S0140-6736(20)30554-7.
  • Chuang-Stein, C., and Kirby, S. (2017), Quantitative Decisions in Drug Development, Cham, Switzerland: Springer.
  • European Special Interest Group on Historical Data. (2022), “A Framework for Evaluation of Bayesian Dynamic Borrowing Designs in Pivotal Studies,” PSI conference. Available at https://www.psiweb.org/sigs-special-interest-groups/historical-data.
  • Food and Drug Administration. (1998), “Providing Clinical Evidence of Effectiveness for Human Drugs and Biological Products (Guidance for Industry),” available at https://www.fda.gov/media/71655/download.
  • ——– (2018), “BLA 125370/s-064 and BLA 761043/s-007 Multi-disciplinary Review and Evaluation,” available at https://www.fda.gov/media/127912/download.
  • ——– (2019a), “Demonstrating Substantial Evidence of Effectiveness for Human Drug and Biological Products (Guidance for Industry),” available at https://www.fda.gov/media/133660/download.
  • ——– (2019b), “Adaptive Designs for Clinical Trials of Drugs and Biologics (Guidance for Industry), available at https://www.fda.gov/media/78945/download.
  • ——– (2020), “Interacting with the FDA on Complex Innovative Trial Designs for Drugs and Biological Products (Guidance for Industry),” available at https://www.fda.gov/media/130897/download.
  • Fu, C., Pang, H., Zhou, S., and Zhu, J. (2023), “Covariate Handling Approaches in Combination with Dynamic Borrowing for Hybrid Control Studies,” Pharmaceutical Statistics, 22, 619–632. DOI: 10.1002/pst.2297.
  • Goring, S., Taylor, A., Müller, K., Li, T. J. J., Korol1, E. E., Levy, A. R., and Freemantle, N. (2019), “Characteristics of Non-Randomised Studies Using Comparisons with External Controls Submitted for Regulatory Approval in the US and Europe: A Systematic Review,” BMJ Open, 9, e024895. DOI: 10.1136/bmjopen-2018-024895.
  • Grieve, A. (2016), “Idle Thoughts of a ‘Well-Calibrated’ Bayesian in Clinical Drug Development,” Pharmaceutical Statistics, 15, 96–108. DOI: 10.1002/pst.1736.
  • Hueber, W., Sands, B. E., Lewitzky, S. (2012), “Secukinumab, A Human Anti-IL-17A Monoclonal Antibody, for Moderate to Severe Crohn’s Disease: Unexpected Results of a Randomised, Double-Blind Placebo-Controlled Trial,” Gut, 61, 1693–1700. DOI: 10.1136/gutjnl-2011-301668.
  • James, W., and Stein, C. (1961), “Estimation with Quadratic Loss,” in Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1), pp. 361–379.
  • Kopp-Schneider, A., Calderazzo, S., and Wiesenfarth, M. (2020), “Power Gains by Using External Information in Clinical Trials are Typically Not Possible When Requiring Strict Type I Error Control,” Biometrical Journal, 62, 361–374. DOI: 10.1002/bimj.201800395.
  • Lim, J., Wang, L., Best, N., et al. (2019), “Minimizing Patient Burden through the Use of Historical Subject-Level Data in Innovative Confirmatory Clinical Trials: Review of Methods and Opportunities,” Therapeutic Innovation & Regulatory Science, 52, 546–59. DOI: 10.1177/2168479018778282.
  • Moore, T., Zhang, H., Anderson, G., et al. (2018), “Estimated Costs of Pivotal Trials for Novel Therapeutic Agents Approved by the US Food and Drug Administration, 2015–2016,” JAMA Internal Medicine, 178, 1451–1457. DOI: 10.1001/jamainternmed.2018.3931.
  • Neuenschwander, B., Capkun-Niggli, G., Branson, M., Spiegelhalter, D. J. (2010), “Summarizing Historical Information on Controls in Clinical Trials,” Clinical Trials, 7, 5–18. DOI: 10.1177/1740774509356002.
  • O’Hagan, A., Stevens, J. W., and Campbell, M. J. (2005), “Assurance in Clinical Trial Design,” Pharmaceutical Statistics, 4, 187–201. DOI: 10.1002/pst.175.
  • Pennello, G., and Thompson, L. (2007), “Experience with Reviewing Bayesian Medical Device Trials,” Journal of Biopharmaceutical Statistics, 18, 81–115. DOI: 10.1080/10543400701668274.
  • Petrou, S. (2012), “Rationale and Methodology for Trial-based Economic Evaluation,” Clinical Investigation, 2, 1191–1200. DOI: 10.4155/cli.12.121.
  • Psioda, M., and Ibrahim, J. (2019), “Bayesian Clinical Trial Design Using Historical Data that Inform the Treatment Effect,” Biostatistics, 20, 400–415. DOI: 10.1093/biostatistics/kxy009.
  • Psioda, M., and Xue, X. (2020), “A Bayesian Adaptive Two-Stage Design for Pediatric Clinical Trials,” Journal of Biopharmaceutical Statistics, 30, 1091–1108. DOI: 10.1080/10543406.2020.1821704.
  • Roychoudhury, B., Scheuer, N., and Neuenschwander, B. (2018), “Beyond p-values: A Phase II Dual-Criterion Design with Statistical Significance and Clinical Relevance,” Clinical Trials, 15, 452–461. DOI: 10.1177/1740774518770661.
  • Richeldi, L., Azuma, A., Cottin, V., Hesslinger, C., Stowasser, S., Valenzuela, C., et al. (2022), “Trial of a Preferential Phosphodiesterase 4b Inhibitor for Idiopathic Pulmonary Fibrosis,” The New England Journal of Medicine, 386, 2178–2187. DOI: 10.1056/NEJMoa2201737.
  • Robertson, D., Choodari-Oskooei, B., Dimairo, M., Flight, L., Pallmann, P., and Jaki, T. (2023), “Point Estimation for Adaptive Trial Designs I: A Methodological Review,” Statistics in Medicine, 42, 122–145. DOI: 10.1002/sim.9605.
  • Schmidli, H., Gsteiger, S., Roychoudhury, S., O’Hagan, A., Spiegelhalter, D., and Neuenschwander, B. (2014), “Robust Meta-Analytic-Predictive Priors in Clinical Trials with Historical Control Information,” Biometrics, 70, 1023–1032. DOI: 10.1111/biom.12242.
  • Spiegelhalter, D., and Friedman, L. (1986), “A Predictive Approach to Selecting the Size of a Clinical Trial, based on Subjective Clinical Opinion,” Statistics in Medicine, 5, 1–13. DOI: 10.1002/sim.4780050103.
  • Spiegelhalter, D., Abrams, K., and Myles, J. (2004), Bayesian Approaches to Clinical Trials and Health-Care Evaluation, Chichester, UK: Wiley.
  • Stein, C. (1956), “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution,” in Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1), pp. 197–206.
  • Van Ryzin, J. (1980), “Designing for Nonparametric Bayesian Survival Analysis Using Historical Controls,” Cancer Treatment Reports, 64, 503–506
  • Viele, K., Berry, S., Neuenschwander, B., et al. (2014), “Use of Historical Control Data for Assessing Treatment Effects in Clinical Trials,” Pharmaceutical Statistics, 13, 41–54. DOI: 10.1002/pst.1589.
  • Viele, K., Mundy, L., Noble, R., Li, G., Broglio, K., and Wetherington, J. D. (2018), “Phase 3 Adaptive Trial Design Options in Treatment of Complicated Urinary Tract Infection,” Pharmaceutical Statistics, 17, 811–22. DOI: 10.1002/pst.1892.
  • Travis, J., Rothman, M., and Thomson, A. (2023), “Perspectives on Informative Bayesian Methods in Pediatrics,” Journal of Biopharmaceutical Statistics, 15:96–108. DOI: 10.1080/10543406.2023.2170405.
  • Walley, R., and Grieve, A. (2021), “Optimising the Trade-Off between Type I and II Error Rates in the Bayesian Context,” Pharm Stat, 20, 710–720. DOI: 10.1002/pst.2102.
  • Walley, R. J., Smith, C. L., Gale, J. D., and Woodward, P. (2015), “Advantages of a Wholly Bayesian Approach to Assessing Efficacy in Early Drug Development: A Case Study,” Pharmaceutical Statistics, 14, 205–215. DOI: 10.1002/pst.1675.
  • Wang, F., and Gelfand, A. (2002), “A Simulation-based Approach to Bayesian Sample Size Determination for Performance Under a Given Model and For Separating Models,” Statistical Science, 17, 193–208. DOI: 10.1214/ss/1030550861.
  • Weber, S., Neuenschwander, B., Schmidli, H., et al. (2020), RBesT: R Bayesian Evidence Synthesis Tools. R package version 1.6-1. Available at https://cran.r-project.org/web/packages/RBesT/index.html.