231
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Bayesian Optimal Designs for Multi-Arm Multi-Stage Phase II Randomized Clinical Trials with Multiple Endpoints

ORCID Icon, , & ORCID Icon
Received 31 Jan 2023, Accepted 10 Apr 2024, Published online: 17 May 2024

Abstract

There is a growing need to evaluate of multiple competing drugs in phase II trials where the number of patients is often limited, and simultaneous assessment of both efficacy and toxicity is crucial. To avoid the waste of research resources, it is indeed more efficient to screen multiple drugs at once in a platform phase II setting. We aim to adapt the Bayesian optimal phase II (BOP2) design to multi-arm trials for both uncontrolled and controlled settings. The binary efficacy and toxicity endpoints are modeled by a Dirichlet distribution as a vector of four outcomes. Posterior marginal distributions at each analysis are used to derive the monitoring threshold that varies during the trial. We control the family-wise Type I error rate for multiple comparison against a common reference value or a shared control. We conduct simulation studies under both uncontrolled and controlled settings to evaluate the operating characteristics of the proposed design. Our simulations demonstrate that the design exhibits better operating characteristics compared to a design using a constant threshold and is less sensitive to changes in accrual rate relative to what was planned. The design had promising operating characteristics and could be used in phase II oncology clinical trials for evaluating multiple drugs at a time.

1 Introduction

In oncology, phase II trials have long employed single-arm designs to assess the efficacy of new drugs. However, over the past few decades, the development of drugs in oncology, particularly immunotherapy agents, has been on the rise, creating a growing need for enhanced evaluation of their toxicity and efficacy. Indeed, the evaluation of multiple new treatments separately poses challenges in estimating the relative effect of each treatment, due to “treatment-trial” confounding (Estey and Thall Citation2003): Trials for different drugs are often designed using distinct eligibility criteria, standards of care, and outcome measures, which can confound the estimation of the relative treatment effect. Moreover, this one-at-a-time evaluation of the treatments in early phases appears to lack specificity in detecting effective drugs, as evidenced by the reported low success rates in phase II and III trials and subsequent market approvals of cancer therapeutics (Sutter and Lamotta Citation2011).

With the aim of avoiding confounding and shortening the duration of drug development, platform trials that evaluate multiple drugs (either simultaneously or at different times) have been proposed (Yu, Hubbard-Lucey, and Tang Citation2019; Franklin et al. Citation2022). Platform trials extend the concept of randomized phase II trials that include a control group, a concept first proposed in the early 90s (Simon, Thall, and Ellenberg Citation1994). They offer well-documented advantages of the randomized trials over single-arm trials (Sharma, Stadler, and Ratain Citation2011; Wason, Stecher, and Mander Citation2014; Hobbs, Chen, and Lee 2018). One notable benefit is the ability to screen several candidate treatments allocated through randomization and identify the most promising ones. More specifically, multi-arm multi-stage (MAMS) designs (Hobbs, Chen, and Lee 2018) establish early decision rules based on sequential analyses of the effect of multiple treatments compared to the control arm. These designs aim to control the false positive rate for the entire trial, not just for each arm separately (Jaki, Pallmann, and Magirr Citation2019). Additionally, MAMS designs address both ethical and economic concerns by allowing for the early termination of treatments with evidence of futility and/or excessive toxicity as well as the early graduation of promising treatments.

The literature proposing Bayesian approaches for adaptive platform trials is expanding (Berry Citation2006; Berger, Wang, and Shen Citation2014; Ryan et al. Citation2019). Among these approaches, the Bayesian Optimal design for Phase II clinical trial (BOP2) allows for evaluation of multiple endpoints sequentially, although it is restricted to the single-arm setting (Zhou, Lee, and Yuan Citation2017). In brief, the BOP2 design aims to assess a new treatment in comparison to reference values of m binary outcome measures. The resulting 2m distinct and mutually exclusive outcomes define a multinomial random variable. Inference on the model parameters is conducted within a Bayesian framework, using a Dirichlet prior, chosen for its conjugacy with the multinomial distribution. Due to the aggregation properties of the Dirichlet distribution, the marginal Beta distributions of each of the m binary outcomes can be computed easily, allowing the derivation of stopping rules for futility or excessive toxicity(Zhou, Lee, and Yuan Citation2017).

We aimed to extend the BOP2 design to accommodate multiple treatment arms (or multiple doses of the same treatment), all evaluated concurrently for both efficacy and toxicity within the same trial. To illustrate this extension, we retrospectively applied the proposed design to the AZA-PLUS trial (NCT01342692), which was conducted in patients with high risk myelodysplastic syndrom (MDS) (Ades et al. Citation2018).

The article is organized as follows: First, we present the models and design algorithms. Next, we conduct a simulation study to assess the performance of the designs. We then illustrate the proposed designs retrospectively using the AZA-PLUS trial (NCT01342692). Finally, we provide a discussion.

2 Motivating Example: AZA-PLUS Trial

In the treatment of adult patients with high risk (defined by an intermediate-2 or high IPSS score) myelodysplastic syndrom, the AZA-PLUS trial (NCT01342692) aimed to evaluate whether the efficacy and toxicity of the standard-of-care azacitidine could be improved by adding a new drug (Ades et al. Citation2018). Initially, the trial was first planned to assess the combination of azacitidine with lenalidomide or valproic acid, and later with idarubicine, using the Jung’s two-stage design(Jung Citation2008). The trial was designed with a Type I error rate of 0.15 and a Type II error rate of 0.20. It was scheduled to recruit 80 patients in each arm, with an interim analysis conducted after 40 patients per arm.

In total, 322 patients were enrolled from June 2011 to July 2017 across 37 participating centers. The treatment arms included 81 patients receiving azacytidine (AZA), 80 receiving AZA + valproic acid (AZA + VPA), 80 receiving AZA + lenalidomide (AZA + LEN), and 81 receiving AZA + idarubicin (AZA + IDA). Unfortunately, there was no evidence of any benefit from any of the combinations.

Our focus was on the three-arm randomized trial, which compared the control (AZA, n = 81) against both AZA + LEN (n = 80) and AZA + VPA (n = 80). We considered two binary outcome measures: the efficacy endpoint was the overall response rate (ORR), defined by the achievement of complete, partial, or medullary remission, and hematological improvement after 6 treatment cycles (a 6-month period). The toxicity endpoint was defined as treatment discontinuation due any reason other than progression or relapse. Based on the terminal analysis of the 241 enrolled patients, the pooled ORR was estimated at 41.1%, with arm-specific estimates of 42.0%, 41.2%, and 40.0% in the AZA, AZA + VPA, and AZA + LEN, respectively. The toxicity rates ranged from 59.3% in the AZA arm to 65.0% in the AZA + VPA arm and 67.5% in the AZA + LEN arm. To assess whether the generalized BOP2 design for multi-arm multi-stage trials could have allowed us to interrupt the trial earlier, we retrospectively applied this proposed design to the trial data.

3 Methods

We extended the BOP2 design to a multi-arm multi-stage trial, where patients are randomized to K experimental arms in an uncontrolled setting or to K experimental arms plus a control arm in a controlled setting. For simplicity, we assumed balanced randomization across the investigational arms. We considered only two binary outcomes, (Y T , Y E ), where Y T = 1 indicates toxicity and 0 otherwise, and Y E = 1 indicates efficacy and 0 otherwise. It’s worth noting that the design can be readily extended to handle unequal randomization, adaptive randomization schemes, or more than two endpoints.

Let Yk=(Yk,T,Yk,E) represent the co-primary toxicity and efficacy endpoints observed in arm k=0,,K (with k = 0 denoting the control arm in a controlled setting). Thus, Yk follows a multinomial distribution with probability vector (θk,TE, θk,T¯E, θk,TE¯, θk,T¯E¯). These probabilities correspond to the four possible events: TE for efficacy and toxicity, T¯E for efficacy and no toxicity, TE¯ for toxicity without efficacy, and T¯E¯ for no toxicity nor efficacy. The prior of those parameters was defined by a Dirichlet (π0,TE,π0,T¯E,π0,TE¯,π0,T¯E¯) with π0,· denoting the probabilities under the inefficacy/toxicity hypothesis. We set π0,TE+π0,T¯E+π0,TE¯+π0,T¯E¯=1 to ensure that the prior effective sample size is 1. Due to the properties of the Dirichlet distribution, the marginal prior distributions of efficacy and toxicity outcomes in arm k, pk,E, and pk,T can be easily derived as Beta(π0,TE+π0,T¯E,π0,TE¯+π0,T¯E¯) and Beta(π0,TE+π0,TE¯,π0,T¯E+π0,T¯E¯), respectively. Due to conjugacy, the posterior distribution of θk=(π0,TE,π0,T¯E,π0,TE¯,π0,T¯E¯) also follows a Dirichlet distribution. Therefore, the posterior distributions of pk,E or pk,T follow Beta distributions, which can be easily obtained based on the numbers of efficacy and toxicity outcomes, xk,E and xk,T, among the nk enrolled patients in arm k, respectively. More specifically, let Dn,k={xk,E,xk,T,nk} denote the observed data in arm k, then we have pk,E|Dn,kBeta(π0,TE+π0,T¯E+xk,E,π0,TE¯+π0,T¯E¯+nkxk,E), and pk,T|Dn,kBeta(π0,TE+π0,TE¯+xk,T,π0,T¯E+π0,T¯E¯+nkxk,T).

Similar to the original BOP2 design, we considered I interim analyses, defining early futility and toxicity stopping rules. The maximum sample size for each arm was set at Nk=ini,k patients, where ni,k denotes the number of patients enrolled in arm k during the ith interim analysis, i=1,,I. We first derived the generalized BOP2 design in the uncontrolled setting, where each treatment was compared with prespecified historical efficacy and toxicity rates. Subsequently, we then adapted the BOP2 design to the controlled setting, in which each treatment was compared to a common control.

3.1 Uncontrolled Setting

Define prespecified targeted efficacy and toxicity rates as ϕE and ϕT, respectively. These values are elicited based on expert opinions and historical data. Thus, the hypotheses for each treatment arm k are as follows:  Futility or Toxicityk:pk,EϕE or pk,T>ϕT Efficacy and Non toxicityk:pk,E>ϕE and pk,TϕT.

At each interim analysis, which is conducted once nk patients have been treated and assessed for both efficacy and toxicity endpoints, the stopping decision for futility and/or toxicity are made using the following rules: {P(pk,EϕE|Dn,k)>CnP(pk,T>ϕT|Dn,k)>Cn where P(·|Dn,k) is derived from the posterior Beta distribution of the marginal probability of pk,E or pk,T, and 0<Cn<1 is the decision threshold (as defined in Section 3.3).

3.2 Controlled Setting

In the setting of a controlled trial, which includes a common control arm (k = 0), the hypotheses for each experimental arm k > 0 are as follows:  Futility or Toxicitykc:pk,Ep0,E or pk,T>p0,T Efficacy and Non toxicitykc:pk,E>p0,E and pk,Tp0,T.

The two stopping rules to halt arm k (k > 0) for futility and/or toxicity are as follows: {P(pk,Ep0,E|Dn)>CnP(pk,T>p0,T|Dn)>Cn where Dn={xk,E,x0,E,xk,T,x0,T,nk,n0} is the observed data at the interim analysis for arm k, k=0,,K, and Cn is the common decision threshold as defined in Section 3.3.

Computations of the above two probabilities are carried out through integration, as follows (Jacob et al. Citation2016; Hobbs, Chen, and Lee 2018): P(pk,.>p0,.|Dn)=01 (1F(p|Dn,k))f(p|Dn,0)dp where F(·) is the Beta cumulative distribution function and f(·) is the Beta density function.

3.3 Choice of the Decision Threshold

When planning the multi-arm trial, the hypotheses regarding the multinomial distribution of Yk,· for the efficacy and toxicity of the new treatment, as well as the maximum acceptable Family-Wise Error Rate (FWER), are first elicited with the clinicians. Following the standard BOP2 design by Zhou, Lee, and Yuan (Citation2017), the null hypothesis H0 corresponds to an inadmissible treatment (ineffective and overly toxic), while the alternative hypothesis H1 corresponds to a promising treatment (effective and not excessively toxic). Additional discussions on the choice of the null hypothesis for testing co-primary toxicity and efficacy endpoints are provided at the end of this section.

Unlike a single arm trial, we must address the presence of K distinct experimental arms when optimizing the decision threshold. In this context, the FWER is defined as the proportion of claiming an inadmissible arm promising, which means concluding the efficacy and acceptable toxicity of at least one experimental arm, while all these arms are inefficacious and toxic (i.e., the global null hypothesis, where each arm is at H0). The (least) power is defined as the proportion of concluding the efficacy and acceptable toxicity of the efficacious and safe arm when that arm is the only efficacious and safe arm, and all others are inefficacious and toxic. This scenario is also referred to as the Least Favourable Configuration (LFC), where only one arm is at H1 and all others are at H0)(Thall, Simon, and Ellenberg Citation1989; Jaki, Pallmann, and Magirr Citation2019).

To define the decision boundaries, we first considered the same threshold function as the original BOP2 design (Zhou, Lee, and Yuan Citation2017). This function is represented as Cns=1λ(nN)γ, where λ (0<λ<1) and γ are positive design parameters. In cases where there is only one experimental arm, the values of λ and γ can be optimized by grid search to control the false positive rate and maximize power in both uncontrolled and controlled settings, respectively (Zhou, Lee, and Yuan Citation2017; Zhao et al. Citation2023). We further defined γ such that γ1, ensuring a convex shape for the decision boundary. The shape of the threshold function reflects the principle that early stopping based on sparse data should be avoided, and the rules become less stringent as more data accumulates over the trial (Jennison and Turnbull Citation1999).

We then defined a threshold specifically designed for the multi-arm trial, denoted thereafter as Cnm. We employed the same threshold function, but the design parameters (λ,γ) used to define Cnm were optimized through simulation via a grid search that accounts for multiple experimental arms. Among all possible pairs of (λ,γ) satisfying the prespecified FWER constraint (Magirr, Jaki, and Whitehead Citation2012; Bratton et al. Citation2016; Jaki, Pallmann, and Magirr Citation2019) under the global null hypothesis, we selected the one that maximizes power under the LFC. The grid search can be performed based on simulations, see more details in sec. 2.3 of Zhou, Lee, and Yuan (Citation2017). It’s worth noting that as the cutoff function was optimized under the global null and depends only on the sample size, this approach also maintained the arm-specific Type I error rate.

Alternatively, another decision threshold, Cnm,a, has been proposed, and it depends on the number a of ongoing active experimental arms (those still open to patient accrual) at the time of the interim analysis. It is defined as Cnm,a=1(ηλη)(nN)γ where η=K+1a. This correction relative to a leads to a less stringent threshold, particularly when multiple arms are truly promising as the threshold function is an increasing function of a. To optimize the parameters λ and γ, mimicking the procedure used for optimizing Cnm, we aimed to control the FWER under the global null and maximize power under LFC. However, due to the construction of Cnm,a, merely controlling the FWER is insufficient to maintain a desired arm-specific Type I error rate, especially when only one experimental arm is unpromising while the rest are promising (because many arms including the unpromising one can pass the monitoring criteria). To address this issue, there are two approaches. The first approach is to simultaneously control the FWER under the global null and the arm-specific Type I error rate under the scenario where only one arm is unpromising. However, due to the additional constraint in terms of the arm-specific Type I error rate, this approach requires a larger parameter optimization search within a well-defined search region, making it time-consuming. The second approach is more straightforward. By taking the final threshold as the minimum between Cnm,a and the single experimental arm-based Cns of Zhou, Lee, and Yuan (Citation2017), it is trivial to demonstrate that the design based on min(Cnm,a,Cns) can not only maintain the FWER but also control the arm-specific Type I error rate. We chose the latter approach as it requires less computational resources.

Lastly, due to the formulation of the null hypothesis H0 (i.e., treatment being ineffective and overly toxic), there is a possibility that the optimal multi-arm design may not maintain tight control over the false positive rate, especially when one of the co-primary endpoints is not met—either the treatment is safe but ineffective, or effective yet overly toxic. This issue is evident in scenarios 6 and 7 of and , where the arm-specific Type I error rate or the FWER may exceed the nominal level. To address potential concerns, a more rigorous calibration process involving two null hypotheses, H01 and H02, could be implemented. H01 would consider the treatment as safe but ineffective, whereas H02 would view it as effective but overly toxic. The initial step of the calibration process as outlined above could then be extended to identify all feasible (λ,γ) pairs meeting the predefined FWER constraints under both H01 and H02 (as well as the arm-specific Type I error constraints for the Cnm,a approach). This method, while potentially being less powerful, offers improved control over false positives.

Table 1 Operating characteristics for each arm under the 13 scenarios (Sc) for the uncontrolled designs: Family-Wise Error Rate (FWER), percent of conclusion of efficacy and no toxicity (ENT), percent of early stopping (ES), and mean sample size (SS).

Table 2 Operating characteristics for each experimental (Exp.) arm under the 13 scenarios (Sc) for the controlled designs: Family-Wise Error Rate (FWER), percent of conclusion of efficacy and no toxicity (ENT), percent of early stopping (ES), and mean sample size (SS).

4 Simulation Study

4.1 Simulation Settings

We conducted various simulation studies to assess the operating characteristics of the proposed generalized BOP2 design in both uncontrolled and controlled settings. A total of K = 3 experimental treatment arms were considered, with a maximum sample size of n = 60 patients in each group. To assess the proposed designs, we considered the uncontrolled (3 arms) and controlled (thus, with 4 arms) settings, separately, resulting in a maximum total number of included patients of N = 180 and N = 240, respectively. Three interim analyses were planned to be performed, when every 15 additional patients have been enrolled in each arm, plus the terminal analysis. Of note, we also examined various sample sizes. As expected, in the case of a smaller planned sample size (e.g., 30 patients planned per arm with interim analyses at 10 and 20), power decreased under fixed scenarios. Desirable performance could be achieved with a larger planned effect size, in both uncontrolled and controlled settings (data not shown). All arms were assumed to be of equal size in the main simulation study. Sensitivity analysis, based on different accrual rates among the various arms, was also conducted, as shown below.

A total of 13 different scenarios were constructed similarly for both uncontrolled and controlled settings (see Supplementary Table S1). These scenarios were derived from two real clinical trials: one evaluating the efficacy and safety of lenalidomide associated with rituximab in the treatment of recurrent non-follicular lymphoma (Sacchi et al. Citation2016) in an uncontrolled setting, and the second comparing TAS-102, a nucleoside analogue, and topotecan/amrubicin for the treatment of refractory small cell lung cancer in a controlled setting (Scagliotti et al. Citation2016). These motivating examples allowed us to derive realistic hypotheses of efficacy and toxicity targets. In both uncontrolled and controlled settings, scenario 1 corresponded to the global null hypothesis H0, while scenario 2 represented the LFC. Scenarios 6, 7, 10, and 11 explored cases of undesirable treatments, including scenarios with no efficacy and toxicity, efficacy and toxicity, and no efficacy and no toxicity. Scenarios 5 and 8 involved treatments with more efficacy or less toxicity than expected. Finally, Scenarios 12 and 13 illustrated treatments with intermediate efficacy and toxicity.

We compared the performances of the proposed decision thresholds (Cnm, Cnm,a) with Thall, Simon, and Estey’s approach (Thall, Simon, and Estey Citation1995), which is similar to the approach of Hobbs, Chen, and Lee (2018), using constant boundaries applied to a multi-arm trial, denoted hereafter ϵm , and original single experimental arm-based BOP2 threshold Cns. All decision thresholds (Cnm, Cns, Cnm,a and ϵm ) were computed as described in Section 3.3, using the following hypotheses: for the uncontrolled design H0: θk=(0.15,0.30,0.15,0.40) and H1: θk=(0.18,0.42,0.02,0.38) with prespecified targeted efficacy and toxicity rates ϕT=0.30 and ϕE=0.45; and for the controlled design H0: θk=(0.30,0.30,0.10,0.30) and H1: θk=(0.25,0.50,0.05,0.20) (). Decision thresholds were optimized to maximize power while controlling the FWER at 10%.

Fig. 1 Decision thresholds used during the trial at the interim and terminal analyses, either for futility (A, C) or (over-)toxicity (B, D), based on the uncontrolled design (plots A and B) or the controlled design (plots C and D). Cnm stands for the threshold in the same form as BOP2 applied in a multi-arm setting, Cnm,a represents the multi-arm threshold dependent on the number of remaining ongoing arms, while the ϵ is the multi-arm constant threshold maintained throughout the trial.

Fig. 1 Decision thresholds used during the trial at the interim and terminal analyses, either for futility (A, C) or (over-)toxicity (B, D), based on the uncontrolled design (plots A and B) or the controlled design (plots C and D). Cnm stands for the threshold in the same form as BOP2 applied in a multi-arm setting, Cnm,a represents the multi-arm threshold dependent on the number of remaining ongoing arms, while the ϵ is the multi-arm constant threshold maintained throughout the trial.

The prior distributions of the designs were set to reflect the null hypotheses (H0) as described in Section 3, resulting in a so-called “skeptical” prior approach (Spiegelhalter, Abrams, and Myles Citation2004). Consistent with the two real trials that were used to define realistic scenarios, the prior was set to Dir(0.15,0.30,0.15,0.40) for the uncontrolled setting and to Dir(0.30,0.30,0.10,0.30) for the controlled design.

For each scenario, both in the uncontrolled and controlled settings, we conducted 10,000 independent repetitions of each trial, each of K = 3 experimental arms. We computed various performance metrics, including the percentage of selection for both efficacy and non toxicity, both globally and for each arm separately, the percentage of correct selection, the empirical FWER of the designs (under the null scenario 1) and the power (under the LFC of scenario 2). Additionally, we calculated the percentage of early stopping, defined as any stopping decision regarding arm k before the terminal analysis, and recorded the reason of the stopping decision (toxicity or futility). It’s important to note that the number of repetitions was determined to ensure a Monte Carlo standard error of 0.003 for a 0.1 Type I error rate and 0.004 for a power value of 0.8, following previous work (Koehler, Brown, and Haneuse Citation2009; Morris, White, and Crowther Citation2019).

Results for the uncontrolled setting are reported in ; and results for the controlled setting are provided in .

We conducted additional analyses to assess the robustness of our results concerning the total sample size, accrual rate, and imbalanced sample sizes across the arms during interim analyses. First, to simulate a lower accrual than expected, we reduced the maximum sample size N in each arm, ranging from 20 to 60 while using the threshold optimized for 60 patients per arm. Interim analyses were performed every N/4 patients in each arm. It’s important to note that the thresholds were optimized based on a prespecified sample size of 60, rather than the actual sample size, to mimic a real-world scenario where the accrual rate is slower than anticipated. Secondly, to simulate fluctuations in the accrual rate, we performed interim analyses after every 60×(j4)ψ enrolled patients, with j representing the jth interim analysis and ψ varying from 0.25 to 1.75. A value of ψ = 1 represented the planned accrual, with 15 patients recruited in each arm between analyses. Lower values of ψ corresponded to a fast accrual rate at the beginning, while higher values indicated a slower accrual rate at the beginning. Lastly, to assess impact of imbalances in the number of patients recruited at each interim analysis, we conducted interim analyses after the recruitment of 15 + {u,,0,,u} patients, with u ranging from 0 to 5. Here, 2u represents the maximum imbalance across arms.

4.2 Operating Characteristics

The operating characteristics of the different approaches in the uncontrolled setting are reported in . In scenario 1, the use of the three decision thresholds Cnm,a, Cnm, and ϵm allowed for desired control of the FWER. The FWER of Cnm,a was closer to the prespecified level, with 9.52% of conclusions indicating a promising treatment, where as the FWER was 8.53% for Cnm and 9.36% for ϵm . However, it’s worth noting that the FWER increased to 22.87% for Cns because it was only optimized for a single-arm study. When the number of arms with efficacy and no toxicity increased (scenarios 2 to 5), both Cnm and Cnm,a exhibited improved power compared to ϵm . Under the LFC, the empirical power was 72.43% for Cnm and 73.22% for Cnm,a, whereas it was 53.96% with ϵm (). Overall, Cnm,a outperformed the other thresholds when there were more than one truly promising arm (scenarios 3, 4, 5, 9, and 10). However, in Scenario 8, where one arm was more effective than H1 and two arms were at H0, Cnm outperformed the other thresholds including Cnm,a. This is because in the late stages, fewer arms may remain active as the majority of arms are unpromising, resulting in a more stringent threshold for Cnm,a in the final analysis. Scenarios involving arms with discordant profiles of efficacy and toxicity (scenarios 6, 7, 10, and 11) resulted in a non-negligible proportion of false positive conclusions. In scenarios 6 and 7, where no arm was identified as promising, the ϵm exhibited a lower arm-specific false positive rate compared to other methods for arms with mismatched efficacy and toxicity. The FWERs for Cnm,a, Cnm, and ϵm were similar in these scenarios and were all lower than that of Cns (52.73% in scenario 6 and 18.52% in scenario 7). In contrast, scenarios 10 and 11, which featured both promising and non-promising arms, saw Cnm,a producing slightly higher false positive rates than both Cnm and ϵm .

Lastly, in scenarios 12 and 13, characterized by intermediate probabilities of efficacy and toxicity, Cnm,a exhibited a higher rate of false positives compared to Cnm, while ϵm had the lowest proportion of false positives. In general, the higher false positive rate of Cnm,a was due to its less stringent threshold. However, its arm-specific Type I error rate remained under 10%.

The operating characteristics of the different approaches in the controlled setting are displayed in . Similar to the uncontrolled setting, the empirical FWER was controlled at 8.75% for Cnm, 9.46% for Cnm,a and 9.62% for ϵm . However, it was not controlled for Cns (21.09%). In scenario 2 there was a reduction in power when using the constant ϵm compared to Cnm (under the LFC, power was 55.52% for Cnm, 50.01% for Cnm,a and 43.73% for ϵm ). Similar results were observed for scenarios 3, 4, 5, 8, and 9. Cnm,a outperformed Cnm when the three experimental arms were efficacious and not toxic (approximately 66% in each arm in scenario 4, compared to around 55% for Cnm). Conversely, with only one truly promising arm, Cnm exhibited greater power (50.01% vs. 55.52% in scenario 2, arm C). Similarly to uncontrolled settings, scenarios with discordant arms resulted in an increased proportion of false positives (Scenarios 6, 7, 10, and 11). Lastly, in Scenarios 12 and 13, characterized by intermediate probabilities of efficacy and toxicity, Cnm,a exhibited a higher rate of false positives than Cnm due to the less stringent threshold used, although it still remained controlled at the desired FWER level.

In conclusion, ϵm had a higher proportion of early stoppages, while Cnm,a resulted in a lower proportion of early stoppages.

When the sample size was smaller than planned, the estimated FWER remained close to the prespecified level, no more than 12.32%, even with only one third of total expected sample size (20 patients per arm instead of 60, see Supplementary Table S2). depicts the proportion of conclusion of efficacy and acceptable toxicity in each arm under scenario 13 for different sample sizes (see also Supplementary Table S2 in supplementary materials for complete results with lower sample sizes in each arm). Scenario 13 exemplified a situation covering the design hypotheses across the 3 experimental arms (arm A: H0, arm B: H1, arm C: intermediate efficacy and toxicity). Cnm, Cnm,a, and ϵm behaved similarly: the proportion of correct selection decreased with a lower sample size while the empirical FWER remained around the specified level. The power and the percentage of correct selection were higher for Cnm.

Fig. 2 Results for scenario 13 in a controlled setting, calibrated for a maximum sample size of 60 patients per arm. Panel A: percent of conclusions regarding efficacy and absence toxicity in each arm according to the maximum number of enrolled patients. B: proportion of correct selection relative to the accrual rate. The table below the plot represents the additional number of enrolled patients at each interim analysis.

Fig. 2 Results for scenario 13 in a controlled setting, calibrated for a maximum sample size of 60 patients per arm. Panel A: percent of conclusions regarding efficacy and absence toxicity in each arm according to the maximum number of enrolled patients. B: proportion of correct selection relative to the accrual rate. The table below the plot represents the additional number of enrolled patients at each interim analysis.

When assessing values of ψ ranging from 0.25 to 1.75, the Cnm and Cnm,a designs had stable operating characteristics with these different recruitment rates (see ). In contrast, ϵm was more sensitive to variations in accrual rates that deviated from the planned rate. It is also important to note that the percentage of correct selection using Cnm,a slightly increased with higher values of ψ. This is because, with larger values of ψ, the sample size at the interim analysis is smaller than expected. Due to the sparser data, the probability of early termination of an arm decreases, allowing for a less stringent cutoff for Cnm,a. As a result, fewer arms were terminated incorrectly at the interim analysis, leading to an increase in the percentage of correct selection. There was limited impact of interim analyses performed at fixed time points across arms, instead of using a fixed number of patients, on the proportion of conclusion regarding efficacy and acceptable toxicity, as well as on the proportion of correct selections in each arm. The empirical FWER slightly increased when an imbalance appeared across arms (u=1,2), but it remained stable regardless of the magnitude of imbalance (see Supplementary Table S2).

5 The AZA-PLUS Trial

To illustrate the design, we retrospectively planned three interim analyses (at least 20, 40, and 60 patients in each arm) along with the terminal analysis (assuming a maximum of 80 patients in each arm), using the AZA-PLUS trial as an example. We used the following probabilities θk for the multinomial distribution of treatment for inefficacy/toxicity to calibrate the design: (0.15, 0.25, 0.15, 0.45), and for efficacy/non toxicity: (0.15, 0.40, 0.05, 0.40). The inefficacy/toxicity hypothesis H0 will be used as the prior for the endpoint distributions, and comparisons will be made with the control arm (Azacitidine) for decision rules. Consequently, the prior distribution is Dir(0.15,0.25,0.15,0.45). The optimized parameters for multi-arm Cnm were λ=0.63 and γ = 1. These parameters controlled the FWER under 15% as planned (actual = 14.84%), and correspondeded to a simulated power of 73.78%.

The interim analyses had slightly different sample sizes across arms due to randomization. (). The retrospective analysis using the proposed design suggested stopping both Azacitidine + Lenalidomide and Azacitidine + Valproic acid arms after the 2nd analysis. Compared to the initial trial, this would have reduced the actual sample size by 120 patients overall (40 patients less in each arm). Of note, given the trial population, the accrual rate was approximately 20 patients per year per arm. Assuming that both efficacy and toxicity criteria would require a follow-up of 6.5 months to eventually take the go/no go decision, the proposed design would have led to an approximate 2-year reduction in trial duration compared to the original design.

Fig. 3 AZA-PLUS trial with the multi-arm Cnm threshold: Posterior probabilities of efficacy and no toxicity, along with decision rules, at 3 interim analyses and the final analysis.

Fig. 3 AZA-PLUS trial with the multi-arm Cnm threshold: Posterior probabilities of efficacy and no toxicity, along with decision rules, at 3 interim analyses and the final analysis.

6 Discussion

We have proposed a Bayesian design to control the FWER in multi-arm multi-stage phase II clinical trials with a joint assessment of efficacy and toxicity. We adapted the BOP2 design (Zhou, Lee, and Yuan Citation2017) to this setting, using group-sequential decision boundaries that depend on the fraction of the number of enrolled patients, either solely or also accounting for the number of active arms at a given analysis. The proposed decision thresholds demonstrated good operating characteristics with increased power, as shown in single-arm trials, when compared to constant thresholds (Zhou, Lee, and Yuan Citation2017). This finding is consistent with the work of Jiang et al. (Citation2020), who found that varying boundaries with the course of the study outperformed constant thresholds in terms of power. Additionally, their approach used separate constant thresholds for efficacy and toxicity, which sometimes may result in thresholds that are more stringent for one endpoint than the other. It might be beneficial to consider additional constraints for these constant thresholds.

Two functions were defined for the decision boundaries, with the one depending on the number of active arms at the time of interim analyses outperforming the one depending only on the fraction of included patients overall. Ensuring less stringent boundaries when there remained some unpromising arms ensured greater power for truly promising arms. However, this came at the cost of a slightly increased false positive rate in trials that included truly promising arms alongside ineffective or toxic ones.

The design appeared robust in the face of departures from the planned setting, including variations in maximum sample size, accrual rate, and balance across arms at each analysis. The design however encountered challenges when dealing a drug exhibiting a discordant profile of efficacy and toxicity (i.e., efficacious but toxic or inefficacious but safe). This could be attributed to the computation of the decision thresholds, which are optimized to control the FWER under the global null hypothesis, where all treatments are assumed to be inefficacious and toxic. These situations may arise when evaluating the association of treatments that could potentially lead to antagonistic interactions, for example. Additionally, the method demonstrated robustness in terms of the choice of prior. The selection of a pessimistic prior, as in original BOP2 design, was made to ensure that interim analysis conclusions are robust. A change in prior with a small effective sample size does not alter the results significantly. However, one should exercise caution when considering the weight of the prior relative to interim analysis results. Nevertheless, through the calibration process, the influence of prior distributions can be further diminished. Similar to the BOP2 design, the proposed thresholds are optimized considering the entire distribution of efficacy and toxicity, thereby taking into account the correlation between efficacy and toxicity. This approach provides adaptability, allowing for the adjustment of the correlation between efficacy and toxicity. However, one should exercise caution when choosing this correlation, as an overestimated correlation may result in a slight inflation of the FWER (results not shown).

Future research directions can be explored in conjunction with the proposed method. Jiang et al. (Citation2021) recently proposed a seamless phase I/II design in which patients are split into indication-specific parallel subgroups for the phase II part of the design, similar to an uncontrolled basket trial. They rely on Bayesian hierarchical modeling to borrow information across subgroups. Further evaluation is needed to quantify the benefit of such an approach in our controlled setting. Furthermore, while we have only implemented futility stopping rules, efficacy stopping rules could also be considered for the multi-arm multi-stage trial, where a promising treatment showing signals of efficacy without toxicity could graduate early (Blenkinsop, Parmar, and Choodari-Oskooei Citation2019). Of note, the proposed method could be considered as a selection design with multiple treatment candidates for a common indication. Since there is no strong consensus on Type I and II error rates for phase II trials, emphasis should be made on FWER or power relatively to the purpose and settings of the study (Stallard Citation2012), compromising somewhat on the risk of false positives and false negatives, prior to formal efficacy assessment in phase III.

In conclusion, the proposed design for multi-arm multi-stage trials has demonstrated promising operating characteristics and could be employed to screen multiple treatments in phase II trials. Accounting for the available fraction of information and for the number of active arms allowed for an improvement in power, particularly in situations with multiple promising treatments. The R package to implement the methods proposed in this article is available at https://github.com/GuillaumeMulier/multibrasBOP2.

Supplementary Materials

The Supplementary Materials include the simulation scenarios and additional simulation results.

Supplemental material

supplementary.zip

Download Zip (2.3 MB)

Acknowledgments

The authors would like to extend their gratitude to the Editor, Associate Editor, and two knowledgeable reviewers for their valuable and constructive comments, which have significantly improved the quality of the article.

Disclosure Statement

The authors report that there are no conflicts of interests to declare.

Additional information

Funding

Lin’s research was partly supported by grants from the National Cancer Institute (5P30CA016672 and 1R01CA261978).

References

  • Ades, L., Guerci, A., Laribi, K., Peterlin, P., Vey, N., Thepot, S., Wickenhauser, S., Zerazhi, H., Stamatoullas, A., Wattel, E., et al. (2018), “A Randomized Phase II Study of Azacitidine (AZA) Alone or with Lenalidomide (LEN), Valproic Acid (VPA) or Idarubicin (IDA) in Higher-Tisk MDS: GFM’s ‘Pick a Winner’ Trial,” Blood, 132, 467. DOI: 10.1182/blood-2018-99-111756.
  • Berger, J. O., Wang, X., and Shen, L. (2014), “A Bayesian Approach to Subgroup Identification,” Journal of Biopharmaceutical Statistics, 24, 110–129. DOI: 10.1080/10543406.2013.856026.
  • Berry, D. A. (2006), “Bayesian Clinical Trials,” Nature Reviews Drug Discovery, 5, 27–36. DOI: 10.1038/nrd1927.
  • Blenkinsop, A., Parmar, M. K. B., and Choodari-Oskooei, B. (2019), “Assessing the Impact of Efficacy Stopping Rules on the Error Rates Under the Multi-Arm Multi-Stage Framework,” Clinical Trials, 16, 132–141. DOI: 10.1177/1740774518823551.
  • Bratton, D. J., Parmar, M. K. B., Phillips, P. P. J., and Choodari-Oskooei, B. (2016), “Type I Error Rates of Multi-Arm Multi-Stage Clinical Trials: Strong Control and Impact of Intermediate Outcomes,” Trials, 17, 1–8. DOI: 10.1186/s13063-016-1382-5.
  • Estey, E. H., and Thall, P. F. (2003), “New Designs for Phase 2 Clinical Trials,” Blood, 102, 442–448. DOI: 10.1182/blood-2002-09-2937.
  • Franklin, M. R., Platero, S., Saini, K. S., Curigliano, G., and Anderson, S. (2022), “Immuno-Oncology Trends: Preclinical Models, Biomarkers, and Clinical Development,” Journal for Immunotherapy of Cancer, 10, e003231. DOI: 10.1136/jitc-2021-003231.
  • Hobbs, B. P., Chen, N., and Jack Lee, J. (2018), “Controlled Multi-Arm Platform Design Using Predictive Probability,” Statistical Methods in Medical Research, 27, 65–78. DOI: 10.1177/0962280215620696.
  • Jacob, L., Uvarova, M., Boulet, S., Begaj, I., and Chevret, S. (2016), “Evaluation of a Multi-Arm Multi-Stage Bayesian Design for Phase II Drug Selection Trials–An Example in Hemato-Oncology,” BMC Medical Research Methodology, 16, 1–15. DOI: 10.1186/s12874-016-0166-7.
  • Jaki, T. F., Pallmann, P. S., and Magirr, D. (2019), “The R package MAMS for Designing Multi-Arm Multi-Stage Clinical Trials,” Journal of Statistical Software, 88, 1–25. DOI: 10.18637/jss.v088.i04.
  • Jennison, C., and Turnbull, B. W. (1999), Group Sequential Methods with Applications to Clinical Trials, Boca Raton, FL: CRC Press.
  • Jiang, L., Yan, F., Thall, P. F., and Huang, X. (2020), “Comparing Bayesian Early Stopping Boundaries for Phase II Clinical Trials,” Pharmaceutical Statistics, 19, 928–939. DOI: 10.1002/pst.2046.
  • Jiang, L., Li, R., Yan, F., Yap, T. A., and Yuan, Y. (2021), “Shotgun: A Bayesian Seamless Phase I-II Design to Accelerate the Development of Targeted Therapies and Immunotherapy,” Contemporary Clinical Trials, 104, 106338. DOI: 10.1016/j.cct.2021.106338.
  • Jung, S.-H. (2008), “Randomized Phase II Trials with a Prospective Control,” Statistics in Medicine, 27, 568–583. DOI: 10.1002/sim.2961.
  • Koehler, E., Brown, E., and Haneuse, S. J. P. A. (2009), “On the Assessment of Monte Carlo Error in Simulation-based Statistical Analyses,” The American Statistician, 63, 155–162. DOI: 10.1198/tast.2009.0030.
  • Magirr, D., Jaki, T., and Whitehead, J. (2012), “A Generalized Dunnett Test for Multi-Arm Multi-Stage Clinical Studies with Treatment Selection,” Biometrika, 99, 494–501. DOI: 10.1093/biomet/ass002.
  • Morris, T. P., White, I. R., and Crowther, M. J. (2019), “Using Simulation Studies to Evaluate Statistical Methods,” Statistics in Medicine, 38, 2074–2102. DOI: 10.1002/sim.8086.
  • Ryan, E. G., Bruce, J., Metcalfe, A. J., Stallard, N., Lamb, S. E., Viele, K., Young, D., and Gates, S. (2019), “Using Bayesian Adaptive Designs to Improve Phase III Trials: A Respiratory Care Example,” BMC Medical Research Methodology, 19, 1–10. DOI: 10.1186/s12874-019-0739-3.
  • Sacchi, S., Marcheselli, R., Bari, A., Buda, G., Molinari, A. L., Baldini, L., Vallisa, D., Cesaretti, M., Musto, P., Ronconi, S., et al. (2016), “Safety and Efficacy of Lenalidomide in Combination with Rituximab in Recurrent Indolent Non-follicular Lymphoma: Final Results of a Phase II Study Conducted by the Fondazione Italiana Linfomi,” haematologica, 101, e196. DOI: 10.3324/haematol.2015.139329.
  • Scagliotti, G., Nishio, M., Satouchi, M., Valmadre, G., Niho, S., Galetta, D., Cortinovis, D., Benedetti, F., Yoshihara, E., Makris, L., et al. (2016), “A Phase 2 Randomized Study of TAS-102 versus Topotecan or Amrubicin in Patients Requiring Second-Line Chemotherapy for Small Cell Lung Cancer Refractory or Sensitive to Frontline Platinum-based Chemotherapy,” Lung Cancer, 100, 20–23. DOI: 10.1016/j.lungcan.2016.06.023.
  • Sharma, M. R., Stadler, W. M., and Ratain, M. J. (2011), “Randomized Phase II Trials: A Long-Term Investment with Promising Returns,” Journal of the National Cancer Institute, 103, 1093–1100. DOI: 10.1093/jnci/djr218.
  • Simon, R., Thall, P. F., and Ellenberg, S. S. (1994), “New Designs for the Selection of Treatments to be Tested in Randomized Clinical Trials,” Statistics in Medicine, 13, 417–429. DOI: 10.1002/sim.4780130506.
  • Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. (2004), Bayesian Approaches to Clinical Trials and Health-Care Evaluation (Vol. 13), Chichester: Wiley.
  • Stallard, N. (2012), “Optimal Sample Sizes for Phase II Clinical Trials and Pilot Studies,” Statistics in Medicine, 31, 1031–1042. DOI: 10.1002/sim.4357.
  • Sutter, S., and Lamotta, L. (2011), “Cancer Drugs have Worst Phase III Track Record,” https://www.mdedge.com/internalmedicine/article/24676/oncology/cancer-drugs-have-worst-phase-iii-track-record.
  • Thall, P. F., Simon, R., and Ellenberg, S. S. (1989), “A Two-Stage Design for Choosing Among Several Experimental Treatments and a Control in Clinical Trials,” Biometrics, 45, 537–547. DOI: 10.2307/2531495.
  • Thall, P. F., Simon, R. M., and Estey, E. H. (1995), “Bayesian Sequential Monitoring Designs for Single-Arm Clinical Trials with Multiple Outcomes,” Statistics in Medicine, 14, 357–379. DOI: 10.1002/sim.4780140404.
  • Wason, J. M. S., Stecher, L., and Mander, A. P. (2014), “Correcting for Multiple-Testing in Multi-Arm Trials: Is It Necessary and Is It Done?” Trials, 15, 1–7. DOI: 10.1186/1745-6215-15-364.
  • Yu, J. X., Hubbard-Lucey, V. M., and Tang, J. (2019), “Immuno-Oncology Drug Development Goes Global,” Nature Reviews Drug Discovery, 18, 899–901. DOI: 10.1038/d41573-019-00167-9.
  • Zhao, Y., Li, D., Liu, R., and Yuan, Y. (2023), “Bayesian Optimal Phase II Designs with Dual-Criterion Decision Making,” Pharmaceutical Statistics. DOI: 10.1002/pst.2296.
  • Zhou, H., Lee, J., and Yuan, Y. (2017), “Bop2: Bayesian Optimal Design for Phase II Clinical Trials with Simple and Complex Endpoints,” Statistics in Medicine, 36, 3302–3314. DOI: 10.1002/sim.7338.