Full article: Single-arm phase II three-outcome designs with handling of over-running/under-running

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Phase II clinical trials are commonly conducted as pilot studies to evaluate the efficacy and safety of the investigational drug in the targeted patient population with the disease or condition to be treated or prevented. When designing such a trial considering efficacy conclusions, people naturally think as follows: if efficacy evidence is very strong, a go decision should be made; if efficacy evidence is very weak, a no-go decision should be made; if the efficacy evidence is neither strong nor weak, no decision can be made (inconclusive). The designs presented in this paper match this natural thinking process with go/no-go/inconclusive outcomes. Both two-/three-stage designs are developed with three outcomes. Additionally, a general approach based on conditional error function is implemented such that new decision boundaries can be calculated to handle mid-course sample size change which results in either ‘over-running’ or ‘under-running’ and ensure the control of overall type I error. A free open-source R package tsdf that calculates the proposed two-/three-stage designs is available on CRAN.

Keywords:

1. Introduction

Phase II clinical trials are commonly conducted as pilot studies to evaluate the efficacy and safety of the investigational drug in the targeted patient population of the disease or condition to be treated or prevented. The initial assessment of efficacy plays an important role in determining if a compound should be further studied in the targeted patient population. When designing such a trial considering efficacy conclusions, people naturally think as follows: if efficacy evidence is very strong, a go decision should be made; if efficacy evidence is none or very weak, a no-go decision should be made; if the efficacy evidence is neither very strong nor very weak, no clear decision can be made (inconclusive). This decision process is essential in proof-of-concept (POC) studies and this is fundamentally different from Phase III confirmatory studies. Note that this decision process is quite common in real-world practice. For example, in one oncology study (the actual tumour type and compound are not revealed here since the compound is still under development), the decision tree is as follows: if the response rate is above 50%, a phase 3 study will be launched with a single agent under study as the active treatment against an active control; if the response rate is below 30%, no further studies will be conducted on the compound for the cancer under study; if the response rate is between 30% and 50%, a study of the agent in combination with an existing treatment will be conducted. The question is how to design a trial that matches this natural decision process and supports it statistically.

The topic has been discussed in B. Zhong (Citation2012) and W. Zhong and Zhong (Citation2013) and they proposed a design to accommodate such a natural and practical thinking process that can provide a more realistic basis for decision making in Phase 2 trials comparing to the most commonly used Simon's two-stage design. The design is outlined as follows. First, the minimal effective response rate, $p_{c}$ , is selected. Then the hypotheses are set up as: $H_{0} : p = p_{c}$ , vs. $H_{1} : p \neq p_{c}$ , where if data supports hypothesis conclusion $p < p_{c}$ then a no-go decision should be made, and if data supports hypothesis conclusion $p > p_{c}$ then a go decision should be made. If data supports neither hypothesis conclusion $p < p_{c}$ nor hypothesis conclusion $p > p_{c}$ then neither a go nor a no-go decision can be made, resulting in an inconclusive outcome. Similar designs with three outcomes under different hypothesis settings were developed to overcome the issue that the union of null and alternative hypotheses can't cover the whole parameter spaces (Hong & Wang, Citation2007; Sargent et al., Citation2001). In practice, Zhong's two-stage designs have some limitations, which motivated this paper. The objective of our paper is to modify and extend Zhong's two-stage designs to address the following. (1) Allow early stopping for efficacy. If the intervention proves to be superior, the study can be stopped early to maximize the number of patients who will benefit from the intervention, to save time and to reduce costs in drug development. (2) Apply alpha spending method (Demets & Lan, Citation1994) on two-/three-stage designs so that type I error allocated at each stage will be controlled. (3) Extend the design where an interval $(p_{l}, p_{u})$ is utilized instead of a single point $p_{c}$ to accommodate the uncertainty in specifying $p_{c}$ in practice. Specifically, if the true response rate of the test treatment is p and if $p < p_{l}$ then the test treatment is clearly not effective (no-go) or if $p > p_{u}$ then the test treatment is clearly effective (go). (4) Allow for mid-course sample size change. In classical group sequential designs, the sample sizes for the interim and final analyses need to be pre-specified and followed strictly throughout the trial to control the overall type I error rate. In practice, however, the actual sample sizes often deviate from the planned ones, such as trials conducted in multi-centres. These situations are referred to ‘over-running’ and ‘under-running’ (Whitehead, Citation1992). To this end, we derive new decision boundaries based on the conditional error function and ensure strictly control of the overall type I error rate under these ‘over-running’ or ‘under-running’ situations. Various procedures are proposed to deal with unplanned sample size at interim or final stage (Koyama & Chen, Citation2008; Li et al., Citation2002; Shan & Chen, Citation2018). However, these existing methods either based on Simon's two-stage design, not providing flexible stopping rule for early stopping for efficacy, or don't control type 1 error rate at each stage.

The remainder of the paper is organized as follows. Section 2 is devoted to explain how to calculate two- and three-stage designs. Section 3 explains the handling of ‘over-running’ and ‘under-running’ using conditional error function. Section 4 introduces an R package tsdf that implements the proposed two-/three-stage designs. We provide some examples in Section 5 to demonstrate our proposed designs and the pragmatic feature of handling the ‘over-running’ and ‘under-running’. Discussions are given in Section 6.

2. Two- and three-stage designs

The key of Zhong's two-stage design (B. Zhong, Citation2012) is to correctly specify the minimal effective response rate, $p_{c}$ . Intuitively, any response rate below $p_{c}$ (the minimal effective response rate) is considered ineffective hence does not warrant further development. In contrast, any response rate above $p_{c}$ is considered effective hence may warrant further development. It can be seen that the selection of the minimal effective response rate (threshold) is critical. General guidelines are given below on how to select the threshold in single-arm trials. There are three scenarios in practice. 1. When a standard of care for the disease and population under investigation is available, a reliable point estimate of the response rate of the standard care is commonly available. In this scenario, if the test treatment can be developed as ‘similar’ to the standard of care, then one of the choices is to select the point estimate of the standard of care as the minimal effective response rate. This is based on the fact that the standard of care is an effective therapy. If the test treatment is intended to be developed as a treatment superior over the standard treatment, then a rate equal to or higher than the point estimate of the standard care can be chosen according to clinical and statistical judgment. For oncology studies where the objective response rate is the endpoint, the minimal effective response rate may be chosen as 5% to 10% above the point estimate. 2. When a standard of care is not available but several treatments are available, then the selection of the minimal response rate depends on the choice of the control of future randomized registration studies. If one of the available treatments will be the active control, then that treatment can serve as the ‘standard of care’ for the purpose of determining of the minimal effective response rate. The patient population should mimic the population of the control as well. If the objective is to beat all of them, then a possible choice is to use the highest point estimate of the response rates among all available treatments. 3. When there is no treatment available, the minimal effective response rate can be set based on clinical and statistical judgment. Historical data under best supportive care is commonly used to support such a choice.

Once the minimal effective response rate $p_{c}$ is chosen, the hypotheses are set as (1) $H_{0} : p = p_{c} v s . H_{1} : p \neq p_{c} .$ (1) Note that the alternative can be decomposed as two parts: $H_{1}^{-} : p < p_{c} a n d H_{1}^{+} : p > p_{c} .$ There are three possible conclusions for the above hypothesis test: do not reject null hypothesis, reject null and conclude $H_{1}^{+}$ , and reject null and conclude $H_{1}^{-}$ . It means we may conclude that the response rate is equal, higher or lower than the minimal effective response rate. This hypothesis test also can be generalized to the case that the minimal effective response rate is not a single value but a pre-specified interval. The hypothesis becomes (2) $H_{0} : p \in [p_{l}, p_{u}] v s . H_{1} : p \notin [p_{l}, p_{u}],$ (2) where $p_{u} > p_{l}$ . Similarly, we decompose the alternative as $H_{1}^{-} : p < p_{l} a n d H_{1}^{+} : p > p_{u} .$ The interval setup is motivated by situations where a single point cannot be confidently and accurately determined. For example, when a standard of care exists but there is more than one reliable and large trial that yields different estimates of the response of the standard of care. Another example is when there is no available treatment and clinical judgment is expressed in an interval, the response rate below 15% is clearly not of interest and the response rate above 25% is clearly of interest for further development. In these situations, the threshold can be set as an interval $[p_{l}, p_{u}]$ . That is, any response rate below $p_{l}$ is considered ineffective hence does not warrant further development; any response rate above $p_{u}$ is considered effective hence may warrant further development. The hypothesis test in (Equation2(2) $H_{0} : p \in [p_{l}, p_{u}] v s . H_{1} : p \notin [p_{l}, p_{u}],$ (2) ) is equivalent to test in (Equation1(1) $H_{0} : p = p_{c} v s . H_{1} : p \neq p_{c} .$ (1) ) when $p_{l} = p_{u} = p_{c}$ , so we proceed with (Equation2(2) $H_{0} : p \in [p_{l}, p_{u}] v s . H_{1} : p \notin [p_{l}, p_{u}],$ (2) ) hereafter.

Denote $x_{i}$ as the cumulative number of responders among $n_{1} + n_{2} + \dots + n_{i}$ at stage i. The corresponding left-side critical values are $r_{i}$ 's and right-side critical values are $s_{i}$ 's. At each stage, one of the following decisions will be made.

If $x_{i} \leq r_{i}$ , conclude $H_{1}^{-}$ and stop the trial for inefficacy.
If $x_{i} > s_{i}$ , conclude $H_{1}^{+}$ and stop the trial for efficacy.
If $r_{i} < x_{i} \leq s_{i}$ , proceed to next stage and treat an additional $n_{i + 1}$ subject.

$r_{i}$ and $s_{i}$ are determined by significance levels, i.e., left-side and right-side type I errors. Denote the overall left-side type I error as $α_{1}$ , right-side type I error as $α_{2}$ , and the type II error as β. We use the α-spending function to distribute the overall type I error over two/three stages, which prevents the case that most α are spent in the early stages. We denote the cumulative left-side type I errors at stage i as $α_{1 i}$ 's ( $α_{11} \leq α_{12}$ ) and the cumulative right-side type I errors as $α_{2 i}$ 's ( $α_{21} \leq α_{22}$ ). We have the following constraints:

if $p = p_{l}$ , the probability of concluding $H_{1}^{-}$ should not exceed $α_{1 i}$ (left-side type I error) ;
if $p = p_{u}$ , the probability of concluding $H_{1}^{+}$ should not exceed $α_{2 i}$ (right-side type I error);
if the expected response rate is $p_{e}$ ( $> p_{u}$ ), then the probability of not concluding $H_{1}^{+}$ should not exceed β.

Before we provide details of two-stage designs and three-stage designs in the following sections, how error constraints affect the trial design will be discussed. The type I error is the probability of rejecting the true null hypothesis. For left-side, high type I error means that it's more likely to conclude $H_{1}^{-}$ , i.e., it's easier to terminate the trial for inefficacy. Thus high left-side type I error designs lead to a higher chance of terminating the trial for inefficacy. Right side is the opposite: low right-side type I error is more conservative as it is harder to reject the null hypothesis, which leads to declaring efficacy outcome. Investigators can choose a suitable design by giving specific left-side, right-side type I errors and type II error, respectively.

We describe two-stage designs in Section 2.1 and three-stage designs in Section 2.2.

2.1. Two-stage designs

The two-stage design setup is: $n_{1}$ patients are treated in the first stage. If the trial continues to the second stage, additional $n_{2}$ patients are treated. Recall that $x_{i}$ is the total cumulative number of responders until stage i. The procedure is as follows ( $r_{i} \leq s_{i}, r_{1} \leq r_{2}, s_{1} \leq s_{2}, r_{i} \leq \sum_{k = 1}^{i} n_{k}, s_{i} \leq \sum_{k = 1}^{i} n_{k}$ ).

Stage 1: treat $n_{1}$ patients
- If $x_{1} \leq r_{1}$ , terminate the trial and conclude $H_{1}^{-}$ .
- If $x_{1} > s_{1}$ , terminate the trial and conclude $H_{1}^{+}$ .
- If $r_{1} < x_{1} \leq s_{1}$ , continue the trial and go to stage 2.
Stage 2: treat additional $n_{2}$ patients
- If $x_{2} \leq r_{2}$ , terminate the trial and conclude $H_{1}^{-}$ .
- If $x_{2} > s_{2}$ , terminate the trial and conclude $H_{1}^{+}$ .
- If $r_{2} < x_{2} \leq s_{2}$ , terminate the trial and conclude ‘data does not contradict to null hypothesis’.

Denote the binomial cumulative density function as $B (\cdot; n, p)$ and probability function as $b (\cdot, n, p)$ , where n is the number of Bernoulli trials and p is the probability of success. Let's calculate the conditional probabilities. If the true response rate is p and given $r_{1}$ , $n_{1}$ , then the probability of concluding $H_{1}^{-}$ at the first stage is (3) $L_{1} (p) = B (r_{1}, n_{1}, p)$ (3) and at the second stage, given $r_{2}, n_{2}$ , is (4) $L_{2} (p) = \sum_{t_{1} = r_{1} + 1}^{s_{1}} b (t_{1}, n_{1}, p) B (r_{2} - t_{1}, n_{2}, p) .$ (4) Similarly, the probabilities of concluding $H_{1}^{+}$ at stage 1 and 2 are $R_{1} (p) = 1 - B (s_{1}, n_{1}, p) a n d R_{2} (p) = \sum_{t_{1} = r_{1} + 1}^{s_{1}} b (t_{1}, n_{1}, p) [1 - B (s_{2} - t_{1}, n_{2}, p)] .$ Lower and upper type I errors $α_{1}, α_{2}$ are spent following an error spending method. For any chosen error spending function, error rate $α_{i 1} \leq α_{i 2} = α_{i}$ allowed at each stage can be calculated. Therefore, $r_{i}, s_{i}$ satisfy the following type I error constraints: (5) $L_{1} (p_{l}) \leq α_{11}, L_{1} (p_{l}) + L_{2} (p_{l}) \leq α_{12} = α_{1}$ (5) and (6) $R_{1} (p_{u}) \leq α_{21}, R_{1} (p_{u}) + R_{2} (p_{u}) \leq α_{22} = α_{2} .$ (6) The type II error constraint is (7) $\sum_{i = 1}^{2} R_{i} (p_{e}) \geq 1 - β .$ (7) We search all combinations of $(r_{i}, s_{i}, n_{i})$ satisfying the type I error and type II error constraints. The focus is to pursue designs that have the closest errors to the desired left-side and right-side type I errors under the condition that $\leq α_{i 1}$ at stage 1 and $\leq α_{i 2} = α_{i}, i = 1, 2$ at stage 2 from a chosen α-spending function and the minimal sample size n with many choices of $n_{1}$ under the expected response rate. In addition, it is also desirable to maximize the power for a given n. With the conditions (Equation5(5) $L_{1} (p_{l}) \leq α_{11}, L_{1} (p_{l}) + L_{2} (p_{l}) \leq α_{12} = α_{1}$ (5) ), (Equation6(6) $R_{1} (p_{u}) \leq α_{21}, R_{1} (p_{u}) + R_{2} (p_{u}) \leq α_{22} = α_{2} .$ (6) ), and (Equation7(7) $\sum_{i = 1}^{2} R_{i} (p_{e}) \geq 1 - β .$ (7) ) outlined, there are usually many designs with combinations of ( $r_{i}, s_{i}, n_{i}$ ) that satisfy type I and II error constraints. Therefore, additional selection criteria are needed to choose designs that satisfy practical considerations. Given $n_{1}, n_{2}$ , all possible combinations of $(r_{i}, s_{i}, i \leq 2)$ satisfying the constraints are outputted into a matrix in R. The designs are then sorted in descending order by left-side, right-side type I errors and power. The first design is then chosen. For given $n_{1}, n_{2}$ , this chosen design has the closest type I errors to $α_{11}$ , $α_{21}$ , $α_{12}$ , and $α_{22}$ and the maximum of power. We increment the total sample n from 2 until predetermined distinct choices of $n_{1}$ are found. Denote the smallest possible $n_{1}$ as $n_{1 s}$ and the largest as $n_{1 l}$ . Then all integers between $n_{1 s}$ and $n_{1 l}$ can be used as the first stage sample size. In practice, this means once the total sample is fixed, the stage 1 sample size can vary within a range instead being fixed, which makes trial conduct easier to manage.

With so many choices of stage 1 sample size as described above, a natural question is where we should set our target stage 1 sample size. One possible choice is to set the target sample size at which the expected sample size is minimized under null hypothesis. That is, we should target to have the design that minimizes $n_{1} + n_{2} \times (1 - B (r_{1}, n_{1}, p_{l}))$ (without early stopping for efficacy) or $n_{1} + n_{2} \times (B (s_{1}, n_{1}, p_{u}) - B (r_{1}, n_{1}, p_{l}))$ (with early stopping for efficacy). The chosen design is called optimal design.

2.2. Three-stage designs

Three-stage design is the extension of two-stage design where we treat additional $n_{3}$ patients if the decision is to continue to the next stage at the end of stage 2. Thus the sample size is at most $n_{1} + n_{2} + n_{3} = n$ . The complete three-stage design is as follows.

Stage 1: treat $n_{1}$ patients
- If $x_{1} \leq r_{1}$ , terminate the trial and conclude $H_{1}^{-}$ .
- If $x_{1} > s_{1}$ , terminate the trial and conclude $H_{1}^{+}$ .
- If $r_{1} < x_{1} \leq s_{1}$ , go to stage 2.
Stage 2: treat additional $n_{2}$ patients
- If $x_{2} \leq r_{2}$ , terminate the trial and conclude $H_{1}^{-}$ .
- If $x_{2} > s_{2}$ , terminate the trial and conclude $H_{1}^{+}$ .
- If $r_{2} < x_{2} \leq s_{2}$ , go to Stage 3.
Stage 3: treat additional $n_{3}$ patients
- If $x_{3} \leq r_{3}$ , terminate the trial and conclude $H_{1}^{-}$ .
- If $x_{3} > s_{3}$ , terminate the trial and conclude $H_{1}^{+}$ .
- If $r_{3} < x_{3} \leq s_{3}$ , terminate the trial and conclude ‘data does not contradict to null hypothesis ’.

Then we only need to calculate the conditional probabilities at the third stage in addition to the first two stages that have been calculated in previous subsection. Given $n_{1}, n_{2}, r_{1}, r_{2}, r_{3}, s_{1}, s_{2}$ , the probability of concluding $H_{1}^{-}$ at the third stage is (8) $L_{3} (p) = \sum_{t_{1} = r_{1} + 1}^{s_{1}} \sum_{t_{2} = r_{2} - t_{1} + 1}^{s_{2} - t_{1}} b (t_{1}, n_{1}, p) b (t_{2}, n_{2}, p) B (r_{3} - t_{1} - t_{2}, n_{3}, p),$ (8) and concluding $H_{1}^{+}$ at the third stage is $R_{3} (p) = \sum_{t_{1} = r_{1} + 1}^{s_{1}} \sum_{t_{2} = r_{2} - t_{1} + 1}^{s_{2} - t_{1}} b (t_{1}, n_{1}, p) b (t_{2}, n_{2}, p) [1 - B (s_{3} - t_{1} - t_{2}, n_{3}, p)] .$ Combining (Equation8(8) $L_{3} (p) = \sum_{t_{1} = r_{1} + 1}^{s_{1}} \sum_{t_{2} = r_{2} - t_{1} + 1}^{s_{2} - t_{1}} b (t_{1}, n_{1}, p) b (t_{2}, n_{2}, p) B (r_{3} - t_{1} - t_{2}, n_{3}, p),$ (8) ) with the constrains on the first two stages (Equation3(3) $L_{1} (p) = B (r_{1}, n_{1}, p)$ (3) ), (Equation4(4) $L_{2} (p) = \sum_{t_{1} = r_{1} + 1}^{s_{1}} b (t_{1}, n_{1}, p) B (r_{2} - t_{1}, n_{2}, p) .$ (4) ), $r_{i}, n_{i}$ satisfy the following constraints: $\sum_{k = 1}^{i} L_{k} (p_{l}) \leq α_{1 i} a n d \sum_{k = 1}^{i} R_{k} (p_{u}) \leq α_{2 i},$ for i = 1, 2, 3 and $\sum_{i = 1}^{3} R_{i} (p_{e}) \geq 1 - β .$ The optimal design is chosen as described in the end of Section 2.1.

3. Adjustment for over-running and under-running

To assure the control of type I error rate, the sample size at each stage need to be specified in the protocol and adhered strictly during study conduct in sequential clinical trial. However, a considerably amount of uncertainty is usually expected during the planning stage of a study and such strong restrictions of the sample sizes imposed by the common two-/three-stage designs can make study conduct very challenging. For example, stopping the recruitment exactly after certain number of patients have been accrued can be difficult especially in a multi-centre trial. Patients may have been screened and it is unethical to withhold treatment and exclude such patients from the study. As a result, over-running may occur. Moreover, even if we have recruited the exact number of patients as pre-specified in the protocol, it is possible that not all the patients are evaluable in regard to the primary endpoint, leading to a situation we then refer as ‘under-running’ of the study. In all the aforementioned cases, there is a violation of the predetermined sample sizes and the control of type I error is no longer guaranteed.

In general, there are three possible scenarios: (1) sample size is only modified at interim analysis; (2) sample size is only modified at final analysis; (3) sample size is modified at both interim analysis and final analysis. For the first scenario, recall that in our proposed two-/three-stage designs, a range of the designs satisfying the criterion are generated. In other words, our proposed design allows change of the sample size $n_{1}$ in the first stage for two-stage design or $(n_{1}, n_{2})$ in the first two stages in the three-stage design as long as the new sample size $n_{1}^{*}$ or ( $n_{1}^{*}, n_{2}^{*}$ ) is within the range and the total sample size remains the same as the initial design while controlling the type I error. For the third scenario, it is easy to locate a design in our list of proposed designs to match the modified sample size at interim. With that being said, the third scenario can be reduced to the second scenario. Therefore, we only need to consider the cases where the sample size at the final analysis is modified and discuss possible remedies for mid-course modifications of sample size while strictly controlling the overall type I error rate.

We apply the conditional error function approach in Englert and Kieser (Citation2012, Citation2015) that allows for arbitrary modifications of sample size based on the results of the interim analysis or external information while controlling for overall type I error. The concept of conditional error function (Proschan & Hunsberger, Citation1995) was introduced as a method to test the null hypothesis within a two-stage design and allow for data dependent modifications of the sample sizes after the first stage. The conditional significance level of the second stage depends on the outcome of the first stage. Simultaneously, conditioning on the results from interim analysis, the rejection region of the final decision is invariant to the mid-course modifications such that the unconditional overall Type I error is controlled. The flexibility nature of such designs is referred as ‘conditional invariance principle’ (Brannath et al., Citation2007).

For single-arm oncology trials with the binary endpoint, the conditional error function is the type I error rate used at the final stage given the number of responses observed in the previous stages. We first apply the conditional error function approach to the proposed two-stage design. Assume the sample size modification happens at the second stage. After completion of the first stage, the trial proceeds to the second stage and the second stage sample size may be changed from $n_{2}$ to $n_{2}^{*}$ so that we need to find new boundaries $r_{2}^{*}$ and $s_{2}^{*}$ to assure that overall type I error is controlled. Recall that our hypothesis is $H_{0} : p \in [p_{l}, p_{u}] v s . H_{1} : p \notin [p_{l}, p_{u}]$ and the null hypothesis can be written as $H_{1}^{-} : p < p_{l} a n d H_{1}^{+} : p > p_{u} .$ We allocate different type I error $α_{1}$ and $α_{2}$ on left side and right side, respectively. Denote the number of responses at the first stage as $t_{1}$ . Then the conditional function for left side test $H_{1}^{-}$ is $f_{l} (t_{1}) = {\begin{cases} 0, & i f t_{1} \leq r_{1}, \\ B (r_{2}^{*} - t_{1}, n_{2}^{*}, p_{l}), & i f r_{1} < t_{1} \leq s_{1} \\ 1, & i f t_{1} > s_{1}, \end{cases},$ and, for right side $H_{1}^{+}$ , $f_{r} (t_{1}) = {\begin{cases} 0, & i f t_{1} \leq r_{1}, \\ 1 - B (s_{2}^{*} - t_{1}, n_{2}^{*}, p_{u}), & i f r_{1} < t_{1} \leq s_{1}, \\ 1, & i f t_{1} > s_{1} . \end{cases}$ The conditional error functions can be considered as the conditional significance level that can be used at final analysis. The trial is terminated early for efficacy or inefficacy when the conditional function is equal to 0 and 1, respectively. We then have the left-side type I error $α_{1}^{*} = \sum_{t_{1} = 0}^{n_{1}} f_{l} (t_{1}, p_{l}) \cdot b (t_{1}, n_{1}, p_{l}),$ and the right-side type I error $α_{2}^{*} = \sum_{t_{1} = 0}^{n_{1}} f_{r} (t_{1}, p_{u}) \cdot b (t_{1}, n_{1}, p_{u}) .$ The boundaries $r_{2}^{*}$ and $s_{2}^{*}$ are selected such that the actual type I error $α_{i}^{*}$ is at most $α_{i}, i = 1, 2$ under the null hypothesis. In all possible combinations of $r_{2}^{*}$ and $s_{2}^{*}$ , the one has the closest type I errors to the desired significance level and the smallest type II error is selected.

For three-stage designs, the third stage sample size may be changed from $n_{3}$ to $n_{3}^{*}$ after completion of stage 1 and stage 2. We need to obtain new boundaries $s_{3}^{*}$ and $r_{3}^{*}$ to control overall type I error. Similarly, we obtain the right-side type I error $α_{2}^{* *}$ $α_{2}^{* *} = \sum_{t_{1} = 0}^{n_{1}} \sum_{t_{2} = 0}^{n_{2}} f_{r} (t_{1}, t_{2}) \cdot b (t_{1}, n_{1}, p_{u}) \cdot b (t_{2}, n_{2}, p_{u}),$ where $f_{r} (t_{1}, t_{2}) = {\begin{cases} 0, & i f t_{1} + t_{2} \leq r_{2} o r t_{1} \leq r_{1}, \\ 1, & i f t_{1} + t_{2} > s_{2} o r t_{1} > s_{1}, \\ 1 - B (s_{3}^{*} - t_{1} - t_{2}, n_{3}^{*}, p_{u}), & o t h e r w i s e, \end{cases}$ and the left-side type I error $α_{1}^{* *}$ $α_{1}^{* *} = \sum_{t_{1} = 0}^{n_{1}} \sum_{t_{2} = 0}^{n_{2}} f_{l} (t_{1}, t_{2}) \cdot b (t_{1}, n_{1}, p_{u}) \cdot b (t_{2}, n_{2}, p_{u}),$ where $f_{l} (t_{1}, t_{2}) = {\begin{cases} 0, & i f t_{1} + t_{2} \leq r_{2} o r t_{1} \leq r_{1}, \\ 1, & i f t_{1} + t_{2} > s_{2} o r t_{1} > s_{1}, \\ B (r_{3}^{*} - t_{1} - t_{2}, n_{3}^{*}, p_{l}), & o t h e r w i s e . \end{cases}$ We solve for $r_{3}^{*}$ and $s_{3}^{*}$ with the same approach as the two-stage cases.

By applying the conditional error function technique, we re-calculate new decision boundaries at final stage for two-stage designs and three-stage designs when the sample size is only changed at the final analysis. As a result, our proposed design allows for a completely free sample size modification at every stage while controlling the overall type I error rate. The examples are given in Section 5.

4. Software

We provide an R package that calculates the proposed two-/three-stage designs. This section gives a general sense of this package and introduces some basic usage. The complete document is in CRAN R documentation (Guo et al., Citation2019). To install tsdf, run the following command in R console:

The main function performing Phase II designs is opt.design. We will briefly go over this function, see some basic operations and have a look at the outputs. opt.design requires at least five inputs: alpha1 as the left-side type I error, alpha2 as the right-side type I error, beta as the type II error, pc as the minimal effective rate used in null hypothesis which can be a single value or an interval, and pe as the expected response rate. So a simple example would be

The above code returns an object that contains all feasible designs and prints out the optimal one as below.

By default, this function calculates two-stage designs that do not include early stop for superiority and do not apply alpha-spending method. Other key options include

stage: A single value indicates whether two or three stage designs should be returned.
stop.eff: A logical flag indicates if this trial can allow early stopping for efficacy.
sf.param: A single real value specifying the gamma parameter for which Hwang-Shih-DeCani spending is to be computed.

tsdf supports Hwang-Shih-DeCani (Hwang et al., Citation1990) $α -$ spending function, which takes the form: $f (t, α, γ) = α (1 - \exp (- t γ)) / (1 - \exp (- γ)),$ where α is the overall type I error, t is the values of the proportion of sample size/information for which the spending function will be computed, and γ is a parameter that controls how the α is distributed at each stage. In function dec.table, sf.param specifies the choice of γ. Increasing γ implies that more error is spent at early stage and less is available in late stage. For example, a value of $γ = - 4$ is used to approximate an O'Brien–Fleming design (O'Brien & Fleming, Citation1979), while a value of $γ = 1$ approximates a Pocock design (Jennison & Turnbull, Citation2000). We set the maximum sample size to be 100 by default as it may take more time to compute when n is large. It can be specified by the user based on their design settings and computing power. More details of this package can be found in the R documentation (Guo et al., Citation2019).

5. Evaluation

Simon's two-stage design (Simon, Citation1989) is the most commonly used design for Phase II oncology trials and any new method used in a trial may be asked by the question from investigators and IRBs (institutional review boards) to explain why the new design is needed and the difference between them. We summarize the differences in Table . In this section, we implement the proposed two-/three-stage designs and illustrate the procedure involved.

Table 1. Differences between Simon's two-stage design and the new design.

Display Table

We look at two case studies, one with minimal response rate as single point and one as an interval.

Case 1

The minimal effective response rate is 0.40. The postulate response rate is at least 0.55. The type I error to conclude the compound is ineffective (conclude p<0.4) when p = 0.4 and effective (conclude p>0.4) when p = 0.4 are 0.3, 0.1, respectively. The power to conclude the compound is effective (conclude p>0.4) when p = 0.55 is at least 0.80.

We used Pocock spending function and calculated designs with and without early stopping for efficacy. We increase the total sample size n by 1 from 2 and stop until we find there are at least five choices of

n_{1}

in the range of 30–60% of the total.

Table (a) summarizes two-stage designs without early stopping for efficacy. The study will enroll a total of 50 response-evaluable subjects, with stage 1 sample size between 15 and 30. The response-evaluable subjects will consist of all subjects who receive at least 1 dose of study drug and have at least one post-treatment disease assessment. At the end of the second stage, efficacy is declared when there are more than 24 responders from 50 response-evaluable subjects while inefficacy is declared when there are less than or equal to 17 responders. In this example, the boundaries at the second stage are not impacted by the stage 1 sample size. In fact, $r_{2}, s_{2}$ usually do not change much as the total sample size remains the same when the stage 1 sample size varies. The optimal design is when the stage 1 sample size is 22, with the expected sample size of 45.564. Table provides designs when sample size at stage 2 may be changed from $n_{2} = 26$ to $n_{2}^{*} = 25 \sim 31$ while the stage 1 sample size remains the same. As can be observed, the new boundaries $r_{2}^{*}$ and $s_{2}^{*}$ are adjusted along with the new sample size $n_{2}^{*}$ . Namely, efficacy and inefficacy are claimed with smaller number of responders in the under-running cases while larger number of responders are needed to reach conclusion in the over-running cases. Table (b) presents two-stage designs with early stopping for efficacy. The total sample size is also 50. The stage 1 sample size has a range of 15–30. The optimal design is the same except adding efficacy boundary at stage 1. For example, when stage 1 sample size $n_{1} = 15$ , efficacy is declared when there are more than 11 responders. The inefficacy boundaries are similar to the designs without early stopping for efficacy in Table .

Table 2. Two-stage designs for Case 1: the minimal effective response rate is 0.40. The postulate response rate is at least 0.55. The type I errors to conclude the compound are ineffective (conclude p<0.4) when p = 0.4 and effective (conclude p>0.4) when p = 0.4 are 0.3, 0.1, respectively. The power to conclude the compound is effective (conclude p>0.4) when p = 0.55 is at least 0.80.

Display Table

Table 3. Two-stage designs for Case 1 with over-running and under-running at stage 2 (no stopping for efficacy). The planned design has $n_{1} = 22, n_{2} = 28$ . The sample size at stage 2 is changed from 28 to 25–27 (under-running) and 29–31 (over-running).

Display Table

Case 2

The minimal effective response rate is between 0.40 and 0.45. The expected response rate is at least 0.60. The type I errors to conclude the compound is ineffective (conclude p<0.4) when p = 0.4 and effective (conclude p>0.45) when p = 0.45 are 0.3, 0.1, respectively. The power to conclude the compound is effective(conclude p>0.4) when p = 0.60 is at least 0.80.

Using the assumptions in Case 2, the study will need 53 response-evaluable subjects, with stage 1 sample size between 15 and 32. The inefficacy boundary for stage 1 varies from 3 to 10. At the end of the second stage, efficacy is declared when there are more than 28 responders from 53 response-evaluable subjects and inefficacy is declared when there are less than or equal to 18 responders. The optimal design is when the stage 1 sample size is 22 with an expected sample size of 48.088. The maximum expected sample size is 50.633 when stage 1 sample size is 19. The improvement in the expected sample size is very small while there is a big difference in stage 1 sample size. This indicates it is not necessary to seek the optimal design.

Table 4. Two-stage designs for Case 2: the minimal effective response rate is between 0.40 and 0.45. The expected response rate is at least 0.60. The type I errors to conclude the compound is ineffective (conclude p<0.4) when p = 0.4 and effective (conclude p>0.45) when p = 0.45 are 0.3, 0.1, respectively. The power to conclude the compound is effective (conclude $p > 0.4$ ) when p = 0.60 is at least 0.80.

Display Table

6. Summary and discussion

This paper presented modifications and extensions of the two-stage designs of B. Zhong (Citation2012) to solve some practical issues when conducting single-arm Phase II oncology trials. We provide flexible two-/three-stage designs which allows flexible interim sample size, early stopping rule for efficacy, hypothesized interval of response rate when a single minimal response rate is not available, and handling of over-running and under-running while controlling overall type I error. Moreover, we provide an open-source R package that integrates the aforementioned features.

In Zhong's design, an inconclusive outcome is added as a natural outcome of the designs. The inconclusive outcome corresponds to a trial where definitive go/no-go decisions cannot be made. This inconclusive result is an unavoidable result due to the small sample size as well as the intrinsic nature of these uncontrolled clinical trials. Our design can be further extended by integrating an additional hypothesis testing with the secondary endpoint and dividing the ‘grey zone’ into a ‘sub-go zone’ and a ‘sub-no-go zone’. Moreover, the inefficacy/efficacy boundaries are established with the new two-stage design. The concept of inefficacy is used instead of the commonly used term ‘futility’. The term ‘futility’ in clinical trials is used to refer to the inability of a clinical trial to achieve its objectives. For example, a trial is designed with 65% response rate for the test treatment group and 50% response rate for the control group (standard of care) and with the intention to establish the superiority. Therefore, the study objective is to demonstrate that the test treatment is superior over the standard of care. In the middle of the study, an interim analysis was performed and revealed 55% response rate for the new treatment group and 50% response rate for the control group. Based on the results from the interim analysis, it is concluded that it is unlikely to reach statistical significance at the end of the study. Therefore, the trial is terminated. Note that the test treatment is still better than the standard of care by the interim observed response rates. If the sample size is large enough, statistical significance may still be demonstrated. In other words, futility does not mean inefficaciousness. For this example, the observed response rate at the interim analysis is 55% and this is higher than the 50% response rate of the standard of care. The test treatment is still effective even if it has a 50% response rate. It is obvious that ineffectiveness cannot be concluded from the interim data but futility conclusion can be made.

The designs presented in this paper can also be used for multi-arm trials with the intention to identify/rule out ineffective doses/treatments where go/no-go decisions are based on the data from individual arms. Finally, the framework of the paper is based on hypothesis testing. Other extensions such as using confidence intervals approach or extending our framework to other type of endpoints and randomized trials will be investigated in the future.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Brannath, W., Koenig, F., & Bauer, P. (2007). Multiplicity and flexibility in clinical trials. Pharmaceutical Statistics, 6(3), 205–216. https://doi.org/10.1002/pst.302
PubMed Web of Science ®Google Scholar
Demets, D. L., & Lan, K. K. G. (1994). Interim analysis: The alpha spending function approach. Statistics in Medicine, 13(13–14), 1341–1352. https://doi.org/10.1002/(ISSN)1097-0258
PubMed Web of Science ®Google Scholar
Englert, S., & Kieser, M. (2012). Improving the flexibility and efficiency of phase II designs for oncology trials. Biometrics, 68(3), 886–892. https://doi.org/10.1111/j.1541-0420.2011.01720.x
PubMed Web of Science ®Google Scholar
Englert, S., & Kieser, M. (2015). Methods for proper handling of overrunning and underrunning in phase II designs for oncology trials. Statistics in Medicine, 34(13), 2128–2137. https://doi.org/10.1002/sim.v34.13
PubMed Web of Science ®Google Scholar
Guo, W., Hui, J., & Zhong, B. (2019). TSDF: Two-/three-stage designs for phase 1&2 clinical trials. R package version 1.1-7.
Google Scholar
Hong, S., & Wang, Y. (2007). A three-outcome design for randomized comparative phase ii clinical trials. Statistics in Medicine, 26(19), 3525–3534. https://doi.org/10.1002/(ISSN)1097-0258
PubMed Web of Science ®Google Scholar
Hwang, I. K., Shih, W. J., & De Cani, J. S. (1990). Group sequential designs using a family of type i error probability spending functions. Statistics in Medicine, 9(12), 1439–1445. https://doi.org/10.1002/(ISSN)1097-0258
PubMed Web of Science ®Google Scholar
Jennison, C., & Turnbull, B. W. (2000). Group sequential methods with applications to clinical trials, Chapman and Hall.
Google Scholar
Koyama, T., & Chen, H. (2008). Proper inference from Simon's two-stage designs. Statistics in Medicine, 27(16), 3145–3154. https://doi.org/10.1002/sim.v27:16
PubMed Web of Science ®Google Scholar
Li, G., Shih, W. J., Xie, T., & Lu, J. (2002). A sample size adjustment procedure for clinical trials based on conditional power. Biostatistics, 3(2), 277–287. https://doi.org/10.1093/biostatistics/3.2.277
PubMed Web of Science ®Google Scholar
O'Brien, P. C., & Fleming, T. R. (1979). A multiple testing procedure for clinical trials. Biometrics, 35(3), 549–556. https://doi.org/10.2307/2530245
PubMed Web of Science ®Google Scholar
Proschan, M. A., & Hunsberger, S. A. (1995). Designed extension of studies based on conditional power. Biometrics, 51(4), 1315–1324. https://doi.org/10.2307/2533262
PubMed Web of Science ®Google Scholar
Sargent, D. J., Chan, V., & Goldberg, R. M. (2001). A three-outcome design for phase ii clinical trials. Controlled Clinical Trials, 22(2), 117–125. https://doi.org/10.1016/S0197-2456(00)00115-X
PubMedGoogle Scholar
Shan, G., & Chen, J. J. (2018). Optimal inference for Simon's two-stage design with over or under enrollment at the second stage. Communications in Statistics -- Simulation and Computation, 47(4), 1157–1167. https://doi.org/10.1080/03610918.2017.1307398
PubMed Web of Science ®Google Scholar
Simon, R. (1989). Optimal two-stage designs for phase II clinical trials. Controlled Clinical Trials, 10(1), 1–10. https://doi.org/10.1016/0197-2456(89)90015-9
PubMedGoogle Scholar
Whitehead, J. (1992). Overrunning and underrunning in sequential clinical trials. Controlled Clinical Trials, 13(2), 106–121. https://doi.org/10.1016/0197-2456(92)90017-T
PubMedGoogle Scholar
Zhong, B. (2012). Single-arm phase IIa clinical trials with go/no-go decisions. Contemporary Clinical Trials, 33(6), 1272–1279. https://doi.org/10.1016/j.cct.2012.07.006
PubMed Web of Science ®Google Scholar
Zhong, W., & Zhong, B. (2013). One-sample proportion testing procedure for hypothesis of inequality. Journal of Biopharmaceutical Statistics, 23(3), 604–617. https://doi.org/10.1080/10543406.2012.756501
PubMed Web of Science ®Google Scholar

Single-arm phase II three-outcome designs with handling of over-running/under-running

Abstract

1. Introduction

2. Two- and three-stage designs

2.1. Two-stage designs

2.2. Three-stage designs

3. Adjustment for over-running and under-running

4. Software

5. Evaluation

Table 1. Differences between Simon's two-stage design and the new design.

Table 3. Two-stage designs for Case 1 with over-running and under-running at stage 2 (no stopping for efficacy). The planned design has $n_{1} = 22, n_{2} = 28$ . The sample size at stage 2 is changed from 28 to 25–27 (under-running) and 29–31 (over-running).

6. Summary and discussion

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Single-arm phase II three-outcome designs with handling of over-running/under-running

Abstract

1. Introduction

2. Two- and three-stage designs

2.1. Two-stage designs

2.2. Three-stage designs

3. Adjustment for over-running and under-running

4. Software

5. Evaluation

Table 1. Differences between Simon's two-stage design and the new design.

Table 3. Two-stage designs for Case 1 with over-running and under-running at stage 2 (no stopping for efficacy). The planned design has n1=22,n2=28. The sample size at stage 2 is changed from 28 to 25–27 (under-running) and 29–31 (over-running).

6. Summary and discussion

Disclosure statement

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

Table 3. Two-stage designs for Case 1 with over-running and under-running at stage 2 (no stopping for efficacy). The planned design has $n_{1} = 22, n_{2} = 28$ . The sample size at stage 2 is changed from 28 to 25–27 (under-running) and 29–31 (over-running).