192
Views
1
CrossRef citations to date
0
Altmetric
Methodological Studies

Using a Multi-Site RCT to Predict Impacts for a Single Site: Do Better Data and Methods Yield More Accurate Predictions?

, , , , , & show all
Pages 184-210 | Received 13 Feb 2022, Accepted 01 Feb 2023, Published online: 13 Apr 2023
 

Abstract

Multi-site randomized controlled trials (RCTs) provide unbiased estimates of the average impact in the study sample. However, their ability to accurately predict the impact for individual sites outside the study sample, to inform local policy decisions, is largely unknown. To extend prior research on this question, we analyzed six multi-site RCTs and tested modern prediction methods—lasso regression and Bayesian Additive Regression Trees (BART)—using a wide range of moderator variables. The main study findings are that: (1) all of the methods yielded accurate impact predictions when the variation in impacts across sites was close to zero (as expected); (2) none of the methods yielded accurate impact predictions when the variation in impacts across sites was substantial; and (3) BART typically produced “less inaccurate” predictions than lasso regression or than the Sample Average Treatment Effect. These results raise concerns that when the impact of an intervention varies considerably across sites, statistical modeling using the data commonly collected by multi-site RCTs will be insufficient to explain the variation in impacts across sites and accurately predict impacts for individual sites.

Notes

1 Cross-validation is a technique used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited. In cross-validation, the dataset is divided into several subsets, typically referred to as “folds.” A given model is fit multiple times, excluding a single fold from each model run, and then using the model to generate predictions for the remaining folds. This process is repeated while varying model hyperparameters, most often hyperparameters that reduce or increase the number of variables included in the model. The “best” model is typically identified as the one with the lowest average out-of-sample prediction error. The use of multiple folds generally decreases the model variance since the model is less likely to be influenced by outliers or unlucky selection of a particular fold to which the model is fit.

2 For some purposes, aggregating student-level data to the site level would limit the analysis (e.g., remove the ability to test for subgroup differences within site). However, for our purposes, only site-level variation is helpful in predicting site-level impacts—that is, the average impact in a site. Individual-level covariates cannot explain any additional variation across sites beyond what is explained by the site-level aggregates.

3 Ten of the grantees enrolled students into the study in two successive years. Unlike the original study authors, we treated both cohorts of students for a single grantee as belonging to the same site.

4 For state tests, the study authors standardized the scores relative to all test takers in the same state and grade level. For nationally normed tests, the study authors standardized the scores relative to all test takers in the national norming population.

5 These summary statistics include those students who contributed to this study. In constructing impact estimates for a particular outcome, we excluded sites with missing values for all students in that site.

6 Missing data in the pre-intervention measure of the outcome was addressed using the dummy variable method (e.g., Jones, Citation1996; Puma et al., Citation2009). Specifically, the model included an indicator variable that equaled 1 if the pre-intervention measure of the outcome was missing and equaled 0 otherwise. When the pre-intervention measure of the outcome was missing, it was set to 0 for the analysis.

7 If our goal were to obtain the most accurate estimate of the impact in each site using data from all sites, we would have used Empirical Bayes methods to estimate site-specific impacts. However, our goal was to use OLS to produce unbiased impact estimates for the meta-analysis described later in the paper.

8 The original analysis plan was posted on June 12, 2020 (see https://osf.io/vbs36). Minor revisions were made to the analysis plan during the conduct of the analysis; a revised analysis plan was posted on November 29, 2021 (see https://osf.io/yqfzt). Finally, the paper deviated from the analysis plan to address feedback during the peer review process.

9 At earlier stages in the analysis, using the approach to estimating the RMSPE from Orr et al. (Citation2019), we also tested precision-weighted averages of the site-specific impact estimates. In that analysis, we found little difference in RMSPE between the simple average of the site-specific impact estimates and the precision-weighted average of those estimates.

10 This could be due to either the lack of any real signal in the data for predicting site-specific impacts or due to lasso’s inability to distinguish signal from noise when confronting a large number of potential moderators, many of which may be weak moderators at best.

11 The meta-analysis can decompose the error variance into the cross-site variance and the within-site variance because evidence on the within-site variance was available based on the estimated standard errors of the unbiased impact estimates for each site. While the small sample sizes in each site could potentially lead to biased random effects meta-analysis estimates through correlations between the estimated effects and their standard errors, a simulation study on this issue found the bias to be small for difference-in-means and standardized mean difference estimators (Lin, Citation2018).

12 Kraft (Citation2020) reported the unweighted mean, the weighted mean—weighted by the inverse of the variance of the impact estimates—and the median. We used the weighted mean for consistency with the other meta-analyses, which reported only weighted means.

13 While the meta-analysis included estimates from studies that the authors acknowledged were based on weak study designs, their analysis suggests that the inclusion of these estimates did not substantially bias their estimates of average effects (Wilson et al., Citation2001, pp. 262–263).

14 While this meta-analysis reported separate estimates by outcome domain—estimates we could have matched to the different outcome domains explored in our study—the meta-analysis found little variation in impacts by outcome domain (DuBois et al., Citation2011, pp. 67–68).

15 Ongoing research by the authors of this paper directly estimates the probability that site-level predictions from all six studies would lead local policymakers to the wrong conclusions about the effectiveness of the interventions.

16 The study of summer reading could only be included in analyses focused on moderators based on the characteristics of participating students: The other types of moderator variables were not collected by the original study.

17 These differences were typically statistically significant at the 90% level using two-tailed tests. For example, when lasso regression was applied to moderator variables on intervention features, lasso yielded a larger RMSPE than the simple average of the impacts in the other sites for 994 out of 1,000 bootstrap samples.

18 While the differences were typically not statistically significant, for reading achievement in year 1 of the charter middle school study, lasso with site characteristics, counterfactual characteristics, and all four types of moderators yielded larger RMSPEs than the simple average of the impacts in the other sites for at least 950 out of 1,000 bootstrap samples.

19 Some sources suggest refining the lambda sequence near the apparent minimum MSPE, to test whether it is possible to achieve an even lower minimum. Given the number of models we were running, any slight gains in prediction due to this refinement were far outweighed by the operational challenges of implementing this testing.

20 Some sources recommend selecting the model with MSPE one standard error greater than the minimum, to select a simpler model that is not statistically different from the “best” model. Since our goal here was optimal prediction and we are not concerned with model simplicity or interpretability, we selected the model with minimum average MSPE.

Additional information

Funding

The study was supported by a grant from the William T. Grant Foundation and National Institute of Mental Health P50MH115842. The opinions expressed are those of the authors and do not represent views of the funders. The authors would like to acknowledge the excellent programming support provided by Vasiliy Sergueev and William Zhu.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 302.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.