765
Views
2
CrossRef citations to date
0
Altmetric
Research Articles

Model Mixing Using Bayesian Additive Regression Trees

, , & ORCID Icon
Pages 196-207 | Received 16 Dec 2022, Accepted 06 Sep 2023, Published online: 18 Oct 2023

Abstract

In modern computer experiment applications, one often encounters the situation where various models of a physical system are considered, each implemented as a simulator on a computer. An important question in such a setting is determining the best simulator, or the best combination of simulators, to use for prediction and inference. Bayesian model averaging (BMA) and stacking are two statistical approaches used to account for model uncertainty by aggregating a set of predictions through a simple linear combination or weighted average. Bayesian model mixing (BMM) extends these ideas to capture the localized behavior of each simulator by defining input-dependent weights. One possibility is to define the relationship between inputs and the weight functions using a flexible nonparametric model that learns the local strengths and weaknesses of each simulator. This article proposes a BMM model based on Bayesian Additive Regression Trees (BART). The proposed methodology is applied to combine predictions from Effective Field Theories (EFTs) associated with a motivating nuclear physics application. Supplementary materials for this article are available online. Source code is available at https://github.com/jcyannotty/OpenBT.

1 Introduction

In statistical learning problems, one often considers a set of plausible models, each designed to explain the system of interest. A common practice is to select a best performing model based on some pre-specified criteria. The ensuing inference for quantities of interest is then carried out using the selected model as if it were the true data generating mechanism. The resulting uncertainty quantification ignores any variability due to the underlying model structure (Draper Citation1995). The misrepresentation of uncertainties associated with such quantities can ultimately lead to misguided interpretation or inappropriate decisions. Another shortcoming of the typical approach to modeling is that the resulting inference may strongly depend on the selection criteria. In other words, different sets of criteria could lead to noticeably different final models and inferential results. To account for such uncertainties, one may elect to combine information across the set of models in some manner.

Any model set can be classified as M-closed, M-complete, or M-open (Bernardo and Smith Citation1994). These three categories differ in their underlying assumptions regarding a true model, M, and its relation to the model set. The M-closed setting assumes a mathematical representation of M can be formulated and it is included in the model set. In this setting, model selection is appropriate because M can be recovered from the set of models under consideration. The M-complete setting also assumes it is possible to construct M, however, it is not included in the model set. For example, an expression for M may exist, however, it may be computationally intensive or intractable compared to the models under consideration. The M-open case assumes the true model may exist, however, a lack of knowledge or resources prevents one from constructing its mathematical representation. Consequently, M is excluded from the model set. This work is motivated by applications in nuclear physics which tend to fall within the M-open class as the underlying truth regarding the physical system may not yet be understood. In such cases, one may desire to leverage the known information about the physical system which is contained in the model set along with experimental data to further understand the nuclear phenomena.

Assume a set of K models are considered when studying a particular system of interest. One approach to account for model uncertainty is to combine the information across these K models. This may involve combining the individual point predictions or probability density functions from each model, usually in some additive manner. Traditional frequentist and Bayesian approaches use global weighting schemes, where each model is weighted by a value intended to reflect overall (global) model performance. For example, a classical global weighting scheme is Bayesian model averaging (BMA) (Raftery, Madigan, and Hoeting Citation1997), which combines the individual posterior densities from each model using a convex combination. The BMA weights are given by the individual posterior model probabilities, each which can be interpreted as the probability the individual model is the true data generating one. Hence, BMA implicitly assumes the true model is contained within the model set, which renders this method inappropriate outside of the M-closed setting (Bernardo and Smith Citation1994). More recent Bayesian global weighting schemes adopt a model stacking approach, where model weights are assigned to minimize a specified posterior expected loss. This decision theory viewpoint of global weighting can be used for combining point predictions (Le and Clarke Citation2017) or probability densities (Yao et al. Citation2018). Under some assumptions, stacking methods have been shown to be more appropriate for both the M-open and M-closed settings (Yao et al. Citation2018).

Though global weighting methods are effective, they still might lead to poor approximations of the true system when the individual model performance is localized. In such a case, one may wish to select a weighting scheme that reflects the localized characteristics of the models by constructing input-dependent weights. With input-dependent weights, one would expect an individual model to receive a higher weight in input regions where it exhibits strong predictive performance, while receiving a weight close to 0 in regions of poor performance. Localized weighting schemes are more appropriate for the M-open or M-complete settings where the true model may be better characterized as a localized mixture of the model set under consideration.

This work is motivated by problems in nuclear physics modeled using a technique known as Effective Field Theory (EFT) (Georgi Citation1993; Petrov and Blechman Citation2016; Burgess Citation2020). EFTs are designed to perform well in a particular subregion(s) of the input domain, yet diverge in the rest of the input domain. Prototypes of such models are the weak and strong coupling finite-order expansions for the partition function of the zero-dimensional ϕ4 theory presented by Honda (Citation2014). Examples of this problem are shown in where the various dashed and dotted lines represent the mean predictions from a finite-order expansion and the solid line denotes the true physical system. One can see that these models are highly accurate descriptions of the true system in some regions of the domain, yet they are unable to provide a globally accurate model. Most EFT problems fall within the M-open setting, as the true underlying description of the system across the entire domain is unknown and thus is not contained within the model set. Instead, multiple EFTs can be constructed based on the known physics to recover the true system across subsets of the domain. This poses the question as to how to combine the predictions from multiple EFTs in order to obtain a globally accurate prediction. Various interpolation methods (Honda Citation2014) exist, however, no data-driven approaches are currently available for EFTs.

Fig. 1 Three different EFT experimental settings. Each panel displays the true physical system (solid) and the mean predictions from the EFTs under consideration (non-solid).

Fig. 1 Three different EFT experimental settings. Each panel displays the true physical system (solid) and the mean predictions from the EFTs under consideration (non-solid).

To demonstrate why problems falling in the M-open class may not be suited for model averaging schemes, consider applying BMA to the model set involving the two expansions as shown in . The posterior mean prediction from BMA results in a poor estimate of the true system as shown in . Essentially, BMA selects the dashed model rather than leveraging the localized strengths contained in the model set. Given the characteristics of EFTs and the M-open setting associated with these problems, a simple weighted average of the predictions from each model is insufficient for recovering the true physical system. A better approach is to use an input-dependent weighting scheme which leverages the localized behaviors of each model to ascertain appropriate mean prediction and uncertainty quantification. Such an approach falls under the general class of problems known as Bayesian model mixing (BMM) (Yao et al. Citation2021).

Fig. 2 The posterior mean prediction of f(x) when applying BMA to the 2nd order weak and 4th order strong coupling expansions.

Fig. 2 The posterior mean prediction of f†(x) when applying BMA to the 2nd order weak and 4th order strong coupling expansions.

A key challenge in BMM is to define the relationship between the inputs and the weight functions. This work proposes a Bayesian treed model which specifies the weight functions as a sum-of-trees. This representation relies on tree bases which are used to learn the localized model behavior. Additionally, this flexible and nonparametric approach allows the user to avoid having to specify a more restrictive model for the weight functions, such as a generalized linear model. Maintaining the traditional conjugacy properties associated with Bayesian Additive Regression Tree (BART) models, the weight functions are regularized via a multivariate Gaussian prior. The prior is calibrated so that the weight functions prefer the interval [0,1] without imposing any further constraints. Additionally, this framework includes a simple strategy for incorporating prior information about localized model performance when available. All together, this approach highlights the localized behaviors of the candidate models and yields significant improvements in prediction, interpretation, and uncertainty quantification compared to traditional model averaging methods.

In addition to proposing a novel nonparametric BMM method, this work introduces a new data-driven approach for combining predictions from various EFTs. This is not only important for prediction of the system, but also for the resulting inference. In particular, practitioners can better understand the accuracy of each EFT while also advancing their knowledge about the underlying physical system across areas which are not well explained by the EFTs under consideration.

The remainder of this article is organized in the following manner. Section 2 highlights some relevant work related to model averaging, model mixing, and BART. Section 3 introduces the essential features of EFTs, while Section 4 outlines the specifics of the proposed BART-based framework. Three motivating EFT examples are presented in Section 5. Finally, Section 6 provides a detailed discussion of the results presented throughout this work. The online supplement includes full derivations of the methodology along with additional information regarding EFTs. The source code is available at https://github.com/jcyannotty/OpenBT.

2 Background

This section provides an overview of the primary statistical methods discussed throughout this work. Section 2.1 details popular model averaging and model mixing techniques. Section 2.2 summarizes the primary features of Bayesian tree models, which play an integral role in the proposed model mixing approach described in this work.

2.1 Model Averaging and Model Mixing

Methods to address model uncertainty have been widely studied throughout the past few decades. The majority of work in this area combines competing models through either mean or density estimation. In either case, the combined result is generally computed via linear combination of the individual predictive means or densities from the models under consideration. The weights in this linear combination may or may not depend on the inputs for each model and are learned using the set of training data D={(x1,y1),,(xn,yn)}. Many frequentist and Bayesian methods exist for estimating the model weights. Popular frequentist approaches such as stacking (Breiman Citation1996) and model aggregation (Bunea, Tsybakov, and Wegkamp Citation2007) estimate the weights by minimizing a specified loss function. Additionally, one may elect to impose constraints such as a non-negativity or sum-to-one constraint on the weights or apply regularization techniques. Other frequentist approaches estimate the weights using evaluation metrics such as the Akaike information criteria (Burnham and Anderson Citation1998) or Mallow’s CP (Hansen Citation2007). These methods generally fall under the model averaging regime, as the weights are independent of the model inputs, with the exception of Sill et al. (Citation2009). The remainder of this section reviews popular Bayesian methods in further detail.

Bayesian Model Averaging: A classical approach for combining models M1,…, MK is Bayesian Model Averaging (Raftery, Madigan, and Hoeting Citation1997). Suppose Q is a quantity of interest. The posterior density of Q is defined by π(Q|D)=l=1Kwlπ(Q|D,Ml), which is a weighted average of the posterior densities with respect to each model. Each weight is defined in terms of its corresponding posterior model probability, that is, wl=π(Ml|D) where π(Ml|D)=p(D|Ml)π(Ml)k=1Kp(D|Mk)π(Mk)and p(D|Ml) is the marginal likelihood of the data with respect to the lth model. Though BMA is useful, it has been criticized for emphasizing a fit to the training data as opposed to out-of-sample prediction, asymptotically selecting a single model (inappropriate in the M-complete and M-open settings, e.g., ), and being sensitive to prior specification.

Bayesian Mean Stacking: Recent work has extended stacking to the Bayesian paradigm as an approach for mean estimation (Clyde and Iversen Citation2013; Le and Clarke Citation2017). Given K competing models, the stacked mean for a future observation y˜ at input x˜ is constructed as a linear combination of individual model predictors E[y˜|x˜,D]=l=1Kwlfl(x˜), where E[y˜|x˜,D,Ml]=fl(x˜). When the individual models are unknown, stacking is conducted in a two-step procedure: (i) independently fitting the individual models Ml,l=1,K, given the set of training data D, and (ii) estimating the weights w=(w1,,wK) for the stacked predictor given the fitted models.

In the first step, each model is fit and their corresponding mean predictions, f̂l(xi), are obtained at each of the training points. In practice, cross validation techniques are used to reduce the risk of overfitting the stacked predictor to the training data. In the second step, the coefficient vector w=(w1,,wK) is defined as the minimzer of a specified posterior expected loss. Additionally, one may impose various constraints such as a simplex, nonnegativity, or sum-to-m constraint on the weights (Le and Clarke Citation2017). Other approaches include regularization via a penalty term or a prior (Breiman Citation1996; Yang and Dunson Citation2014).

Bayesian Complete Stacking: Complete Stacking was motivated by the shortcomings of BMA (Yao et al. Citation2018). This Bayesian stacking model emphasizes prediction, as the weights are selected to minimize the Kullback-Leibler (KL) divergence between the true predictive density and the stacked predictive density p(y˜|x˜)=l=1Kwlp(y˜|x˜,D,Ml), where y˜ is a future observation with input x˜. Similar to mean stacking, the leave-one-out (LOO) cross validated predictive density can be used in place of p(y˜|x˜,D,Ml) when the individual models are unknown. Given training data, the weights are constrained to a K dimensional simplex SK and estimated as ŵ=argmaxwSKi=1nlogl=1Kwlp(yi|xi,D(i),Ml), where D(i) denotes the training set excluding the pair (xi,yi).

Bayesian Hierarchical Stacking: Hierarchical Stacking (Yao et al. Citation2021) is a model mixing approach which extends Complete Stacking by defining input-dependent weights that are estimated in a fully Bayesian manner. One way to define the weight functions is through a parametric model. First, K–1 unconstrained weight functions are defined as, wl*(xi)=μl+j=1Jαljgj(xi),which depend on the sets of hyperparameters {αlj} and {μl} along with user-specified basis functions gj(xi), where j=1,,J and l=1,,K1. The Kth function wK*(xi) is set to 0 to serve as a baseline. Then, a softmax transformation is applied to the unconstrained weights in order to confine each model weight to the K-dimensional simplex, namely wl(xi)=exp(wl*(xi))exp(w1*(xi))++exp(wK*(xi)),l=1,,K.

The methods discussed above outline a number of strategies one can take to combine the information across multiple models. In the setting of EFT experiments, the localized nature of the predictions suggests an input-dependent weighing scheme like Bayesian Hierarchical Stacking is more suitable. However, specifying the required basis functions may not be trivial. Thus, the proposed method will adopt the notion of mean stacking within an additive tree basis model to achieve localized weighting in a flexible and nonparametric manner.

2.2 Bayesian Tree Models

Bayesian additive regression trees (BART) have become increasingly popular for modeling complex and high dimensional systems (Chipman, George, and McCulloch Citation2010). This additive approach involves summing together the predictions made from m trees and is facilitated through a Bayesian backfitting algorithm (Hastie and Tibshirani Citation2000). Each tree Tj is characterized by its structure, comprised of internal and terminal nodes, along with its associated set of terminal node parameters, Mj. The internal nodes define binary partitions of the input space according to a specified splitting rule. A given node η is defined to be an internal node with probability p(η is internal)=α(1+dη)β where dη is the depth of η and α and β are tuning parameters. By construction, this prior penalizes tree complexity and thus ensures each tree maintains a shallow and simple structure. Given d different predictors, x1,,xd, splitting rules are of the form xv<c for v{1,,d} and cutpoint c from a discretized subset of R. In the simplest approach, the predictor and cutpoint associated with each splitting rule are randomly selected from discrete uniform distributions. The probabilities associated with the designation of each node along with the splitting rules for internal nodes are used to define the stochastic tree-generating prior for each tree.

The m trees are learned through MCMC, where a slight modification to each structure is proposed at every iteration of the simulation. Generally, such modifications to the tree include birth, death, perturb, or rotate as described by Chipman, George, and McCulloch (Citation1998) and Pratola (Citation2016). Proposals are then accepted or rejected using a Metropolis-Hastings step. To avoid a complex reversible jump MCMC, the algorithm depends on the integrated likelihood, which is obtained by integrating over the terminal node parameters associated with the given tree. A closed form expression for this density can be obtained with conditional conjugate priors for the terminal node parameters.

Given the tree structure, prior distributions can be assigned to each terminal node parameter. In the BART model, the priors ensure each tree explains a small yet different source of variation in the data. For continuous data, BART assigns Gaussian priors to the terminal node parameters. Assuming the data is mean centered, the prior assigned to terminal node parameter μpj in node ηpj is given by μpj|TjN(0,τ2) where τ=(ymaxymin)/(2km) with tuning parameter k. Additionally, a scaled inverse Chi-squared prior is assigned to the variance, that is, σ2νλ/χν2.

The traditional Bayesian regression tree model can be extended to allow for a more complex structure in the terminal nodes. Existing extensions include linear regression (Chipman, George, and McCulloch 2002; Prado, Moral, and Parnell Citation2021) and Gaussian processes (Gramacy and Lee Citation2008). For the setting of model mixing, this work uses a multivariate Gaussian terminal node model.

3 Toward Model Mixing with EFTs

An EFT forms an expansion (or multiple expansions) as a ratio of an input parameter to a physically relevant scale. Computer models implement EFTs as simulators. The theoretical predictions of the physical system are approximations from each simulator plus a discrepancy term, which is designed to account for the remaining unexplained portions of a system. These two components may have specific properties which can be leveraged when working with observational data. This section summarizes these details in the context of EFTs (see also Supplement C).

3.1 Motivating EFT Example

Consider the EFT example where the true physical system is the partition function of the zero-dimensional ϕ4 theory defined by (1) f(x)=exp(u22x2u4)du,(1) where x denotes the coupling constant (Honda Citation2014). Two types of finite-order expansions exist for this partition function and are given by (2) and (3) for ns or nl 1, namely (2) hs(ns)(x)=t=0nsstxtwherest={2Γ(t+0.5)(t/2)!(4)(t/2)t is even0t is odd(2) (3) hl(nl)(x)=t=0nlltxtwherelt=Γ(0.5t+0.25)2t!(12)t,t=0,,nl.(3)

The weak coupling expansion in (2) is an asymptotic Taylor-like series of order ns centered about zero. Thus, hs(ns)(x) will yield high-fidelity predictions for smaller coupling constants and diverge as the value increases. The reverse behavior is observed for the strong coupling expansion in (3), hl(nl)(x), which is convergent. Example predictions of the physical system using these finite-order expansions can be seen in and are discussed in detail in Section 3.2.

The theoretical predictions of the physical system using the weak and strong coupling expansions are expressed using (4) and (5), respectively. (4) fs(ns)(x)=hs(ns)(x)+δs(ns)(x)(4) (5) fl(nl)(x)=hl(nl)(x)+δl(nl)(x).(5) where the truncation errors δs(ns)(x) and δl(nl)(x) are modeled with Gaussian processes (GPs) (Santner, Williams, and Notz Citation2018; Gramacy Citation2020). As described by Melendez et al. (Citation2019), the parameters in both truncation error models are dependent upon the evaluations of their corresponding finite-order expansions (described in (2) and (3), respectively) over a sparse grid of points. The discrepancy model also depends on physical quantities, Q(x) and yref(x), which are chosen based on domain expertise. The relationship between these quantities and the discrepancy are summarized in the supplementary material (Supplement C). When Q(x) and yref(x) are unknown, one can alternatively use the error approximation described by Semposki, Furnstahl, and Phillips (Citation2022).

The features present in this example from Honda (Citation2014) are commonly found across the landscape of EFT problems. For instance, the physical system can be expressed as an additive model involving a finite-order expansion and the induced truncation error. The finite-order expansions are designed to provide high-fidelity predictions in specific subregions of the domain. There exists a subregion of the domain where none of the finite-order expansions yield accurate theoretical predictions. All together, this motivating example serves as a prototype for the EFTs that may be encountered in a general experimental setting.

3.2 The Model Set for EFT Experiments

One may encounter various experimental settings when working with EFTs. Such scenarios are introduced in the context of the motivating example presented in Section 3.1. First, consider the most basic case where the model set contains a single EFT. With one EFT, the overall predictive accuracy of the true system is poor, despite the good performance in a localized region. For example, suppose the model set M contains the 2nd order weak coupling expansion fs(2)(x). Mean predictions constructed from (2) and (4) are shown by the dashed line in . Clearly, this model is limited to strong predictive accuracy in only the left subregion of the domain.

When available, one can consider different finite-order approximations of the same EFT. For example, consider the 2nd,4th, and the 6th order coupling expansions which are shown in . The three models are very similar for lower coupling constants yet drastically differ in the remainder of the domain. Despite each expansion’s poor theoretical predictions, one can still leverage the available information to improve the overall prediction of the physical system. For instance, the 2nd and 6th order expansions (dashed and dashed-dotted) are concave functions while the 4th order expansion (dotted) is convex. This suggests the true physical system lies between the expansions under consideration and can be recovered by re-weighting the corresponding predictions.

A third situation is to consider EFTs centered about different areas of the domain. For example, a model set can contain a finite-order weak coupling expansion (dashed) and the 4th order strong coupling expansion (dotted) as shown in and . The addition of the strong coupling expansion allows for a high-fidelity approximation of the physical system to be considered in the rightmost subregion of the domain. The model set listed in panel (a) implies the true system lies between the two expansions. This is particularly useful in the intermediate range where neither of the EFTs are accurate. Meanwhile, the set in panel (b) presents an interesting case where the physical system lies below both EFTs in the intermediate range. In this case, the information in the observational data can be leveraged to help recover the true system.

In this example, the predictions from the weak coupling expansion degrade slowly compared to those from the strong coupling expansions. Consequently, the weak coupling expansions generally appear to have a better overall predictive performance across the entirety of the domain. When combining these two types of EFTs using global weighting schemes such as BMA, the resulting prediction will favor the weak coupling expansion due to its drastic advantage in the overall model performance. The undesirability of the BMA solution is evident in , which demonstrates that BMA effectively matches the 2nd order weak coupling expansion. Hence, a weighting scheme which captures the localized behaviors of each model is preferred in the EFT setting.

The proceeding sections consider a general set of K different EFTs, which are denoted by f1(x),,fK(x). In this motivating example, fl(x)=hl(x)+δl(x) where hl(x) can denote either a weak or strong coupling expansion of order Nl, where l=1,,K. Meanwhile, δl(x) is the associated truncation error and is modeled by a GP (Supplement C).

3.3 Predictions from EFTs

Prior to model mixing, each of the K EFTs are independently emulated. Without loss of generality, consider the lth EFT denoted by fl(x). It is assumed this EFT is accompanied by a set of simulator runs across a fixed set of inputs xl1c,,xlnlc. Information regarding the design of the computer experiment for each EFT can be found in Melendez et al. (Citation2021). The simulator runs are evaluations of the finite-order expansion, hl(·), at the specified inputs. Using these runs, one can extract the set of Nl+1 coefficients c0(·),,cNl(·) at each of the fixed inputs. The training set for the lth EFT is then defined by Dl={(xl1c,C(xl1c)),,(xlnlc,C(xlnlc))} where C(·) denotes the vector of known finite-order coefficients at the specified model input. The resulting coefficients and the set of inputs can differ across the K models, thus, the sets D1,,DK will contain different information.

As described in Supplement C, an EFT is fit using the finite-order coefficients to learn the unknown parameters which characterize the GP assigned to the truncation error. This information can be extracted from Dl, which implies the set of field observations is not required to fit each EFT. Consequently, the desired theoretical predictions across the input domain can be obtained without using any of the observational data. The resulting posterior predictive distribution is a Gaussian process, which can be characterized by the corresponding mean and covariance functions as described in Melendez et al. (Citation2019). The predictions for an EFT are then computed through the posterior mean.

4 Bayesian Additive Model Mixing Trees

4.1 Defining a Mixed Model

The proposed BMM model is trained using a set of observational data, Y1,,Yn, which are assumed to be independently generated at fixed inputs x1,,xn according to (6) Yi=f(xi)+ϵi,ϵiiidN(0,σ2)(6) where f(xi) represents the true and unknown physical system and σ2 denotes the constant variance associated with the independent observational errors. Conditional on the theoretical predictions at a given point, f1(xi),,fK(xi), the data can be modeled as (7) Yi|f(xi),w(xi),σ2indN(f(xi)w(xi),σ2)(7) where f(xi)=(f1(xi),,fK(xi)) and w(xi)=(w1(xi),,wK(xi)). This formulation is an example of Bayesian mean stacking with an input-dependent weighting scheme. In practice, the predictions from each model are unknown and must be estimated.

The proposed BMM model relies on a two-step approach for combining the predictions across K EFTs. This implies each EFT is first fit independently and the estimated predictions f̂l(xi) are obtained for l=1,K and i=1,,n prior to learning the weight functions w1(xi),,wK(xi). The proposed two-step approach is tailored to EFTs by taking advantage of the sources of data described in Section 3.3 as well as the properties described in Supplement C. Conditional on the estimated predictions, the model for the observational data becomes Yi|f̂(xi),w(xi),σ2indN(f̂(xi)w(xi),σ2)where f̂(xi)=(f̂1(xi),,f̂K(xi)). The weight functions are then learned using the set of field data. The next section outlines the proposed model mixing scheme which defines the weight functions using Bayesian Additive Regression Trees (BART).

4.2 Model Mixing Using BART

The weight functions w(x)=(w1(x),,wK(x)) are modeled using a sum-of-trees (8) w(xi)=j=1mg(xi,Tj,Mj),(8) where g(xi,Tj,Mj) is the K-dimensional output of the jth tree using the set of terminal node parameters, Mj, at the input, xi. This approach defines the weight functions using tree bases which are learned from the data. The amount of flexibility in the weight functions can be controlled by changing the number of trees or tuning the hyperparameters in the prior distributions.

In this application of BART, each terminal node parameter is a K-dimensional vector which is assigned a multivariate Gaussian prior. The parameter is regularized so that each tree accounts for a small amount of variation in the weight functions. For the proceeding statements, let ηpj represent the pth terminal on the jth tree and define its corresponding parameter by μpj=(μpj1,,μpjK). Now assume the observations (x1,y1),,(xnp,ynp) lie in the hyper-rectangle defined by ηpj, where np is the number of observations assigned to this subregion. The model at each terminal node amounts to fitting a localized Bayesian linear regression with parameter vector μpj. Due to conditional independence, the likelihood in this node is defined by L(r1,,rnp|Tj,μpj,σ2)=(2πσ2)np/2exp(12σ2i=1np(rif̂(xi)μpj)2)

where f̂(xi)=(f̂1(xi),,f̂K(xi)) is a vector of mean predictions from each EFT and ri is the ith residual given by ri=yiqjf̂(xi)g(xi,Tq,Mq).

Conditional on the tree structure, Tj, the terminal node parameter, μpj is assigned a conjugate multivariate Gaussian prior, namely (9) μpj|TjindNK(β,τ2IK)(9) where β=(β1,,βK) is a K-dimensional mean vector and IK is the identity matrix. This prior is noninformative in the sense that the mean is fixed regardless of how the input space is partitioned.

In model mixing, each simulator may perform strongly in one subregion of the input space but weakly in another. This belief can be reflected in the prior distribution of μpj by allowing the hyperparameters to depend on the partition of input space assigned to the given terminal node. Thus, an informative prior for μpj can be constructed as μpj|TjindNK(βpj,τ2IK)where βpj=(βpj1,,βpjK). This allows the prior mean to vary depending on the tree partitions and thus reflect some sense of localized model performance. Meanwhile, the assumed covariance structure implies the K vector components μpj1,,μpjK are independent apriori.

Both of the proposed priors are conjugate, which is an important choice in BART, as it allows for a closed form expression for the marginal likelihood for the vector of residuals Rpj=(r1,,rnp), Additionally, the conjugate priors result in closed form expressions for the full conditional distributions of the terminal node parameters and the error variance. The derivations of these distributions are found in the online supplement (Supplement B). In particular, the full conditional distribution for the pth terminal node in Tj is given by μpj|Rpj,Tj,σ2indNK((1σ2F̂pjF̂pj+1τ2IK)1(1τ2βpj+1σ2F̂pjRpj),            (1σ2F̂pjF̂pj+1τ2IK)1)where F̂pj is the np×K design matrix with the ith row vector given by the vector f̂(xi). The full conditional distribution for σ2 is a scaled inverse Chi-squared, that is, σ2|·νλ/χν2, where ν=n+νandλ=1n+ν(i=1n(yif̂(xi)w(xi))2+νλ), with ν and λ denoting the prior shape and scale parameters, respectively.

4.3 Calibrating Priors

First consider the prior for the terminal node parameters. The calibration of the hyperparameters differs for the non-informative and informative priors, however, both approaches are designed to ensure that each model weight wl(x) should prefer the interval [0,1] and be centered at a value within this region. Moreover, the functions w1(x),,wK(x) are assumed to be independent apriori at a fixed input. This enables the prior for each weight to be calibrated marginally.

4.3.1 Noninformative Prior

Consider a noninformative prior for the terminal node parameters. In this setting,

μpj|TjiidNK(β,τ2IK) for the pth terminal node parameter in the jth tree. First, fix l{1,,K} and i{1,,n} to calibrate the prior for wl(xi). Since the terminal node parameters are independent and identically distributed with a diagonal covariance structure, the prior induced on wl(xi) is the same for the remaining weight and input combinations. From (8) and (9), the induced prior on the lth model weight is wl(xi)N(mβl,mτ2). Since it is believed wl(xi)[0,1] with high probability, it is plausible to set mβl=0.5. Consequently, βl=0.5/m. Thus, each weight has an equal chance to reach the “extreme” values of 0 or 1 regardless of the input location. The prior standard deviation, τ, can be selected so that wl(xi)[0,1] with high probability. To do this, a confidence interval for wl(xi) is constructed such that 0=0.5kτm and 1=0.5+kτm. Subtracting the first equation from the second and solving for τ yields τ=1/2km. This calibration approach is very similar to the one proposed by Chipman, George, and McCulloch (Citation2010). The main difference is due to the context of the problem, as it is believed the weights are predominately contained in an interval [0,1] rather than the observed range of the data, [ymin,ymax].

4.3.2 Informative Prior

In the informative setting, the prior mean directly depends on the partitions of the input space induced by the given tree, that is μpj|TjNK(βpj,τ2IK). This prior is tailored toward EFTs, where the functional variance, vl(xi), indicates the severity of the truncation error. A larger variance within a particular subregion of the domain indicates the presence of larger truncation error meaning the EFT provides a poor approximation of the true system.

Given this interpretation of the truncation error variances, one strategy for combining EFTs is precision weighting (Phillips et al. Citation2021). For example, the precision weight for the lth EFT at xi is given by βl(xi)=1/vl(xi)1/v1(xi)++1/vK(xi).

The precision weight βl(xi) can be interpreted as an initial guess for the weight function wl(xi) for l=1,,K and i=1,,n.

Since the prior of the terminal node parameter changes conditional on the tree structure, each βpj is chosen separately from the other terminal node parameters. Given the precision weights for the EFTs and n training points, each component of the prior mean vector, βpjl, is chosen by βpjl=1mi=1n1(xiηpj)i=1nβl(xi)1(xiηpj),where 1(xiηpj) is the indicator that xi is assigned to the terminal node ηpj. A confidence interval for each terminal node parameter can be set to have a length of 1/m in order to ensure each tree is a weak learner. This is done by setting τ = 1/2km.

4.3.3 Variance Prior

A conjugate scaled inverse Chi-squared distribution with hyperparameters ν and λ is assigned to the error variance σ2. To calibrate the prior, first select a value of ν to reflect the desired shape of the distribution. Common values of ν range from 3 to 10. Before selecting a value for λ, one needs an initial estimate of the error variance to help set the prior around a range of plausible values of σ2. Given the model set and the corresponding point predictions at each of the training points f̂l(xi), one can use a lightly data informed prior by setting σ̂2=maxl=1,,K{mini=1,,n(yif̂l(xi))2}. Since a common belief is that each model yields accurate approximations of the true system over some subregion of the domain, one should expect the set of minimum squared differences across the K models will unveil reliable information about the true error variance. Given this information, one strategy is to set σ̂2 to be the mean or mode of a λν/χν2 distribution. The value of λ is then found by solving the resulting equation.

5 EFT Examples

This section applies the proposed model mixing methodology to three different examples. Section 5.1 demonstrates the success of the BART-based mixing approach on two univariate EFT examples, which are introduced in Section 3. A multi-dimensional example is highlighted in Section 5.2 using simulators which are based on Taylor series expansions of a trigonometric function. Though this last example does not involve a true underlying physical system, the model set considers simulators which have similar qualities of EFTs with double expansions (see Burgess Citation2020). Each example highlights specific features of the proposed BART-based mixing model such as flexible basis functions for the weights and the associated prior regularization.

5.1 Example 1: Mixing Univariate EFTs

This section applies the BART model mixing (BART-BMM) method to various EFTs over a one-dimensional domain. For comparison, Hierarchical Stacking (HS) is also applied to the same set of EFTs. In both EFT examples, 20 observations are independently generated according to Yi=f(xi)+ϵi,ϵiN(0,σ2)where i=1,,20,σ=0.005, and f(x) is defined in (1). The 20 training points are located at inputs which are evenly spaced over the interval of 0.03 to 0.50. The error standard deviation of 0.005 was selected to mimic a controlled experiment setting. Each EFT model is fit using nc = 4 evaluations of the corresponding finite-order expansion.

5.1.1 Example 1a: Mixing Two EFTs

First consider mixing the EFTs based on the 2nd order weak coupling expansion, fs(2)(x) and the 4th order strong coupling expansion, fl(4)(x) as shown in . The true system f(x) lies between both EFTs across the entire domain, hence, a convex combination of the predictions from both EFTs is appropriate for recovering the true system. The BART-BMM model is fit using 10 trees and k = 5.0. Meanwhile, the HS unconstrained weight function is defined by w1*(x)=μ1+α1x. The results of the BART-BMM method and HS are shown in .

Fig. 3 The predicted mean (dark gray) and 95% credible intervals (shaded) when mixing fs(2)(x) (dashed) and fl(4)(x) (dotted). Results are obtained from a BART-BMM model with 10 trees and a Hierarchical Stacking model with a linear unconstrained weight function (bottom).

Fig. 3 The predicted mean (dark gray) and 95% credible intervals (shaded) when mixing fs(2)(x) (dashed) and fl(4)(x) (dotted). Results are obtained from a BART-BMM model with 10 trees and a Hierarchical Stacking model with a linear unconstrained weight function (bottom).

In terms of the root mean squared error (RMSE) between the predicted system and the true f(x), the BART-BMM model results in more accurate mean predictions compared to HS, which have RMSE values of 0.0053 and 1.9460, respectively. The RMSE for the HS result is inflated by the diverging mixed prediction in the left portion of the domain. For example, the RMSE for the HS model over the interval [0.1,0.5] drops to 0.0717. Additionally, from it is evident BART-BMM results in predictions of f(x) which have lower uncertainty than those from HS.

The weight functions in also take similar sigmoid-like shapes, however, the HS solution displays a high degree of uncertainty. The most noticeable difference between the two methods can be seen in the weight function of fl(4)(x) (dotted). In particular, the curve in the BART-BMM result increases at a quicker rate in the sub-region [0.3,0.4] compared to the HS result. This slower rate of increase contributes to the poor prediction from HS in this sub-region. Another difference is observed in the region of [0.03,0.15], as the weight of fl(4)(x) under the HS approach is near 0, however, it is not small enough to negate the effect of the drastically diverging mean prediction from fl(4)(x). Meanwhile, the BART-BMM weight is shrunk close to 0 with minimal uncertainty due to the mean estimation objective, which directly re-weights the mean prediction from an individual model, and the lack of a simplex constraint.

Another advantage of BART-BMM is that the weight functions are learned throughout the MCMC via the tree models. This differs from HS, which requires specification of a basis for the unconstrained weights a priori. In this example, one may consider a different basis function, as the specified linear basis appears to be inadequate for ascertaining high-fidelity mean predictions across the entire domain.

5.1.2 Example 1b: Mixing Two Convex EFTs

Now, consider a second model set which is shown in and replaces fs(2)(x) with fs(4)(x). Both EFTs overestimate f(x) in the intermediate range, hence, weights which are confined to a simplex are unable to recover the true system. In this case, a piecewise basis function is assigned to the unconstrained HS weight as shown below, w1*(x)=μ1+α11(x<0.15)+α21(0.15x<0.25)+α31(0.25x<0.35).

This basis was chosen to roughly reflect the areas where the mean predictions begin to change at differing rates. Other selections of the partitions for a piecewise basis are equally valid.

The BART-BMM and HS results are shown in . Once more, the BART-BMM approach outperforms HS in terms of mean prediction, with RMSE values of 0.0057 and 0.1141, respectively. Most notably, the HS solution is unable to accurately predict the true system in the intermediate range of the domain due to the simplex constraint on the model weights. Meanwhile, the BART-BMM approach is able to recover the system across the entirety of the domain due to the prior regularization approach taken with the weights, which does not impose such strict constraints.

Fig. 4 The predicted mean (dark gray) and 95% credible intervals (shaded) when mixing fs(4)(x) (dashed) and fl(4)(x) (dotted). Results are obtained from a BART-BMM model with 10 trees (top) and a Hierarchical Stacking model with a piecewise unconstrained weight function (bottom).

Fig. 4 The predicted mean (dark gray) and 95% credible intervals (shaded) when mixing fs(4)(x) (dashed) and fl(4)(x) (dotted). Results are obtained from a BART-BMM model with 10 trees (top) and a Hierarchical Stacking model with a piecewise unconstrained weight function (bottom).

In this HS result, it appears the piecewise basis was more effective than the linear basis in terms of predicting the true system in the left and right portion of the domain. This further poses the question of how to select the partitions induced by the piecewise basis, as different choices may lead to drastically different results. This question served as the motivation for defining a BART-based model, which adaptively learns these partitions based on the observational data and the model set.

5.2 Example 2: Multi-Dimensional Mixing

The proposed model mixing approach is also applicable for computer experiments which depend on multi-dimensional inputs. To demonstrate this, consider a two-dimensional problem where the true underlying system is defined by f(x)=sin(x1)+cos(x2),where x=(x1,x2)[π,π]×[π,π].A set of 80 training points are generated from this true system with observational error standard deviation of 0.1. Additionally two candidate models are considered, each with simulators defined in terms of Taylor series expansions of s(xi):=sin(x1) and c(x2):=cos(x2). For this example, the simulators are defined by h1(x)=j=07s(j)(x1)j!(x1π)j+k=010c(k)(x2)k!(x2π)kh2(x)=j=013s(j)(x1)j!(x1+π)j+k=06c(k)(x2)k!(x2+π)k where s(j)(x1) and c(k)(x2) denote the jth and kth derivatives of sin(x1) and cos(x2), respectively. Note, the first simulator h1(x) centers both Taylor series expansions about π, hence, it produces relatively accurate predictions of the system in upper right corner of the domain and diverges when moving toward the negative portion of the domain. Meanwhile, the h2(x) is composed of Taylor series expansions centered about π which produces accurate predictions in the negative portion of the domain. One key difference between the simulators is that h2(x) contains a highly accurate approximation of sin(x1) across the entire interval [π,π] because its corresponding Taylor series expansion is composed of seven nonzero terms. Thus, even though the expansion sin(x1) and are centered about π, one would expect h2(x) to result in accurate predictions of f(x) across the rectangle [π,π]×[π,0].

The theoretical predictions from each model f1(x) and f2(x) can be defined using the additive form fl(x)=hl(x)+δl(x), where δl(x) represents the unknown higher-order corrections and l = 1, 2. Due to the nature of this example, no model is postulated for δl(x). Consequently, the estimated theoretical predictions at each training point xi are obtained by f̂l(xi)=hl(xi). Note, in a multi-dimensional EFT setting, the strategy discussed from Section 3 remains applicable.

The results from a 30-tree BART-BMM model are shown in . The leftmost plot displays the absolute value of the mean residuals, |f̂(x)f(x)| where f̂(x) denotes the mean prediction from the BMM model. Based on the residual plot, it appears f(x) is adequately recovered across the majority of the domain with an RMSE of 0.2575. As expected, the error in the mean prediction noticeably increases in the upper left corner of the domain, where only two training points are included and both simulators are inaccurate.

Fig. 5 (Left) The mean difference between the predicted system f̂(x), and the true system f(x). (Center) The mean weight function for h1(x). (Right) The mean weight function for h2(x).

Fig. 5 (Left) The mean difference between the predicted system f̂†(x), and the true system f†(x). (Center) The mean weight function for h1(x). (Right) The mean weight function for h2(x).

The second and third plots illustrate the posterior mean weight functions for each simulator. Based on the middle plot, the first simulator has increasing utility as x1 and x2 both increase. This is to be expected, as h1(x) is composed of two expansions centered about π. Note, the mean value of w1(x) does not reach 1 in the upper right corner of the domain because the simulator slightly overestimates the peak of f(x) in this region. Meanwhile, the posterior mean of w2(x) indicates h2(x) has high utility for x[π,π]×[π,0], which is to be expected given the nature of the expansions included in this simulator. Moreover, the predictions from h2(x) appear to align closely with the data, and thus f(x), as is evident by the weights approaching values near 1 in the bottom half of the domain.

6 Discussion

A variety of frequentist and Bayesian approaches are available for model averaging and mixing. Each method involves estimating the overall predictive mean or density based on the individual models. The selection between these two objectives should ultimately be guided by the underlying statistical inference one wishes to ascertain. In computer experiments, a primary objective is to recover the underlying system, which is generally expressed as the mean function in an additive model for the observational data. Hence, a mean estimation approach is more desirable when working within this setting compared to a predictive density estimation, which is modeled with the intention of predicting a future observation y˜.

Example 5.1 compares the proposed mean estimation method versus a density estimation method in Hierarchical Stacking (HS). In HS, the weight functions are learned relative to leave-one-out (LOO) predictive densities under a simplex constraint. These LOO densities incorporate information regarding the mean and variance of each EFT at a given x. In portions of the domain where a model may rapidly diverge, the resulting LOO predictive density is shrunk toward 0. In turn, the corresponding weight function will approach 0, however, it may struggle to obtain a small enough value to shrink out the effect of the diverging mean. Meanwhile, shrinking the effect of a diverging prediction appears to be easier when mixing the mean predictions from each EFT.

The primary objective of the weight functions is to re-scale the predictions given by each individual model so that a linear combination of these predictions can adequately recover the true system. Given the prior regularization method applied to the weight functions, exact interpretation of the resulting values can be unclear. However, using this regularization perspective, one can conclude that weight functions which fall close to 0 within a particular subregion indicate that the corresponding model is unnecessary for the overall prediction. Meanwhile, a model which is the unique local expert within a particular region should be weighted by values close to 1. Overall, a joint interpretation of the weight functions is appropriate, particularly in regions where the weights concentrate around values away from 0 or 1. These features are observed across each example.

The benefit of the proposed regularization approach can further be understood through the posterior distribution of the sum of the weight functions, wsum(x)=l=1Kwl(x), as shown in . The posterior of wsum(x) from Example 1a (left panel, solid) is centered very close to 1 with relatively small amounts of uncertainty. This results because: (i) the prior regularization and (ii) f(x) lies between the selected EFTs, which indicates a convex combination is appropriate. Even though a sum-to-one property is not strictly imposed, it appears to naturally occur in this situation where an interpolation of the competing models is appropriate. Meanwhile, the posterior of wsum(x) from Example 1b (left, dashed) significantly drops below 1 in the intermediate range of the domain because both EFTs overestimate the true system, which renders a convex combination to be inappropriate. Similar features are observed in the two-dimensional example, as the mean of wsum(x) concentrates around 1 in areas where at least one of the simulators aligns well with the true system. Meanwhile, when neither simulator is accurate (i.e., the top left corner) the mean value of wsum(x) is drastically below 1. From these observations, it appears the BART-BMM approach benefits by not imposing strict assumptions, such as a simplex constraint, on the weights. The examples in Section 5 emphasize the two-model case, however, analogous interpretations of the weight functions are applicable when considering model sets of size K.

Fig. 6 (Left) The posterior mean estimates and 95% credible intervals (shaded) of the sum of weight functions from Examples 1a and 1b (solid and dashed). (Right) The posterior mean estimate of the sum of weight functions in Example 2.

Fig. 6 (Left) The posterior mean estimates and 95% credible intervals (shaded) of the sum of weight functions from Examples 1a and 1b (solid and dashed). (Right) The posterior mean estimate of the sum of weight functions in Example 2.

Finally, the weight functions can be used to better understand the M-open assumption associated with the model set. An initial confirmation of the M-open setting can be made when the weight functions noticeably change as a function of the inputs. This observation indicates localized performance of each model, hence, one can confirm the true system is not contained in the set. If the weight functions are nearly constant, one may also wish to check the posterior of wsum(x) to see if the sum of the weights is fixated close to 1. Such a case may suggest model averaging with a simplex constraint could also be an appropriate solution. This alone is not enough to confirm or deny the M-open assumption, however, it may indicate that the M-complete or M-closed assumptions are possible for the model set. A final case to consider is the situation where a single model receives a weight near 1 while the effects of the competing models are shrunk to 0 across a subregion of the domain. This situation may indicate the model set is M-closed conditional on the subregion of interest despite falling in the M-open case when considering the entire domain.

In conclusion, this work proposes a Bayesian treed framework to mix predictions from a set of competing models, each of which are intended to explain the physical system across a subregion of the domain. This approach falls within the class of problems referred to as Bayesian model mixing, as input-dependent weights are defined to reflect the localized behavior of each model. The weight functions are modeled using a sum-of-trees and are regularized via a multivariate Gaussian prior. The tree bases coupled with the regularization approach allows for the weights to be learned in a flexible nonparametric manner free of strict constraints. Using the weight functions, predictions from the individual models are mixed via a linear combination. The success of this mixing approach is demonstrated on three examples, each of which considers models with localized predictive performances. Leveraging the localized behavior of the individual models leads to significant improvements in the posterior prediction and uncertainty quantification of f(x) and the overall interpretation of the system compared to existing global and local weighting schemes.

Supplementary Materials

The supplementary material includes the essential derivations of the methodology along with additional information regarding EFTs. Code implementing the method and reproducing the examples is also available online.

Supplemental material

supplement (17).zip

Download Zip (1,007 KB)

Acknowledgments

The authors would like to thank the Editor, an Associate Editor, and two referees for helpful comments on this work.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The work of JCY and RJF work was supported in part by the National Science Foundation under Agreement OAC-2004601. The work of MTP was supported in part by the National Science Foundation under Agreements DMS-1916231, DMS-1564395, OAC-2004601, and in part by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-2018-CRG7-3800.3. The work of TJS was supported in part by the National Science Foundation under Agreement DMS-1564395 (The Ohio State University).

References

  • Bernardo, J. M., and Smith, A. F. (1994), Bayesian Theory, Chichester: Wiley.
  • Breiman, L. (1996), “Stacked Regressions,” Machine Learning, 24, 49–64. DOI: 10.1007/BF00117832.
  • Bunea, F., Tsybakov, A. B., and Wegkamp, M. H. (2007), “Aggregation for Gaussian Regression,” The Annals of Statistics, 35, 1674–1697. DOI: 10.1214/009053606000001587.
  • Burgess, C. P. (2020), Introduction to Effective Field Theory: Thinking Effectively about Hierarchies of Scale, Cambridge: Cambridge University Press.
  • Burnham, K. P., and Anderson, D. R. (1998), “Practical Use of the Information-Theoretic Approach,” in Model Selection and Inference, pp. 75–117, New York: Springer.
  • Chipman, H. A., George, E. I., and McCulloch, R. E. (1998), “Bayesian CART Model Search,” Journal of the American Statistical Association, 93, 935–948. DOI: 10.1080/01621459.1998.10473750.
  • ———(2002), “Bayesian Treed Models,” Machine Learning, 48, 299–320.
  • Chipman, H., George, E., and McCulloch, R. (2010), “BART: Bayesian Additive Regression Trees,” The Annals of Applied Statistics, 4, 266–298. DOI: 10.1214/09-AOAS285.
  • Clyde, M., and Iversen, E. S. (2013), “Bayesian Model Averaging in the M-Open Framework,” in Bayesian Theory and Applications, eds. P. Damien, P. Dellaportas, N. G. Polson, and D. A. Stephens, 484–498, Oxford: Oxford University Press.
  • Draper, D. (1995), “Assessment and Propagation of Model Uncertainty,” Journal of the Royal Statistical Society, Series B, 57, 45–70. DOI: 10.1111/j.2517-6161.1995.tb02015.x.
  • Georgi, H. (1993), “Effective Field Theory,” Annual Review of Nuclear and Particle Science, 43, 209–252. DOI: 10.1146/annurev.ns.43.120193.001233.
  • Gramacy, R. B. (2020), Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences, Boca Raton, FL: Chapman and Hall/CRC.
  • Gramacy, R. B., and Lee, H. K. H. (2008), “Bayesian Treed Gaussian Process Models with an Application to Computer Modeling,” Journal of the American Statistical Association, 103, 1119–1130. DOI: 10.1198/016214508000000689.
  • Hansen, B. E. (2007), “Least Squares Model Averaging,” Econometrica, 75, 1175–1189. DOI: 10.1111/j.1468-0262.2007.00785.x.
  • Hastie, T., and Tibshirani, R. (2000), “Bayesian Backfitting” (with comments and a rejoinder), Statistical Science, 15, 196–223. DOI: 10.1214/ss/1009212815.
  • Honda, M. (2014), “On Perturbation Theory Improved by Strong Coupling Expansion,” Journal of High Energy Physics, 2014, 1–44. DOI: 10.1007/JHEP12(2014)019.
  • Le, T., and Clarke, B. (2017), “A Bayes Interpretation of Stacking for M-complete and M-open Settings,” Bayesian Analysis, 12, 807–829. DOI: 10.1214/16-BA1023.
  • Melendez, J. A., Furnstahl, R. J., Phillips, D. R., Pratola, M. T., and Wesolowski, S. (2019), “Quantifying Correlated Truncation Errors in Effective Field Theory,” Physical Review C, 100, 044001. DOI: 10.1103/PhysRevC.100.044001.
  • Melendez, J., Furnstahl, R., Grießhammer, H., McGovern, J., Phillips, D., and Pratola, M. (2021), “Designing Optimal Experiments: An Application to Proton Compton Scattering,” The European Physical Journal A, 57, 1–24. DOI: 10.1140/epja/s10050-021-00382-2.
  • Petrov, A. A., and Blechman, A. E. (2016), Effective Field Theories, Singapore: World Scientific. https://www.worldscientific.com/doi/abs/10.1142/8619
  • Phillips, D., Furnstahl, R., Heinz, U., Maiti, T., Nazarewicz, W., Nunes, F., Plumlee, M., Pratola, M., Pratt, S., Viens, F. et al. (2021), “Get on the BAND Wagon: A Bayesian Framework for Quantifying Model Uncertainties in Nuclear Dynamics,” Journal of Physics G: Nuclear and Particle Physics, 48, 072001. DOI: 10.1088/1361-6471/abf1df.
  • Prado, E. B., Moral, R. A., and Parnell, A. C. (2021), “Bayesian Additive Regression Trees with Model Trees,” Statistics and Computing, 31, 1–13. DOI: 10.1007/s11222-021-09997-3.
  • Pratola, M. T. (2016), “Efficient Metropolis–Hastings Proposal Mechanisms for Bayesian Regression Tree Models,” Bayesian Analysis, 11, 885–911. DOI: 10.1214/16-BA999.
  • Raftery, A., Madigan, D., and Hoeting, J. (1997), “Bayesian Model Averaging for Linear Regression Models,” Journal of the American Statistical Association, 92, 179–191. DOI: 10.1080/01621459.1997.10473615.
  • Santner, T. J., Williams, B. J., and Notz, W. I. (2018), The Design and Analysis of Computer Experiments (2nd ed.), New York: Springer.
  • Semposki, A. C., Furnstahl, R. J., and Phillips, D. R. (2022), “Interpolating Between Small- and Large-g Expansions Using Bayesian Model Mixing,” Physical Review C, 106, 044002. DOI: 10.1103/PhysRevC.106.044002.
  • Sill, J., Takács, G., Mackey, L., and Lin, D. (2009), “Feature-Weighted Linear Stacking,” arXiv preprint arXiv:0911.0460 .
  • Yang, Y., and Dunson, D. B. (2014), “Minimax Optimal Bayesian Aggregation,” arXiv preprint arXiv:1403.1345.
  • Yao, Y., Pirš, G., Vehtari, A., and Gelman, A. (2021), “Bayesian Hierarchical Stacking: Some Models are (somewhere) Useful,” Bayesian Analysis, 1, 1–29. DOI: 10.1214/21-BA1287.
  • Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018), “Using Stacking to Average Bayesian Predictive Distributions,” Bayesian Analysis, 13, 917–1007. DOI: 10.1214/17-BA1091.