121
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Partially fixed bayesian additive regression trees

& ORCID Icon
Received 09 Jan 2024, Accepted 05 Apr 2024, Published online: 18 Apr 2024

Abstract

Bayesian Additive Regression Trees (BART) is a widely popular nonparametric regression model known for its accurate prediction capabilities. In certain situations, there is knowledge suggesting the existence of certain dominant variables. However, the BART model fails to fully utilize the knowledge. To tackle this problem, the paper introduces a modification to BART known as the Partially Fixed BART model. By fixing a portion of the trees' structure, this model enables more efficient utilization of prior knowledge, resulting in enhanced estimation accuracy. Moreover, the Partially Fixed BART model can offer more precise estimates and valuable insights for future analysis even when such prior knowledge is absent. Empirical results substantiate the enhancement of the proposed model in comparison to the original BART.

1. Introduction

Bayesian Additive Regression Trees (BART) (Chipman et al., Citation2010) is a nonparametric regression model known for its superior accuracy compared to other tree-based methods like random forest (Breiman, Citation2001) and Xgboost (Chen & Guestrin, Citation2016). Furthermore, the BART model deviates from the strict parametric assumptions of classical models and combines the flexibility of machine learning algorithms with the rigidity of likelihood-based inference, making it a potent inferential tool. Another advantage of the BART model is its robustness to hyper-parameter selection.

When setting up a data analysis model, we often possess prior knowledge indicating the significant relationships between certain explanatory variables (predictors) and the predicted variable through logical deduction or background research. Particularly in spatial-temporal models, time or spatial variables are presumed to play crucial roles. If we have knowledge of a portion of the model structure, we can construct a parametric or semi-parametric model (Tan & Roy, Citation2019), with the parametric component representing the known structure. However, in most situations, the model structure is not known with certainty. How can we fully utilize this type of prior knowledge?

In the BART model, a uniform distribution prior is commonly used to select active predictors for splitting, resulting in equal selection probabilities for each variable. This contradicts our understanding that certain variables are more important than others. One approach to incorporate prior knowledge is to assign higher prior probabilities to important variables, although determining the prior is challenging. In this paper, we propose fixing the important variables at the root of trees, introducing a new model called Partially Fixed BART (PFBART). The PFBART model improves estimation accuracy compared to the original BART model when appropriate prior knowledge is incorporated.

The paper is structured as follows: Section 2 provides a review of BART, including the MCMC algorithm elements used for posterior inference. In Section 3, we present a detailed introduction to PFBART. Section 4 describes the conducted experiments, comparing and examining PFBART alongside the original BART. Finally, Section 5 presents the paper's conclusions and suggests future research directions.

2. Bayesian additive regression trees (BART)

2.1. Model

This section motivates and describes the BART framework. We begin our discussion from a basic BART with independent continuous outcomes, because this is the most natural way to explain BART.

For data with n samples, the ith sample is consist of a p-dimensional vector of predictors Xi and a response Yi(1in), and the BART model posits (1) Yi=f(Xi)+εi,εiN(0,σ2),i=1,,n.(1) To estimate f(X), a sum of regression trees is specified as (2) f(Xi)=j=1mg(Xi;Tj,Mj),(2) where Tj is the jth binary tree structure and Mj={μ1j,,μbjj} is the parameters associated with bj terminal nodes of Tj. Tj contains information of which bivariate to split on, the cutoff value, as well as the internal nodes' location. The hyperparameter number of trees m is usually set as 200.

2.2. Prior

BART is designed based on Bayes model. So we denote the prior distribution for BART model as P(T1,M1,,Tm,Mm,σ). {(T1,M1),,(Tm,Mm)} are assumed independent with σ, and (T1,M1),,(Tm,Mm) are also independent with each other, so we have (3) P(T1,M1,,Tm,Mm,σ)=P(T1,M1,,Tm,Mm)P(σ)=[j=1mP(Tj,Mj)]P(σ)=[j=1mP(MjTj)P(Tj)]P(σ)=[j=1m{k=1bjP(μkjTj)}P(Tj)]P(σ).(3) From (Equation3), we need to specify the priors of P(μkjTj), P(σ), and P(Tj) respectively. For the convenience of computation, we use the conjugate normal distribution N(μμ,σμ2) as the prior for μijTj. The initial prior parameter (μμ, and σμ) can be set through roughly computation. We also use a conjugate prior, here the inverse chi-square distribution for σ, σ2/χv2, where the two hype-parameters λ, v can be roughly derived by calculation. The prior for Tj is specified and made up of three aspects.

  1. The probability for a node at depth d to split: given by α(1+d)β. We can confine the depth of each tree by controlling the splitting probability so that we can avoid overfitting. Usually α is set to 0.95 and β is set to 2.

  2. The probability on splitting variable assignments at each interior node: default as uniform distribution. Dirichlet distribution is introduced for high dimension variable selection scenario (Linero, Citation2018; Linero & Yang, Citation2018).

  3. The probability for cutoff value assignment: default as uniform distribution.

2.3. Posterior distribution

With the settings of priors (Equation3), the posterior distribution can be obtained by (4) P[(T1,M1),,(Tm,Mm),σY]P(Y(T1,M1),,(Tm,Mm),σ)×P((T1,M1),,(Tm,Mm),σ),(4) where (Equation4) can be obtained by Gibbs sampling. First m successive (5) P[(Tj,Mj)T(j),M(j),Y,σ](5) can be drawn where T(j) and M(j) consist of all the trees information except the jth tree. Then P[σ(T1,M1),,(Tm,Mm),Y] can be obtained from explicit inverse gamma distribution.

How to draw from (Equation5) ? Note that Tj, Mj depend on T(j), M(j) and Y through Rj=Ywjg(X,Tw,Mw), and it is equivalent to draw posterior from a single tree of (6) P[(Tj,Mj)Rj,σ].(6) We can proceed (Equation6) in two steps. First we obtain a draw from P(TjRj,σ), then draw posterior from P(MjTj,Rj,σ). In the first step, we have (7) P(TjRj,σ)P(Tj)P(RjTj,σ),(7) P(RjTj,σ)=P(RjMj,Tj,σ)P(MjTj,σ)dMj as marginal likelihood. Because conjugate Normal prior is employed on Mj, we can get an explicit expression of the marginal likelihood.

We proceed (Equation7) by generating a candidate tree Tj from the previous tree structure with MH algorithm. we accept the new tree structure with probability (8) min{1,q(Tj,Tj)q(Tj,Tj)P(RjX,Tj)P(RjX,Tj)P(Tj)P(Tj)},(8) where q(Tj,Tj) is the probability for the previous tree Tj moving to the new tree Tj.

The candidate tree is proposed using four type of moves.

  1. Grow: splitting a current leaf into two new leaves, the probability as 0.25.

  2. Prune: collapsing adjacent leaves back into a single leaf, the probability as 0.25.

  3. Swap: swapping the decision rules assigned to two connected interior nodes, the probability as 0.1.

  4. Change: reassigning a decision rule attached to an interior node, the probability as 0.4.

Once we have finished sample from P(TjRj,σ), we can sample the jth leaf parameter μkj of the kth tree from N(σμ2k=1nkRkjnkσμ2+σ2,σ2σμ2nkσμ2+σ2), where Rkj is the subset of Rj allocated to the leaf node with parameter μkj and nk is the number of Rkj allocated to that node. With all the m updates (Tj,Mj) and one update of σ, we finish one iteration of the MCMC process. We repeat this process for many iterations and drop numbers of first unstable iterations and finally keep the stable iterations as the non-parameter estimator.

3. Partially fixed BART

As mentioned earlier, a uniform distribution is typically employed as the prior for selecting splitting variables, resulting in an equal probability for each variable to be chosen. Through logical inference or background analysis, we may identify certain variables as more important than others in specific models. In such cases, it is necessary to assign higher probabilities to these variables, such as the time variable in a time-related model or location variables in a spatial-related model. In these situations, simply applying the BART model fails to fully utilize this prior knowledge. We applied the BART model to the data generated from scenario F1(x) in Section 4.1 in which we can find that x1 is related to each part of the function, so it is a natural idea to force x1 to be in every regression tree. Figure  illustrates the frequency of each variable in the model during the final iteration. It reveals that the important variable x1 is not the most frequently selected; on the contrary, certain irrelevant variables like x7 exhibit higher frequencies than x1.

Figure 1. The frequency of each variable used in the BART model. X1 is an important variable. X6,,X10 are irrelevant variables.

Figure 1. The frequency of each variable used in the BART model. X1 is an important variable. X6,…,X10 are irrelevant variables.

When we possess such prior knowledge, we can anchor these variables at the topmost levels of the trees. Note that in the case of ordinal splitting variables, samples with xc (where c represents the cut point for the splitting variable) are directed to the left child node, while samples with x>c are assigned to the right child node. When there is a need to fix multiple layers of variables, it is common to assign the same splitting variable to the left and right child nodes, thus establishing variable fixing across layers. For instance, if we identify two variables as crucial in the model, we can fix these two variables at the topmost two levels of the trees, effectively preventing other variables from appearing at these levels.

The four moves for generating a new tree structure are modified.

  1. Grow: If a node in the fixed layers needs to be grown, only the assigned important variables are allowed to be chosen as splitting variables.

  2. Prune: No changes are made unless a logical hyperparameter is in effect. Detailed information will be provided later.

  3. Swap: The tree structure will not be changed if swapping two nodes violates the rule.

  4. Change: If a node in the fixed layer needs to be changed, the variable to be split is confined to the fixed variable scope.

The details of PFBART can be referred to in Algorithm 1.

Three logical hyperparameters are introduced in PFBART to enhance control over the fixing activity.

The first logical hyperparameter, Prune, controls the prune process. If Prune is False and the node to be pruned is in the fixed layers, the prune process will not alter the tree structure.

When dealing with multiple important variables, fixing each layer with each variable may be too demanding. If Swap is True, these variables can appear at any fixed layer. Otherwise, the variables to be fixed must follow a specific order. Specifically, the first important variable can only be selected in the first layer of the trees, and so on.

Given the BART model's restriction on tree depth, fixing multiple variables at the tree's upper levels may hinder the inclusion of other variables in lower level. Therefore, we introduce a logical parameter called ChangePrior. When ChangePrior is False, we maintain the splitting probability unchanged. If ChangePrior is True, nodes in the fixed layers adopt the same splitting probability as the root node of the trees. Nodes outside the fixed layers undergo a probability adjustment to α(1+dh)β, where h denotes the height of the fixed layers.

A toy example is used to demonstrate PFBART and the effect of the logical hyper parameter. We used data generated from scenario F1(x). We take two trees from the two hundred trees as a brief example. If we use BART model to fit the data, X1 may not be in every regression tree which we can see from the second tree of part A of Figure . X1 and X2 are two variables we fix in PFBART(X2 is fixed just to demonstrate the effect of hyper parameter). In part B of Figure , we set Swap as false, which means the order is fixed. In our example we fix X1 at the first layer and X2 at the second layer of the regression tree. In part C, Swap is true, so the variable in the first two layers must be X1 or X2 and they don't have to be in special order. By setting Prune to true in part D, the second tree exhibits a single-layer tree structure. In contrast, in parts B and C, the tree structure always consists of more than one layer.

Figure 2. Toy example for PFBART.

Figure 2. Toy example for PFBART.

4. Illustrations

4.1. Simulation experiment

Initially, we illustrate the advantages of PFBART over BART in various scenarios. The data is generated based on function F1(X)=10sin(πX1X2)+5X12(X30.5)+10X13X3X4+5X14X5.To make comparation, considering another two scenarios which data is generated from functions F2(X)=10sin(πX1X2)+5X22(X30.5)+10X13X3X4+5X14X5and F3(X)=10sin(πX6X2)+5X62(X30.5)+10X63X3X4+5X64X5.In scenario F1(x), X1 is associated with every part of the function, indicating its crucial role. In scenario F2(x), the second part is unrelated to X1, enabling us to evaluate PFBART's performance when the fixed variable is less significant. In scenario F3(x), X1 is an irrelevant variable in the model. To demonstrate that PFBART's effectiveness is independent of the variable selection process, we run the model exclusively with X1,,X5 using data from F1(x). This scenario is labelled as F4(x).

We generate 100 datasets for each function, with a sample size of 4000 in each dataset. Each dataset comprises 10 variables, X1,,X10, randomly sampled from a uniform distribution U(0,1). The datasets are split equally into training and testing subsets. In both BART and PFBART, the initial 500 unstable iterations are excluded, and the following 1000 iterations are considered as the model result. The remaining parameters utilize the default settings.

Each function was employed to predict the corresponding test set based on its respective training set. The predictions were evaluated using the root mean squared error (RMSE), RMSE=12000i=12000(f^(xi)f(xi))2.In this experiment, two competitors of eXtreme Gradient Boosting (XGB) and random forests (RF) with the default settings are introduced. We can see that BART outperforms XGB and RF in the four scenarios which means the two competitors can not recognize this special structure, so we mainly focus on the comparison of BART and PFBART in this section.

Table  lists the four combinations of logical hyper parameter with which we conduct PFBART. Figure  shows the boxplots of the 100 RMSE values for each scenario.

Figure 3. Boxplots of the RMSE values for each method across the 100 data sets.

Figure 3. Boxplots of the RMSE values for each method across the 100 data sets.

Table 1. Settings for hyperparameter.

Some finding can be derived from Figure .

  1. The performance of different logical parameters follows a specific order in the four scenarios: SET1 ≈ SET2 > SET3 > SET4. Setting the logical parameter ChangePrior to True is a trade-off for easier growth of deeper trees at the cost of overfitting. When there is only one layer to fix, changing the splitting priority is unnecessary and leads to overfitting. When ChangePrior is True, setting the logical parameter Prune to False increases the probability of overfitting. However, when the splitting priority remains unchanged, allowing or disallowing pruning in the fixed layer has little effect on the model. There is almost no difference between SET1 and SET2. Therefore, the following discussion primarily focuses on comparing PFBART SET1 and BART.

  2. In scenario F1(X), PFBART reduces the median RMSE by approximately 15% compared to BART. This indicates that if we possess right prior information that the fixed variable is related to every part of the model, PFBART can archive more accurate estimations.

  3. In scenario F2(X), where a portion of the model is unrelated to the assigned fixed variable X1, PFBART reduces the median RMSE by approximately 9%. This suggests that PFBART can perform effectively in a wider range of scenarios as long as the fixed variable is correlated with large part of the model.

  4. In scenario F3(X), where the fixed variable X1 is irrelevant to the model, PFBART performs poorly due to fixing an irrelevant variable, which introduces additional error to the model.

  5. In scenario F4(X), where only X1,,X5 are used in the model, PFBART still outperforms BART by approximately 10%. This indicates that the effectiveness of PFBART is not solely attributed to the variable selection process.

4.2. UCI data sets

In the previous simulation, we demonstrated how prior knowledge can be utilized to achieve better estimations. In this section, we illustrate the use of PFBART on data without prior knowledge.

From the UCI dataset (Dua & Graff, Citation2017), we selected 14 datasets based on the following criteria. (1) Sample size ranging from 240 to 5500. (2) Attributes ranging from 5 to 13. (3) Regression datasets, excluding time series datasets. The details of the datasets can be referred to in Table .

Table 2. UCI data sets information.

For simplification purposes, we randomly removed samples from the dataset to ensure that the total sample size is divisible by 10. Each dataset was evaluated using 10-fold cross-validation. We performed 10 randomizations for each dataset. Each variable is fixed at the top of the trees. We used the relative RMSE, defined as the ratio of PFBART RMSE to BART RMSE for the same dataset, as a measure of variable importance. Thus we obtained 10 such statistics for each covariate, presented in Figure .

Figure 4. Relative RMSE for every covariate in UCI data sets.

Figure 4. Relative RMSE for every covariate in UCI data sets.

For the datasets Abalone, Forest Fire, Wine Quality, QSAR Aquatic Toxicity, and QSAR Fish Toxicity, fixing every variable had a similar effect on the BART model. This suggests that these variables all contribute to the model, and no single variable plays a dominant role.

For the Airfoil Self Noise dataset, the variable X1, frequency, is highly correlated with the dependent variable sound pressure level, as observed in Brooks et al. (Citation1989).

In the Auto MPG dataset, the variable X6 (model year) is an important variable in the model, as it reflects changes in the MPG model due to scientific and technological advancements over different model years.

In the Bike Rental dataset, two variables, X2 (month) and X7 (feeling temperature), interact with other independent variables to influence bike rental behaviour.

For the Concrete Compressive Strength dataset, fixing each variable results in slightly worse estimation. However, these variables are not irrelevant variables, so we can incorporate this information along with background knowledge for future use.

In the Energy Efficiency dataset, X8 (Glazing Area Distribution) is an important variable as different types of area distributions lead to different energy efficiency models.

In the Real Estate Valuation dataset, fixing X5 (latitude) and X6 (longitude) improves estimation accuracy. Considering the common knowledge that these variables interact with other variables such as X1 (transaction date) and X2 (house age) to predict house prices, the results seem reasonable. In the next section, we will examine the performance of PFBART on a larger real estate dataset.

In the Strike dataset, the two important variables, X1 (country) and X6 (union centralization), interact with other independent variables to influence the strike volume.

The Tecator dataset is used to predict the fat content of a meat sample based on its near-infrared absorbance spectrum. The dependent variables are principal components derived from the spectrum. No dominant variable can be identified among the principal components, although the first four components appear to be more important than others.

In the Yacht Hydrodynamics dataset, fixing X5 (Froude number) improves the estimation. Based on background information in hydrodynamics, X5 plays a significant role in predicting residuary resistance. Fixing other covariates except X5 leads to worse estimation, especially for X4. However, removing X4 from the model also results in worse estimation, suggesting that X4 should be included in the model. It is not a variable with global influence, similar to X4 in the Airfoil Self Noise and Bike Rental datasets. This indicates that variables with high relative RMSE are not necessarily useless in the model.

4.3. Beijing housing price

The Beijing house price data (Lin et al., Citation2023) is used to demonstrate the process of fixing multiple variables in a spatial-temporal model. The response variable is the unit house price, and the covariates include location, floor, number of living rooms and bathrooms, presence of an elevator, and other variables. Based on prior knowledge, we assume that location and year of trading have a significant influence on the model. In this study, the longitude, latitude, and year of trading are fixed at the top three layers of the regression trees.

After preprocessing, the dataset consists of 296255 valid samples. Due to the large sample size and the time-consuming nature of MCMC iterations, a random selection of 30% of the total samples is used for training, while the remaining 70% is used for testing. This process generates 10 datasets, and for each dataset, PFBART is run with eight combinations of logical hyperparameters, as listed in Table . The relative RMSE is used as the evaluation metric.

Table 3. Hyper parameter combinations.

Figure  presents the results of eight different PFBART models with varying hyperparameter settings. All eight models outperform BART, with SET6 yielding the best performance.

Figure 5. Relative RMSE for PFBART with Beijing house price data.

Figure 5. Relative RMSE for PFBART with Beijing house price data.

The results confirm our hypothesis that spatial-temporal variables play a crucial role in the model. In other words, a significant portion of the variance in house prices is related to these three variables.

The good performance of SET6 can be explained as follows.

  1. Fixing multiple layers in the tree has the side effect of making it difficult for other covariates to be included in the model. To address this, we can adjust the splitting probability in a way that allows non-fixed layers to grow as if without the fixed layers, thus facilitating deeper growth.

  2. Preventing nodes from being pruned results in regression trees with more than two layers. Conversely, including pruning may lead to unexpected shallow trees that do not align with our expectations.

  3. When fixing more than one layer, should the order of fixing be considered? By setting Swap to True, we can relax this restriction and make the model more flexible to approximate the true model effectively. This change allows the three variables (longitude, latitude, and year of trading) to grow at the fixed layers without considering their order.

5. Conclusion and looking forward

When constructing statistical models, particularly those related to spatial-temporal analysis, it is known that certain variables have a strong correlation with the majority of the model either through logical deduction or background knowledge. This paper presents a method, referred to as Partially Fixed BART, that leverages this prior knowledge by fixing these important variables at the top of the regression trees. Through data experiments and real-world examples, it is demonstrated that this approach leads to improved performance compared to the original BART model. Additionally, even in the absence of prior information, the proposed model can still be employed to achieve more accurate estimations or serve as a measure of variable importance.

The primary contribution of this paper is the development of PFBART, an extension of the BART model. In a previous work by Linero and Yang (Citation2018), a soft BART model was introduced, which is better suited for approximating continuous or differentiable functions. Building upon this, we plan to incorporate the fixing of important variables based on the soft BART model and investigate whether this modification yields further improvement.

PFBART demonstrates superior performance in datasets where certain dominant variables exert significant influence. However, in most scenarios, each variable is only correlated with a portion of the overall variation, and there is no dominant variable. Currently, we are focussed on analysing the model structure and leveraging this information to enhance its performance.

Acknowledgements

The authors are grateful to the Editor, an Associate Editor and two anonymous referee for their insightful comments and suggestions on this article, which have led to significant improvements.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
  • Brooks, T. F., Pope, D. S., & Marcolini, M. A. (1989). Airfoil self-noise and prediction [Tech. Rep]. NASA.
  • Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). Association for Computing Machinery.
  • Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298. https://doi.org/10.1214/09-AOAS285
  • Dua, D., & Graff, C. (2017). UCI machine learning repository. https://archive.ics.uci.edu/ml
  • Lin, W., Shi, Z., Wang, Y., & Yan, T. H. (2023). Unfolding Beijing in a hedonic way. Computational Economics, 61(1), 1–24. https://doi.org/10.1007/s10614-021-10209-3
  • Linero, A. R. (2018). Bayesian regression trees for high-dimensional prediction and variable selection. Journal of the American Statistical Association, 113(522), 626–636. https://doi.org/10.1080/01621459.2016.1264957
  • Linero, A. R., & Yang, Y. (2018). Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(5), 1087–1110. https://doi.org/10.1111/rssb.12293
  • Tan, Y. V., & Roy, J. (2019). Bayesian additive regression trees and the general BART model. Statistics in Medicine, 38(25), 5048–5069. https://doi.org/10.1002/sim.v38.25