Abstract
Likelihood-free methods are an essential tool for performing inference for implicit models which can be simulated from, but for which the corresponding likelihood is intractable. However, common likelihood-free methods do not scale well to a large number of model parameters. A promising approach to high-dimensional likelihood-free inference involves estimating low-dimensional marginal posteriors by conditioning only on summary statistics believed to be informative for the low-dimensional component, and then combining the low-dimensional approximations in some way. In this article, we demonstrate that such low-dimensional approximations can be surprisingly poor in practice for seemingly intuitive summary statistic choices. We describe an idealized low-dimensional summary statistic that is, in principle, suitable for marginal estimation. However, a direct approximation of the idealized choice is difficult in practice. We thus suggest an alternative approach to marginal estimation which is easier to implement and automate. Given an initial choice of low-dimensional summary statistic that might only be informative about a marginal posterior location, the new method improves performance by first crudely localizing the posterior approximation using all the summary statistics to ensure global identifiability, followed by a second step that hones in on an accurate low-dimensional approximation using the low-dimensional summary statistic. We show that the posterior this approach targets can be represented as a logarithmic pool of posterior distributions based on the low-dimensional and full summary statistics, respectively. The good performance of our method is illustrated in low to moderate dimensional examples. Computer code to implement the methods for the examples of this article is available at https://github.com/cdrovandi/ABC-marginal-approximations.
1 Introduction
Likelihood-free statistical methods such as approximate Bayesian computation (ABC, Sisson, Fan, and Beaumont Citation2018) and Bayesian synthetic likelihood (BSL, Wood Citation2010; Price et al. Citation2018) are now commonly applied to conduct inference on the parameters of computationally expensive models for which simulation of synthetic data is easy, but likelihood computation is impractical. Such approaches aim to find values of the model parameters for which simulated and observed data are close, typically on the basis of a set of summary statistics.
Standard likelihood-free techniques such as ABC and BSL are known to scale poorly to a large number of summary statistics (Blum Citation2010; Frazier et al. Citation2022). Although BSL methods scale more readily to higher-dimensional summaries (Priddle et al. Citation2022), its performance can still degrade as the dimension of the summaries increases. In spite of this, the use of a high-dimensional summary statistic might seem desirable for several reasons. First, a high-dimensional summary statistic might be necessary to mitigate the inevitable information loss incurred by reducing the full dataset. Second, when the model contains a large number of parameters, a large number of summary statistics are required by default: when the dimension of the summary statistic is less than the parameter, it is not possible to point-identify all parameters. From a computational perspective, however, matching of simulated and observed summary statistics and posterior exploration become more difficult in higher dimensions.
In the ABC framework, a promising approach for performing likelihood-free analyses for high-dimensional parameter problems is marginal adjustment (Nott et al. Citation2014) and the closely related copula ABC approach (Li et al. Citation2017; Chen and Gutmann Citation2019). These approaches estimate low-dimensional (e.g., one or two dimensional) marginal posterior distributions, and then combine them in some way to obtain an approximation to the joint posterior distribution. The idea is that each parameter, or pair of parameters, is likely to be informed by only a small number of summary statistics. Thus, in principle, it is possible to accurately estimate these low-dimensional posterior distributions, and avoid the curse of dimensionality associated with the use of high-dimensional summary statistics. Other reasons why accurate marginal approximations might be important are that researchers often report inferences for individual parameters and that in initial modeling phases the identifiability of the model can be explored.
The purpose of this article is to demonstrate that the approximation of low-dimensional marginal posteriors based on intuitive low-dimensional summary statistic choices can be surprisingly poor in some cases, and less accurate than the corresponding approximation using all the summary statistics. Motivated by this observation, we describe a summary statistic choice for marginal posterior estimation that would, in principle, deliver precise marginal posterior inferences. Unfortunately, this statistic is not practically feasible to construct in the situations to which ABC is generally applied. Our more practical approach is based on crudely localizing the posterior approximation using all the summary statistics, and then subsequently conducting marginal inferences by matching low-dimensional summary statistics that are informative for different marginal parameters. We show that our two-stage approach can be thought of as sampling a logarithmic pool of posterior distributions based on low-dimensional and full summary statistics, respectively. In Appendix A of the supplementary material, we describe how our article is related to other parts of the literature, including other high-dimensional likelihood-free methods, machine learning approaches and how our approach is relevant beyond the ABC setting. We demonstrate our findings and method on low to moderate dimensional simulated examples. We leave the significant challenges of extending this work to high-dimensional models and model misspecification for future research.
2 Marginal Approximations in Likelihood-Free Inference
2.1 Approximate Bayesian Computation
We denote the parameter of the statistical model of interest as , where p is the number of parameters. The observed data is given by where n denotes the number of observations. Ideally we wish to base our statistical inferences on the posterior density
where is the likelihood function and the prior density. However, for many complex models of interest, may be too computationally expensive to permit the application of exact methods, up to Monte Carlo error, for approximating expectations with respect to the desired posterior. In such situations, we can resort to likelihood-free methods, which replace likelihood evaluations with model simulations, to obtain an approximation to the posterior.
Perhaps the most popular statistical likelihood-free method is ABC. In ABC, we often first choose a summary statistic function, where and d is the number of summary statistics, to map the data to a lower dimensional space. We then compare observed data y and simulated data on the basis of the corresponding observed and simulated summary statistics. Overloading notation, we write the summary statistic function evaluated at a generic value for the data simply by S. When it is necessary to distinguish between the summary statistic function evaluated at the observed data y and simulated data x, we denote these by , and , respectively.
Write for the partial posterior distribution conditioned on the summary statistic Sy, where is the summary statistic likelihood evaluated at . In ABC, the posterior density is approximated as follows: (1) (1) where denotes the ABC posterior, and (2) (2) with a distance function that compares observed and simulated data, and a kernel function that is designed to be relatively large when ρ is small. We will refer to defined in (2) as the ABC likelihood. It is a kernel smoothed version of the true summary statistic likelihood . Ultimately, for a given S and ρ, the bandwidth parameter of the kernel ϵ, also referred to as the ABC tolerance, defines when the observed and simulated data are considered close. ABC can be more easily understood when using the indicator kernel function, , which is equal to one when and 0 otherwise. The integral in (1) for a given θ can be unbiasedly estimated by taking a single draw from the model and evaluating .
Although there are many algorithms for sampling from the ABC posterior (Sisson and Fan Citation2018), in this article we use the sequential Monte Carlo (SMC) ABC algorithm of Drovandi and Pettitt (Citation2011a). This method generates samples (often referred to as particles in the SMC context) from an adaptive sequence of ABC posteriors with decreasing ABC thresholds. It does this by eliminating a proportion of particles with the highest discrepancy at each iteration. The population of particles is rejuvenated via a resampling and move step, the latter achieved with an MCMC ABC kernel to preserve the distribution of particles. The number of MCMC iterations applied to each particle adapts with the overall MCMC acceptance rate. In this article we stop the SMC ABC algorithm when the acceptance rate in the MCMC step drops below 1%, since it implies that a great deal more computation is required to progress to smaller values of ϵ. The proposal distribution in the MCMC ABC kernel is a Gaussian random walk, with a covariance tuned based on the population of SMC particles, which is reasonably efficient for the examples considered in this article.
2.2 Marginal Approximations
It is well known that simple ABC methods struggle to produce accurate approximations of the (partial) posterior as the dimension of the summary statistic increases. A Monte Carlo estimate for the ABC likelihood (2) results in a conditional kernel density estimator of the likelihood for the summaries, the statistical behavior of which is known to severely degrade as the dimension of the summary statistic increases, even if the ABC tolerance is favorably chosen. Results in Blum (Citation2010), Blum and François (Citation2010), and Barber, Voss, and Webster (Citation2015) make the effect of the dimension on simple ABC algorithms more precise.
Further theoretical results for ABC methods are described in Li and Fearnhead (Citation2018b), Frazier et al. (Citation2018), and Li and Fearnhead (Citation2018a) consider regression adjusted ABC methods. Li and Fearnhead (Citation2018a, Citation2018b) obtain interesting results about both point estimation for the ABC posterior mean, and accurate uncertainty quantification of the ABC posterior, for rejection and importance sampling ABC algorithms with and without regression adjustment for an appropriately chosen sample-size dependent tolerance. With an adaptively chosen and informative proposal distribution, so long as the tolerance goes to zero fast enough, the authors demonstrate that ABC regression adjustment methods can control the Monte Carlo error of the posterior approximation and provide correct uncertainty quantification. However, it is not so easy to disentangle the impact of the summary statistic dimension in this theory, since the construction of the required proposal becomes more difficult as the parameter dimension increases, which is one situation when high-dimensional summary statistics are needed. Furthermore, in the case of simple accept/reject ABC, results in Frazier et al. (Citation2018) demonstrate that to control the Monte Carlo error the tolerance used in ABC must be chosen as an increasing function of the summary statistic dimension. We note that, in the BSL framework, Frazier et al. (Citation2022) obtain similar results to those of Li and Fearnhead (Citation2018a) for regression adjusted ABC, showing that the two methods behave similarly from a computational standpoint.
An appealing approach for overcoming issues associated with likelihood-free methods in high dimensions is to approximate low-dimensional posterior distributions (Nott et al. Citation2014; Li et al. Citation2017). Let θj be some component of θ. Note that θj may consist of more than one parameter, but for the purposes of this article it suffices to treat θj as a scalar. We note that θj may be informed by a relatively small number of summary statistics. We denote the corresponding summary statistic function as where , and the observed statistic as . The motivation for this is that could be a much better approximation to compared to for two inter-related reasons: first, since Sj is lower dimensional, it is easier to find matches between observed and simulated summary statistic vectors within a given tolerance region; second, since we are attempting to match summaries in a lower-dimensional space, we can more readily control the approximation error introduced through the use of a positive tolerance (i.e., it is generally the case that the tolerance value ϵj can be much smaller than ϵ). For these reasons, there are meaningful benefits to employing low-dimensional summaries when targeting marginal inference in ABC.
In this article we demonstrate that, even with an intuitively reasonable choice of Sj, sometimes the approximation can be less accurate than , and substantially so. As an example of an intuitively reasonable choice, the semi-automatic ABC method of Fearnhead and Prangle (Citation2012) produces summary statistics which are estimated posterior means for θ, and Sj could be obtained by extracting estimates for θj. That such an intuitive choice can perform badly is perhaps well-known, but it is worthwhile to demonstrate this in a simple normal example where using sufficient statistics to produce posterior approximations may not produce accurate inferences for θj.
2.2.1 The Normal Case
Write for a normal distribution with mean μ and variance , and write for the value of its density at z. Let be a random sample drawn from a distribution with unknown μ and . We assume that μ and are independent a priori and allocate priors: and where denotes the inverse-gamma distribution with shape α and scale β and are fixed. A minimal and Bayes sufficient statistic for estimating jointly is and , that is, . Since and s2 are direct point estimates of μ and , respectively, it might be tempting to consider approximating the marginals of these parameters conditioning only on their respective point estimate statistics.
A certain realization of y with produces the results shown in .Footnote1 It is evident that produces a diffuse approximation of the actual marginal posterior, whereas the approximation is highly accurate for the actual -marginal.
To understand the findings in , first rewrite the normal likelihood as
For inference on μ, the above decomposition makes clear that the marginal posterior for μ will depend on s2: under the inverse-gamma prior for we have In contrast, if we take as our “likelihood” for inference on μ, , under the same prior for , we have the marginal posterior While the two posterior locations are similar, their scales are not, with the use of the simpler likelihood resulting in a loss of precision, see .
Now we turn to the marginal approximation of . The term is nonconstant for all except in the case where . Hence, the likelihood contribution of is informative for inference on based on this likelihood; see, for example, Zhu and Reid (Citation1994) for additional discussion on this example.
However, the marginal posterior for is , while the marginal posterior for is While the two posteriors appear different, the integrand in can be shown to be constant as a function of under a diffuse prior for μ. Hence, even though the likelihood contribution containing , that is, , does carry meaningful information about , the “likelihood” that results for integrating out μ, under a diffuse prior, ensures that the integrated component has no information about that is not already contained in . This explains the accurate results for in . In more substantive problems pertinent to likelihood-free inference, the integrated likelihood term that results from marginalization is unlikely to simplify, and we should not expect accurate marginal approximations even for a seemingly obvious choice of low-dimensional summary statistic.
On a more intuitive level, s2 is obviously highly informative about , and its distribution does not depend on μ. This might be the best situation that we can hope for when conditioning on a low-dimensional summary statistic consisting of point estimates of parameters in likelihood-free inference. In contrast, even though is informative about μ, its distribution depends on .
2.2.2 A General Issue
The simple normal example clearly illustrates the potential inaccuracy of using marginal approximations based on summary statistics, and, thus, why the use of marginal approximations in ABC may produce inaccurate inferences. Therefore, it is helpful to give a more general formulation of the lessons of this example.
Consider that we wish to conduct inference on the unknown parameter conditional on a vector of observed summary statistics Sy, which we partition as , and which has dimension at least as large as θ. Furthermore, consider the unrealistic scenario where the likelihood for the observed summary Sy can be factorized as (3) (3)
The factorization in (3) allows us to write the joint posterior as
We define the “integrated likelihood” for ψ, evaluated at , and conditional on , as where the notation clarifies that we have integrated out λ from the likelihood contribution using the prior . The marginal posterior can then be written as
Even if (3) is valid, still depends on through . The only way that the posterior does not depend on is when, for the observed value Sy, the term is a constant function of ψ. The latter requirement is essentially the requirement that the likelihood contribution is S-nonformative (short for non-informative), a concept due to Barndorff-Nielsen (Citation1976), when conducting inference on ψ.
Furthermore, the toy normal example makes clear that even in linear exponential families where the resulting decomposition in (3) is satisfied, the likelihood can depend on ψ. That is, while linear exponential families have minimally sufficient summaries, those same summaries are not generally S-nonformative when considering inference for a sub-vector of parameters, for example, ψ. Given this finding, there is no reason to suspect that marginal approximations by themselves will yield accurate partial posterior inference unless the summary statistics are chosen very carefully. The example in Appendix B of the supplementary material demonstrates how such a careful summary selection can lead to accurate marginal approximations. However, as we demonstrate in Section 4, finding summary statistics that satisfy the conditions outlined in this section is not an easy task.
3 Accurate Marginal Approximations via Localization
In Appendix C of the supplementary materials we show that ideal summary statistics for estimating the marginal partial posterior of θj are its corresponding marginal posterior moments. For example, suppose we are interested in estimating the one-dimensional parameter θj, and define an additional two-dimensional summary statistic vector . Conforming to our previous notation, we write and for the value of for simulated data x and observed data y, respectively. We show the posterior mean and variance of is exactly the posterior mean and variance of . That is, we can match the mean and variance of the partial posterior with a much lower dimensional summary statistic, and this result extends to higher order moments. However, we also explain in Appendix C of the supplementary materials why it is practically difficult to approximate these statistics in the likelihood-free setting, especially for higher-order moments. Thus, as a more practical method, we propose the below localisation approach.
3.1 Localization Approach
We show how the issue described in Section 2.2.2 can be mitigated by first localizing the approximation using the full set of summary statistics S and then focusing on matching the summary subset Sj. More specifically, we consider the following approximate posterior for estimating the θj-marginal (4) (4)
We abuse notation here and write for the metric on both full and reduced summary statistic spaces. Here, the role of is to provide an initial crude approximation of the posterior, so that the parameters are more tightly constrained compared to the prior. Note that we do not use the data twice; in the second constraint we simply impose a tighter restriction on Sj that is designed to be informative about θj. Our approach is related to other ABC methods which use a pilot run to truncate the prior such as those of Fearnhead and Prangle (Citation2012) and Blum and François (Citation2010). However, the key difference in our approach is that different summary statistics are used at the pilot and analysis stage.
We now provide some intuition on why this idea is helpful for marginal approximations. The joint selection condition is needed to ensure that values of θ are in the high density region for the approximation only if they produce reasonable global agreement for the entire vector of summaries; while the marginal selection condition for Sj allows us to focus on specific regions of the marginal parameter space where is particularly informative about the unknown θj. However, in many cases the likelihood for the marginal summary statistic will be a complex function of all the parameters, and the joint selection step is critical as otherwise conducting inference using only the marginal selection condition could lead to diffuse posteriors, or, at worst, a complete lack of point identification. We later present examples of both types of behavior, which emphasizes the importance of including the joint selection condition when conducting marginal inferences.
To implement this in practice we use a pilot run of SMC ABC with the full set of statistics S for a relatively short time until the MCMC acceptance rate drops below some threshold much greater than 1%. Then we perform a second SMC ABC step initialized at the pilot approximation and aim to produce closer matches based on Sj (i.e., reduce ϵj) until the MCMC acceptance rate drops below 1%. Note that we also check in the second stage if a proposal satisfies the pilot constraint . If it does not, the proposal is rejected even if it matches the current constraint based on Sj. It is important that the acceptance rate threshold for the pilot run is set relatively high. First, we do not want to use much computation time on the pilot run. Second, in the continuation run, we want to reduce the tolerance associated with the marginal summary statistics greatly, and this will be difficult to do computationally if it is already difficult to match on the pilot tolerance. In our experiments, we use a pilot % stopping rule of 10%–30%, depending on the parameter dimensionality. Another possible stopping rule for the pilot run is when a certain number of model simulations have been exceeded, which may be a more explicit way to control the computational effort imposed in the pilot run. The second stage starts from the pilot result, and does not require many additional SMC iterations as it aims to match only on a low-dimensional statistic. The approach is summarized in Algorithm 1.
We note that there is no guarantee that our approach will improve on marginal ABC approximations using all the statistics. Ultimately, there is still more information contained in S about θj compared to Sj, and it will depend on how useful Sj is. Furthermore, the two stage approach is computationally partly hindered by having to match also on the pilot tolerance. However, in the examples of this article and the supplementary material, we obtain very good empirical performance in the large majority of cases.
Algorithm 1
SMC ABC approach for accurate estimation of .
Inputs: Acceptance rate thresholds for the pilot, , and continued, , SMC ABC runs. Note that . Observed data y, prior distribution , discrepancy functions and .
Outputs: Tolerances ϵ0 and ϵj. Samples from the approximate posterior .
1: Run SMC ABC, reducing the tolerance with respect to the discrepancy function until the acceptance rate falls below . This produces a sample from where and .
2: Compute for . This produces the initial value of . We can now think of as an ABC sample from .
2: Continue running SMC ABC, reducing the tolerance ϵj with respect to the discrepancy function , whilst still matching on for a fixed ϵ0, until the acceptance rate drops below . This produces samples based on the approximate posterior with kernel .
3.2 Interpretation in Terms of Logarithmic Pooling
The above approach has an interesting interpretation in terms of logarithmic pooling of posterior densities conditioned on summary statistic vectors Sy and , respectively. For two densities and their log-pooled density with weight α, , is where is a normalizing constant, and we note that this definition can be extended to accommodate any finite number of densities (Genest, McConway, and Schervish Citation1986). Logarithmic pools and other related pooling methods are often used for constructing consensus priors when prior distributions from multiple individuals are available, in the literature on optimal (Bayesian) decision making (see Genest and Zidek Citation1986 for a review) and in the literature on combinations of forecast distributions (see Clements and Harvey Citation2011 for a review).Footnote2
A key feature of logarithmic pooling is that a high density value in the pooled density must necessarily have a high density value in all the densities being pooled. We now argue that that this feature can be used to obtain the joint matching condition in the SMC algorithm described above. Consider once again the ABC posterior density (1). Suppose we use the uniform kernel, so that . Considering the summary statistic Sy, and denoting the tolerance in our approximation by ϵ0, the joint ABC density of parameter and summary is given by (5) (5)
Integrating out Sx in (5) gives the θ-marginal density given by (1). A similar ABC posterior density on θ and Sx conditional on , with tolerance denoted ϵj is (6) (6)
It is immediate that for any , then for (5) and (6), their log-pooled density is (7) (7) and the SMC-ABC algorithm described above targets sampling from the density (7), which involves the joint matching condition. Note that if we had integrated out Sx first in (5) and (6) and then pooled, this does not give the same result as pooling (5) and (6) and integrating out Sx afterwards. Based on the earlier interpretation of logarithmic pooling, a parameter value will only have high density under the pooled posterior if it has high density under the pilot ABC posterior and the ABC posterior where only Sj is matched.
The above makes it clear that, at least in the case of the uniform kernel, so long as , the weights in the logarithmic pool do not influence the resulting posterior. The same result will not apply if one uses a bounded kernel function such as the triangular or Epanechnikov kernels. However, in the case of the commonly used Gaussian kernel, the effect of the pooling parameter α is just to change the effective kernel bandwidth parameters, since raising a Gaussian density to a power changes the scale, up to a normalizing factor. Hence, a joint matching condition for the Gaussian kernel also has an interpretation in terms of logarithmic pooling. While logarithmic pooling gives an interpretation of the posterior distribution sampled by Algorithm 1, for both uniform and Gaussian kernels, we stress that it is unnecessary to choose a pooling weight in both cases.
4 Examples
In the first two examples we consider only using a small number of summary statistics, so it is possible to use ABC to generate a gold standard approximation of the posterior distribution (and its corresponding marginals) conditional on these statistics. This is intentional so that we can properly assess the performance of the various marginal approximations. In the examples we also show the distribution of the simulated summary statistics for the accepted SMC-ABC posterior samples, so that we can compare how well the different approaches can match on individual summary statistics. For simplicity we refer to such distributions as the posterior distributions of the summaries. We provide two additional low-dimensional examples in Appendices E and F of the supplementary material, respectively. These examples qualitatively have similar results to the g-and-k example in Section 4.2. In the third example in this section we show that our approach can outperform SMC ABC with all summaries on a moderate dimensional example. All examples support our assertion that seemingly obvious choices of low-dimensional summary statistics do not necessarily lead to accurate marginal approximations without localization.
4.1 MA(2) Example
Consider data generated from the following moving average model of order 2 (MA(2)) model (8) (8) where, say, and the unknown parameters are assumed to obey the invertibility region . Our prior information on is uniform over the invertibility region. A useful choice of summary statistics for the MA(2) model are the sample autocovariances , for j = 0, 1, 2. Then we have for j = 1, 2, 3. The binding functions corresponding to these summary functions (expected values of the summary statistics as a function of θ) are given by:
Since θ1 only appears in and , it is tempting to approximate the marginal for θ1 matching only S1 and S2. This turns out to be a poor choice generally as there are two solutions when we solve for θ1 and θ2 as a function of b1 and b2 in the binding function; this suggests that the posterior conditional on S1 and S2 is likely to concentrate around two values in the parameter space. Regarding estimation of the θ2-marginal, it may be sensible to match only on S3.
We consider a dataset of size generated with and . When using only S1 and S2 as summary statistics, there is a second value of matching the value of b1 and b2 obtained for the true parameter value, , which is well separated from the true parameter. We consider six ABC approximations: (1) matching all three summary statistics S1, S2, and S3, (2) matching only on S1 and S2, (3) matching only on S3, (4) a pilot ABC approximation matching S1, S2, and S3, (5) continuing from the pilot approximation matching only S1 and S2 and (6) continuing from the pilot approximation matching only S3. We stop the pilot run when the MCMC acceptance rate drops below 30%.
The results are shown in . We first focus on estimating the θ1-marginal (left plot). It can be seen that the marginal approximation of θ1 is poor when conditioning on only S1 and S2. The posterior is multi-modal, as expected. The pilot run (shown as blue dash) produces a diffuse approximation but effectively eliminates the second mode. Then, continuing from the pilot run with S1 and S2 produces an accurate marginal approximation of θ1.
Next we focus on the θ2-marginal (right plot). This time, the standard marginal approximation based on only S3 produces an accurate marginal approximation. S3 is a highly informative statistic for θ2, and the posterior for θ1 is close to the prior and shows little posterior dependence with θ2 (results not shown). As we see with the θ1 marginal results and the examples below, obtaining an accurate marginal approximation without the pilot is an exception rather than the rule. Here pilot gives a reasonable estimate of the θ2-marginal but it is not highly accurate. We suggest that this is a result of having to match on both the pilot and marginal tolerances, and so the pilot approximation may still have some influence. We will see in the below examples that continuing from a pilot ABC run is crucial to obtaining accurate marginal approximations.
As alluded to above, the MA(2) example clearly demonstrates the identification issues that can arise when doing marginal adjustments in an ad-hoc fashion. Namely, since the distribution of the summaries S1, S2 results in a multi-modal likelihood function as a consequence of the multiplicity of roots discussed above, the information in these summaries is not enough by themselves to identify the true value of θ1. However, once we have adequately restricted the parameter space for θ2, which can be achieved through a pilot selection step based on all the summaries, the multiplicity of roots is alleviated, and the summaries S1, S2 are highly informative for the true value of the unknown parameter θ1.
Since the parameter θ2 clearly influences the distribution of S1, S2, there is no hope of obtaining marginal sufficiency for θ1, when we base our inference for θ1 on the summaries S1, S2. Consequently, a joint selection step based on all the summaries will be required if we are to have any hope of identifying the unknown value of θ1.
In contrast, the asymptotic distribution of the third summary statistic is (in the limit) independent of θ1, and hence inference for θ2 based solely on S3 is likely to deliver reasonable results, as accords with the result in . In particular, we have that where the comes about since is independent of for all t. Thus, for n large which has mean θ2, variance , and does not depend on θ1.Footnote3 Hence, in large samples,
Therefore, if we conduct inference on θ2 using just S3 via the likelihood , the results are likely to be quite accurate, especially in the case where there is little information about θ2 in the conditional distribution .
In general, the marginal posterior for in this example is given by
In the extremal case where is S-nonformative for θ2, the integrated likelihood is constant in θ2, and the posterior reduces to . The latter posterior is plotted in as the green dotted curve, which we see is virtually identical to the marginal posterior based on the full set of summaries.
4.2 Univariate g-and-k Example
The g-and-k distribution is a popular test example for ABC (e.g., Drovandi and Pettitt Citation2011b; Fearnhead and Prangle Citation2012), and is defined by its quantile function (9) (9) where , b > 0, and is the standard normal quantile function. The parameters control the location, scale, skewness and kurtosis, respectively. It is common practice to set c = 0.8. As with previous studies, we consider analyzing a simulated dataset of length with a = 3, b = 1, g = 2 and k = 0.5. Writing for the uniform distribution on , the prior on each component of θ is set as with independent components.
Drovandi and Pettitt (Citation2011b) consider using as summary statistics robust measures of location, scale, skewness, and kurtosis, with where Li denotes the ith quartile and Ei denotes the ith octile. Given the natural interpretation of the g-and-k parameters, it might be intuitively appealing for each component of θ to use the corresponding robust summary statistic. That is, we use S1, S2, S3, and S4 for estimating the approximate posterior marginals of a, b, g, and k, respectively. For the pilot run we use an MCMC acceptance rate threshold of 20% (results are similar for thresholds of 15% and 10%, see Appendix D of the supplementary materials).
The results are shown in . (a) shows the univariate marginal parameter posterior approximations based on various approaches, whereas (b) shows the corresponding results for the four summary statistics. The solid red densities in (a) show the typical ABC approximation based on all the statistics S. Since there are only a small number of summary statistics, this approximation should be fairly close to . The purple dash results show the marginal approximations when only using the Sj statistic. Apart from the parameter k, these marginal approximations are substantially worse than the marginal approximations based on S, despite producing closer matches to Sj as demonstrated in (b). The pilot approximation results are shown in green dash. Our approach using the pilot run as the initial approximation followed by close matching with Sj produces the results shown as blue dash with univariate posterior approximations in agreement with the gold standard typical ABC approximation. (b) shows that this new approach results in very close matches with the Sj statistics.
One might expect the posterior of a given S1 to be accurate since the median of the g-and-k distribution is exactly a. Yet (a) reveals a poor approximation. We demonstrate in Appendix D of the supplementary material that the poor approximation relies on the fact that the distribution of S1 also depends on b, and that there is dependence between a and b in the posterior given S1.
4.3 Sparse Vector Autoregression Example
Here we consider the sparse vector autoregression example proposed in Thomas et al. (Citation2020). The model is given by where is the k-dimensional observation of the time series at time t, is the transition matrix and is a k-dimensional noise vector with being the noise parameter. Thomas et al. (Citation2020) consider a sparse matrix X where each variable is coupled with only one other variable, so that the only off-diagonal entries that are nonzero are the ones associated with these couplings. That is, if variable i is coupled with variable j, , then and we also set . We set the diagonal elements of X to be –0.1, which seems to ensure stability of the simulated vector autoregression.
The parameters are the nonzero off-diagonal terms of X and σ. Here we consider k = 20, which leads to p = 21 parameters, producing a moderate-dimensional example. We generate observed data with , and we assign a prior . For the elements of X, we draw the couplings randomly and simulate the true value of each non-diagonal entry of X from a distribution, which we also set as the prior for these parameters. We choose uniform priors for illustrative purposes; typically a prior on σ would not be bounded from above. We simulate n = 500 observations from the k-dimensional time series model. For the summary statistic corresponding to parameter we use the lag 1 autocovariance, where is the tth observation of the ith time series. To inform σ we use the sample standard deviation of the k times series concatenated together. Thus, for each parameter, there is a corresponding informative summary Sj.
Here we run SMC ABC with the k = 21 summaries, SMC ABC with each individual Sj summary and also our two stage procedure with each Sj used in the second stage and based on a pilot run with all summaries using a 10% acceptance rate as the stopping rule. The results are shown in . It is evident that running ABC to estimate each marginal posterior using its corresponding informative summary does not lead to accurate inferences. Unlike the previous low-dimensional examples, our two stage approach leads to more accurate inferences than standard SMC ABC for several of the parameters, and comparable inference for the other parameters.
5 Discussion
In this article we have presented a new approach for improving the accuracy of marginal approximations in likelihood-free inference when conditioning on a low-dimensional summary statistic. We showed through several examples and sufficiency arguments that marginal approximations can be poor unless summary statistics are chosen very carefully, even if the summaries are highly informative about the marginal posterior location. Our analysis demonstrates that, even if we have a vector of sufficient summaries, conducting ABC-based inference on a single (marginal) parameter using a well-intentioned subset of the summaries will not necessarily deliver inferences that agree with the partial posterior for the same (marginal) parameter based on the entire vector of summaries, even as .
We have described some ideal summary statistics which are low-dimensional and which can achieve accurate estimation of marginal posterior means and variances given the full set of summary statistics S as the ABC tolerance goes to zero. Given that such statistics may be difficult to approximate well, in practice, we find that joint matching on the full summary statistics with a loose tolerance and low-dimensional summary statistics Sj informative about marginal posterior location with a much more stringent tolerance results in greatly improved marginal posterior estimation compared to matching using Sj alone. The joint matching criterion can be motivated from the point of view of logarithmic pooling of ABC approximations based on the full posterior and the summary statistics Sj.
Algorithmically, our approach used a crude pilot run of likelihood-free inference based on all the summary statistics in order to help identify the parameters, before honing in on the low-dimensional statistic intended to be informative for its corresponding low-dimensional parameter in a second stage. Our approach should combine well with the popular summary statistic selection approach of Fearnhead and Prangle (Citation2012), which forms summary statistics by estimating the posterior means via regressions. Fearnhead and Prangle (Citation2012) recommend performing a pilot run of ABC first in order to improve the fit of the regressions, which ultimately improves the derived summary statistics. This pilot run can be exploited in our approach, which also requires a pilot run, and then the individual posterior mean estimates from the regression can be use as low-dimensional summaries for the corresponding low-dimensional parameters.
Our article does not rigorously answer the question of how long the pilot ABC should be run. In general, we suggest to run the pilot ABC until the acceptance rate starts to dramatically drop. We want to perform the pilot run to eliminate very poor parts of the parameter space, but we do not want to run it so long that it becomes computationally difficult to satisfy the pilot discrepancy in the subsequent ABC run that focuses on matching on a low-dimensional summary statistic.
In this article, several examples used a small number of summary statistics so that we could obtain a gold standard marginal approximation to compare our method against. These examples were sufficient to illustrate the concepts and ideas of the article. We also illustrated the benefits of our approach in a moderate dimensional example. However, we suggest that the principles of this article can be extended to high-dimensional likelihood-free problems, and we leave this for future work. In truly high-dimensional problems, it may not suffice to crudely localize on the full summary statistic S: instead, and following the discussion in Appendix C of the supplementary materials, methods will be needed to define subsets of S which determine our idealized posterior mean and variance statistics for each marginal for use in the pilot run. These subsets do not need to be very low-dimensional, but should not be so high-dimensional that ABC methods are infeasible. Furthermore, the ideas of our article could be combined with machine learning approaches to likelihood-free inference (e.g., Cranmer, Pavez, and Louppe Citation2015), which are generally more scalable.
We also leave extensions to the challenging problem of misspecification in the likelihood-free setting (e.g., Frazier, Robert, and Rousseau Citation2020) for future research. We believe there is a role to play for modular inference ideas in likelihood-free inference (e.g., Chakraborty et al. Citation2023). In this case, if marginal inference was desired on some subset of the parameters say, and it was felt that some components of the full summary statistic vector provided misleading information about that parameter, then these components of the full summary statistic vector would need to be excluded from the pilot run. There are diagnostics in the ABC literature for highlighting which summaries cannot be matched (e.g., Frazier et al. Citation2018); when all summaries cannot be matched, one may need to choose which summaries to match, or try to exclude only a sparse set of summaries. What option to choose depends on the problem at hand and what the model will be used for.
Supplemental Material
Download PDF (450.4 KB)Acknowledgments
The authors are grateful to three anonymous referees whose comments led to improvements in this article. CD is affiliated with the QUT Centre for Data Science. DJN is affiliated with the NUS Institute of Operations Research and Analytics, National University of Singapore.
Disclosure Statement
The authors report there are no competing interests to declare.
Funding
CD gratefully acknowledges support from the Australian Research Council Future Fellowship Award (FT210100260). DTF gratefully acknowledges support by the Australian Research Council through grant DE200101070.
Notes
1 Since most of the marginal posteriors do not have closed-form distribution, we normalize the densities using trapezoidal numerical integration for convenience, except for which has an inverse-gamma distribution.
2 In general, the weights for the logarithmic pool must be specified. This can be done in several ways, with the most common way being to obtain point estimates of the weights using data (Poole and Raftery 2000). Alternatively, in certain settings, a prior distribution over the weights can be specified and a posterior distribution for the weights obtained via Bayes’ theorem (see, e.g., Carvalho et al. 2022).
3 From this result we also see that, for n large, and the asymptotic distribution of the scaled and centered statistic only depends on θ2.
References
- Barber, S., Voss, J., and Webster, M. (2015), “The Rate of Convergence for Approximate Bayesian Computation,” Electronic Journal of Statistics, 9, 80–105. DOI: 10.1214/15-EJS988.
- Barndorff-Nielsen, O. (1976), “Nonformation,” Biometrika, 63, 567–571. DOI: 10.1093/biomet/63.3.567.
- Blum, M. G. B. (2010), “Approximate Bayesian Computation: A Non-parametric Perspective,” Journal of the American Statistical Association, 105, 1178–1187. DOI: 10.1198/jasa.2010.tm09448.
- Blum, M. G. B., and François, O. (2010), “Non-linear Regression Models for Approximate Bayesian Computation,” Statistics and Computing, 20, 63–73. DOI: 10.1007/s11222-009-9116-0.
- Carvalho, L. M., Villela, D. A., Coelho, F. C., and Bastos, L. S. (2022), “Bayesian Inference for the Weights in Logarithmic Pooling,” Bayesian Analysis, 1, 1–29. DOI: 10.1214/22-BA1311.
- Chakraborty, A., Nott, D. J., Drovandi, C. C., Frazier, D. T., and Sisson, S. A. (2023), “Modularized Bayesian Analyses and Cutting Feedback in Likelihood-Free Inference,” Statistics and Computing, 33, 33. DOI: 10.1007/s11222-023-10207-5.
- Chen, Y., and Gutmann, M. U. (2019), “Adaptive Gaussian Copula ABC,” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Volume 89 of Proceedings of Machine Learning Research, eds. K. Chaudhuri and M. Sugiyama, pp. 1584–1592. PMLR.
- Clements, M. P., and Harvey, D. I. (2011), “Combining Probability Forecasts,” International Journal of Forecasting, 27, 208–223. DOI: 10.1016/j.ijforecast.2009.12.016.
- Cranmer, K., Pavez, J., and Louppe, G. (2015), “Approximating Likelihood Ratios with Calibrated Discriminative Classifiers,” arXiv:1506.02169.
- Drovandi, C. C., and Pettitt, A. N. (2011a), “Estimation of Parameters for Macroparasite Population Evolution Using Approximate Bayesian Computation,” Biometrics, 67, 225–233. DOI: 10.1111/j.1541-0420.2010.01410.x.
- Drovandi, C. C., and Pettitt, A. N. (2011b), “Likelihood-Free Bayesian Estimation of Multivariate Quantile Distributions,” Computational Statistics and Data Analysis, 55, 2541–2556. DOI: 10.1016/j.csda.2011.03.019.
- Fearnhead, P., and Prangle, D. (2012), “Constructing Summary Statistics for Approximate Bayesian Computation: Semi-Automatic Approximate Bayesian Computation,” Journal of the Royal Statistical Society, Series B, 74, 419–474. DOI: 10.1111/j.1467-9868.2011.01010.x.
- Frazier, D., Nott, D. J., Drovandi, C., and Kohn, R. (2022), “Bayesian Inference Using Synthetic Likelihood: Asymptotics and Adjustments,” Journal of the American Statistical Association (to appear), DOI: 10.1080/01621459.2022.2086132.
- Frazier, D. T., Martin, G. M., Robert, C. P., and Rousseau, J. (2018), “Asymptotic Properties of Approximate Bayesian Computation,” Biometrika, 105, 593–607. DOI: 10.1093/biomet/asy027.
- Frazier, D. T., Robert, C. P., and Rousseau, J. (2020), “Model Misspecification in Approximate Bayesian Computation: Consequences and Diagnostics,” Journal of the Royal Statistical Society, Series B, 82, 421–444. DOI: 10.1111/rssb.12356.
- Genest, C., McConway, K. J., and Schervish, M. J. (1986), “Characterization of Externally Bayesian Pooling Operators,” The Annals of Statistics, 14, 487–501. DOI: 10.1214/aos/1176349934.
- Genest, C., and Zidek, J. V. (1986), “Combining Probability Distributions: A Critique and an Annotated Bibliography,” Statistical Science, 1, 114–135. DOI: 10.1214/ss/1177013825.
- Li, J., Nott, D., Fan, Y., and Sisson, S. (2017), “Extending Approximate Bayesian Computation Methods to High Dimensions via a Gaussian Copula Model,” Computational Statistics & Data Analysis, 106, 77–89. DOI: 10.1016/j.csda.2016.07.005.
- Li, W., and Fearnhead, P. (2018a), “Convergence of Regression-Adjusted Approximate Bayesian Computation,” Biometrika, 105, 301–318. DOI: 10.1093/biomet/asx081.
- Li, W., and Fearnhead, P. (2018b), “On the Asymptotic Efficiency of Approximate Bayesian Computation Estimators,” Biometrika, 105, 285–299. DOI: 10.1093/biomet/asx078.
- Nott, D. J., Fan, Y., Marshall, L., and Sisson, S. (2014), “Approximate Bayesian Computation and Bayes Linear Analysis: Toward High-Dimensional ABC,” Journal of Computational and Graphical Statistics, 23, 65–86. DOI: 10.1080/10618600.2012.751874.
- Poole, D., and Raftery, A. E. (2000), “Inference for Deterministic Simulation Models: The Bayesian Melding Approach,” Journal of the American Statistical Association, 95, 1244–1255. DOI: 10.1080/01621459.2000.10474324.
- Price, L. F., Drovandi, C. C., Lee, A., and Nott, D. J. (2018), “Bayesian Synthetic Likelihood,” Journal of Computational and Graphical Statistics, 27, 1–11. DOI: 10.1080/10618600.2017.1302882.
- Priddle, J. W., Sisson, S. A., Frazier, D. T., and Drovandi, C. (2022), “Efficient Bayesian Synthetic Likelihood with Whitening Transformations,” Journal of Computational and Graphical Statistics, 31, 50–63. DOI: 10.1080/10618600.2021.1979012.
- Sisson, S., and Fan, Y. (2018), “ABC Samplers,” in Handbook of Approximate Bayesian Computation, eds. S. A. Sisson, Y. Fan, and M. Beaumont, pp. 87–123, New York: Chapman and Hall/CRC.
- Sisson, S. A., Fan, Y., and Beaumont, M. (2018), Handbook of Approximate Bayesian Computation (1st ed.). Boca Raton, FL: Chapman and Hall/CRC.
- Thomas, O., Corander, J., Sá-Leão, R., de Lencastre, H., Kaski, S., and Pesonen, H. (2020), “Split-BOLFI for Misspecification-Robust Likelihood Free Inference in High Dimensions,” arXiv preprint arXiv:2002.09377v1.
- Wood, S. N. (2010), “Statistical Inference for Noisy Nonlinear Ecological Dynamic Systems,” Nature, 466, 1102–1107. DOI: 10.1038/nature09319.
- Zhu, Y., and Reid, N. (1994), “Information, Ancillarity, and Sufficiency in the Presence of Nuisance Parameters,” Canadian Journal of Statistics, 22, 111–123. DOI: 10.2307/3315827.