Full article: Improving the Accuracy of Marginal Approximations in Likelihood-Free Inference via Localization

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Likelihood-free methods are an essential tool for performing inference for implicit models which can be simulated from, but for which the corresponding likelihood is intractable. However, common likelihood-free methods do not scale well to a large number of model parameters. A promising approach to high-dimensional likelihood-free inference involves estimating low-dimensional marginal posteriors by conditioning only on summary statistics believed to be informative for the low-dimensional component, and then combining the low-dimensional approximations in some way. In this article, we demonstrate that such low-dimensional approximations can be surprisingly poor in practice for seemingly intuitive summary statistic choices. We describe an idealized low-dimensional summary statistic that is, in principle, suitable for marginal estimation. However, a direct approximation of the idealized choice is difficult in practice. We thus suggest an alternative approach to marginal estimation which is easier to implement and automate. Given an initial choice of low-dimensional summary statistic that might only be informative about a marginal posterior location, the new method improves performance by first crudely localizing the posterior approximation using all the summary statistics to ensure global identifiability, followed by a second step that hones in on an accurate low-dimensional approximation using the low-dimensional summary statistic. We show that the posterior this approach targets can be represented as a logarithmic pool of posterior distributions based on the low-dimensional and full summary statistics, respectively. The good performance of our method is illustrated in low to moderate dimensional examples. Computer code to implement the methods for the examples of this article is available at https://github.com/cdrovandi/ABC-marginal-approximations.

KEYWORDS:

1 Introduction

Likelihood-free statistical methods such as approximate Bayesian computation (ABC, Sisson, Fan, and Beaumont Citation2018) and Bayesian synthetic likelihood (BSL, Wood Citation2010; Price et al. Citation2018) are now commonly applied to conduct inference on the parameters of computationally expensive models for which simulation of synthetic data is easy, but likelihood computation is impractical. Such approaches aim to find values of the model parameters for which simulated and observed data are close, typically on the basis of a set of summary statistics.

Standard likelihood-free techniques such as ABC and BSL are known to scale poorly to a large number of summary statistics (Blum Citation2010; Frazier et al. Citation2022). Although BSL methods scale more readily to higher-dimensional summaries (Priddle et al. Citation2022), its performance can still degrade as the dimension of the summaries increases. In spite of this, the use of a high-dimensional summary statistic might seem desirable for several reasons. First, a high-dimensional summary statistic might be necessary to mitigate the inevitable information loss incurred by reducing the full dataset. Second, when the model contains a large number of parameters, a large number of summary statistics are required by default: when the dimension of the summary statistic is less than the parameter, it is not possible to point-identify all parameters. From a computational perspective, however, matching of simulated and observed summary statistics and posterior exploration become more difficult in higher dimensions.

In the ABC framework, a promising approach for performing likelihood-free analyses for high-dimensional parameter problems is marginal adjustment (Nott et al. Citation2014) and the closely related copula ABC approach (Li et al. Citation2017; Chen and Gutmann Citation2019). These approaches estimate low-dimensional (e.g., one or two dimensional) marginal posterior distributions, and then combine them in some way to obtain an approximation to the joint posterior distribution. The idea is that each parameter, or pair of parameters, is likely to be informed by only a small number of summary statistics. Thus, in principle, it is possible to accurately estimate these low-dimensional posterior distributions, and avoid the curse of dimensionality associated with the use of high-dimensional summary statistics. Other reasons why accurate marginal approximations might be important are that researchers often report inferences for individual parameters and that in initial modeling phases the identifiability of the model can be explored.

The purpose of this article is to demonstrate that the approximation of low-dimensional marginal posteriors based on intuitive low-dimensional summary statistic choices can be surprisingly poor in some cases, and less accurate than the corresponding approximation using all the summary statistics. Motivated by this observation, we describe a summary statistic choice for marginal posterior estimation that would, in principle, deliver precise marginal posterior inferences. Unfortunately, this statistic is not practically feasible to construct in the situations to which ABC is generally applied. Our more practical approach is based on crudely localizing the posterior approximation using all the summary statistics, and then subsequently conducting marginal inferences by matching low-dimensional summary statistics that are informative for different marginal parameters. We show that our two-stage approach can be thought of as sampling a logarithmic pool of posterior distributions based on low-dimensional and full summary statistics, respectively. In Appendix A of the supplementary material, we describe how our article is related to other parts of the literature, including other high-dimensional likelihood-free methods, machine learning approaches and how our approach is relevant beyond the ABC setting. We demonstrate our findings and method on low to moderate dimensional simulated examples. We leave the significant challenges of extending this work to high-dimensional models and model misspecification for future research.

2 Marginal Approximations in Likelihood-Free Inference

2.1 Approximate Bayesian Computation

We denote the parameter of the statistical model of interest as $θ \in Θ \subseteq R^{p}$ , where p is the number of parameters. The observed data is given by $y = (y_{1}, \dots, y_{n}) \in Y^{n}$ where n denotes the number of observations. Ideally we wish to base our statistical inferences on the posterior density $π (θ | y) \propto p (y | θ) π (θ),$

where $p (y | θ)$ is the likelihood function and $π (θ)$ the prior density. However, for many complex models of interest, $p (y | θ)$ may be too computationally expensive to permit the application of exact methods, up to Monte Carlo error, for approximating expectations with respect to the desired posterior. In such situations, we can resort to likelihood-free methods, which replace likelihood evaluations with model simulations, to obtain an approximation to the posterior.

Perhaps the most popular statistical likelihood-free method is ABC. In ABC, we often first choose a summary statistic function, $S : Y^{n} \to S$ where $S \subseteq R^{d}$ and d is the number of summary statistics, to map the data to a lower dimensional space. We then compare observed data y and simulated data $x \in Y^{n}$ on the basis of the corresponding observed and simulated summary statistics. Overloading notation, we write the summary statistic function evaluated at a generic value for the data simply by S. When it is necessary to distinguish between the summary statistic function evaluated at the observed data y and simulated data x, we denote these by $S_{y} = S (y)$ , and $S_{x} = S (x)$ , respectively.

Write $π (θ | S_{y}) \propto π (θ) p (S_{y} | θ)$ for the partial posterior distribution conditioned on the summary statistic S_y, where $p (S | θ)$ is the summary statistic likelihood evaluated at $S \in S$ . In ABC, the posterior density is approximated as follows: (1) $π (θ | y) \approx π (θ | S_{y}) \approx π_{ϵ} (θ | S_{y}) \propto π (θ) p_{ϵ} (S_{y} | θ),$ (1) where $π_{ϵ} (θ | S_{y})$ denotes the ABC posterior, and (2) $p_{ϵ} (S_{y} | θ) = \int_{Y^{n}} p (S_{x} | θ) K_{ϵ} {ρ (S_{y}, S_{x})} d S_{x},$ (2) with $ρ : S \times S \to R^{+}$ a distance function that compares observed and simulated data, and $K_{ϵ} : R^{+} \to R^{+}$ a kernel function that is designed to be relatively large when ρ is small. We will refer to $p_{ϵ} (S_{y} | θ)$ defined in (2) as the ABC likelihood. It is a kernel smoothed version of the true summary statistic likelihood $p (S_{y} | θ)$ . Ultimately, for a given S and ρ, the bandwidth parameter of the kernel ϵ, also referred to as the ABC tolerance, defines when the observed and simulated data are considered close. ABC can be more easily understood when using the indicator kernel function, $K_{ϵ} {ρ (S_{y}, S_{x})} = I {ρ (S_{y}, S_{x}) \leq ϵ}$ , which is equal to one when $ρ (S_{y}, S_{x}) \leq ϵ$ and 0 otherwise. The integral in (1) for a given θ can be unbiasedly estimated by taking a single draw $x \sim p (\cdot | θ)$ from the model and evaluating $K_{ϵ} {ρ (S_{y}, S_{x})}$ .

Although there are many algorithms for sampling from the ABC posterior (Sisson and Fan Citation2018), in this article we use the sequential Monte Carlo (SMC) ABC algorithm of Drovandi and Pettitt (Citation2011a). This method generates samples (often referred to as particles in the SMC context) from an adaptive sequence of ABC posteriors with decreasing ABC thresholds. It does this by eliminating a proportion of particles with the highest discrepancy at each iteration. The population of particles is rejuvenated via a resampling and move step, the latter achieved with an MCMC ABC kernel to preserve the distribution of particles. The number of MCMC iterations applied to each particle adapts with the overall MCMC acceptance rate. In this article we stop the SMC ABC algorithm when the acceptance rate in the MCMC step drops below 1%, since it implies that a great deal more computation is required to progress to smaller values of ϵ. The proposal distribution in the MCMC ABC kernel is a Gaussian random walk, with a covariance tuned based on the population of SMC particles, which is reasonably efficient for the examples considered in this article.

2.2 Marginal Approximations

It is well known that simple ABC methods struggle to produce accurate approximations of the (partial) posterior as the dimension of the summary statistic increases. A Monte Carlo estimate for the ABC likelihood (2) results in a conditional kernel density estimator of the likelihood for the summaries, the statistical behavior of which is known to severely degrade as the dimension of the summary statistic increases, even if the ABC tolerance is favorably chosen. Results in Blum (Citation2010), Blum and François (Citation2010), and Barber, Voss, and Webster (Citation2015) make the effect of the dimension on simple ABC algorithms more precise.

Further theoretical results for ABC methods are described in Li and Fearnhead (Citation2018b), Frazier et al. (Citation2018), and Li and Fearnhead (Citation2018a) consider regression adjusted ABC methods. Li and Fearnhead (Citation2018a, Citation2018b) obtain interesting results about both point estimation for the ABC posterior mean, and accurate uncertainty quantification of the ABC posterior, for rejection and importance sampling ABC algorithms with and without regression adjustment for an appropriately chosen sample-size dependent tolerance. With an adaptively chosen and informative proposal distribution, so long as the tolerance goes to zero fast enough, the authors demonstrate that ABC regression adjustment methods can control the Monte Carlo error of the posterior approximation and provide correct uncertainty quantification. However, it is not so easy to disentangle the impact of the summary statistic dimension in this theory, since the construction of the required proposal becomes more difficult as the parameter dimension increases, which is one situation when high-dimensional summary statistics are needed. Furthermore, in the case of simple accept/reject ABC, results in Frazier et al. (Citation2018) demonstrate that to control the Monte Carlo error the tolerance used in ABC must be chosen as an increasing function of the summary statistic dimension. We note that, in the BSL framework, Frazier et al. (Citation2022) obtain similar results to those of Li and Fearnhead (Citation2018a) for regression adjusted ABC, showing that the two methods behave similarly from a computational standpoint.

An appealing approach for overcoming issues associated with likelihood-free methods in high dimensions is to approximate low-dimensional posterior distributions (Nott et al. Citation2014; Li et al. Citation2017). Let θ_j be some component of θ. Note that θ_j may consist of more than one parameter, but for the purposes of this article it suffices to treat θ_j as a scalar. We note that θ_j may be informed by a relatively small number of summary statistics. We denote the corresponding summary statistic function as $S_{j} : Y^{n} \to S_{j}$ where $S_{j} \subseteq S$ , and the observed statistic as $S_{j, y} = S_{j} (y)$ . The motivation for this is that $π_{ϵ_{j}} (θ_{j} | S_{j, y})$ could be a much better approximation to $π (θ_{j} | S_{y})$ compared to $π_{ϵ} (θ_{j} | S_{y})$ for two inter-related reasons: first, since S_j is lower dimensional, it is easier to find matches between observed and simulated summary statistic vectors within a given tolerance region; second, since we are attempting to match summaries in a lower-dimensional space, we can more readily control the approximation error introduced through the use of a positive tolerance (i.e., it is generally the case that the tolerance value ϵ_j can be much smaller than ϵ). For these reasons, there are meaningful benefits to employing low-dimensional summaries when targeting marginal inference in ABC.

In this article we demonstrate that, even with an intuitively reasonable choice of S_j, sometimes the approximation $π_{ϵ_{j}} (θ_{j} | S_{j, y})$ can be less accurate than $π_{ϵ} (θ_{j} | S_{y})$ , and substantially so. As an example of an intuitively reasonable choice, the semi-automatic ABC method of Fearnhead and Prangle (Citation2012) produces summary statistics which are estimated posterior means for θ, and S_j could be obtained by extracting estimates for θ_j. That such an intuitive choice can perform badly is perhaps well-known, but it is worthwhile to demonstrate this in a simple normal example where using sufficient statistics to produce posterior approximations may not produce accurate inferences for θ_j.

2.2.1 The Normal Case

Write $N (μ, ϕ)$ for a normal distribution with mean μ and variance $ϕ$ , and write $N (z; μ, ϕ)$ for the value of its density at z. Let $y = (y_{1}, y_{2}, \dots, y_{n})$ be a random sample drawn from a $N (μ, ϕ)$ distribution with unknown μ and $ϕ$ . We assume that μ and $ϕ$ are independent a priori and allocate priors: $μ \sim N (μ_{0}, ϕ_{0})$ and $ϕ \sim I G (α, β)$ where $I G (α, β)$ denotes the inverse-gamma distribution with shape α and scale β and $μ_{0}, ϕ_{0}, α, β$ are fixed. A minimal and Bayes sufficient statistic for estimating $(μ, ϕ)$ jointly is $\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}$ and $s^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}$ , that is, $π (μ, ϕ | y) = π (μ, ϕ | \bar{y}, s^{2})$ . Since $\bar{y}$ and s² are direct point estimates of μ and $ϕ$ , respectively, it might be tempting to consider approximating the marginals of these parameters conditioning only on their respective point estimate statistics.

A certain realization of y with $n = 10, μ = 0, ϕ = 1, μ_{0} = 0, ϕ_{0} = 1, α = 1, β = 1$ produces the results shown in .Footnote¹ It is evident that $π (μ | \bar{y})$ produces a diffuse approximation of the actual marginal posterior, whereas the approximation $π (ϕ | s^{2})$ is highly accurate for the actual $ϕ$ -marginal.

Fig. 1. Posteriors for toy normal example.

To understand the findings in , first rewrite the normal likelihood as $\begin{matrix} N (y; μ, ϕ) \propto p_{1} (s^{2} | ϕ) p_{2} (\bar{y} | μ, ϕ) \\ = {ϕ^{- (n - 1) / 2} exp {- s^{2} / 2 ϕ}} {\frac{1}{\sqrt{2 π ϕ}} exp {- n {(\bar{y} - μ)}^{2} / 2 ϕ}} . \end{matrix}$

For inference on μ, the above decomposition makes clear that the marginal posterior for μ will depend on s²: under the inverse-gamma prior for $ϕ$ we have $π (μ | s^{2}, \bar{y}) \propto π (μ) {[β + \frac{n {(μ - \bar{y})}^{2}}{(n - 1) s^{2}}]}^{- α - n / 2} .$ In contrast, if we take as our “likelihood” for inference on μ, $exp {- n {(\bar{y} - μ)}^{2} / 2 ϕ}$ , under the same prior for $π (ϕ)$ , we have the marginal posterior $π (μ | \bar{y}) \propto π (μ) {[β + n {(μ - \bar{y})}^{2}]}^{- α - 1 / 2} .$ While the two posterior locations are similar, their scales are not, with the use of the simpler likelihood resulting in a loss of precision, see .

Now we turn to the marginal approximation of $ϕ$ . The term $p_{2} (\bar{y} | μ, ϕ)$ is nonconstant for all $ϕ$ except in the case where $\bar{y} = μ$ . Hence, the likelihood contribution of $p_{2} (\bar{y} | μ, ϕ)$ is informative for inference on $ϕ$ based on this likelihood; see, for example, Zhu and Reid (Citation1994) for additional discussion on this example.

However, the marginal posterior for $ϕ | s^{2}$ is $π (ϕ | s^{2}) \propto p_{1} (s^{2} | ϕ) π (ϕ)$ , while the marginal posterior for $ϕ | s^{2}, \bar{y}$ is $π (ϕ | s^{2}, \bar{y}) \propto p_{1} (s^{2} | ϕ) π (ϕ) \frac{1}{\sqrt{2 π ϕ}} \int exp {- n {(μ - \bar{y})}^{2} / 2 ϕ} π (μ) d μ .$ While the two posteriors appear different, the integrand in $π (ϕ | s^{2}, \bar{y})$ can be shown to be constant as a function of $ϕ$ under a diffuse prior for μ. Hence, even though the likelihood contribution containing $\bar{y}$ , that is, $p_{2} (\bar{y} | μ, ϕ)$ , does carry meaningful information about $ϕ$ , the “likelihood” that results for integrating out μ, under a diffuse prior, ensures that the integrated component $\int p_{2} (\bar{y} | μ, ϕ) π (μ) d μ$ has no information about $ϕ$ that is not already contained in $p_{1} (s^{2} | ϕ)$ . This explains the accurate results for $ϕ$ in . In more substantive problems pertinent to likelihood-free inference, the integrated likelihood term that results from marginalization is unlikely to simplify, and we should not expect accurate marginal approximations even for a seemingly obvious choice of low-dimensional summary statistic.

On a more intuitive level, s² is obviously highly informative about $ϕ$ , and its distribution does not depend on μ. This might be the best situation that we can hope for when conditioning on a low-dimensional summary statistic consisting of point estimates of parameters in likelihood-free inference. In contrast, even though $\bar{y}$ is informative about μ, its distribution depends on $ϕ$ .

2.2.2 A General Issue

The simple normal example clearly illustrates the potential inaccuracy of using marginal approximations based on summary statistics, and, thus, why the use of marginal approximations in ABC may produce inaccurate inferences. Therefore, it is helpful to give a more general formulation of the lessons of this example.

Consider that we wish to conduct inference on the unknown parameter $θ = {(ψ^{⊤}, λ^{⊤})}^{⊤}$ conditional on a vector of observed summary statistics S_y, which we partition as $S_{y} = {(S_{1, y}^{⊤}, S_{2, y}^{⊤})}^{⊤}$ , and which has dimension at least as large as θ. Furthermore, consider the unrealistic scenario where the likelihood for the observed summary S_y can be factorized as (3) $p (S_{y} | θ) \propto p_{1} (S_{1, y} | ψ) p_{2} (S_{2, y} | S_{1, y}, ψ, λ) .$ (3)

The factorization in (3) allows us to write the joint posterior as $π (θ | S_{y}) \propto π (ψ) π (λ) p_{1} (S_{1, y} | ψ) p_{2} (S_{2, y} | S_{1, y}, ψ, λ) .$

We define the “integrated likelihood” for ψ, evaluated at $S_{2, y}$ , and conditional on $S_{1, y}$ , as $p_{2}^{λ} (S_{2, y} | ψ, S_{1, y}) : = \int p_{2} (S_{2, y} | S_{1, y}, ψ, λ) π (λ) d λ,$ where the notation $p_{2}^{λ} (S_{2, y} | ψ, S_{1, y})$ clarifies that we have integrated out λ from the likelihood contribution $p_{2} (S_{2, y} | S_{1, y}, ψ, λ)$ using the prior $π (λ)$ . The marginal posterior $π (ψ | S_{y})$ can then be written as $\begin{matrix} π (ψ | S_{y}) = π (ψ) p_{1} (S_{1, y} | ψ) p_{2}^{λ} (S_{2, y} | ψ, S_{1, y}) \\ / \int p_{1} (S_{1, y} | ψ) p_{2}^{λ} (S_{2, y} | ψ, S_{1, y}) π (ψ) d ψ . \end{matrix}$

Even if (3) is valid, $π (ψ | S_{y})$ still depends on $S_{2, y}$ through $p_{2}^{λ} (S_{2, y} | ψ, S_{1, y})$ . The only way that the posterior $π (ψ | S_{y})$ does not depend on $S_{2, y}$ is when, for the observed value S_y, the term $p_{2}^{λ} (S_{2, y} | ψ, S_{1, y})$ is a constant function of ψ. The latter requirement is essentially the requirement that the likelihood contribution $p_{2} (S_{2, y} | S_{1, y}, ψ, λ)$ is S-nonformative (short for non-informative), a concept due to Barndorff-Nielsen (Citation1976), when conducting inference on ψ.

Furthermore, the toy normal example makes clear that even in linear exponential families where the resulting decomposition in (3) is satisfied, the likelihood $p_{2} (S_{2, y} | S_{1, y}, ψ, λ)$ can depend on ψ. That is, while linear exponential families have minimally sufficient summaries, those same summaries are not generally S-nonformative when considering inference for a sub-vector of parameters, for example, ψ. Given this finding, there is no reason to suspect that marginal approximations by themselves will yield accurate partial posterior inference unless the summary statistics are chosen very carefully. The example in Appendix B of the supplementary material demonstrates how such a careful summary selection can lead to accurate marginal approximations. However, as we demonstrate in Section 4, finding summary statistics that satisfy the conditions outlined in this section is not an easy task.

3 Accurate Marginal Approximations via Localization

In Appendix C of the supplementary materials we show that ideal summary statistics for estimating the marginal partial posterior of θ_j are its corresponding marginal posterior moments. For example, suppose we are interested in estimating the one-dimensional parameter θ_j, and define an additional two-dimensional summary statistic vector ${\tilde{S}}_{j} = {(E (θ_{j} | S), var (θ_{j} | S))}^{⊤} \in {\tilde{S}}_{j}$ . Conforming to our previous notation, we write ${\tilde{S}}_{x, j}$ and ${\tilde{S}}_{y, j}$ for the value of ${\tilde{S}}_{j}$ for simulated data x and observed data y, respectively. We show the posterior mean and variance of $θ_{j} | {\tilde{S}}_{y, j}$ is exactly the posterior mean and variance of $θ_{j} | S_{y}$ . That is, we can match the mean and variance of the partial posterior with a much lower dimensional summary statistic, and this result extends to higher order moments. However, we also explain in Appendix C of the supplementary materials why it is practically difficult to approximate these statistics in the likelihood-free setting, especially for higher-order moments. Thus, as a more practical method, we propose the below localisation approach.

3.1 Localization Approach

We show how the issue described in Section 2.2.2 can be mitigated by first localizing the approximation using the full set of summary statistics S and then focusing on matching the summary subset S_j. More specifically, we consider the following approximate posterior for estimating the θ_j-marginal (4) $\begin{matrix} π_{ϵ_{j}} (θ, S_{x} | S_{y}) \propto π (θ) p (S_{x} | θ) I {ρ (S_{y}, S_{x}) \leq ϵ_{0}} \\ \cdot I {ρ (S_{j, y}, S_{j, x}) \leq ϵ_{j}} . \end{matrix}$ (4)

We abuse notation here and write $ρ (\cdot, \cdot)$ for the metric on both full and reduced summary statistic spaces. Here, the role of $ϵ_{0} > ϵ_{j}$ is to provide an initial crude approximation of the posterior, so that the parameters are more tightly constrained compared to the prior. Note that we do not use the data twice; in the second constraint we simply impose a tighter restriction on S_j that is designed to be informative about θ_j. Our approach is related to other ABC methods which use a pilot run to truncate the prior such as those of Fearnhead and Prangle (Citation2012) and Blum and François (Citation2010). However, the key difference in our approach is that different summary statistics are used at the pilot and analysis stage.

We now provide some intuition on why this idea is helpful for marginal approximations. The joint selection condition is needed to ensure that values of θ are in the high density region for the approximation only if they produce reasonable global agreement for the entire vector of summaries; while the marginal selection condition for S_j allows us to focus on specific regions of the marginal parameter space where $S_{j, y}$ is particularly informative about the unknown θ_j. However, in many cases the likelihood for the marginal summary statistic will be a complex function of all the parameters, and the joint selection step is critical as otherwise conducting inference using only the marginal selection condition could lead to diffuse posteriors, or, at worst, a complete lack of point identification. We later present examples of both types of behavior, which emphasizes the importance of including the joint selection condition when conducting marginal inferences.

To implement this in practice we use a pilot run of SMC ABC with the full set of statistics S for a relatively short time until the MCMC acceptance rate drops below some threshold much greater than 1%. Then we perform a second SMC ABC step initialized at the pilot approximation and aim to produce closer matches based on S_j (i.e., reduce ϵ_j) until the MCMC acceptance rate drops below 1%. Note that we also check in the second stage if a proposal satisfies the pilot constraint $I {ρ (S_{y}, S_{x}) \leq ϵ_{0}}$ . If it does not, the proposal is rejected even if it matches the current constraint based on S_j. It is important that the acceptance rate threshold for the pilot run is set relatively high. First, we do not want to use much computation time on the pilot run. Second, in the continuation run, we want to reduce the tolerance associated with the marginal summary statistics greatly, and this will be difficult to do computationally if it is already difficult to match on the pilot tolerance. In our experiments, we use a pilot % stopping rule of 10%–30%, depending on the parameter dimensionality. Another possible stopping rule for the pilot run is when a certain number of model simulations have been exceeded, which may be a more explicit way to control the computational effort imposed in the pilot run. The second stage starts from the pilot result, and does not require many additional SMC iterations as it aims to match only on a low-dimensional statistic. The approach is summarized in Algorithm 1.

We note that there is no guarantee that our approach will improve on marginal ABC approximations using all the statistics. Ultimately, there is still more information contained in S about θ_j compared to S_j, and it will depend on how useful S_j is. Furthermore, the two stage approach is computationally partly hindered by having to match also on the pilot tolerance. However, in the examples of this article and the supplementary material, we obtain very good empirical performance in the large majority of cases.

Algorithm 1

SMC ABC approach for accurate estimation of $π (θ_{j} | S_{y})$ .

Inputs: Acceptance rate thresholds for the pilot, $p_{0, acc}$ , and continued, $p_{j, acc}$ , SMC ABC runs. Note that $p_{0, acc} ≫ p_{j, acc}$ . Observed data y, prior distribution $π (θ)$ , discrepancy functions $ρ (S_{y}, S_{x})$ and $ρ (S_{j, y}, S_{j, x})$ .

Outputs: Tolerances ϵ₀ and ϵ_j. Samples ${θ^{(i)}, ρ_{j}^{(i)}, S_{j, x}^{(i)}}_{i = 1}^{N}$ from the approximate posterior $\propto π (θ) p (S_{x} | θ) I {ρ (S_{y}, S_{x}) \leq ϵ_{0}} \cdot I {ρ (S_{j, y}, S_{j, x}) \leq ϵ_{j}}$ .

1: Run SMC ABC, reducing the tolerance with respect to the discrepancy function $ρ (S_{y}, S_{x})$ until the acceptance rate falls below $p_{0, acc}$ . This produces a sample ${θ^{(i)}, ρ^{(i)}, S_{x}^{(i)}}_{i = 1}^{N}$ from $π_{ϵ_{0}} (θ, S_{x} | S_{y})$ where $ρ^{(i)} = ρ (S_{y}, S_{x}^{(i)})$ and $ϵ_{0} \geq max {ρ^{(i)}}_{i = 1}^{N}$ .

2: Compute $ρ_{j}^{(i)} = ρ (S_{j, y}, S_{j, x}^{(i)})$ for $i = 1, \dots, N$ . This produces the initial value of $ϵ_{j} = max {ρ_{j}^{(i)}}_{i = 1}^{N}$ . We can now think of ${θ^{(i)}, ρ^{(i)}, S_{x}^{(i)}}_{i = 1}^{N}$ as an ABC sample from $\propto π (θ) p (S_{x} | θ) I {ρ (S_{y}, S_{x}) \leq ϵ_{0}} \cdot I {ρ (S_{j, y}, S_{j, x}) \leq ϵ_{j}}$ .

2: Continue running SMC ABC, reducing the tolerance ϵ_j with respect to the discrepancy function $ρ (S_{j, y}, S_{j, x})$ , whilst still matching on $I {ρ (S_{y}, S_{x}) \leq ϵ_{0}}$ for a fixed ϵ₀, until the acceptance rate drops below $p_{j, acc}$ . This produces samples ${θ^{(i)}, ρ_{j}^{(i)}, S_{j, x}^{(i)}}_{i = 1}^{N}$ based on the approximate posterior with kernel $I {ρ (S_{y}, S_{x}) \leq ϵ_{0}} \cdot I {ρ (S_{j, y}, S_{j, x}) \leq ϵ_{j}}$ .

3.2 Interpretation in Terms of Logarithmic Pooling

The above approach has an interesting interpretation in terms of logarithmic pooling of posterior densities conditioned on summary statistic vectors S_y and $S_{j, y}$ , respectively. For two densities $f_{1} (w)$ and $f_{2} (w)$ their log-pooled density with weight α, $0 \leq α \leq 1$ , is $f_{p} (w) = C (α) f_{1} {(w)}^{α} f_{2} {(w)}^{1 - α},$ where $C (α)$ is a normalizing constant, and we note that this definition can be extended to accommodate any finite number of densities (Genest, McConway, and Schervish Citation1986). Logarithmic pools and other related pooling methods are often used for constructing consensus priors when prior distributions from multiple individuals are available, in the literature on optimal (Bayesian) decision making (see Genest and Zidek Citation1986 for a review) and in the literature on combinations of forecast distributions (see Clements and Harvey Citation2011 for a review).Footnote²

A key feature of logarithmic pooling is that a high density value in the pooled density must necessarily have a high density value in all the densities being pooled. We now argue that that this feature can be used to obtain the joint matching condition in the SMC algorithm described above. Consider once again the ABC posterior density (1). Suppose we use the uniform kernel, so that $K_{ϵ} {ρ (S_{y}, S_{x})} = I {ρ (S_{y}, S_{x}) \leq ϵ}$ . Considering the summary statistic S_y, and denoting the tolerance in our approximation by ϵ₀, the joint ABC density of parameter and summary is given by (5) $π_{ϵ_{0}} (θ, S_{x} | S_{y}) \propto π (θ) p (S_{x} | θ) I {ρ (S_{y}, S_{x}) \leq ϵ_{0}} .$ (5)

Integrating out S_x in (5) gives the θ-marginal density given by (1). A similar ABC posterior density on θ and S_x conditional on $S_{j, y}$ , with tolerance denoted ϵ_j is (6) $π_{ϵ_{j}} (θ, S_{x} | S_{y}) \propto π (θ) p (S_{x} | θ) I {ρ (S_{j, y}, S_{j, x}) \leq ϵ_{j}} .$ (6)

It is immediate that for any $0 < α < 1$ , then for (5) and (6), their log-pooled density is (7) $\begin{matrix} π_{p, ϵ} (θ, S_{x} | S_{y}) \propto π (θ) p (S_{x} | θ) I {ρ (S_{y}, S_{x}) \leq ϵ_{0}} \\ \cdot I {ρ (S_{j, y}, S_{j, x}) \leq ϵ_{j}}, \end{matrix}$ (7) and the SMC-ABC algorithm described above targets sampling from the density (7), which involves the joint matching condition. Note that if we had integrated out S_x first in (5) and (6) and then pooled, this does not give the same result as pooling (5) and (6) and integrating out S_x afterwards. Based on the earlier interpretation of logarithmic pooling, a parameter value will only have high density under the pooled posterior if it has high density under the pilot ABC posterior and the ABC posterior where only S_j is matched.

The above makes it clear that, at least in the case of the uniform kernel, so long as $α \in (0, 1)$ , the weights in the logarithmic pool do not influence the resulting posterior. The same result will not apply if one uses a bounded kernel function such as the triangular or Epanechnikov kernels. However, in the case of the commonly used Gaussian kernel, the effect of the pooling parameter α is just to change the effective kernel bandwidth parameters, since raising a Gaussian density to a power changes the scale, up to a normalizing factor. Hence, a joint matching condition for the Gaussian kernel also has an interpretation in terms of logarithmic pooling. While logarithmic pooling gives an interpretation of the posterior distribution sampled by Algorithm 1, for both uniform and Gaussian kernels, we stress that it is unnecessary to choose a pooling weight in both cases.

4 Examples

In the first two examples we consider only using a small number of summary statistics, so it is possible to use ABC to generate a gold standard approximation of the posterior distribution (and its corresponding marginals) conditional on these statistics. This is intentional so that we can properly assess the performance of the various marginal approximations. In the examples we also show the distribution of the simulated summary statistics for the accepted SMC-ABC posterior samples, so that we can compare how well the different approaches can match on individual summary statistics. For simplicity we refer to such distributions as the posterior distributions of the summaries. We provide two additional low-dimensional examples in Appendices E and F of the supplementary material, respectively. These examples qualitatively have similar results to the g-and-k example in Section 4.2. In the third example in this section we show that our approach can outperform SMC ABC with all summaries on a moderate dimensional example. All examples support our assertion that seemingly obvious choices of low-dimensional summary statistics do not necessarily lead to accurate marginal approximations without localization.

4.1 MA(2) Example

Consider data $y = (y_{1}, y_{2}, \dots, y_{n})$ generated from the following moving average model of order 2 (MA(2)) model (8) $y_{t} = e_{t} + θ_{1} e_{t - 1} + θ_{2} e_{t - 2},$ (8) where, say, $e_{t} \sim N (0, 1)$ and the unknown parameters $θ = {(θ_{1}, θ_{2})}^{⊤}$ are assumed to obey the invertibility region $- 2 < θ_{1} < 2, θ_{1} + θ_{2} > - 1, θ_{1} - θ_{2} < 1$ . Our prior information on $θ = {(θ_{1}, θ_{2})}^{⊤}$ is uniform over the invertibility region. A useful choice of summary statistics for the MA(2) model are the sample autocovariances $η_{j} (y) = \frac{1}{T} \sum_{t = 1 + j}^{T} y_{t} y_{t - j}$ , for j = 0, 1, 2. Then we have $S_{j} = η_{j - 1}$ for j = 1, 2, 3. The binding functions corresponding to these summary functions (expected values of the summary statistics as a function of θ) are given by: $b_{1} (θ) = 1 + θ_{1}^{2} + θ_{2}^{2}, b_{2} (θ) = θ_{1} + θ_{1} θ_{2}, b_{3} (θ) = θ_{2} .$

Since θ₁ only appears in $b_{1} (θ)$ and $b_{2} (θ)$ , it is tempting to approximate the marginal for θ₁ matching only S₁ and S₂. This turns out to be a poor choice generally as there are two solutions when we solve for θ₁ and θ₂ as a function of b₁ and b₂ in the binding function; this suggests that the posterior conditional on S₁ and S₂ is likely to concentrate around two values in the parameter space. Regarding estimation of the θ₂-marginal, it may be sensible to match only on S₃.

We consider a dataset of size $n = 10, 000$ generated with $θ_{1} = 0.9$ and $θ_{2} = - 0.05$ . When using only S₁ and S₂ as summary statistics, there is a second value of $(θ_{1}, θ_{2})$ matching the value of b₁ and b₂ obtained for the true parameter value, $(θ_{1}, θ_{2}) \approx (0.4860, 0.7591)$ , which is well separated from the true parameter. We consider six ABC approximations: (1) matching all three summary statistics S₁, S₂, and S₃, (2) matching only on S₁ and S₂, (3) matching only on S₃, (4) a pilot ABC approximation matching S₁, S₂, and S₃, (5) continuing from the pilot approximation matching only S₁ and S₂ and (6) continuing from the pilot approximation matching only S₃. We stop the pilot run when the MCMC acceptance rate drops below 30%.

The results are shown in . We first focus on estimating the θ₁-marginal (left plot). It can be seen that the marginal approximation of θ₁ is poor when conditioning on only S₁ and S₂. The posterior is multi-modal, as expected. The pilot run (shown as blue dash) produces a diffuse approximation but effectively eliminates the second mode. Then, continuing from the pilot run with S₁ and S₂ produces an accurate marginal approximation of θ₁.

Fig. 2. Univariate ABC posterior approximations for θ₁ from various approaches for the MA(2) example. Here S_j indicates which statistic is used in the corresponding ABC approximation (the solid black lines are based on all three summary statistics), “pilot” indicates the pilot run with all statistics, and “pilot + x” indicates the continuation of the pilot run with statistics x. The observed summary statistic is $S_{y} \approx {(1.7999, 0.8363, - 0.0658)}^{⊤}$ . The × symbol on the x-axis represents the true value of the parameter.

Fig. 2. Univariate ABC posterior approximations for θ1 from various approaches for the MA(2) example. Here Sj indicates which statistic is used in the corresponding ABC approximation (the solid black lines are based on all three summary statistics), “pilot” indicates the pilot run with all statistics, and “pilot + x” indicates the continuation of the pilot run with statistics x. The observed summary statistic is Sy≈(1.7999,0.8363,−0.0658)⊤. The × symbol on the x-axis represents the true value of the parameter.

Next we focus on the θ₂-marginal (right plot). This time, the standard marginal approximation based on only S₃ produces an accurate marginal approximation. S₃ is a highly informative statistic for θ₂, and the posterior for θ₁ is close to the prior and shows little posterior dependence with θ₂ (results not shown). As we see with the θ₁ marginal results and the examples below, obtaining an accurate marginal approximation without the pilot is an exception rather than the rule. Here pilot $+ S_{3}$ gives a reasonable estimate of the θ₂-marginal but it is not highly accurate. We suggest that this is a result of having to match on both the pilot and marginal tolerances, and so the pilot approximation may still have some influence. We will see in the below examples that continuing from a pilot ABC run is crucial to obtaining accurate marginal approximations.

As alluded to above, the MA(2) example clearly demonstrates the identification issues that can arise when doing marginal adjustments in an ad-hoc fashion. Namely, since the distribution of the summaries S₁, S₂ results in a multi-modal likelihood function as a consequence of the multiplicity of roots discussed above, the information in these summaries is not enough by themselves to identify the true value of θ₁. However, once we have adequately restricted the parameter space for θ₂, which can be achieved through a pilot selection step based on all the summaries, the multiplicity of roots is alleviated, and the summaries S₁, S₂ are highly informative for the true value of the unknown parameter θ₁.

Since the parameter θ₂ clearly influences the distribution of S₁, S₂, there is no hope of obtaining marginal sufficiency for θ₁, when we base our inference for θ₁ on the summaries S₁, S₂. Consequently, a joint selection step based on all the summaries will be required if we are to have any hope of identifying the unknown value of θ₁.

In contrast, the asymptotic distribution of the third summary statistic is (in the limit) independent of θ₁, and hence inference for θ₂ based solely on S₃ is likely to deliver reasonable results, as accords with the result in . In particular, we have that $\begin{matrix} S_{3, x} = n^{- 1} \sum e_{t} e_{t - 2} = θ_{2} n^{- 1} \sum e_{t - 2}^{2} + θ_{1} n^{- 1} \sum e_{t - 1} e_{t - 2} \\ = θ_{2} n^{- 1} \sum e_{t - 2}^{2} + o_{p} (1 / \sqrt{n}), \end{matrix}$ where the $o_{p} (1 / \sqrt{n})$ comes about since $e_{t - 1}$ is independent of $e_{t - 2}$ for all t. Thus, for n large $S_{3, x} | θ \sim \frac{θ_{2}}{n} χ^{2} (n),$ which has mean θ₂, variance $2 θ_{2}^{2} / n$ , and does not depend on θ₁.Footnote³ Hence, in large samples, $p (S | θ) = p_{1} (S_{3} | θ_{1}, θ_{2}) p_{2} (S_{1}, S_{2} | θ, S_{3}) \approx p_{1} (S_{3} | θ_{2}) p_{2} (S_{1}, S_{2} | θ, S_{3}) .$

Therefore, if we conduct inference on θ₂ using just S₃ via the likelihood $p_{1} (S_{3} | θ_{2})$ , the results are likely to be quite accurate, especially in the case where there is little information about θ₂ in the conditional distribution $p_{2} (S_{1}, S_{2} | θ, S_{3})$ .

In general, the marginal posterior for $θ_{2} | S$ in this example is given by $\begin{matrix} π (θ_{2} | S) = π (θ_{2}) p_{1} (S_{3} | θ_{2}) p_{2}^{θ_{1}} (S_{1}, S_{2} | θ_{2}, S_{3}) \\ / \int π (θ_{2}) p_{1} (S_{3} | θ_{2}) p_{2}^{θ_{1}} (S_{1}, S_{2} | θ_{2}, S_{3}) d θ_{2} . \end{matrix}$

In the extremal case where $p_{2}^{θ_{1}} (S_{1}, S_{2} | θ, S_{3})$ is S-nonformative for θ₂, the integrated likelihood is constant in θ₂, and the posterior reduces to $π (θ_{2} | S_{3}) \propto π (θ_{2}) p_{1} (S_{3} | θ_{2})$ . The latter posterior is plotted in as the green dotted curve, which we see is virtually identical to the marginal posterior $π (θ_{2} | S)$ based on the full set of summaries.

4.2 Univariate g-and-k Example

The g-and-k distribution is a popular test example for ABC (e.g., Drovandi and Pettitt Citation2011b; Fearnhead and Prangle Citation2012), and is defined by its quantile function (9) $Q (z (p) | θ) = a + b (1 + c tanh [gz (p) / 2]) {(1 + z {(p)}^{2})}^{k} z (p),$ (9) where $- \infty < a < \infty$ , b > 0, $- \infty < g < \infty, k > - 0.5$ and $z (p) = Φ^{- 1} (p)$ is the standard normal quantile function. The parameters $θ = {(a, b, g, k)}^{⊤}$ control the location, scale, skewness and kurtosis, respectively. It is common practice to set c = 0.8. As with previous studies, we consider analyzing a simulated dataset of length $n = 10, 000$ with a = 3, b = 1, g = 2 and k = 0.5. Writing $U (a, b)$ for the uniform distribution on $[a, b]$ , the prior on each component of θ is set as $U (0, 10)$ with independent components.

Drovandi and Pettitt (Citation2011b) consider using as summary statistics robust measures of location, scale, skewness, and kurtosis, $S = {(S_{1}, S_{2}, S_{3}, S_{4})}^{⊤}$ with $\begin{matrix} S_{1} = L_{2}, S_{2} = L_{3} - L_{1}, S_{3} = \frac{L_{3} + L_{1} - 2 L_{2}}{S_{2}}, \\ S_{4} = \frac{E_{7} - E_{5} + E_{3} - E_{1}}{S_{2}}, \end{matrix}$ where L_i denotes the ith quartile and E_i denotes the ith octile. Given the natural interpretation of the g-and-k parameters, it might be intuitively appealing for each component of θ to use the corresponding robust summary statistic. That is, we use S₁, S₂, S₃, and S₄ for estimating the approximate posterior marginals of a, b, g, and k, respectively. For the pilot run we use an MCMC acceptance rate threshold of 20% (results are similar for thresholds of 15% and 10%, see Appendix D of the supplementary materials).

The results are shown in . (a) shows the univariate marginal parameter posterior approximations based on various approaches, whereas (b) shows the corresponding results for the four summary statistics. The solid red densities in (a) show the typical ABC approximation based on all the statistics S. Since there are only a small number of summary statistics, this approximation should be fairly close to $π (θ | S_{y})$ . The purple dash results show the marginal approximations when only using the S_j statistic. Apart from the parameter k, these marginal approximations are substantially worse than the marginal approximations based on S, despite producing closer matches to S_j as demonstrated in (b). The pilot approximation results are shown in green dash. Our approach using the pilot run as the initial approximation followed by close matching with S_j produces the results shown as blue dash with univariate posterior approximations in agreement with the gold standard typical ABC approximation. (b) shows that this new approach results in very close matches with the S_j statistics.

Fig. 3. Univariate ABC posterior densities for the (a) parameters and (b) summary statistics obtained from various approaches for the g-and-k example. Here “all summs” means ABC with all four summary statistics, “pilot (all summs)” means ABC with all four summary statistics but using a relatively large tolerance, “summ” means ABC using the summary statistic relevant for its corresponding parameter and “pilot + summ” means ABC using the summary statistic relevant for its corresponding parameter but also satisfying the tolerance from “pilot (all summs)”. The dot on the x-axis shows the true parameter value in (a) and the observed summary statistic in (b).

One might expect the posterior of a given S₁ to be accurate since the median of the g-and-k distribution is exactly a. Yet (a) reveals a poor approximation. We demonstrate in Appendix D of the supplementary material that the poor approximation relies on the fact that the distribution of S₁ also depends on b, and that there is dependence between a and b in the posterior given S₁.

4.3 Sparse Vector Autoregression Example

Here we consider the sparse vector autoregression example proposed in Thomas et al. (Citation2020). The model is given by $y_{t} = X y_{t - 1} + η_{t},$ where $y_{t} \in R^{k}$ is the k-dimensional observation of the time series at time t, $X \in R^{k \times k}$ is the transition matrix and $η_{t} \sim N (0, σ^{2} I)$ is a k-dimensional noise vector with $σ^{2}$ being the noise parameter. Thomas et al. (Citation2020) consider a sparse matrix X where each variable is coupled with only one other variable, so that the only off-diagonal entries that are nonzero are the ones associated with these couplings. That is, if variable i is coupled with variable j, $i \neq j$ , then $X_{i, j} \neq 0$ and we also set $X_{i, j} \neq X_{j, i}$ . We set the diagonal elements of X to be –0.1, which seems to ensure stability of the simulated vector autoregression.

The parameters are the nonzero off-diagonal terms of X and σ. Here we consider k = 20, which leads to p = 21 parameters, producing a moderate-dimensional example. We generate observed data with $σ = 0.1$ , and we assign a prior $σ \sim U (0, 1)$ . For the elements of X, we draw the couplings randomly and simulate the true value of each non-diagonal entry of X from a $U (- 1, 1)$ distribution, which we also set as the prior for these parameters. We choose uniform priors for illustrative purposes; typically a prior on σ would not be bounded from above. We simulate n = 500 observations from the k-dimensional time series model. For the summary statistic corresponding to parameter $X_{i, j} \neq 0$ we use the lag 1 autocovariance, $\frac{1}{n} \sum_{t = 2}^{n} y_{t}^{i} y_{t - 1}^{j}$ where $y_{t}^{i}$ is the tth observation of the ith time series. To inform σ we use the sample standard deviation of the k times series concatenated together. Thus, for each parameter, there is a corresponding informative summary S_j.

Here we run SMC ABC with the k = 21 summaries, SMC ABC with each individual S_j summary and also our two stage procedure with each S_j used in the second stage and based on a pilot run with all summaries using a 10% acceptance rate as the stopping rule. The results are shown in . It is evident that running ABC to estimate each marginal posterior using its corresponding informative summary does not lead to accurate inferences. Unlike the previous low-dimensional examples, our two stage approach leads to more accurate inferences than standard SMC ABC for several of the parameters, and comparable inference for the other parameters.

Fig. 4. Univariate posterior approximations for the sparse vector autoregession example. The first four rows correspond to the parameters in X while the bottom plot corresponds to σ. Here “all summs” means ABC with all summary statistics, “pilot (all summs)” means ABC with all summary statistics but using a relatively large tolerance, “S” means ABC using the summary statistic relevant for its corresponding parameter and “pilot + S” means ABC using the summary statistic relevant for its corresponding parameter but also satisfying the tolerance from “pilot (all summs).” The true parameter value is shown as a cross on the x-axis.

5 Discussion

In this article we have presented a new approach for improving the accuracy of marginal approximations in likelihood-free inference when conditioning on a low-dimensional summary statistic. We showed through several examples and sufficiency arguments that marginal approximations can be poor unless summary statistics are chosen very carefully, even if the summaries are highly informative about the marginal posterior location. Our analysis demonstrates that, even if we have a vector of sufficient summaries, conducting ABC-based inference on a single (marginal) parameter using a well-intentioned subset of the summaries will not necessarily deliver inferences that agree with the partial posterior for the same (marginal) parameter based on the entire vector of summaries, even as $ϵ \to 0$ .

We have described some ideal summary statistics which are low-dimensional and which can achieve accurate estimation of marginal posterior means and variances given the full set of summary statistics S as the ABC tolerance goes to zero. Given that such statistics may be difficult to approximate well, in practice, we find that joint matching on the full summary statistics with a loose tolerance and low-dimensional summary statistics S_j informative about marginal posterior location with a much more stringent tolerance results in greatly improved marginal posterior estimation compared to matching using S_j alone. The joint matching criterion can be motivated from the point of view of logarithmic pooling of ABC approximations based on the full posterior and the summary statistics S_j.

Algorithmically, our approach used a crude pilot run of likelihood-free inference based on all the summary statistics in order to help identify the parameters, before honing in on the low-dimensional statistic intended to be informative for its corresponding low-dimensional parameter in a second stage. Our approach should combine well with the popular summary statistic selection approach of Fearnhead and Prangle (Citation2012), which forms summary statistics by estimating the posterior means via regressions. Fearnhead and Prangle (Citation2012) recommend performing a pilot run of ABC first in order to improve the fit of the regressions, which ultimately improves the derived summary statistics. This pilot run can be exploited in our approach, which also requires a pilot run, and then the individual posterior mean estimates from the regression can be use as low-dimensional summaries for the corresponding low-dimensional parameters.

Our article does not rigorously answer the question of how long the pilot ABC should be run. In general, we suggest to run the pilot ABC until the acceptance rate starts to dramatically drop. We want to perform the pilot run to eliminate very poor parts of the parameter space, but we do not want to run it so long that it becomes computationally difficult to satisfy the pilot discrepancy in the subsequent ABC run that focuses on matching on a low-dimensional summary statistic.

In this article, several examples used a small number of summary statistics so that we could obtain a gold standard marginal approximation to compare our method against. These examples were sufficient to illustrate the concepts and ideas of the article. We also illustrated the benefits of our approach in a moderate dimensional example. However, we suggest that the principles of this article can be extended to high-dimensional likelihood-free problems, and we leave this for future work. In truly high-dimensional problems, it may not suffice to crudely localize on the full summary statistic S: instead, and following the discussion in Appendix C of the supplementary materials, methods will be needed to define subsets of S which determine our idealized posterior mean and variance statistics for each marginal for use in the pilot run. These subsets do not need to be very low-dimensional, but should not be so high-dimensional that ABC methods are infeasible. Furthermore, the ideas of our article could be combined with machine learning approaches to likelihood-free inference (e.g., Cranmer, Pavez, and Louppe Citation2015), which are generally more scalable.

We also leave extensions to the challenging problem of misspecification in the likelihood-free setting (e.g., Frazier, Robert, and Rousseau Citation2020) for future research. We believe there is a role to play for modular inference ideas in likelihood-free inference (e.g., Chakraborty et al. Citation2023). In this case, if marginal inference was desired on some subset of the parameters $ϕ$ say, and it was felt that some components of the full summary statistic vector provided misleading information about that parameter, then these components of the full summary statistic vector would need to be excluded from the pilot run. There are diagnostics in the ABC literature for highlighting which summaries cannot be matched (e.g., Frazier et al. Citation2018); when all summaries cannot be matched, one may need to choose which summaries to match, or try to exclude only a sparse set of summaries. What option to choose depends on the problem at hand and what the model will be used for.

Supplemental material

Supplemental Material

Download PDF (450.4 KB)

Acknowledgments

The authors are grateful to three anonymous referees whose comments led to improvements in this article. CD is affiliated with the QUT Centre for Data Science. DJN is affiliated with the NUS Institute of Operations Research and Analytics, National University of Singapore.

Disclosure Statement

The authors report there are no competing interests to declare.

Funding

CD gratefully acknowledges support from the Australian Research Council Future Fellowship Award (FT210100260). DTF gratefully acknowledges support by the Australian Research Council through grant DE200101070.

Notes

1 Since most of the marginal posteriors do not have closed-form distribution, we normalize the densities using trapezoidal numerical integration for convenience, except for

π (ϕ | s^{2})

which has an inverse-gamma distribution.

2 In general, the weights for the logarithmic pool must be specified. This can be done in several ways, with the most common way being to obtain point estimates of the weights using data (Poole and Raftery 2000). Alternatively, in certain settings, a prior distribution over the weights can be specified and a posterior distribution for the weights obtained via Bayes’ theorem (see, e.g., Carvalho et al. 2022).

3 From this result we also see that, for n large, $\sqrt{n} {S_{3, x} - θ_{2}} | θ \sim N (0, θ_{2}^{2})$ and the asymptotic distribution of the scaled and centered statistic only depends on θ₂.

References

Barber, S., Voss, J., and Webster, M. (2015), “The Rate of Convergence for Approximate Bayesian Computation,” Electronic Journal of Statistics, 9, 80–105. DOI: 10.1214/15-EJS988.
Web of Science ®Google Scholar
Barndorff-Nielsen, O. (1976), “Nonformation,” Biometrika, 63, 567–571. DOI: 10.1093/biomet/63.3.567.
Web of Science ®Google Scholar
Blum, M. G. B. (2010), “Approximate Bayesian Computation: A Non-parametric Perspective,” Journal of the American Statistical Association, 105, 1178–1187. DOI: 10.1198/jasa.2010.tm09448.
Web of Science ®Google Scholar
Blum, M. G. B., and François, O. (2010), “Non-linear Regression Models for Approximate Bayesian Computation,” Statistics and Computing, 20, 63–73. DOI: 10.1007/s11222-009-9116-0.
Web of Science ®Google Scholar
Carvalho, L. M., Villela, D. A., Coelho, F. C., and Bastos, L. S. (2022), “Bayesian Inference for the Weights in Logarithmic Pooling,” Bayesian Analysis, 1, 1–29. DOI: 10.1214/22-BA1311.
Google Scholar
Chakraborty, A., Nott, D. J., Drovandi, C. C., Frazier, D. T., and Sisson, S. A. (2023), “Modularized Bayesian Analyses and Cutting Feedback in Likelihood-Free Inference,” Statistics and Computing, 33, 33. DOI: 10.1007/s11222-023-10207-5.
Web of Science ®Google Scholar
Chen, Y., and Gutmann, M. U. (2019), “Adaptive Gaussian Copula ABC,” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Volume 89 of Proceedings of Machine Learning Research, eds. K. Chaudhuri and M. Sugiyama, pp. 1584–1592. PMLR.
Google Scholar
Clements, M. P., and Harvey, D. I. (2011), “Combining Probability Forecasts,” International Journal of Forecasting, 27, 208–223. DOI: 10.1016/j.ijforecast.2009.12.016.
Web of Science ®Google Scholar
Cranmer, K., Pavez, J., and Louppe, G. (2015), “Approximating Likelihood Ratios with Calibrated Discriminative Classifiers,” arXiv:1506.02169.
Google Scholar
Drovandi, C. C., and Pettitt, A. N. (2011a), “Estimation of Parameters for Macroparasite Population Evolution Using Approximate Bayesian Computation,” Biometrics, 67, 225–233. DOI: 10.1111/j.1541-0420.2010.01410.x.
PubMed Web of Science ®Google Scholar
Drovandi, C. C., and Pettitt, A. N. (2011b), “Likelihood-Free Bayesian Estimation of Multivariate Quantile Distributions,” Computational Statistics and Data Analysis, 55, 2541–2556. DOI: 10.1016/j.csda.2011.03.019.
Web of Science ®Google Scholar
Fearnhead, P., and Prangle, D. (2012), “Constructing Summary Statistics for Approximate Bayesian Computation: Semi-Automatic Approximate Bayesian Computation,” Journal of the Royal Statistical Society, Series B, 74, 419–474. DOI: 10.1111/j.1467-9868.2011.01010.x.
Google Scholar
Frazier, D., Nott, D. J., Drovandi, C., and Kohn, R. (2022), “Bayesian Inference Using Synthetic Likelihood: Asymptotics and Adjustments,” Journal of the American Statistical Association (to appear), DOI: 10.1080/01621459.2022.2086132.
PubMed Web of Science ®Google Scholar
Frazier, D. T., Martin, G. M., Robert, C. P., and Rousseau, J. (2018), “Asymptotic Properties of Approximate Bayesian Computation,” Biometrika, 105, 593–607. DOI: 10.1093/biomet/asy027.
Web of Science ®Google Scholar
Frazier, D. T., Robert, C. P., and Rousseau, J. (2020), “Model Misspecification in Approximate Bayesian Computation: Consequences and Diagnostics,” Journal of the Royal Statistical Society, Series B, 82, 421–444. DOI: 10.1111/rssb.12356.
Google Scholar
Genest, C., McConway, K. J., and Schervish, M. J. (1986), “Characterization of Externally Bayesian Pooling Operators,” The Annals of Statistics, 14, 487–501. DOI: 10.1214/aos/1176349934.
Web of Science ®Google Scholar
Genest, C., and Zidek, J. V. (1986), “Combining Probability Distributions: A Critique and an Annotated Bibliography,” Statistical Science, 1, 114–135. DOI: 10.1214/ss/1177013825.
Google Scholar
Li, J., Nott, D., Fan, Y., and Sisson, S. (2017), “Extending Approximate Bayesian Computation Methods to High Dimensions via a Gaussian Copula Model,” Computational Statistics & Data Analysis, 106, 77–89. DOI: 10.1016/j.csda.2016.07.005.
Web of Science ®Google Scholar
Li, W., and Fearnhead, P. (2018a), “Convergence of Regression-Adjusted Approximate Bayesian Computation,” Biometrika, 105, 301–318. DOI: 10.1093/biomet/asx081.
Web of Science ®Google Scholar
Li, W., and Fearnhead, P. (2018b), “On the Asymptotic Efficiency of Approximate Bayesian Computation Estimators,” Biometrika, 105, 285–299. DOI: 10.1093/biomet/asx078.
Web of Science ®Google Scholar
Nott, D. J., Fan, Y., Marshall, L., and Sisson, S. (2014), “Approximate Bayesian Computation and Bayes Linear Analysis: Toward High-Dimensional ABC,” Journal of Computational and Graphical Statistics, 23, 65–86. DOI: 10.1080/10618600.2012.751874.
Web of Science ®Google Scholar
Poole, D., and Raftery, A. E. (2000), “Inference for Deterministic Simulation Models: The Bayesian Melding Approach,” Journal of the American Statistical Association, 95, 1244–1255. DOI: 10.1080/01621459.2000.10474324.
Web of Science ®Google Scholar
Price, L. F., Drovandi, C. C., Lee, A., and Nott, D. J. (2018), “Bayesian Synthetic Likelihood,” Journal of Computational and Graphical Statistics, 27, 1–11. DOI: 10.1080/10618600.2017.1302882.
Web of Science ®Google Scholar
Priddle, J. W., Sisson, S. A., Frazier, D. T., and Drovandi, C. (2022), “Efficient Bayesian Synthetic Likelihood with Whitening Transformations,” Journal of Computational and Graphical Statistics, 31, 50–63. DOI: 10.1080/10618600.2021.1979012.
Web of Science ®Google Scholar
Sisson, S., and Fan, Y. (2018), “ABC Samplers,” in Handbook of Approximate Bayesian Computation, eds. S. A. Sisson, Y. Fan, and M. Beaumont, pp. 87–123, New York: Chapman and Hall/CRC.
Google Scholar
Sisson, S. A., Fan, Y., and Beaumont, M. (2018), Handbook of Approximate Bayesian Computation (1st ed.). Boca Raton, FL: Chapman and Hall/CRC.
Google Scholar
Thomas, O., Corander, J., Sá-Leão, R., de Lencastre, H., Kaski, S., and Pesonen, H. (2020), “Split-BOLFI for Misspecification-Robust Likelihood Free Inference in High Dimensions,” arXiv preprint arXiv:2002.09377v1.
Google Scholar
Wood, S. N. (2010), “Statistical Inference for Noisy Nonlinear Ecological Dynamic Systems,” Nature, 466, 1102–1107. DOI: 10.1038/nature09319.
PubMed Web of Science ®Google Scholar
Zhu, Y., and Reid, N. (1994), “Information, Ancillarity, and Sufficiency in the Presence of Nuisance Parameters,” Canadian Journal of Statistics, 22, 111–123. DOI: 10.2307/3315827.
Web of Science ®Google Scholar

Improving the Accuracy of Marginal Approximations in Likelihood-Free Inference via Localization

Abstract

1 Introduction