903
Views
0
CrossRef citations to date
0
Altmetric
Bayesian and Monte Carlo Methods

Distilling Importance Sampling for Likelihood Free Inference

& ORCID Icon
Pages 1461-1471 | Received 12 Apr 2022, Accepted 27 Jan 2023, Published online: 15 Mar 2023

Abstract

Likelihood-free inference involves inferring parameter values given observed data and a simulator model. The simulator is computer code which takes parameters, performs stochastic calculations, and outputs simulated data. In this work, we view the simulator as a function whose inputs are (1) the parameters and (2) a vector of pseudo-random draws. We attempt to infer all these inputs conditional on the observations. This is challenging as the resulting posterior can be high dimensional and involves strong dependence. We approximate the posterior using normalizing flows, a flexible parametric family of densities. Training data is generated by likelihood-free importance sampling with a large bandwidth value ϵ, which makes the target similar to the prior. The training data is “distilled” by using it to train an updated normalizing flow. The process is iterated, using the updated flow as the importance sampling proposal, and slowly reducing ϵ so the target becomes closer to the posterior. Unlike most other likelihood-free methods, we avoid the need to reduce data to low-dimensional summary statistics, and hence can achieve more accurate results. We illustrate our method in two challenging examples, on queuing and epidemiology. Supplementary materials for this article are available online.

1 Introduction

Many statistical models are specified by simulators, which can be used to generate data under the model given values of parameters. Probability calculations are often not possible for such models. In particular, it can be infeasible to evaluate the likelihood function L(θ): the probability (or density) of the observed data under parameters θ. This makes it difficult to use standard methods of statistical inference such as maximum likelihood and many Monte Carlo methods for Bayesian inference.

Likelihood-free inference (LFI)—also known as simulation-based inference—describes methods to perform inference using simulators without evaluating the likelihood function. One popular class of LFI methods is approximate Bayesian computation (ABC) (Marin et al. Citation2012). Here one runs the simulator under many θ values, and each time calculates a distance between the simulated data y and the observed data y0. The distances are used to produce a sample (or weighted sample) of θ s from an approximation to the posterior, for example, by selecting the θ s with the smallest distances.

A drawback of ABC is that is suffers from a curse of dimensionality: the cost of producing an accurate posterior approximation rises rapidly with dim(y). An intuitive explanation is that, even under the best θ choices, close matches of y and y0 are rare unless dim(y) is low. Therefore, it is common to use dimension reduction, replacing y with a vector of low dimensional summary statistics s(y) when calculating distances. However, using summary statistics typically results in a loss of posterior accuracy.

An alternative class of LFI methods use conditional density estimation (Grazian and Fan Citation2019). These estimate a density based on simulated pairs of parameters and data. Then one can condition on the observed data to approximate its posterior distribution. These methods avoid the ABC curse of dimensionality, but typically still require dimension reduction of the data to summary statistics. So the associated loss of information remains a problem.

We aim to improve the efficiency of LFI algorithms by parameter augmentation. The idea is to infer not only the model parameters θ, but also all the random variables the simulator samples during its operation. Effectively this learns to control the simulator to produce output similar to the observations. Now it is no longer the case that outputting close matches will always be rare. In principle this can avoid the ABC curse of dimensionality without the need for summary statistics.

However, inference under parameter augmentation is challenging. The posterior is high dimensional and may have strong dependencies, for example, lying on a lower dimensional manifold. This makes it hard to design effective proposals for Monte Carlo methods. We use approximate versions of the posterior, which correspond to inference under noisy observation of the data, with a bandwidth parameter ϵ controlling the variance of the noise. The resulting distribution is the prior for ϵ, and the posterior for ϵ0. Thus, large ϵ produces a less challenging inference problem.

We approximate these distributions using normalizing flows (Papamakarios et al. Citation2021), a family of flexible andcomputationally tractable distributions. They transform a random vector from a simple random distribution to a random vector from a complex distribution, by applying a sequence of learnable bijections. Normalizing flows can be trained using batches of data sampled from a target distribution.

We propose a method which alternates two steps. The first is importance sampling, using the current normalizing flow as the proposal. The target distribution is an approximate posterior with large ϵ, chosen so that importance sampling produces reasonably accurate results. The second step is to use the resulting weighted sampled to train the normalizing flow further. Following Li, Turner, and Liu (Citation2017), we refer to this as distilling the importance sampling results. By iteratively distilling importance sampling results, we can target increasingly accurate posterior approximations, that is, reduce ϵ.

We demonstrate our algorithm on two challenging examples. In a queuing example, we obtain approximate inference results which are much more accurate than two ABC baselines: with and without summary statistics. In an epidemiology example, we are able to target the exact posterior.

In the remainder of the article, Section 2 presents background material. Section 3 discusses LFI parameter augmentation. Sections 4 and 5 describe our method. Section 6 illustrates it on a simple two dimensional inference task. Sections 7 and 8 present our main examples. Section 9 concludes with a discussion, including limitations and opportunities for extensions. Further details are available in the supplement, including a discussion of recommended tuning choices (Section E). Code for the examples can be found at https://github.com/dennisprangle/DistillingImportanceSampling, written using PyTorch (Paszke et al. Citation2019). All examples were run on a 16-core desktop PC.

1.1 Related Work and Novelty

Several recent articles (Müller et al. Citation2019; Arbel, Matthews, and Doucet Citation2021; Duan Citation2021; Naesseth, Lindsten, and Blei Citation2021) train a normalizing flow to use as an importance sampling proposal. Variations include training a Gaussian mixture (Jerfel et al. Citation2021) or a distribution transformed using polynomials (Cotter, Kevrekidis, and Russell Citation2020), rather than the usual neural network approaches in normalizing flows literature (reviewed in Section 2.3). The novelty of our work is an application to likelihood-free inference. To do so, we sequentially target a sequence of approximate distributions, unlike the prior work listed above.

Parameter augmentation methods for likelihood-free inference have been used previously. Graham and Storkey (Citation2017) propose a method using constrained Hamiltonian Monte Carlo (HMC). This alternates HMC steps with projection steps to move the MCMC state back onto a target manifold. A limitation of this method is that it requires a differentiable simulator. Optimization Monte Carlo (OMC) (Meeds and Welling Citation2015) generates random seeds to use within the simulator, then uses optimization to find the corresponding θ values which minimize the distance of the resulting y to y0, and computes appropriate weights. Robust OMC (Ikonomov and Gutmann Citation2020) produces multiple θ s given a random seed. Rare event ABC (Prangle, Everitt, and Kypraios Citation2018) is a MCMC algorithm which, whenever a new θ is proposed, uses rare event methods to estimate the probability of the simulator producing a close match.

Let x denote the internal random variables used by the simulator (defined more precisely in Section 3). Then OMC methods essentially sample x from its prior, which could be inefficient if the posterior for x is concentrated compared to the prior. Rare event ABC infers p(x|θ,y) for each proposed θ, which becomes expensive for long MCMC chains. In contrast to these two methods, our approach aims to infer p(θ,x|y), which we argue is more efficient.

Joint inference of θ and x can also sometimes be performed by sophisticated MCMC algorithms specialized to particular applications. For instance Shestopaloff and Neal (Citation2014) give an MCMC algorithm for the queuing example we consider in Section 7. Our goal in this article is to provide a more generic approach.

More broadly, our approach has connections to several inference methods. Concentrating on its importance sampling component, it is closely related to adaptive importance sampling (Cornuet et al. Citation2012). Concentrating on training an approximate density, it is related to the cross-entropy method (Rubinstein Citation1999), an estimation of distribution algorithm (Larrañaga and Lozano Citation2002), and reweighted wake-sleep (Bornschein and Bengio Citation2014).

2 Background

2.1 Likelihood-Free Inference

Suppose we observe data y0, assumed to be the output of a probability model with density p(y|θ) under some parameters θ, and we have access to a prior density π(θ). Bayesian inference aims to find the corresponding posterior density p(θ|y0)π(θ)p(y0|θ). When the likelihood function p(y|θ) can be evaluated, many approaches to posterior inference are available.

However, many models are specified using a simulator, typically some computer code. The simulator’s inputs are parameters θ, and the output is data y. This effectively defines a distributionFootnote1 for y conditional on θ. However, probability calculations under this distribution are typically not feasible, including evaluation of the likelihood function. So standard likelihood-based methods of inference are not possible. Likelihood-free inference (LFI) methods attempt to perform approximate inference in this setting without directly evaluating the likelihood function.

One popular LFI approach is approximate Bayesian computation (ABC). The simplest ABC algorithm, rejection ABC, samples (θ,y) pairs from the prior and simulator, and accepts those where a distance between y and y0 (e.g., the Euclidean distance ||yy0|| if data is a vector of fixed length) is below some threshold ϵ. The accepted θ s form a sample from an approximation to the posterior. More efficient ABC algorithms exist, for instance using more sophisticated methods of sampling θ values, or using importance sampling ideas to allow weighted samples.

Despite these improvements, close matches of y and y0 are rare unless dim(y) is low. As a consequence, ABC suffers from a curse of dimensionality: in the asymptotic case ϵ0, the cost of ABC rapidly increases with dim(y). So ABC typically uses dimension reduction. That is, acceptance now occurs when ||s(y)s(y0)|| is small, for some function s mapping data to a vector of summary statistics. The loss of information entailed by dimension reduction reduces inference accuracy (as sufficient statistics are rarely available).

See Prangle, Everitt, and Kypraios (Citation2018) (end of Section 2.1) for a brief review of the ABC curse of dimensionality, and Marin et al. (Citation2012) for a review of other relevant aspects of ABC. The supplement (Section C.1) gives more details of a specific ABC algorithm we use as a baseline in one of our examples.

As mentioned in Section 1, an alternative approach to LFI is through conditional density estimation methods (CDE), as reviewed by Grazian and Fan (Citation2019). CDE methods estimate a density, often a normalizing flow, using simulated (θ,y) pairs. The density estimated could be the joint π(θ,y) or a conditional: p(θ|y) or p(y|θ). One can then condition on the observed data to approximate its posterior distribution. The method for conditioning depends on which density was estimated. CDE methods avoid the ABC curse of dimensionality, but typically still require dimension reduction of the data to summary statistics. So the resulting loss of information remains a problem.

2.2 Importance Sampling

Let p(ξ)=p˜(ξ)/Z be a target density where only p˜(ξ) can be evaluated, and the value of the normalizing constant Z=p˜(ξ)dξ is unknown. Importance sampling (IS) is a Monte Carlo method to estimate expectations of the form I=Eξp[h(ξ)], for some function h. Here we review relevant aspects. For full details see, for example, Robert and Casella (Citation2013) and Rubinstein and Kroese (Citation2016).

IS requires a proposal density λ(ξ) which can easily be sampled from, and must satisfy(1) supp(p)supp(λ),(1) where supp denotes support. Then(2) I=Eξλ[p(ξ)λ(ξ)h(ξ)].(2)

So an unbiased Monte Carlo estimate of I is(3) Î1=1NZi=1Nwih(ξ(i)),(3) where ξ(1),ξ(2),,ξ(N) are independent samples from λ, and wi=p˜(ξ(i))/λ(ξ(i)) is an importance weight.

Typically Z is estimated as 1Ni=1Nwi giving(4) Î2=i=1Nwih(ξ(i))/i=1Nwi,(4)

a biased, but consistent, estimate of I. EquivalentlyÎ2=i=1Nsih(ξ(i)),for normalized importance weights si=wi/i=1Nwi.

A drawback of IS is that it can produce estimates with large, or infinite, variance if λ is a poor approximation to p. Hence, diagnostics for the quality of the results are useful. A popular diagnostic is effective sample size (ESS),(5) NESS=(i=1Nwi)2/i=1Nwi2.(5)

Under various assumptions, Var(Î2) roughly equals the variance of an idealized Monte Carlo estimate based on NESS independent samples from p(ξ) (Liu Citation1996). Elvira, Martino, and Robert (Citation2022), among others, note that ESS can be an unreliable diagnostic in practice and research is needed to propose better alternatives.

2.3 Normalizing Flows

A normalizing flow represents a random vector ξ with a complicated distribution as an invertible transformation of a random vector z with a simple base distribution, typically N(0,I).

Recent research has developed flexible learnable families of normalizing flows. See Papamakarios et al. (Citation2021) for a review. We focus on autoregressive neural spline flows (Durkan et al. Citation2019), which we will refer to as “spline flows” as a shorthand. See Section 5.5 for comments on exploratory work with an alternative choice.

Spline Flows Description

An autoregressive transformation transforms input vector u to output vector v using vi=τ(ui,h(u<i)) (where u<i refers to the uj values with j<i). Here τ is a monotonic and invertible transformation of its first argument. The particular transformation used depends on the second argument. So vi is a transformation of ui, and the transformation used depends only on u<i. Therefore, the overall transformation has a triangular Jacobian matrix, facilitating quick density calculations. Furthermore, a masked autoregressive neural network (Germain et al. Citation2015; Nash and Durkan Citation2019) is typically used for h. This allows h(u<i) to be computed in parallel for all i values.

Durkan et al. (Citation2019) propose using a spline transformation for τ. This transformation is a piecewise function based on partitioning a bounding box [B,B] into several bins. The type of function used is chosen so it can be fully defined by: the location of knots (bin boundaries); the function values at the knots; and (in some cases) derivatives at the knots. All these details should be the output of h(u<i). We use piecewise rational quadratic transformations, following Durkan et al. (Citation2019). Outside the bounding box the function is defined to be the identity.

Spline Flows Properties

Some relevant properties of spline flows are as follows. See Papamakarios et al. (Citation2021) for full discussion of the points made here.

  • Universality. An autoregressive flow can represent any distribution meeting some simple conditions, if any τ and h functions can be used. In practice this means the expressiveness of spline flows is only limited by the complexity of the splines and neural networks used.

  • Sampling. Samples of ξ can rapidly be drawn from q(ξ;ϕ).

  • Gradient calculation. It is reasonably fast to compute logq(ξ;ϕ). (Throughout the article represents the gradient operator with respect to ϕ.)

Universality means that spline flows can approximate many distributions well. Sampling and gradient calculation are required in our algorithm. Alternative types of flow can allow faster gradient calculation, but are often less expressive.

We note that Gaussian mixture models are an alternative to flows. However, their cost of sampling and density evaluation grows rapidly with dimξ, as it involve a O([dimξ]3) matrix inversion cost.

3 Parameter Augmentation

Here we outline an approach to likelihood-free inference which our algorithm will use. The idea is to infer not just the parameters θ, but also all the internal stochastic behavior of the simulator. To do so, we introduce a random variable x. This represents all outputs of an underlying random number generator used by the simulator. We suppose all the sampled random variables required during simulation can be expressed as transformations of θ and x. So the simulator is now a deterministic function y(ξ), where ξ represents the augmented parameters (θ,x).

Throughout this article we work with simulators where x can be expressed as a fixed length random vector xN(0,I). We denote its density as π(x). Future work on more complex simulators could consider alternative choices of x. First, x could be allowed to be dependent on θ. Second, x could be an infinite sequence of independent N(0,1) random variables. This is helpful if the number of random draws required by the simulator is unknown in advance. For instance the simulator could perform a loop for a number of iterations which depends on θ in a complex fashion.

3.1 Advantages and Challenges

An advantage of parameter augmentation is avoiding the ABC curse of dimensionality. We first give an intuitive explanation. Sampling ξ from the posterior p(ξ|y0), or a close approximation, can be viewed as controlling the simulator through x so that y(ξ) is a close match to y0. This avoids the problem of close matches being rare when only θ is controlled, as discussed in Section 2.1. Secondly, we give a more direct explanation. Suppose we can produce a good approximation q(ξ) of the augmented posteriorFootnote2 p(ξ|y), which is suitable as an importance sampling proposal. This allows straightforward inference using standard importance sampling.

A difficulty of parameter augmentation is that the posterior for ξ often has a challenging form. A first challenge is that ξ usually has high dimension. MacKay (Citation2003) argues that importance sampling is not suitable for high-dimensional targets due to the difficulty of producing suitable proposals. Recent advances in approximating distributions has made progress on this challenge. For instance Müller et al. (Citation2019) demonstrates that normalizing flows can learn good importance sampling proposals for difficult target distributions, including high-dimensional cases. A theoretical justification for this is the universality property discussed in Section 2.3. This is the motivation for using flows in this article.

A second challenge is that parameter augmentation may result in extremely strong posterior dependence. Indeed the posterior is often a singular distribution whose mass lies on a lower dimensional manifold (Graham and Storkey Citation2017). Monte Carlo inference for such posteriors is challenging due to the difficulty of producing proposals exactly on an unknown manifold. We make progress by performing inference on a sequence of increasingly accurate approximate posteriors, which are all nonsingular.

3.2 Approximate Posteriors

We define approximate posterior densities for ξ, controlled by a bandwidth value ϵ>0, as pϵ(ξ)=p˜ϵ(ξ)/Zϵ where(6) p˜ϵ(ξ)=π(ξ)exp[12ϵ2||y(ξ)y0||2]Zϵ=p˜ϵ(ξ)dξ.(6)

Here π(ξ)=π(θ)π(x) and ||y(ξ)y0|| is the Euclidean distance between the simulated and observed data. Densities of the form (6) are often used in ABC and correspond to the posterior for data observed with independent N(0,ϵ2) additive errors (Wilkinson Citation2013).

Other similar approximate posteriors can be defined. For instance, the most common ABC approximation takes p˜ϵ(ξ)=π(ξ)1[||y(ξ)y0||ϵ], where 1 denotes an indicator function. We prefer the approximation (6) in this article since it has the same support as π(ξ). This makes the occurrence of zero importance weights in our algorithm less likely (or impossible if π(ξ) has full support). Such zero weights can cause numerical problems to our algorithm.

It will be useful later to extend the unnormalized density (6) to ϵ=0 by taking(7) p˜0(ξ)=π(ξ)1[y(ξ)=y0].(7)

A valid density p0 results when Z0>0. This is typically the case when the data y is discrete, but not when it is continuous. When Z0=0, the distribution for ϵ0 is singular with respect to π(ξ).

4 Objective and Gradient

Our algorithm approximates pϵ using a family of densities q(ξ;ϕ), typically a normalizing flow. This section introduces an objective function to judge how well q approximates pϵ. It then discusses how to estimate the gradient of this objective with respect to ϕ. Section 5 presents our algorithm, which uses these gradients to update ϕ while also reducing ϵ.

4.1 Objective

Given pϵ, we aim to minimize the inclusive Kullback-Leibler (KL) divergence,(8) KL(pϵ||q)=Eξpϵ[logpϵ(ξ)logq(ξ;ϕ)].(8)

This is equivalent to maximizing a scaled negative cross-entropy, which is our objective,Jϵ(ϕ)=ZϵEξpϵ[logq(ξ;ϕ)].

(Scaling by Zϵ avoids this intractable constant appearing in our gradient estimates below.)

The inclusive KL divergence penalizes ϕ values which produce small q(ξ;ϕ) when pϵ(ξ) is large. Hence, the optimal ϕ tends to make q(ξ;ϕ) nonnegligible where pϵ(ξ) is nonnegligible, known as the zero-avoiding property. This is an intuitively attractive feature for importance sampling proposal distributions. Indeed recent theoretical work shows that, under some conditions, the sample size required in importance sampling scales exponentially with the inclusive KL divergence (Chatterjee and Diaconis Citation2018).

Our work could be adapted to use the χ2 divergence (Dieng et al. Citation2017; Müller et al. Citation2019), which also has theoretical links to the sample size needed by IS (Agapiou et al. Citation2017).

4.2 Basic Gradient Estimate

Assuming standard regularity conditions (Mohamed et al. Citation2020, Section 4.3.1), the objective has gradient(9) Jϵ(ϕ)=ZϵEξpϵ[logq(ξ;ϕ)].(9)

Using (2), an importance sampling form isJϵ(ϕ)=Eξλ[p˜ϵ(ξ)λ(ξ)logq(ξ;ϕ)],where λ(ξ) is a proposal density satisfying (1). For now we assume λ is not a function of ϕ. An unbiased Monte Carlo gradient estimate of (9) is(10) g1=1Ni=1Nwilogq(ξ(i);ϕ),(10) where ξ(i)λ(ξ) are independent samples and wi=p˜ϵ(ξ(i))/λ(ξ(i)) (importance sampling weights). We calculate logq(ξ(i);ϕ) by backpropagation (see, e.g., Baydin et al. Citation2018).

Choice of Proposal Density

In our main algorithm, defined below in Section 5, when gradient estimates are required we have an existing estimate of ϕ (i.e., at the start of the outer for loop in Algorithm 1 we have access to ϕt1.) Therefore, it is appealing to take λ(ξ)=q(ξ;ϕ). Since we use choices of q with full support, (1) is satisfied.

This λ is a function of ϕ, so the ξ(i) terms may have some dependence on ϕ. Interpreting as full differentiation could cause terms involving ξ(i) to appear in g1, preventing it being an unbiased estimate of Jϵ(ϕ). To avoid this henceforth we take as the gradient operator based on partial differentiation with respect to ϕ. (Or equivalently we take λ(ξ)=q(ξ;ϕ), where is the “stop-gradient” operator in the notation of Foerster et al. (Citation2018). This matches the implementation in code, as corresponds to the torch command detach.

4.3 Improved Gradient Estimates

Here we discuss reducing the variance and cost of g1.

Truncating Weights

To avoid high variance gradient estimates we apply truncated importance sampling (Ionides Citation2008). This truncates the weights at a maximum value ω, giving truncated importance weights w˜i=min(wi,ω). The resulting gradient estimate isg2=1Ni=1Nw˜ilogq(ξ(i);ϕ).

This typically has lower variance than g1, but has some bias. See the supplement (Section A) for more details and discussion, including how we choose ω automatically, and a heuristic argument why truncation bias is not problematic.

One can also think of g2 as replacing p˜ϵ(ξ(i)) with min[p˜ϵ(ξ),\breakωλ(ξ)]. Thus, we attempt to optimize q in a local neighborhood of λ.

A more sophisticated alternative to truncation is Pareto smoothing the largest importance weights (Yao et al. Citation2018). We do not use this as it is more expensive to implement than truncation, but it would be interesting to investigate in future work.

Resampling

Calculating g2 requires evaluating logq(ξ(i);ϕ) for 1iN. Each of these has a computational cost, but often many receive small weights and so contribute little to g2.

To reduce this cost we can discard many low weight samples, by using importance resampling (Smith and Gelfand Citation1992) as follows. We sample nN times, with replacement, from the ξ(i) s with probabilities s˜i=w˜i/S where S=i=1Nw˜i. Denote the resulting samples as ξ˜(j). Then an unbiased estimate of g2 is(11) g3=SnNj=1nlogq(ξ˜(j);ϕ).(11)

5 Algorithm

Algorithm 1 gives our main algorithm, and this section discusses various details of it. The algorithm outputs an ϵ value and a density q(ξ;ϕ) trained to be an importance sampling proposal for pϵ(ξ). We recommend using q in a final stage of importance sampling with a large sample size. While the density q is itself an approximation of pϵ (and therefore, of the exact posterior), in our experience it can have artifacts: see discussion at the end of Section 6.

Algorithm 1

Distilled importance sampling (DIS)

1: Input: importance sampling size N, target ESS M, batch size n, number of batches B.

2: Initialize ϕ0 (followed by pretraining if necessary) and let ϵ0=.

3: for t=1,2, do

4: Sample (ξi)1iN from q(ξ;ϕt1).

5: Select a new value ϵtϵt1 (see Section 5.3 for details).

6: Calculate wi=p˜ϵ(ξ(i))/q(ξ(i);ϕt1) weights and truncate to w˜i s (see the supplement, Section A, for details).

7: for j=1,2,,B do

8: Resample (ξ˜(j))1jn from (ξ(i))1iN using normalized w˜i s as probabilities, with replacement.

9: Calculate gradient estimate g using (11).

10: Calculate ϕt,j from ϕt,j1 and g using a step of an optimization algorithm such as Adam (where ϕt,0=ϕt1.)

11: end for

12: Let ϕt=ϕt,B.

13: if ϵ=0 or runtime above prespecified limit then

14: return ϵ, q(ξ;ϕt) (typically to be used as importance sampling proposal)

15: end if

16: end for

5.1 Algorithm Overview

A summary of the algorithm follows. Steps 4–6 perform importance sampling with target pϵ. Steps 7–11 use the output to train ϕ, aiming for the resulting density to approximate the importance sampling target. As the algorithm progresses, step 5 reduces ϵ, aiming for the importance sampling ESS to equal M (a tuning choice). The goal is to make the importance sampling target closer to the posterior, slowly enough to avoid high variance gradient estimates.

To use the importance sampling output efficiently, we sample from it (with replacement) several times to create training batches, and use each batch for one update of ϕ. We fix training batch size to n=100, and number of batches B to M/n. So approximately M importance sampling outputs are used for training. Limiting the number of outputs used aims to avoid overfitting from too much reuse of the same training data.

An update of ϕ is done by a step of an optimization algorithm based on gradient estimates, aiming to increase Jϵ. The gradient estimate is calculated from the current batch of training data, and is calculated using (11). We use the Adam optimization algorithm (Kingma and Ba Citation2015), and found it worked well, but many alternatives could also be used (see the review of Ruder Citation2016 for instance).

Steps 13–14 terminate the algorithm after a fixed runtime or once ϵ=0 is reached (when this is possible, i.e., for discrete data.) Alternatively, the termination decision could be based on approximate inference diagnostics, such as those of Yao et al. (Citation2018) or Huggins et al. (Citation2020).

We investigate the remaining tuning choices empirically in Section 7. For now note that N must be reasonably large since our method to update ϵt relies on making an accurate ESS estimate, as detailed in Section 5.3. A further summary and discussion of tuning choices appears in the supplement(Section E).

5.2 Pretraining

Our initial target is the prior π(ξ), since we take ϵ0=. The initial q should be similar to this. Otherwise the first gradient estimates produced by importance sampling are likely to have high variance. To achieve this we often make use of pretraining. We assume the prior can easily be sampled from. Then pretraining iterates the following steps:

  1. Sample (ξ(i))1in from π(ξ).

  2. Update ϕ using gradient 1ni=1nlogq(ξ(i);ϕ).

This maximizes the negative cross-entropy Eξπ[logq(ξ;ϕ)]. We use n=100, and terminate once q(ξ;ϕ) achieves a ESS of 75 when targeting the prior in importance sampling.

5.3 Selecting ϵt

We select ϵt using ESS, as in Del Moral, Doucet, and Jasra (Citation2012). Given (ξi)1iN sampled from q(ξ;ϕt1), the ESS value for target pϵ(ξ) isNESS(ϵ)=[i=1Nw(ξi,ϵ)]2/i=1Nw(ξi,ϵ)2where w(ξ,ϵ)=p˜ϵ(ξ)/q(ξ;ϕt1).

(Also, to avoid numerical errors, we define the ESS to be zero if all weights are zero.)

In Step 5 of Algorithm 1 we first check whether NESS(ϵt1)<M, the target ESS value. If so we set ϵt=ϵt1. Otherwise we set ϵt to an estimate of the minimal ϵ such that NESS(ϵ)M, computed by a bisection algorithmFootnote3. We perform at least 50 iterations of bisection, and then terminate once NESS(ϵt)[M0.01,M+0.01].

5.4 Reaching ϵt=0: Exact Inference

For continuous data, the set of ξ such that y(ξ)=y0 typically has probability zeroFootnote4 under any choice of q(θ;ϕ). Hence, DIS cannot reach ϵ=0 since the unnormalized density p˜0 from (7) will almost surely result in all weights being zero. Instead, like ABC, DIS produces increasingly good posterior approximations as ϵ0.

However, for discrete data, it is plausible to reach ϵt=0. When this happens, the algorithm produces an approximation of the augmented posterior p0(ξ) which is a suitable proposal density for exact importance sampling.

5.5 Choice of q

For all our examples we use an autoregressive neural spline flow for q(ξ;ϕ) with five bins for each variable, and bounding box [10,10]. The spline details are output by an autoregressive residual neural network (Nash and Durkan Citation2019) with 3 blocks, 20 hidden features and ReLU activation. More comments on the choice of q are in Section E of the supplement.

6 Illustration: Sinusoidal Simulator

As a simple illustration, consider the simulator y(θ,x)=sinθ+x with independent priors θU(π,π), xN(0,1). As observations we take y0=0. Thus, the exact posterior is supported on the line x=sinθ.

It is helpful to reparametrize θ to a distribution with unbounded support. So we introduce ϑN(0,1) and let θ=π[2Φ(ϑ)1] where Φ is the N(0,1) cumulative distribution function. We then infer ξ=(ϑ,x).

We use Algorithm 1 with N=4000 training samples and a target ESS of M=2000. These values give a clear visual illustration: we investigate efficient tuning choices later. We pretrain so that q(ξ;ϕ0) approximates the prior.

shows our results. The normalizing flow quickly adapts to meet the importance sampling results, and after 30 iterations we reach ϵ=0.008, taking roughly 0.1 min. Visually, the samples lie close to the exact posterior support described above. In some panels, q has artifacts—an unwanted “tail” at its right hand end—which are removed by importance sampling. This illustrates our recommendation (see start of Section 5) to use q as an importance proposal, rather than as a posterior estimate.

Figure 1: Sinusoidal example output. In each frame, dots are 300 samples from the current density q. Crosses are 300 samples targeting pϵ(θ). These are based on the importance sampling stage of Algorithm 1 (steps 3–6) using q as a proposal, with importance resampling (see Section 4.3) used to get unweighted samples.

Figure 1: Sinusoidal example output. In each frame, dots are 300 samples from the current density q. Crosses are 300 samples targeting pϵ(θ). These are based on the importance sampling stage of Algorithm 1 (steps 3–6) using q as a proposal, with importance resampling (see Section 4.3) used to get unweighted samples.

7 Example: M/G/1 Queue

7.1 Model

Consider a M/G/1 queuing model of a single queue of customers. Times between arrivals at the back of the queue are Exp(θ1). On reaching the front of the queue, a customer’s service time is U(θ2,θ3). All these random variables are independent. We consider a setting where only inter-departure times are observed: times between departures from the queue.

We sample a synthetic dataset of 20 observations from parameter values θ1=0.1,θ2=4,θ3=5. We attempt to infer these parameters under the prior θ1U(0,1/3),θ2U(0,10),θ3θ2U(0,10) (all independent).

7.2 Existing Methods

The M/G/1 model with inter-departure observations is a common benchmark for likelihood-free inference, which can often provide fast approximate inference. See Papamakarios, Sterratt, and Murray (Citation2019) for a comparison of methods for an M/G/1 example. A potential disadvantage of existing likelihood-free methods is that they typically require using low dimensional summary statistics to be efficient, and this can lose some information from the data. As a likelihood-free inference baseline, we investigate a ABC-PMC algorithm. This is a modification of existing ABC-PMC methods to use the unnormalized target (6). See Section C of the supplement for full details.

We also include results from a sophisticated MCMC scheme (Shestopaloff and Neal Citation2014) for this model. This provides near-exact samples from the posterior, which is a useful gold standard comparison for the approximate inference methods. A limitation of this MCMC scheme is the difficulty in adapting it to other observation regimes—for example, observing the queue length at various times (Pickands and Stine Citation1997)—which would be relatively easy for likelihood-free methods.

7.3 DIS Implementation

We introduce ϑ and x, vectors of length 3 and 2dim(y), and take ξ as the collection (ϑ,x). Our simulator transforms these inputs to θ(ϑ) and y(ξ), with the property that ξN(0,I) outputs a sample from the prior and the model. See the supplement (Section B) for details. We pretrain q to approximate this distribution.

7.4 Results

(left) compares different choices of N (number of importance samples) and M (target ESS). It shows that ϵ reduces more quickly for smaller M/N and smaller N, with M/N having most influence. Note that small N also has the advantage of lower memory requirements.

Figure 2: M/G/1 results. Left: The ϵ value reached by DIS on the M/G/1 example against computation time, for various choices of N (number of importance samples) and M/N (ratio of target effective sampling size to N). Right: Marginal posterior histograms for M/G/1 example. The DIS output shown is for ϵ=2.30, which took 60 min to reach. To produce DIS histograms we ran a final stage of importance sampling with 750,000 samples, and then used importance resampling (see Section 4.3) to get unweighted samples.

Figure 2: M/G/1 results. Left: The ϵ value reached by DIS on the M/G/1 example against computation time, for various choices of N (number of importance samples) and M/N (ratio of target effective sampling size to N). Right: Marginal posterior histograms for M/G/1 example. The DIS output shown is for ϵ=2.30, which took 60 min to reach. To produce DIS histograms we ran a final stage of importance sampling with 750,000 samples, and then used importance resampling (see Section 4.3) to get unweighted samples.

This tuning study suggests taking N and M/N as small as possible, while ensuring M is at least a few hundred to avoid high variance in the importance sampling estimates. We use this guidance here and in the next section’s example. In both these examples, the cost of a single simulator run is low. For more expensive models efficient tuning choices may differ.

(right) is based on the DIS results with N=5000 and M=250, following by a final importance sampling step with 750,000 samplesFootnote5, which takes a further 7.2 min. The results are a reasonably close match to near-exact MCMC output using the algorithm of Shestopaloff and Neal (Citation2014). The DIS results for θ2 and θ3 are more diffuse than the MCMC results. Also the θ2 DIS results lacks the sharp truncation of the right tail present in the MCMC results. (Truncation occurs at the minimum observed inter-departure time of 4.01, as the minimum service time cannot be greater than this.)

We also compare DIS to our ABC-PMC baseline, under a similar runtime. For full details of methods and results see the supplement (Section C). summarizes the results. ABC-PMC without summary statistics targets (6), just as DIS does, but cannot reach as low an ϵ value in the same time, so it produces inaccurate estimates for all parameters. To improve ABC-PMC performance, we consider replacing the data with summaries: quartiles of the inter-departure times, as used in Papamakarios, Sterratt, and Murray (Citation2019). This improves results for θ1 and θ2, but still produces severe bias for θ3.

Table 1: Estimated posterior means and 95% quantile intervals for the M/G/1 example using various methods: MCMC (near-exact), ABC without summary statistics for ϵ=6.32, ABC with quartile summaries for ϵ=0.16, DIS for ϵ=2.30.

8 Example: Susceptible-Infective Process on a Random Network

This section illustrates how DIS can be used to infer discrete variables. The key idea is to represent them as transformations of continuous latent variables. Such a mapping must be noninjective, and this is supported by the DIS algorithm without a need for any modifications.

We demonstrate this in the context of a spreading process model of an infectious disease on a random network, specifically a susceptible-infective (SI) compartmental model on a Erdős-Rényi network. See Dutta, Mira, and Onnela (Citation2018) for a discussion of similar models, although not the exact one we consider.

8.1 Model

Our model represents a population of m individuals as a network. Each individual corresponds to a node. The individuals are labeled 0,1,,m1. An edge indicates contact between the corresponding individuals. The network is assumed to be an instance of the Erdős-Rényi random network (Erdős and Rényi Citation1959), so each pair of nodes form an edge with probability θ1.

The spreading process of the infectious disease is modeled as a discrete-time process. We consider a compartmental model where at any time each individual is in one of three states: susceptible, infective or immune. Initially there is a single infective node, individual 0. Susceptible neighbors of an infective individual at time t are exposed and become infective at time t+1 with probability θ2. Exposed individuals who are not infected instead become immune. (All infection and edge formation events are assumed independent.)

Parameters θ1 and θ2 have independent U(0,1) prior distributions. We assume that the observations are a binary vector indicating infective status for each node/time combination.

8.2 Likelihood-Based Inference

The likelihood function of the model can be calculated by summing contributions from all possible networks, as detailed in the supplement (Section D). This is computationally feasible for small m, enabling near-exact posterior inference. Below, we use likelihood-based importance sampling as a benchmark for an example with m=5. However, likelihood calculations become infeasibly expensive for larger m, as the number of networks to sum is 2m(m1)/2. For instance below we also investigate an example with m=10, giving 2453×1013 total networks.

8.3 DIS Implementation

Latent Variables and Simulator

Similarly to Section 7, we introduce vectors of latent variables ϑ,xedge,xinfect. Here dim(ϑ)=2,dim(xedge)=(m2),dim(xinfect)=m. Now ξ is the collection (ϑ,xedge,xinfect).

Our simulator transforms these inputs to θ(ϑ) and y(ξ). As in Section 7, the simulator has the property that ξN(0,I) produces samples from the prior and model. We pretrain q to approximate this distribution. Full details of the simulator transformation are in the supplement (Section D). In brief, xedge has an entry for each possible edge. An edge occurs when the corresponding entry falls below a threshold controlled by ϑ1. Similarly xinfect has an entry for each individual. When that individual is exposed, then infection occurs if the corresponding entry falls below a threshold controlled by ϑ2. (Only one random variable is needed per individual as they can only be exposed once. Afterwards they are either infective or immune.)

To summarize, as well as continuous parameters θ we also wish to infer discrete variables: presence of edges and infection status. As noted at the start of Section 8, the discrete variables are inferred by representing them as a mapping of continuous latent variables ξ.

8.4 Results

We first consider a small network with m=5 nodes, observed for 5 time steps. This has dim(ξ)=17. Following Section 7 we use N=5000 and M=250 in DIS. As discussed in Section 3.2, it is plausible here for DIS to reach ϵ=0, since the data are discrete. Our experiment reaches ϵ=0 after 1.4 min. Then we perform importance sampling using this DIS proposal with 100,000 samples, which takes a further 1.2 min. shows posterior histograms for θ match those for likelihood-based importance sampling, which produces 100,000 samples in only 4 sec.

Figure 3: Output for SI example with m=5, showing marginal posterior histograms for parameters θ1 and θ2 from IS and DIS output. To produce DIS histograms we ran a final stage of importance sampling with 100,000 samples and then used importance resampling (see Section 4.3) to get unweighted samples.

Figure 3: Output for SI example with m=5, showing marginal posterior histograms for parameters θ1 and θ2 from IS and DIS output. To produce DIS histograms we ran a final stage of importance sampling with 100,000 samples and then used importance resampling (see Section 4.3) to get unweighted samples.

Next we consider a larger network with m=10 nodes, observed for 10 time steps. Now dim(ξ)=57. As described in Section 8.2, exact likelihood calculation is infeasible here. However, DIS with N=5000, M=250 can perform exact inference, reaching ϵ=0 after 393 min. (left) shows the resulting posterior histograms for θ. Then we perform importance sampling using this DIS proposal with 20,000 samples, which takes a further 1.7 min.

Figure 4: DIS output for SI example with m=10. To produce the plots we ran a final stage of importance sampling with 20, 000 samples, and then used importance resampling (see Section 4.3) to get unweighted samples. Left: Marginal posterior histograms for parameters θ1 and θ2. Vertical lines show the true values. Right: Posterior distribution of the network structure. Colors (in electronic version)/shades (in printed version) indicate the posterior probability of the existence of an edge and that a node will be infected upon exposure. Nodes represented by shaded circles are infective at some point, and dashed edges are those existing in the true network.

Figure 4: DIS output for SI example with m=10. To produce the plots we ran a final stage of importance sampling with 20, 000 samples, and then used importance resampling (see Section 4.3) to get unweighted samples. Left: Marginal posterior histograms for parameters θ1 and θ2. Vertical lines show the true values. Right: Posterior distribution of the network structure. Colors (in electronic version)/shades (in printed version) indicate the posterior probability of the existence of an edge and that a node will be infected upon exposure. Nodes represented by shaded circles are infective at some point, and dashed edges are those existing in the true network.

DIS also outputs posterior distributions of the latent variables xedge and xinfect, controlling the network structure and whether individuals become immune on exposure. These are summarized in (right). In the observed data, only individuals 1 and 7 do not become infected. The figure shows these individuals have a low posterior probability of becoming infected on exposure. Individual 0 has a moderate probability of infection on exposure. This is because they are already infective initially, so there is little information in the data on what would happen to them on exposure. The remaining individuals have a high posterior probability of becoming infected. In the data, individuals 3,6,8 are infected in the first time period. Consequently the edges from these to individual 0 have the highest posterior probabilities, as this is the only possible route of infection. Edges representing plausible routes to individuals infected later have lower probabilities, presumably as there are more possibilities for how this happens. The lowest probability edges are those which would result in individuals being infected before this is observed to happen in the data, for example, the edge (0,2).

9 Conclusion

We present distilled importance sampling for likelihood-free inference, and show its application in several examples. An M/G/1 queuing example demonstrates that the method produces approximate results which are much more accurate than ABC, without needing the use of summary statistics. An epidemic model example shows that the method can also produce inference for discrete data. Indeed it demonstrates that for discrete data, it is possible for DIS to target the exact posterior. We also investigate tuning choices, and propose some default choices. The supplement (Section E) has a summary of tuning recommendations, and discussion of some alternative possibilities.

The remainder of the section discusses possible extensions of our methodology, as well as limitations.

Likelihood-Based Inference

This article concentrates on likelihood-free inference. DIS can also be applied to some likelihood-based inference problems. However, exploratory work shows that here it perform less well than the competing methods listed at the start of Section 1.1. See the supplement (Section F) for further discussion.

More Efficient Training

Every iteration of Algorithm 1 generates new model simulations, and uses them for a few iterations of training. This results in a computationally feasible algorithm for our examples, as the simulators are reasonably fast. However, for more expensive simulators, it would be desirable to use simulations more efficiently in the training process. For instance, training could continue with the same training data until convergence—although this risks overtraining. Alternatively, training data from all previous iterations could be reused for training—although ensuring the correct target distribution may be difficult. We plan to explore these approaches in future work.

Amortized Flows

Throughout the article, q(ξ;ϕ) is a generic normalizing flow producing the vector ξ. We refer to this as a black-box approach, since it involves no knowledge of how x is used by the simulator. This is less practical for large dim(ξ), as an increasingly large amount of simulations are required to learn such a high dimensional distribution.

An alternative is leveraging amortization in the flow, using knowledge about the simulator. Suppose a simulator requests a random variable. An amortized flow would determine the distribution to sample from using knowledge of: previously returned random variables (as for a black-box flow); the current simulator state; the line of code making the request. The complexity of the flow now depends on the number of sampling statements in the code, rather than the total number of samples made. This can be a big reduction in complexity when the simulator makes many iterations over a few sample statements. Similar ideas appear in inference networks (Le, Baydin, and Wood Citation2017).

Overriding Random Number Generators

For the examples in this article, we wrote custom simulator code which allowed x values sampled from q to be supplied as input. For large simulator codebases, it would be ideal to work with the original code, without needing to develop a custom version. Following Baydin et al. (Citation2019), a possible approach is to override the random number generator that the code uses to instead provide x values proposed by q.

Use with CDE

As mentioned in Sections 1 and 2.1, one approach to likelihood-free inference is conditional density estimation (CDE). Unlike DIS, this allows amortized inference: after training it can be used for inference of many different observed datasets. Future work could combine the methods: use CDE to produce an initial approximate posterior estimate, then refine further using DIS. This makes use of complementary advantages of the methods: CDE is amortized, while DIS relies less on summary statistics.

Supplemental material

Supplemental Material

Download PDF (225.2 KB)

Acknowledgments

The authors gratefully acknowledge: Alex Shestopaloff for providing MCMC code for the M/G/1 model; Sammy Ragy for spotting several typos; Andrew Golightly, Chris Williams and anonymous reviewers for helpful suggestions. This work was supported by the EPSRC under grant EPSRC EP/V049127/1 (Dennis Prangle); and Italian Ministry of University and Research (MIUR) under grant 58523_DIPECC (Cecilia Viscardi). The authors have no potential conflict of interest to report.

Supplementary Material

Distilling Importance Sampling: Supplementary Material:

This document contains additional information and figures. (PDF)

Additional information

Funding

This work was supported by the EPSRC under grant EPSRC EP/V049127/1 (Dennis Prangle); and Italian Ministry of University and Research (MIUR) under grant 58523_DIPECC (Cecilia Viscardi)

Notes

1 If the simulator is deterministic the distribution is a point mass. However, it is more common to consider stochastic simulators with LFI.

2 Or of an approximation pϵ(ξ) as defined below in (6).

3 We must sometimes bisect intervals of the form [a,]. To do so we take the midpoint to be a+100.

4 For instance because this set has Lebesgue measure zero and q has Lebesgue dominating measure.

5 The large number of samples is to meet a reviewer request to estimate the tails in accurately.

References

  • Agapiou, S., Papaspiliopoulos, O., Sanz-Alonso, D., and Stuart, A. M. (2017), “Importance Sampling: Intrinsic Dimension and Computational Cost,” Statistical Science, 32, 405–431.
  • Arbel, M., Matthews, A., and Doucet, A. (2021), “Annealed Flow Transport Monte Carlo,” in International Conference on Machine Learning, pp. 318–330.
  • Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2018), “Automatic Differentiation in Machine Learning: A Survey,” Journal of Machine Learning Research, 18, 1–43.
  • Baydin, A. G., Shao, L., Bhimji, W., Heinrich, L., Naderiparizi, S., Munk, A., Liu, J., Gram-Hansen, B., Louppe, G., Meadows, L., Torr, P., Lee, V., Prabhat, Cranmer, K., and Wood, F. (2019), “Efficient Probabilistic Inference in the Quest for Physics Beyond the Standard Model,” in Advances in Neural Information Processing Systems (Vol. 32).
  • Bornschein, J., and Bengio, Y. (2014), “Reweighted Wake-Sleep,” arXiv preprint arXiv:1406.2751.
  • Chatterjee, S., and Diaconis, P. (2018), “The Sample Size Required in Importance Sampling,” The Annals of Applied Probability, 28, 1099–1135.
  • Cornuet, J.-M., Marin, J.-M., Mira, A., and Robert, C. P. (2012), “Adaptive Multiple Importance Sampling,” Scandinavian Journal of Statistics, 39, 798–812.
  • Cotter, S. L., Kevrekidis, I. G., and Russell, P. (2020), “Transport Map Accelerated Adaptive Importance Sampling, and Application to Inverse Problems Arising from Multiscale Stochastic Reaction Networks,” SIAM/ASA Journal on Uncertainty Quantification, 8, 1383–1413.
  • Del Moral, P., Doucet, A., and Jasra, A. (2012), “An Adaptive Sequential Monte Carlo Method for Approximate Bayesian Computation,” Statistics and Computing, 22, 1009–1020.
  • Dieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. (2017), “Variational Inference via χ Upper Bound Minimization,” in Advances in Neural Information Processing Systems (Vol. 30).
  • Duan, L. L. (2021), “Transport Monte Carlo: High-Accuracy Posterior Approximation via Random Transport,” Journal of the American Statistical Association, DOI: 10.1080/01621459.2021.2003201.
  • Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G. (2019), “Neural Spline Flows,” in Advances in Neural Information Processing Systems (Vol. 32).
  • Dutta, R., Mira, A., and Onnela, J.-P. (2018), “Bayesian Inference of Spreading Processes on Networks,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474, 20180129.
  • Elvira, V., Martino, L., and Robert, C. P. (2022), “Rethinking the Effective Sample Size,” International Statistical Review, 90, 525–550.
  • Erdős, P., and Rényi, A. (1959), “On Random Graphs,” Publicationes Mathematicate, 6, 290–297.
  • Foerster, J., Farquhar, G., Al-Shedivat, M., Rocktäschel, T., Xing, E., and Whiteson, S. (2018), “Dice: The Infinitely Differentiable Monte Carlo Estimator,” in International Conference on Machine Learning, pp. 1529–1538.
  • Germain, M., Gregor, K., Murray, I., and Larochelle, H. (2015), “MADE: Masked Autoencoder for Distribution Estimation,” in International Conference on Machine Learning, pp. 881–889.
  • Graham, M. M., and Storkey, A. J. (2017), “Asymptotically Exact Inference in Differentiable Generative Models,” Electronic Journal of Statistics, 11, 5105–5164.
  • Grazian, C., and Fan, Y. (2019), “A Review of Approximate Bayesian Computation Methods via Density Estimation: Inference for Simulator-Models,” Wiley Interdisciplinary Reviews: Computational Statistics, 12, e1486.
  • Huggins, J. H., Kasprzak, M., Campbell, T., and Broderick, T. (2020), “Practical Posterior Error Bounds from Variational Objectives,” in Artificial Intelligence and Statistics, pp. 1792–1802.
  • Ikonomov, B., and Gutmann, M. U. (2020), “Robust Optimisation Monte Carlo,” in International Conference on Artificial Intelligence and Statistics, pp. 2819–2829.
  • Ionides, E. L. (2008), “Truncated Importance Sampling,” Journal of Computational and Graphical Statistics 17, 295–311.
  • Jerfel, G., Wang, S., Fannjiang, C., Heller, K. A., Ma, Y., and Jordan, M. I. (2021), “Variational Refinement for Importance Sampling Using the Forward Kullback-Leibler Divergence,” in Uncertainty in Artificial Intelligence, pp. 1819–1829.
  • Kingma, D. P., and Ba, J. (2015), “Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations.
  • Larrañaga, P., and Lozano, J. A. (2002), Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation, New York, NY: Springer.
  • Le, T. A., Baydin, A. G., and Wood, F. (2017), “Inference Compilation and Universal Probabilistic Programming,” in Artificial Intelligence and Statistics, pp. 1338–1348.
  • Li, Y., Turner, R. E., and Liu, Q. (2017), “Approximate Inference with Amortised MCMC,” arXiv preprint arXiv:1702.08343.
  • Liu, J. S. (1996), “Metropolized Independent Sampling with Comparisons to Rejection Sampling and Importance Sampling,” Statistics and Computing, 6, 113–119.
  • MacKay, D. J. C. (2003), Information Theory, Inference and Learning Algorithms, Cambridge, UK: Cambridge University Press.
  • Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. (2012), “Approximate Bayesian Computational Methods,” Statistics and Computing, 22, 1167–1180.
  • Meeds, T., and Welling, M. (2015), “Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference,” Advances in Neural Information Processing Systems (Vol. 28).
  • Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A. (2020), “Monte Carlo Gradient Estimation in Machine Learning,” Journal of Machine Learning Research, 21, 1–62.
  • Müller, T., Mcwilliams, B., Rousselle, F., Gross, M., and Novák, J. (2019), “Neural Importance Sampling,” ACM Transactions on Graphics (TOG), 38, 1–19.
  • Naesseth, C. A., Lindsten, F., and Blei, D. (2021), “Markovian Score Climbing: Variational Inference with KL(p||q) ,” in Advances in Neural Information Processing Systems, Vol. 33, pp. 15499–15510.
  • Nash, C., and Durkan, C. (2019), “Autoregressive Energy Machines,” in International Conference on Machine Learning, pp. 1735–1744.
  • Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. (2021), “Normalizing Flows for Probabilistic Modeling and Inference,” Journal of Machine Learning Research, 22,1–64.
  • Papamakarios, G., Sterratt, D., and Murray, I. (2019), “Sequential Neural Likelihood: Fast Likelihood-Free Inference with Autoregressive Flows,” in Artificial Intelligence and Statistics, pp. 837–848.
  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al. (2019), “Pytorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems, Vol. 32.
  • Pickands III, J., and Stine, R. A. (1997), “Estimation for an M/G/∞ Queue with Incomplete Information,” Biometrika, 84, 295–308.
  • Prangle, D., Everitt, R. G., and Kypraios, T. (2018), “A Rare Event Approach to High-Dimensional Approximate Bayesian Computation,” Statistics and Computing, 28, 819–834.
  • Robert, C. P., and Casella, G. (2013), Monte Carlo Statistical Methods, New York, NY: Springer.
  • Rubinstein, R. (1999), “The Cross-Entropy Method for Combinatorial and Continuous Optimization,” Methodology and Computing in Applied Probability, 1, 127–190.
  • Rubinstein, R. Y., and Kroese, D. P. (2016), Simulation and the Monte Carlo Method, Hoboken, NJ: John Wiley & Sons.
  • Ruder, S. (2016), “An Overview of Gradient Descent Optimization Algorithms,” arXiv preprint arXiv:1609.04747.
  • Shestopaloff, A. Y., and Neal, R. M. (2014), “On Bayesian Inference for the M/G/1 Queue with Efficient MCMC Sampling,” arXiv preprint arXiv:1401.5548.
  • Smith, A. F. M., and Gelfand, A. E. (1992), “Bayesian Statistics without Tears: A Sampling–Resampling Perspective,” The American Statistician, 46, 84–88.
  • Wilkinson, R. D. (2013), “Approximate Bayesian Computation (ABC) Gives Exact Results UNDER the Assumption of Model Error,” Statistical Applications in Genetics and Molecular Biology, 12, 129–141. DOI: 10.1515/sagmb-2013-0010.
  • Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018), “Yes, but Did It Work?: Evaluating Variational Inference,” in International Conference on Machine Learning, pp. 5581–5590.