963
Views
0
CrossRef citations to date
0
Altmetric
Bayesian and Monte Carlo Methods

The Apogee to Apogee Path Sampler

ORCID Icon, ORCID Icon & ORCID Icon
Pages 1436-1446 | Received 10 Jan 2022, Accepted 01 Mar 2023, Published online: 24 Apr 2023

Abstract

Among Markov chain Monte Carlo algorithms, Hamiltonian Monte Carlo (HMC) is often the algorithm of choice for complex, high-dimensional target distributions; however, its efficiency is notoriously sensitive to the choice of the integration-time tuning parameter. When integrating both forward and backward in time using the same leapfrog integration step as HMC, the set of apogees, local maxima in the potential along a path, is the same whatever point (position and momentum) along the path is chosen to initialize the integration. We present the Apogee to Apogee Path Sampler (AAPS), which uses this invariance to create a simple yet generic methodology for constructing a path, proposing a point from it and accepting or rejecting that proposal so as to target the intended distribution. We demonstrate empirically that AAPS has a similar efficiency to HMC but is much more robust to the setting of its equivalent tuning parameter, the number of apogees that the path crosses. Supplementary materials for this article are available online.

1 Introduction

Markov chain Monte Carlo (MCMC) is often the method of choice for estimating expectations with respect to complex, high-dimensional targets (e.g., Gilks et al. Citation1996; Brooks et al. Citation2011). Amongst MCMC algorithms, Hamiltonian Monte Carlo (HMC, also known as Hybrid Monte Carlo; Duane et al. Citation1987) is known to offer a performance that scales better with the dimension of the state space than many of its rivals (Neal Citation2011b; Beskos et al. Citation2013).

Given a target density π(x),xRd, with respect to Lebesgue measure, and a current position, at each iteration HMC samples a momentum and numerically integrates Hamiltonian dynamics on a potential surface (1) U(x)=logπ(x)(1) to create a proposal that will either be accepted or rejected. As such, HMC has two main tuning parameters: the numerical integration step size, ϵ, and the total integration time, T. Given T, guidelines for tuning ϵ have been available for some time (Beskos et al. Citation2013); however, the integration time itself, is notoriously difficult to tune (e.g., Neal Citation2011b), with algorithm efficiency often dropping sharply following only slight changes from the optimum T value, and usually exhibiting approximately cyclic behavior as T increases.

The sensitivity is illustrated in the left panel of , in which HMC is applied to a modified Rosenbrock distribution (see Section 4.1) of dimension d = 40. In the top half of this plot, the efficiency (see (8) for a precise definition), is given as a function of ϵ and the number of numerical integration steps, L. The bottom half of the panel shows the analogous plot for a modification of the HMC algorithm (Mackenze Citation1989; Neal Citation2011b), which we refer to as blurred HMC where at each iteration, the actual step-size is sampled (independently of all previous choices) uniformly from the interval [0.8ϵ,1.2ϵ]. This step was designed to mitigate the near reducibility of HMC that can occur when T is some rational multiple of the integration time required to return close to the starting point, but as is visible from the plots, it also makes the performance of the algorithm more robust to the choice of T, and, we have found, often leads to a slightly more efficient algorithm. In both cases, the optimal tuning choice appears as a narrow ridge of roughly constant T=Lϵ. Blurred HMC can be viewed as sampling the integration time T uniformly from [23T*,T*], where T*=1.2Lϵ. The approach of using a random integration time to introduce robustness has been extended recently to sampling uniformly from [0,T*] (Hoffman et al. Citation2021) and as an exponential variable with an expectation of T* (Bou-Rabee and Sanz-Serna Citation2017).

Fig. 1 Efficiency, according to (Equation8), as a function of the tuning parameters for the 40-dimensional modified Rosenbrock target of Section 4.1. Left panel: HMC with positive L values corresponding to the standard HMC algorithm and negative L correspond to |L| leapfrog steps of blurred HMC. Right panel: AAPS. Optimal parameter settings in red.

Fig. 1 Efficiency, according to (Equation8(8) Eff=1nleapmini=1,…,dESSi.(8) ), as a function of the tuning parameters for the 40-dimensional modified Rosenbrock target of Section 4.1. Left panel: HMC with positive L values corresponding to the standard HMC algorithm and negative L correspond to |L| leapfrog steps of blurred HMC. Right panel: AAPS. Optimal parameter settings in red.

Motivated by the difficulty of tuning T, Hoffman and Gelman (Citation2014) introduces the no U-turn sampler (NUTS). This uses the same numerical integration scheme as standard HMC, the leapfrog step, to integrate Hamiltonian dynamics both forward and backward from the current point, recursively doubling the size of the path until the distance between the points the farthest forward and backward in time stops increasing. Considerable care must be taken to ensure that the intended posterior is targeted, making the algorithm relatively complex; however, the no U-turn sampler is the default engine behind the popular STAN package (Stan Development Team Citation2020), which has a relatively straightforward user interface.

Integration of Hamiltonian dynamics can be thought of as describing the position and momentum of a particle as it traverses the potential surface U. During its journey, provided T is not too small, the particle will reach one or more local maximum, or apogee, in the potential. The leapfrog scheme creates a path which is a discrete set of points rather than a continuum, and so (with probability 1) apogees occur between consecutive points in the path; however, they are straightforward to detect. We call the set of points (positions and momenta) between two apogees a segment. The discrete dynamics using the leapfrog step share several properties with the true dynamics, including the following: if we take the position and momentum of any point along the path, and integrate forward and backward for appropriate lengths of time we will create exactly the same path, and hence the same set of apogees and the same set of segments. This invariance is vital to the correctness and flexibility of the algorithm presented in this article.

In Section 3 we present the Apogee to Apogee Path Sampler (AAPS). Like HMC, AAPS is straightforward to implement and has two tuning parameters, one of which is an integration step size, ϵ. As with the no U-turn sampler, given a current point (position and momentum), AAPS uses the leapfrog step to integrate forwards and backwards in time. However, the integration stops when the path contains the segment in which the current point lies as well as K additional segments, where K is a user-defined tuning parameter. A point is then proposed from this set of K + 1 segments and either accepted or rejected. The positioning of the current segment within the K + 1 and the accept-reject probability are chosen precisely so that the algorithm targets the intended density. The invariance of the path to the starting position and momentum leads to considerable flexibility in the method for proposing a point from the set of segments, which in turn allows us to create an algorithm which enjoys a similar efficiency to HMC yet is extremely robust to the choice of ϵ and K. These properties are evident from the right-hand panel of and analogous plots for other distributions in Section 4.1. The robustness arises mainly from the proposal scheme; see the end of Section 3.3. The definition of K, however, also naturally caters to intrinsic properties of the target, such as its eccentricity, with any externally imposed length (or time) scale largely irrelevant; the theoretical analysis of Section 3.6 makes this explicit in a simplified setting.

Section 2 describes Hamiltonian dynamics and HMC. The AAPS is detailed in Section 3, and is compared empirically against HMC and the no U-turn sampler in Section 4. We conclude in Section 5 with a discussion.

2 Hamiltonian Dynamics and Hamiltonian Monte Carlo

2.1 Hamiltonian Dynamics

The position, x, and momentum, p, of a particle on a frictionless potential surface U(x) evolve according to Hamilton’s equations: (2) dxdt=M1p   and   dpdt=xU.(2)

For a real object, M is the mass of the object, a scalar, but the properties of the equations themselves that we will require hold more generally, when M is a symmetric, positive-definite mass matrix. The choice of M, whether for HMC or AAPS, is discussed at the end of Section 3.5. We define zt=(xt,pt) and the map ϕt which integrates the dynamics forwards for a time t, so zt=ϕt(z0). The map, ϕt has the following fundamental properties (e.g., Neal Citation2011b):

  1. It is deterministic.

  2. It has a Jacobian of 1.

  3. It is skew reversible: ϕt(xt,pt)=(x0,p0).

  4. It preserves the total energy H(x,p)=U(x)+12pM1p.

Except in a few special cases, the dynamics are intractable and must be integrated numerically, with a user-chosen time step which we will denote by ϵ. The default method for Hamiltonian Monte Carlo, is the method which will be used throughout this article, the leapfrog step; the leapfrog step itself is detailed in Appendix A. Throughout the main text of this article we denote the action of a single leapfrog step of length ϵ on a current state z as LeapFrog(z;ϵ). The leapfrog step satisfies Properties 1–3 above (see Appendix A), but it does not preserve the total energy.

Consider using L leapfrog steps of size ϵ=t/L to approximately integrate the dynamics forward for time t. Since each individual leapfrog step satisfies Properties 1–3, so does the composition of L such steps, which we denote ϕ̂t(z0;ϵ).

2.2 Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) creates a Markov chain which has a stationary distribution of π(x)=exp{U(x)}. Given a current position, xcurr, and tuning parameters ϵ and L, a single iteration of the algorithm proceeds as follows:

  1. Sample a momentum p0N(0,M) and set z0=(xcurr,p0).

  2. For i in 1 to L:

  3. zi=LeapFrog(zi1;ϵ).

  4. Let zprop=zL and set α=1π˜(zprop)/π˜(zcurr).

  5. With probability α, zcurrzprop; else zcurrzcurr.

Here, with ρ(p) denoting the density of the N(0,M) random variable, (3) π˜(z)π˜(x,p)=π(x)ρ(p)=exp{H(x,p)}.(3)

If xcurr is in its stationary distribution then π˜ is the density of z0=(xcurr,p0).

3 The Apogee to Apogee Path Sampler

3.1 Apogees and Segments

The left panel of shows L = 50 leapfrog steps of size ϵ=0.1 from a current position, x0 simulated randomly from a two-dimensional posterior π(x) (with contours of U(x) shown), and momentum p0 simulated from a N(0,I2) distribution. Different symbols and colors are used along the path, with both of these changing from step l to step l + 1 if and only if (4) plM1U(xl)>0   and   pl+1M1U(xl+1)<0.(4)

Fig. 2 Left panel: L = 50 leapfrog steps of size ϵ=0.1 from the current point. Right panel: the current segment, S0, and two segments forward and one segment backward using ϵ=0.1. The current point, x0 is simulated from a target, which has a density of π(x)exp(x12/26x22); p0 is simulated from N(0,I2). Different colors and symbols are used for each segment along the path.

Fig. 2 Left panel: L = 50 leapfrog steps of size ϵ=0.1 from the current point. Right panel: the current segment, S0, and two segments forward and one segment backward using ϵ=0.1. The current point, x0 is simulated from a target, which has a density of π(x)∝ exp (−x12/2−6x22); p0 is simulated from N(0,I2). Different colors and symbols are used for each segment along the path.

Intuitively, condition (4) indicates when the “particle” has switched from moving “uphill” to moving “downhill” with respect to the potential surface U(x). By (2), pM1Udx/dt·UdU/dt, so (4) indicates a local maximum in U(xt) between xl and xl+1.

Between such a pair of points at l and l + 1 there is a hypothetical point, a with l<a<l+1 where the particle’s potential has reached a local maximum, which we call an apogee. Under the exact, continuous dynamics, this point would, of course, be realized, but under the discretized dynamics the probability of this is 0. We call each of the realized sections of the path between a pair of consecutive apogees (i.e., each portion with a different color and symbol in ) a segment. Each segment consists of the time-ordered list of position and momentum at each point between two consecutive apogees.

Instead of integrating forward for L steps, one can imagine integrating both forwards and backwards in time from z0=(x0,p0) until a certain number of apogees have been found. The right pane of shows the segment to which the current point belongs, which we denote S0(z0), together with two segments forward and one segment backward. We denote the jth segment forward by Sj(z0) and the jth segment backward by Sj(z0). We abbreviate the ordered collection of segments from Sa(z0) to Sb(z0) as Sa:b(z0). Thus, the right panel of shows the positions from S1:2(z0). For a particular point z=(x,p)Sa:b, we denote the segment to which it belongs by S#(z;z0).

The following segment invariance property is vital to both the simplicity and correctness of our algorithm. For any K0 and c{0,1,,K} set a=c and b=Kc. For any z and any zSa:b(z), (5) Sa:b(z)Sa:b(z),where   a=c, b=Kc where c=c+S#(z;z).(5)

The quantities a,b, and c correspond to a, b, and c but from the point of view of z rather than z. For the right panel of , for example, picking any z=(x,p) from S1(z0),S2:1(z) would give the same ordered set of segments as illustrated in the figure. This is because the numerical integration scheme is deterministic and skew reversible, so the apogees would all occur in exactly the same positions with exactly the same (up to a possible sign flip) momenta.

3.2 The AAPS Algorithm

We now introduce our algorithm, the Apogee to Apogee Path Sampler, AAPS. The algorithm requires a weight function w:R4d[0,), where d is the dimension of the target. Weight functions are investigated in more detail in Sections 3.3, but for now it might be helpful keep in mind the simplest that we consider: w(z,z)=π˜(z).

Given a step-size ϵ, a nonnegative integer, K, a mass matrix, M and a current position xcurr, one iteration of AAPS proceeds as follows:

  1. Sample a momentum p0N(0,M) and set zcurr=z0=(xcurr,p0).

  2. Simulate c uniformly from {0,1,,K}; set a=c and b=Kc.

  3. Create Sa:b by leapfrog stepping forward from (x0, p0) and then backward from (x0, p0).

  4. Propose zprop w.p. w(zcurr,zprop) 1(zpropSa:b(zcurr)).

  5. With a probability of (6) α=1π˜(zprop)w(zprop,zcurr)zSa:b(zcurr)w(zcurr,z)π˜(zcurr)w(zcurr,zprop)zSa:b(zcurr)w(zprop,z)(6) set zcurrzprop else zcurrzcurr.

  6. Discard pcurr and retain xcurr.

The Metropolis-Hastings formula (6) arises because, out of the allowable proposals once c has been chosen, the probability of proposing zprop is q(zprop|zcurr)=w(zcurr,zprop)/zSa:bw(zcurr,z).

Proposition 1.

The AAPS algorithm satisfies detailed balance with respect to the extended posterior π˜(x,p).

Proof.

Step 1 preserves π˜ because p is independent of x and is sampled from its marginal.

It will be helpful to define the system from the point of view of starting at zprop. Let a,b and c be as defined in (5), but with z=zprop. Then, since S#(zprop;zcurr)=S#(zcurr;zprop), c0Kc, cS#(zprop;zcurr)Kcc S#(zcurr;zprop)Kc, c0Kc.

Equivalently, 1(c{0,,K})1(zpropSa:b(zcurr))1(c{0,,K})1(zcurrSa:b(zprop)). Moreover, segment invariance (5) is equivalent to zSa:b(zcurr)zSa:b(zprop).

The resulting chain satisfies detailed balance with respect to π˜ because π˜(zcurr)×P(c)×P(propose zprop|c)×P(propose zprop|propose ,c)is π˜(zcurr)×1(c{0,,K})K+1×w(zcurr,zprop)1(zpropSa:b(zcurr))zSa:b(zcurr)w(zcurr,z)×1π˜(zprop)w(zprop,zcurr)zSa:b(zcurr)w(zcurr,z)π˜(zcurr)w(zcurr,zprop)zSa:b(zprop)w(zprop,z), which is 1(c{0,,K})1(zpropSa:b(zcurr))K+1×{π˜(zcurr)w(zcurr,zprop)zSa:b(zcurr)w(zcurr,z)π˜(zprop)w(zprop,zcurr)zSa:b(zprop)w(zprop,z)}.

This expression is invariant to (zcurr,a,b,c)(zprop,a,b,c). □

Remark 1.

The AAPS could be applied with a numerical integration scheme that does not have a Jacobian of 1. In such a scheme, however, the Jacobian would need to be included in the acceptance probability; moreover, a Jacobian other than 1 would lead to greater variability in π˜ along a path and, we conjecture, to a reduced efficiency.

The path Sa:b may visit parts of the statespace where the numerical solution to (2) is unstable and the error in the Hamiltonian may increase without bound, leading to wasted computational effort as large chunks of the path may be very unlikely to be proposed and accepted. Indeed it is even possible for the error to increase beyond machine precision. The no U-turn sampler suffers from a similar problem and we introduce a similar stability condition to that in Hoffman and Gelman (Citation2014). We require that (7) maxzSa:b(zcurr)H(z)minzSa:b(zcurr)H(z)<Δ,(7) for some prespecified parameter Δ. This criterion can be monitored as the forward and backward integrations proceed, and if at any point the criterion is breached, the path is thrown away: the proposal is automatically rejected. Segment invariance means that zSa:b(zcurr)  z  Sa:b(zprop), so the same rejection would have occured if we created the path from any proposal in Sa:b(zcurr), and detailed balance is still satisfied. For the experiments in Section 4 we found that a value of Δ = 1000 was sufficiently large that the criterion only interfered when something was seriously wrong with the integration. Step 3 of the algorithm then becomes:

  1. Create Sa:b by leapfrog stepping forward from (x0, p0) and then backward from (x0, p0). If condition (7) fails then go to 6.

For a d-dimensional Markov chain, {Xt}t=0, with a stationary distribution of π, the asymptotic variance is Vf:=limnnvar[1ni=1nf(Xi)]. Thus, after burn in, var[1ni=1nf(Xi)]Vf/n. The effective sample size is the number of iid samples from π that would lead to the same variance: ESSf=nvarπ[f]/Vf. Let ESSi be an empirical estimate of the ESS for the ith component of X (i.e., f gives the ith component) and let nleap be the total number of leapfrog steps taken during the run of the algorithm. We measure the efficiency of an algorithm as (8) Eff=1nleapmini=1,,dESSi.(8)

Since the leapfrog step is by far the most computationally intensive part of the algorithm, Eff is proportional to the number of effective iid samples generated per second in the worst mixing component.

3.3 Choice of Weight Function

The weight function w(z,z)=π˜(z) mentioned previously is a natural choice, and substitution into (6) shows that this leads to an acceptance probability of 1; however, it turns out not to be the most efficient choice. For example, intuitively the algorithm might be more efficient if points on the path that are further away from the current point are more likely to be proposed. To investigate the effects of the choice of w we examine six possibilities:

  1. w(z,z)=π˜(z), which leads to α = 1.

  2. w(z,z)=||xx||2, squared jumping distance (SJD).

  3. w(z,z)=π˜(z)||xx||2, SJD modulated by target.

  4. w(z,z)=||xx||, absolute jumping distance (AJD).

  5. w(z,z)=π˜(z)||xx||, AJD modulated by target.

  6. w(z,z)=π˜(z) 1(zH(z)), where H(z) is described below; this also leads to α = 1.

Scheme 6 essentially partitions Sa:b(zcurr) into two halves of roughly equal total π˜ and then proposes only values from the half that does not contain zcurr,H(zcurr); details are given in Appendix H.

The left panel of shows, for a particular choice of ϵ and target, the efficiency of AAPS as a function of K for each of the six weight schemes. Scheme 6 is the most efficient; however, Schemes 3 and 5 each of which involve some measure of jumping distance modulated by π˜ are not far behind. Indeed there is only a factor of two between the least and most efficient. Very similar relative performances were found for all the other toy targets from Section 4.1 and across a variety of choices of ϵ, except that when ϵ becomes small, modulation of SJD or AJD by π˜ makes little difference since relative changes in π˜ are small.

Fig. 3 Left: efficiency (see (Equation8)) of AAPS as a function of K when ϵ=1.2. Right: acceptance rate as a function of ϵ when K = 15. There is one curve for each of the six choices of weight function. The single horizontal line associated with each curve in the left panel indicates the maximum efficiency achieved. The target is πGH(x) (see Section 4.1) with d = 40 and ξ = 20. Weighting functions 1 and 6 allow zprop=zcurr, which we count as equivalent to a rejection.

Fig. 3 Left: efficiency (see (Equation8(8) Eff=1nleapmini=1,…,dESSi.(8) )) of AAPS as a function of K when ϵ=1.2. Right: acceptance rate as a function of ϵ when K = 15. There is one curve for each of the six choices of weight function. The single horizontal line associated with each curve in the left panel indicates the maximum efficiency achieved. The target is πGH(x) (see Section 4.1) with d = 40 and ξ = 20. Weighting functions 1 and 6 allow zprop=zcurr, which we count as equivalent to a rejection.

A simple heuristic can explain the difference of a factor of nearly two between Scheme 1 and Scheme 6 in the case where π˜ is approximately constant; for example, when ϵ is small. Sa:b is constructed prior to making the proposal, so the computational effort does not depend on the scheme; thus, efficiency can be measured crudely in terms of the squared jumping distance of the proposal (e.g., Roberts and Rosenthal Citation2001; Sherlock and Roberts Citation2009; Beskos et al. Citation2013), since it is accepted with probability 1. Without loss of generality, we rescale Sa:b to have unit length. For two independent Unif(0,1) variables U1 and U2, Scheme 1 is equivalent to an expected squared jumping distance (ESDJ) of E[(U1U2)2]=1/6, whereas Scheme 2 is equivalent to an ESJD of E[{(1U1/2)U2/2)}2]=7/24; the ratio between the two ESJDs is 7/4.

A naive implementation of each of the above weighting schemes would require storing each of the points in Sa:b(zcurr), which has an associated memory cost of O(K) and an exact cost that varies from one iteration to the next as the number of points in each segment is not fixed. For all schemes except the last there is a simple mechanism for sampling zprop with a fixed O(1) memory cost; however, this would be useless if calculation of the acceptance ratio α in (6) still required storage of O(K). For Schemes 1, 2, and 3, however, it is also possible to calculate α with a fixed O(1) memory usage. We have found that this has a negligible effect on the CPU cost but the impact on the peak memory footprint of the code is substantial, decreasing by a factor of around 17 in d = 40, 27 in d = 100 and 42 in d = 800. Since there is little otherwise to choose between the schemes, we opt for the most efficient of these, Scheme 3, w(z,z)=||xx||2π˜(x) and apply this throughout the remainder of this article. Appendix B details the implementation of Schemes 1–3 with O(1) memory cost.

3.4 Robustness and Efficiency

The robustness of AAPS with Scheme 3 which is visible in and similar figures in Section 4.1, arises from two aspects of the algorithm. We first consider robustness to the choice of ϵ, then robustness to the choice of K.

For Scheme 2, as with HMC itself, the acceptance probability contains the ratio π˜(zprop)/π˜(zcurr). The error in the total energy of HMC is proportional to ϵ2, so as ϵ increases the acceptance rate drops substantially when the error becomes O(1). By contrast, the acceptance probability for Scheme 3 contains only weighted sums of the π˜ in both the numerator and denominator so the empirical acceptance rate is much more robust to increasing ϵ. This effect can be seen in the right panel of .

Robustness to the choice of K arises because AAPS samples a point from the whole set of segments rather than choosing the endpoint of the integration as HMC does. (bottom left) of Appendix D echoes , but for an alternative version of AAPS which always sets c = 0 and always samples the proposal from segment K via weighting Scheme 3. At its optimum, the algorithm is slightly more efficient than the version of AAPS that we recommend; however, the efficiency drops much more sharply when K is larger or smaller than its optimal value.

Both of the above robustness properties are shared by Scheme 1 (also demonstrated in ); however, Scheme 3 is more efficient than Scheme 1 precisely because it preferentially targets proposals that are further from the current value.

3.5 Tuning

Since AAPS with weighting Scheme 3 is much more robust to the choice of either tuning parameter than HMC is, precise tuning is less important than it is for HMC. Thus, we present only brief guidelines here; further heuristics and empirical evidence for them are presented in Appendices E and F.

For any given K > 0, we recommend increasing ϵ from some small value until the empirical acceptance starts to change, stopping when the change from the small-ϵ acceptance rate is no more than 3%. Empirically, we have found that such minor changes in acceptance rate correspond to substantial changes in the total energy over the K + 1 segments, such that the modulation of SJD by π˜ starts to have a negative impact on the choice of zprop.

For a given value of ϵ we recommend a short tuning run of AAPS using a large value, K*, of K and then choosing a sensible K{0,,K*} according to the most popularly proposed segment number using a diagnostic that we describe in Appendix F.

As with HMC and the no U-turn sampler, the mass matrix, M, used by AAPS can also be tuned, and mixing will be optimized if M1varπ[X], as approximated from a tuning run. However, the matrix-vector multiplication required for simulating p0 and in each leapfrog step can be expensive, so it is usual to choose a diagonal M instead.

3.6 Gaussian Process Limit

Intuitively, the more eccentric a target the more apogees one might expect per unit time. This section culminates in an expression which makes this relationship concrete for a product target where each component has its own length scale. With the initial condition (X0, P0) sampled from π˜,PtM1U(Xt) can be considered as a random function of time. We first show that when the true Hamiltonian dynamics are used, subject to conditions, a rescaling of this dot-product tends to a stationary Gaussian process as the number of components in the product tends to infinity. A formula for the expected number of apogees per unit time follows as a corollary.

We consider a d-dimensional product target with a potential of (9) U(d)(x)=i=1dg(νi(d)xi(d))+constant(9) for some g:RR and values νi(d)>0,i=1,,d, and where exp{U(d)(x)}dx=1.

HMC using a diagonal mass matrix, M, and a product target is equivalent to HMC using an identity mass matrix and a target of M1/2x (e.g., Neal Citation2011b), which, in the case of (9) is also a product target, but with different νi. For simplicity, therefore, we assume the identity mass matrix throughout this section. We also consider the true Hamiltonian dynamics which are approached in the limit as ϵ0, but approximate the dynamics for small to moderate ϵ reasonably well.

Define the scaled dot product at time t given an initial position of x0(d) and momentum of p0(d) as (10) D(d)(t;x0(d),p0(d)):=1d p(d)(t)·xU|x(d)(t)=1d i=1dνi(d)g(νi(d)xi(d))pi(d)(t).(10)

We define the one-dimensional densities with respect to Lebesgue measure πν(x)=νexp{g(νx)}   and   ρ1(p)=12πexp{p2/2},and let πν and ρ1 denote the corresponding measures. The joint density and measure are π˜ν(x,p;ν)=πν(x;ν)ρ1(p) and π˜ν.

Assumption

s 1. We assume that gC1, that there is a δ>0 such that (11) EXπ1[|g(X)|2+δ]<,(11) and that for each y0R, there is a unique, nonexplosive solution a(t;y0,y0) to the initial value problem: (12) d2ydt2=g(y); y(0)=y0, y(0)=y0.(12)

Theorem 1 is proved in Appendix C.1.

Theorem 1.

Let the potential be defined as in (9) and where g satisfies the assumptions around (11) and (12). Further, let μ be a distribution with support on R+ with (13) Eνμ[ν1+δ/2]<(13) for some δ>0, and let νiiidμ. Define (14) V(t)=E[ν g(X)P g(νXt(X,P)) a(νt;X,P)],(14) where the expectation is over the independent variables Xπ1,Pρ1 and νμ, and assume that for any finite sequence of n distinct times (t1,,tn), the n × n matrix Σ with Σi,j=V(tjti) is positive definite.

Let D(d) be the scaled dot product defined in (10), and let (X0,P0)π˜; then (15) D(d)D˜SGP(0,V),(15) as d, where YSGP(b,V) denotes that Y is a one-dimensional stationary Gaussian process with an expectation of b and a covariance function of cov[Yt,Yt]=V(tt).

Ylvisaker (Citation1965) shows that the expected number of zeros over a unit interval of a stationary Gaussian process with a unit variance is: 1πC(0), where C is its covariance function. Zeroes can be either apogees or perigees (local minima in U(x) along the path), and these alternate. Hence, with some work (see Appendix C.2), we obtain the following:

Corollary 1.

The expected number of apogees of D˜ over a time T is E[N(T)]TE[ν]×E[ν2]E[ν]2.

Since P0N(0,Id), with an identity mass matrix, the root-mean-square speed in any direction is 1. As ν is a squared inverse-scale parameter, the first term in the product simply relates the time interval to the overall length scale of the target (1/E[ν]). The second part of the product increases with relative variability in the squared inverse length scales of the components of the target. Choosing K fixes N(T), and approximately fixes the product of the two terms on the right. For a given relative variability in length scales, the integration time automatically flexes according to the overall absolute length scale; however, if that relative variability increases then T and, hence, the path length reduces. This makes it plain that the tuning parameter K relates to properties intrinsic to the target; unlike L, its impact is relatively unaffected by a uniform redefinition of the length scale of the target or by the choice of ϵ.

4 Numerical Experiments

In this section we compare, AAPS with HMC, blurred HMC (see Section 1) and the no U-turn sampler over a variety of targets. For fairness we use the basic no U-turn sampler from Hoffman and Gelman (Citation2014), without it needing to adaptatively tune ϵ; instead tuning ϵ using a fine grid of values. For both varieties of HMC we use a grid of ϵ and L values, and for AAPS we use a grid of ϵ and K values; in each case we choose the combination that leads to the optimal efficiency. To ensure that the various toy targets are close to representing the difficulties encountered with targets from complex statistical models, all algorithms use the identity mass matrix.

All code (AAPS/HMC/no U-turn sampler) was written in C++; the effective sample size was estimated using the R package coda, and the mean number of leapfrog steps per iteration (for AAPS and the no U-turn sampler) was output from each run as a diagnostic.

4.1 Toy Targets

Here we investigate performance across a variety of targets with a relatively simple functional form, and across a variety of dimensions. Many are product densities with independent components: Gaussian, logistic and skew-Gaussian. We consider different, relatively large ratios ξ between the largest and smallest length scales of the components in a target, as well as different sequences of scales from the smallest to the largest. We also consider a modification of the Rosenbrock banana-shaped target.

For a target of dimension d, given a sequence of scale parameters, σ1,,σd we consider the following four forms: πG(x)=i=1d1σi2πexp(xi22σi2),πL(x)=i=1d1σiexp(xi/σi){1+exp(xi/σi)}2,πSG(x)=i=1d2σi2πexp(xi22σi2)Φ(αxiσi),πMR(x)i=1d/2exp{12si2(x2ii2βsi)2}×exp{12(x2i12six2i121+x2ii2/(4si2))2},where Φ is the distribution function of a N(0,1) variable, α = 3, β = 1 and si2=99(i1)/(d/21)+1. The targets πG, πL, and πSG are products of one-dimensional Gaussian, logistic, and skew-Gaussian distributions, respectively. The potential surface of the target πMR is a modified version of the Rosenbrock function (Rosenbrock Citation1960), a well-known, difficult target for optimization algorithms, which is also a challenging target used to benchmark MCMC algorithms (e.g., Heng and Jacob Citation2019; Pagani et al. Citation2021). The tails of the standard Rosenbrock potential increase quartically, but all algorithms which use a standard leapfrog step and Gaussian momentum are unstable in the tails of a target where the potential increases faster than quadratically. We have, therefore, modified the standard Rosenbrock target to keep the difficult banana shape while ensuring quadratic tails.

For πG, πL and πSG, we denote the largest ratio of length scales by ξ:=max1ijdσi/σj and define four different patterns between the lowest and highest scaling, depending on which of σi, σi2,1/σi2 or 1/σi increases approximately linearly with component number. To minimize the odd behavior that HMC (but not AAPS or the no U-turn sampler) can exhibit when scalings are rational multiples of each other (a phenomenon which is rarely seen for targets in practice) we jitter the scales for all the intermediate components. Specifically, let v¯ be a vector with v1=0, vd = 1 and for i=2,,d1,vi=(i1+Ui)/(d1) where U2,,Ud1 are independent Unif(0.5,0.5) variables. Then we define the following four progressions for i=1,,d: SD: σi=(ξ1)vi+1; VAR: σi2=(ξ21)vi+1; H: 1/σi2=(11/ξ2)vi+1/ξ2; invSD: 1/σi=(11/ξ)vi+1/ξ.

A final target, πGRN, arises from an online comparison between HMC and the no U-turn sampler at https://radfordneal.wordpress.com/2012/01/27/evaluation-of-nuts-more-comments-on-the-paper-by-hoffman-and-gelman/.

uses ξ = 20 and repeats in d = 40 for each of the main product target types: Gaussian, skew-Gaussian and logistic. It demonstrate the robustness of AAPS when compared with HMC and blurred HMC. Appendix D contains more details on these experiments. in Appendix D plots efficiency against step size when the popular No U-turn Sampler is used on the targets in and . The peaks for the logistic and modified Rosenbrock targets are broad, suggesting some robustness; however, the peaks for the Gaussian and skew-Gaussian targets are much sharper.

Fig. 4 Efficiency, measured via (Equation8), as a function of the tuning parameters for πGH (top), πSGVAR (middle), and πLVAR (bottom) all with (ξ,d)=(20,40) as defined in Section 4.1. Left panels: HMC with positive L values corresponding to the standard HMC algorithm and negative L correspond to |L| leapfrog steps of blurred HMC. Right panels: AAPS. Optimal tuning in red.

Fig. 4 Efficiency, measured via (Equation8(8) Eff=1nleapmini=1,…,dESSi.(8) ), as a function of the tuning parameters for πGH (top), πSGVAR (middle), and πLVAR (bottom) all with (ξ,d)=(20,40) as defined in Section 4.1. Left panels: HMC with positive L values corresponding to the standard HMC algorithm and negative L correspond to |L| leapfrog steps of blurred HMC. Right panels: AAPS. Optimal tuning in red.

We next investigate products of Gaussians using each of the four scale-parameter progressions with ξ = 20. Dimension d = 40 is high enough that for this and higher dimensions the components can be well approximated as arising from some continuous distributions and Corollary 1 applies.

shows the efficiencies of HMC, blurred HMC and the no U-turn sampler relative to that of AAPS. Among the Gaussian targets, the relative performance of AAPS compared with HMC and NUTS is best for πGinvSD and πGH, which have just a few components with large scales and many components with small scales; weighting Scheme 3 ensures that the large components are explored preferentially. In contrast, the large number of components with large scales in πGVAR leads to the worst relative performance of AAPS; we, therefore, investigate this regime further using alternative component distributions, and choice of dimension and ξ. Across the range of targets, no algorithm is more than 1.7 times as efficient as AAPS. Empirical acceptance rates at the optimal parameter settings are provided in Appendix E. All ESS estimates were at least 1000.

Table 1 Relative efficiency compared with AAPS of HMC, blurred HMC (HMC-bl) and the no U-turn sampler (NUTS); raw efficiencies were taken as the minimum over the d components of the number of effective samples per leapfrog step.

Table 3 of Appendix I provides the equivalent efficiencies when the algorithms are tuned according to recommended guidelines, and shows AAPS remaining competitive with blurred HMC and the no U-turn sampler.

4.2 Stochastic Volatility Model

Consider the following model for zero-centered, Gaussian data y=(y1,,yT) where the variance depends on a zero-mean, Gaussian AR(1) process started from stationarity (e.g., Girolami and Calderhead Citation2011; Wu et al. Citation2019): YtN(0,κ2expxt), t=1,,T,X1N(0,σ21ϕ2),   Xt|(Xt1=xt1)N(ϕxt1,σ2), t=2,,T.

Parameter priors are π0(κ)1/κ,1/σ2Gamma(10,0.05) and (1+ϕ)/2Beta(20,1.5), and we reparameterise to give a parameter vector in R3: α=log1+ϕ1ϕ,   β=logκ,   and   γ=2logσ.

As in Girolami and Calderhead (Citation2011) and Wu et al. (Citation2019) we generate T = 1000 observations using parameters ϕ=0.98,κ=0.65 and σ=0.15. We then apply blurred HMC, AAPS and the no U-turn sampler to perform inference on the 1003-dimensional posterior. We ran AAPS using two different K values, one found by optimizing the choice of (ϵ,K) over a numerical grid and one by using the tuning mechanism mentioned in Section 3.5. Tuning standard HMC (not blurred HMC) on such a high-dimensional target was extremely difficult; due to the algorithm’s sensitivity to the integration time, we could not identify a suitable range for L. Widely used statistical packages such as Stan (Stan Development Team Citation2020) and PyMC3 (Salvatier et al. Citation2016) perform the blurring by default, and so we only present the results for HMC-bl.

Each algorithm was then run for 10 replicates of 105 iterations using the optimal tuning parameters. Efficiency was calculated for each parameter, and the minimum efficiency over the latent variables and over all 1003 components were also calculated. For each algorithm the mean and standard deviation (over the replicates) of these efficiencies were ascertained; reports these values normalized by the mean efficiency for AAPS for that parameter or parameter combination. Overall, on this complex, high-dimensional posterior, AAPS is slightly less efficient than blurred HMC, and slightly more efficient than the no U-turn sampler.

Table 2 Relative efficiency compared with AAPSg (AAPS tuned using a grid) of AAPSa (AAPS tuned using the advice from Section 3.5), blurred HMC (HMC-bl) and the no U-turn sampler (NUTS) for the stochastic volatility model using 10 replicates of 105 iterations.

4.3 Multimodality

The AAPS algorithm with K = 0 is close to reducible on a multimodal one-dimensional target, and this might lead to concerns about the algorithm’s performance on multimodal targets in general. However, in d dimensions, because pU(x) is a sum of d components even with K = 0, AAPS is not reducible on multimodal targets with d > 1. This is illustrated in Appendix G, which also details a short simulation study on three 40-dimensional bimodal targets where AAPS is more efficient than the no U-turn sampler and is never less than two-thirds as efficient as blurred HMC.

5 Discussion

We have presented the Apogee to Apogee Path Sampler (AAPS), and demonstrated empirically that it has a similar efficiency to HMC but is much easier to tune. From a current point, AAPS uses the leapfrog step to create a path consisting of a fixed number of segments, it then proposes a point from these segments and uses an accept-reject step to ensure that it targets the intended distribution.

We investigated six possible mechanisms for proposing a point from the path, and for the numerical experiments we chose the probability of proposing a point to be proportional to the product of the extended target density at that point and the proposal’s squared distance from the current point, which was possible with an O(1) memory cost. However, the flexibility in the proposal mechanism allows other possibilities such as a Mahalanobis distance based on an estimated covariance matrix, or (||xpropμ||2||xcurrμ||2)2 for some central point, μ, with a similar motivation to the ChEEs diagnostic of Hoffman et al. (Citation2021). Indeed, if any scalar or vector function, f, is of particular interest, then a proposal weighting of the form ||f(xprop)f(xcurr)||2 could be used with a memory cost of O(1).

Choosing the current segment’s position uniformly at random from the K + 1 segments is not the only way to preserve detailed balance with respect to the intended target. For example, the current segment could be fixed as segment 0 and proposals could only be made from segment K, a choice which bears some resemblance to the window scheme in Neal (Citation1992); however, we found that this had a negative impact on the robustness of the efficiency to the choice of K (see Appendix D).

Because of its simplicity, many extensions to the AAPS algorithm are clear. For example, if the positions along the path are stored, then a delayed rejection step may increase the acceptance probabilities. A cheap surrogate for π could be substituted within π˜ in any weighting scheme. Indeed, given c, the randomly chosen offset of segment 0, the next value could be chosen conditional on zcurr using any Markov kernel reversible with respect to π˜ (see also Neal Citation2011a). A nonreversible version of the algorithm could set K = 1 and always choose a path consisting of the current segment and the next segment forward; instead of completely refreshing momentum at each iteration, the momentum at the start of a new step could be a Crank-Nicolson perturbation of the momentum at the end of the previous step as in Horowitz (Citation1991). The properties of the leapfrog step required for the validity of AAPS are a subset of those required for HMC (see Section 2.2), so any alternative momentum formulation (e.g., Livingstone et al. Citation2019) or numerical integration scheme that can be used within HMC could also be used within AAPS.

AAPS with weighting Scheme 1 relates to HMC using windows of states Neal (Citation2011b) but with K defining the total number of (forward and backward) leapfrog steps taken rather than the number of additional segments. As mentioned in Section 1 and shown in Corollary 1, the number of apogees is a more natural tuning parameter than an integration time as it relates to intrinsic properties of the target: rescaling all coordinates by a constant factor would not change the optimal K.

Supplementary Materials

Appendices A–I are contained in the supplementary materials.

Supplemental material

Supplemental Material

Download PDF (416.6 KB)

Acknowledgments

We are grateful to two anonymous reviewers for comments and suggestions that have materially improved the article.

Data Availability Statement

Code is available from https://github.com/ChrisGSherlock/AAPS

Disclosure Statement

There are no financial or nonfinancial conflicts of interest to declare.

Additional information

Funding

All three authors gratefully acknowledge funding through EPSRC grant EP/P033075/1.

References

  • Beskos, A., Pillai, N., Roberts, G., Sanz-Serna, J.-M. and Stuart, A. (2013), “Optimal Tuning of the Hybrid Monte Carlo Algorithm,” Bernoulli, 19, 1501–1534. http://www.jstor.org/stable/42919328. DOI: 10.3150/12-BEJ414.
  • Bou-Rabee, N., and Sanz-Serna, J. M. (2017), “Randomized Hamiltonian Monte Carlo,” The Annals of Applied Probability, 27, 2159–2194. http://www.jstor.org/stable/26361544. DOI: 10.1214/16-AAP1255.
  • Brooks, S., Gelman, A., Jones, G. L. and Meng, X.-L., ds. (2011), Handbook of Markov Chain Monte Carlo, Chapman & Hall/CRC Handbooks of Modern Statistical Methods, Boca Raton, FL: CRC Press.
  • Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters B, 195, 216–222. https://www.sciencedirect.com/science/article/pii/037026938791197X. DOI: 10.1016/0370-2693(87)91197-X.
  • Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London, UK: Chapman and Hall.
  • Girolami, M., and Calderhead, B. (2011), “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods,” Journal of the Royal Statistical Society, Series B, 73, 123–214. DOI: 10.1111/j.1467-9868.2010.00765.x.
  • Heng, J., and Jacob, P. E. (2019), “Unbiased Hamiltonian Monte Carlo with Couplings,” Biometrika, 106, 287–302. DOI: 10.1093/biomet/asy074.
  • Hoffman, M., Radul, A., and Sountsov, P. (2021), “An Adaptive-MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo,” in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, eds., A. Banerjee and K. Fukumizu, vol. 130 of Proceedings of Machine Learning Research, pp. 3907–3915. PMLR. https://proceedings.mlr.press/v130/hoffman21a.html.
  • Hoffman, M. D., and Gelman, A. (2014), “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, 15, 1593–1623.
  • Horowitz, A. M. (1991), “A Generalized Guided Monte Carlo Algorithm,” Physics Letters B, 268, 247–252. https://www.sciencedirect.com/science/article/pii/0370269391908125. DOI: 10.1016/0370-2693(91)90812-5.
  • Livingstone, S., Faulkner, M. F., and Roberts, G. O. (2019), “Kinetic Energy Choice in Hamiltonian/Hybrid Monte Carlo,” Biometrika, 106, 303–319. DOI: 10.1093/biomet/asz013.
  • Mackenze, P. B. (1989), “An Improved Hybrid Monte Carlo Method,” Physics Letters B, 226, 369–371. DOI: 10.1016/0370-2693(89)91212-4.
  • Neal, R. M. (1992), “An Improved Acceptance Procedure for the Hybrid Monte Carlo Algorithm,” Journal of Computational Physics, 111, 194–203. DOI: 10.1006/jcph.1994.1054.
  • Neal, R. M. (2011a), “MCMC Using Ensembles of States for Problems with Fast and Slow Variables such as Gaussian Process Regression.” arXiv 1101.0387.
  • Neal, R. M. (2011b), “MCMC Using Hamiltonian Dynamics,” in Handbook of Markov Chain Monte Carlo, eds. S. Brooks, A. Gelman, G. Jones and X.-L. Meng), pp. 113–162, Boca Raton, FL: CRC press.
  • Pagani, F., Wiegand, M., and Nadarajah, S. (2021), “An n-dimensional Rosenbrock Distribution for Markov Chain Monte Carlo Testing,” Scandinavian Journal of Statistics, 49, 657–680. DOI: 10.1111/sjos.12532.
  • Roberts, G. O., and Rosenthal, J. S. (2001), “Optimal Scaling for Various Metropolis-Hastings Algorithms,” Statistical Science, 16, 351–367. DOI: 10.1214/ss/1015346320.
  • Rosenbrock, H. H. (1960), “An Automatic Method for Finding the Greatest or Least Value of a Function,” The Computer Journal, 3, 175–184. DOI: 10.1093/comjnl/3.3.175.
  • Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. (2016), “Probabilistic Programming in Python Using PyMC3,” PeerJ Computer Science, 2, e55. DOI: 10.7717/peerj-cs.55.
  • Sherlock, C., and Roberts, G. (2009), “Optimal Scaling of the Random Walk Metropolis on Elliptically Symmetric Unimodal Targets,” Bernoulli, 15, 774–798. DOI: 10.3150/08-BEJ176.
  • Stan Development Team. (2020), Stan Modeling Language Users Guide and Reference Manual, Version 2.28, http://mc-stan.org/.
  • Wu, C., Stoehr, J., and Robert, C. P. (2019), “Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale.” arXiv 1810.04449.
  • Ylvisaker, N. D. (1965), “The Expected Number of Zeros of a Stationary Gaussian Process,” Annals of Mathematical Statistics, 36, 1043–1046. DOI: 10.1214/aoms/1177700077.