Abstract
Among Markov chain Monte Carlo algorithms, Hamiltonian Monte Carlo (HMC) is often the algorithm of choice for complex, high-dimensional target distributions; however, its efficiency is notoriously sensitive to the choice of the integration-time tuning parameter. When integrating both forward and backward in time using the same leapfrog integration step as HMC, the set of apogees, local maxima in the potential along a path, is the same whatever point (position and momentum) along the path is chosen to initialize the integration. We present the Apogee to Apogee Path Sampler (AAPS), which uses this invariance to create a simple yet generic methodology for constructing a path, proposing a point from it and accepting or rejecting that proposal so as to target the intended distribution. We demonstrate empirically that AAPS has a similar efficiency to HMC but is much more robust to the setting of its equivalent tuning parameter, the number of apogees that the path crosses. Supplementary materials for this article are available online.
1 Introduction
Markov chain Monte Carlo (MCMC) is often the method of choice for estimating expectations with respect to complex, high-dimensional targets (e.g., Gilks et al. Citation1996; Brooks et al. Citation2011). Amongst MCMC algorithms, Hamiltonian Monte Carlo (HMC, also known as Hybrid Monte Carlo; Duane et al. Citation1987) is known to offer a performance that scales better with the dimension of the state space than many of its rivals (Neal Citation2011b; Beskos et al. Citation2013).
Given a target density , with respect to Lebesgue measure, and a current position, at each iteration HMC samples a momentum and numerically integrates Hamiltonian dynamics on a potential surface (1) (1) to create a proposal that will either be accepted or rejected. As such, HMC has two main tuning parameters: the numerical integration step size, ϵ, and the total integration time, T. Given T, guidelines for tuning ϵ have been available for some time (Beskos et al. Citation2013); however, the integration time itself, is notoriously difficult to tune (e.g., Neal Citation2011b), with algorithm efficiency often dropping sharply following only slight changes from the optimum T value, and usually exhibiting approximately cyclic behavior as T increases.
The sensitivity is illustrated in the left panel of , in which HMC is applied to a modified Rosenbrock distribution (see Section 4.1) of dimension d = 40. In the top half of this plot, the efficiency (see (8) for a precise definition), is given as a function of ϵ and the number of numerical integration steps, L. The bottom half of the panel shows the analogous plot for a modification of the HMC algorithm (Mackenze Citation1989; Neal Citation2011b), which we refer to as blurred HMC where at each iteration, the actual step-size is sampled (independently of all previous choices) uniformly from the interval . This step was designed to mitigate the near reducibility of HMC that can occur when T is some rational multiple of the integration time required to return close to the starting point, but as is visible from the plots, it also makes the performance of the algorithm more robust to the choice of T, and, we have found, often leads to a slightly more efficient algorithm. In both cases, the optimal tuning choice appears as a narrow ridge of roughly constant . Blurred HMC can be viewed as sampling the integration time T uniformly from , where . The approach of using a random integration time to introduce robustness has been extended recently to sampling uniformly from (Hoffman et al. Citation2021) and as an exponential variable with an expectation of (Bou-Rabee and Sanz-Serna Citation2017).
Motivated by the difficulty of tuning T, Hoffman and Gelman (Citation2014) introduces the no U-turn sampler (NUTS). This uses the same numerical integration scheme as standard HMC, the leapfrog step, to integrate Hamiltonian dynamics both forward and backward from the current point, recursively doubling the size of the path until the distance between the points the farthest forward and backward in time stops increasing. Considerable care must be taken to ensure that the intended posterior is targeted, making the algorithm relatively complex; however, the no U-turn sampler is the default engine behind the popular STAN package (Stan Development Team Citation2020), which has a relatively straightforward user interface.
Integration of Hamiltonian dynamics can be thought of as describing the position and momentum of a particle as it traverses the potential surface U. During its journey, provided T is not too small, the particle will reach one or more local maximum, or apogee, in the potential. The leapfrog scheme creates a path which is a discrete set of points rather than a continuum, and so (with probability 1) apogees occur between consecutive points in the path; however, they are straightforward to detect. We call the set of points (positions and momenta) between two apogees a segment. The discrete dynamics using the leapfrog step share several properties with the true dynamics, including the following: if we take the position and momentum of any point along the path, and integrate forward and backward for appropriate lengths of time we will create exactly the same path, and hence the same set of apogees and the same set of segments. This invariance is vital to the correctness and flexibility of the algorithm presented in this article.
In Section 3 we present the Apogee to Apogee Path Sampler (AAPS). Like HMC, AAPS is straightforward to implement and has two tuning parameters, one of which is an integration step size, ϵ. As with the no U-turn sampler, given a current point (position and momentum), AAPS uses the leapfrog step to integrate forwards and backwards in time. However, the integration stops when the path contains the segment in which the current point lies as well as K additional segments, where K is a user-defined tuning parameter. A point is then proposed from this set of K + 1 segments and either accepted or rejected. The positioning of the current segment within the K + 1 and the accept-reject probability are chosen precisely so that the algorithm targets the intended density. The invariance of the path to the starting position and momentum leads to considerable flexibility in the method for proposing a point from the set of segments, which in turn allows us to create an algorithm which enjoys a similar efficiency to HMC yet is extremely robust to the choice of ϵ and K. These properties are evident from the right-hand panel of and analogous plots for other distributions in Section 4.1. The robustness arises mainly from the proposal scheme; see the end of Section 3.3. The definition of K, however, also naturally caters to intrinsic properties of the target, such as its eccentricity, with any externally imposed length (or time) scale largely irrelevant; the theoretical analysis of Section 3.6 makes this explicit in a simplified setting.
Section 2 describes Hamiltonian dynamics and HMC. The AAPS is detailed in Section 3, and is compared empirically against HMC and the no U-turn sampler in Section 4. We conclude in Section 5 with a discussion.
2 Hamiltonian Dynamics and Hamiltonian Monte Carlo
2.1 Hamiltonian Dynamics
The position, x, and momentum, p, of a particle on a frictionless potential surface U(x) evolve according to Hamilton’s equations: (2) (2)
For a real object, M is the mass of the object, a scalar, but the properties of the equations themselves that we will require hold more generally, when M is a symmetric, positive-definite mass matrix. The choice of M, whether for HMC or AAPS, is discussed at the end of Section 3.5. We define and the map which integrates the dynamics forwards for a time t, so . The map, has the following fundamental properties (e.g., Neal Citation2011b):
It is deterministic.
It has a Jacobian of 1.
It is skew reversible: .
It preserves the total energy .
Except in a few special cases, the dynamics are intractable and must be integrated numerically, with a user-chosen time step which we will denote by ϵ. The default method for Hamiltonian Monte Carlo, is the method which will be used throughout this article, the leapfrog step; the leapfrog step itself is detailed in Appendix A. Throughout the main text of this article we denote the action of a single leapfrog step of length ϵ on a current state z as . The leapfrog step satisfies Properties 1–3 above (see Appendix A), but it does not preserve the total energy.
Consider using L leapfrog steps of size to approximately integrate the dynamics forward for time t. Since each individual leapfrog step satisfies Properties 1–3, so does the composition of L such steps, which we denote .
2.2 Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (HMC) creates a Markov chain which has a stationary distribution of . Given a current position, , and tuning parameters ϵ and L, a single iteration of the algorithm proceeds as follows:
Sample a momentum and set .
For i in 1 to L:
.
Let and set .
With probability α, ; else .
Here, with denoting the density of the random variable, (3) (3)
If is in its stationary distribution then is the density of .
3 The Apogee to Apogee Path Sampler
3.1 Apogees and Segments
The left panel of shows L = 50 leapfrog steps of size from a current position, x0 simulated randomly from a two-dimensional posterior (with contours of U(x) shown), and momentum p0 simulated from a distribution. Different symbols and colors are used along the path, with both of these changing from step l to step l + 1 if and only if (4) (4)
Intuitively, condition (4) indicates when the “particle” has switched from moving “uphill” to moving “downhill” with respect to the potential surface U(x). By (2), , so (4) indicates a local maximum in between xl and .
Between such a pair of points at l and l + 1 there is a hypothetical point, a with where the particle’s potential has reached a local maximum, which we call an apogee. Under the exact, continuous dynamics, this point would, of course, be realized, but under the discretized dynamics the probability of this is 0. We call each of the realized sections of the path between a pair of consecutive apogees (i.e., each portion with a different color and symbol in ) a segment. Each segment consists of the time-ordered list of position and momentum at each point between two consecutive apogees.
Instead of integrating forward for L steps, one can imagine integrating both forwards and backwards in time from until a certain number of apogees have been found. The right pane of shows the segment to which the current point belongs, which we denote , together with two segments forward and one segment backward. We denote the jth segment forward by and the jth segment backward by . We abbreviate the ordered collection of segments from to as . Thus, the right panel of shows the positions from . For a particular point , we denote the segment to which it belongs by .
The following segment invariance property is vital to both the simplicity and correctness of our algorithm. For any and set and . For any z and any , (5) (5)
The quantities , and correspond to a, b, and c but from the point of view of rather than z. For the right panel of , for example, picking any from would give the same ordered set of segments as illustrated in the figure. This is because the numerical integration scheme is deterministic and skew reversible, so the apogees would all occur in exactly the same positions with exactly the same (up to a possible sign flip) momenta.
3.2 The AAPS Algorithm
We now introduce our algorithm, the Apogee to Apogee Path Sampler, . The algorithm requires a weight function , where d is the dimension of the target. Weight functions are investigated in more detail in Sections 3.3, but for now it might be helpful keep in mind the simplest that we consider: .
Given a step-size ϵ, a nonnegative integer, K, a mass matrix, M and a current position , one iteration of proceeds as follows:
Sample a momentum and set .
Simulate c uniformly from ; set and .
Create by leapfrog stepping forward from (x0, p0) and then backward from (x0, p0).
Propose w.p. .
With a probability of (6) (6) set else .
Discard and retain .
The Metropolis-Hastings formula (6) arises because, out of the allowable proposals once c has been chosen, the probability of proposing is .
Proposition 1.
The algorithm satisfies detailed balance with respect to the extended posterior .
Proof.
Step 1 preserves because p is independent of x and is sampled from its marginal.
It will be helpful to define the system from the point of view of starting at . Let and be as defined in (5), but with . Then, since ,
Equivalently, . Moreover, segment invariance (5) is equivalent to .
The resulting chain satisfies detailed balance with respect to because is which is
This expression is invariant to . □
Remark 1.
The could be applied with a numerical integration scheme that does not have a Jacobian of 1. In such a scheme, however, the Jacobian would need to be included in the acceptance probability; moreover, a Jacobian other than 1 would lead to greater variability in along a path and, we conjecture, to a reduced efficiency.
The path may visit parts of the statespace where the numerical solution to (2) is unstable and the error in the Hamiltonian may increase without bound, leading to wasted computational effort as large chunks of the path may be very unlikely to be proposed and accepted. Indeed it is even possible for the error to increase beyond machine precision. The no U-turn sampler suffers from a similar problem and we introduce a similar stability condition to that in Hoffman and Gelman (Citation2014). We require that (7) (7) for some prespecified parameter Δ. This criterion can be monitored as the forward and backward integrations proceed, and if at any point the criterion is breached, the path is thrown away: the proposal is automatically rejected. Segment invariance means that , so the same rejection would have occured if we created the path from any proposal in , and detailed balance is still satisfied. For the experiments in Section 4 we found that a value of Δ = 1000 was sufficiently large that the criterion only interfered when something was seriously wrong with the integration. Step 3 of the algorithm then becomes:
Create by leapfrog stepping forward from (x0, p0) and then backward from (x0, p0). If condition (7) fails then go to 6.
For a d-dimensional Markov chain, , with a stationary distribution of π, the asymptotic variance is . Thus, after burn in, . The effective sample size is the number of iid samples from π that would lead to the same variance: . Let be an empirical estimate of the ESS for the ith component of X (i.e., f gives the ith component) and let be the total number of leapfrog steps taken during the run of the algorithm. We measure the efficiency of an algorithm as (8) (8)
Since the leapfrog step is by far the most computationally intensive part of the algorithm, is proportional to the number of effective iid samples generated per second in the worst mixing component.
3.3 Choice of Weight Function
The weight function mentioned previously is a natural choice, and substitution into (6) shows that this leads to an acceptance probability of 1; however, it turns out not to be the most efficient choice. For example, intuitively the algorithm might be more efficient if points on the path that are further away from the current point are more likely to be proposed. To investigate the effects of the choice of w we examine six possibilities:
, which leads to α = 1.
, squared jumping distance (SJD).
, SJD modulated by target.
, absolute jumping distance (AJD).
, AJD modulated by target.
, where is described below; this also leads to α = 1.
Scheme 6 essentially partitions into two halves of roughly equal total and then proposes only values from the half that does not contain ; details are given in Appendix H.
The left panel of shows, for a particular choice of ϵ and target, the efficiency of as a function of K for each of the six weight schemes. Scheme 6 is the most efficient; however, Schemes 3 and 5 each of which involve some measure of jumping distance modulated by are not far behind. Indeed there is only a factor of two between the least and most efficient. Very similar relative performances were found for all the other toy targets from Section 4.1 and across a variety of choices of ϵ, except that when ϵ becomes small, modulation of SJD or AJD by makes little difference since relative changes in are small.
A simple heuristic can explain the difference of a factor of nearly two between Scheme 1 and Scheme 6 in the case where is approximately constant; for example, when ϵ is small. is constructed prior to making the proposal, so the computational effort does not depend on the scheme; thus, efficiency can be measured crudely in terms of the squared jumping distance of the proposal (e.g., Roberts and Rosenthal Citation2001; Sherlock and Roberts Citation2009; Beskos et al. Citation2013), since it is accepted with probability 1. Without loss of generality, we rescale to have unit length. For two independent variables U1 and U2, Scheme 1 is equivalent to an expected squared jumping distance (ESDJ) of , whereas Scheme 2 is equivalent to an ESJD of ; the ratio between the two ESJDs is 7/4.
A naive implementation of each of the above weighting schemes would require storing each of the points in , which has an associated memory cost of and an exact cost that varies from one iteration to the next as the number of points in each segment is not fixed. For all schemes except the last there is a simple mechanism for sampling with a fixed memory cost; however, this would be useless if calculation of the acceptance ratio α in (6) still required storage of . For Schemes 1, 2, and 3, however, it is also possible to calculate α with a fixed memory usage. We have found that this has a negligible effect on the CPU cost but the impact on the peak memory footprint of the code is substantial, decreasing by a factor of around 17 in d = 40, 27 in d = 100 and 42 in d = 800. Since there is little otherwise to choose between the schemes, we opt for the most efficient of these, Scheme 3, and apply this throughout the remainder of this article. Appendix B details the implementation of Schemes 1–3 with memory cost.
3.4 Robustness and Efficiency
The robustness of AAPS with Scheme 3 which is visible in and similar figures in Section 4.1, arises from two aspects of the algorithm. We first consider robustness to the choice of ϵ, then robustness to the choice of K.
For Scheme 2, as with HMC itself, the acceptance probability contains the ratio . The error in the total energy of HMC is proportional to , so as ϵ increases the acceptance rate drops substantially when the error becomes . By contrast, the acceptance probability for Scheme 3 contains only weighted sums of the in both the numerator and denominator so the empirical acceptance rate is much more robust to increasing ϵ. This effect can be seen in the right panel of .
Robustness to the choice of K arises because AAPS samples a point from the whole set of segments rather than choosing the endpoint of the integration as HMC does. (bottom left) of Appendix D echoes , but for an alternative version of AAPS which always sets c = 0 and always samples the proposal from segment K via weighting Scheme 3. At its optimum, the algorithm is slightly more efficient than the version of AAPS that we recommend; however, the efficiency drops much more sharply when K is larger or smaller than its optimal value.
Both of the above robustness properties are shared by Scheme 1 (also demonstrated in ); however, Scheme 3 is more efficient than Scheme 1 precisely because it preferentially targets proposals that are further from the current value.
3.5 Tuning
Since AAPS with weighting Scheme 3 is much more robust to the choice of either tuning parameter than HMC is, precise tuning is less important than it is for HMC. Thus, we present only brief guidelines here; further heuristics and empirical evidence for them are presented in Appendices E and F.
For any given K > 0, we recommend increasing ϵ from some small value until the empirical acceptance starts to change, stopping when the change from the small-ϵ acceptance rate is no more than 3%. Empirically, we have found that such minor changes in acceptance rate correspond to substantial changes in the total energy over the K + 1 segments, such that the modulation of SJD by starts to have a negative impact on the choice of .
For a given value of ϵ we recommend a short tuning run of AAPS using a large value, , of K and then choosing a sensible according to the most popularly proposed segment number using a diagnostic that we describe in Appendix F.
As with HMC and the no U-turn sampler, the mass matrix, M, used by AAPS can also be tuned, and mixing will be optimized if , as approximated from a tuning run. However, the matrix-vector multiplication required for simulating p0 and in each leapfrog step can be expensive, so it is usual to choose a diagonal M instead.
3.6 Gaussian Process Limit
Intuitively, the more eccentric a target the more apogees one might expect per unit time. This section culminates in an expression which makes this relationship concrete for a product target where each component has its own length scale. With the initial condition (X0, P0) sampled from can be considered as a random function of time. We first show that when the true Hamiltonian dynamics are used, subject to conditions, a rescaling of this dot-product tends to a stationary Gaussian process as the number of components in the product tends to infinity. A formula for the expected number of apogees per unit time follows as a corollary.
We consider a d-dimensional product target with a potential of (9) (9) for some and values , and where .
HMC using a diagonal mass matrix, M, and a product target is equivalent to HMC using an identity mass matrix and a target of (e.g., Neal Citation2011b), which, in the case of (9) is also a product target, but with different νi. For simplicity, therefore, we assume the identity mass matrix throughout this section. We also consider the true Hamiltonian dynamics which are approached in the limit as , but approximate the dynamics for small to moderate ϵ reasonably well.
Define the scaled dot product at time t given an initial position of and momentum of as (10) (10)
We define the one-dimensional densities with respect to Lebesgue measure and let and ρ1 denote the corresponding measures. The joint density and measure are and .
Assumption
s 1. We assume that , that there is a such that (11) (11) and that for each , there is a unique, nonexplosive solution to the initial value problem: (12) (12)
Theorem 1 is proved in Appendix C.1.
Theorem 1.
Let the potential be defined as in (9) and where g satisfies the assumptions around (11) and (12). Further, let μ be a distribution with support on with (13) (13) for some , and let . Define (14) (14) where the expectation is over the independent variables and , and assume that for any finite sequence of n distinct times , the n × n matrix Σ with is positive definite.
Let be the scaled dot product defined in (10), and let ; then (15) (15) as , where denotes that Y is a one-dimensional stationary Gaussian process with an expectation of b and a covariance function of .
Ylvisaker (Citation1965) shows that the expected number of zeros over a unit interval of a stationary Gaussian process with a unit variance is: , where C is its covariance function. Zeroes can be either apogees or perigees (local minima in U(x) along the path), and these alternate. Hence, with some work (see Appendix C.2), we obtain the following:
Corollary 1.
The expected number of apogees of over a time T is
Since , with an identity mass matrix, the root-mean-square speed in any direction is 1. As ν is a squared inverse-scale parameter, the first term in the product simply relates the time interval to the overall length scale of the target (). The second part of the product increases with relative variability in the squared inverse length scales of the components of the target. Choosing K fixes N(T), and approximately fixes the product of the two terms on the right. For a given relative variability in length scales, the integration time automatically flexes according to the overall absolute length scale; however, if that relative variability increases then T and, hence, the path length reduces. This makes it plain that the tuning parameter K relates to properties intrinsic to the target; unlike L, its impact is relatively unaffected by a uniform redefinition of the length scale of the target or by the choice of ϵ.
4 Numerical Experiments
In this section we compare, AAPS with HMC, blurred HMC (see Section 1) and the no U-turn sampler over a variety of targets. For fairness we use the basic no U-turn sampler from Hoffman and Gelman (Citation2014), without it needing to adaptatively tune ϵ; instead tuning ϵ using a fine grid of values. For both varieties of HMC we use a grid of ϵ and L values, and for AAPS we use a grid of ϵ and K values; in each case we choose the combination that leads to the optimal efficiency. To ensure that the various toy targets are close to representing the difficulties encountered with targets from complex statistical models, all algorithms use the identity mass matrix.
All code (/HMC/no U-turn sampler) was written in C++; the effective sample size was estimated using the R package coda, and the mean number of leapfrog steps per iteration (for AAPS and the no U-turn sampler) was output from each run as a diagnostic.
4.1 Toy Targets
Here we investigate performance across a variety of targets with a relatively simple functional form, and across a variety of dimensions. Many are product densities with independent components: Gaussian, logistic and skew-Gaussian. We consider different, relatively large ratios ξ between the largest and smallest length scales of the components in a target, as well as different sequences of scales from the smallest to the largest. We also consider a modification of the Rosenbrock banana-shaped target.
For a target of dimension d, given a sequence of scale parameters, we consider the following four forms: where is the distribution function of a variable, α = 3, β = 1 and . The targets πG, πL, and πSG are products of one-dimensional Gaussian, logistic, and skew-Gaussian distributions, respectively. The potential surface of the target πMR is a modified version of the Rosenbrock function (Rosenbrock Citation1960), a well-known, difficult target for optimization algorithms, which is also a challenging target used to benchmark MCMC algorithms (e.g., Heng and Jacob Citation2019; Pagani et al. Citation2021). The tails of the standard Rosenbrock potential increase quartically, but all algorithms which use a standard leapfrog step and Gaussian momentum are unstable in the tails of a target where the potential increases faster than quadratically. We have, therefore, modified the standard Rosenbrock target to keep the difficult banana shape while ensuring quadratic tails.
For πG, πL and πSG, we denote the largest ratio of length scales by and define four different patterns between the lowest and highest scaling, depending on which of σi, or increases approximately linearly with component number. To minimize the odd behavior that HMC (but not AAPS or the no U-turn sampler) can exhibit when scalings are rational multiples of each other (a phenomenon which is rarely seen for targets in practice) we jitter the scales for all the intermediate components. Specifically, let be a vector with , vd = 1 and for where are independent variables. Then we define the following four progressions for : SD: ; VAR: ; H: ; invSD: .
A final target, , arises from an online comparison between HMC and the no U-turn sampler at https://radfordneal.wordpress.com/2012/01/27/evaluation-of-nuts-more-comments-on-the-paper-by-hoffman-and-gelman/.
uses ξ = 20 and repeats in d = 40 for each of the main product target types: Gaussian, skew-Gaussian and logistic. It demonstrate the robustness of AAPS when compared with HMC and blurred HMC. Appendix D contains more details on these experiments. in Appendix D plots efficiency against step size when the popular No U-turn Sampler is used on the targets in and . The peaks for the logistic and modified Rosenbrock targets are broad, suggesting some robustness; however, the peaks for the Gaussian and skew-Gaussian targets are much sharper.
We next investigate products of Gaussians using each of the four scale-parameter progressions with ξ = 20. Dimension d = 40 is high enough that for this and higher dimensions the components can be well approximated as arising from some continuous distributions and Corollary 1 applies.
shows the efficiencies of HMC, blurred HMC and the no U-turn sampler relative to that of . Among the Gaussian targets, the relative performance of compared with HMC and NUTS is best for and , which have just a few components with large scales and many components with small scales; weighting Scheme 3 ensures that the large components are explored preferentially. In contrast, the large number of components with large scales in leads to the worst relative performance of ; we, therefore, investigate this regime further using alternative component distributions, and choice of dimension and ξ. Across the range of targets, no algorithm is more than 1.7 times as efficient as . Empirical acceptance rates at the optimal parameter settings are provided in Appendix E. All ESS estimates were at least 1000.
Table 3 of Appendix I provides the equivalent efficiencies when the algorithms are tuned according to recommended guidelines, and shows AAPS remaining competitive with blurred HMC and the no U-turn sampler.
4.2 Stochastic Volatility Model
Consider the following model for zero-centered, Gaussian data where the variance depends on a zero-mean, Gaussian AR(1) process started from stationarity (e.g., Girolami and Calderhead Citation2011; Wu et al. Citation2019):
Parameter priors are and , and we reparameterise to give a parameter vector in :
As in Girolami and Calderhead (Citation2011) and Wu et al. (Citation2019) we generate T = 1000 observations using parameters and . We then apply blurred HMC, and the no U-turn sampler to perform inference on the 1003-dimensional posterior. We ran using two different K values, one found by optimizing the choice of over a numerical grid and one by using the tuning mechanism mentioned in Section 3.5. Tuning standard HMC (not blurred HMC) on such a high-dimensional target was extremely difficult; due to the algorithm’s sensitivity to the integration time, we could not identify a suitable range for L. Widely used statistical packages such as Stan (Stan Development Team Citation2020) and PyMC3 (Salvatier et al. Citation2016) perform the blurring by default, and so we only present the results for HMC-bl.
Each algorithm was then run for 10 replicates of 105 iterations using the optimal tuning parameters. Efficiency was calculated for each parameter, and the minimum efficiency over the latent variables and over all 1003 components were also calculated. For each algorithm the mean and standard deviation (over the replicates) of these efficiencies were ascertained; reports these values normalized by the mean efficiency for for that parameter or parameter combination. Overall, on this complex, high-dimensional posterior, is slightly less efficient than blurred HMC, and slightly more efficient than the no U-turn sampler.
4.3 Multimodality
The algorithm with K = 0 is close to reducible on a multimodal one-dimensional target, and this might lead to concerns about the algorithm’s performance on multimodal targets in general. However, in d dimensions, because is a sum of d components even with K = 0, is not reducible on multimodal targets with d > 1. This is illustrated in Appendix G, which also details a short simulation study on three 40-dimensional bimodal targets where is more efficient than the no U-turn sampler and is never less than two-thirds as efficient as blurred HMC.
5 Discussion
We have presented the Apogee to Apogee Path Sampler (), and demonstrated empirically that it has a similar efficiency to HMC but is much easier to tune. From a current point, uses the leapfrog step to create a path consisting of a fixed number of segments, it then proposes a point from these segments and uses an accept-reject step to ensure that it targets the intended distribution.
We investigated six possible mechanisms for proposing a point from the path, and for the numerical experiments we chose the probability of proposing a point to be proportional to the product of the extended target density at that point and the proposal’s squared distance from the current point, which was possible with an memory cost. However, the flexibility in the proposal mechanism allows other possibilities such as a Mahalanobis distance based on an estimated covariance matrix, or for some central point, μ, with a similar motivation to the ChEEs diagnostic of Hoffman et al. (Citation2021). Indeed, if any scalar or vector function, f, is of particular interest, then a proposal weighting of the form could be used with a memory cost of .
Choosing the current segment’s position uniformly at random from the K + 1 segments is not the only way to preserve detailed balance with respect to the intended target. For example, the current segment could be fixed as segment 0 and proposals could only be made from segment K, a choice which bears some resemblance to the window scheme in Neal (Citation1992); however, we found that this had a negative impact on the robustness of the efficiency to the choice of K (see Appendix D).
Because of its simplicity, many extensions to the algorithm are clear. For example, if the positions along the path are stored, then a delayed rejection step may increase the acceptance probabilities. A cheap surrogate for π could be substituted within in any weighting scheme. Indeed, given c, the randomly chosen offset of segment 0, the next value could be chosen conditional on using any Markov kernel reversible with respect to (see also Neal Citation2011a). A nonreversible version of the algorithm could set K = 1 and always choose a path consisting of the current segment and the next segment forward; instead of completely refreshing momentum at each iteration, the momentum at the start of a new step could be a Crank-Nicolson perturbation of the momentum at the end of the previous step as in Horowitz (Citation1991). The properties of the leapfrog step required for the validity of AAPS are a subset of those required for HMC (see Section 2.2), so any alternative momentum formulation (e.g., Livingstone et al. Citation2019) or numerical integration scheme that can be used within HMC could also be used within AAPS.
with weighting Scheme 1 relates to HMC using windows of states Neal (Citation2011b) but with K defining the total number of (forward and backward) leapfrog steps taken rather than the number of additional segments. As mentioned in Section 1 and shown in Corollary 1, the number of apogees is a more natural tuning parameter than an integration time as it relates to intrinsic properties of the target: rescaling all coordinates by a constant factor would not change the optimal K.
Supplementary Materials
Appendices A–I are contained in the supplementary materials.
Supplemental Material
Download PDF (416.6 KB)Acknowledgments
We are grateful to two anonymous reviewers for comments and suggestions that have materially improved the article.
Data Availability Statement
Code is available from https://github.com/ChrisGSherlock/AAPS
Disclosure Statement
There are no financial or nonfinancial conflicts of interest to declare.
Additional information
Funding
References
- Beskos, A., Pillai, N., Roberts, G., Sanz-Serna, J.-M. and Stuart, A. (2013), “Optimal Tuning of the Hybrid Monte Carlo Algorithm,” Bernoulli, 19, 1501–1534. http://www.jstor.org/stable/42919328. DOI: 10.3150/12-BEJ414.
- Bou-Rabee, N., and Sanz-Serna, J. M. (2017), “Randomized Hamiltonian Monte Carlo,” The Annals of Applied Probability, 27, 2159–2194. http://www.jstor.org/stable/26361544. DOI: 10.1214/16-AAP1255.
- Brooks, S., Gelman, A., Jones, G. L. and Meng, X.-L., ds. (2011), Handbook of Markov Chain Monte Carlo, Chapman & Hall/CRC Handbooks of Modern Statistical Methods, Boca Raton, FL: CRC Press.
- Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, D. (1987), “Hybrid Monte Carlo,” Physics Letters B, 195, 216–222. https://www.sciencedirect.com/science/article/pii/037026938791197X. DOI: 10.1016/0370-2693(87)91197-X.
- Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London, UK: Chapman and Hall.
- Girolami, M., and Calderhead, B. (2011), “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods,” Journal of the Royal Statistical Society, Series B, 73, 123–214. DOI: 10.1111/j.1467-9868.2010.00765.x.
- Heng, J., and Jacob, P. E. (2019), “Unbiased Hamiltonian Monte Carlo with Couplings,” Biometrika, 106, 287–302. DOI: 10.1093/biomet/asy074.
- Hoffman, M., Radul, A., and Sountsov, P. (2021), “An Adaptive-MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo,” in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, eds., A. Banerjee and K. Fukumizu, vol. 130 of Proceedings of Machine Learning Research, pp. 3907–3915. PMLR. https://proceedings.mlr.press/v130/hoffman21a.html.
- Hoffman, M. D., and Gelman, A. (2014), “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, 15, 1593–1623.
- Horowitz, A. M. (1991), “A Generalized Guided Monte Carlo Algorithm,” Physics Letters B, 268, 247–252. https://www.sciencedirect.com/science/article/pii/0370269391908125. DOI: 10.1016/0370-2693(91)90812-5.
- Livingstone, S., Faulkner, M. F., and Roberts, G. O. (2019), “Kinetic Energy Choice in Hamiltonian/Hybrid Monte Carlo,” Biometrika, 106, 303–319. DOI: 10.1093/biomet/asz013.
- Mackenze, P. B. (1989), “An Improved Hybrid Monte Carlo Method,” Physics Letters B, 226, 369–371. DOI: 10.1016/0370-2693(89)91212-4.
- Neal, R. M. (1992), “An Improved Acceptance Procedure for the Hybrid Monte Carlo Algorithm,” Journal of Computational Physics, 111, 194–203. DOI: 10.1006/jcph.1994.1054.
- Neal, R. M. (2011a), “MCMC Using Ensembles of States for Problems with Fast and Slow Variables such as Gaussian Process Regression.” arXiv 1101.0387.
- Neal, R. M. (2011b), “MCMC Using Hamiltonian Dynamics,” in Handbook of Markov Chain Monte Carlo, eds. S. Brooks, A. Gelman, G. Jones and X.-L. Meng), pp. 113–162, Boca Raton, FL: CRC press.
- Pagani, F., Wiegand, M., and Nadarajah, S. (2021), “An n-dimensional Rosenbrock Distribution for Markov Chain Monte Carlo Testing,” Scandinavian Journal of Statistics, 49, 657–680. DOI: 10.1111/sjos.12532.
- Roberts, G. O., and Rosenthal, J. S. (2001), “Optimal Scaling for Various Metropolis-Hastings Algorithms,” Statistical Science, 16, 351–367. DOI: 10.1214/ss/1015346320.
- Rosenbrock, H. H. (1960), “An Automatic Method for Finding the Greatest or Least Value of a Function,” The Computer Journal, 3, 175–184. DOI: 10.1093/comjnl/3.3.175.
- Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. (2016), “Probabilistic Programming in Python Using PyMC3,” PeerJ Computer Science, 2, e55. DOI: 10.7717/peerj-cs.55.
- Sherlock, C., and Roberts, G. (2009), “Optimal Scaling of the Random Walk Metropolis on Elliptically Symmetric Unimodal Targets,” Bernoulli, 15, 774–798. DOI: 10.3150/08-BEJ176.
- Stan Development Team. (2020), Stan Modeling Language Users Guide and Reference Manual, Version 2.28, http://mc-stan.org/.
- Wu, C., Stoehr, J., and Robert, C. P. (2019), “Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale.” arXiv 1810.04449.
- Ylvisaker, N. D. (1965), “The Expected Number of Zeros of a Stationary Gaussian Process,” Annals of Mathematical Statistics, 36, 1043–1046. DOI: 10.1214/aoms/1177700077.