Full article: Concave-Convex PDMP-based Sampling

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Recently nonreversible samplers based on simulating piecewise deterministic Markov processes (PDMPs) have shown potential for efficient sampling in Bayesian inference problems. However, there remains a lack of guidance on how to best implement these algorithms. If implemented poorly, the computational costs of simulating event times can outweigh the statistical efficiency of the nonreversible dynamics. Drawing on the adaptive rejection literature, we propose the concave-convex adaptive thinning approach for simulating a piecewise deterministic Markov process, which we call CC-PDMP. This approach provides a general guide for constructing bounds that may be used to facilitate PDMP-based sampling. A key advantage of this method is its additive structure—adding concave-convex decompositions yields a concave-convex decomposition. This makes the construction of bounds modular, as given a concave-convex decomposition for a class of likelihoods and a family of priors, they can be combined to construct bounds for the posterior. We show that constructing our bounds is simple and leads to computationally efficient thinning. Our approach is well suited to local PDMP simulation where conditional independence of the target can be exploited for potentially huge computational gains. We provide an R package and compare with existing approaches to simulating events in the PDMP literature. Supplementary materials for this article are available online.

Keywords:

1 Introduction

Monte Carlo methods based on continuous-time Markov processes have shown potential for efficient sampling in Bayesian inference challenges (Fearnhead et al. Citation2018; Goldman and Singh Citation2021). These new sampling algorithms are based on simulating piecewise-deterministic Markov processes (PDMPs). Such processes evolve deterministically between random event times with possibly random dynamics at event times (see Davis Citation1993, for an introduction to PDMPs). Monte Carlo methods based on simulating a PDMP with a desired invariant distribution were first proposed in the computational physics literature where they were motivated as examples of nonreversible Markov processes (Peters and de With Citation2012; Michel, Kapfer, and Krauth Citation2014). These ideas transitioned to the statistics community as an alternative to traditional Markov chain Monte Carlo (MCMC) samplers. Popular algorithms in this development include the bouncy particle sampler (Bouchard-Côté, Vollmer, and Doucet Citation2018) and the Zig-Zag sampler (Bierkens, Fearnhead and Roberts Citation2019) amongst others (Vanetti et al. Citation2017; Wu and Robert Citation2020; Michel, Durmus, and Sénécal Citation2020; Bierkens et al. Citation2020).

There is substantial theoretical and empirical evidence to support the claim that nonreversible MCMC samplers offer more efficient sampling (Diaconis, Holmes, and Neal Citation2000; Bierkens and Roberts Citation2017; Bouchard-Côté, Vollmer, and Doucet Citation2018) as nonreversibility reduces the chance of moving back to areas that have recently been visited. Samplers based on PDMPs simulate from a density $π (θ)$ by introducing a velocity component v which is used to evolve the position of the state, $θ$ , with deterministic dynamics. At stochastic event times the velocity component is updated using a Markov kernel after which the process continues with the new velocity. The main barrier to implementing a PDMP-based sampler is the simulation of the random event times. This involves simulating the next event time in a time-inhomogeneous Poisson process with rate function that depends on the current position of the sampler.

Simulation of event times in a Poisson process is often facilitated through a method known as thinning (Lewis and Shedler Citation1979). The aim of this article is to provide a general framework to aid practitioners in implementing PDMP-based samplers. Specifically we introduce the concave-convex thinning approach to facilitate simulating a PDMP-based sampler, which we call CC-PDMP. This method can be applied whenever the rate function of the PDMP can be decomposed into the sum of concave and convex functions, and can be used to facilitate thinning from polynomial rate functions, allowing for broad applicability of the method. We discuss the efficiency of the method compared to alternative approaches in the literature for exact sampling of event times.

Related to our work is the concave-convex adaptive rejection sampler (CC-ARS) of Görür and Teh (Citation2011) which uses the same upper-bounding approach to construct bounds on the log density within a rejection sampler that draws independent samples from univariate density functions. In contrast, by applying this technique within PDMP sampling, the bounds are constructed on the gradient of the log density (i.e., the rate) and used within the nonreversible Markov process framework to generate samples from multivariate densities. Moreover, the CC-PDMP framework allows for both subsampling and local updating schemes to facilitate these methods for high dimensional sampling. A short comparison between concave-convex adaptive rejection sampling and CC-PDMP for sampling from a univariate density is given in Appendix C, supplementary materials.

Some authors have proposed using automatic but approximate methods for simulating a PDMP-based sampler. Among these approaches, Cotter, House, and Pagani (Citation2020) propose numerical integration and root-finding algorithms to facilitate simulation. Others have considered using approximate local bounds which can be simulated exactly (Pakman et al. Citation2017; Goldman and Singh Citation2021; Goan et al. Citation2023). Both of these approaches sacrifice exact sampling of the posterior and involve a tradeoff between the computational cost and the approximation of the sampling distribution.

The article is organized as follows. In Section 2, we introduce the technical details of simulating from a PDMP-based sampler and the details of some popular samplers. Section 3 reviews the literature on thinning for PDMP-based samplers. We introduce the concave-convex PDMP approach for adaptive thinning in Sections 4 and 5. Empirical evaluation of CC-PDMP, and comparison to existing methods, is given in Section 6. We conclude with discussion of limitations and extensions that are possible for the method. Code for implementing our method and replicating our results is available at https://github.com/matt-sutton/ccpdmp.

2 Sampling using a PDMP

2.1 Piecewise Deterministic Markov Processes

Before we explore PDMP-based samplers, we first review PDMPs. A PDMP is a stochastic process, and we will denote its state at time t by $z_{t}$ . There are three key components that define the dynamics of a PDMP:

A set of deterministic dynamics defined as $z_{t + s} = ψ (z_{t}, s)$ .
Event times that are driven by an inhomogeneous Poisson process with rate $λ (z_{t})$ .
A Markov transition kernel $q (z | z_{t -})$ where $z_{t -} : = \lim_{s ↑ t} z_{s}$ .

A PDMP with these specifications will evolve according to its deterministic dynamics between event times and at event times will update according to the Markov transition kernel $q (z | z_{t -})$ . The event times must be simulated from a Poisson process with rate $λ (z_{t})$ , which depends on the state of the process at time t. The result of simulating this process is a PDMP skeleton ${(t_{k}, z_{t_{k}})}_{k = 1}^{n}$ where for $k = 1, \dots, n$ , t_k denotes the kth event time and we have stored the state of the process immediately after this event time, $z_{t_{k}}$ . Given this skeleton it is possible to fill in the values of the state between the events using the deterministic dynamics.

2.2 PDMP-based Samplers

Assume we wish to construct a PDMP process to sample from a target distribution of interest, $π (θ)$ . Current PDMP-based samplers are defined on an augmented space $z \in E$ , where $E = X \times V$ , that can be viewed as having a position, $θ_{t}$ , and a velocity, $v_{t}$ . As described below, the dynamics of the PDMP can be chosen so that the PDMP’s invariant distribution has the form $ν (z) = π (θ) p (v | θ)$ , for some conditional distribution of the velocities, $p (v | θ)$ . As a result, the $θ$ -marginal of the invariant distribution is the target distribution we wish to sample from. We will consider target distributions of the form $π (θ) \propto \exp (- U (θ))$ where $θ \in R^{p}$ and $U (θ)$ is the associated potential.

There are two ways to evaluate Monte Carlo estimates of $\int g (z) ν (z) d z$ using the output of a PDMP. The first estimator averages over the continuous time path of the PDMP, $\frac{1}{t_{n} - t_{1}} \sum_{k = 1}^{n - 1} \int_{t_{k}}^{t_{k + 1}} g (ψ (z_{t_{k}}, s)) d s .$

This requires that the integral of $g (ψ (z_{t_{k}}, s))$ with respect to s may be computed. The second, simpler, approach is to “discretise” the PDMP path. This involves taking an integer M > 0, defining $s = (t_{n} - t_{1}) / M$ , calculating the value of the state at discrete-time points, $z_{t_{1} + s}, \dots, z_{t_{1} + M s}$ , and then using the familiar Monte Carlo estimator $\frac{1}{M} \sum_{j = 1}^{M} g (z_{t_{1} + j s}) .$

The points $z_{t_{1} + j s}$ may be found by evolving the process along its deterministic dynamics for the appropriate event time interval.

2.3 Bouncy Particle Sampler

The bouncy particle sampler (BPS) takes the velocity space $V$ to be either $R^{p}$ or the unit sphere $S^{p - 1}$ and targets an invariant distribution $ν (z) = π (θ) p (v)$ where $p (v)$ is either the standard p-dimensional Normal distribution, or a uniform distribution on $S^{p - 1}$ . The sampler moves with linear deterministic dynamics $ψ ((θ_{t}, v_{t}), s) = (θ_{t} + s v_{t}, v_{t})$ until a random time t occurs according to a Poisson process with canonical event rate given by $λ^{c} (z_{t}) = \max {0, 〈 v_{t}, \nabla_{θ} U (θ_{t}) 〉} .$

At an event time t the velocity is updated as $v_{t} = R_{θ_{t -}} (v_{t -})$ where $R_{θ} (v) = v - 2 \frac{〈 v, \nabla_{θ} U (θ) 〉}{‖ \nabla_{θ} U (θ) ‖^{2}} \nabla_{θ} U (θ) .$

This update can be interpreted as reflecting the velocity off the hyperplane tangential to the gradient of $U (θ)$ . Additionally, at random times t an additional event occurs according to a homogeneous Poisson process with rate $λ^{ref} > 0$ in which the velocity is refreshed, drawing $v_{t} \sim p (v)$ . Refreshment ensures that the Bouncy Particle Sampler samples the invariant distribution and does not get “stuck” on contours of the potential (Bouchard-Côté, Vollmer, and Doucet Citation2018). Formally the Bouncy Particle Sampler has event rate $λ (z_{t}) = λ^{c} (z_{t}) + λ^{ref}$ with Markov transition kernel $q (z | z_{t -}) = \frac{λ^{c} (z_{t -})}{λ (z_{t -})} δ_{θ_{t -}} (θ) δ_{R_{θ_{t -}} (v_{t -})} (v) + \frac{λ^{ref}}{λ (z_{t -})} δ_{θ_{t -}} (θ) p (v),$ where $δ_{a} (A)$ denotes the Dirac delta function with point mass at a. The BPS is an example of a global PDMP as all components of the velocity v are changed according to the Markov transition kernel.

2.4 Zig-Zag Sampler

The Zig-Zag sampler takes the space of velocities, $V$ , to be ${- 1, 1}^{p}$ with $p (v)$ uniform on this space. Similarly to BPS, the Zig-Zag sampler uses linear dynamics $ψ ((θ_{t}, v_{t}), s) = (θ_{t} + s v_{t}, v_{t})$ between events, but unlike the BPS only a single element of the velocity vector is updated at event times. The overall Zig-Zag process can be viewed as a composition of p local event rates where each event rate has a corresponding Markov transition kernel which operates on a single component of the velocity. Specifically, the ith component of the velocity is updated with local event rate $λ_{i} (z_{t}) = \max {0, v_{i} \partial_{θ_{i}} U (θ_{t})}$ for $i = 1, \dots, p$ , where $\partial_{θ_{i}}$ denotes the partial derivative of $U (θ_{t})$ with respect to θ_i. Simulating an event time in this local framework consists of simulating an event time for each local event rate and then taking the minimum time as the event time t for the overall process. At an event time t triggered by event rate λ_i the velocity is updated as $v_{t} = F_{i} (v_{t -})$ where $F_{i} (v) = {(v_{1}, \dots, v_{i - 1}, - v_{i}, v_{i + 1}, \dots, v_{p})}^{T} .$

Once the velocity is updated for component i local event times must be re-simulated only for rates $λ_{j} (z_{t})$ which are functions of θ_i. This fact can induce massive computational savings when there is conditional independence between the components of $θ$ . To see this, consider the extreme case where the target is fully independent across dimensions. In this case each event rate, $λ_{i} (z_{t})$ , will only depend on the ith components of $θ_{t}$ and $v_{t}$ . Thus, at each event, it is only the time of the next event for the component whose velocity changes that needs to be recalculated. The computational cost of simulation in this case will scale as O(1) per event as opposed to O(d) for the global BPS. This computational advantage was explored in the context of simulating from diffusion bridges in Bierkens et al. (Citation2021).

2.5 Global and Local PDMP Methods

As described above, we can divide PDMP samplers into two main classes: global methods and local methods. These PDMPs differ in how they operate on the velocity vector when an event is triggered. Global methods are PDMP-based methods where the transition kernel acts on the entire velocity vector. By contrast, local methods are defined so that the transition kernel only affects a subset of the velocity. In this section we will give a general algorithm for constructing a local PDMP-based sampler (though see also Bouchard-Côté, Vollmer, and Doucet Citation2018). Let $v^{[S]}$ denote the sub-vector of v with elements indexed by $S \subseteq {1, \dots, p}$ and $v^{[- S]}$ denote the sub-vector of v without the elements indexed in S. For some set S define the local kernel as one that satisfies $q_{θ}^{[S]} (v | v_{t -}) = δ_{v_{t -}^{[- S]}} (v^{[- S]}) q_{θ}^{[S]} (v | v_{t -}),$ that is, $q_{θ}^{[S]} (v | v_{t -})$ only updates the sub-vector $v^{[S]}$ . Suppose we have a partition of the components of $θ_{1}, \dots, θ_{p}$ given by $S = {S_{1}, S_{2}, \dots, S_{F}}$ . Subset S_f of v is updated with local kernel $q_{θ}^{[S_{f}]} (v | v_{t -})$ and associated event rate $λ_{f} (z_{t}) = \max {0, 〈 v^{[S_{f}]}, \nabla_{θ^{[S_{f}]}} U (θ_{t}) 〉},$ for $f = 1, \dots, F$ . Simulating a local PDMP with this structure involves simulating an event time for each of the F rates and applying the local transition kernel corresponding to the smallest simulated event time. For the partition $S$ define $N = {N_{1}, \dots, N_{F}}$ with $N_{f} = {m \in {1, \dots, F} | θ^{[S_{m}]} / ⊥ ⊥ θ^{[S_{f}]}}$ for $f = 1, \dots, F$ where $θ^{[S_{m}]} / ⊥ ⊥ θ^{[S_{f}]}$ denotes that the rate of events associated with S_m depends on some element of $θ^{[S_{f}]}$ . When an event is triggered for a local rate f the velocity $v^{[S_{f}]}$ is updated and new event times are simulated for rates which depend on the value of one or more θ_i for $i \in S_{f}$ . This simulation process is summarized in alg1.

In this framework the Zig-Zag sampler is a local PDMP with partition $S = {{1}, {2}, \dots, {p}}$ and local kernels $q_{θ}^{[S_{f}]} (v | v_{t -}) = δ_{F_{f} (v_{t -})} (v)$ . The local BPS uses kernel $q_{θ}^{[S]} (v | v_{t -}) = δ_{v_{*}} (v)$ where $v_{*}^{[- S]} = v_{t -}^{[- S]}$ and $v_{*}^{[S]} = R_{θ}^{[S]} (v_{t -}^{[S]})$ with $R_{θ}^{[S]} (v^{[S]}) = v^{[S]} - 2 \frac{〈 v^{[S]}, \nabla_{θ^{[S]}} U (θ) 〉}{‖ \nabla_{θ^{[S]}} U (θ) ‖^{2}} \nabla_{θ^{[S]}} U (θ) .$

If there is a single factor $S = {S}$ containing all indices $S = {1, \dots, p}$ we recover the global BPS.

Algorithm 1

: Simulating a PDMP with local structure

Inputs: $t_{1} = 0, θ_{t_{1}}, v_{t_{1}}$ , factorization $(S, N)$ with $S = {S_{1}, S_{2}, \dots, S_{F}}$ and $N = {N_{1}, N_{2}, \dots, N_{F}}$ .

Sample τ_f for $f = 1, \dots, F$ where $P (τ_{f} \geq t) = \exp (- \int_{0}^{t} λ_{f} (θ_{0} + v_{0} s, v_{0}) d s)$

For $k = 1, \dots, n$

Let $f^{*} = {argmin}_{f} (τ_{f})$ and update $t_{k + 1} = t_{k} + τ_{f^{*}}$
Update states: $θ_{t_{k + 1}} = θ_{t_{k}} + τ_{f^{*}} v_{t_{k}}$
Update velocity: $v_{t_{k + 1}} = q_{θ_{t_{k + 1}}}^{[S_{f^{*}}]} (v_{t_{k}} | v)$
For $f \in N_{f^{*}}$ update times, that is, resample τ_f where $P (τ_{f} \geq t) = \exp (- \int_{0}^{t} λ_{f} (θ_{t_{k + 1}} + v_{t_{k + 1}} s, v_{t_{k + 1}}) d s)$

Outputs: PDMP skeleton ${(t_{k}, θ_{t_{k}}, v_{t_{k}})}_{k = 1}^{n}$

3 Algorithms for Event-Time Simulation

Simulating the first event time, τ, from a Poisson process with event rate $λ (t)$ is equivalent to simulating $u \sim Uniform [0, 1]$ and solving $τ = {argmin}_{t} {- \log (u) = \int_{0}^{t} λ (z_{s}) d s} .$

If the event rate is sufficiently simple this can be done exactly. For example if the rate is constant or linear, then there are analytical solutions for τ in terms of u. If the rate function is convex a solution may be found using numerical methods (Bouchard-Côté, Vollmer, and Doucet Citation2018). For more complex functions, the rate can be simulated by a process known as thinning. This process makes use of Theorem 1.

Theorem 1

(Lewis and Shedler (Citation1979)). Let $λ (t)$ and $\bar{λ} (t)$ be continuous functions where $λ (t) \leq \bar{λ} (t)$ for all $0 \leq t_{\max}$ . Let $τ_{1}, τ_{2}, \dots, τ_{n}$ be event times of the Poisson process with rate $\bar{λ} (t)$ over the interval $[0, t_{\max})$ . For $i = 1, \dots, n$ retain the event times τ_i with probability $λ (τ_{i}) / \bar{λ} (τ_{i})$ to obtain a set of event times from a nonhomogeneous Poisson process with rate λ over the interval $(0, t_{\max}]$ .

The efficiency of simulating via thinning depends on how tightly the rate $\bar{λ}$ upper-bounds λ and how costly it is to simulate from $\bar{λ}$ . Another key tool for enabling the simulation of event times is known as super-positioning and the general process is outlined in Theorem 2.

Theorem 2

(Kingman (Citation1992)). Suppose that $Λ_{1}, \dots, Λ_{n}$ are a set of independent Poisson processes with rates $λ_{1} (t), \dots, λ_{n} (t)$ with first arrival times $τ^{[1]}, \dots, τ^{[n]}$ . The Poisson process with rate $λ (t) = \sum_{i = 1}^{n} λ_{i} (t)$ may be simulated by returning the first arrival time as $τ = \min_{i = 1, \dots, n} τ^{[i]}$ .

Both super-positioning and thinning provide useful tools in the simulation of event times from a nonhomogeneous Poisson process. However, constructing the upper-bounds required for thinning is largely an ad hoc and time consuming process for the practitioner. A common desire for a practitioner is to reuse simulation knowledge. For example if an event rate can be written as the sum of several sub-event rates, that can be exactly simulated, ideally this knowledge should be used to simulate from the sum. We refer to this class of event rates as additive rates. Specifically, we will refer to a Poisson process with a rate $λ (z_{t}) = \max {0, f_{1} (t) + f_{2} (t)}$ as process with an additive rate. The super-positioning method from Theorem 2 facilitates additive event-time simulation by thinning using an upper-bounding event rate $\bar{λ} (t) = \max {0, f_{1} (t)} + \max {0, f_{2} (t)},$ which satisfies $λ (z_{t}) \leq \bar{λ} (t)$ . We can simulate the first event from a process with this rate as $τ = \min (τ_{1}, τ_{2})$ , where τ₁ is simulated with rate ${\bar{λ}}_{1} (t) = \max {0, f_{1} (t)}$ and τ₂ with rate ${\bar{λ}}_{2} (t) = \max {0, f_{2} (t)}$ . This simulated time will be accepted with probability $\frac{\max {0, f_{1} (τ) + f_{2} (τ)}}{\max {0, f_{1} (τ)} + \max {0, f_{2} (τ)}},$ which will be one when both $f_{1} (τ) > 0$ and $f_{2} (τ) > 0$ . This procedure is useful since it allows the reuse of thinning procedures for f₁ and f₂. Thinning via super-positioning is often recommended in the literature; notable examples include Bouchard-Côté, Vollmer, and Doucet (Citation2018) who illustrate this approach when simulating from an exponential family and Wu and Robert (Citation2020) who discuss the approach generally. The approach was also recommended broadly by Sen et al. (Citation2020) where it was suggested for practical implementation of the Zig-Zag where f₁ and f₂ correspond to the terms from the likelihood and prior, respectively.

4 Concave-Convex Adaptive Thinning

Our general proposal for simulating events in a PDMP is based on concave-convex adaptive thinning. As the dynamics between events are deterministic, conditional on the current state of the PDMP, we can rewrite the rate of the next event as a function of time. With slight abuse of notation we will denote this rate as $λ (t)$ with the argument of the function making it clear whether we are viewing it as a function of the state, or as here, as a function of time. Suppose $λ (t) = \max {0, f (t)}$ where f(t) can be decomposed as(1) $f (t) = f_{\cup} (t) + f_{\cap} (t)$ (1) on a finite interval $t \in [0, τ_{\max})$ where $f_{\cup} (t)$ is a convex function and $f_{\cap} (t)$ is a concave function in time. We will later discuss general conditions where such a decomposition is possible. The problem of upper-bounding f(t) is recast to finding upper-bounding piecewise linear functions $l_{u} (t)$ and $l_{n} (t)$ such that $f_{\cup} (t) \leq l_{u} (t)$ and $f_{\cap} (t) \leq l_{n} (t)$ . We may then apply the bound $f (t) \leq l (t)$ where $l (t) = l_{n} (t) + l_{u} (t)$ . Since $l (t)$ is the sum of piecewise linear functions it will also be a piecewise linear function and direct simulation from a Poisson process with rate $l (t)$ is possible (see Appendix A, supplementary materials). Concave-convex adaptive thinning proceeds by constructing a piecewise linear function $l (t)$ over a set of m abscissae $t_{1} = 0 < t_{2} < \dots < t_{m} = τ_{\max}$ at which the evaluation of $f_{\cup} (t_{i}), f_{\cap} (t_{i})$ and derivative $f_{\cap}^{'} (t_{i})$ are known. An event time τ is simulated from the Poisson process with rate $\bar{λ} (t) = \max {0, l (t)}$ and is accepted with probability $\frac{\max {0, f_{\cup} (τ) + f_{\cap} (τ)}}{\bar{λ} (τ)} .$

If the event is rejected, we can reuse the information from the evaluation of $f_{\cup} (τ)$ and $f_{\cap} (τ)$ with an additional evaluation of $f'_{n} (τ)$ to refine the simulation on the abscissae $t_{1} = τ$ to $τ_{\max}$ where all existing evaluations for $τ < t_{k} \leq τ_{\max}$ may be reused. If an event does not occur on the range $t \in [0, τ_{\max})$ the PDMP process is evolved by $τ_{\max}$ and the thinning process is repeated. We give the general construction of the piecewise linear function $l (t)$ for $t \in [t_{1}, t_{2})$ and note this construction may be applied iteratively over the abscissae. This construction is also depicted visually in . A convex function $f_{\cup} (t)$ , is by definition a function that can be upper-bounded by the line segment connecting any two function evaluations. So, for any $t \in [t_{1}, t_{2})$ we have the upper-bound $f_{\cup} (t) \leq l_{u} (t)$ where $l_{u} (t) = f_{\cup} (t_{1}) + \frac{f_{\cup} (t_{2}) - f_{\cup} (t_{1})}{t_{2} - t_{1}} (t - t_{1}) .$

Fig. 1 Upper bounds for a rate function based on concave (left) and convex (right) information on three abscissae ( $t_{i} = 0, 1, 2$ ). Concave bounds are formed using linear segments from the gradient of $f_{\cap} (t_{i})$ . Convex bounds are constructed using linear segments connecting evaluations of $f_{\cup} (t_{i})$ .

For a concave function $f_{\cap} (t)$ with derivative $f_{\cap}^{'} (t)$ the line segment corresponding to the tangent of $f_{\cap} (t)$ will upper-bound the function. Thus, for $t \in [t_{1}, t_{2}), f_{\cap} (t)$ will be upper-bounded by $l_{n} (t) = \min {f_{\cap} (t_{1}) + f_{\cap}^{'} (t_{1}) (t - t_{1}), f_{\cap} (t_{2}) + f_{\cap}^{'} (t_{2}) (t_{2} - t)} .$

This minimum will switch at the point of intersection between the lines $f_{\cap} (t_{1}) + f_{\cap}^{'} (t_{1}) (t - t_{1})$ and $f_{\cap} (t_{2}) + f_{\cap}^{'} (t_{2}) (t_{2} - t)$ . It is simple to find this intersection point (Görür and Teh Citation2011), $t^{*} = \frac{f_{\cap} (t_{2}) - f_{\cap}^{'} (t_{2}) t_{2} - f_{\cap} (t_{1}) + f_{\cap}^{'} (t_{1}) t_{1}}{f_{\cap}^{'} (t_{1}) - f_{\cap}^{'} (t_{2})}$ which will be a point on the interval $[t_{1}, t_{2}]$ provided the derivatives $f_{\cap}^{'} (t_{1})$ and $f_{\cap}^{'} (t_{2})$ are not equal. If $f_{\cap}^{'} (t_{1}) = f_{\cap}^{'} (t_{2})$ , the linear segment will not change over the interval and we take $t^{*} = t_{2}$ . So we take $l_{n} (t) = {\begin{matrix} f_{\cap} (t_{1}) + f_{\cap}^{'} (t_{1}) t & t \in [t_{1}, t^{*}) \\ f_{\cap} (t_{2}) + f_{\cap}^{'} (t_{2}) (t_{2} - t) & t \in [t^{*}, t_{2}) \end{matrix},$ and combining these bounds we have $l (t) = l_{u} (t) + l_{n} (t)$ which is a piecewise linear function upper-bounding $f (t) \leq l (t)$ for $t \in [t_{1}, t_{2})$ .

4.1 Concave-Convex Decompositions

The proposition below gives simple conditions for the class of nonhomogeneous Poisson processes that admit thinning using the concave-convex approach.

Proposition 1.

If a Poisson process with rate function $λ (t)$ can be written as $λ (t) = \max {0, f (t)}$ where f(t) is continuous then the process admits thinning using concave-convex adaptive thinning.

Proof.

Consider f(t) on the interval $[0, τ_{\max}]$ where $τ_{\max}$ is some arbitrary value $τ_{\max} > 0$ . On this closed compact interval f(t) is continuous and by the Stone-Weierstrass theorem (Stone Citation1948), for any $ϵ > 0$ there exists a polynomial g(t) on the interval with $‖ f - g ‖ < ϵ$ . The function g is a polynomial so it admits the concave-convex decomposition $g (t) = \sum_{{i : a_{i} > 0}} a_{i} t^{i} + \sum_{{i : a_{i} < 0}} a_{i} t^{i}$ with convex function $g_{u} (t) = \sum_{{i : a_{i} > 0}} a_{i} t^{i}$ and concave function $g_{n} (t) = \sum_{{i : a_{i} < 0}} a_{i} t^{i}$ . The Process with rate $λ (t)$ admits thinning on the interval $[0, τ_{\max})$ using the Poisson process with rate $\hat{λ} (t) = \max {0, g (t) + ϵ}$ which has a convex-concave decomposition. If an event does not occur on the interval $[0, τ_{\max})$ then the process is evolved to $τ_{\max}$ and the thinning process repeated. □

Remark 1.

While continuity of f(t) is required for the Stone-Weirestrass theorem, the concave-convex process may also be applied more generally. If f(t) is continuous for all $0 < t < t^{*}$ and $\lim_{t \to (t^{*}) -} f (t) = \infty$ then concave-convex thinning can be applied. Consider f(t) using the abscissae $t_{0} = 0, t_{1} = \frac{t^{*}}{2}, tmax = t^{*}$ . Proposition 1 ensures a concave-convex decomposition on the interval $[0, t^{*} / 2)$ . On $[t^{*} / 2, t^{*})$ we upper bound the rate by infinity. So if a time is not simulated on $[0, t^{*} / 2)$ , we will propose an event at t₁, and this event will be rejected. The concave-convex bound will then be updated based on $f (t^{*} / 2)$ , and used to simulate an event on $[t^{*} / 2, t^{*})$ . The abscissae on this increment will be $t^{*} / 2, 3 t^{*} / 4$ , and $t^{*}$ , and the process will be repeated. Importantly, as $\lim_{t \to (t^{*}) -} f (t) = \infty$ there will be an actual event before $t^{*}$ with probability one, thus, repeating this process will yield an (accepted) event time.

The proof of Proposition 1 relies on finding a polynomial that can upper-bound the process over a finite interval. If a bound can be given for higher order derivatives of f(t) then there are two main approaches for finding an upper-bounding polynomial. Using Taylor’s expansion of f(t) about zero, we get $\begin{matrix} f (t) = f (0) + f' (0) t + \frac{f ″ (0)}{2!} t^{2} + \dots \\ + \frac{f^{(k)} (0)}{k!} t^{k} + \int_{0}^{t} f^{(k + 1)} (s) \frac{t^{k}}{k!} d s . \end{matrix}$

If there is a known constant M such that $| f^{(k + 1)} (t) | \leq M$ we can employ thinning based on the polynomial bound $g (t) = f (0) + f' (0) t + \frac{f ″ (0)}{2!} t^{2} + \dots + \frac{f^{(k)} (0)}{k!} t^{k} + \frac{M}{(k + 1)!} t^{(k + 1)}$ . Alternatively polynomial upper-bounds can be constructed based on interpolation where the error is controlled. For example using Lagrange polynomial interpolation on $[0, τ_{\max}]$ with k + 1 evaluations of the function has error bounded by $τ_{\max^{k + 1}} \frac{M}{(k + 1)!}$ . Adding this bound as a constant yields an upper-bounding polynomial without requiring evaluation of the derivatives of f(t) required for the Taylor series expansion.

4.2 Concave-Convex Adaptive Thinning and PDMP-based Samplers

A key advantage to CC-PDMP is that it can easily and efficiently simulate from rates with an additive structure. That is, if $f_{1} (t)$ and $f_{2} (t)$ both have concave-convex decompositions then $f (t) = f_{1} (t) + f_{2} (t)$ will also have a concave-convex decomposition. The rates associated to a PDMP-based sampler with $π (θ) \propto \exp (- U (θ))$ as its target will have an additive structure. In particular, a local PDMP sampler as described in Section 2.4 will have rates $λ_{f} (z_{t}) = \max {0, \sum_{j \in S_{f}} f_{j} (t)}$ where $f_{j} (t) = v_{j} \partial_{θ_{j}} U (θ + t v)$ . This means that the concave-convex decompositions for the partial derivatives $f_{j} (t)$ can be trivially combined to simulate from new local or global implementations of the PDMP. There is a subtle tradeoff between computational and statistical efficiency offered by global and local PDMP-based samplers which we explore in Section 6.

Additive rates are also present when using a PDMP to simulate from a Bayesian posterior distribution. For a model with parameter $θ \in R^{p}$ , the target has the form $π (θ) \propto p (θ) p (y_{1 : n} | θ)$ where $p (y_{1 : n} | θ)$ is the likelihood of the observed data $y_{1 : n} = (y_{1}, \dots, y_{n})$ , and $p (θ)$ is the prior for the parameters of the model. The function associated to the jth partial derivative of $U (θ)$ can be written as $f_{j} (t) = f_{j}^{p} (t) + f_{j}^{l} (t)$ where $f_{j}^{p} (t) = v_{j} \partial_{θ_{j}} \log p (θ + t v)$ depends on the log prior and $f_{j}^{l} (t) = v_{j} \partial_{θ_{j}} \log p (y_{1 : n} | θ + t v)$ depends on the log-likelihood. This allows for the rate function to be split into more manageable pieces. Once a decomposition has been found for a choice of prior or likelihood, it can be reused to facilitate simulating from that prior or likelihood in future models. Specific examples are given in Section 6.

5 Tuning Parameters

5.1 Choice of Abscissae

A key user choice in CC-PDMP sampling is that of the position and number of abscissae on the interval $[0, τ_{\max})$ . Suppose that CC-PDMP simulation is implemented with abscissae $t_{0} = 0 < t_{1} < \dots < t_{n} = τ_{\max}$ if an event does not occur on the interval $[0, τ_{\max})$ the PDMP will be evolved by $τ_{\max}$ and the thinning process will repeat where the function evaluations at $τ_{\max}$ can be reused at $t_{0} = 0$ . Thus, one simulation with n abscissae would be computationally equivalent to n evolutions of the CC-PDMP approach using abscissae with only two points $t_{0} = 0$ and $t_{1} = τ_{\max}$ . This encourages choosing a minimal number of abscissae and more carefully choosing the length of the interval. There are two main issues that can occur when tuning the parameter $τ_{\max}$ . If $τ_{\max}$ is too small, then the proposal frequently lie outside the interval $[0, τ_{\max})$ and simulating an event will requiring many iterations of bounding the event rate. While if $τ_{\max}$ is too large, then the bound on the rate may be poor, leading to many events that are rejected. In this article, we refer to an iteration of the CC-PDMP that does not generate an event time (i.e., a rejected proposal event or not generating on the interval) as a shadow event. The total number of iterations will be the sum of the number of events and shadow events. Efficiency is measured as the proportion of iterations that are events. Adapting the value of $τ_{\max}$ will not change the sampling dynamics. Consequently, unlike in adaptive MCMC, this parameter may be adapted throughout the course of simulating the PDMP. Ideally this parameter should be a little larger than the average event time so that events are proposed on the interval. A simple automatic approach for choosing this parameter is to set $τ_{\max}$ equal to the qth percentile of previous simulated event times, and update $τ_{\max}$ every 100 iterations of the sampler. We found that setting q = 80th percentile worked well for automatically selecting $τ_{\max}$ . We investigate the dependency on this tuning parameter by implementing the Zig-Zag sampler on the two-dimensional Banana distribution, with $U (θ) = {(θ_{1} - 1)}^{2} + κ {(θ_{2} - θ_{1}^{2})}^{2},$ where κ controls how much mass concentrates around the region $θ_{2} \approx θ_{1}^{2}$ . Since the potential is polynomial in both parameters it is trivial to see that the Zig-Zag rate functions will also be polynomial. Further implementation details may be found in the supplementary material and associated GitHub.

shows the efficiency of the Zig-Zag sampler on the Banana distribution with κ = 1 when changing the interval length $τ_{\max}$ from 0.1 to 4 with two abscissae. The Zig-Zag algorithm was simulated for 10,000 events at each value of $τ_{\max}$ . The effect of selecting $τ_{\max}$ too small or too large is clearly seen and the red line shows the performance using the adaptive approach. The main computational cost is due to an artifact of the otherwise fast C++ code having to call an R function each time the concave-convex decomposition is evaluated. For simple event rates such as polynomials this computation can be implemented in C++ as well. Appendix D, supplementary materials illustrates the effect of $τ_{\max}$ on computation time when the convex-concave decomposition is evaluated in C++.

Fig. 2 Efficiency, defined as the proportion of iterations (events + shadow events) that are events, of Zig-Zag using the CC-PDMP thinning with varying choices for $τ_{\max}$ on the Banana target. Plot on the right shows the actual computation time for running the sampler with different choices of $τ_{\max}$ . The red line shows the performance of an automatic choice for this parameter.

5.2 Choice of Decomposition

The choice of concave-convex decomposition is not unique (Görür and Teh Citation2011). Arbitrary convex functions can be added to $f_{\cup} (t)$ and subtracted from $f_{\cap} (t)$ to give a new valid decomposition. A concave-convex decomposition is minimal if there is no non-affine convex function that can be added to $f_{\cap} (t)$ and subtracted from $f_{\cup} (t)$ while preserving the convexity of the decomposition (Hartman Citation1959). For example, consider the function $f (t) = - t^{3} + 3 t^{2} - 3 t + 3,$ where $f ″ (t) = - 6 t + 6$ . It is simple to see that the function is convex for t < 1 and concave for t > 1. A minimal decomposition is given by the piecewise functions $\begin{matrix} f_{\cup}^{1} (t) = {\begin{matrix} - t^{3} + 3 t^{2} - 3 t + 3 & t \leq 1 \\ 0 & t > 1 \end{matrix} and \\ f_{\cap}^{1} (t) = {\begin{matrix} 0 & t \leq 1 \\ - t^{3} + 3 t^{2} - 3 t + 3 & t > 1 \end{matrix} . \end{matrix}$

Görür and Teh (Citation2011) give a general construction for a minimal concave-convex decomposition. However, this construction relies on finding all points of inflection, which are points where the function changes convexity. Alternatively, Proposition 1 gives a simple decomposition $f_{\cup}^{2} (t) = 3 t^{2} + 3$ and $f_{\cap}^{2} (t) = - t^{3} - 3 t$ . While the decomposition from Proposition 1 is not minimal, it does not require finding all points of inflection. These two decompositions are shown in using abscissae at 0 and 1.

Fig. 3 The upper-bounds resulting from two different concave-convex decompositions of the function $f (t) = - t^{3} + 3 t^{2} - 3 t + 3$ , based on abscissae at 0 and 1. The top row corresponds to the decomposition from Proposition 1 and the bottom row corresponds to an optimal decomposition which recognizes f(t) as a convex function on $[0, 1)$ . The columns show the functions $f_{\cup}, f_{\cap}$ and f as well as their piece-wise linear upper-bounds.

The optimal decomposition gives a tighter bound, here reducing the bounding rate by approximately 0.33 on average across the interval. However, using this bound comes with additional computation cost of finding inflection points. This has the potential to reduce the overall efficiency of the method. Our experience is that the efficiency gains for using the optimal polynomial are generally not sufficient for it to be beneficial. See Appendix D, supplementary materials for further discussion and simulations.

6 Experiments

We now present empirical evaluation of CC-PDMP and comparison with other approaches to simulate PDMPs. Our experiments were implemented using the R package ccpdmp available at https://github.com/matt-sutton/ccpdmp.

The package enables you to simulate a PDMP provided one specifies the concave and convex decomposition of the rate function. If the rate function is polynomial (or bounded by a polynomial) the practitioner may parse this instead and the concave-convex decomposition will be handled internally. The package contains some basic example use cases and code to reproduce the experiments.

6.1 Application in Generalized Linear Models

Generalized Linear Models (GLMs) provide a rich class of models that are frequently used in Bayesian inference. Let ${(y_{i}, x_{i})}_{i = 1}^{n}$ be the observed data, where y_i is an observed response and $x_{i} \in R^{p}$ is a vector of associated covariates for $i = 1, \dots, n$ . The expected value of y_i is modeled by $g^{- 1} (x_{i}^{T} θ)$ where $g^{- 1} : R \to R$ is the inverse link function. The potential of the likelihood has the form $U^{(l)} (θ) = \sum_{i = 1}^{n} ϕ (x_{i}^{T} θ, y_{i})$ where $ϕ : R^{2} \to R$ is the GLM mapping function $ϕ (a, y) = - \log p (y | g^{- 1} (a))$ returning the negative log-likelihood for observation y given a. The partial derivative with respect to θ_k has the form $\partial_{θ_{k}} U^{(l)} (θ) = \sum_{i = 1}^{n} ϕ' (x_{i}^{T} θ, y_{i}) x_{i k}$ where $ϕ' (a, y) = \partial_{a} ϕ (a, y)$ and higher order derivatives are defined similarly. Over the time interval $t \in [0, τ]$ let $f_{k} (t) = v_{k} \partial_{θ_{k}} U^{(l)} (θ + t v)$ which is $f_{k} (t) = v_{k} \sum_{i = 1}^{n} ϕ' (a_{i} (t), y_{i}) x_{i k}$ where $a_{i} (t) = x_{i}^{T} (θ + t v)$ . We can use this to define local rates as $λ_{k} (t) = \max {0, f_{k} (t)}$ for $k = 1, \dots, p$ . For GLMs, repeated application of the chain rule yields: $f_{k}^{(j)} (t) = v_{k} \sum_{i = 1}^{n} ϕ^{(j + 1)} (a_{i} (t), y_{i}) {(\frac{\partial a_{i}}{\partial t})}^{j} x_{i k}$ where $\frac{\partial a_{i}}{\partial t} = x_{i}^{T} v$ and $f_{k}^{(j)}$ denotes the jth derivative of $f_{k} (t)$ . When there are bounds on $ϕ^{(j)}$ we can use the upper-bounding Taylor polynomial. In Appendix B, supplementary materials we provide bounds for several modeling choices. Here we consider logistic regression, which has the mapping function $ϕ (a, y) = \log (1 + \exp (a)) - y a$ where $y \in {0, 1}$ . We look at the efficiency of CC-PMDP thinning for a five dimensional logistic regression problem with n = 200 observations. The covariates were generated from a multivariate normal, $x_{i} \sim N (0, V^{- 1})$ for $i = 1, \dots, 200$ , with mean zero, precision matrix $V = (\begin{matrix} 1 & ρ & 0 & 0 & 0 \\ ρ & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{matrix})$ and data generated using $θ = {(- 1.25, 0.5, - 0.4, - 0.4, - 0.4)}^{T}$ . We take a Gaussian prior $θ_{j} \sim N (0, 1)$ for $j = 1, \dots, 5.$ As the correlation is increased the thinning becomes more challenging. We investigate the performance of CC-PDMP for increasing polynomial bounds with increasing correlation ρ.

To date only linear bounds (polynomial order 1) for logistic regression have been used for thinning (Bierkens et al. Citation2018; Bouchard-Côté, Vollmer, and Doucet Citation2018). These are based on the bound $| ϕ ″ (a, y) | \leq 1 / 4$ . shows the efficiency for thinning using polynomials of order 1–3 (using bounds $| ϕ ″ (a, y) | \leq 1 / 4, | ϕ ″' (a, y) | \leq 1 / (6 \sqrt{3})$ and $| ϕ^{(4)} (a, y) | \leq 1 / 8$ ). The linear 1st order bound matches the bound used in Bierkens et al. (Citation2018) for logistic regression using the Zig-Zag sampler. shows that higher order polynomial thinning facilitated through the CC-PDMP allows more efficient thinning. However, these bounds can incur higher computational cost as they require evaluation of the higher order derivatives. For this example an order 2 polynomial appears to have the tradeoff in computation and thinning efficiency giving the best performance. For the linear bounds we used a large fixed $τ_{\max}$ to ensure the linear rate proposal is simulated on the interval $[0, τ_{\max})$ . For other polynomial orders we used the adaptive procedure described in the tuning section to select $τ_{\max}$ .

Table 1 Efficiency of thinning in Zig-Zag sampling of logistic regression for increasing order of Taylor series polynomial thinning bound.

Download CSV Display Table

6.2 Comparison to Thinning via Super-Positioning

In this example we demonstrate the advantages of CC-PDMP when the event rate is additive. In particular, we compare thinning using the CC-PDMP approach with thinning using super-positioning as outlined in Section 3. Let ${y_{i}}_{i = 1}^{n}$ be the observed data with the following model $\begin{matrix} y_{i} ∣ θ_{i} \sim Poisson (\exp (θ_{i})), \\ θ_{i} \sim N (0, 1) \end{matrix}$ independently for $i = 1, \dots, n$ . The derivative of the potential is $\partial_{θ_{k}} U (θ) = θ_{k} - y_{k} + \exp (θ_{k})$ and $f_{k} (t) = v_{k} \partial_{θ_{k}} U (θ + t v) = v_{k} (θ_{k} + v_{k} t) - y_{k} v_{k} + v_{k} \exp (θ_{k} + v_{k} t) .$

This has the concave-convex decomposition $f_{\cup} (t) = f_{k} (t)$ when $v_{k} > 0$ and $f_{\cup} (t) = v_{k} (θ_{k} + v_{k} t) - y_{k} v_{k}, f_{\cap} (t) = v_{k} \exp (θ_{k} + v_{k} t)$ when $v_{k} < 0$ . The global BPS rate can be defined as $λ (t) = \max {0, \sum_{k = 1}^{n} v_{k} \partial_{θ_{k}} U (θ + t v)} = \max {0, \sum_{k = 1}^{n} f_{k} (t)},$ which has concave-convex decomposition defined by the decompositions of the individual f_k functions. In contrast, thinning via super-positioning involves simulating an event time for the linear and exponential terms of the rate individually. The proposed event time is the minimum of all simulated times. This approach to simulating event times for exponential likelihood and Poisson-Gaussian Markov random fields has been previously used in implementing PDMPs (Vanetti et al. Citation2017; Bouchard-Côté, Vollmer, and Doucet Citation2018).

compares the thinning and overall computational efficiency for the BPS applied to the Poisson likelihood problem. In the CC-PDMP approach we used the adaptive update for $τ_{\max}$ as described in the tuning section. As the dimension increases we see the proportion of iterations that are events using super-positioning drops quickly, consequently the overall computation time for this approach scales poorly. The poor performance of the super-positioning approach with increasing dimension is the result of the large number of exponential terms, $v_{k} \exp (θ_{k} + v_{k} t)$ , in the event rate. The thinning acceptance rate for this proposal is $\frac{\max {0, \sum_{k = 1}^{n} v_{k} (θ_{k} + v_{k} t - y_{k} + \exp (θ_{k} + v_{k} t))}}{\begin{matrix} \max {0, \sum_{k = 1}^{n} (v_{k} (θ_{k} + v_{k} t) - y_{k} v_{k})} \\ + \sum_{k = 1}^{n} \max {0, v_{k} \exp (θ_{k} + v_{k} t)} \end{matrix}} .$

Fig. 4 Efficiency of thinning for the global BPS using thinning via super positioning and CC-PDMP thinning with increasing dimension. Efficiency of the thinning (left) and total computation time (right) is averaged over 20 repeated runs. The BPS samplers were simulated for 1000 event times on each run.

When $v_{k} < 0$ these exponential terms contribute zero to the denominator of the super-positioning approach. In comparison the CC-PDMP approach uses a linear upper-bound on these exponential terms which can be negative and reduce the denominator term, leading to more efficient thinning proposals. This is seen in where the thinning efficiency remains roughly constant with increasing dimension.

6.3 Local Methods

Local PDMP methods take advantage of the conditional independence between parameters. Consider the following extension to the previous Poisson example $\begin{matrix} y_{i} ∣ θ_{i} \sim Poisson (\exp (θ_{i})) independently for i = 1, \dots, n \\ θ_{1} \sim N (0, 1 / (1 - ρ^{2})) \\ θ_{i} ∣ θ_{i - 1} \sim N (ρ θ_{i - 1}, 1), for i = 2, \dots, n \end{matrix}$ where we fix $ρ = 0.5$ . The prior on $θ$ corresponds to an AR(1) process. The partial derivative of the potential is $\partial_{θ_{k}} U (θ) = {\begin{matrix} (1 + ρ^{2}) θ_{k} - ρ θ_{k + 1} - y_{k} + \exp (θ_{k}) & k = 1 \\ (1 + ρ^{2}) θ_{k} - ρ θ_{k - 1} - ρ θ_{k - 1} \\ - y_{k} + \exp (θ_{k}) & 1 < k < n \\ θ_{k} - ρ θ_{k - 1} - y_{k} + \exp (θ_{k}) & k = n \end{matrix}$ which has the same linear and exponential form as the rate in Section 6.2 so we can use an analogous concave-convex decomposition for this rate. In this section we consider local PDMP implementations. The CC-PDMP implementation facilitates simple construction of thinning bounds for local PDMP factorizations. For the partition $S = {S_{1}, \dots, S_{F}}$ the local rate for factor f is $λ_{f} (z_{t}) = \max {0, \sum_{k \in S_{f}} v_{k} \partial_{θ_{k}} U (θ_{t})} = \max {0, \sum_{k \in S_{f}} f_{k} (t)},$ for $f = 1, \dots, F$ . In this section we consider a number of local decompositions of the form $S^{(j)} = {S_{1}, \dots, S_{F}}$ where $S_{1} = {1, 2, \dots, j}, S_{2} = {j + 1, j + 2, \dots, 2 j}, \dots, S_{F} = {n - j, \dots, n - 1, n}$ . We assume n is divisible by F so each S_f contains j elements. For these decompositions the conditional independence between parameters gives the following neighbors $N = {N_{1}, \dots, N_{F}}$ where $N_{f} = {\begin{matrix} {f, f + 1} & f = 1 \\ {f - 1, f, f + 1} & 1 < f < F \\ {f - 1, f} & f = F \end{matrix} .$

This local decomposition offers computational advantages since updating a single rate requires recalculation of at most three rates regardless of the dimension of the problem. However, this computational efficiency comes at a loss of statistical efficiency. In particular, a global PDMP will be able to move further in stochastic time than local methods before requiring a new event simulation as the global rate function will lower bound the local rate. The overall efficiency of the PDMP method should therefore be considered in terms of both its computational and statistical efficiency.

In we measure overall efficiency as the effective sample size (ESS) per second, the computational efficiency as the number of events or MCMC samples per second and the statistical efficiency as the ESS per MCMC sample or PDMP event. This allows a clear description of the scaling performance of local methods in comparison with alternative state-of-the-art MCMC methods. Linear trends on the log-log scale are stated in the figure legend and used to give approximate asymptotic scaling rates with dimension. We compare local PDMP methods with well-tuned implementations of Metropolis adjusted Langevin algorithm (MALA) and Hamiltonian Monte Carlo (HMC). MALA is tuned to obtain acceptance rate approximately equal to 0.5 and HMC tuned to scale with acceptance probability approximately equal to 0.6. These are close to the optimal values from the MCMC optimal scaling literature (Roberts and Rosenthal Citation1998; Beskos et al. Citation2013). For MALA this involves scaling the variance in the proposal and for HMC this involves adjusting the number of leapfrog steps per iteration. These implementations are in line with theoretical scaling results when the parameters of the distribution are independent.

Fig. 5 Log-log breakdown of computational, statistical and overall empirical efficiency scaling with dimension. The ESS values are calculated with respect to the first coordinate θ₁ using the coda R package on a discretized trajectory of the PDMPs. Plotted are the average rates and error bars of all methods calculated from 50 repeated runs of the methods. The legend shows the slope and intercept fitted for each method giving empirical evidence for scaling rates.

The computational efficiency of the local PDMP methods remains (approximately) constant with increasing dimension, while for MALA and BPS the computational efficiency scales approximately as $O (d^{- 1})$ . The computational efficiency for HMC scales at a rate worse than $O (d^{- 1})$ due to the increase in the number of leapfrog iterations. The statistical efficiency for the local methods improves with larger local factors and is highest for the Global BPS. The statistical efficiency for MALA drops at a rate roughly equal to $O (d^{- 1 / 3})$ . The Zig-Zag and local BPS methods attain the best overall efficiency scaling rates around O(d) or better. However, it is clear that there is a tradeoff between the statistical efficiency of larger local factors and the increased computational cost. This decomposition can be thought of as an additional choice for PDMP implementation that can easily be tuned using the CC-PDMP approach for thinning. Despite a poorer scaling with dimension, HMC remains competitive with local methods to a very high dimension.

7 Conclusion

PDMP-based samplers have shown advantages over traditional MCMC samplers, but their use has been limited by the perceived difficulty in simulating the event times. We have introduced CC-PDMP as a general approach that can simulate the PDMP provided we can specify a concave-convex decomposition of the event rate. This method has broad applicability, enables simple implementation of local PDMP methods, and empirically outperforms alternative simulation approaches. Additional generalizations of the CC-PDMP approach are also possible by making use of other ideas for adaptive rejection sampling. For example, to bound the concave term, $f_{\cap} (t)$ , of the rate function we require that the derivative $f_{\cap}^{'} (t)$ is known on the interval $t \in [0, τ_{\max})$ . However, we could instead use a looser bound that does not require this derivative information based on the adaptive rejection bounds proposed in Gilks (Citation1992). If the rate function is particularly computationally intensive, a lower bound based on the abscissae where $f_{\cup}, f_{\cap}$ and $f'_{n}$ and $f_{\cup}^{'}$ are evaluated may be used to perform early rejection in the thinning. Additional generalizations may also be found in the difference of convex functions programming literature (Le Thi and Pham Dinh Citation2018).

A general approach to facilitate thinning using a concave-convex decomposition was described in Section 4. This approach advocates finding an upper-bounding polynomial approximation for the rate function. This upper-bounding polynomial may be thinned using the concave-convex adaptive thinning as described in Proposition 1. It is natural to consider an approximate version of our algorithm where this polynomial is estimated via interpolation. Without controlling for the potential error in the polynomial the overall algorithm would be biased. Recently Corbella, Spencer, and Roberts (Citation2022) have proposed an automatic Zig-Zag implementation which uses optimization methods to obtain a valid linear thinning bound. Future research may consider how to automate the concave-convex thinning procedure for higher order polynomial thinning based on this new work. This may enable more efficient automatic PDMP-based samplers. Finally while our CC-PDMP approach has been illustrated only on PDMPs with linear dynamics the approach could be used more generally on PDMPs with nonlinear dynamics, such as the Boomerang sampler of Bierkens et al. (Citation2020). The only requirement is that the rate function can be bounded by a function with a concave-convex decomposition.

Supplemental material

ucgs_a_2203735_sm4153.zip

Download Zip (1 MB)

Supplementary Materials

Appendix: Appendix containing additional coding and mathematical details for the work (appendix.pdf)

R-package: R-package “ccpdmp” containing code to perform methods described in the article. The package also contains all data generating code to generate the examples in the article. (ccpdmp.zip, zipped file)

Readme: A markdown file giving instructions for installing the package and replicating the results (README.md)

Additional information

Funding

This research was supported by EPSRC grant EP/R018561 and EP/R034710/1

References

Beskos, A., Pillai, N., Roberts, G., maria Sanz-serna, J., and Stuart, A. (2013), “Optimal Tuning of the Hybrid Monte Carlo Algorithm,” Bernoulli, 19, 1501–1534. DOI: 10.3150/12-BEJ414.
Web of Science ®Google Scholar
Bierkens, J., and Roberts, G. (2017), “A Piecewise Deterministic Scaling Limit of Lifted Metropolis–Hastings in the Curie–Weiss Model,” The Annals of Applied Probability, 27, 846–882. DOI: 10.1214/16-AAP1217.
Web of Science ®Google Scholar
Bierkens, J., Bouchard-Côté, A., Doucet, A., Duncan, A. B., Fearnhead, P., Lienart, T., Roberts, G., and Vollmer, S. J. (2018), “Piecewise Deterministic Markov Processes for Scalable Monte Carlo on Restricted Domains,” Statistics & Probability Letters, 136, 148–154. DOI: 10.1016/j.spl.2018.02.021.
Web of Science ®Google Scholar
Bierkens, J., Fearnhead, P., and Roberts, G. (2019), “The zig-zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data,” The Annals of Statistics, 47, 1288–1320. DOI: 10.1214/18-AOS1715.
Web of Science ®Google Scholar
Bierkens, J., Grazzi, S., Kamatani, K., and Roberts, G. (2020), “The Boomerang Sampler,” in: International Conference on Machine Learning, pp. 908–918, PMLR.
Google Scholar
Bierkens, J., Grazzi, S., van der Meulen, F., and Schauer, M. (2021), “A Piecewise Deterministic Monte Carlo Method for Diffusion Bridges,” Statistics and Computing, 31, 37. DOI: 10.1007/s11222-021-10008-8.
Web of Science ®Google Scholar
Bouchard-Côté, A., Vollmer, S. J., and Doucet, A. (2018), “The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo Method,” Journal of the American Statistical Association, 113, 855–867. DOI: 10.1080/01621459.2017.1294075.
Web of Science ®Google Scholar
Corbella, A., Spencer, S. E. F., and Roberts, G. O. (2022), “Automatic zig-zag Sampling in Practice,” Statistics and Computing, 32, 107. DOI: 10.1007/s11222-022-10142-x.
Web of Science ®Google Scholar
Cotter, S., House, T., and Pagani, F. (2020), “The NuZZ: Numerical zigzag Sampling for General Models,” arXiv.2003.03636.
Google Scholar
Davis, M. (1993), Markov Models and Optimization, New York: Chapman Hall.
Google Scholar
Diaconis, P., Holmes, S., and Neal, R. M. (2000), “Analysis of a Nonreversible Markov Chain Sampler,” Annals of Applied Probability, 10, 726–752.
Web of Science ®Google Scholar
Fearnhead, P., Bierkens, J., Pollock, M., and Roberts, G. O. (2018), “Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo,” Statistical Science, 33, 386–412. DOI: 10.1214/18-STS648.
Web of Science ®Google Scholar
Gilks, W. R. (1992), “Derivative-Free Adaptive Rejection Sampling for Gibbs Sampling,” Bayesian Statistics, 4, 641–649.
Google Scholar
Goan E., Perrin D., Mengersen K., and Fookes C. (2023), “Piecewise Deterministic Markov Processes for Bayesian Neural Networks,” arXiv:2302.08724.
Google Scholar
Goldman, J. V., and Singh, S. S. (2021), “Spatiotemporal Blocking of the Bouncy Particle Sampler for Efficient Inference in State-Space Models,” Statistics and Computing, 31, 68. DOI: 10.1007/s11222-021-10034-6.
PubMed Web of Science ®Google Scholar
Görür, D., and Teh, Y. W. (2011), “Concave-Convex Adaptive Rejection Sampling,” Journal of Computational and Graphical Statistics, 20, 670–691. DOI: 10.1198/jcgs.2011.09058.
Web of Science ®Google Scholar
Hartman, P. (1959), “On Functions Representable as a Difference of Convex Functions,” Pacific Journal of Mathematics, 9, 707–713. DOI: 10.2140/pjm.1959.9.707.
Google Scholar
Kingman, J. F. C. (1992), Poisson Processes, Oxford: Clarendon Press.
Google Scholar
Le Thi, H. A., and Pham Dinh, T. (2018), “DC Programming and DCA: Thirty Years of Developments,” Mathematical Programming. A Publication of the Mathematical Programming Society, 169, 5–68. DOI: 10.1007/s10107-018-1235-y.
Web of Science ®Google Scholar
Lewis, P. A. W., and Shedler, G. S. (1979), “Simulation of Nonhomogeneous Poisson Processes by Thinning,” Naval Research Logistics Quarterly, 26, 403–413. DOI: 10.1002/nav.3800260304.
Google Scholar
Michel, M., Kapfer, S. C., and Krauth, W. (2014), “Generalized Event-Chain Monte Carlo: Constructing Rejection-Free Global-Balance Algorithms from Infinitesimal Steps,” The Journal of Chemical Physics, 140, 054116. DOI: 10.1063/1.4863991.
PubMed Web of Science ®Google Scholar
Michel, M., Durmus, A., and Sénécal, S. (2020), “Forward Event-Chain Monte Carlo: Fast Sampling by Randomness Control in Irreversible Markov Chains,” Journal of Computational and Graphical Statistics, 29, 689–702. DOI: 10.1080/10618600.2020.1750417.
Web of Science ®Google Scholar
Pakman, A., Gilboa, D., Carlson, D., and Paninski, L. (2017), “Stochastic Bouncy Particle Sampler,” in Proceedings of the 34th International Conference on Machine Learning (Vol. 70), pp. 2741–2750, PMLR.
Google Scholar
Peters, E. A., and de With, G. (2012), “Rejection-Free Monte Carlo Sampling for General Potentials,” Physical Review E, 85, 026703. DOI: 10.1103/PhysRevE.85.026703.
PubMed Web of Science ®Google Scholar
Roberts, G. O., and Rosenthal, J. S. (1998), “Optimal Scaling of Discrete Approximations to Langevin Diffusions,” Journal of the Royal Statistical Society, Series B, 60, 255–268. DOI: 10.1111/1467-9868.00123.
Google Scholar
Sen, D., Sachs, M., Lu, J., and Dunson, D. (2020), “Efficient Posterior Sampling for High-Dimensional Imbalanced Logistic Regression,” Biometrika, 107, 1005–1012. DOI: 10.1093/biomet/asaa035.
PubMed Web of Science ®Google Scholar
Stone, M. H. (1948), “The Generalized Weierstrass Approximation Theorem,” Mathematics Magazine, 21, 237–254. DOI: 10.2307/3029337.
Google Scholar
Vanetti, P., Bouchard-Côté, A., Deligiannidis, G., and Doucet, A. (2017), “Piecewise-Deterministic Markov Chain Monte Carlo,” arXiv:1707.05296.
Google Scholar
Wu, C., and Robert, C. P. (2020), “Coordinate Sampler: A Non-Reversible Gibbs-like MCMC Sampler,” Statistics and Computing, 30, 721–730. DOI: 10.1007/s11222-019-09913-w.
Web of Science ®Google Scholar

Concave-Convex PDMP-based Sampling

Abstract

1 Introduction

2 Sampling using a PDMP