1,320
Views
0
CrossRef citations to date
0
Altmetric
Bayesian and Monte Carlo Methods

A Quantum Parallel Markov Chain Monte Carlo

Pages 1402-1415 | Received 13 Sep 2022, Accepted 19 Mar 2023, Published online: 21 Apr 2023

Abstract

We propose a novel hybrid quantum computing strategy for parallel MCMC algorithms that generate multiple proposals at each step. This strategy makes the rate-limiting step within parallel MCMC amenable to quantum parallelization by using the Gumbel-max trick to turn the generalized accept-reject step into a discrete optimization problem. When combined with new insights from the parallel MCMC literature, such an approach allows us to embed target density evaluations within a well-known extension of Grover’s quantum search algorithm. Letting P denote the number of proposals in a single MCMC iteration, the combined strategy reduces the number of target evaluations required from O(P) to O(P). In the following, we review the rudiments of quantum computing, quantum search and the Gumbel-max trick in order to elucidate their combination for as wide a readership as possible.

1 Introduction

Parallel MCMC techniques use multiple proposals to obtain efficiency gains over MCMC algorithms such as Metropolis-Hastings (Metropolis et al. Citation1953; Hastings Citation1970) and its progeny that use only a single proposal. Neal (Citation2003) first develops efficient MCMC transitions for inferring the states of hidden Markov models by proposing a “pool” of candidate states and using dynamic programming to select among them. Next, Tjelmeland (Citation2004) considers inference in the general setting and shows how to maintain detailed balance for an arbitrary number P of proposals. Consider a probability distribution π(dθ) defined on RD that admits a probability density π(θ) with respect to the Lebesgue measure, that is, π(dθ)=:π(θ)dθ. To generate samples from the target distribution π, we craft a kernel P(θ0,dθ) that satisfies(1) π(A)=π(dθ0)P(θ0,A)(1) for all ARD. Letting θ0 denote the current state of the Markov chain, Tjelmeland (Citation2004) proposes sampling from such a kernel P(θ0,·) by drawing P proposals Θ0=(θ1,,θP) from a joint distribution Q(θ0,dΘ0)=:q(θ0,Θ0)dΘ0 and selecting the next Markov chain state from among the current and proposed states with probabilities(2) πp=π(θp)q(θp,Θp)p=0Pπ(θp)q(θp,Θp) ,p=0,,P .(2)

Here, if Θ=(θ0,,θP) is the D×(P+1) matrix having the current state and all P proposals as columns, then Θp is the D × P matrix obtained by removing the pth column. Tjelmeland (Citation2004) shows that the kernel P(θ0,·) constructed in such a manner maintains detailed balance and hence satisfies (1). Others have since built on this work, developing parallel MCMC methods that generate or recycle multiple proposals (Frenkel Citation2004; Delmas and Jourdain Citation2009; Calderhead Citation2014; Yang et al. Citation2018; Luo and Tjelmeland Citation2019; Schwedes and Calderhead Citation2021; Holbrook Citation2021). Most recently, Glatt-Holtz et al. (Citation2022) place all these developments in their natural measure theoretical context, allowing one to trivially apply (1) and (2) to distributions over discrete-valued random variables.

Taken together, these developments demonstrate the ability of parallel MCMC methods to alleviate inferential challenges such as multimodality and to deliver performance gains over single-proposal competitors as measured by reduced autocorrelation between samples. In the following, we focus on the specification presented in Algorithm 1 but note that the techniques we present may also be effective for generalizations of this basic algorithm.

One need not use parallel computing to implement Algorithm 1, but the real promise and power of parallel MCMC comes from its natural parallelizability (Calderhead Citation2014). Contemporary hardware design emphasizes architectures that enable execution of multiple mathematical operations simultaneously. Parallel MCMC techniques stand to leverage technological developments that keep modern computation on track with Moore’s Law, which predicts that processing power doubles every two years. For example, the algorithm of Tjelmeland (Citation2004) generates P conditionally independent proposals and then evaluates the probabilities of (2). One may parallelize the proposal generation step using parallel pseudorandom number generators (PRNG) such as those advanced in Salmon et al. (Citation2011). The computational complexity of the target evaluations π(θp) is linear in the number of proposals. This presents a significant burden when π(·) is computationally expensive, for example, in big data settings or for Bayesian inversion, but evaluation of the target density over the P proposals is again a naturally parallelizable task. Moreover, widely available machine learning software such as TensorFlow allows users to easily parallelize both random number generation and target evaluations on general purpose graphics processing units (GPU) (Lao et al. Citation2020). Finally, when generating independent proposals using a proposal distribution of the form q(θ0,Θ0)=p=1Pq(θ0,θp), the acceptance probabilities (2) require the O(P2) evaluation of the (P+12) terms q(θp,θp), but Holbrook et al. (Citation2021a, Citation2021b) demonstrate the natural parallelizability of such pairwise operations, obtaining orders-of-magnitude speedups with contemporary GPUs. The proposed method directly addresses the acceptance step of Algorithm 1, while leaving the choice of parallelizing (or not parallelizing) the proposal step to the practitioner.

While parallel MCMC algorithms are increasingly well-suited for developing many-core computational architectures, there are trade-offs that need to be taken into account when choosing how to allocate computational resources. On one end of the spectrum, Gelman and Rubin (Citation1992a, Citation1992b) demonstrate the usefulness of generating, combining and comparing multiple independent Markov chains that target the same distribution, and one may perform this embarrassingly parallel task by assigning the operations for each individual chain to a separate central processing unit (CPU) core or GPU work-group. In this multi-chain context, simultaneously running multiple parallel MCMC chains could limit resources available for the within-chain parallelization described in the previous paragraph. On the other end of the spectrum, one may find it useful to allocate resources to overcome computational bottlenecks within a single Markov chain that uses a traditional accept-reject step. In big data contexts, Holbrook et al. (Citation2021a, Citation2021b) and Holbrook, Ji, and Suchard (Citation2022a, Citation2022b) use multi- and many-core processing to accelerate log-likelihood and log-likelihood gradient calculations within single Metropolis-Hastings and Hamiltonian Monte Carlo (Neal Citation2011) generated chains. This big data strategy might again limit resources available for parallelization across proposals and target evaluations described above.

Algorithm 1:

Parallel (multiproposal) MCMC (Tjelmeland Citation2004)

Data: Initial Markov chain state θ(0); total length of Markov chain S; total number of proposals per iteration P; routine for evaluating target density π(·); routines for drawing random samples from the proposal distribution Q(θ(s),·) and from a P + 1 discrete distribution Discrete(π) given some probability vector π=(π0,,πP).

Result: A Markov chain θ(1),,θ(S).for s{1,,S} do

θ0θ(s1); Θ0Q(θ0,·); for p{0,,P} do πpπ(θp) q(θp,Θp);end p̂Discrete(π/πT1); θ(s)θp̂;

end return θ(1),,θ(S).

In the presence of these tradeoffs, it is worthwhile to develop additional parallel computing tools for parallel MCMC so that future scientists may be better able to flexibly assign sub-routines to different computing resources in a way that is tailored to their specific inference task. In particular, we show that quantum parallelization can be a useful tool for parallel MCMC when evaluation of the target density represents the computational bottleneck.

Quantum computers use quantum mechanics to store information and manipulate data. By leveraging the idiosyncracies of quantum physics, these computers are able to deliver speedups over conventional computing for a relatively small number of computational problems. Some of these speedups are very large. Quantum computers can achieve exponential speedups over conventional computers for some computational tasks. Shor’s quantum algorithm for factoring an integer N (Shor Citation1994) is polynomial in logN compared to a super-polynomial classical optimum. The HHL algorithm for solving sparse N-dimensional linear systems (Harrow, Hassidim, and Lloyd Citation2009) is O(logN) compared to a classical O(N). Other algorithms deliver a still impressive polynomial speedup. For example, the algorithms considered in the following (Section 2.2) achieve quadratic speedups over conventional computing, turning O(N) runtimes into O(N). Both the magnitude of quantum computing’s successes and the relative rarity of those successes mean that there is a general interest in leveraging quantum algorithms for previously unconsidered computational problems. We will show that—with the help of the Gumbel-max trick (Section B)—established quantum optimization techniques are actually useful for sampling from computationally expensive discrete distributions and apply this insight to achieving quadratic speedups for parallel MCMC.

We provide a short introduction to the most basic elements of quantum computing in Appendix A. The interested reader may look to Nielsen and Chuang (Citation2010) for a canonical introduction to quantum algorithms or Lopatnikova and Tran (Citation2021), Wang and Liu (Citation2022) for surveys geared toward statisticians. In Section 2.1, we review and compare three different quantum search algorithms that enjoy quadratic speedups over conventional algorithms. In Section 2.2 we review the quantum minimization algorithm of Durr and Hoyer (Citation1996) and investigate the use of warm-starting therein. We then combine the Gumbel-max trick (Section B) and recent insights in parallel MCMC with the quantum minimization algorithm to create the quantum parallel MCMC algorithm for both continuously-valued (Section 3.2) and discrete-valued (Section 3.3) targets.

2 Quantum Search and Quantum Minimization

Grover (Citation1996) demonstrates how use a quantum computer to find a single marked element within a finite set of N items with only O(N) queries, and a result from Bennett et al. (Citation1997) shows that Grover’s algorithm is optimal up to a multiplicative constant. Boyer et al. (Citation1998) extend Grover’s algorithm to multiple marked items or solutions, further extend it to the case when the number of solutions is unknown and provide a rigorous bounds on the algorithms’ error rates. Finally, Durr and Hoyer (Citation1996) use the results of Boyer et al. (Citation1998) to obtain the minimum value within a discrete ordered set. Here, we briefly review these advances and provide illustrative simulations. In Section B, the Gumbel-max trick allows us to extend these results to the problem of sampling from a discrete distribution.

2.1 Quantum Search

Algorithm 2:

Quantum search algorithm (Grover Citation1996)

Data: An oracle gate Uf taking |x|y to |x|yf(x) for a function f(x):{0,,N1}{0,1} that satisfies f(x0)=1 for a single x0; n+1=log2(N)+1 quantum states initialized to |0n|1; an integer R=πN/4 denoting the number of Grover iterations.

Result: An n-bit binary string x0 satisfying f(x0)=1 with error less than 1/N.

|0n|1Hn+1|0n|1=|h|;/* |h=1Nx=0N1|x. */

|h|GR|h||x0|;/* G=(2|hh|I)(I2|x0x0|) */

|x0x0; return x0 .

Given a set of N items and a function f:{0,1,,N1}{0,1} that evaluates to 1 for a single element, Grover (Citation1996) develops an algorithm that uses quantum parallelism to score quadratic speedups over its classical counterparts. After only O(N) evaluations of f(·), Grover’s algorithm returns the x{0,1,,N1} satisfying f(x) = 1 with high probability. Compare this to O(N) requirement for the randomized classical algorithm that must evaluate f(·) over at least N/2 items to obtain the same probability of detecting x. The algorithm takes the state |0N|1 as input and applies Hadamard gates to each of the individual N + 1 input qubits. The resulting state is|h|=(1Nx=0N1|x)12(|0|1)=1Nx=0N1|x| .

Next, we apply the oracle gate Uf:|x|y|x|yf(x) and note thatUf|x|=Uf|x12(|0|1)=12(|x|0f(x)|x|1f(x))=1f(x)|x| .

Thus, Uf flips the sign for the state |x0 for which f(x0)=1 but leaves the other states unchanged. If we suppress the ancillary qubit |, then Uf is equivalent to the gate Ux0 defined as Ux0|x=1δx0|x. We may succinctly write this gate as the Householder matrix that reflects vectors about the unique hyperplane through the origin that has |x0 for a normal vector:Ux0=I2|x0x0| .

The action of this gate on the state |h takes the form1Nx=0N1|x1Nx=0N11δx0(x)|x .

Next, the algorithm reflects the current state about the hyperplane that has |h as a normal vector and negates the resulting state:(2|hh|I)(1Nx=0N11δx0(x)|x)=1Nx((1)1δx0(x)+2(N2)N)|x=(3N4)N3/2|x0+xx0(N4)N3/2|x .

The scientist who measures the state at this moment would obtain the desired state |x0 with a slightly higher probability of (3N4)2/N3 than the individual probabilities of (N4)2/N3 for the other states. Each additional application of the Grover iteration (3) G:=UhUx0:=(2|hh|I)(I2|x0x0|)(3) increases the probability of obtaining |x0 at the time of measurement in the computational basis and R=πN/4 iterations guarantees a probability of success that is greater than 11/N.

In general, we may use Grover’s search algorithm to find a solution when the number of solutions M is greater than 1. While the algorithmic details change little, the number of required Grover iterations(4) R=π4NM(4) and the probability of success after those R iterations do change (Boyer et al. Citation1998). When M is much smaller than N, the success rate is greater than 1M/N, and even for large M the success rate is more than 1/2. Lower-bounds are useful for establishing mathematical guarantees, but it is also helpful to understand the quality of algorithmic performance as a function of M and R. shows success curves as a function of the number of Grover iterations (or oracle calls) applied. The search field contains 21416 k elements, and the number of solutions M varies. Each curve represents an individual search task. The algorithm requires more iterations for smaller numbers and fewer iterations for larger numbers of solutions. The upper bound on error M/N is only an upper bound: the final probability of success for, for example, M = 256 is 0.997 compared to the bound of 0.984.

Fig. 1 Curves depict the probability that Grover’s quantum search algorithm returns a solution as a function of the number of oracle evaluations. For the ranges depicted, the number of iterations required and the number of solutions inversely relate, but increasing the number of iterations past R (EquationEquation (4)) can backfire and lead to extremely small probabilities of success.

Fig. 1 Curves depict the probability that Grover’s quantum search algorithm returns a solution as a function of the number of oracle evaluations. For the ranges depicted, the number of iterations required and the number of solutions inversely relate, but increasing the number of iterations past R (EquationEquation (4)(4) R=⌈π4NM⌉(4) ) can backfire and lead to extremely small probabilities of success.

Algorithm 3:

Quantum exponential searching algorithm (Boyer et al. Citation1998)

Data: An oracle gate Uf taking |x|y to |x|yf(x) for a function f(x):{0,,N1}{0,1} with unknown number of solutions; n=log2(N).

Result: If a solution exists, an n-bit binary string x0 satisfying f(x0)=1; if no solution exists, the algorithm runs forever.

m1; γ6/5; success FALSE; while success TRUE do

jUniform{0,,m1}; |0n|1Hn+1|0n|1=|h|; |h|Gj|h|=|x|;/* j Grover iterations from Algorithm 2 . */

|xx;/* Measure and check. */if f(x) = 1 then

x0x; success TRUE;else

mmin(γm,N);/* Increase m in case of failure. */end end return x0 .

While Grover’s algorithm delivers impressive speedups over classical search algorithms, it has a major weakness. hides the fact that the probability of success for Grover’s algorithm is not monatonic in the number of iterations. Running the algorithm for more than R iterations can backfire. For example, running the algorithm for 2N iterations when M = 1 results in a probability of success less than 0.095 (Boyer et al. Citation1998). The non-monotinicity of Grover’s algorithm becomes particularly problematic when we do not know the number of solutions M. Taking for example N=220, Boyer et al. (Citation1998) point out that 804 iterations provide an extremely high probability of success when M = 1 but a one-in-a-million chance of success when M = 4. To solve this problem and develop an effective search algorithm when M is unknown, those authors adopt the strategy of focusing on the expected number of iterations before success. In particular, they propose the quantum exponential search algorithm (Algorithm 3). When a solution exists, the algorithm returns a solution with expected total number of Grover iterations bounded above by 92N/M. Still better, this upper bound reduces to 94N/M for the special case MN, and simulations presented in show that even this bound is large. Such results come in handy when deciding whether to halt the algorithm’s progress if one believes it possible that no solutions exist. Indeed, this turns out to be useful in the context of quantum minimization (Section 2.2).

Fig. 2 [Left] Total number of oracle evaluations required by quantum exponential searching algorithm (Algorithm 3) for different numbers of solutions M and search set sizes N from 500 independent simulations each. Horizontal lines at 94N represent upper bounds on expected total number of evaluations to obtain a solution for the M = 1 problem. [Right] Probabilities of success for the fixed-point quantum search algorithm (Yoder, Low, and Chuang Citation2014) for different proportions of solutions λ=M/N and selecting different lower bounds w on M/N with error tolerance δ2.

Fig. 2 [Left] Total number of oracle evaluations required by quantum exponential searching algorithm (Algorithm 3) for different numbers of solutions M and search set sizes N from 500 independent simulations each. Horizontal lines at 94N represent upper bounds on expected total number of evaluations to obtain a solution for the M = 1 problem. [Right] Probabilities of success for the fixed-point quantum search algorithm (Yoder, Low, and Chuang Citation2014) for different proportions of solutions λ=M/N and selecting different lower bounds w on M/N with error tolerance δ2.

Algorithm 3 is not the only quantum search algorithm that is useful when the number of solutions M is unknown. Yoder, Low, and Chuang (Citation2014) use generalized Grover iterations that extend (3) to include general phases (αj,βj):(5) G(αj,βj):=Uh(αj)Ux0(βj):=((1eiαj)|hh|I)(I(1eiβj)|x0x0|).(5)

This form reduces to the original Grover iteration (3) when α=±π and β=±π. To select general phases, the method first fixes an error tolerance δ2>0 and a lower bound wλ=M/N and then selects an odd upper bound L on the total number of oracle queries (L – 1) such thatLlog(2/δ)w .

Finally, letting l=(L1)/2, one obtains phases (αj,βj) for j=1,,l that satisfyαj=βlj+1=2cot1(tan(2πj/L)1γ2) ,for γ1=cos(1Lcos1(1δ)) is the reciprocal of the inverse Lth Chebyshev polynomial of the first kind. Such phases mark the only algorithmic difference between the fixed-point quantum search and the original Grover search (Algorithm 2). The upshot is a search procedure that obtains a guaranteed probability of success 1δ2 for any M such that there exists an M0<M that also obtains the same success threshold (). On the one hand, this algorithm avoids the exponential quantum search algorithm’s need to perform multiple measurements of the quantum state. On the other hand, the exponential quantum search algorithm requires significantly fewer oracle evaluations when M is small (), and it turns out that this is precisely the scenario that interests us.

Fig. 3 Total number of oracle evaluations required by fixed-point quantum search (FPQS) (Yoder, Low, and Chuang Citation2014) minus total required for quantum exponential searching algorithm (QESA) (Algorithm 3) for different numbers of solutions M and search set sizes N from 500 independent simulations each. FPQS underperforms partially due to a miniscule lower bound w=1/N on λ=M/N, a tuning decision motivated by the potential application within quantum minimization with warm-starting (Section 2.2).

Fig. 3 Total number of oracle evaluations required by fixed-point quantum search (FPQS) (Yoder, Low, and Chuang Citation2014) minus total required for quantum exponential searching algorithm (QESA) (Algorithm 3) for different numbers of solutions M and search set sizes N from 500 independent simulations each. FPQS underperforms partially due to a miniscule lower bound w=1/N on λ=M/N, a tuning decision motivated by the potential application within quantum minimization with warm-starting (Section 2.2).

2.2 Quantum Minimization

Given a function f(·) that maps the discrete set {0,,N1} to the integers, we wish to find the minimizerx0=arg minx{0,,N1}f(x) .

Durr and Hoyer (Citation1996) propose a quantum algorithm for finding x0 that iterates between updating a running minimum Ff({0,,N1}) and applying the quantum exponential search algorithm (Algorithm 3) to find an element x such that f(x) < F. Having run these iterations a set number of times, the algorithm returns the minimizer x0 with high probability. To this end, Durr and Hoyer (Citation1996) show that their algorithm takes an expected total time less than or equal to m0:=454N+710log2(N) to find the minimizer, where marking items with values less than the threshold value (Algorithm 4) requires log2(N) time steps and each Grover iteration within the quantum exponential search algorithm requires one time step. From there, Markov’s inequality saysPr( total time to successm0ϵ) E( total time to success)m0/ϵϵ,or that we must scale the minimization procedure’s time steps by a factor of 1/ϵ to reach a failure rate less than ϵ. Due to the iterative nature of the algorithm, one might suppose that it is beneficial to start at a value x0 for which f(x0) is lower relative to other values. This turns out to be the case in theory and in practice, and the benefit of warm-starting is particularly useful in the context of parallel MCMC.

Proposition 1

(Warm-starting). Suppose that the quantum minimization algorithm begins with a threshold F0 such that f(x)<F0 for only K1>0 items. Then the expected total number of time steps to find the minimizer is bounded above bym0K=(541K1)9N+710log2(K)log2(N) ,and so the following rule relates the warm-started upper bound to the generic upper bound:m0K=m09NK1+710log2(KN)log2(N) .

Proof.

The proof follows the exact same form as Lemma 2 of Durr and Hoyer (Citation1996). It relies on a theoretical algorithm called the infinite algorithm that runs until the minimum is found. In this case, Lemma 1 of that paper says that the probability that the rth lowest value is ever selected when searching among K items is p(K,r)=1/r for rK and 0 otherwise. For a warm-start at element K, the expected total time spent in the exponential search algorithm isr=2Np(K,r)92Nr1=r=2Kp(K,r)92Nr1=92Nr=1K11r+11r92N(12+r=2K1r3/2)92N(12+r=1K1r3/2dr)=(541K1)9N.

An upper bound for the expected total number of time steps preparing the initial state and marking items f(x)<f(x0) follows in a similar manner. □

Proposition 1 shows that, for example, if Algorithm 4 begins at the item with second lowest value, then the expected total time to success is bounded above by m02=94N+710log2(N), reducing the generic upper bound by 9N710log2(2N)log2(N) time steps. When N equals 1000, say, the expected total time steps is m02<78.2. Keeping N = 1000 but letting the algorithm begin at the third lowest value (K = 3), the expected total time steps before success is m03<165.6. Raising N to 10,000, the two numbers increase to m02<234.4 and m03<503.4.

Algorithm 4:

Quantum minimum searching algorithm (Durr and Hoyer Citation1996)

Data: A quantum sub-routine capable of evaluating a function f(·) over {0,,N1} with with unique integer values; a maximum error tolerance ϵ(0,1); expected total time to success m0=454N+710log2(N).

Result: A log2(N)-bit binary string x0 satisfying f(x0)=minf with probability greater than 1ϵ. s0; x0Uniform{0,,N1}; while s<m0/ϵ do

Prepare initial state 1Nx|x|x0; Mark all items x satisfying f(x)<f(x0); ss+log2(N); Apply quantum exponential searching algorithm (Algorithm 3);/* I time steps */

ss+I; Obtain x by measuring first register; if f(x)<f(x0) then x0xend end return x0 .

In practice, the number of time steps before finding the minimum is surprisingly small. corroborates the intuition that it is beneficial to start at a lower ranked element. To generate these results, we use an early stopping rule for the quantum minimization algorithm’s exponential searching sub-routine, halting it after 94N iterations. Even with this stopping rule in place, the error rate of the quantum minimization algorithm is less than 1%. We can increase the early stopping threshold to obtain even lower error rates, but this strategy would seem to be unnecessary in the context of parallel MCMC. The benefits of warm-starting are useful in the same context, when the current state θ(s) usually inhabits the high-density region of the target distribution but the majority of proposals do not.

Fig. 4 Total number of oracle evaluations required by quantum minimization algorithm (Durr and Hoyer Citation1996) for different starting ranks and search set sizes N from 500 independent simulations each. Here, less than 1% of the 12,500 instances fail to recover the true minimum on account of early stopping.

Fig. 4 Total number of oracle evaluations required by quantum minimization algorithm (Durr and Hoyer Citation1996) for different starting ranks and search set sizes N from 500 independent simulations each. Here, less than 1% of the 12,500 instances fail to recover the true minimum on account of early stopping.

3 Quantum Parallel MCMC

With the rudiments of quantum minimization in hand, we present a quantum parallel MCMC (QPMCMC). The general structure of the algorithm the same as Algorithm 1: it constructs a Markov chain by iterating between generating many proposals and randomly selecting new Markov chain states among these proposals. Unlike classical single-proposal MCMC, parallel MCMC proposes many states and chooses one according to the generalized acceptance probabilities of (2). In general MCMC, evaluation of the target density function π(·) is the rate-limiting computational step, becoming particularly onerous in big data scenarios (Holbrook et al. Citation2021a). While parallel MCMC benefits from improved mixing as a function of MCMC iterations, the increased computational burden of P target evaluations at each step can lead to less favorable comparisons when accounting for elapsed wall-clock time.

Having successfully generated proposals θp, p=1,,P and evaluated the corresponding proposal densities q(·,·), we would like to use quantum parallelism and an appropriate oracle gate to efficiently compute the π(θp) s but immediately encounter a Catch-22 when we seek to draw a sample θ(s+1) Discrete (π) for π the vector of probabilities π0,π1,,πP defined in (2). We can use neither a quantum nor a classical circuit to draw the sample! On the one hand, drawing the sample within a quantum circuit would somehow require that all superposed states have access to all the π(θp) s at once to perform the required normalization. On the other hand, drawing the sample within the classical circuit would require access to all the π(θp) s, but measurement in the computational basis can only return one.

In light of this dilemma, we propose to use the Gumbel-max trick (Papandreou and Yuille Citation2011) to transform the generalized accept-reject step into a discrete optimization procedure (Section B). From there, we use quantum minimization to efficiently sample from Discrete(π). Crucially, we get away with quantum parallel evaluation of the target π(·) over all superposed states because the Gumbel-max trick does not rely on a normalization step: each superposed state requires no knowledge of any target evaluation other than its own.

Once one has effectively parallelized the target evaluations π(θp), the new computational bottleneck is the calculations required to obtain the P + 1 proposal density terms q(θp,Θp), for p=0,,P. Under an independent proposal mechanism with a proposal distribution of the form q(θ0,Θ0)=p=1Pq(θ0,θp), the acceptance probabilities (2) require the O(P2) evaluation of the (P+12) terms q(θp,θp). To avoid such calculations, we follow Holbrook (Citation2021) and use symmetric proposal distributions for which q(θp,Θp)=q(θp,Θp) for p,p=0,,P.

Algorithm 5:

The Gumbel-max trick

Data: A vector of unnormalized log-probabilities λ*=logπ+logc, for π a discrete probability vector with P + 1 elements.

Result: A single sample p̂ Discrete (π) satisfying p̂{0,1,,P}.for p{0,1,,P} do zpGumbel(0,1);αp*λp*+zp; end

p̂arg maxp=0,,Pαp*; return p̂ .

3.1 Simplified Acceptance Probabilities

Evaluation of the P + 1 proposal density terms q(θp,Θp) can be computationally burdensome as P grows large. Holbrook (Citation2021) proves that two specific proposal mechanisms enforce equality across all P + 1 terms and thus enable the simplified acceptance probabilities(6) πp=π(θp)p=0Pπ(θp) ,p=0,,P .(6)

In Section 3.2, we opt for one of these proposal mechanisms, the centered Gaussian proposal of Tjelmeland (Citation2004). This strategy first generates a Gaussian random variable θ¯ centered at the current state θ0 and then generates P proposals centered at θ¯:(7) θ1,,θPNormalD(θ¯,Σ) ,θ¯NormalD(θ0,Σ) .(7)

Holbrook (Citation2021) shows that this strategy leads to the higher-order proposal symmetry(8) q(θ0,Θ0)=q(θ1,Θ1)==q(θP,ΘP)(8) and that the simplified acceptance probabilities (6) maintain detailed balance. In fact, other location scale families also suffice in the manner, but here we focus on Gaussian proposals without loss of generality. Finally, the simplicial sampler (Holbrook Citation2021) accomplishes the same simplified acceptance probabilities but incurs an O(D3) computational cost.

One may also use a more limited strategy to enforce (8). Glatt-Holtz et al. (Citation2022) show that the sampler using independence proposal q(θp,Θp)=ppq(θp), where q(·) is not a function of θp, results in acceptance probabilities(9) πp=π(θp)/q(θp)p=0Pπ(θp)/q(θp) ,p=0,,P .(9)

When π(·) and q(·) take continuous values on a topologically compact domain or discrete values on a finite domain, one may specify q(·) to be uniform and let (9) simplify to (6). This strategy proves useful in Section 3.3, in which we consider discrete-valued targets over Ising-type lattice models.

3.2 Continuously Valued Targets

Algorithm 6:

Quantum parallel MCMC for a continuously-valued target

Data: Initial Markov chain state θ(0); total length of Markov chain S; total number of proposals per iteration P; routine for evaluating target density π(·); routines for drawing random samples from a D-dimensional multivariate Gaussian NormalD(μ,Σ) and the standard Gumbel distribution Gumbel(0,1).

Result: A Markov chain θ(1),,θ(S).for s{1,,S} do θ0θ(s1); z0Gumbel(0,1); θ¯NormalD(θ0,Σ);/* Proposal (7) */

for p{1,,P} do θpNormalD(θ¯,Σ); zpGumbel(0,1); |θpθp;/* Load proposal onto quantum computer. */

p̂arg minp=0,,P(f(p):=(zp+logπ(θp)));/* O(P) Algorithm 4 with warm-start at p = 0. */

θ(s)θp̂; return θ(1),,θ(S) .

Algorithm 6 presents the details of QPMCMC for a continuously-valued target. The algorithm uses conventional (perhaps parallel) computing almost entirely, excluding two lines that are highlighted. The first of these loads proposals onto the quantum machine, and the second line that calls the quantum minimization algorithm presented in Algorithm 4. This sub-routine seeks to find the minimizer of the function(10) f(p)=(zp+logπ(θp)) overp=0,1,,P,(10) which combines the numerators of (2), the Gumbel-max trick and a trivial negation. As discussed in Section 2.2, the sub-routine requires only O(P) oracle evaluations and, therefore, only O(P) evaluations of the target density π(·) using a quantum circuit. In theory, a quantum computer can perform the same computations as a classical computer, but the efficiency of the target evaluations would depend on a number of factors related to the structure of the density itself, the availability of fast quantum random access memory (Giovannetti, Lloyd, and Maccone Citation2008) and the ability of large numbers of quantum gates to act in concert with negligible noise (Kielpinski, Monroe, and Wineland Citation2002; Erhard et al. Citation2019).

In general MCMC, one often calibrates the scaling of a proposal kernel to balance between exploring the target’s domain and remaining within high-density regions. Optimal scaling strategies may lead to a large number of rejected proposals (Roberts and Rosenthal Citation2001). Indeed, Holbrook (Citation2021) shows that parallel MCMC algorithms are no exception. When using the Gumbel-max trick to sample from proposals, this means that the current state is often quite near optimality, representing a warm-start. shows how this warm-starting effects the number of oracle calls (and hence target evaluations) within the quantum minimization sub-routine over the course of an MCMC trajectory. We target multi-dimensional standard normal distributions of differing dimensions using the vector (100,,100) as starting position. We fix the number of Gaussian proposals to be 2000 and use standard adaptation procedures (Rosenthal Citation2011) to target a 50% acceptance rate while guaranteeing convergence. Although no theoretical results validate this target acceptance rate, the rate is close to empirically optimal rates from Holbrook (Citation2021). Across iterations, QPMCMC requires relatively few oracle calls compared to the 2000 target evaluations of parallel MCMC. We also witness the expected drop in the number of oracle evaluations as the chain approaches the target’s high-density region. Remarkably, the algorithm succeeds in sampling from the true discrete distribution in 99.5% of the MCMC iterations. An independent simulation obtains similar results, requiring roughly 7% of the usual number of target evaluations. shows a quantile-quantile plot for all 100 dimensions of a multi-dimensional standard normal distribution. We see no appreciable impact from the rare failure of the quantum minimization sub-routine.

Fig. 5 Total number of oracle evaluations required for each of 2000 quantum parallel MCMC (QPMCMC) iterations for standard multivariate normal targets of five different dimensionalities. Regardless of target dimension, the individual QPMCMC runs require roughly 7% of the usual 4 million target evaluations. Over 99.4% of the 10,000 MCMC iterations across all dimensions successfully sample from the discrete distribution with probabilities of (2). Burn-in iterations require moderately more evaluations because the current state occupies a lower density region and represents a “less good” warm-start.

Fig. 5 Total number of oracle evaluations required for each of 2000 quantum parallel MCMC (QPMCMC) iterations for standard multivariate normal targets of five different dimensionalities. Regardless of target dimension, the individual QPMCMC runs require roughly 7% of the usual 4 million target evaluations. Over 99.4% of the 10,000 MCMC iterations across all dimensions successfully sample from the discrete distribution with probabilities of (2). Burn-in iterations require moderately more evaluations because the current state occupies a lower density region and represents a “less good” warm-start.

Fig. 6 Empirical accuracy of quantum parallel MCMC (QPMCMC) for a 100 dimensional spherical Gaussian target. We generate 100,000 samples using 2000 proposals each iteration and remove the first 2000 samples as burn-in. The QQ (quantile-quantile) plot shows sample quantiles adhering closely to the theoretical quantiles. Similar to the independent simulation shown in , here QPMCMC requires less than 7.2% of the usual number of target evaluations.

Fig. 6 Empirical accuracy of quantum parallel MCMC (QPMCMC) for a 100 dimensional spherical Gaussian target. We generate 100,000 samples using 2000 proposals each iteration and remove the first 2000 samples as burn-in. The QQ (quantile-quantile) plot shows sample quantiles adhering closely to the theoretical quantiles. Similar to the independent simulation shown in Figure 5, here QPMCMC requires less than 7.2% of the usual number of target evaluations.

Finally, we compare convergence speed for QPMCMC using 1000, 5000, and 10,000 proposals to target a massively multimodal distribution: a mixture of 1000 two-dimensional, isotropic Gaussians with standard deviations equal to 1 and means equal to 10×(0,1,,999). This target is particularly challenging because the distance between modes is significantly larger than standard deviations. In five independent simulations, we run each algorithm until it achieves and effective sample size of 100. contains MCMC iterations and target evaluations required to achieve this effective sample size as well as relative speedups over sequential implementations and relative improvements over the 1000 proposal sampler.

Table 1 Racing to a minimum (between the two dimensions) of 100 effective samples for a target with 1000 disjoint modes.

3.3 Discrete-Valued Targets

We wish to sample from a target distribution defined over a discrete set {θα}αA, for A some finite or countably infinite set of indices. Glatt-Holtz et al. (Citation2022) establish the broader measure theoretical foundations for parallel MCMC procedures that use selection probabilities (2) to maintain detailed balance. In particular, when the target distribution has probability mass function π(·) defined with respect to the counting measure on the power set 2A, then detailed balance results in the kernel P(·,·) satisfyingπ(α)=απ(α)P(α,α) ,α,αA .

Sections 3.3.1 and 3.3.2 consider targets defined over finite sets A and make use of the uniform independence proposal scheme described in Section 3.1, thereby facilitating simplified acceptance probabilities (6). This scheme proposes single-flip updates to the current Markov chain state θ0, but it is worth noting that other multiple-flip schemes would also be amenable to QPMCMC.

3.3.1 Ising Model on a Two-Dimensional Lattice

Here, we are interested in an Ising-type lattice model over configurations θ=(θ1,,θD) consisting of D individual spins θd{1,1}. In terms of the preceding section, we have A={1,1}D and |A|=2D. In particular, we consider targets of the form(11) π(θ|ρ)exp(ρ(d,d)Eθdθd) ,(11) where ρ>0 is the interaction term and E is the lattice edge set. We specify a two-dimensional 500-by-500 lattice with nearest neighbors connections and fix ρ = 1. Since this latter setting encourages neighboring spins to equal each other, we begin our QPMCMC trajectories at the lowest-probability “checkerboard” state for an initial configuration. At each iteration, we generate collections of proposal states by uniformly flipping P individual spins θp , p{1,,P}, and obtaining each proposal state θp by updating the current state θ0 at the single location corresponding to θp . shows results from 10 independent runs using P{4,8,16,,2048} proposals. Trace plots of unnormalized log-probabilities indicate that higher proposal counts enable discovery of higher probability configurations. Interestingly, QPMCMC appears to be particularly beneficial in this large P context, requiring less than one-tenth the target evaluations required using conventional computing when P = 2048.

Fig. 7 Convergence of log-posterior for a parallel MCMC sampler targeting an Ising model with uniform interaction ρ = 1 and no external field. All chains begin at minimal probability “checkerboard” state. Larger proposal counts allow discovery of higher probability configurations.

Fig. 7 Convergence of log-posterior for a parallel MCMC sampler targeting an Ising model with uniform interaction ρ = 1 and no external field. All chains begin at minimal probability “checkerboard” state. Larger proposal counts allow discovery of higher probability configurations.

3.3.2 Bayesian Image Analysis

We apply a Bayesian image classification model to an intensity map () of the newly imaged supermassive black hole, Sagittarius A*, located at the Galactic Center of the Milky Way (Akiyama et al. Citation2022). Whereas one cannot see the black hole itself, one may see the shadow of the black hole cast by the hot, swirling cloud of gas surrounding it. We extract the black hole from the surrounding cloud by modeling the intensity at each of the D=40762=16,613,776 pixels as belonging to a mixture of two truncated normals with values yd restricted to the intensity range [0,255]. Namely, we follow Hurn (Citation1997) and specify the latent variable model yd|(μl,σ2,θd)~ind Normal(μl,σ2),yd[0,255],     θd=l,d{1,,D},μl~iid Uniform(0,255),l{1,1},1σ2~ Gamma(12,12),where θ=(θ1,,θD) share for a prior the Ising model (11) with edges joining neighboring pixels and interaction ρ=1.2.

Fig. 8 On the left is a 4076-by-4076 intensity map of the shadow of supermassive black hole Sagittarius A* (Akiyama et al. Citation2022). On the right is the pixelwise posterior mode of a Bayesian image classification model fit to intensity data. Within a Metropolis-in-Gibbs routine, quantum parallel MCMC using 1024 proposals requires less that one-tenth the posterior evaluations required by conventional parallel MCMC.

Fig. 8 On the left is a 4076-by-4076 intensity map of the shadow of supermassive black hole Sagittarius A* (Akiyama et al. Citation2022). On the right is the pixelwise posterior mode of a Bayesian image classification model fit to intensity data. Within a Metropolis-in-Gibbs routine, quantum parallel MCMC using 1024 proposals requires less that one-tenth the posterior evaluations required by conventional parallel MCMC.

We use a QPMCMC-within-Gibbs scheme to infer the join posterior of θ and the three mixture model parameters. For the former, we use the same QPMCMC scheme as in Section 3.3.1 with 1024 proposals at each iteration. For the latter, we use the closed-form updates presented in Hurn (Citation1997). We run this scheme for 20 million iterations, discarding the first 10 million as burnin. We thin the remaining sample at a ratio of 1 to 40,000 for the latent memberships θ and 1 to 4000 for the three parameters μ1, μ1 and σ2. Using the R package coda (Plummer et al. Citation2006), we calculate effective sample sizes of the log-posterior (257.1), μ1 (1,578.7), μ1 (257.6) and σ2 (2500.0), suggesting adequate convergence. shows both the intensity data and the pixelwise posterior mode of the latent membership vector θ. The QPMCMC requires only 1,977,553,608 target evaluations compared to the 1024 × 20,000,000 = 2.048×1010 evaluations required for the analogous parallel MCMC scheme implemented on a conventional computer, representing a 10.36-fold speedup.

4 Discussion

We have shown that parallel MCMC enjoys quadratic speedups by combining quantum minimization with the Gumbel-max trick. Within a QPMCMC iteration, the current state represents a warm-start for the sampling-turned-optimization problem, leading to increased efficiencies for the quantum minimization algorithm. Moreover, combining this approach with the Tjelmeland correction (Holbrook Citation2021) results in a total complexity reduction from O(P2) to O(P). Preliminary evidence suggests that our strategy may find use when target distributions exhibit extreme multimodality. While the algorithm still must construct long Markov chains to reach equilibrium, generating each individual Markov chain state requires significantly fewer target evaluations.

There are major technical barriers to the practical implementation of QPMCMC. The framework, like other quantum machine learning (QML) schemes, requires on-loading and off-loading classical data to and from a quantum machine. In the seminal review of QML, Biamonte et al. (Citation2017) discuss what they call the “input problem.” Cortese and Braje (Citation2018) present a general purpose quantum circuit for loading B classical bits into a quantum data structure comprising log2(B) qubits with circuit depth O(log(B)). For continuous targets, QPMCMC requires reading O(P) real vectors θpRD onto the quantum machine at each MCMC iteration. If b is the number of bits used to represent a single real value, then reading B=O(PDb) classical bits into the quantum machine requires time O(log(PDb)). This is true whether one opts for fixed-point (Jordan Citation2005) or floating-point (Haener et al. Citation2018) representations for real values. The outlook can be better for discrete distributions. For example, the QPMCMC scheme presented in Section 4 only requires loading the entire configuration state once prior to sampling. A D-spin Ising model requires D classical bits to encode, and these bits load onto a quantum machine in time O(log(D)). Once one has encoded the entire system state, each QPMCMC iteration only requires loading the addresses of the O(P) proposal states. If one uses b bits to encode each address, then the total time required to load data onto the quantum machine is O(log(Pb)) for each QPMCMC iteration. On the other hand, the speedup for discrete targets assumes the ability to hold an entire configuration in QRAM. Conveniently, the “output problem” is less of an issue for QPMCMC, as only a single integer p̂{0,,P} need be extracted within Algorithm 4.

This work leads to three interesting questions. First, what is the status of inference obtained by QPMCMC? The QPMCMC selection step relies on quantum minimization, an algorithm that only achieves success with probability 1ϵ. While empirical studies suggest that this error induces little bias, it would be helpful to use this ϵ to bound the distance between the target distribution and the distribution generated by QPMCMC. Such theoretical efforts would need to extend recent formalizations of the multiproposal based parallel MCMC paradigm (Glatt-Holtz et al. Citation2022) to account for biased MCMC kernels.

Second, can QPMCMC be combined with established quantum algorithms that make use of “quantum walks” on graphs to sample from discrete target distributions? Szegedy (Citation2004) presents a quantum analog to classical ergodic reversible Markov chains and shows that such quantum walks sometimes provide quadratic speedups over classical Markov chains. Szegedy (Citation2004) also points out that Grover’s search, a key component within QPMCMC, may be interpreted as performing just such a quantum walk on a complete graph. Wocjan and Abeyesinghe (Citation2008) accelerate the quantum walk by using ancillary Markov chains to improve mixing and apply their method to simulated annealing. Given a quantum algorithm A for producing a discrete random sample with variance σ2, Montanaro (Citation2015) develops a quantum algorithm for estimating the mean of algorithm A’s output with error ϵ by running algorithm A a mere O˜(σ/ϵ) times, where O˜ hides polylogarithmic terms. Importantly, this quadratic quantum speedup over classical Monte Carlo applies for quantum algorithms A that only feature a single measurement such as certain simple quantum walk algorithms. Unfortunately, this assumption fails for quantizations of Metropolis-Hastings, in general, and QPMCMC, in particular. More promising for QPMCMC, Lemieux et al. (Citation2020) develop a quantum circuit that applies Metropolis-Hastings to the Ising model without the need for oracle calls. An interesting question is whether similar non-oracular quantum circuits exist for the basic parallel MCMC backbone to QPMCMC. In general, however, comparison between QPMCMC and other quantum Monte Carlo techniques is challenging because the foregoing literature (a) largely focuses on MCMC as a tool for discrete optimization, with algorithms that only ever return a single Monte Carlo sample or function thereof, and (b) restricts itself to a handfull of simple, stylized and discrete target distributions. On the other hand, QPMCMC is a general inferential framework for sampling from general discrete and continuous distributions alike.

Third, there are other models and algorithms in statistics, machine learning and statistical mechanics that require sampling from potentially costly discrete distributions. Can one adapt our quantum Gumbel-max trick to these? Approximate Bayesian computation (Csilléry et al. Citation2010) extends probabilistic inference to contexts within which one only has access to a complex, perhaps computationally intensive, forward model. Having sampled model parameters from the prior, one accepts or rejects these parameter values based on the discrepancy between the observed and simulated data. If one succeeds in embedding forward model dynamics within a quantum circuit, then one may plausibly select from many parameter values using our quantum Gumbel-max trick. The trick may also find use within sequential Monte Carlo (Doucet, De Freitas, and Gordon Citation2001). For example, Berzuini and Gilks (Citation2001) present a sequential importance resampling algorithm that uses MCMC-type moves to encourage particle diversity and avoid the need for bootstrap resampling. Multiproposals accelerated by the quantum Gumbel-max trick could add speed and robustness to such MCMC-resampling.

Large-scale, practical quantum computing is still a long way off, but quantum algorithms are ripe for mainstream computational statistics.

Disclosure Statement

There are no competing interests to declare.

Additional information

Funding

This work was supported by grants NIH K25 AI153816, NSF DMS 2152774 and NSF DMS 2236854.

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016), “{ TensorFlow } : A System for { Large-Scale } Machine Learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283.
  • Akiyama, K., Alberdi, A., Alef, W., Algaba, J. C., Anantua, R., Asada, K., Azulay, R., Bach, U., Baczko, A.-K., Ball, D., et al. (2022), “First Sagittarius a* Event Horizon Telescope Results. I. The Shadow of the Supermassive Black Hole in the Center of the Milky Way,” The Astrophysical Journal Letters, 930, L12.
  • Bennett, C. H., Bernstein, E., Brassard, G., and Vazirani, U. (1997), “Strengths and Weaknesses of Quantum Computing,” SIAM Journal on Computing, 26, 1510–1523. DOI: 10.1137/S0097539796300933.
  • Berzuini, C., and Gilks, W. (2001), “Resample-Move Filtering with Cross-Model Jumps,” in Sequential Monte Carlo Methods in Practice, eds. A. Doucet, N. Freitas, and N. Gordon, pp. 117–138, New York: Springer.
  • Biamonte, J., Wittek, P., Pancotti, N., Rebentrost, P., Wiebe, N., and Lloyd, S. (2017), “Quantum Machine Learning,” Nature, 549, 195–202. DOI: 10.1038/nature23474.
  • Boyer, M., Brassard, G., Høyer, P., and Tapp, A. (1998), “Tight Bounds on Quantum Searching,” Fortschritte der Physik: Progress of Physics, 46, 493–505. DOI: 10.1002/(SICI)1521-3978(199806)46:4/5<493::AID-PROP493>3.0.CO;2-P.
  • Calderhead, B. (2014), “A General Construction for Parallelizing Metropolis- Hastings Algorithms,” Proceedings of the National Academy of Sciences, 111, 17408–17413. DOI: 10.1073/pnas.1408184111.
  • Cortese, J. A., and Braje, T. M. (2018), “Loading Classical Data into a Quantum Computer,” arXiv preprint arXiv:1803.01958.
  • Csilléry, K., Blum, M. G., Gaggiotti, O. E., and François, O. (2010), “Approximate Bayesian Computation (ABC) in Practice,” Trends in Ecology & Evolution, 25, 410–418. DOI: 10.1016/j.tree.2010.04.001.
  • Delmas, J.-F., and Jourdain, B. (2009), “Does Waste Recycling Really Improve the Multi-Proposal Metropolis–Hastings Algorithm? An Analysis based on Control Variates,” Journal of Applied Probability, 46, 938–959. DOI: 10.1239/jap/1261670681.
  • Doucet, A., De Freitas, N., and Gordon, N. J. (2001), Sequential Monte Carlo Methods in Practice (Vol. 1), New York: Springer.
  • Durr, C., and Hoyer, P. (1996), “A Quantum Algorithm for Finding the Minimum,” arXiv preprint quant-ph/9607014 .
  • Erhard, A., Wallman, J. J., Postler, L., Meth, M., Stricker, R., Martinez, E. A., Schindler, P., Monz, T., Emerson, J., and Blatt, R. (2019), “Characterizing Large-Scale Quantum Computers via Cycle Benchmarking,” Nature Communications, 10, 1–7. DOI: 10.1038/s41467-019-13068-7.
  • Frenkel, D. (2004), “Speed-up of Monte Carlo Simulations by Sampling of Rejected States,” Proceedings of the National Academy of Sciences 101, 17571–17575. DOI: 10.1073/pnas.0407950101.
  • Gelman, A., and Rubin, D. B. (1992a), “Inference from Iterative Simulation Using Multiple Sequences,” Statistical Science, 7, 457–472. DOI: 10.1214/ss/1177011136.
  • Gelman, A., and Rubin, D. B. (1992b), “A Single Series from the Gibbs Sampler Provides a False Sense of Security,” Bayesian Statistics, 4, 625–631.
  • Giovannetti, V., Lloyd, S., and Maccone, L. (2008), “Quantum Random Access Memory,” Physical Review Letters, 100, 160501. DOI: 10.1103/PhysRevLett.100.160501.
  • Glatt-Holtz, N. E., Holbrook, A. J., Krometis, J. A., and Mondaini, C. F. (2022), “Parallel MCMC Algorithms: Theoretical Foundations, Algorithm Design, Case Studies,” arXiv preprint arXiv:2209.04750.
  • Grover, L. K. (1996), “A Fast Quantum Mechanical Algorithm for Database Search,” in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, pp. 212–219.
  • Haener, T., Soeken, M., Roetteler, M., and Svore, K. M. (2018), “Quantum Circuits for Floating-Point Arithmetic,” in International Conference on Reversible Computation, pp. 162–174, Springer.
  • Harrow, A. W., Hassidim, A., and Lloyd, S. (2009), “Quantum Algorithm for Linear Systems of Equations,” Physical Review Letters, 103, 150502. DOI: 10.1103/PhysRevLett.103.150502.
  • Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chains and their Applications,” Biometrika, 57, 97–109. DOI: 10.1093/biomet/57.1.97.
  • Holbrook, A. J. (2021), “Generating MCMC Proposals by Randomly Rotating the Regular Simplex,” arXiv preprint arXiv:2110.06445 .
  • Holbrook, A. J., Ji, X., and Suchard, M. A. (2022a), “Bayesian Mitigation of Spatial Coarsening for a Hawkes Model Applied to Gunfire, Wildfire and Viral Contagion,” The Annals of Applied Statistics 16, 573–595. DOI: 10.1214/21-aoas1517.
  • Holbrook, A. J., Ji, X., and Suchard, M. A. (2022b), “From Viral Evolution to Spatial Contagion: A Biologically Modulated Hawkes Model,” Bioinformatics, 38, 1846–1856.
  • Holbrook, A. J., Lemey, P., Baele, G., Dellicour, S., Brockmann, D., Rambaut, A., and Suchard, M. A. (2021a), “Massive Parallelization Boosts Big Bayesian Multidimensional Scaling,” Journal of Computational and Graphical Statistics, 30, 11–24. DOI: 10.1080/10618600.2020.1754226.
  • Holbrook, A. J., Loeffler, C. E., Flaxman, S. R., and Suchard, M. A. (2021b), “Scalable Bayesian Inference for Self-Excitatory Stochastic Processes Applied to Big American Gunfire Data,” Statistics and Computing, 31, 1–15. DOI: 10.1007/s11222-020-09980-4.
  • Hurn, M. (1997), “Difficulties in the Use of Auxiliary Variables in Markov Chain Monte Carlo Methods,” Statistics and Computing, 7, 35–44.
  • Jordan, S. P. (2005), “Fast Quantum Algorithm for Numerical Gradient Estimation,” Physical Review Letters, 95, 050501. DOI: 10.1103/PhysRevLett.95.050501.
  • Kielpinski, D., Monroe, C., and Wineland, D. J. (2002), “Architecture for a Large-Scale Ion-Trap Quantum Computer,” Nature, 417, 709–711. DOI: 10.1038/nature00784.
  • Lao, J., Suter, C., Langmore, I., Chimisov, C., Saxena, A., Sountsov, P., Moore, D., Saurous, R. A., Hoffman, M. D., and Dillon, J. V. (2020), “tfp. MCMC: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware,” arXiv preprint arXiv:2002.01184 .
  • Lemieux, J., Heim, B., Poulin, D., Svore, K., and Troyer, M. (2020), “Efficient Quantum Walk Circuits for Metropolis-Hastings Algorithm,” Quantum, 4, 287. DOI: 10.22331/q-2020-06-29-287.
  • Lopatnikova, A., and Tran, M.-N. (2021), “An Introduction to Quantum Computing for Statisticians,” arXiv preprint arXiv:2112.06587.
  • Luo, X., and Tjelmeland, H. (2019), “A Multiple-Try Metropolis–Hastings Algorithm with Tailored Proposals,” Computational Statistics, 34, 1109–1133. DOI: 10.1007/s00180-019-00878-y.
  • Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953), “Equation of State Calculations by Fast Computing Machines,” The Journal of Chemical Physics, 21, 1087–1092. DOI: 10.1063/1.1699114.
  • Montanaro, A. (2015), “Quantum Speedup of Monte Carlo Methods,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471, 20150301. DOI: 10.1098/rspa.2015.0301.
  • Neal, R. (2011), “MCMC Using Hamiltonian Dynamics,” in Handbook of Markov Chain Monte Carlo, Part 1, eds. S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, New York: Chapman & Hall/CRC.
  • Neal, R. M. (2003), “Markov Chain Sampling for Non-linear State Space Models using Embedded Hidden Markov Models,” arXiv preprint math/0305039.
  • Nielsen, M. A., and Chuang, I. (2010), Quantum Computation and Quantum Information, Cambridge University Press.
  • Papandreou, G., and Yuille, A. L. (2011), “Perturb-and-Map Random Fields: Using Discrete Optimization to Learn and Sample from Energy Models,” in 2011 International Conference on Computer Vision, pp. 193–200, IEEE.
  • Plummer, M., Best, N., Cowles, K., and Vines, K. (2006), “Coda: Convergence Diagnosis and Output Analysis for MCMC,” R News 6, 7–11.
  • R Core Team. (2021), R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Ram, K., and Wickham, H. (2018), wesanderson: A Wes Anderson Palette Generator. R package version 0.3.6.
  • Roberts, G. O., and Rosenthal, J. S. (2001), “Optimal Scaling for Various Metropolis-Hastings Algorithms,” Statistical Science, 16, 351–367. DOI: 10.1214/ss/1015346320.
  • Rosenthal, J. S. (2011), “Optimal Proposal Distributions and Adaptive MCMC,” in Handbook of Markov Chain Monte Carlo 4, eds. S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, New York: Chapman & Hall/CRC.
  • Salmon, J. K., Moraes, M. A., Dror, R. O., and Shaw, D. E. (2011), “Parallel Random Numbers: As Easy as 1, 2, 3,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12.
  • Schwedes, T., and Calderhead, B. (2021), “Rao-blackwellised Parallel MCMC,” in International Conference on Artificial Intelligence and Statistics, pp. 3448–3456, PMLR.
  • Shor, P. (1994), “Algorithms for Quantum Computation: Discrete Logarithms and Factoring,” in Proceedings 35th Annual Symposium on Foundations of Computer Science, pp. 124–134.
  • Szegedy, M. (2004), “Quantum Speed-Up of Markov Chain Based Algorithms,” in 45th Annual IEEE Symposium on Foundations of Computer Science, pp. 32–41, IEEE.
  • Tjelmeland, H. (2004), “Using all Metropolis–Hastings Proposals to Estimate Mean Values,” Technical Report. Citeseer.
  • van Rossum, G. (1995), “Python Reference Manual,” Department of Computer Science [CS].
  • Wang, Y., and Liu, H. (2022), “Quantum Computing in a Statistical Context,” Annual Review of Statistics and Its Application, 9, 479–504. DOI: 10.1146/annurev-statistics-042720-024040.
  • Werner, K., Jansson, M., and Stoica, P. (2008), “On Estimation of Covariance Matrices with Kronecker Product Structure,” IEEE Transactions on Signal Processing, 56, 478–491. DOI: 10.1109/TSP.2007.907834.
  • Wickham, H. (2016), ggplot2: Elegant Graphics for Data Analysis, New York: Springer-Verlag.
  • Wocjan, P., and Abeyesinghe, A. (2008), “Speedup via Quantum Sampling,” Physical Review A, 78, 042336. DOI: 10.1103/PhysRevA.78.042336.
  • Yang, S., Chen, Y., Bernton, E., and Liu, J. S. (2018), “On Parallelizable Markov Chain Monte Carlo Algorithms with Waste-Recycling,” Statistics and Computing, 28, 1073–1081. DOI: 10.1007/s11222-017-9780-4.
  • Yoder, T. J., Low, G. H., and Chuang, I. L. (2014), “Fixed-Point Quantum Search with an Optimal Number of Queries,” Physical Review Letters, 113, 210501. DOI: 10.1103/PhysRevLett.113.210501.

Appendix A.

Limited Introduction to Quantum Computing

Quantum computers perform operations on unit-length vectors of complex data called quantum bits or qubits. One may write any qubit ψ as a linear combination of the computational basis states |0 and |1. In symbols,|ψ=α|0+β|1 forα, βC and|α|2+|β|2=1 .

This formula uses Dirac or bra-ket notation and obscures ideas that are commonplace in the realm of applied statistics. We make the unit-vector specification of |ψ clear by writing|0=[10] ,|1=[01] or|ψ=α[10]+β[01] .

The full machinery of linear algebra is also available within this notation. The conjugate transpose of |ψ is ψ|. The inner product of |ψ and |ϕ is ϕ|ψ. The outer product is |ψϕ|. Naturally, we can write ψ as a linear combination of any other such basis elements. Consider instead the vectors|+=12|0+12|1 and|=12|012|1 .

A little algebra shows that |+ and | are indeed unit-length and orthogonal to each other. A little more algebra reveals that, with respect to this basis, ψ has coefficients α and β given by (α+β)/2 and (αβ)/2, respectively. A last bit of algebra shows that this representation is consistent with ψ’s unit length.

But linear algebra is not the only aspect of quantum computing that should come easily to applied statisticians. In addition to thinking of ψ as a vector, it is also useful to think of ψ as a (discrete) probability distribution over the computational basis states |0 and |1. The constraints on coefficients α and β mean that we can think of |α|2 and |β|2 as probabilities that ψ is in state |0 or |1, respectively. Accordingly, |+ and | encode uniform distributions over the computational basis states. In the parlance of quantum mechanics, α, β and ±1/2 are all probability amplitudes, and ψ,|+ and | are all superpositions of the computational basis states. Quantum measurement of ψ results in |0 with probability |α|2, but—in the following—we can safely think of this physical procedure as drawing a single discrete sample from ψ’s implied probability distribution.

Quantum logic gates perform linear operations on qubits like ψ and take the form of unitary matrices U satisfying UU=I for U the conjugate transpose of U. One terrifically important single-qubit quantum gate is the Hadamard gate H=12[1111] .

One may verify that H is indeed unitary and that its action maps |0 to |+ and |1 to |. In fact, the reverse is also true on account of the symmetry of H. The Hadamard gate thus takes the computational basis states in and out of superposition, facilitating a phenomenon called quantum parallelism. Given a function f:{0,1}{0,1}, consider the two-qubit quantum oracle gate Uf which takes |x|y as input and returns |x|yf(x) as output, where ⊕ denotes addition modulo 2. Importantly, the output simplifies to |x|f(x) for y = 0. We now consider a quantum circuit that acts on two qubits by first applying the Hadamard transform H to the first qubit and then applying the oracle gate Uf to both. Using the state |0|0 as input, we have(A.1) |0|012|0|0+12|1|012|0|f(0)+12|1|f(1) .(A.1)

The quantum circuit evaluates f(·) simultaneously over both inputs! Unfortunately, the scientist who implements this circuit cannot directly access Uf’s output, and measurement will provide only |0|f(0) or |1|f(1) with probability 1/2 each. Unlocking the potential of quantum parallelism requires more ingenuity.

Uncountably many single- and two-qubit quantum gates exist, but the real power of quantum computing stems from the development of complex quantum gates that act on multiple qubits simultaneously. To access this power, we need one more tool that also appears in the statistics literature. The Kronecker or tensor product between an L-by-M matrix A and an N-by-O matrix B is the LN-by-MO matrix(A.2) AB=[A11BA1MBAL1BALMB] .(A.2)

In statistics, the Kronecker product features in the definition of a matrix normal distribution and is sometimes helpful when specifying the kernel function of a multivariate Gaussian process (Werner, Jansson, and Stoica Citation2008). Here, the product is equivalent to the parallel action of individual quantum gates on individual qubits. Simple application of Formula (A.2) shows that(A.3) H2=HH=12[1111111111111111] and|0|0=[10][10]=[1000].(A.3)

We may also denote the left product H2 and the right product |02,|00 or |0|0. One may therefore express (A.1) as a series of transformations applied to the 4-vector on the very right. Letting |10,|01 and |11 take on analogous meanings to |00, an immediate result of (A.3) is that(A.4) H2|00=12(|00+|01+|10+|11) .(A.4)

The action of H2 transforms |00 into a superposition of the states |00,|10,|01 and |11, where the probability of selecting each is a uniform (1/2)2=1/4. Writing so many 0s and 1s is tedious, so an alternative notation becomes preferable. Exchanging |0 for |00,|1 for |01,|2 for |10, and |3 for |11, (A.4) becomes the more succinctH2|00=12(|0+|1+|2+|3) .

This formula extends generally to operations over n qubits. Now letting N=2n, we haveHn|0n=1N(|0+|1++|N1)=1Nx=0N1|x=:|h ,and we call |h a uniform superposition over the states |0,,|N1. The many-qubit analogue for the quantum parallelism of (A.1) is then|0n|0(1Nx=0N1|x)|01Nx=0N1|x|f(x) .

Appendix B.

Gumbel-max

We wish to randomly select a single element from the set {0,1,,P} with probability proportional to the unnormalized probabilities π*=(π0*,π1*,,πP*). That is, there exists a c > 0 such that π*=cπ, for π a valid probability vector, but we only have access to π*. Define λ:=logπ and λ*:=logπ*=logπ+logc, and assume that z0,,zPiidGumbel(0,1). Then, the probability density function g(·) and cumulative distribution function G(·) for each individual zp areg(zp)=exp(zpexp(zp)) andG(zp)=exp(exp(zp)) .

Now, defining the random variables αp*:=λp*+zp,αp:=λp+zp andp̂=arg maxp=0,,Pαp* ,we have the result Pr(p̂=p)=πp .

In words, the procedure of adding Gumbel noise to unnormalized log-probabilities and taking the index of the maximum produces a random variable that follows the discrete distribution over π. Moving from left to right: Pr(p̂=p)= Pr(αp*>αp*, pp)= Pr(αp+logc>αp+logc, pp)= Pr(αp>αp, pp)=pp Pr(αp>αp|αp)g(αpλp) dαp=ppG(αpλp)g(αpλp) dαp=ppexp(exp(λpαp))exp(αp+λpexp(αp+λp)) dαp .

Recalling that λp=logπp, we exponentiate the logarithms, and the integral becomesπpppexp(πpexp(αp))exp(αpπpexp(αp)) dαp=πpexp(αp)exp(p=0Pπpexp(αp)) dαp=πpexp(αp)exp(exp(αp)) dαp=πp ,where the final equality follows easily from a change of variables.

Appendix C.

Mixing of Parallel MCMC and QPMCMC

To ascertain whether QPMCMC mixes differently compared to conventional parallel MCMC, we run both algorithms for a range of proposal counts to sample from a 10-dimensional standard normal distribution. For each algorithm and proposal setting, we run 100 independent chains for 10,000 iterations and obtain effective sample sizes ESSd for d{1,,10}. We then calculate the relative differences between the means and minima of one chain generated using parallel MCMC and one chain generated using QPMCMC; for example:Relative difference between means  ESS¯(·)=| ESS¯pMCMC ESS¯QPMCMC| ESS¯pMCMC

shows results. In general, the majority relative differences are small. For both statistics, mean relative differences are less than 0.05, regardless of proposal count. Again for both statistics, more than 75% of the independent runs result in relative differences below 0.1. We note that relative differences between means (blue) appear to decrease with the number of proposals, but the same does not hold for relative differences between minima (red).

Fig. B.1 Relative differences between effective sample sizes (ESS) for parallel MCMC (pMCMC) and quantum parallel MCMC (QPMCMC) across a range of proposal counts. We target a 10-dimensional standard normal distribution. For each algorithm and each proposal setting, we generate 100 independent chains of length 10,000 and calculate the mean and minimum ESS across dimensions.

Fig. B.1 Relative differences between effective sample sizes (ESS) for parallel MCMC (pMCMC) and quantum parallel MCMC (QPMCMC) across a range of proposal counts. We target a 10-dimensional standard normal distribution. For each algorithm and each proposal setting, we generate 100 independent chains of length 10,000 and calculate the mean and minimum ESS across dimensions.

Appendix D.

Note on Simulations

We use R (R Core Team Citation2021), Python (van Rossum Citation1995), TensorFlow (Abadi et al. Citation2016) and TensorFlow Probability MCMC (Lao et al. Citation2020) in our simulations and the ggplot2 package to generate all figures (Wickham Citation2016). In R, we use the package coda (Plummer et al. Citation2006) to calculate effective sample sizes. In Python, we use a built-in function from TensorFlow Probability MCMC. The primary color palette comes from Ram and Wickham (Citation2018).

All simulations rely on the fact that the Grover iterations of (3) manipulate the probability amplitudes in a deterministic manner. For example, the following R code specifies N uniform probability amplitudes (sqrtProbs), performs I Grover iterations and measures a single state according to the resulting probabilities.

sqrtProbs <- rep(1/sqrt(N),N); i <- 1 while (i < = I) { sqrtProbs <- (1-2*marks)*sqrtProbs sqrtProbs <- -sqrtProbs + 2*mean (sqrtProbs) i <- i + 1 } measurement <- sample(size = 1, x = 1:N, prob = sqrtProbsˆ2)

Of course, simulating these iterations on a classical computer requires precomputing the values of marks beforehand. All code is available online at https://github.com/andrewjholbrook/qpMCMC.