Full article: Estimating Posterior Sensitivities with Application to Structural Analysis of Bayesian Vector Autoregressions

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

The inherent feature of Bayesian empirical analysis is the dependence of posterior inference on prior parameters, which researchers typically specify. However, quantifying the magnitude of this dependence remains difficult. This article extends Infinitesimal Perturbation Analysis, widely used in classical simulation, to compute asymptotically unbiased and consistent sensitivities of posterior statistics with respect to prior parameters from Markov chain Monte Carlo inference via Gibbs sampling. The method demonstrates the possibility of efficiently computing the complete set of prior sensitivities for a wide range of posterior statistics, alongside the estimation algorithm using Automatic Differentiation. The method’s application is exemplified in Bayesian Vector Autoregression analysis of fiscal policy in U.S. macroeconomic time series data. The analysis assesses the sensitivities of posterior estimates, including the Impulse response functions and Forecast error variance decompositions, to prior parameters under common Minnesota shrinkage priors. The findings illuminate the significant and intricate influence of prior specification on the posterior distribution. This effect is particularly notable in crucial posterior statistics, such as the substantial absolute eigenvalue of the companion matrix, ultimately shaping the structural analysis.

Keywords:

1 Introduction

Empirical analysis through Bayesian inference relies on posterior distributions that combine information from the data using the model likelihood and prior information on model parameters. The latter is characterized by prior distributions that depend on a set of hyperparameters (prior parameters). As prior parametric families are usually determined based on computational considerations, researchers are mainly required to specify the prior parameters.

In practice, researchers adopt various strategies to set these prior parameters, depending on available data, modeling context, identification issues, or existing prior information. These strategies encompass the use of perceived uninformative default priors and more informative priors established through different methods of prior parameter elicitation (Giannone, Lenza, and Primiceri Citation2015; Baumeister and Hamilton Citation2019; Jarociński and Marcet Citation2019; Jacobi et al. Citation2021). One domain that experiences an active debate regarding prior parameter specification is the Bayesian vector autoregressive (BVAR) analysis with shrinkage priors for macroeconomic forecasting and structural analysis (Amir-Ahmadi, Matthes, and Wang Citation2018; Huber and Feldkircher Citation2019; Baumeister, Korobilis, and Lee Citation2020).

In formal terms, the dependence of prior parameters locally can be captured by observing changes in posterior inference resulting from modifications in any of the prior parameters, as formalized by the sensitivity of a statistic (Berger, Insua, and Ruggeri Citation2000; Gustafson Citation2000). However, calculating the exact derivative of a posterior statistic with respect to even a single hyperparameter is typically analytically intractable, given that posterior distributions manifest as high-dimensional integrals requiring numerical solutions. Efficient Markov chain Monte Carlo (MCMC) simulation methods, especially Gibbs sampling, offer popular tools for constructing transition kernels to simulate posterior distributions and estimate posterior statistics across various modeling contexts.Footnote¹ Constrained by the computational burden, empirical studies often resort to a less formal sensitivity analysis by re-executing the Gibbs algorithm under alternative prior assumptions (Del Negro and Schorfheide Citation2008; Jarociński and Marcet Citation2019; Jacobi et al. Citation2021).

In this article, we present an approach for sensitivity analysis of MCMC inference. This approach is based on extending the Infinitesimal Perturbation Analysis of the simulation path, traditionally employed in independent samples, to dependent MCMC samples. In particular, we introduce IPA analysis for MCMC inference in Gibbs sampling and leverage its dependence on smooth transition kernels. Notably, Gibbs sampling is widely favored in empirical research due to its advantageous computational properties (Greenberg Citation2012), and its broad applicability.Footnote² When combined with suitable conjugate priors, data augmentation techniques, and smoothing methods (Albert and Chib Citation1993; Frühwirth-Schnatter Citation1994), Gibbs-based inference can be extended to complex model settings, encompassing state-space models (Durbin and Koopman Citation2002), treatment effects models (Jacobi, Wagner, and Frühwirth-Schnatter Citation2016), multivariate demand analysis (Briggs et al. Citation2017), and time series models like Bayesian Vector Autoregressions (BVARs) (Koop and Korobilis Citation2010; Camacho, Gadea, and Gómez-Loscos Citation2020).

To establish the theoretical properties of our proposed IPA derivative estimator, we break down the Gibbs algorithm into three distinct categories of mappings: (a) the computation of hyperparameters for the transition kernel, (b) the generation of random variables using the transition kernel, and (c) the conversion of parameter samples into posterior statistics, encompassing measures like the mean, covariance, and more intricate outcomes like forecasts and impulse response functions. We establish the IPA derivative by employing a chain-rule argument and expand the theoretical IPA results from the classical simulation context to MCMC under a set of regularity conditions. In particular, we prove that the resulting estimator exhibits desirable asymptotic properties of unbiasedness and consistency. For precise implementation, we employ tools from Automatic Differentiation (AD) (as discussed in (Griewank and Walther Citation2008)) to facilitate derivative computation in parallel with the original MCMC estimation algorithm.

The Likelihood Ratio (LR) method is frequently employed as an unbiased alternative for IPA derivative estimation in classical simulation. It involves shifting the parameters of interest into the likelihood function to compute derivative outcomes. Within the MCMC framework, Pérez, Martin, and Rufo (Citation2006) introduced an LR approach for inference within the exponential parametric family that relies on the analytical partial derivative of the score function. Thus, the application of the LR approach to assess sensitivities from posterior to prior has far been confined to computing posterior mean sensitivities concerning prior coefficient means (Müller Citation2012). Another alternative to the infinitesimal perturbation in the IPA approach is finite perturbation via the finite-difference (FD) method. While widely applicable, the FD method lacks the unbiasedness property of IPA and LR, and its effectiveness depends on the chosen bump size. In principle, a smaller bump size should lead to a reduced bias under a set of regularity conditions, but this is often unattainable in practice due to numerical instability when dealing with extremely small numbers in computer operations (Glasserman Citation2013).

We demonstrate the unbiasedness and precision inherent in the IPA-based derivative computation through simulation experiments conducted in a standard small Bayesian Vector Autoregression (BVAR) setup. Our benchmarking study concentrates on assessing the sensitivities of the posterior means and variances of the regression coefficients. To maintain consistency with the simulation literature, we incorporate the LR method whenever feasible as an alternative for unbiased derivative estimation. For the remaining comparisons, we rely on FD estimates. Our simulation results align with our theoretical conclusions and substantiate classical simulation literature findings (Glasserman Citation2013). They reveal that IPA estimates are not only unbiased when compared to LR estimates but also boast higher precision. Furthermore, in contrast to the biased yet general FD approach, IPA sensitivity estimates exhibit computational efficiency and stability. In contrast, the FD estimates show considerable dependence on the chosen bump size. Within the same BVAR framework, we also expand the application of IPA sensitivities to encompass the Impulse response function (IRF), Forecast error variance decomposition (FEVD), and the eigenvalues of the companion matrix for model stability assessment. Through this extension, we showcase the notable influence that shrinkage priors exert on these posterior statistics when juxtaposed with weakly informative priors.

In our application, we consider a standard fiscal policy Bayesian Vector Autoregression (BVAR) model encompassing government spending, GDP, and tax revenue. This framework follows the spirit of Ramey (Citation2019) and employs a widely adopted Minnesota-style shrinkage prior, rooted in the popular Littermann prior (Doan, Litterman, and Sims Citation1984; Litterman Citation1986). Our findings suggest that the sensitivities of posterior inference primarily revolve around the prior means and variances pertaining to the GDP equation. This is evident from the sensitivities of prior parameters observed in posterior parameter estimates, and these sensitivities subsequently carry through to influence the IRFs and FEVDs. Focusing on the government spending shock of the log GDP, we find that the impact of prior mean is less than that of the prior variance specifications on the IRF across the complete forecast horizon. Results obtained by rerunning the MCMC (at a higher computational cost) confirm the results of the first-order approximation of the IRFs constructed by our IPA derivatives. Overall, the application showcases that our method provides an additional tool for practitioners using the BVAR approach.

The remainder of the article is organized as follows. Section 2 introduces our proposed framework for IPA derivative analysis of MCMC output and provides its asymptotic convergence properties for Gibbs-based MCMC. In Section 3, we discuss implementation aspects and present a simulation study to compare the proposed IPA-derivatives via AD against two alternatives derivative estimation methods. Section 4 illustrates IPA-derivative analysis in the context of a fiscal policy VAR. Section 5 concludes.

2 IPA based Sensitivity Analysis for Gibbs Inference

Bayesian Inference through Gibbs samplers stands as a fundamental tool in statistical and econometric inference. Over time, Gibbs simulation algorithms have retained their popularity as a preferred method for Bayesian statistical analysis due to their favorable computational attributes and wide-ranging applicability. In this section, we present a novel approach designed to tackle the methodological and computational hurdles inherent in evaluating the robustness of posterior statistics. Our approach focuses on quantifying sensitivities with respect to the prior hyperparameters, as introduced in the realm of Bayesian robustness literature by Berger, Insua, and Ruggeri (Citation2000) and Gustafson (Citation2000). Rather than relying on methods like rerunning the MCMC estimation (numerical finite differencing) or relying on analytical expressions (symbolic differentiation, likelihood ratio method) to evaluate the sensitivities of prior parameters, we leverage the concept of IPA. This approach involves monitoring the undisturbed trajectory of the MCMC simulation as it evolves, allowing us to predict the effects of small changes in input parameters (as they tend toward zero) on both parameter draws and associated statistics.

2.1 Setting

Consider the setting where some data y is generated via a data generating process that depends on a vector of parameters $θ \in Ω \subset R^{dim (θ)}$ . Bayesian inference about the model parameters $θ$ is based on the posterior distribution $π (θ | η_{0}, y)$ that combines the model likelihood $f (y | θ)$ given the model parameters and their prior distribution $π (θ | η_{0})$ where the latter depends on a set of hyperparameters $η_{0} \in S^{η} \subset R^{dim (η)}$ . In practice, MCMC analysis provides an efficient and popular tool to simulate the posterior distribution, which is typically a high dimensional integral and not analytically tractable.

MCMC-based posterior inference relies on the construction of a MCMC chain of parameter draws, ${θ^{g}}$ , with a Markov transition kernel $π (\cdot | θ^{g - 1}, η_{0}, y)$ at each iteration g such that (1) $θ^{g} | θ^{g - 1} \sim π (\cdot | θ^{g - 1}, η_{0}, y) for g = 1, 2 \dots$ (1) given chain starting values $θ^{0}, θ^{g} | θ^{0} \to π (θ | η_{0}, y)$ converges to the posterior distribution of interest. Bayesian inference aims to obtain estimates of a set of posterior statistics of interest, $S \in S^{s} \subset R^{dim (S)}$ (2) $α (η_{0}, y) = E_{π} [S (θ) | η_{0}, y]$ (2) that include estimates of first and second order moments, as well as interval estimates which will also depend on the prior parameter specified for the inference. Our approach encompasses more complex posterior statistics, including IRF estimates as illustrated in the real data analysis.

2.2 Prior Parameter Dependence via Local Sensitivities of Posterior Statistics

We focus on prior parameter dependence of inference as defined and used in the literature (Berger, Insua, and Ruggeri Citation2000; Giordano, Broderick, and Jordan Citation2017) in terms of the local derivatives (sensitivities) of a posterior statistic with respect to prior parameters (3) $\frac{\partial E_{π} [S (θ) | Y, η_{0}]}{\partial η_{0}} .$ (3)

Note that both $η_{0}$ and $E_{π} [S (θ) | Y, η_{0}]$ are high dimensional vectors, which in turns implies that the resulting expression for (3) is a high-dimensional matrix of dimension $dim (S) \times dim (η)$ . For example, the sensitivity of all the posterior means with respect to all the prior mean parameters for the regression coefficients in a linear model with Gaussian errors.

While conceptually appealing, the sensitivities are usually unavailable in the closed-form given most models’ complex stochastic dependence structure. Computational assessment faces additional problems due to the dimensionality of the problem in terms of outputs (set of posterior statistics) and inputs (set of prior parameters) in combination with the computational intensity of the original MCMC estimation. This poses challenges for existing methods such as finite differencing, symbolic differentiation, and likelihood ratio methods to evaluate a broad set of prior parameters and a wide range of posterior output statistics.Footnote³ As a result, a common practice has been a less formal finite-difference approach based on rerunning the model under a small set of different prior input specifications (Chib and Ergashev Citation2009; Roos et al. Citation2015).

On the other hand, the proposed IPA approach uses the concept of perturbation analysis, that is, monitoring the path of the simulation as it evolves to estimate impacts of input parameter changes. From an algorithmic point of view relevant for such a perturbation analysis, an MCMC algorithm is simply a function or mapping that takes in a model of θ, a vector of hyper-parameters $η_{0}$ , and a starting vector $θ^{0}$ to obtain estimates of a range of statistics, based on the simulated posterior distribution from an MCMC algorithm. We show in Section 2.3 that by applying IPA we can express the local derivative (3) in terms of derivatives of the simulation draws as shown in Section 2.3.

The approach requires computing derivatives of simulated output. By definition, simulation is a way of evaluating integration. Derivate estimation via simulation inevitably requires an interchange of the order of differentiation and integration. In the context of IPA, we show in Section 2.4 that it is possible to obtain unbiased and consistent estimates of (3) via the IPA based sensitivity analysis as $\frac{\partial E_{π} [S (θ) | Y, η_{0}]}{\partial η_{0}} = E_{π} [\frac{\partial S (θ)}{\partial η_{0}} | Y, η_{0}]$ , that is, it is valid to interchange the order of differentiation and integration in this context.

2.3 IPA Derivatives in Dependent Sampling

IPA constitutes one form of perturbation analysis that examines the behavior of a simulation path as the size of perturbations applied to input parameters approaches zero. This method has already found broad applications in classical simulation studies (Giles and Glasserman Citation2006; Glasserman and Liu Citation2010). Here, we extend the scope of IPA to encompass dependent sampling using Gibbs sampling algorithms. In this context, the transition kernel (1) is established on the basis of analytically known conditional posterior distributions of model parameters. We initiate by defining the central IPA output of interest using the Jacobian matrix denoted as $J_{θ^{g}} (η_{0})$ . This matrix encapsulates the complete set of derivatives of the gth parameter draw $θ^{g}$ concerning the hyperparameters $η_{0}$ . It’s important to note that, parametrically, the collection of data also constitutes an input of the algorithm. However, for the sake of simplicity, we omit the consideration of the impact of data in the current analysis. Further exploration of data impact is left for future research, hence, we drop Y in the notations.

MCMC algorithms present intricate simulation paths. Nevertheless, they can be broken down into a series of mappings. These mappings are designed to manage the transition kernel $π (\cdot | θ^{g - 1}, η_{0})$ for generating $θ^{g}$ samples, and a third mapping for computing the posterior statistics of interest:

A type one mapping updates the state parameter s of the transition kernel, that is $f^{1} : S^{η} \times Ω \to S^{s}$ such that $s^{g} = f^{1} (η_{0}, θ^{g - 1}), with S^{s} \subset R^{dim (s)}$ .
A type two mapping draws the model parameters with the updated state parameters via $f^{2} : S^{s} \times {(0, 1)}^{dim (ω)} \to Ω$ such that $θ^{g} = f^{2} (s^{g}, ω^{g})$ given the state parameters s and a sequence of independent U(0, 1).Footnote⁴

For example, suppose the transition kernel is multivariate Gaussian. At each iteration, the mean and covariance are matrix-valued functions (updated via f¹) of the draw from the last iteration and prior hyperparameters. Then a draw from this particular multivariate Gaussian is obtained given the updated mean and covariance using f². This structure is easily generalizable to a multiple block setting. In a standard multivariate linear regression model with Normal errors and Normal priors on regression coefficients and Inverse Wishart prior for the covariance matrix of the errors, we would have a multivariate Gaussian update of the regression coefficients and an Inverse Wishart update of the covariance matrix. Again, type one mappings update the parameters in the two components of the transition kernel and type two mappings are used to generate each set of parameter draws, iteratively.

Considering a Gibbs algorithm expressed in terms of the two mappings, we can apply the chain rule to differentiate the Gibbs chain that generates the model parameter draws to obtain the expression for the main IPA output defined in terms of the Jacobian, $J_{θ^{g}} (η_{0})$ , we have (4) $J_{θ^{g}} (η_{0}) = J_{f^{2}} (s^{g}) J_{f^{1}} (η_{0}) + J_{f^{2}} (s^{g}) J_{f^{1}} (θ^{g - 1}) J_{θ^{g - 1}} (η_{0}) .$ (4)

Hence, the IPA analysis implies that the Jacobian matrix of sensitivities of the parameter draws with respect to the prior parameters can be expressed in terms of the Jacobian of the second mapping with respect to the state parameters and Jacobians of the first mapping with respect to the state parameters, prior parameters, and previous draws. Note that the expression (4) contains a path-wise and a chain-rule components. The path-wise component arises from the direct influence of the hyper-parameters on the transition kernel, that is, the random number generation algorithm is a function of the hyper-parameters. In addition, they also affect the $(g - 1$ )th draw, which determines the state parameters of the transition kernel at the current iteration g.

Finally, in order to compute posterior statistics, a type three mapping in Gibbs inference involves the computation of samples statistics $S (θ^{g})$ given the draws $θ^{g}$ . Bayesian inference typically involves the analysis of the posterior distribution, with particular attention given to the posterior mean. In this case, identity mapping is the third type of mapping in the algorithm. Hence, the sensitivity of posterior statistics w.r.t. the complete set of prior inputs can be computed by passing the derivative operator through S given the already available $J_{θ^{g}}$ , and averaging over the sample.

We can now introduce the following measure of the local input sensitivity the posterior statistics considered in the literature based on IPA analysis (IPA derivatives), where sensitivities of posterior statistics of the model parameters with respect to the prior hyperparameters are given by (5) $J_{\hat{S} (θ)} (η_{0}) = \frac{\partial E_{\hat{π}} [S (θ) | y, η_{0}]}{\partial η_{0}} = \frac{1}{G} \sum_{g = B + 1}^{G + B} J_{S} (θ^{g}) J_{θ^{g}} (η_{0})$ (5) where $E_{\hat{π}}$ reflects that the expectation of S is obtained from the empirical distribution formed with posterior draws, and G is the total number of iterations after the burn-in period B.

2.4 Convergence and Asymptotics

In the previous section, we have introduced the idea of IPA to Bayesian MCMC-Gibbs simulation settings given the input parameters $η_{0}$ and $θ^{0}$ . We are interested in the derivatives of some posterior statistics $α (η_{0}, θ^{0})$ written as expectations over some real-valued objective functions. When a sample is drawn independently, as in the classical simulation setting, convergence properties of IPA derivatives in the form of Equationequation (5)(5) $J_{\hat{S} (θ)} (η_{0}) = \frac{\partial E_{\hat{π}} [S (θ) | y, η_{0}]}{\partial η_{0}} = \frac{1}{G} \sum_{g = B + 1}^{G + B} J_{S} (θ^{g}) J_{θ^{g}} (η_{0})$ (5) are implied by the basic Law of Large Numbers. In this section, we extend convergence properties to dependent sampling via Gibbs MCMC settings, where the key focus is to estimate the derivative of α as defined in (2) with the expectation taken over the true posterior distribution. Note that the assumptions we make in the following section are regarding the Markov transition kernel, that it naturally flows on to the impact of $θ^{0}$ .

2.4.1 Asymptotic Unbiasedness

Sampling from the joint posterior distribution using MCMC techniques starts after discarding the first B draws from the algorithm. That is because generally we assume that from B + 1 onwards, we are sampling from the exact posterior distribution. In fact for an ergodic Markov chain, as $B \to \infty$ , we reach the stationary distribution, and $θ^{B + 1}$ becomes unbiased, that is, $lim_{B \to \infty} E [θ^{B + 1}] = E [θ | Y]$ . Hence, it is critical for analysis to collect sensitivity estimates that is unbiased from B + 1 onwards. The basic idea behind the Gibbs sampler, is that a posterior statistics α can be viewed as $α (η_{0}) = \int_{Ω} S (θ) d π_{η_{0}} (θ)$ which is the expectation of an integrand S of the sample path, $θ$ , with respect to a posterior probability measure $π_{η_{0}}$ over some measurable space $(Ω, F)$ .Footnote⁵ In general, one cannot differentiate the integrand in Gibbs settings, as $π_{η_{0}}$ depends on $η_{0}$ . Note that to simplify notation we are suppressing the conditioning on the data.

The choice of sample space Ω and its corresponding sample path is crucial in IPA (L’Ecuyer Citation1990). In fact, for all simulation settings, all random variables are generated from $ω \sim U (0, 1)$ variates. Let $(\tilde{Ω}, \tilde{F}, \tilde{π})$ denote a probability space in which the random element is a sequence of independent U(0, 1) variates. Because Gibbs samples are drawn dependently via a transition kernel, $π (\cdot | \cdot, η_{0})$ , that is $θ^{g + 1}$ is drawn from $π (\cdot | θ^{g}, η_{0})$ , we shall use $(Ω^{g}, F^{g}, P_{η_{0}}^{g})$ to denote the conditional probability triplet.

Assumption 1.

For $(Ω^{g}, F^{g}, P_{η_{0}}^{g})$ , assume that

the measurable space $(Ω^{g}, F^{g}) = (Ω, F)$
for every measurable set $A$ of $(Ω, F)$ , the transformation $ϕ_{η_{0}, θ} = f^{2} ° f^{1}$ is such that $P_{η_{0}}^{g} (A) = \tilde{π} (ϕ_{η_{0}, θ}^{- 1} (A)) .$

To apply IPA, the standard regularity conditions are needed for the algorithm. Since the algorithm is constructed via a stochastic representation of the transition kernel in Gibbs, we have the following conditions similar to those in the classical simulation setting.

Assumption 2.

The following statements are considered under $(Ω^{g}, F^{g}, P_{η_{0}}^{g})$ .

(i) $S^{η}$ is compact.
(ii) For all $η_{0} \in S^{η}$ and $θ \in Ω$ , the transformation $ϕ_{η_{0}, θ}$ is differentiable in both $η_{0}$ and $θ$ with probability one. For all $η_{0}, η_{0}^{*} \in S^{η}$ and $θ, θ^{*} \in Ω$ , there exist random variables C₁ and C₂ such $\begin{matrix} | | ϕ_{η_{0}, θ} (ω) - ϕ_{η_{0}^{*}, θ} (ω) | | \leq C_{1} | | η_{0} - η_{0}^{*} | | \\ | | ϕ_{η_{0}, θ} (ω) - ϕ_{η_{0}, θ^{*}} (ω) | | \leq C_{2} | | θ - θ * | | \end{matrix}$

with $E [C_{1}] < \infty, E [C_{2}] < 1$ .Footnote⁶

(iii) Let $D_{s} \subseteq Ω$ denote the compact set at which S is differentiable, $Pr (S (θ) \in D_{s}) = 1$ and S is Lipschitz for all $θ \in Ω$ .

Compactness is a standard assumption here for sensitivity analysis and estimation. Assumptions (ii) and (iii) are concerned with the smoothness of the algorithm and its capacity to allow the passing through of derivative operators. Assumption (ii) is also related to convergence and the boundedness of the derivative as $B \to \infty$ and ensures that $\frac{\partial θ^{B}}{\partial θ^{j}} = \prod_{i = j}^{B - 1} \frac{\partial θ^{i + 1}}{\partial θ^{i}}$ pre-multiplied with $\frac{\partial θ^{j + 1}}{\partial θ^{j}}$ , is essentially a product of matrix valued random variables (see Raghunathan (Citation1979) for a theoretical discussion). Note this is not a very strong condition since under a geometric convergent chain these partial derivative ought to converge to zero as $B \to \infty$ at a geometric rate almost surely.

Theorem 1

(Asymptotic Unbiased IPA Derivatives). Under Assumptions (1) and (2), and for all $η \in S^{η}, | | π^{B} (\cdot | θ^{0}, η_{0}) - π (\cdot | η_{0}) | | \to 0$ we have unbiasedness for the limit of Equationequation (5)(5) $J_{\hat{S} (θ)} (η_{0}) = \frac{\partial E_{\hat{π}} [S (θ) | y, η_{0}]}{\partial η_{0}} = \frac{1}{G} \sum_{g = B + 1}^{G + B} J_{S} (θ^{g}) J_{θ^{g}} (η_{0})$ (5) as $B \to \infty$ of $\frac{\partial}{\partial η_{0}} α (η_{0})$ , that is, $E [lim_{B \to \infty} J_{G, B} (η_{0})] = \frac{\partial}{\partial η_{0}} α (η_{0}) .$

See the proof in Appendix A.

2.4.2 Asymptotic Properties as $G \to \infty$ and $B \to \infty$

We have shown the estimator to be unbiased as $B \to \infty$ . In this section, we discuss its property as $G \to \infty$ . In the classical simulation setting, the consistency and asymptotic normality are straightforward to show for the IPA derivatives, given its unbiasedness and diminishing finite variance. In the Gibbs context, however, dependent samples are drawn, and their derivatives are calculated along the Markov chain. Hence, the variance of the sample mean also includes covariance terms. Therefore, our proof of the following limiting theorem relies on the construction of Mixingale type stochastic processes. Their limiting behavior is discussed in detail in the time series literature (see De Jong Citation1995).

Theorem 2

(Consistency for the IPA Derivatives). Under assumption (1) and (2), we have $J_{G, B} (η_{0}) \overset{P}{\to} \frac{\partial}{\partial η_{0}} α (η_{0}) for all η_{0} \in S^{η}$ as $G \to \infty$ and $B \to \infty$ .

The proof is provided in Appendix A.

The established asymptotic properties of IPA-based derivatives hold true across algorithms applied to diverse models. We will showcase the effectiveness of this method by using it within linear regression frameworks, both in real-world applications and simulation studies. Furthermore, this method can be extended to a broad array of state space models, encompassing models like the time-varying coefficient Vector Autoregression (VAR) (Primiceri Citation2005) and dynamic factor models (Diebold and Li Citation2006), each with varying prior specifications. Notably, the standard algorithm for these model types, typically founded on the Kalman Filter (Carter and Kohn Citation1994), adheres to the necessary regularity conditions, thereby making the IPA approach applicable.

An inherent limitation of the IPA approach lies in its requirement for algorithm smoothness. Algorithms of the rejection type, including the widely used Metropolis-Hastings (MH) algorithm, generate nonsmooth simulation paths. This conflicts with the Lipschitz assumptions established in our theoretical framework. Nevertheless, in situations involving low dimensions and assuming the density itself is continuously differentiable, we can resort to numerical integration techniques to derive distributional derivatives. In Appendix D, we present an illustration of a possible expansion of the standard IPA setup to MCMC settings where the full conditional distribution remains unknown for a specific model parameter. We have also implemented an IPA based approach within Metropolis-within-Gibbs and Slice-within-Gibbs algorithms and demonstrated the performance in a simulation study (see Appendix B). Note that algorithms such as unadjusted Hamiltonian Monte Carlo (Bou-Rabee, Eberle, and Zimmer Citation2020) and Langevin sampling (Durmus and Moulines Citation2017) fulfill the required smoothness assumption.

2.5 Implementation

To address the reliance of IPA on the exact evaluation of complex derivative operations on the simulation algorithm, the automatic differentiation (AD) approach was introduced in classical simulation community by Griewank and Walther (Citation2008), now a standard computational tool for computing derivatives of smooth functions (Homescu Citation2011).Footnote⁷ While AD has been included as part of common statistical libraries, such as PYMANOPT for optimization of manifolds implemented in Python (Townsend, Koep, and Weichwald Citation2016) and Stan for approximate inference (Giordano, Broderick, and Jordan Citation2017), these are not designed to handle some of the specific computational challenges of Gibbs sampling. Since it fails to exploit substantial gains in computational efficiency from matrix operations, a naive application of such standard element by element AD implementations for a IPA-based senstivity analysis in Gibbs settings can be computationally prohibitively expensive. Our implementation, therefore, exploits a Vector-based AD approach discussed in Kwok, Zhu, and Jacobi (Citation2022) to facilitate the efficient computation of the complete set of prior parameter sensitivities of the MCMC draws alongside the Gibbs sampling iterations. We refer interested readers to the R package (Kwok, Zhu, and Jacobi Citation2020) for an implementation of Vectorized AD in a general multivariate regression context. The code is also available as part of the supplementary materials.

3 Simulation Study

We illustrate the performance and use of the introduced IPA approach for prior parameter sensitivities of posterior statistics from Gibbs inference in a multivariate regression framework. Our experimental design is based on a standard BVAR model and explores prior parameter sensitivities in the context the common use of shrinkage priors to address over-parameterization arising from the moderate size of many macroeconomic datasets (Karlsson Citation2013). Within this setting we first benchmark IPA derivative estimates against two alternative computational methods to demonstrate the unbiasedness and high precision of IPA estimates implemented. This part of the simulation focuses on sensitivities of posterior mean and variance statistics to aid benchmarking given limitations of existing methods. In the second part of the simulation study we exploit the flexibility and efficiency of the proposed IPA tool by contrasting patterns of prior parameter dependence under shrinkage with more uninformative prior parameter settings, and by expanding the set of considered sensitivities to assess sensitivity patterns of key features in VAR-based policy analysis, such as impulse responses to policy shocks.

3.1 Experiment Design

Consider the VAR(q) model, which can be expressed as (6) $y_{t} = b_{0} + \sum_{l = 1}^{q} y_{t - l} B_{l} + ϵ_{t} = X_{t} β + ϵ_{t}; ϵ \sim N (0, Σ),$ (6) where $t = 1, 2, \dots, T, X_{t} = I_{n} \otimes (1, y_{t - 1}^{'}, \dots, y_{t - q}^{'})$ and $β = vec ([b, B_{1}, \dots, B_{q}]')$ . The row-stacked vector of intercepts and VAR (lag) coefficients is denoted as $β \in R^{k_{β}}$ , where $k_{β} = n (nq + 1)$ . The latter part of VAR (lag) coefficients captures the dynamic interdependence among the time series. We revisit the BVAR setting in the empirical analysis. For the simulation study, we consider two data generating processes:

DGP1: Independent normal VAR(2) with $b_{0} = 1_{n}, B_{l} = \frac{1}{l + 1} I_{n}$ and $Σ = 0.01 I_{n}$ ;
DGP2: General normal VAR(2) with $b_{0} = 0.01 \times 1_{n}, B_{l}$ and $Σ$ . The diagonal elements of the first VAR coefficient $B_{1}$ are iid uniform $U (0, 0.5)$ and the off-diagonal elements are $U (- 0.2, 0.2)$ . All elements of the higher VAR coefficient matrix $B_{l}$ are iid $N (0, {0.05}^{2} / l^{2})$ , where l is the lag length. The error covariance matrix $Σ$ is generated from the Inverse-Wishart distribution $I W (n + 10, 0.07 I_{n} + 0.03 1_{n} 1'_{n})$ .

Under both data designs the number of endogenous variables n is set at 3 and the time length T is set at 600. For the simulation study 100 samples are generated under each DGP with the initial value of $y_{0}$ set as the stationary mean of the corresponding process.

For each dataset simulated from the true DGP, we will fit a VAR(p) model, with the possible options for the lag order p of fitted model being 1, 2, or 4. We proceed under standard independent (conditional conjugate) prior assumptions with a Normal prior on $β, N (β_{0}, B_{0})$ , and an Inverse Wishart prior on $Σ, IW (ν_{0}, S_{0})$ , with the exact implementation given in Appendix B.Footnote⁸ Within this framework, we set the prior mean $β_{0}$ to be zeros and proceed under different prior parameter specifications for the prior variances of the lag coefficients ( $B_{0}$ ), encompassing both uninformative and shrinkage settings. For the uninformative prior, we set $B_{0} = 100 I_{k'_{β}}$ ( $k'_{β} = n (np + 1)$ ), a standard choice in many practical applications. Our main prior specification is a Minnesota-type shrinkage prior that shrinks higher lag-coefficients to zero (Koop and Korobilis Citation2010; Karlsson Citation2013), thus, favoring more parsimonious models and avoiding overfitting the data (Morley and Wong Citation2017). In particular, we set $B_{0}$ as a diagonal matrix with different formulations for lag and intercept coefficient variances such that $B_{0, ii} = κ_{0} / l^{2}$ if i corresponds to the own lth lag, $κ_{1} s_{i}^{2} / (l^{2} s_{j}^{2})$ if i corresponds to lth lag coefficients of j variable in the ith equation and $κ_{2} s_{i}^{2}$ if it is and intercept. We set $S_{0} = κ_{3} I_{n}$ , with $κ_{0} = {0.1}^{2}, κ_{1} = {0.1}^{2}, κ_{2} = 100$ , and $κ_{3} = 1$ .

3.2 Benchmarking IPA Derivatives

Following the classical simulation literature, we compare the IPA derivative estimates against the two popular alternative numerical derivative estimation methods, the likelihood ratio (LR) and the numerical finite-difference (FD) methods (Glasserman Citation2013). Like IPA, the LR method is unbiased, but relies on analytically evaluating and computing derivatives with respect to the loglikelihood.Footnote⁹ Within the MCMC context its implementations so far are restricted to the derivative of the posterior mean with respect to the prior mean within the exponential family of distributions (Pérez, Martin, and Rufo Citation2006; Müller Citation2012). Appendix B provides more details on the LR and FD implementations.

All presented benchmarking results refer to VAR inference under the Minnesota shrinkage prior for data simulated under DGP 2. We begin with the comparison of the derivative estimates for $\frac{\partial}{\partial β_{0}} E [β | y, η_{0}]$ via the IPA and LR methods. compares the posterior mean derivatives estimates under AD and LR in the top panels (a)–(c) and precision of the corresponding precisions of the derivative estimates in terms of their standard deviations in the bottom panel (d)–(e). First, both IPA and LR produce unbiased derivative estimates, with the points reflecting the derivative estimates under both methods closely tracing the 45-degree line. Second, the bottom set of graphs shows that the IPA estimates are estimated more precisely for VARs with more than one lag with most points well above the 45-degree line reflecting a higher standard deviation of the LR estimates. This pattern is in accordance with the large evidence from the classical simulation literature that shows that the LR method generally produces derivative estimates with larger standard deviations than that of the IPA estimator (Cui et al. Citation2020).

Fig. 1 Comparing estimates of posterior mean derivatives (top panel) and their corresponding precision in terms of estimator standard deviations (bottom panel) obtained through the IPA on the x-axis and the LR on the y-axis. The findings are derived from 100 samples originating from DGP 2, using VAR(p) (p = 1, 2, 4) fitting under Minnesota shrinkage priors.

Next, we turn to the derivatives of the posterior variances $\frac{\partial}{\partial diag (B_{0})} var (β | y, η_{0})$ that can be benchmarked against the more general FD methods. Estimates of the fully numerical FD approach are biased and depend on the chosen bump size in the implementation. Hence, FD provides an approximation subject to the widely documented variance-bias tradeoff, also evident in our simulation results where we benchmark the IPA estimates with estimates obtained under the FD implementation under various bump sizes. Note, that since FD assesses derivatives with respect to one prior input parameter at a time, it requires many reruns of MCMC under some chosen bump size with the fixed random number generators. While different bump sizes could be chosen for different dimensions, the complexity of the problem makes such choices difficult in practice (Glasserman Citation2013). compares IPA deriviates with FD derivatives under three bump sizes (1e-1, 1e-5, and 1e-17). The variation in the patterns across the columns reflects the dependence of the FD performance, in terms of both point estimates (graphs (a)–(c)) and their precisions (graphs (d)–(e)), on the chosen bump size and the inherent variance-bias tradeoff. If the bump size is too small, then the FD point estimates of the sensitivities are reasonable (c), although with considerably more variation around the 45-degree line than LR estimates, and suffering from a low precision (f) due to the large variance and floating point error. On the other hand, if the bump size is larger, we observe a substantial bias in the FD estimates (2a) combined with a very high precision (2d). There is always a variance/bias tradeoff in the application of FD with smaller bump values leading to smaller bias but large variances, and vice versa for larger bump values. In practice, the appropriate choice is hence difficult to determine. In our particular setting, a moderate bump size of (1e-5) appears close to the optimal choice as it results in FD very close to AD. Note that we can only conclude this in hindsight after comparison with the unbiased IPA estimates and after some fine-tuning by trying many bump sizes.

Fig. 2 Comparing estimates of posterior variance derivatives and their corresponding precision from IPA and FD under various bump sizes (1e-1, 1e-5, and 1e-17). Results are from fitting of a VAR(2) model under Minnesota shrinkage priors, using 100 samples drawn from DGP2.

Overall, the benchmark results demonstrate that the proposed IPA approach yields estimates consistent with unbiased LR estimates with stabler results, in line with findings from a classical simulation context (Glasserman Citation2013). In terms of computational time, IPA takes less than a second for one MCMC parameter inference and gradient computation, while FD takes roughly nine seconds on average. This computational gain is achieved in our context using AD’s advances, which execute all the derivative computations alongside the parameter inference from one MCMC chain. Due to the relatively simple structure of the posterior mean with respect to the prior mean, derivative estimation under the LR method only requires the prior and posterior covariance matrices and takes less than half a second. In more complicated settings, particularly those involving intractable likelihoods, the LR approach requires much more effort as it relies on taking derivatives of the density function and generally does not lead to a simple analytical expression as in the presented example.

3.3 Sensitivity Analysis with IPA Derivates

We now turn to exploiting the generality and computational efficiency of the proposed IPA approach to present patterns in prior parameter dependence for an expanded set of posterior measures, beyond lag coefficients, that relate directly to relevant features in a VAR-based policy analysis. However, we still begin by presenting the estimates for the IPA sensitivities of the regression coefficients also presented in the benchmarking exercise.

depicts heatmaps illustrating the sensitivities of posterior means and variances of the VAR coefficient matrix $β$ within the context of inference using VAR(2) and VAR(4) models. These models incorporate the Minnesota prior and are applied to datasets sourced from DGP1 and DGP2.For clarity of exposition, we only present the sensitivities of posterior means and variances of VAR coefficients in the first equation with respect to the corresponding prior means and variances. Please refer to Appendix E for the heatmap in its entirety. Upon examining both rows of , it becomes evident that common sensitivity patterns emerge across the VAR(2) and VAR(4). The results clearly indicate that as the model increases in size, that is, the number of VAR coefficients changes from 21 to 39 as transitioning from VAR(2) to VAR(4), the influence of the prior parameters becomes more pronounced while preserving a similar pattern.

Fig. 3 IPA estimates of posterior mean and variance derivatives under Minnesota shrinkage priors for data simulated under DGP1 and DGP2. Presented estimates are for a VAR(2) or VAR(4) model for the sensitivities of posterior means and variances of VAR coefficients in the first equation with respect to the corresponding prior means and variances.

showcases the corresponding sensitivities when conducting inference under an uninformative prior (For more details, please refer to Appendix E). Notably, these sensitivities are markedly reduced compared to the outcomes of the previous analysis, thereby confirming the substantial impact of prior shrinkage on posterior statistics.

Fig. 4 IPA estimates of posterior mean and variance derivatives under the Uninformative prior for data simulated under DGP1 and DGP2. Presented estimates are sensitivities of posterior means and variances of VAR coefficients in the first equation with respect to the corresponding prior means and variances for inference under a VAR(2) model.

Next, we present IPA derivative estimates for IRFs and FEVDs. To compute the IRF and FEVD, we focus on the Cholesky decomposition of the structural VAR. This allows us to demonstrate that IPA serves as a tool for examining the influence of prior parameters on policy inference through IRF derived from orthogonal shocks, as implied by the Cholesky decomposition. Note that the algorithm for computing the Cholasky decomposition is inherently smooth, that is, satisfying the regularity conditions of our theoretical results.

Let $IR F_{i, j, t} (β, Σ)$ denote the algorithm for calculating the IRF of the ith variable concerning the shock to the jth variable at time t, where ${β, Σ} \sim π (\cdot | data, η_{0})$ . We introduce the prior sensitivities in terms of the mean IPA derivatives with respect to the relevant set of prior parameters $η_{0}$ , that is, $\frac{\partial}{\partial η_{0}} E_{π} [IR F_{i, j, t} (β, Σ)]$ , obtained by averaging samples $\frac{1}{S} \sum_{g = B + 1}^{S + B} \frac{\partial}{\partial η_{0}} I R F_{i, j, t} (β^{g}, Σ^{g}) .$ We derive the sensitivities of FEVD in a similar manner. The interchange of the integration (expectation) and the derivative computation arises as a consequence of the smoothness in the Gibbs transition kernel and the computation of IRFs/FEVDs.

For ease of understanding, displays sensitivity estimates of $FEV D_{2, 1, t}$ for $t = 0, 1, \dots, 16$ along the y-axis with respect to priors of VAR parameters in the first equation, representing the FEVD of the second variable under the influence of the first shock. We further present an entire heatmap illustrating sensitivity estimates of $FEV D_{2, 1, t}$ for $t = 0, 1, \dots, 24$ with respect to prior means and variances of all VAR coefficients $β$ in Appendix E. As the FEVD is independent of the intrinsic magnitudes of the DGPs, direct comparisons between DGP 1 and DGP 2 results are feasible. In the context of DGP 2, when compared to DGP 1, $FEV D_{2, 1, t}$ demonstrates heightened sensitivities across a wider array of prior parameters. This is likely attributed to the interdependencies among variables inherent to DGP 2.

Fig. 5 IPA estimates of posterior forecast error variance decomposition $FEV D_{2, 1, t}$ ( $t = 0, 1, \dots, 16$ ) with respect to both prior means and prior variances of VAR coefficients in the first equation under Minnesota shrinkage prior for data simulated from DGP1 and DGP2.

Fig. 5 IPA estimates of posterior forecast error variance decomposition FEVD2,1,t (t=0,1,…,16) with respect to both prior means and prior variances of VAR coefficients in the first equation under Minnesota shrinkage prior for data simulated from DGP1 and DGP2.

showcases sensitivity estimates for the IRF under the same conditions as presented in (For full details, please refer to Appendix E). The lower panels (e)–(h) present “relative” IPA estimates, wherein each sensitivity estimate is scaled by a factor corresponding to the value of $IR F_{2, 1, t}$ itself.Footnote¹⁰ These relative IPA estimates illustrate that while the IPA derivatives themselves may appear relatively small, they hold significant importance when considered in relation to the magnitudes of the original posterior statistics.

Fig. 6 IPA and “Relative” IPA estimates of posterior ${IRF}_{2, 1, t}$ ( $t = 0, 1, \dots, 16$ ) with respect to prior means and prior variances of VAR coefficients in the first equation under Minnesota shrinkage prior for data simulated from DGP1 and DGP2.

Fig. 6 IPA and “Relative” IPA estimates of posterior IRF2,1,t (t=0,1,…,16) with respect to prior means and prior variances of VAR coefficients in the first equation under Minnesota shrinkage prior for data simulated from DGP1 and DGP2.

4 Application

In this section, we employ the IPA-based derivative approach to investigate the sensitivity of the estimated IRFs to prior parameter choices in the context of an orthogonal government spending shock. We focus on a fiscal policy Bayesian VAR (BVAR) model applied to U.S. data, a methodology akin to the frameworks used in Jarociński and Marcet (Citation2019) and Ramey (Citation2019). While VARs and their associated IRFs have a longstanding history in macroeconomic policy analysis (Blanchard and Perotti Citation2002), they have garnered renewed attention among researchers due to the heightened employment of fiscal stimulus in response to the Covid pandemic. Bayesian VAR analysis remains a favored approach for policy evaluation and forecasting, offering a variety of prior specification options within the literature. However, the incorporation of new data points from the Covid era introduces sparks renewed robustness concerns both to modeling and prior choices (Hauzenberger et al. Citation2021).

Establishing the suitability of prior assumptions on the autoregressive parameters in a specific application remains a challenge with existing tools (Amir-Ahmadi, Matthes, and Wang Citation2018) and restricts the assessment of impacts of such prior parameter choices on IRF inference. Recent work by Jarociński and Marcet (Citation2019) compares IRF estimates by repeated estimation under alternative sets of priors and finds that magnitude and dynamics of output responses from an orthogonal spending shock can vary considerably under different prior assumptions on the mean coefficients. The IPA-based derivative analysis introduces the option of a comprehensive and precise local sensitivity analysis, by identifying both key prior parameters driving prior sensitivity and the overall local sensitivity. Our analysis focuses on the impacts of prior mean and variance parameter changes to implied GDP responses from an orthogonal fiscal policy shock that are estimated from a BVAR under a common Minnesota shrinkage prior.

4.1 Policy BVAR with Minnesota Prior

We consider a basic fiscal policy VAR for Government spending, GDP and Federal tax revenue based on a subset of the data used by Ramey (Citation2019). The variables are defined in terms of the nominal government purchases, the natural logarithm of real GDP over the total population and tax revenues for the period from 1939 to 2015.Footnote¹¹ The analysis is based on a 3-variable 4-lag Bayesian VAR as specified in (6). A similar model has also been considered in Jarociński and Marcet (Citation2019).

We proceed under an independent Normal-Wishart prior set-up also considered in the simulation exercise. For the Normal prior on the regression coefficients $β$ we specify a common Minnesota shrinkage prior (Koop and Korobilis Citation2010; Karlsson Citation2013) to avoid overfitting the data by setting prior means at zero. The prior covariance matrix $B_{0}$ is set as a diagonal matrix with a specific structure to shrink coefficients related to higher lag orders stronger to zero. In particular, we consider $B_{0, ii} = κ_{3} \frac{s_{i}^{2}}{l^{2} s_{j}^{2}}$ for the lth lag coefficient of variable j in the ith equation, $B_{0, ii} = κ_{1} \frac{1}{l^{2}}$ for the own lag variables, κ₂ if it is the intercept, and $S_{0} = κ_{0} I_{n}$ . We set $κ_{0} = 1, κ_{1} = {0.2}^{2}, κ_{2} = 100$ , and $κ_{3} = {0.1}^{2}$ .

4.2 Sensitivity Results

Inference on the VAR model parameters is based on the output of 10,000 iterations (after the 1,000 burn-in iterations) of a 2-step Gibbs algorithm provided in Appendix B with posterior estimates of the model parameters provided in Appendix C. The IPA-based sensitivity analysis is again implemented via the vector-based AD approach also used in the simulation study (Kwok, Zhu, and Jacobi Citation2020). It investigates sensitivities of the estimated model parameters and GDP responses to government spending shocks in terms of FEVDs and IRFs with respect to the key prior parameters in the Minnesota context, the prior locations ( $β_{0}$ ) and variances ( $B_{0}$ ) for the intercept and lag coefficients. In addition, we provide sensitivities for the companion matrix as a measure of the stability of a VAR system. In the final section we show how the introduced IPA estimates can be used to assess the impacts of prior parameter changes in IRFs across the complete forecast horizon.

4.2.1 Model Parameters and Model Stability

To begin with, we provide results on sensitivities of the posterior mean model parameter estimates across the three equations. presents heat maps of the mean derivatives for the model coefficients $β$ for each equation (Government spending, GDP and Federal tax) with respect to prior means (top panel) and prior variances (bottom panel) of VAR parameters within the GDP equation. In the heatmap, each row corresponds to the derivatives of the posterior coefficient mean concerning all 13 prior mean/variance parameters in the GDP equation. For a more comprehensive illustration, please refer to Appendix E. Common across all panels, we see pronounced sensitivity patterns with respect to some prior settings for parameters in the GDP equation where the heatmaps contain the darkest red (largest positive) or blue (most negative) values.

Fig. 7 Mean derivatives of model coefficients from all three equations with respect to the prior means $\frac{\partial \hat{β}}{\partial β_{0}}$ (top panels) and prior variances $\frac{\partial \hat{β}}{\partial diag (B_{0})}$ (bottom panels) of prior parameters in the GDP equation.

Fig. 7 Mean derivatives of model coefficients from all three equations with respect to the prior means ∂β̂∂β0 (top panels) and prior variances ∂β̂∂diag(B0) (bottom panels) of prior parameters in the GDP equation.

Moving forward, we next examine the sensitivities related to the largest absolute eigenvalue of the companion matrix, denoted as λ₁. The companion matrix holds a pivotal role in ascertaining the stability characteristics of the VAR system under consideration. The stability of a VAR model is contingent upon all eigenvalues exhibiting an absolute value that is less than 1. This property ensures that the system’s behavior remains non-explosive and permits the derivation of a unique, stationary representation. Furthermore, λ₁, governs the speed at which the impacts of exogenous shocks within the system attenuate over time. Our exploration into the sensitivity of λ₁ concerning prior specifications yields valuable insights into the influence of these priors on the long-term stability of the system. This becomes especially pertinent when λ₁ converges toward a value of 1. In the aforementioned analysis, it has been observed that the model parameters are most sensitive to the prior means and variances related to the GDP equation. This pattern is further illustrated in , where the absolute values of the IPA estimates of the largest eigenvalue of the companion matrix, λ₁, are displayed. It is noteworthy that the absolute values of the posterior means for λ₁ are slightly below 1, indicating proximity to this threshold value. Given the existing prior parameter configurations, the estimated VAR model aligns closely with a unit root process. However, should the prior variances for the lag coefficients associated with Federal Tax in the GDP equation experience an increase, it is likely that λ₁ would surpass the value of 1.

Fig. 8 Heatmaps of derivative estimates (in absolute values) of largest eigenvalue λ₁ with respect to the prior mean and variance parameters. The three rows are derivatives of λ₁ with respects to prior parameters associated with the Government spending, GDP, and Tax equations, respectively.

Fig. 8 Heatmaps of derivative estimates (in absolute values) of largest eigenvalue λ1 with respect to the prior mean and variance parameters. The three rows are derivatives of λ1 with respects to prior parameters associated with the Government spending, GDP, and Tax equations, respectively.

4.2.2 FEVD and IRF of GDP to Spending Shock

In the of the section, we focus on GDP responses from an orthogonal spending shock as implied by the Cholasky decomposition. presents the sensitivity estimates concerning the 20-quarter ahead IRFs and FEVD of the log GDP in response to a government spending shock, denoted as $IR F_{20}^{GDP, Gov}$ and $FEV D_{20}^{GDP, Gov}$ , respectively. Each heatmap in these graphs encompasses 39 derivative values relative to the prior mean or variance parameters. We note that the majority of sensitivities are strongly linked to the prior means and variances in the GDP equation, which is the consistent with the results presented in the previous section. Moreover, it’s worth noting that the intercept term in the GDP equation holds significant importance for the $FEV D_{20}^{GDP, Gov}$ .

Fig. 9 Heatmaps of derivative estimates of the IRF and FEVD of log GDP at 20 quarters after the spending shock with respect to the prior mean parameters and variance parameters. The three rows are derivatives with respects to prior parameters associated with the Government spending, GDP and Tax equations, respectively.

4.2.3 Prior Parameter Dependence across the IRF Forecast Horizon

Consider a forecast horizon of $T_{2} = 24$ . We use the notation ${\hat{IRF}}_{t}^{GDP, Gov} (η_{0})$ to represent the estimated IRFs for logarithmic GDP under a government spending shock for time periods $t = 1, \dots, T_{2}$ , given a vector of hyperparameters $η_{0}$ . Correspondingly, $▽ {\hat{IRF}}_{t}^{GDP, Gov} (η_{0})$ denotes the corresponding estimate of derivatives for $η_{0} = (β_{0}, diag (B_{0}))$ . It is important to recognize that it is feasible to calculate derivatives of ${\hat{IRF}}_{t}^{i, j} (η_{0}) \forall i \in [1, \dots n], j \in [1, \dots, n], t \in [1, \dots, T_{2}]$ simultaneously, that is, obtain the matrix $▽ \hat{IRF} (η_{0})$ of dimensions $n^{2} T_{2} \times dim (η_{0})$ .

By making small perturbations to the prior parameters and observing the resulting changes in the IRFs, analysts can acquire valuable insights into the robustness of their chosen prior specifications. In this section, our objective is to showcase the use of IPA derivatives in efficiently conducting such analyzes. This obviates the need for repetitive runs of MCMC inference at various perturbed values of the prior hyperparameters. To do so, we consider a first-order Taylor approximation at each point in time (7) ${\hat{IRF}}_{t}^{GDP, Gov} (η_{0} + δ) \approx {\hat{IRF}}_{t}^{GDP, Gov} (η_{0}) + ▽ {\hat{IRF}}_{t}^{GDP, Gov} (η_{0})' δ$ (7) for a basic perturbation analysis on the (sub)set of prior parameters, with the perturbation direction δ. While a single run of MCMC provides estimations for ${\hat{IRF}}_{t}^{GDP, Gov} (η_{0})$ and $+ ▽ {\hat{IRF}}_{t}^{GDP, Gov} (η_{0})$ , we can extend this to derive various ${\hat{IRF}}_{t}^{GDP, Gov} (η_{0} + δ)$ estimates by evaluating the outlined approximation at different values of δ.

In order to assess the overall prior sensitivity of the IRFs of log GDP with respect to a government spending shock, we consider a very small change in $η_{0}$ on mean estimates of the IRF. presents the IRF under the initial specification of the prior location and variance parameters, alongside the implied IRF for 24 periods following the shock when either the prior means, prior variances or both are changed by $1 e - 3$ (and also $3 e - 3$ for the latter). Two key patterns emerge. First, the highest sensitivities are observed in the responses further out in the time horizon, particularly between 15 and 20 quarters after the spending shock. Second, responses are more sensitive to the changes in the prior variance parameters, which is expected given their shrinkage effect and their lower magnitudes.

Fig. 10 IPA-based GDP response changes ( ${\hat{IRF}}^{GDP, Gov} (η_{0} + δ)$ under various perturbations $(δ = 1 e - 03, δ = 3 e - 03)$ of prior parameter settings (prior mean and variance parameters, prior mean parameters, prior variance parameters) under RW prior.

Fig. 10 IPA-based GDP response changes (IRF̂GDP,Gov(η0+δ) under various perturbations (δ=1e−03,δ=3e−03) of prior parameter settings (prior mean and variance parameters, prior mean parameters, prior variance parameters) under RW prior.

The accuracy of such first-order approximations is influenced by the curvature of the function being approximated. The curvature provides insights into how the function’s rate of change varies across different directions. The curvature is captured by the Hessian matrix, which contains second-order partial derivatives of the function with respect to its variables. However, computing the Hessian matrix can be computationally demanding and time-consuming, particularly for complex functions or high-dimensional problems.

Here, we present the outcomes of our first-order approximated Impulse Response Function (IRF) through IPA derivatives in comparison to the IRF obtained by conducting a further run of the Gibbs sampler using perturbed prior hyperparameters. The graphs illustrated in demonstrate that across all considered modifications, the IRF estimates derived from the derivative-based approach are either nearly indistinguishable or extremely close to the IRF acquired through re-estimation with the adjusted prior values. These approximations consistently fall well within the 68% credibility intervals. As anticipated, the approximation exhibits the highest accuracy for changes in means due to the linear nature of IRF with respect to the means. Conversely, the presence of non-linearity within the variance parameters leads to a slightly larger approximation error. This highlights that the influence of second-order terms cannot be disregarded, particularly when dealing with non-linearities stemming from variance parameters.

Fig. 11 Approximated change of GDP response to spending shock, ${\hat{IRF}}^{GDP, Gov}$ , under relative changes in prior mean and variances based on AD derivatives/first order Taylor series expansion compared to 68% IRF interval under initial prior setting.

Fig. 11 Approximated change of GDP response to spending shock, IRF̂GDP,Gov, under relative changes in prior mean and variances based on AD derivatives/first order Taylor series expansion compared to 68% IRF interval under initial prior setting.

Finally, within , we further analyse how the prior specification of intercepts influences IRFs. It is important to note that while the intercept itself does not have a direct impact on IRF calculations, its prior effects are inherently embedded within IRFs through its posterior interdependence with other model parameters. The panels (a) and (b) of the figure examine scenarios involving increases and decreases in prior mean or variance by their posterior estimates, respectively. The concluding panel of the figure delves into adjustments made to both the prior mean and variance parameters associated with the intercept terms. This comprehensive exploration enables us to understand how modifications to the prior assumptions for intercepts reverberate through the IRF estimations, considering the intricate interplay between these parameters and other components of the model. The results suggest that IRF estimates are quite robust to changes in the prior parameters at least in this particular context. That is, with different changes, the IRF estimates are still well within the IRF interval estimated under the original settings.

Fig. 12 Approximated change of GDP response to spending shock, ${\hat{IRF}}^{GDP, Gov}$ , under relative changes in prior mean and variances based on AD derivatives/first order Taylor series expansion compared to 68% IRF interval under initial prior setting. For changes in the intercept prior both mean and variances are changed.

Fig. 12 Approximated change of GDP response to spending shock, IRF̂GDP,Gov, under relative changes in prior mean and variances based on AD derivatives/first order Taylor series expansion compared to 68% IRF interval under initial prior setting. For changes in the intercept prior both mean and variances are changed.

5 Discussion

The dependence of posterior inference on a (often large) set of prior parameters, typically specified by the researcher, is an inherent feature of Bayesian empirical analysis. Local sensitivity analysis can provide important insights into the patterns of prior parameter dependence for key posterior statistics. However, the computation of such prior parameter sensitivities is complicated by statistical and computational challenges and available methods are limited in scope and computational feasibility. This article introduces a new approach based on Infinitesimal Perturbation Analysis to expand the scope of sensitivity analysis.

With IPA widely used in classical simulation, we show that it can be extended to compute asymptotically unbiased and consistent prior parameter sensitivities of posterior statistics obtained from Markov chain Monte Carlo inference via Gibbs sampling under set of regularity conditions. By decomposing the Gibbs sampler into a sequence of mappings, we are able to compute the derivatives estimate alongside the MCMC algorithm via the automatic differentiation, that avoid repeated rerunning the MCMC. The new method also provides a more general approach to unbiased derivative estimation than the LR method that is applicable to a wider range of statistics computed from the model parameters, and does not suffer from the computational and statistical drawbacks of the FD methods. However, it is important to acknowledge that the IPA approach has a potential limitation due to the required smoothness in the underlying algorithm. Some widely-used sampling methods beyond Gibbs, such as unadjusted Hamiltonian Monte Carlo and Langevin sampling, adhere to these smoothness prerequisites. Accept-reject type algorithms such as Metropolis Hastings fail this condition and an extension of IPA estimates requires further work. We illustrate how IPA might applied in this type of setting in the context of Metropolis-within-Gibbs and Slice-within-Gibbs algorithms. A more comprehensive examination of sensitivity analysis for MCMC outputs under Metropolis-Hasting type algorithms is a matter for subsequent research.

We illustrate the performance and relevance of the proposed approach in the context of Bayesian VAR analysis where we assess the dependence of VAR inference to prior location and variance specifications under a common Minnesota-type prior. The simulation study confirms the unbiasedness and precision in the benchmarking against the LR and FD methods. It also introduces IPA-based derivatives for posterior statistics beyond posterior parameter estimates, to include the Impulse Response Functions and Forecast Error Variance Decompositions of orthogonal exogenous shocks. Our application in terms of a fiscal policy-VAR for U.S. data (Government Spending, GDP, Tax) provides the first formal local sensitivity analysis of IRFs and finds the highest sensitivities with respect to prior location and variance parameters related to the GDP equation, a finding that consistent across all posterior measures.

Supplementary Materials

This supplementary material comprises comprehensive proofs of the theoretical results, accompanied by supplementary simulation data and empirical findings.

Supplemental material

Supplemental Material

Download PDF (3.4 MB)

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Notes

1 Throughout this article, the term “posterior statistic” (or “posterior output”) pertains to the posterior mean associated with a certain function of parameters.

2 This includes numerous implementations in various software packages, such as R-packages like Bayesm by Peter Rossi for Marketing and Microeconometrics, MCMCpack, and the BUGS packages.

3 An exception are sensitivities of the posterior mean with respect to prior mean parameters as computed in Pérez, Martin, and Rufo (Citation2006) and Müller (Citation2012).

4 In practice, we do not have to work in this microscopic fashion, as long as we can find a representation of the sample path that its probability the measure does not depend on the prior hyperparameters.

5 Note here we dropped y in the input vector, as it is not the main focus of this article.

6 Here, we consider the euclidean norm for the vectors.

7 Since the derivative estimate computed using AD is identical to the ordinary pathwise estimates, its potential advantage is, therefore, computational rather than statistical.

8 Compared to the more restrictive natural conjugate joint Normal-Wishart prior formulations, favored in large Bayesian VAR analysis to circumvent computational challenges of Gibbs inference in such settings, it allows for asymmetry in the prior and posterior (Carriero, Clark, and Marcellino Citation2019).

9 We acknowledge the fact that there is a recent generalization of the LR method within the classical simulation context (Peng et al. Citation2018) which has not been applied in the MCMC setting.

10 More generally, a relative IPA derivative is defined as the ratio of the sensitivity and the corresponding IRF, that is,

\frac{\partial}{\partial η_{0}} E_{π} [IR F_{i, j, t} (β, Σ)] / E_{π} [IR F_{i, j, t} (β, Σ)] .

11 These include real GDP (GDPC1), nominal government purchases, implicit price deflator (GDPDEF) and nominal Federal current receipts, NIPA accrual basis(FGRECPT).

References

Albert, J. H., and Chib, S. (1993), “Bayesian Analysis of Binary and Polychotomous Response Data,” Journal of the American Statistical Association, 88, 669–679. DOI: 10.1080/01621459.1993.10476321.
Web of Science ®Google Scholar
Amir-Ahmadi, P., Matthes, C., and Wang, M.-C. (2018), “Choosing Prior Hyperparameters: With Applications to Time-Varying Parameter Models,” Journal of Business & Economic Statistics, 38, 124–136. DOI: 10.1080/07350015.2018.1459302.
Web of Science ®Google Scholar
Baumeister, C., and Hamilton, J. D. (2019), “Structural Interpretation of Vector Autoregressions with Incomplete Identification: Revisiting the Role of Oil Supply and Demand Shocks,” American Economic Review, 109, 1873–1910. DOI: 10.1257/aer.20151569.
Web of Science ®Google Scholar
Baumeister, C., Korobilis, D., and Lee, T. K. (2020), “Energy Markets and Global Economic Conditions,” The Review of Economics and Statistics, 104, 1–45.
Web of Science ®Google Scholar
Berger, J. O., Insua, D. R., and Ruggeri, F. (2000), “Bayesian Robustness,” in Robust Bayesian Analysis, eds. D. R. Insua and F. Ruggeri, pp. 1–32, New York: Springer.
Google Scholar
Blanchard, O., and Perotti, R. (2002), “An Empirical Characterization of the Dynamic Effects of Changes in Government Spending and Taxes on Output,” the Quarterly Journal of Economics, 117, 1329–1368. DOI: 10.1162/003355302320935043.
Web of Science ®Google Scholar
Bou-Rabee, N., Eberle, A., and Zimmer, R. (2020), “Coupling and Convergence for Hamiltonian Monte Carlo,” The Annals of Applied Probability, 30, 1209–1250. DOI: 10.1214/19-AAP1528.
Web of Science ®Google Scholar
Briggs, A. D., Mytton, O. T., Kehlbacher, A., Tiffin, R., Elhussein, A., Rayner, M., Jebb, S. A., Blakely, T., and Scarborough, P. (2017), “Health Impact Assessment of the UK Soft Drinks Industry Levy: A Comparative Risk Assessment Modelling Study,” The Lancet Public Health, 2, e15–e22. DOI: 10.1016/S2468-2667(16)30037-8.
PubMed Web of Science ®Google Scholar
Camacho, M., Gadea, M. D., and Gómez-Loscos, A. (2020), “A New Approach to Dating the Reference Cycle,” Journal of Business & Economic Statistics, 40, 66–81. DOI: 10.1080/07350015.2020.1773834.
Web of Science ®Google Scholar
Carriero, A., Clark, T. E., and Marcellino, M. (2019), “Large Bayesian Vector Autoregressions with Stochastic Volatility and Non-Conjugate Priors,” Journal of Econometrics, 212, 137–154. DOI: 10.1016/j.jeconom.2019.04.024.
Web of Science ®Google Scholar
Carter, C. K., and Kohn, R. (1994), “On Gibbs Sampling for State Space Models,” Biometrika, 81, 541–553. DOI: 10.1093/biomet/81.3.541.
Web of Science ®Google Scholar
Chib, S., and Ergashev, B. (2009), “Analysis of Multifactor Affine Yield Curve Models,” Journal of the American Statistical Association, 104, 1324–1337. DOI: 10.1198/jasa.2009.ap08029.
Web of Science ®Google Scholar
Cui, Z., Fu, M. C., Hu, J.-Q., Liu, Y., Peng, Y. and Zhu, L. (2020), “On the Variance of Single-Run Unbiased Stochastic Derivative Estimators,” INFORMS Journal on Computing, 32, 390–407.
Web of Science ®Google Scholar
De Jong, R. M. (1995), “Laws of Large Numbers for Dependent Heterogeneous Processes,” Econometric Theory, 11, 347–358. DOI: 10.1017/S0266466600009208.
Web of Science ®Google Scholar
Del Negro, M., and Schorfheide, F. (2008), “Forming Priors for DSGE Models (and How it Affects the Assessment of Nominal Rigidities),” Journal of Monetary Economics, 55, 1191–1208. DOI: 10.1016/j.jmoneco.2008.09.006.
Web of Science ®Google Scholar
Diebold, F. X., and Li, C. (2006), “Forecasting the Term Structure of Government Bond Yields,” Journal of Econometrics, 130, 337–364. DOI: 10.1016/j.jeconom.2005.03.005.
Web of Science ®Google Scholar
Doan, T., Litterman, R., and Sims, C. (1984), “Forecasting and Conditional Projection Using Realistic Prior Distributions,” Econometric Reviews, 3, 1–100. DOI: 10.1080/07474938408800053.
Google Scholar
Durbin, J., and Koopman, S. J. (2002), “A Simple and Efficient Simulation Smoother for State Space Time Series Analysis,” Biometrika, 89, 603–616. DOI: 10.1093/biomet/89.3.603.
Web of Science ®Google Scholar
Durmus, A., and Moulines, E. (2017), “Nonasymptotic Convergence Analysis for the Unadjusted Langevin Algorithm,” The Annals of Applied Probability, 27, 1551–1587. DOI: 10.1214/16-AAP1238.
Web of Science ®Google Scholar
Frühwirth-Schnatter, S. (1994), “Data Augmentation and Dynamic Linear Models,” Journal of Time Series Analysis, 15, 183–202. DOI: 10.1111/j.1467-9892.1994.tb00184.x.
Google Scholar
Giannone, D., Lenza, M., and Primiceri, G. E. (2015), “Prior Selection for Vector Autoregressions,” Review of Economics and Statistics, 97, 436–451. DOI: 10.1162/REST_a_00483.
Web of Science ®Google Scholar
Giles, M., and Glasserman, P. (2006), “Smoking Adjoints: Fast Monte Carlo Greeks,” Risk, 19, 88–92.
Google Scholar
Giordano, R., Broderick, T., and Jordan, M. I. (2017), “Covariances, Robustness, and Variational Bayes,” arXiv preprint arXiv:1709.02536.
Google Scholar
Glasserman, P. (2013), Monte Carlo Methods in Financial Engineering (Vol. 53), New York: Springer.
Google Scholar
Glasserman, P., and Liu, Z. (2010), “Estimating Greeks in Simulating lévy-driven Models,” Journal of Computational Finance, 14, 3. DOI: 10.21314/JCF.2010.210.
Web of Science ®Google Scholar
Greenberg, E. (2012), Introduction to Bayesian Econometrics, Cambridge: Cambridge University Press.
Google Scholar
Griewank, A., and Walther, A. (2008), Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (Vol. 105), Philadelphia: SIAM.
Google Scholar
Gustafson, P. (2000), “Local Robustness in Bayesian Analysis,” in Robust Bayesian Analysis, Springer, pp. 71–88.
Google Scholar
Hauzenberger, N., Huber, F., Marcellino, M., and Petz, N. (2021), “Gaussian Process Vector Autoregressions and Macroeconomic Uncertainty,” arXiv preprint arXiv:2112.01995.
Google Scholar
Homescu, C. (2011), “Adjoints and Automatic (Algorithmic) Differentiation in Computational Finance,” Available at SSRN: DOI: 10.2139/ssrn.1828503.
Google Scholar
Huber, F., and Feldkircher, M. (2019), “Adaptive Shrinkage in Bayesian Vector Autoregressive Models,” Journal of Business & Economic Statistics, 37, 27–39. DOI: 10.1080/07350015.2016.1256217.
Web of Science ®Google Scholar
Jacobi, L., Nghiem, N., Ramirez Hassan, A., and Blakely, T. (2021), “Food Price Elasticities for Policy Interventions: Estimates from a Virtual Experiment in a Multistage Demand Analysis with (Expert) Prior Information,” Economic Record (forthcoming). DOI: 10.1111/1475-4932.12640.
Web of Science ®Google Scholar
Jacobi, L., Wagner, H., and Frühwirth-Schnatter, S. (2016), “Bayesian Treatment Effects Models with Variable Selection for Panel Outcomes with an Application to Earnings Effects of Maternity Leave,” Journal of Econometrics, 193, 234–250. DOI: 10.1016/j.jeconom.2016.01.005.
Web of Science ®Google Scholar
Jarociński, M., and Marcet, A. (2019), “Priors about Observables in Vector Autoregressions,” Journal of Econometrics, 209, 238–255. DOI: 10.1016/j.jeconom.2018.12.023.
Web of Science ®Google Scholar
Karlsson, S. (2013), “Forecasting with Bayesian Vector Autoregression,” in Handbook of Economic Forecasting (Vol. 2), Elsevier, pp. 791–897.
Google Scholar
Koop, G., and Korobilis, D. (2010), Bayesian Multivariate Time Series Methods for Empirical Macroeconomics, Delft: Now Publishers Inc.
Google Scholar
Kwok, C. F., Zhu, D., and Jacobi, L. (2020), ADtools: Automatic Differentiation Toolbox. R package version 0.5.4, https://cran.r-project.org/package=ADtools.
Google Scholar
Kwok, C. F., Zhu, D., and Jacobi, L. (2022), “An Analysis of Vectorised Automatic Differentiation for Statistical Applications,” Available at SSRN.
Google Scholar
L’Ecuyer, P. (1990), “A Unified View of the ipa, sf, and lr Gradient Estimation Techniques,” Management Science, 36, 1364–1383.
Web of Science ®Google Scholar
Litterman, R. B. (1986), “Forecasting with Bayesian Vector Autoregressions—Five Years of Experience,” Journal of Business & Economic Statistics, 4, 25–38. DOI: 10.2307/1391384.
Web of Science ®Google Scholar
Morley, J., and Wong, B. (2017), “Estimating and Accounting for the Output Gap with Large Bayesian Vector Autoregressions,” CAMA working paper, 46.
Google Scholar
Müller, U. K. (2012), “Measuring Prior Sensitivity and Prior Informativeness in Large Bayesian Models,” Journal of Monetary Economics, 59, 581–597. DOI: 10.1016/j.jmoneco.2012.09.003.
Web of Science ®Google Scholar
Peng, Y., Fu, M. C., Hu, J.-Q., and Heidergott, B. (2018), “A New Unbiased Stochastic Derivative Estimator for Discontinuous Sample Performances with Structural Parameters,” Operations Research, 66, 487–499. DOI: 10.1287/opre.2017.1674.
Web of Science ®Google Scholar
Pérez, C., Martin, J., and Rufo, M. (2006), “MCMC-based Local Parametric Sensitivity Estimations,” Computational Statistics & Data Analysis, 51, 823–835. DOI: 10.1016/j.csda.2005.09.005.
Web of Science ®Google Scholar
Primiceri, G. E. (2005), ‘Time Varying Structural Vector Autoregressions and Monetary Policy,” The Review of Economic Studies, 72, 821–852. DOI: 10.1111/j.1467-937X.2005.00353.x.
Web of Science ®Google Scholar
Raghunathan, M. S. (1979), “A Proof of Oseledec’s Multiplicative Ergodic Theorem,” Israel Journal of Mathematics, 32, 356–362. DOI: 10.1007/BF02760464.
Web of Science ®Google Scholar
Ramey, V. A. (2019), “Ten Years After the Financial Crisis: What Have We Learned from the Renaissance in Fiscal Research?” Journal of Economic Perspectives, 33, 89–114. DOI: 10.1257/jep.33.2.89.
Web of Science ®Google Scholar
Roos, M., Martins, T. G., Held, L., Rue, H., et al. (2015), “Sensitivity Analysis for Bayesian Hierarchical Models,” Bayesian Analysis, 10, 321–349. DOI: 10.1214/14-BA909.
Web of Science ®Google Scholar
Townsend, J., Koep, N., and Weichwald, S. (2016), “Pymanopt: A Python Toolbox for Optimization on Manifolds Using Automatic Differentiation,” The Journal of Machine Learning Research, 17, 4755–4759.
Web of Science ®Google Scholar

Estimating Posterior Sensitivities with Application to Structural Analysis of Bayesian Vector Autoregressions

Abstract

1 Introduction