765
Views
0
CrossRef citations to date
0
Altmetric
Bayesian Methods

Bayesian Model Choice for Directional Data

ORCID Icon & ORCID Icon
Pages 25-34 | Received 26 May 2022, Accepted 16 Apr 2023, Published online: 13 Jun 2023

Abstract

This article is concerned with the problem of choosing between competing models for directional data. In particular, we consider the question of whether or not two independent samples of axial data come from the same Bingham distribution. This is not a straightforward question to answer, due to the intractable nature of the parameter-dependent normalizing constant of the Bingham distribution. We propose three different methods to perform this task within a Bayesian framework, and apply the methodology to a real dataset on earthquakes in New Zealand. R code to run our methods is available in online supplementary materials.

1 Introduction

The Bingham distribution, introduced by Bingham (Citation1974), is one of the most important and useful models for data which arise as unsigned directions (axes). Bingham (Citation1974) derived its main properties, and since then it has been studied further; see for example Kent (Citation1987), Mardia and Jupp (Citation2000), Kume and Walker (Citation2014), and Dryden (Citation2005), who studied the properties of the Bingham distribution in very high dimensions.

The Bingham distribution can be constructed by conditioning a zero-mean multivariate Normal (MVN) distribution in Rq to lie on the unit sphere Sq1. For xSq1, the density of the Bingham distribution with respect to the uniform measure on Sq1 is given by (1) f(x;A)=exp(xTAx)c(A),xTx=1 and xRq,(1) where the parameter A is a real symmetric q × q matrix and c(A) is the corresponding normalizing constant.

Let A=VΛVT be the Singular Value Decomposition of A. Then maximum likelihood estimates of V (whose columns are the eigenvectors of A) are easily obtained. However, inferring the eigenvalues of A (the elements of the diagonal matrix Λ) is a rather challenging task, since evaluation of the parameter-dependent normalizing constant c(A) is required, which, in the general case, is not available in closed form. This poses significant challenges when fitting the Bingham distribution to some data in either a frequentist or Bayesian setting. There have been several approaches proposed in the literature to approximate/estimate the normalizing constant, and hence facilitate maximum likelihood estimation, based on asymptotic expansions (Kent Citation1987), saddlepoint approximations (Kume and Wood Citation2005) and more recently, holonomic gradient methods (Sei and Kume Citation2015). However, these are nontrivial numerical methods to implement, which in any case involve some level of approximation of c(A). Likewise, standard approaches to model selection involving Bingham distributions would require evaluation of c(A).

In the Bayesian setting, MCMC is also complicated by c(A), since standard Metropolis-Hastings updates would require evaluation of this constant at current and proposed configurations. This is known as a doubly intractable problem, for which various algorithms have been proposed (e.g., Møller et al. Citation2006; Murray, Ghahramani, and MacKay Citation2006; Walker Citation2011). The specific case of the Bingham distribution was considered by Walker (Citation2014) and Fallaize and Kypraios (Citation2016).

The above discussion relates to inference given data from a single Bingham distribution. In this article, we consider, within a Bayesian framework, the problem of model selection in the context of Bingham distributions. That is, suppose we have two independent datasets consisting of some axial data, considered to be realizations from Bingham distributions. A natural question that often arises is the following: do these two datasets come from the same Bingham distribution? A common approach is to compute model probabilities/Bayes factors for the two competing scenarios. Again, standard techniques are complicated by the intractability of the constant c(A). Recall that, generally speaking, there are two approaches for calculating Bayes factors/model probabilities using MCMC:

  • within-model estimation. This involves estimating the model evidence (marginal likelihood) for each model separately, for example, using MCMC samples and Chib’s method. In the doubly intractable case, although samples can be obtained from each posterior distribution using one of the aforementioned algorithms, one would also have to plug in an approximation to the unknown normalizing constant (c(A) for the Bingham distribution). So in practice, the model evidence has to be approximated.

  • across-model estimation. Reversible jump can be used to run one sampler which jumps between competing models and thus allows direct (Monte Carlo) estimation of the model probabilities. In the doubly intractable case, the Reversible Jump Exchange algorithm (Caimo and Friel Citation2013) is an across-model approach that enables exact (in a Monte Carlo sense) calculation of posterior model probabilities without the need to approximate the intractable constant, but this is more computationally-intensive and requires careful design for constructing an efficient sampler.

Although these general approaches are available, implementations and performance are context-specific. The main purpose of this article is to show how methods in each of the above categories can be successfully applied to perform model selection for Bingham distributions, and to investigate performance in terms of ability to discriminate between competing models depending on sample size and parameter configurations. Specifically, we investigate three different approaches: the Reversible Jump Exchange algorithm of Caimo and Friel (Citation2013); a within-model approach in which the model probabilities are approximated; an alternative to the calculation of Bayes Factors, based on calculating (multivariate) highest posterior density regions for a certain parameter of interest, one value of which corresponds to distributional equality. The latter two approaches are computationally fast and easy to implement. They only require the practitioner being able to fit a Bingham distribution to directional data within a Bayesian framework, something that can efficiently be done using the algorithm of Fallaize and Kypraios (Citation2016). They also provide a useful cross-check for the Reversible Jump Exchange algorithm, which is appealing in the sense that it provides exact (Monte Carlo) calculation of posterior model probabilities, but requires more careful sampler design and is more computationally intensive. Hence, all three approaches are useful in practice.

The article is structured as follows. In Section 2 we introduce a motivating example of data from earthquakes in New Zealand. In Section 3, we review how Bayesian inference can proceed for the Bingham distribution, and in Section 4 we describe our three strategies for performing model choice for the Bingham distribution in the Bayesian framework. In Section 5, we illustrate the methods on simulated data, and perform a simulation study to investigate the extent to which it is possible to discriminate between models as parameters and sample sizes vary. In Section 6, we apply our methods to the New Zealand earthquake data, before concluding with a discussion.

2 Motivating Application

Here, we introduce a motivating example, which relates to earthquake data in New Zealand. An earthquake gives rise to three orthogonal axes, and geophysicists are interested in analyzing such data in order to compare earthquakes at different locations and/or at different times. An earthquake gives rise to a pair of orthogonal axes, known as the compressional (P) and tensional (T) axes, from which a third axis, known as the null (B) axis is obtained via B=P×T. Each of these quantities are determined only up to sign, and so models for axial data are appropriate. Arnold and Jupp (Citation2013) treated these data as orthogonal axial 3-frames in R3, but, following Fallaize and Kypraios (Citation2016) we will illustrate our methods using the B axes. In general, an orthogonal axial r-frame in Rq,rq, is an ordered set of r axes, {±u1,,±ur}, where u1,,ur are orthonormal vectors in Rq (Arnold and Jupp Citation2013). The familiar case of data on the sphere S2 is the special case corresponding to q=3,r=1, which is the case we consider here.

The data consist of three clusters of observations derived from earthquakes in New Zealand, where each observation represents a B axis obtained from a particular earthquake. The first two clusters each consist of 50 observations from earthquakes near Christchurch which took place before and after a large earthquake on the 22nd of February 2011. For these two clusters, the P and T axes are quite highly concentrated in the horizontal plane, and as a result the majority of the B axes are concentrated about the vertical axis (see ). From an applied viewpoint, it is of interest to geophysicists to establish whether there is a difference between the pattern of earthquakes before and after the large earthquake. The third cluster is a more diverse set of 32 observations obtained from earthquakes in the north of New Zealand’s South Island. Visually, this cluster appears to exhibit more dispersion than the other two samples. Hence, one approach is to model the B axes using the Bingham distribution, and the principal task is to assess the evidence that the different samples are from populations with the same Bingham distribution or not.

Fig. 1 The B axes for the earthquake data. Top left: Christchurch before the large earthquake. Top right: Christchurch after the large earthquake. Bottom: South Island. The data have been rotated to principal axes, such that the principal axis is the North–South axis. Relatively, the two Christchurch clusters are more concentrated about the vertical axis, and the South Island cluster more dispersed. All points have been plotted in the same hemisphere to enable easier comparison of the variability in each sample. In some cases, the antipodal point was the observation—as these are axial data, both are treated as the same observation and it does not matter which is observed.

Fig. 1 The B axes for the earthquake data. Top left: Christchurch before the large earthquake. Top right: Christchurch after the large earthquake. Bottom: South Island. The data have been rotated to principal axes, such that the principal axis is the North–South axis. Relatively, the two Christchurch clusters are more concentrated about the vertical axis, and the South Island cluster more dispersed. All points have been plotted in the same hemisphere to enable easier comparison of the variability in each sample. In some cases, the antipodal point was the observation—as these are axial data, both are treated as the same observation and it does not matter which is observed.

3 Bayesian Inference

Consider the probability density function of the Bingham distribution as given in (1). If A=VΛVT is the Singular Value Decomposition of A where V is orthogonal and Λ=diag(λ1,,λq), then it can be shown that if x is drawn from a distribution with probability density function f(·;A), the random vector VTx is drawn from a distribution with density f(·;Λ) (see, e.g., Kume and Walker Citation2006; Kume and Wood Citation2007). Therefore, for ease of exposition and without loss of generality, we assume that A=Λ=diag(λ1,,λq). Moreover, to ensure identifiability, we assume that λ1λ2λq=0 (Kent Citation1987) and denote λ:=(λ1,,λq1), the ordered vector of parameters to be inferred. Hence, the probability density function becomes (2) f(x;λ)=exp{i=1q1λixi2}c(λ)(2)

with respect to the uniform measure on the sphere and c(λ)=xSq1exp{i=1q1λixi2} dSq1(x).

(Throughout, we write λ in place of A or Λ, since it is to be understood that A is of the form A=Λ=diag(λ,0).) This integral can be expressed as a hypergeometric function of matrix argument (see, e.g., Mardia and Jupp Citation2000), which is an infinite series involving zonal polynomials and hence not a tractable function to work with in practice. We note that in the case when there are only two distinct eigenvalues of A, the distribution reduces to a Watson distribution, in which case the normalizing constant can be easily evaluated. The Watson distribution is appropriate when there is rotational symmetry about some axis, which is not the case for the Bingham distribution in general. Here, we assume that this possibility has been discounted, and that the general Bingham distribution is appropriate.

Let xjT=(xj1,xj2,,xjq)Sq1 be a unit vector, and suppose x˜=(x1,x2,,xn) is a random sample of n unit vectors from the Bingham distribution with density (2). Then the likelihood function is given by (3) π(x˜|λ)=1c(λ)nexp{i=1q1λij=1n(xji)2}=1c(λ)nexp{ni=1q1λiτi}=1c(λ)nπu(x˜|λ)(3) where τi=1nj=1n(xji)2 and πu(x˜|λ)=exp{ni=1q1λiτi}. The data can therefore be summarized by (n,τ1,,τq1), and (τ1,,τq1) are sufficient statistics for (λ1,,λq1).

The likelihood contains the normalizing constant c(λ) and the fact that there does not exist a closed form expression for it makes Bayesian inference for λ challenging. Denote by π(λ)a prior distribution that is placed on λ1,λ2,,λq1, with support λ1λ2λq10. Hence, the posterior distribution of λ is given by (4) π(λ|x˜)πu(x˜|λ)c(λ)nπ(λ),λ1λ2λq10,(4) which also involves the intractable normalizing constant c(λ). As explained in Fallaize and Kypraios (Citation2016), in principle one can employ one of the proposed methods in the literature to approximate the normalizing constant (see, e.g., Kent Citation1987; Kume and Wood Citation2005; Sei and Kume Citation2015) and then use, for example, a Metropolis-Hastings algorithm to sample from π(λ|x˜). However, despite such an approach being feasible, it is not entirely satisfactory. First, the approximation would have to be computed at each iteration of the Markov chain and second, and most importantly, the stationary distribution of such an MCMC algorithm will not be the distribution of interest, but rather an approximation to it (despite how good this might be). Fallaize and Kypraios (Citation2016) showed how one can bypass the need of calculating the normalizing constant by making use of recent developments in computational methods for doubly intractable distributions.

Of course, in general we will not have V = I. In practice, we do the following. Let X be the n × q data matrix, that is, row j of X is xjT, and X¯=1n(XTX). Then let QΦQT be the Singular Value Decomposition of X¯, whereby the diagonal elements of Φ are the τi above. These are unchanged under any rotation of X. Hence, in practice, we take our data τi,i=1,,q, to be the eigenvalues of X¯. Due to the identifiability constraints on λ, we work with the ordered eigenvalues (with τ1 being the smallest and so on), and τq is redundant since the constraint λq=0 is imposed. This procedure is equivalent to performing inference for λ as above, with V = I.

This can be viewed as (and is equivalent to) the following hybrid algorithm for performing inference for V and λ. Q is in fact the maximum likelihood estimate (mle) of V. Given the data X in the original coordinates, compute Q and set Y = XQ. Then the eigenvalues of 1n(YTY) are the same as those of X¯, which will also be equal to 1nj=1n(yji)2. So working with the eigenvalues of X¯ computed using the original data is equivalent to taking the mle of V, using this to rotate the data, and performing inference for λ using the posterior distribution (4) with the rotated data.

4 Bayesian Model Choice

4.1 Preliminaries

Given a set of K competing models indexed by a model indicator k{1,,K}, the principal interest in Bayesian model selection lies in quantifying the extent to which the observed data support each model k. The Bayesian treatment for model choice involves joint inference for k and a parameter vector λk where the model indicator determines the dimension of the parameter, and this may vary from model to model.

Suppose data x˜ is assumed to have been generated by model k; the posterior distribution is π(λk|x˜,k)=π(x˜|λk,k)π(λk|k)π(x˜|k),where π(x˜|λk,k) represents the likelihood function and π(λk|k) the prior distribution of the parameters of model k. The model evidence for model k is the integral π(x˜|k)=λkπ(x˜|λk,k)π(λk|k)dλk.

It represents the probability of the data x˜ given a particular model k and is vital in Bayesian model choice. The reason is that it allows us to make statements about posterior model probabilities which can be written as π(k|x˜)=π(x˜|k)π(k)k=1Kπ(x˜|k)π(k),where π(k) is the prior probability of model k. (In our examples we have two competing models and take π(k)=0.5 as a default choice in the absence of any prior information to the contrary.) These posterior model probabilities give rise to the Bayes Factors which give the relative posterior likelihoods of pairs of models, for example π(k1|x˜)π(k2|x˜)=π(k1)π(k2)×π(x˜|k1)π(x˜|k2)=π(k1)π(k2)×B12 where B12 is the Bayes factor of model k1 relative to k2, given the observed data x˜, assuming that the prior model probabilities, π(k1) and π(k2) are independent of x˜. The larger the Bayes factor is, the greater the evidence in favor of model k1 compared to model k2. Jeffreys (Citation1998) presents a scale in which one can interpret the strength of evidence of one model against another.

4.2 Computing Bayes Factors

In practice, the posterior model probabilities, which are essential to conduct Bayesian model choice, are very rarely analytically available. Therefore, very often, we rely on computational methods. One of the most favored computational methods is Markov chain Monte Carlo (MCMC). The main idea behind MCMC is to construct a Markov chain which has the posterior distribution of interest as its invariant distribution.

In general, there are two main approaches for computing Bayes factors: (i) across-model and (ii) within-model estimation. The within-model approaches aim to approximate/estimate the model evidence π(k|x˜), for each model k=1,,K. There are a number of approaches ranging from Laplace approximations to path sampling; see Friel and Wyse (Citation2012) for a comprehensive review. The across-model strategy relies on constructing a Markov chain that crosses the joint model and parameter space and samples from π(λk,k|x˜)π(x˜|λk,k)π(λk|k)π(k),and one of the most popular approaches used in this context is the reversible jump MCMC (RJMCMC) algorithm of Green (Citation1995).

In the context of the Bingham distribution, both within-model strategies and across-model strategies (such as RJMCMC) cannot be used straightforwardly because the normalizing constant c(λ) in the Bingham likelihood is not available analytically. Here, we present an implementation of the Reversible jump (RJ) exchange algorithm of Caimo and Friel (Citation2013), which is a modification of the reversible jump algorithm of Green (Citation1995) based on the exchange algorithm of Murray, Ghahramani, and MacKay (Citation2006). Essentially, the RJ exchange algorithm enables us to implement a RJ algorithm without having to compute/approximate the unknown normalizing constant c(λ). We will also present a within-model approach for estimating the model evidence in the context of the Bingham distribution. The motivation for this is to provide a useful method for low-dimensional models to use as an alternative reference to compare with the RJ exchange algorithm.

4.3 Reversible Jump Exchange Algorithm

As described in Section 4.2, the evidence in favor of competing models can be obtained in an across-model fashion. In this case, the parameter space is (λk,k), and a Markov chain which samples from the joint posterior distribution π(λk,k|x˜) can in principle be constructed using RJMCMC. Given such a chain, inference for the model index k is readily available, since it is directly incorporated as a parameter. Given a current state (λk,k), a RJMCMC algorithm essentially consists of two steps: (i) within-model moves which perform updates of the form λkλk for the current model k, and (ii) across-model moves which perform updates of the form (λk,k)(λk,k), whereby a move from model k to model k is proposed, along with a parameter vector λk for that model. In our setting, both types of move are complicated by the fact that the normalizing constant from the Bingham likelihood would need to be evaluated when computing the acceptance probabilities for both moves.

For the within-model updates we use the exchange algorithm of Fallaize and Kypraios (Citation2016); see Algorithm 1. In the algorithm, q(·|λ) is a user-defined proposal distribution and f(·|λ) is the Bingham distribution (2) with parameter λ. In our applications, λ=(λ1,λ2), and we find setting q(·|λ)=N2(λ,I) works perfectly well.

Algorithm 1

Within-model moveinput: current state (λk,k) output: new state (λk,k)

1. Draw λkq(·|λk)

2. Draw y˜f(·|λk)

3. Accept λk with probability min(1,πu(x˜|λk)π(λk)q(λk|λk)πu(y˜|λk)πu(x˜|λk)π(λk)q(λk|λk)πu(y˜|λk)).

We now show how across-model updates can be performed in our setting using an exchange version of RJMCMC proposed by Caimo and Friel (Citation2013), in order to provide model evidence in favour of competing Bingham models (Algorithm 2). We denote model k by an indicator mk. Within the algorithm, a move from model k to k is proposed from a proposal distribution h(k|k), and given the proposed value k, a candidate value for the model-specific parameter λk is then drawn from a proposal density wk(λk|ϕk,mk). The latter is a model-specific proposal density with parameters ϕk from which λk is drawn independently of the current value of λk. Hence, this is a form of independence sampler, which obviates the need to tune parameters for the proposal densities for trans-dimensional moves in RJMCMC, which is often a very challenging task.

Algorithm 2

Across-model moveinput: current state (λk,k) output: new state (λk,k)

1. Draw kh(k|k);

2. Given k draw λkwk(λk|ϕk,mk)

3. Given λk draw y˜f(·;λk,mk)

4. Accept the move (λk,k)λk,k) with probability: α=min{1,r} where r is given (5).

Values for the parameters ϕk which parameterize the densities wk(λk|ϕk,mk) can be obtained by fitting the Bingham distribution to the individual models k=1,,K using pilot MCMC runs on each model separately (Fallaize and Kypraios Citation2016), prior to running the main RJMCMC sampler. In our examples, we find that a multivariate normal distribution (MVN) can approximate sufficiently π(λk|x˜), and hence ϕk consists of a mean vector and covariance matrix for the parameters {λ1(k),,λq1(k)} corresponding to model k, estimated from the pilot runs. (The conclusion that the MVN approximation for use as a proposal distribution is adequate was determined by monitoring performance of the MCMC, where good mixing and acceptance rates were observed, in addition to visually inspecting the fitted MVN contours with those from a kernel density estimate.) Proposals which do not respect the constraints on the parameters are automatically rejected.

Given a proposal λk, auxilary data y˜ are drawn from f(·;λk,mk) (Step 3), the sampling mechanism of the data under the model k with parameter λk. In our case, this will be a form of Bingham distribution, for which exact samples can be drawn using the algorithm of Kent, Ganeiber, and Mardia (Citation2018). The overall acceptance probability is given by α((λk,k);(λk,k))=min{1,r}, where (5) r=f(x˜;λk,k)f(y˜;λk,k)π(λk|k)p(k)wk(λk|ϕk,k)h(k|k)f(y˜;λk,k)f(x˜;λk,k)π(λk|k)p(k)wk(λk|ϕk,k)h(k|k)=πu(x˜;λk,k)πu(y˜;λk,k)π(λk|k)p(k)wk(λk|ϕk,k)h(k|k)πu(y˜;λk,k)πu(x˜;λk,k)π(λk|k)p(k)wk(λk|ϕk,k)h(k|k).(5)

This is the Metropolis-Hastings ratio of the move on the augmented state space {y˜,λk,k}, and the choice of proposal for the auxiliary data (y˜f(·;λk,k)) ensures the ratio c(λk)/c(λk) appears in both numerator and denominator and hence cancels. This a RJMCMC version of the exchange algorithm (Murray, Ghahramani, and MacKay Citation2006), as described in Caimo and Friel (Citation2013). The auxiliary data y˜ are drawn from the same sampling mechanism as the observed data, with parameters set at the candidate state (λk,k). Intuitively, the roles of x˜ and y˜ are reversed (or “exchanged”)—the auxiliary data y˜ are offered to the current state (λk,k) in exchange for the observed data x˜, in order to persuade a move to the candidate state (λk,k).

Therefore, this is a form of reversible jump MCMC algorithm which avoids the need to evaluate the awkward normalizing constants in the same spirit as the exchange algorithm of Murray, Ghahramani, and MacKay (Citation2006), which Caimo and Friel (Citation2013) call the Auto-RJ exchange algorithm.

Algorithm 3

Reversible Jump exchange algorithm

1. for k=1,,K do

2. Draw samples {λk} from π(λk|x˜,k)

3. Set μk and Σk as sample mean and sample covariance of {λk}

4. Assign wk(λk|ϕk,mk)MVN(μk,Σk) in Step 2 of Algorithm 2

5. end for

6. repeat

7. perform a within-model update of λk using Algorithm 1

8. perform an across-model move using Algorithm 2

9. until desired posterior samples required

The process is summarized in Algorithm 3. In our applications, there are K = 2 competing models, and we set h(k|k)=1,kk, so that a move to the other model is proposed with certainty.

4.4 Estimating Model Evidence

In this section we present a within-model approach for estimating the model evidence based on the work of Chib (Citation1995). Suppose that data x˜=(x1,x2,,xn) are assumed to have been generated from a Bingham distribution with parameter λ. The approach follows from noticing that for any λ, (4) implies that π(x˜)=π(x˜|λ)π(λ)π(λ|x˜)=πu(x˜|λ)c(λ)nπ(λ)π(λ|x˜).

Hence, to estimate the model evidence π(x˜) we require both an estimate of the (normalized) posterior density at λ,π(λ|x˜), as well as an estimate of the normalizing constant c(λ). For the latter, we used the method of (Kent Citation1987), based on asymptotic expansions. Another possibility is the holonomic gradient method of (Sei and Kume Citation2015), as implemented in the R package hgm. We also used this method, but occasionally encountered some numerical instability in computing the normalizing constant for some of the datasets we simulated. Avoiding such potential problems is one argument in favour of methods which do not require computation of the constant. With regards to estimating the posterior density at λ, once we have samples from π(λ|x˜) using the exchange algorithm of Fallaize and Kypraios (Citation2016), then we can produce a kernel density estimate of the posterior density at λ. Of course, in practice, because of the curse of dimensionality this approach can only be used in the case where the number of parameters is fairly small. However, this is not uncommon in the applications where the Bingham distribution is applied to, where observed data are typically points on S2.

4.5 Model Selection via Contour Probabilities

We now consider an alternative approach, which uses posterior contour probabilities (Held Citation2004) of a suitable parameter in order to provide a measure of the evidence against the simpler model. The probability can be interpreted in a similar fashion to a p-value, and henceforth any reference to p-values will mean the posterior contour probability described below.

Suppose that we are given two independent samples of unit vectors in Sq1,x˜=(x1,x2,,xn) and y˜=(y1,y2,,ym), respectively. Let us assume that x˜ and y˜ are independent random samples from some Bingham distributions with probability density function f(x;λx) and f(y;λy), respectively, where λx=(λ1(x),,λq1(x)) and λy=(λ1(y),,λq1(y)).

Assessing whether or not the samples x˜ and y˜ come from the same Bingham distribution is equivalent to testing the hypothesis that λx=λy versus λxλy. We propose fitting a Bingham distribution to the combined dataset ω˜=(x˜,y˜)=(x1,x2,,xn,y1,y2,,ym). By independence, the joint density of ω˜ is i=1n+mf(ω˜i;λ(i)). Assuming a particular structure for the {λ(i)} allows us to assess the evidence against λx=λy. Specifically, let zi=I(ω˜i is a y sample),i=1,,n+m, where I(·) is the indicator function. Then, we endow the {λ(i)} with the structure (6) λ(i)=λx+ziα,(6) where α=(α1,,αq1)TRq1. In other words, the combined data constitute n + m independent observations from the Bingham distribution, where the x˜ samples take parameter value λx=(λ1,λ2,,λq1) and the y˜ samples take parameter value λy=(λ1+α1,λ2+α2+,λq1+αq1), with α1,,αq1R.

Testing λx=λy versus λxλy is equivalent to testing H0:α1=α2=,αq1=0 versus H1:at least one α is not zero. Therefore, we can fit the Bingham distribution to the combined dataset ω˜ within a Bayesian framework, where each observation ω˜i has parameter λ(i) of the form (6), and then determine if the posterior distribution of the coefficients αi,i=1,,q1 is “significantly” different from the zero vector. At first glance, one might investigate the corresponding univariate credible intervals of some predefined level separately. However, it is rather unclear how this will translate to a simultaneous probability statement since the coefficients are not independent. One option is to calculate simultaneous credible bands on various levels and determine the smallest level such that the point of interest (i.e., the zero vector in our case) is contained in the simultaneous credible band. Held (Citation2004) proposed instead to use (multivariate) highest posterior density (HPD) regions to check the support of the posterior distribution for a certain parameter vector of interest. Held (Citation2004) defined the (posterior) contour probability Pr(α0|ω˜) as 1 minus the content of the HPD region of the posterior distribution π(α|ω˜) which just covers α0: Pr(α0|ω˜)=Pr(π(α|ω˜)π(α0|ω˜)|ω˜).

If the functional form of the posterior density of α is available analytically, then we can obtain an unbiased estimate of the contour probability by Monte Carlo; in other words, by computing the proportion of posterior samples with density value smaller or equal than π(α0|ω˜). In our context, the functional form of the posterior density is unavailable, and therefore we use the modified Rao-Blackwell estimate of the simple Monte Carlo estimator proposed by Held (Citation2004) and implemented in the bayesSurv R package. Using the algorithm that was proposed in Fallaize and Kypraios (Citation2016), one can sample from the posterior distribution of the parameters (λx,α), and then determine the posterior simultaneous credible region of the coefficients αi,i=1,,q1.

4.6 Computational Intensity

In applications of practical interest, observations are unit vectors in R3, which is the situation for our numerical results in Sections 5 and 6. The most intensive case we have considered in our simulations is the reversible jump exchange algorithm with a total of 1000 observations (which is larger than typical axial datasets). Running the sampler for 100,000 sweeps took 4 min on 3.4-GHz cpu using a single core. The other two approaches require only runs of the within-model sampler of Fallaize and Kypraios (Citation2016) which is less intensive. The additional computation of the HPD region for the contour method is minimal, and ran in a matter of seconds using the simult.pvalue function from the bayesSurv R package.

For higher dimensions, that is, data in Rq for larger q than 3, there will of course be increased computational cost. However, our focus in this article is on investigating the use of the computational methods considered, in the cases of most practical importance for real applications of axial data.

5 Simulation Study

The aims of this section are to (i) investigate the performance of the proposed methods and the extent to which they agree and (ii) assess the extent to which it is possible to discriminate between models as parameters and sample sizes vary. We address these aims by performing simulation studies in which datasets of varying size using different parameter combinations are generated, to which we apply our model selection methods.

In this section, and also the real data application in Section 6, the observations are unit vectors in R3, so the Bingham distribution has 2 (eigenvalue) parameters λ1λ20 to be inferred. For a prior distribution, we use a product of exponential distributions subject to the above constraint, that is, π(λ1,λ2)exp{λ1μ1λ2μ2},λ1λ20,where μi are rate parameters. In all of our numerical results we have used μi=0.01.

5.1 Comparison

In order to test the performance of the different methods we have proposed, we performed simulations for two different parameter cases. In the first case, we set λx=λy=(20,10) and in the second case we set λx=(20,10) and λy=(20,6). In both cases we simulated data for two different sample sizes, n=m=20 and n=m=200. For the four different combinations of parameter and sample size, we simulated 200 sets of data and performed inference using each of the three approaches we have introduced. The results are presented in , where m1 denotes the event that both populations have the same Bingham distribution. The first row shows the results for λx=λy=(20,10) when n = 20 (left) and n = 200 (right). For both sample sizes, there is very good agreement between the RJMCMC and Chib’s method. As expected, for the larger sample size, for most of the datasets the (true) simpler model is favored, whereas there is more variability in p(m1) for the smaller sample size of 20. For the contour probabilities, with a sample size of 20, the smallest p-value was 0.046, and all other p-values were greater than 0.1 (75% were greater than 0.5). As expected, when the sample size was increased to 200, there were some smaller p-values, but all but 9 were greater than 0.05, as we would expect. In summary, the methods agree well and correctly favor the simpler model.

Fig. 2 Results from the simulation experiments described in Section 5.1. The plots show the model evidence for the simpler model, p(m1), using Chib’s method and RJMCMC. The bottom row plots also encode information about the contour probabilities. Top row: λx=λy=(20,10). Bottom row: λx=(20,10) and λy=(20,6). Left column: n=m=20. Right column: n=m=200.

Fig. 2 Results from the simulation experiments described in Section 5.1. The plots show the model evidence for the simpler model, p(m1), using Chib’s method and RJMCMC. The bottom row plots also encode information about the contour probabilities. Top row: λx=λy=(20,10). Bottom row: λx=(20,10) and λy=(20,6). Left column: n=m=20. Right column: n=m=200.

The second row shows the results when λx=(20,10) and λy=(20,6). Again, we see good agreement between the two approaches for calculating p(m1). In the case when n = 20 (left), the majority of the points are concentrated in the upper right of the plot, with p(m1)>0.8, despite the samples coming from different Bingham distributions. In fact, in the next section we will investigate further the ability to discriminate between models for various parameter settings, and see that it is quite hard to distinguish between different Bingham distributions for the parameters used here when sample sizes are small. When n = 200 (right), there is a dense cluster of points in the bottom left of the plot (p(m1)<0.2), but still a fairly uniform spread of values across the remaining values, indicating that it is still reasonably challenging to discriminate between the two different Binghams when the sample size is 200 per group. We also note that, for the points corresponding to larger estimates of p(m1), the p-values from the contour method are larger too, for example, the points in the upper right of the plot have p-values greater than 0.01. So the three approaches are again in good agreement.

5.2 Model Discrimination

Now that we have a reliable method for determining model evidence, we attempt to answer the following question: how well can we discriminate between models in different parameter and sample size settings? Here, we present some further simulation experiments to investigate this. In particular, we wish to know how well we can discriminate between models when the parameters in each population are close/far apart, for different sample sizes. Specifically, we set λx=(20,10) and consider λy of the form (20,λ2(y)), with λ2(y){2,4,6,8,10}. Hence, the cases range from equal parameters through to quite different parameters, in terms of the second eigenvalue. In each case, we also consider varying the sample size, with n = m and n{20,100,200,500}. For each scenario, we simulated 200 datasets, and ran the reversible jump method. Let p1 be the posterior probability of the simpler model, that is, the model where both populations have the same Bingham distribution. We estimate P(p1>0.5) via the proportion of the 200 repetitions for which our estimate of p1 was > 0.5; the results are plotted in . We see that, as expected, it becomes easier to discriminate between models as n increases and as the parameters are either very similar or very different. More specifically, it seems that once we have n = 100, there is a high chance of choosing the correct model when λ2(y)=2,4,10. When n = 200 or 500, most cases are clear, apart from when λ2(y)=8 in which case there is still a good chance that the simpler model will be selected, even though the parameters are different. For the case n = 200, λ2(y)=6, there is also a probability of around 0.4 that the simpler model will be preferred, which agrees with our observations in Section 5.1.

Fig. 3 Results from the simulation experiments described in Section 5.2. For each sample size/parameter combination, we simulated 200 datasets and computed the proportion of times that the simpler model was favored, as determined by the rule p1>0.5.

Fig. 3 Results from the simulation experiments described in Section 5.2. For each sample size/parameter combination, we simulated 200 datasets and computed the proportion of times that the simpler model was favored, as determined by the rule p1>0.5.

6 Earthquake Data Results

We now turn our attention to the motivating application described in Section 2. Recall that the data consist of three orthogonal axes arising from earthquakes in three situations: 50 observations near Christchurch which took place before a large earthquake on the 22nd of February 2011; 50 observations near Christchurch which took place after this large earthquake; and 32 observations obtained from earthquakes in the north of New Zealand’s South Island. We will label the three samples as CCA (Christchurch before the large earthquake), CCB (Christchurch after the large earthquake) and SI (South Island). We will model the B axes from each earthquake, and assess the evidence that the samples come from the same Bingham distribution.

Fallaize and Kypraios (Citation2016) analyzed these data in a Bayesian framework by fitting Bingham models to the B axes from each of the individual clusters and considering the posterior distributions of the parameters of the diagonal component of the Bingham parameter matrix. Let us denote these parameters from the CCA, CCB, and SI models as λiA,λiB, and λiS, respectively, i = 1, 2. The authors investigated informally if there is any evidence of a difference between the two Christchurch clusters by considering the bivariate quantity (λ1Aλ1B,λ2Aλ2B) and approximating its posterior distribution with a bivariate Normal distribution to visually assess if the origin (0,0) was contained comfortably within a 95% probability region. Here, we perform formal model selection to assess the evidence that the samples of B axes come from the same, or different, Bingham distributions. We will perform two pairwise comparisons: CCA-CCB and CCA-SI.

Let the data for the two samples under consideration be stored in matrices x˜ and y˜, where the columns of x˜ are the observed axes xiS2,i=1,,n and similarly for y˜, which contains m observed axes. The combined data are contained in the 3×(n+m) matrix ω˜. In practice, we take our data for x˜ to be the eigenvalues τ˜x={τjx,j=1,,3|τ1x<τ2x<τ3x,Στjx=1} of the sample moment of inertia matrix x˜x˜T/n, and similarly for y˜. This is equivalent to first rotating to principal axes and taking V to be the identity matrix. (Recall that A=VΛVT is the Singular Value Decomposition of the Bingham parameter matrix A.) For the combined data (i.e., when fitting one Bingham model to the combined samples) we can then take the data to be the weighted average of τ˜x and τ˜y.

The results are summarized in . For the CCA-CCB comparison, the model evidence strongly favors the simpler model, whereas for the CCA-SI comparison, the model evidence supports the case that the data come from two different Bingham populations. In terms of the contour probabilities, the same conclusions are reached. This suggests that there is no difference in the pattern of Christchurch earthquakes before and after the large event, but that Christchurch and South Island earthquakes follow different patterns. Regarding a test of equality of two populations based on full axial frame data, (Arnold and Jupp Citation2013) obtained a p-value of 0.890 for the CCA-CCB comparison and a p-value of less than 0.001 for the CCA-SI comparison, so our results agree with these conclusions.

Table 1 Results from earthquake data.

The plots from the contour method are shown in . For the CCA-CCB comparison, the point (0, 0) is comfortably contained within the main point cloud, and vice-versa for the CCA-SI comparison, agreeing with the conclusions from the numerical results in .

Fig. 4 Samples of (α1,α2) from the contour method for the CCA-CCB data (left) and CCA-SI data (right). The cross marks the point (0, 0) and the triangle denotes the maximum likelihood estimate.

Fig. 4 Samples of (α1,α2) from the contour method for the CCA-CCB data (left) and CCA-SI data (right). The cross marks the point (0, 0) and the triangle denotes the maximum likelihood estimate.

7 Discussion

In this article, we have introduced three strategies for performing model choice for the Bingham distribution in the Bayesian setting, which is not a straightforward task due to the intractable nature of the normalizing constant. We have illustrated three approaches, two of which avoid the need to approximate this constant, and one which uses an approximation. Application to simulated data show that the methods can successfully discriminate between models, but if the parameters are not very different, relatively large sample sizes are needed to detect a true difference. Analysis of a real dataset shows how formal model choice can be performed in a situation of practical interest to geophysicists.

Although we have focused on the Bingham distribution, the methods could be useful for other distributions in use in directional statistics which have intractable normalizing constants. Also, although the RJMCMC and Chib’s methods are applicable to any number of models in theory, we have focused on choosing between two models in our applications. It will be interesting to investigate performance for other distributions and as the number of competing models increases.

Supplementary Materials

R codeThe R code to run our methods is available in the online supplementary materials. This includes a “readme” file explaining how to run the code for each of our methods. (BinghamCode.tar.gz, GNU zipped tar file).

Supplemental material

Supplemental Material

Download Zip (22.9 KB)

Disclosure Statement

The authors report there are no competing interests to declare.

References

  • Arnold, R., and Jupp, P. E. (2013), “Statistics of Orthogonal Axial Frames,” Biometrika, 100, 571–586. DOI: 10.1093/biomet/ast017.
  • Bingham, C. (1974), “An Antipodally Symmetric Distribution on the Sphere,” The Annals of Statistics, 2, 1201–1225. DOI: 10.1214/aos/1176342874.
  • Caimo, A., and Friel, N. (2013), “Bayesian Model Selection for Exponential Random Graph Models,” Social Networks, 35, 11–24. DOI: 10.1016/j.socnet.2012.10.003.
  • Chib, S. (1995), “Marginal Likelihood from the Gibbs Output,” Journal of the American Statistical Association, 90, 1313–1321. DOI: 10.1080/01621459.1995.10476635.
  • Dryden, I. L. (2005), “Statistical Analysis on High-Dimensional Spheres and Shape Spaces,” The Annals of Statistics, 33, 1643–1665. DOI: 10.1214/009053605000000264.
  • Fallaize, C. J., and Kypraios, T. (2016), “Exact Bayesian Inference for the Bingham Distribution,” Statistics and Computing, 26, 349–360. DOI: 10.1007/s11222-014-9508-7.
  • Friel, N., and Wyse, J. (2012), “Estimating the Evidence—A Review,” Statistica Neerlandica, 66, 288–308. DOI: 10.1111/j.1467-9574.2011.00515.x.
  • Green, P. J. (1995), “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination,” Biometrika, 82, 711–732. DOI: 10.1093/biomet/82.4.711.
  • Held, L. (2004), “Simultaneous Posterior Probability Statements from Monte Carlo Output,” Journal of Computational and Graphical Statistics, 13, 20–35. DOI: 10.1198/1061860043083.
  • Jeffreys, H. (1998), The Theory of Probability, Oxford: Oxford University Press.
  • Kent, J. T. (1987), “Asymptotic Expansions for the Bingham Distribution,” Journal of the Royal Statistical Society, Series C, 36, 139–144. DOI: 10.2307/2347545.
  • Kent, J. T., Ganeiber, A. M., and Mardia, K. V. (2018), “A New Unified Approach for the Simulation of a Wide Class of Directional Distributions,” Journal of Computational and Graphical Statistics, 27, 291–301. DOI: 10.1080/10618600.2017.1390468.
  • Kume, A., and Walker, S. G. (2006), “Sampling from Compositional and Directional Distributions,” Statistics and Computing, 16, 261–265. DOI: 10.1007/s11222-006-8077-9.
  • Kume, A., and Walker, S. G. (2014), “On the Bingham Distribution with Large Dimension,” Journal of Multivariate Analysis, 124, 345–352.
  • Kume, A., and Wood, A. T. A. (2005), “Saddlepoint Approximations for the Bingham and Fisher-Bingham Normalising Constants,” Biometrika, 92, 465–476. DOI: 10.1093/biomet/92.2.465.
  • Kume, A., and Wood, A. T. A. (2007), “On the Derivatives of the Normalising Constant of the Bingham Distribution,” Statistics and Probability Letters, 77, 832–837.
  • Mardia, K. V., and Jupp, P. E. (2000), Directional Statistics, Chichester: Wiley.
  • Møller, J., Pettitt, A. N., Reeves, R., and Berthelsen, K. K. (2006), “An Efficient Markov Chain Monte Carlo Method for Distributions with Intractable Normalising Constants,” Biometrika, 93, 451–458. DOI: 10.1093/biomet/93.2.451.
  • Murray, I., Ghahramani, Z., and MacKay, D. J. C. (2006), “MCMC for Doubly-Intractable Distributions,” in Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), pp. 359–366, AUAI Press.
  • Sei, T., and Kume, A. (2015), “Calculating the Normalising Constant of the Bingham Distribution on the Sphere Using the Holonomic Gradient Method,” Statistics and Computing, 25, 321–332. DOI: 10.1007/s11222-013-9434-0.
  • Walker, S. G. (2011), “Posterior Sampling When the Normalizing Constant is Unknown,” Communications in Statistics—Simulation and Computation, 40, 784–792. DOI: 10.1080/03610918.2011.555042.
  • Walker, S. G. (2014), “Bayesian Estimation of the Bingham Distribution,” Brazilian Journal of Probability and Statistics, 28, 61–72.