3,022
Views
3
CrossRef citations to date
0
Altmetric
General

Likelihood-Free Parameter Estimation with Neural Bayes Estimators

ORCID Icon, ORCID Icon & ORCID Icon
Pages 1-14 | Received 05 Mar 2023, Accepted 30 Jul 2023, Published online: 03 Oct 2023

Abstract

Neural Bayes estimators are neural networks that approximate Bayes estimators. They are fast, likelihood-free, and amenable to rapid bootstrap-based uncertainty quantification. In this article, we aim to increase the awareness of statisticians to this relatively new inferential tool, and to facilitate its adoption by providing user-friendly open-source software. We also give attention to the ubiquitous problem of estimating parameters from replicated data, which we address using permutation-invariant neural networks. Through extensive simulation studies we demonstrate that neural Bayes estimators can be used to quickly estimate parameters in weakly identified and highly parameterized models with relative ease. We illustrate their applicability through an analysis of extreme sea-surface temperature in the Red Sea where, after training, we obtain parameter estimates and bootstrap-based confidence intervals from hundreds of spatial fields in a fraction of a second.

1 Introduction

The most popular methods for estimating parameters in parametric statistical models are those based on the likelihood function. However, it is not always possible to formulate the likelihood function (Diggle and Gratton Citation1984; Lintusaari et al. Citation2017) and, even when it is available, it may be computationally intractable. For example, popular models for spatial extremes are max-stable processes, for which the number of terms involved in the likelihood function grows more than exponentially fast with the number of observations (Padoan 2010; Huser et al. Citation2019). One common workaround is to replace the full likelihood with a composite likelihood (e.g., Cox and Reid Citation2004; Varin and Vidoni Citation2005; Varin, Reid, and Firth Citation2011), but this usually results in a loss of statistical efficiency (Huser and Davison Citation2013; Castruccio, Huser, and Genton Citation2016), and it is not always clear how one should construct the composite likelihood. The related Vecchia approximation (Vecchia Citation1988) has been applied successfully both in Gaussian (e.g., Stein, Chi, and Welty Citation2004) and max-stable (Huser, Stein, and Zhong Citation2022) settings, but this approximation still trades statistical efficiency for computational efficiency.

To bypass these challenges, several model-fitting methods have been developed that preclude the need to evaluate the likelihood function. The most popular of these so-called likelihood-free methods is approximate Bayesian computation (ABC; Beaumont, Zhang, and Balding Citation2002; Sisson, Fan, and Beaumont Citation2018). In its simplest form, ABC involves sampling parameters from the prior, simulating from the model, and retaining parameters as an approximate sample from the posterior if the simulated data are “similar” to the observed data, with similarity typically assessed by comparing low-dimensional summary statistics. ABC methods are sensitive to the choice of summary statistics, and they are notoriously difficult to calibrate: for example, the number of summary statistics and the tolerance used when comparing statistics affect both the computational efficiency and the statistical efficiency of ABC.

Recently, neural networks have emerged as a promising approach to likelihood-free inference. In this work, we focus on neural networks that map data to parameter point estimates; we refer to such neural networks as neural point estimators. Neural point estimators date back to at least Chon and Cohen (Citation1997), but they have only been adopted widely in recent years, for example in applications involving models for stock returns (Creel?); population-genetics (Flagel, Brandvain, and Schrider Citation2018); time series (Rudi, Julie, and Lenzi Citation2021); spatial fields (Gerber and Nychka Citation2021; Banesh et al. Citation2021; Lenzi et al. Citation2023); spatio-temporal fields (Zammit-Mangion and Wikle Citation2020); and agent-based models (Gaskin, Pavliotis, and Girolami Citation2023). The computational bottleneck in neural point estimation is the training procedure, whereby the neural network “learns” a useful mapping between the sample space and the parameter space. Importantly though, this training cost is amortized: for a given statistical model and prior distribution over the parameters, a trained neural point estimator with sufficiently flexible architecture can be re-used for new datasets at almost no computational cost, provided that each data replicate has the same format as those used to train the point estimator (e.g., images of a pre-specified width and height). Uncertainty quantification of the estimates proceeds naturally through the bootstrap distribution, which is essentially available “for free” with neural point estimators since they are so fast to evaluate. As we shall show, neural point estimators can be trained to approximate Bayes estimators and, in this case, we refer to them as neural Bayes estimators.

Parameter estimation from replicated data is commonly required in statistical applications, and we therefore give this topic particular attention. Neural point estimation with replicated data is not straightforward; for example, Gerber and Nychka (Citation2021) considered two approaches to handling replicated data, both with drawbacks. In their first approach, they trained a neural point estimator for a single spatial field, applied it to each field independently, and averaged the resulting estimates; we call this the “one-at-a-time” approach. This approach does not reduce the bias commonly seen in small-sample estimators as the sample size increases, and it is futile when the parameters are unidentifiable from a single replicate. In their second approach, they adapted the size of the neural estimator’s input layer to match the number of independent replicates. The resulting estimator is not invariant to the ordering of the replicates; it results in an explosion of neural-network parameters with increasing number of replicates; and it requires a different architecture for every possible sample size, which reduces its applicability and generality.

In this article, we first clarify the connection between neural point estimators and classical Bayes estimators, which is sometimes misconstrued or ignored in the literature. Second, we propose a novel way to perform neural Bayes estimation from independent replicates by leveraging permutation-invariant neural networks, constructed using the DeepSets framework (Zaheer et al. Citation2017). To the best of our knowledge, this article is the first to explore its use and the related practical considerations in a point estimation context. We show that these architectures lead to a substantial improvement in estimation accuracy when compared to those that do not account for replication appropriately. Third, we discuss important practicalities for designing and training neural Bayes estimators and, in particular, we describe a way to construct an estimator that is approximately Bayes for any sample size. For illustration, we estimate parameters in Gaussian and max-stable processes, as well as the highly-parameterized spatial conditional extremes model (Wadsworth and Tawn Citation2022). We use the latter model in the analysis of sea-surface temperature extremes in the Red Sea where, using a neural Bayes estimator, we obtain estimates and bootstrap confidence intervals from hundreds of spatial fields in a fraction of a second. A primary motivation of this work is to facilitate the adoption of neural point estimation by statisticians and, to this end, we accompany the article with user-friendly open-source software in the Julia (Bezanson et al. Citation2017) and R (R Core Team Citation2023) programming languages: our software package, NeuralEstimators, is available at https://github.com/msainsburydale/NeuralEstimators, and can be used in a wide range of applied settings.

The remainder of this article is organized as follows. In Section 2, we outline the theory underlying neural Bayes estimators and discuss their implementation. In Section 3, we conduct extensive simulation studies that clearly demonstrate the utility of neural Bayes estimators. In Section 4, we use the spatial conditional extremes model to analyze sea-surface temperature in the Red Sea. In Section 5, we conclude and outline avenues for future research. A supplement is also available that contains more details and figures.

2 Methodology

In Section 2.1, we introduce neural Bayes estimators. In Section 2.2, we discuss the use of permutation-invariant neural networks for neural Bayes estimation from replicated data. In Section 2.3, we describe the general workflow for implementing neural Bayes estimators, and discuss some important practical considerations.

2.1 Neural Bayes Estimators

A parametric statistical model is a set of probability distributions on a sample space S, where the probability distributions are parameterized via some p-dimensional parameter vector θ on a parameter space Θ (McCullagh Citation2002). Suppose that we have data, which we denote as Z, from one such distribution. Then, the goal of parameter point estimation is to come up with an estimate of the unknown θ from Z using an estimator, θ̂:SΘ, which is a mapping from the sample space to the parameter space.

Estimators can be constructed within a decision-theoretic framework. Assume that the sample space is S=Rn, and consider a nonnegative loss function, L(θ,θ̂(Z)), which assesses an estimator θ̂(·) for a given θ and dataset Zf(z|θ), where f(z|θ) is the probability density function of the data conditional on θ. An estimator’s risk function is its loss averaged over all possible data realizations, (1) R(θ,θ̂(·))SL(θ,θ̂(z))f(z|θ)dz.(1)

So-called Bayes estimators minimize a (weighted) average of (1) known as the Bayes risk, (2) rΩ(θ̂(·))ΘR(θ,θ̂(·))dΩ(θ),(2) where Ω(·) is a prior measure for θ. Note that in (1) and (2) we deviate slightly from classical notation and notate the estimator as θ̂(·), rather than just θ̂, to stress that it is constructed as a function of the data.

Bayes estimators are theoretically attractive: for example, unique Bayes estimators are admissible and, under suitable regularity conditions and the squared-error loss, are consistent and asymptotically efficient (Lehmann and Casella Citation1998, chap. 5, thm. 2.4; chap. 6, thm. 8.3). Further, for a large class of prior distributions, every set of conditions that imply consistency of the maximum likelihood estimator also imply consistency of Bayes estimators (Strasser Citation1981). Unfortunately, however, Bayes estimators are typically unavailable in closed form for the complex models often encountered in practice. A way forward is to assume a flexible parametric model for θ̂(·), and to optimize the parameters within that model in order to approximate the Bayes estimator. Neural networks are ideal candidates, since they are universal function approximators (e.g., Hornik, Stinchcombe, and White Citation1989; Zhou Citation2018), and because they are fast to evaluate, usually involving only simple matrix-vector operations.

Let θ̂(Z;γ) denote a neural point estimator, that is, a neural network that returns a point estimate from data Z, where γ contains the neural-network parameters (i.e., the so-called “weights” and “biases”). Bayes estimators may be approximated with θ̂(·;γ*) by solving the optimization problem, (3) γ*arg minγrΩ(θ̂(·;γ)).(3)

Typically, rΩ(·) defined in (2) cannot be directly evaluated, but it can be approximated using Monte Carlo methods. Specifically, given a set ϑ of K parameter vectors sampled from the prior Ω(·) and, for each θϑ, J samples from f(z|θ) collected in the set Zθ, (4) rΩ(θ̂(·;γ))1Kθϑ1JZZθL(θ,θ̂(Z;γ)).(4)

Note that (4) does not involve evaluation, or knowledge, of the likelihood function.

The surrogate objective function (4) can be straightforwardly minimized with respect to γ using back-propagation and stochastic gradient descent with deep-learning software packages such as Flux (Innes Citation2018). For sufficiently expressive architectures, the point estimator targets a Bayes estimator with respect to L(·,·) and Ω(·). We therefore call the fitted neural point estimator a neural Bayes estimator. Note that neural Bayes estimators, like all other neural networks, are function approximators that, at least in theory, can approximate continuous functions arbitrarily well. However, the discrepancy between θ̂(·;γ*) and the true Bayes estimator will depend on a number of factors, such as the neural-network architecture and the specific optimization procedure used to minimize (4). We provide more discussion on these practical considerations in Section 2.3.

Like Bayes estimators, neural Bayes estimators target a specific point summary of the posterior distribution. For instance, the 0–1, absolute-error, and squared-error loss functions lead to neural Bayes estimators that approximate the posterior mode, median, and mean, respectively. Further, posterior quantiles, which can be used to construct credible intervals, may be estimated by using the quantile loss function (e.g., Cressie 2023, Equationeq. 7). This important link between neural point estimators and Bayes estimators is often overlooked in the literature.

2.2 Neural Bayes Estimators for Replicated Data

While neural Bayes estimators have been employed, sometimes inadvertently, in various applications, much less attention has been given to neural Bayes estimators for replicated data. In what follows, we denote a collection of m independent and identically distributed replicates (iid) Z1,,Zm as Z(m)(Z1,,Zm). In Section 2.2.1 we show how the so-called DeepSets framework is ideally placed to construct Bayes estimators for replicated data; in Section 2.2.2 we discuss approaches one can adopt to design neural point estimators that are approximately Bayes for any number of replicates; and in Section 2.2.3 we illustrate the neural Bayes estimator for replicated data on a relatively simple problem where the Bayes estimator is known in closed form.

2.2.1 The DeepSets Framework

Parameter estimation from replicated data is commonly required in statistical applications and, as discussed in Section 1, naïve approaches to constructing neural Bayes estimators for replicated data can lead to bias, an explosion in the number of neural-network parameters, and an inability to generalize to different sample sizes. We overcome these issues by leveraging an important property of Bayes estimators for replicated data, namely that they are invariant to permutations of the replicates. In Section S1 of the supplementary material, we give a formal proof of this property.

The permutation-invariance property of Bayes estimators for replicated data has largely been ignored to date in the context of neural point estimation. We propose enforcing our neural Bayes estimators to be permutation invariant by couching them in the DeepSets framework (Zaheer et al. Citation2017), a computationally convenient special case of Janossy pooling (Murphy et al. Citation2019). In this framework, the Bayes estimator is represented by (5) θ̂(Z(m);γ)=ϕ(T(Z(m);γψ);γϕ),T(Z(m);γψ)=a({ψ(Zi;γψ):i=1,,m}),(5) where ψ:RnRq and ϕ:RqRp are (deep) neural networks parametrized by γψ and γϕ, respectively, γ(γϕ,γψ), and a:(Rq)mRq is a permutation-invariant (set) function. shows a schematic of this representation. Each element of a(·), aj(·),j=1,,q, can be chosen to be a simple aggregation function, such as elementwise addition, average, or maximum, but may also be parameterized (Soelch et al. Citation2019); in this article, we use the elementwise average. Beyond its attractive parsimonious form, (5) has appealing theoretical properties. For example, Han et al. (Citation2022) show that functions of the form (5) can approximate any continuously differentiable permutation-invariant real-valued function for sufficiently large q, the dimension of T(·); see also Wagstaff et al. (Citation2019, Citation2022) for relevant discussion. The permutation-invariance property of Bayes estimators (Section S1) coupled with the representational capacity of (5) make DeepSets a principled and theoretically motivated framework for constructing neural Bayes estimators for replicated data.

Fig. 1 Schematic of the DeepSets representation. Each independent replicate Zi is transformed independently using the function ψ(·). The set of transformed inputs are then aggregated elementwise using a permutation-invariant function, a(·), yielding the summary statistic T. Finally, the summary statistic is mapped to parameter estimates estimates θ̂ by the function ϕ(·). Many classical estimators take this form; in this work, we use neural networks to model ψ(·) and ϕ(·).

Fig. 1 Schematic of the DeepSets representation. Each independent replicate Zi is transformed independently using the function ψ(·). The set of transformed inputs are then aggregated elementwise using a permutation-invariant function, a(·), yielding the summary statistic T. Finally, the summary statistic is mapped to parameter estimates estimates θ̂ by the function ϕ(·). Many classical estimators take this form; in this work, we use neural networks to model ψ(·) and ϕ(·).

Furthermore, the representation (5) is similar in form to many classical estimators when viewed as a nonlinear mapping ϕ(·) of summary statistics T(·). For example, best (i.e., minimum variance) unbiased estimators for exponential family models are of the form (5) where T(·) is sufficient for θ (Casella and Berger Citation2001, chap. 7). This connection provides an additional rationale for adopting the structure in (5), and provides interpretability: the functions ψ(·) and a(·) together extract summary statistics from the data, while ϕ(·) maps these learned summary statistics to parameter estimates.

Some summary statistics are available in closed form, simple to compute, and highly informative (e.g., sample quantiles). Denote these statistics as S(·). One may choose to explicitly incorporate these statistics by making ϕ(·) in (5) a function of both T(·) (learned) and S(·) (user-defined). The estimator remains permutation invariant provided that S(·) is permutation invariant. Since T(·) can theoretically approximate well any summary statistic as a continuous function of the data, the choice to include S(·) in (5) is mainly a practical one that could be useful in certain applications.

2.2.2 Variable Sample Size

Estimators of the form (5) can be applied to datasets of arbitrary size. However, the Bayes estimator for replicated data is a function of the number of replicates (see, e.g., (8) in the illustrative example of Section 2.2.3). Therefore, a neural Bayes estimator of the form (5) trained on datasets with m replicates will generally not be Bayes for datasets containing m˜m replicates.

There are at least two approaches that could be adopted if one wishes to use a neural Bayes estimator of the form (5) with datasets containing an arbitrary number of replicates, m. First, one could train l neural Bayes estimators for different sample sizes, or groups thereof (e.g., a small-sample estimator and a large-sample estimator). Specifically, for sample-size change-points m1<m2<<ml1, one could construct a piecewise neural Bayes estimator, (6) θ̂(Z(m);γ*)={θ̂(Z(m);γm˜1*)mm1,θ̂(Z(m);γm˜2*)m1<mm2,θ̂(Z(m);γm˜l*)m>ml1,(6) where, here, γ*(γm˜1*,,γm˜l*), and where γm˜* are the neural-network parameters optimized for sample size m˜; θ̂(·;γm˜*) is then assumed to be approximately Bayes over the range of sample sizes for which it is then applied in (6). Typically, m˜1m1,m1<m˜2m2, and so on. We find that this approach works well in practice, and that it is less computationally burdensome than it first appears when used in conjunction with pre-training (see Section 2.3). Note that the relative influence of the prior distribution diminishes as the sample size increases, so that only a single “large sample” estimator is needed. Alternatively, one could treat the sample size as a random variable, M, with support over a set of positive integers, M, in which case (1) becomes (7) R(θ,θ̂(·;γ))mMPr(M=m)×(SmL(θ,θ̂(z(m);γ))f(z(m)|θ)dz(m)).(7)

This does not materially alter the workflow, except that one must sample the number of replicates before simulating data for (4). These two approaches can also be combined, so that each sub-estimator in (6) is trained using (7). We illustrate the importance in accounting for the dependence of the Bayes estimator on m in Section S2 of the supplementary material.

2.2.3 Illustrative Example

We now present a relatively simple example that demonstrates that the DeepSets representation can approximate well Bayes estimators for replicated data. Consider the problem of estimating θ from data Z1,,Zm that are iid according to a uniform distribution on [0,θ]. Suppose that we choose a conjugate Pareto (α,β) prior for θ with shape α = 4 and scale β = 1; that is, Pr(θx)=1(x/β)α,xβ. Under the absolute-error loss, the unique Bayes estimator is the posterior median which, for this model, is available in closed form: for a Pareto (α,β) prior distribution, the Bayes estimator given m independent replicates Z(m)(Z1,,Zm) is (8) θ̂Bayes(Z(m))=21α+mmax(Z1,,Zm,β),(8) which is clearly invariant to permutations of the observations, and can be expressed as a scalar rendition of (5), where ψ(z)z,a({·})max({·}), and ϕ(t)21/(α+m)max(t,β). Note that when a({·}) is instead chosen to be the set average, alternative nonlinear functions ψ(z) and ϕ(t) exist such that the estimator (5) approximates (8) arbitrarily well by approximating the maximum as a log-sum-exp.

For this model, the “one-at-a-time” estimator, which applies the single-replicate Bayes estimator to each replicate independently and averages the resulting estimates, is given by (9) θ̂0(Z(m))=1mi=1m21α+1max(Zi,β).(9)

Although (9) is permutation invariant, it is clearly very different to (8). In Section S3 of the Supplementary Material we show that the one-at-a-time estimator for this model is not consistent, while the Bayes estimator is consistent. We compare our proposed estimator to the one-at-a-time estimator in this example and in the remainder of the article, to stress the importance of properly accounting for replication when constructing neural Bayes estimators.

We now proceed with the design of the neural Bayes estimator as though we had no knowledge of its closed-form expression. We construct our neural Bayes estimator to have the form (5), where we use three-layer neural networks to model ϕ(·;γ) and ψ(·;γ), and we let J = 1 and K=106 in (4) when training the network. For simplicity, we consider a single value for m and fix m = 10. We compare the distributions of the estimators at θ=4/3 and m = 10 by simulating 30,000 datasets of size m = 10 from a uniform distribution on [0,4/3] and then applying the estimators to each of these simulated datasets. shows the kernel-smoothed distribution of the true Bayes estimator (8), the neural Bayes estimator, the one-at-a-time estimator (9), and the maximum likelihood estimator which, for this model, is given by max(Z1,,Zm). The distribution of our neural Bayes estimator is clearly very similar to that of the true Bayes estimator, while the one-at-a-time estimator is clearly biased. The kernel-smoothed distribution of the maximum likelihood estimator emphasizes that (neural) Bayes estimators are influenced by the prior distribution; in this case, the prior distribution leads to a Bayes estimator with a bimodal distribution due to a point mass in the lower endpoint of the support. This example also shows clearly that neural point estimators trained using (4) target the Bayes estimator, and not necessarily the maximum likelihood estimator; this notion is sometimes misconstrued in the literature.

Fig. 2 Kernel density approximations to the distribution of the Bayes estimator (green line), our neural Bayes estimator (red line), the maximum likelihood estimator (purple line), and the one-at-a-time estimator (orange line), for θ from Unif(0,θ) data, where θ=4/3 (gray dashed line) and where the sample size is m = 10.

Fig. 2 Kernel density approximations to the distribution of the Bayes estimator (green line), our neural Bayes estimator (red line), the maximum likelihood estimator (purple line), and the one-at-a-time estimator (orange line), for θ from Unif(0,θ) data, where θ=4/3 (gray dashed line) and where the sample size is m = 10.

2.3 Implementation

Neural Bayes estimators are conceptually simple and can be used in a wide range of problems where other approaches are computationally infeasible. They also have marked practical appeal, as the general workflow for their construction is only loosely connected to the model being considered. The workflow for constructing a neural Bayes estimator is as follows:

  1. Define the prior, Ω(·).

  2. Choose a loss function, L(·,·), typically the absolute-error or squared-error loss.

  3. Design a suitable neural-network architecture for the neural point estimator θ̂(·;γ).

  4. Sample parameters from Ω(·) to form training/validation/test parameter sets.

  5. Given the above parameter sets, simulate data from the model, to form training/validation/test datasets.

  6. Train the neural network (i.e., estimate γ) by minimizing the loss function averaged over the training sets. That is, perform the optimization task (3) where the Bayes risk is approximated using (4) with the training sets. During training, monitor performance and convergence using the validation sets.

  7. Assess the fitted neural Bayes estimator, θ̂(·;γ*), using the test set.

We elaborate on the steps of this workflow below. A crucial factor in the viability of any statistical method is the availability of software; to facilitate the adoption of neural Bayes estimators by statisticians, we accompany the article with user-friendly open-source software in the Julia (Bezanson et al. Citation2017) and R (R Core Team Citation2023) programming languages (available at https://github.com/msainsburydale/NeuralEstimators) that can be used in a wide range of settings.

2.3.1 Defining the Prior

Prior disributions are typically determined by the applied problem being tackled. However, the choice of prior has practical implications on the training phase of the neural network. For example, an informative prior with compact and narrow support reduces the volume of the parameter space that must be sampled from when evaluating (4). In this case, a good approximation of the Bayes estimator can typically be obtained with smaller values of K and J in (4) than those required under a diffuse prior. This consideration is particularly important when the number of parameters, p, is large, since the volume of the parameter space increases exponentially with p. On the other hand, if the neural Bayes estimator needs to be re-used for several applications, it might be preferable to employ prior distributions that are reasonably uninformative. Lenzi et al. (Citation2023) suggest using likelihood-based estimates to elicit an informative prior from the data: however, this requires likelihood estimation to be feasible in the first place, which is often not the case in applications for which neural Bayes estimators are attractive.

2.3.2 Designing the Neural-Network Architecture

The main consideration when designing the neural-network architecture is the structure of the data. For example, if the data are gridded, a convolutional neural network (CNN) may be used, while a dense neural network (DNN; also known as a multi-layer perceptron, MLP) is more appropriate for unstructured data (for further examples, see Goodfellow, Bengio, and Courville Citation2016). When estimating parameters from replicated data using the representation (5), the architecture of ψ(·) is dictated by the structure of the data (since it acts on the data directly), while ϕ(·) is typically a DNN (since T(·) is a fixed-dimensional vector). Note that the training cost of the neural Bayes estimator is only amortized if the structure of new datasets conforms with the chosen architecture (otherwise the neural Bayes estimator will need to be re-trained with a new architecture). Once the general neural-network class is identified, one must specify architectural hyperparameters, such as the number of layers and number of neurons in each layer. We found that our results were not overly-sensitive to the specific choice of hyperparameters. In practice, the network must be sufficiently large so that the universal approximation theorem applies, but not so large as to make training prohibitively expensive. We give details on the architectures used in our experiments in Section 3.

2.3.3 Simulating Parameters/Data and Training the Network

In standard applications of neural networks, the amount of training data is fixed. One of the biggest risks one faces when fitting a large neural network is that the amount of training data is too “small”, which can result in the neural network overfitting the training set (Goodfellow, Bengio, and Courville Citation2016, Section 5.2). Such a neural network is said to have a high generalization error. However, when constructing a neural Bayes estimator, we are able to simulate as much training data as needed. That is, we are able to set K and J in (4) as large as needed to avoid overfitting. The amount of training data needed would depend on the model, the number of parameters, the neural-network architecture, and the optimization algorithm that is used. Providing general guidance is therefore difficult; however, our experience is that K needs to be at least 104–105 for the neural network to have low generalization error. Note that J can be kept small (on the order of 10°–101) since data are simulated for every sampled parameter vector. One also has the option to simulate training data “on-the-fly”, in the sense that new training data are simulated continuously during training (Chan et al. Citation2018). This approach facilitates the use of large networks with a high representational capacity, since then the data used in stochastic-gradient-descent updates are always different from those in previous updates (see Section S4 of the supplementary material for more details and for an illustration). On-the-fly simulation also allows the data to be simulated “just-in-time,” in the sense that the data can be simulated from a small batch of parameters, used to train the estimator, and then removed from memory; this can reduce memory requirements when a large amount of training data are required to avoid overfitting. Chan et al. (Citation2018) also continuously refresh the parameter vectors in the training set, which has similar benefits. Keeping these parameters fixed, however, allows computationally expensive terms, such as Cholesky factors, to be reused throughout training, which can substantially reduce the training time with some models (Gerber and Nychka Citation2021).

The parameters of a neural network trained for one task can be used as initial values for the parameters of another neural network intended for a slightly different task. This is known as pre-training (Goodfellow, Bengio, and Courville Citation2016, chap. 8). Pre-training is ideal when developing a piecewise neural Bayes estimator (6), whereby one may randomly initialize and train θ̂(·;γm˜1), use the optimized parameters γm˜1* as initial values when training θ̂(·;γm˜2), where m˜2>m˜1, and so on. Since each estimator need only account for a larger sample size than that used by its predecessor, scant computational resources are needed to train subsequent estimators. This approach can be useful even when only a single large-sample estimator is needed: doing most of the learning with small, computationally cheap sample sizes can substantially reduce training time when compared to training a single large-sample estimator from scratch. We demonstrate the benefits of this strategy through an illustration in Section S5 of the supplementary material.

2.3.4 Assessing the Estimator

Once a neural Bayes estimator is trained, its performance needs to be assessed on unseen test data. Simulation-based (empirical) methods are ideal, since simulation is already required for constructing the estimator. Neural Bayes estimators are very fast to evaluate, and can therefore be applied to thousands of simulated datasets at almost no computational cost. One can therefore quickly and accurately assess the estimator with respect to any property of its sampling distribution (e.g., bias, variance, etc.).

3 Simulation Studies

We now conduct several simulation studies that clearly demonstrate the utility of neural Bayes estimators in increasingly complex settings. In Section 3.1, we outline the general setting. In Section 3.2 we estimate the parameters of a Gaussian process model with three unknown parameters. Since the likelihood function is available for this model, here we compare the efficiency of our neural Bayes estimator to that of a gold-standard likelihood-based estimator. In Section 3.3 we consider a spatial extremes setting and estimate the two parameters of Schlather’s max-stable model (Schlather Citation2002); the likelihood function is computationally intractable for this model, and we observe substantial improvements over the classical composite-likelihood approach. In Section 3.4 we consider the highly-parameterized conditional extremes model (Wadsworth and Tawn Citation2022). We provide simulation details and density functions for each model in Section S6 of the supplementary material. For ease of notation, we omit dependence on the neural-network parameters γ in future references to the neural Bayes estimators, so that θ̂(·;γ) is written simply as θ̂(·).

3.1 General Setting

Across the simulation studies we assume, for ease of exposition, that our processes are spatial. Our spatial domain of interest, D, is [0,16]×[0,16], and we simulate data on a regular grid with unit-square cells, yielding 162=256 observations per spatial field. CNNs are a natural choice for regularly spaced gridded data, and we therefore use a CNN architecture, summarized in Table S3 of the Supplementary Material. To implement the neural point estimators, we use the accompanying package NeuralEstimators that is written in Julia and which leverages the package Flux (Innes Citation2018). We conduct our studies using a workstation with an AMD EPYC 7402 3.00GHz CPU with 52 cores and 128 GB of CPU RAM, and an NVIDIA Quadro RTX 6000 GPU with 24 GB of GPU RAM. All results presented in the remainder of this article can be generated using reproducible code at https://github.com/msainsburydale/NeuralBayesEstimators.

We consider two neural point estimators which are both based on the architecture given in Table S3 of the supplementary material. The first estimator, θ̂0(·), is the permutation-invariant one-at-a-time neural point estimator considered by Gerber and Nychka (Citation2021). The second estimator, θ̂DS(·), is the piecewise neural point estimator (6), where each sub-estimator employs the DeepSets representation (5) with ψ(·) and ϕ(·) constructed using the first four and last two rows of Table S3, respectively. Five sub-estimators are used, with training sample sizes m˜1=1,m˜2=10,m˜3=35,m˜4=75, and m˜5=150, and with sample-size changepoints m1=1,m2=20,m3=50, and m4=100.

We assume that the parameters are a priori independent and uniformly distributed on an interval that is parameter dependent. We train the neural point estimators under the absolute-error loss. We set K in (4) to 10,000 and 2000 for the training and validation parameter sets, respectively, and we keep these sets fixed during training. We construct the training and validation datasets by simulating J = 10 sets of m model realizations for each parameter vector in the training and validation parameter sets. During training, we fix the validation data, but simulate the training data on-the-fly; hence, in this article, we define an epoch as a pass through the training sets when doing stochastic gradient descent, after which the training data (i.e., the J = 10 datasets at each of the 10,000 parameter samples) are refreshed. We cease training when the risk evaluated using the validation set has not decreased in five consecutive epochs.

We compare the neural Bayes estimators to the maximum a posteriori (MAP) estimator, (10) θ̂MAP(Z(m))arg maxθΘ i=1ml(θ;Zi)+logp(θ),(10) where l(θ;·) is the log-likelihood function for a single replicate and p(θ) is the prior density. We solve (10) using the true parameters as initial values for the Nelder–Mead algorithm. We advantage the competitor MAP estimator, which minimizes (2) under the 0–1 loss, by assessing all estimators with respect to the 0–1 loss (we assign zero loss if an estimate is within 10% of the true value), with K = 500 test parameter vectors.

3.2 Gaussian Process Model

Spatial statistics is concerned with modeling data that are collected across space; reference texts include Cressie (Citation1993), Banerjee, Carlin, and Gelfand (Citation2004), and Diggle and Ribeiro (Citation2007). Here, we consider a classical spatial model, the linear Gaussian-Gaussian model, (11) Zij=Yi(sj)+ϵij,i=1,,m,j=1,,n,(11) where Zi(Zi1,,Zin) are data observed at locations {s1,,sn} on a spatial domain D, {Yi(·)} are iid spatially-correlated mean-zero Gaussian processes, and ϵijN(0,σϵ2) is Gaussian white noise. The covariance function, C(s,u)cov(Yi(s), Yi(u)), for s,uD and i=1,,m, is the primary mechanism for capturing spatial dependence. Note that, since {Yi(·)} are iid, we have that cov(Yi(s), Yi(u))=0 for all ii and s,uD. Here, we use the popular isotropic Matérn covariance function, (12) C(s,u)=σ221νΓ(ν)(suρ)νKν(suρ),(12) where σ2 is the marginal variance, Γ(·) the gamma function, Kν(·) is the Bessel function of the second kind of order ν, and ρ>0 and ν>0 are the range and smoothness parameters, respectively. We follow Gerber and Nychka (Citation2021) and fix σ2=1. This leaves three unknown parameters that need to be estimated: θ(σϵ,ρ,ν).

Fixing ν simplifies the estimation task, since the remaining parameters are well-identified from a single replicate; we consider this model, which was also considered by Gerber and Nychka (Citation2021), in Section S7 of the supplementary material. Here, we consider the case where all three parameters are unknown. Estimation in this case is more challenging since ρ and ν are only weakly identifiable from a single replicate (Zhang Citation2004), but inference on ν is important since it controls the differentiability of the process (Stein Citation1999, chap. 2).

We use the priors σϵUnif(0.1,1), ρUnif(2,10), and νUnif(0.5,3). The total training time for θ̂0(·) is 15 min. The training time for θ̂DS(·), despite it consisting of several sub-estimators and being trained with large sample sizes, is only slightly more, at 33 min, due to the pre-training strategy discussed in Section 2.3. This modest increase in training time is a small cost when accounting for the improvement in statistical efficiency, as shown in . Clearly, θ̂DS(·) substantially improves over θ̂0(·), and performs similarly well to the MAP estimator. The neural point estimators only take 0.1 and 2 sec to yield estimates from all the test data when m = 1 and m = 150, respectively, while the MAP estimator takes between 600 and 800 sec.

Fig. 3 The risk with respect to the 0–1 loss against the number of replicates, m, evaluated using the parameter vectors in the test parameter set, for the estimators considered in Section 3.2. The estimators θ̂0(·), θ̂DS(·), and the MAP estimator are shown in green, orange, and purple, respectively.

Fig. 3 The risk with respect to the 0–1 loss against the number of replicates, m, evaluated using the parameter vectors in the test parameter set, for the estimators considered in Section 3.2. The estimators θ̂0(·), θ̂DS(·), and the MAP estimator are shown in green, orange, and purple, respectively.

Next, we compare the empirical joint distributions of the estimators for a single parameter vector. shows the true parameters (red cross) and corresponding estimates obtained by applying each estimator to 100 sets of m = 150 independent model realizations. The MAP estimator and θ̂DS(·) are both approximately unbiased, and both are able to capture the correlations expected when using the parameterization (12); both are clearly much better than θ̂0(·). Empirical joint distributions for additional parameter vectors are shown in Figure S11 of the supplementary material, and lead to similar conclusions to those drawn here. Overall, these results show that θ̂DS(·) is a substantial improvement over the prior art, θ̂0(·), and that θ̂DS(·) is competitive with the likelihood-based estimator for this model.

Fig. 4 The empirical joint distribution of the estimators considered in Section 3.2 for a single parameter vector. The true parameters are shown in red, while estimates from θ̂0(·), θ̂DS(·), and the MAP estimator are shown in green, orange, and purple, respectively. Each estimate was obtained from a simulated dataset of size m=150.

Fig. 4 The empirical joint distribution of the estimators considered in Section 3.2 for a single parameter vector. The true parameters are shown in red, while estimates from θ̂0(·), θ̂DS(·), and the MAP estimator are shown in green, orange, and purple, respectively. Each estimate was obtained from a simulated dataset of size m=150.

3.3 Schlather’s Max-Stable Model

We now consider models used for spatial extremes, useful reviews for which are given by Davison and Huser (Citation2015), Davison, Huser, and Thibaud (2019), and Huser and Wadsworth (Citation2022). Max-stable processes are the cornerstone of spatial extreme-value analysis, being the only possible nondegenerate limits of properly renormalized pointwise block maxima of iid random fields. However, their practical use has been severely hampered due to the computational bottleneck in evaluating their likelihood function. They are thus natural models to consider in our experiments, and we here consider a fairly simple max-stable model with only two parameters. We consider Schlather’s max-stable model (Schlather Citation2002), given by (13) Zij=kNζikmax{0,Yik(sj)},i=1,,m,j=1,,n,(13) where denotes the maximum over the indexed terms, Zi(Zi1,,Zin) are observed at locations {s1,,sn}D, {ζik:kN} for i=1,,m are iid Poisson point processes on (0,) with intensity measure dΛ(ζ)=ζ2dζ, and {Yik(·):i=1,,m,kN} are iid mean-zero Gaussian processes scaled so that E[max{0,Yik(·)}]=1. Here, we model each Yik(·) using the Matérn covariance function (12), with σ2=1. Hence, θ(ρ,ν).

We use the same uniform priors for ρ and ν as in Section 3.2. Realizations from the present model, here expressed on unit Fréchet margins, tend to have highly varying magnitudes, and we reduce this variability by log-transforming our data to the unit Gumbel scale. The total training time for θ̂0(·) and θ̂DS(·) is 14 and 66 min, respectively.

As in Section 3.2, we assess the neural point estimators by comparing them to a likelihood-based estimator. For Schlather’s model (and other max-stable models in general), the full likelihood function is computationally intractable, since it involves a summation over the set of all possible partitions of the spatial locations (see, e.g., Padoan 2010; Huser et al. Citation2019, and the references therein). A popular substitute is the pairwise likelihood (PL) function, a composite likelihood formed by considering only pairs of observations; specifically, the pairwise log-likelihood function for the ith replicate is (14) lPL(θ;zi)j=1n1j=j+1nlogf(zij,zij|θ),(14) where f(·,·|θ) denotes the bivariate probability density function for pairs in zi. Hence, in this subsection, we compare the neural Bayes estimators to the pairwise MAP (PMAP) estimator, that is, (10) with the full log-likelihood function l(θ;·) replaced by lPL(θ;·). Often, both computational and statistical efficiency can be drastically improved by using only a subset of pairs that are within a fixed cutoff distance, d (see, e.g., Bevilacqua et al. Citation2012; Sang and Genton Citation2012). A line-search for d (see Figure S12 of the supplementary material for details) shows that, here, d = 3 units (used hereafter) provides good results.

The left and center panels of show the estimators’ risk against the number of independent replicates. For small samples, both neural point estimators improve over the PMAP estimator. For moderate-to-large samples, θ̂0(·) hits a performance plateau, while θ̂DS(·) continues to substantially outperform the PMAP estimator. The run time for the neural point estimators to estimate all test parameters scales linearly between 0.1 and 1.5 sec for m = 1 and m = 150, respectively, while the PMAP estimator takes between 750 and 1900 sec. The right panel of shows the empirical joint distribution of the estimators for a single parameter vector, where each estimate was obtained from m = 150 replicates. Again, θ̂0(·) is strongly biased, while the PMAP estimator is unbiased but is less efficient than θ̂DS(·). Empirical joint distributions for additional parameter vectors are shown in Figure S13 of the supplementary material, and lead to similar conclusions to those drawn here. Overall, the proposed estimator, θ̂DS(·), is statistically and computationally superior to the likelihood-based technique for Schlather’s max-stable model.

Fig. 5 Diagnostic plots for the simulation study of Section 3.3. (Left and center) The risk with respect to the 0–1 loss against the number of replicates, m, evaluated using the parameter vectors in the test parameter set. (Right) True parameters (red) and corresponding estimates, each of which was obtained using a simulated dataset of size m = 150. In all panels, θ̂0(·), θ̂DS(·), and the pairwise MAP estimator are shown in green, orange, and purple, respectively.

Fig. 5 Diagnostic plots for the simulation study of Section 3.3. (Left and center) The risk with respect to the 0–1 loss against the number of replicates, m, evaluated using the parameter vectors in the test parameter set. (Right) True parameters (red) and corresponding estimates, each of which was obtained using a simulated dataset of size m = 150. In all panels, θ̂0(·), θ̂DS(·), and the pairwise MAP estimator are shown in green, orange, and purple, respectively.

3.4 Spatial Conditional Extremes Model

While max-stable processes are asymptotically justified for modeling spatial extremes defined as block maxima, they have strong limitations in practice (Huser and Wadsworth Citation2022). Beyond the intractability of their likelihood function in high dimensions, max-stable models have an overly rigid dependence structure, and in particular cannot capture a property known as asymptotic tail independence. Therefore, different model constructions, justified by alternative asymptotic conditions, have been recently proposed to circumvent the restrictions imposed by max-stability. In particular, the spatial conditional extremes model, first introduced by Wadsworth and Tawn (Citation2022) as a spatial extension of the multivariate Heffernan and Tawn (Citation2004) model, and subsequently studied and used in applications by, for example, Richards, Tawn, and Brown (Citation2022) and Simpson, Opitz, and Wadsworth (Citation2023), is especially appealing. This model has a flexible dependence structure capturing both asymptotic tail dependence and independence (Wadsworth and Tawn Citation2022) and leads to likelihood-based inference amenable to higher dimensions. Nevertheless, parameter estimation remains challenging because this model is complex and is typically highly parameterized. Here, we consider a version of the spatial conditional extremes model that involves eight dependence parameters in total.

Our formulation of the model is similar to that originally proposed by Wadsworth and Tawn (Citation2022). Specifically, we model the process, expressed on Laplace margins, conditional on it exceeding a threshold, u, at a conditioning site, s0D, as (15) Zij|Zi0>u=da(hj,Zi0)+b(hj,Zi0)Yi(sj),i=1,,m,j=1,,n,(15) where “=d” denotes equality in distribution, Zi0 is the datum at s0,hjsjs0, and Zi0u|Zi0>u is a unit exponential random variable that is independent of the residual process, Yi(·), which we describe below. We model a(·,·) and b(·,·) using parametric forms proposed by Wadsworth and Tawn (Citation2022), namely a(h,z)=zexp{(h/λ)κ},λ>0,κ>0,b(h,z)=1+a(h,z)β,β>0, where hss0 for sD and zR. We construct the residual process, Yi(·), by first defining Y˜i(0)(·)Y˜i(·)Y˜i(s0), with Y˜i(·)a mean-zero Gaussian process with Matérn covariance function and unit marginal variance, and then marginally transforming it to the scale of a delta-Laplace (generalized Gaussian) distribution (Subbotin Citation1923) which, for yR and parameters μR,τ>0,δ>0, has density function fS(y|μ,τ,δ)=δ2τΓ(1/δ)exp(|yμτ|δ).

We model δ as decaying from 2 to 1 as the distance to s0 increases; specifically, (16) δ(h)=1+exp{(h/δ1)2},δ1>0.(16)

We defer to Wadsworth and Tawn (Citation2022) for model justification and interpretation of model parameters. The threshold u is a modeling decision made to ensure that the data are sufficiently extreme and that, in a real-data setting, we have sufficiently many fields available for estimation: here, we set u to the 0.975 quantile of the unit-Laplace distribution. For simplicity, we consider s0 fixed and in the center of D.

We again use uniform priors, with κUnif(1,2), λUnif(2,5), βUnif(0.05,1), μUnif(0.5,0.5), τUnif(0.3,0.9), δ1Unif(1.3,3), and with the same priors for ρ and ν as used in the preceding sections. We use the cube-root function as a variance-stabilizing transformation. The training time for θ̂0(·) and θ̂DS(·) is 22 and 43 min, respectively.

shows the risk against the sample size. shows the empirical joint distribution of the estimators for a single parameter vector (panels in the lower triangle) and simulations from the corresponding model (panels in the upper triangle). The estimator θ̂DS(·) is approximately unbiased for all parameters, and captures the expected negative correlation between ρ and ν. Overall, the estimator θ̂DS(·) is clearly appropriate for this highly parameterized model (unlike θ̂0(·)), which is an important result for this framework.

Fig. 6 The risk with respect to the 0–1 loss against the number of replicates, m, evaluated using the parameter vectors in the test parameter set, for the estimators considered in Section 3.4. The estimators θ̂0(·) and θ̂DS(·) are shown in green and orange, respectively.

Fig. 6 The risk with respect to the 0–1 loss against the number of replicates, m, evaluated using the parameter vectors in the test parameter set, for the estimators considered in Section 3.4. The estimators θ̂0(·) and θ̂DS(·) are shown in green and orange, respectively.

Fig. 7 (Lower triangle) The empirical joint distribution of the estimators considered in Section 3.4 for a single parameter vector. The true parameters are shown in red, while estimates from θ̂0(·) and θ̂DS(·) are shown in green and orange, respectively. Each estimate was obtained from a simulated dataset of size m = 150. (Upper triangle) Simulations from the model.

Fig. 7 (Lower triangle) The empirical joint distribution of the estimators considered in Section 3.4 for a single parameter vector. The true parameters are shown in red, while estimates from θ̂0(·) and θ̂DS(·) are shown in green and orange, respectively. Each estimate was obtained from a simulated dataset of size m = 150. (Upper triangle) Simulations from the model.

4 Application to Red Sea Surface Temperature

We now apply our methodology to the analysis of sea-surface temperature data in the Red Sea, which have also been analyzed by Hazra and Huser (Citation2021), Simpson and Wadsworth (Citation2021), and Simpson, Opitz, and Wadsworth (Citation2023), among others, and have been the subject of a competition in the prediction of extreme events (Huser Citation2021). The dataset we analyze comprises daily observations from the years 1985 to 2015, for 16,703 regularly spaced locations across the Red Sea; see Donlon et al. (Citation2012) for further details. Following Simpson, Opitz, and Wadsworth (Citation2023), we focus on a southern portion of the Red Sea, consider only the summer months to approximately eliminate the effects of seasonality, and retain only every third longitude and latitude value; this yields a dataset with 678 unique spatial locations that are regularly spaced but contained within an irregularly-shaped spatial domain, D. To account for D lying away from the equator, we follow Simpson, Opitz, and Wadsworth (Citation2023) in scaling the longitude and latitude so that each unit distance in the spatial domain corresponds to approximately 100 km.

To model these data, we use the spatial conditional extremes model described in Section 3.4. We transform our data to the Laplace scale and set the threshold, u in (15), to the 95th percentile of the transformed data. This yields 141 spatial fields for which the transformed datum at the conditioning site, s0, here chosen to lie in the center of D, is greater than u; that is, we estimate parameters based on m = 141 replicates of the spatial process. A randomly-selected sample of these extreme fields are shown in the upper panels of Figure S14 of the supplementary material.

The irregular shape of D means that CNNs are not directly applicable. We use the standard technique of padding empty regions with zeros, so that each field is a 29 × 37 rectangular array consisting of a data region and a padded region, as shown in Figure S15 of the supplementary material. We use the same prior distributions as in Section 3.4, but with those associated with range parameters scaled appropriately. Our architecture is the same as that given in Table S3, but with an additional convolutional layer that transforms the input array to dimension 16 × 16. We validate our neural Bayes estimator using the approach taken in Section 3 (figures omitted for brevity).

Once training is complete, we can compute parameter estimates for the observed data. gives estimates and 95% confidence intervals for the estimates. These confidence intervals are obtained using the nonparametric bootstrap procedure described in Section S8 of the Supplementary Material, which accounts for temporal dependence between the spatial fields. Estimation from a single set of 141 fields takes only 0.008 sec, meaning that bootstrap confidence intervals can be obtained very quickly. This is clearly an advantage of using neural point estimators, as opposed to likelihood-based techniques for which uncertainty assessment in complex models is usually a computational burden.

Table 1 Parameter estimates and 95% bootstrap confidence intervals (provided via the 2.5 and 97.5 percentiles of the bootstrap distribution) for the Red Sea dataset of Section 4.

Next, following Simpson, Opitz, and Wadsworth (Citation2023), we separate D into 17 nonoverlapping regions, which are shown in the left panel of . Given that the transformed datum at s0 exceeds u, we estimate (and quantify the uncertainty in) the proportion of locations in each region that also exceed u, using both model-based and empirical methods. These are shown in the right panel of , and indicate overall agreement between the model and the observed data, with some minor lack of fit at very short distances. The empirical estimates are not monotonically decreasing as a function of distance; this could be due to complex spatial dynamics or non-stationarity in the data, or it could simply be an artifact of sampling variability. In either case, the fit is reasonable, which suggests that our neural Bayes estimator has provided reasonable parameter estimates for this dataset.

Fig. 8 (Left) The spatial domain of interest for the Red Sea study of Section 4, with the conditioning site, s0, shown in red, and the remaining locations separated into 17 regions; the region labels begin at 1 in the center of the domain and increase with distance from the center. (Right) The estimated proportion of locations for which the process exceeds u given that it is exceeded at s0 (points) and corresponding 95% confidence intervals (vertical segments) using model-simulated datasets (blue) and bootstrap samples of the observed dataset (red).

Fig. 8 (Left) The spatial domain of interest for the Red Sea study of Section 4, with the conditioning site, s0, shown in red, and the remaining locations separated into 17 regions; the region labels begin at 1 in the center of the domain and increase with distance from the center. (Right) The estimated proportion of locations for which the process exceeds u given that it is exceeded at s0 (points) and corresponding 95% confidence intervals (vertical segments) using model-simulated datasets (blue) and bootstrap samples of the observed dataset (red).

5 Conclusion

Neural Bayes estimators are a class of likelihood-free estimators that approximate Bayes estimators using neural networks. Their connection to classical estimation theory is often under-appreciated, and this article serves to increase the awareness and adoption of this powerful estimation tool by statisticians. This article also proposes a principled way to construct neural Bayes estimators for replicated data via the DeepSets architecture, which has the same structure as that of well-known conventional estimators, such as best unbiased estimators with exponential family models. Using these estimators that are able to automatically learn suitable summary statistics from the data, we jointly estimate the range and smoothness parameters in a Gaussian process model and in Schlather’s max-stable model, and estimate parameters in the highly parameterized spatial conditional extremes model. These estimators are implemented with just a few lines of code, thanks to the package NeuralEstimators, which is released with this article and is available at https://github.com/msainsburydale/NeuralEstimators.

As with all estimation methods, neural Bayes estimators come with advantages and disadvantages that will either make them highly applicable, or impractical, depending on the application. At the time of writing we see five main drawbacks of neural Bayes estimators. First, it is unclear how one could verify that a trained neural Bayes estimator is “close” to the true Bayes estimator, and more theory is needed to provide guarantees and guidelines on how to choose the number of training samples in practice. This limitation can be somewhat mitigated by running empirical checks (as we do in our examples); these checks are not computationally costly given the amortized nature of the neural Bayes estimator. Second, the training cost is amortized only if the estimator is used repeatedly for the same estimation problem; in applications where estimation needs to be done just once, and the likelihood function is available and computationally tractable, classical likelihood-based methods supported by theoretical guarantees and extensive software availability might be more attractive. Third, one needs to be able to simulate relatively easily from the data generating process; although this sampling phase is parallelizable, this step could make the use of neural Bayes estimators with some models impractical. Fourth, the fast construction of neural Bayes estimators requires graphics processing units (GPUs); however, GPUs suitable for deep learning are now commonplace in high-end workstations, and this hardware is only needed for training, and not for evaluating the estimator once trained. Fifth, the volume of the parameter space increases exponentially fast with the number of parameters; we therefore expect training the neural Bayes estimator to be more difficult with highly-parameterized models, especially those with non-orthogonal parameterizations. On the other hand, neural point estimators have substantial advantages that will, in our opinion, make them the preferred and the de facto option in several applications in the near future. First, they are particularly useful when the likelihood function is unavailable or computationally intractable. Second, due to their amortized nature, they are ideal for settings in which the same statistical model must be fit repeatedly (e.g., when solving inverse problems with remote sensing data; see Cressie 2018), or in scenarios where accurate bootstrap-based uncertainty estimates are needed.

There are many avenues for future research. In this work, we have illustrated neural Bayes estimation with spatial data observed over a regular grid. Parameter estimation from irregular spatial data is an important problem, and one that we consider briefly with DNNs in Section S9 of the supplementary material. An attractive way forward in this regard is the use of graph neural networks (GNNs; Wu et al. Citation2021), which generalize the convolution operation to irregular data; their use in the context of parameter estimation is the subject of ongoing work. Ways to incorporate covariate information also need to be explored. A possible criticism of neural Bayes estimators is that they are not robust to model misspecification since neural networks are, generally, poor extrapolators. However, we did not find this to be the case in our work. In Section S10 of the supplementary material, we provide some empirical evidence that neural Bayes estimators can be used on data that are very different to those used during training, but further research on this topic is needed. Neural Bayes estimators of the form (5) are ideally placed for online learning problems; investigating their potential in this context is the subject of future work. Finally, while this article focuses on neural point estimation, there is a growing literature on neural approaches that approximate the full posterior distribution (Radev et al. Citation2022; Pacchiardi and Dutta Citation2022): this is also a promising avenue for future research.

Supplementary Materials

In Section S1, we give a proof for the permutation-invariance property of Bayes estimators for replicated data. In Section S2, we illustrate how neural Bayes estimators depend on the sample size. In Section S3, we show that the Bayes estimator is consistent in the example of Section 2.2.3, but that the one-at-a-time estimator is not. In Sections S4 and S5, we conduct experiments to illustrate the benefits of “on-the-fly” simulation and pre-training, respectively. In Section S6, we describe how we simulate from the models considered in the main text. In Section S7, we describe our analysis for a Gaussian process model with known smoothness parameter. In Section S8, we detail the bootstrap techniques used in Section 4. In Section S9, we apply our methodology to the analysis of sea-surface temperature data in the Red Sea, where the data are measured irregularly in space. In Section S10, we provide some empirical evidence that our neural Bayes estimators can be used on data that are different to those used during training. Finally, in Section S11, we provide additional exploratory and diagnostic plots for the experiments given in the main text.

Supplemental material

supplement-sainsbury-dale.pdf

Download PDF (3.4 MB)

Acknowledgments

The authors would like to thank Yi Cao for technical support. Thanks also go to Emma Simpson, Jennifer Wadsworth, and Thomas Opitz for providing access to the preprocessed Red Sea dataset. The authors are also grateful to Noel Cressie and Jonathan Rougier for providing comments on the manuscript. We are also grateful to the reviewers and the editors for their helpful comments and suggestions that improved the quality of the manuscript.

Disclosure Statement

The authors report that there are no competing interests to declare.

Additional information

Funding

Matthew Sainsbury-Dale’s and Andrew Zammit-Mangion’s research was supported by an Australian Research Council (ARC) Discovery Early Career Research Award, DE180100203. Matthew Sainsbury-Dale’s research was also supported by an Australian Government Research Training Program Scholarship and a 2022 Statistical Society of Australia (SSA) top-up scholarship. Raphaël Huser was partially supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. OSR-CRG2020-4394. The authors would like to acknowledge Johan Barthelemy, NVIDIA, SMART Infrastructure Facility of the University of Wollongong, as well as a 2021 Major Equipment Grant from the University of Wollongong’s University Research Committee, that together have provided access to extensive GPU computing resources.

References

  • Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2004), Hierarchical Modeling and Analysis for Spatial Data, Boca Raton, FL: Chapman and Hall/CRC Press.
  • Banesh, D., Panda, N., Biswas, A., Roekel, L. V., Oyen, D., Urban, N., Grosskopf, M., Wolfe, J., and Lawrence, E. (2021). “Fast Gaussian Process Estimation for Large-Scale in Situ Inference Using Convolutional Neural Networks,” in IEEE International Conference on Big Data, pp. 3731–3739.
  • Beaumont, M. A., Zhang, W., and Balding, D. J. (2002), “Approximate Bayesian Computation in Population Genetics,” Genetics, 162, 2025–2035. DOI: 10.1093/genetics/162.4.2025.
  • Bevilacqua, M., Gaetan, C., Mateu, J., and Porcu, E. (2012), “Estimating Space and Space-Time Covariance Functions for Large Data Sets: A Weighted Composite Likelihood Approach,” Journal of the American Statistical Association, 107, 268–280. DOI: 10.1080/01621459.2011.646928.
  • Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017), “Julia: A Fresh Approach to Numerical Computing,” SIAM Review, 59, 65–98. DOI: 10.1137/141000671.
  • Casella, G., and Berger, R. (2001), Statistical Inference (2nd ed.), Belmont, CA: Duxbury.
  • Castruccio, S., Huser, R., and Genton, M. G. (2016), “High-Order Composite Likelihood Inference for Max-Stable Distributions and Processes,” Journal of Computational and Graphical Statistics, 25, 1212–1229. DOI: 10.1080/10618600.2015.1086656.
  • Chan, J., Perrone, V., Spence, J., Jenkins, P., Mathieson, S., and Song, Y. (2018), “A Likelihood-Free Inference Framework for Population Genetic Data Using Exchangeable Neural Networks,” in Advances in Neural Information Processing Systems (Vol. 31), eds. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Curran Associates, Inc.
  • Chon, K., and Cohen, R. (1997), “Linear and Nonlinear ARMA Model Parameter Estimation Using an Artificial Neural Network,” IEEE Transactions on Biomedical Engineering, 44, 168–174. DOI: 10.1109/10.554763.
  • Cox, D., and Reid, N. (2004), “A Note on Pseudolikelihood Constructed from Marginal Densities,” Biometrika, 3, 729–737. DOI: 10.1093/biomet/91.3.729.
  • Creel, M. (2017), “Neural Nets for Indirect Inference,” Econometrics and Statistics, 2, 36–49. DOI: 10.1016/j.ecosta.2016.11.008.
  • Cressie, N. (1993), Statistics for Spatial Data (rev. ed.), Hoboken, NJ: Wiley.
  • ——- (2018), “Mission co2ntrol: A Statistical Scientist’s Role in Remote Sensing of Atmospheric Carbon Dioxide,” Journal of the American Statistical Association, 113, 152–168.
  • ——- (2023), “Decisions, Decisions, Decisions in an Uncertain Environment,” Environmetrics, 34, e2767.
  • Davison, A. C., and Huser, R. (2015), “Statistics of Extremes,” Annual Review of Statistics and its Application, 2, 203–235. DOI: 10.1146/annurev-statistics-010814-020133.
  • Davison, A. C., Huser, R., and Thibaud, E. (2019), “Spatial Extremes,” in Handbook of Environmental and Ecological Statistics, eds. A. E. Gelfand, M. Fuentes, J. A. Hoeting, and R. L. Smith, pp. 711–744, Boca Raton, FL: Chapman & Hall/CRC Press.
  • Diggle, P. J., and Gratton, R. J. (1984), “Monte Carlo Methods of Inference for Implicit Statistical Models,” Journal of the Royal Statistical Society, Series B, 46, 193–227. DOI: 10.1111/j.2517-6161.1984.tb01290.x.
  • Diggle, P. J., and Ribeiro, P. J. (2007), Model-Based Geostatistics, New York, NY: Springer.
  • Donlon, C. J., Martin, M., Stark, J., Roberts-Jones, J., Fiedler, E., and Wimmer, W. (2012), “The Operational Sea Surface Temperature and Sea Ice Analysis (OSTIA) System,” Remote Sensing of Environment, 116, 140–158. DOI: 10.1016/j.rse.2010.10.017.
  • Flagel, L., Brandvain, Y., and Schrider, D. R. (2018), “The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference,” Molecular Biology and Evolution, 36, 220–238. DOI: 10.1093/molbev/msy224.
  • Gaskin, T., Pavliotis, G. A., and Girolami, M. (2023), “Neural Parameter Calibration for Large-Scale Multiagent Models,” Proceedings of the National Academy of Sciences, 120, e2216415120. DOI: 10.1073/pnas.2216415120.
  • Gerber, F., and Nychka, D. W. (2021), “Fast Covariance Parameter Estimation of Spatial Gaussian Process Models Using Neural Networks,” Stat, 10, e382. DOI: 10.1002/sta4.382.
  • Goodfellow, I., Bengio, Y., and Courville, A. (2016), Deep Learning, Cambridge, MA: MIT Press.
  • Han, J., Li, Y., Lin, L., Lu, J., Zhang, J., and Zhang, L. (2022), “Universal Approximation of Symmetric and Anti-Symmetric Functions,” Communications in Mathematical Sciences, 20, 1397–1408. DOI: 10.4310/CMS.2022.v20.n5.a8.
  • Hazra, A., and Huser, R. (2021), “Estimating High-Resolution Red Sea Surface Temperature Hotspots, Using a Low-Rank Semiparametric Spatial Model,” Annals of Applied Statistics, 15, 572–596.
  • Heffernan, J. E., and Tawn, J. A. (2004), “A Conditional Approach for Multivariate Extreme Values,” Journal of the Royal Statistical Society, Series B, 66, 497–546. DOI: 10.1111/j.1467-9868.2004.02050.x.
  • Hornik, K., Stinchcombe, M., and White, H. (1989), “Multilayer Feedforward Networks are Universal Approximators,” Neural Networks, 2, 359–366. DOI: 10.1016/0893-6080(89)90020-8.
  • Huser, R. (2021), “Editorial: EVA 2019 Data Competition on Spatio-Temporal Prediction of Red Sea Surface Temperature Extremes,” Extremes, 24, 91–104. DOI: 10.1007/s10687-019-00369-9.
  • Huser, R., and Davison, A. C. (2013), “Composite Likelihood Estimation for the Brown–Resnick Process,” Biometrika, 100, 511–518. DOI: 10.1093/biomet/ass089.
  • Huser, R., Dombry, C., Ribatet, M., and Genton, M. G. (2019), “Full Likelihood Inference for Max-Stable Data,” Stat, 8, e218. DOI: 10.1002/sta4.218.
  • Huser, R., Stein, M. L., and Zhong, P. (2022), “Vecchia Likelihood Approximation for Accurate and Fast Inference in Intractable Spatial Extremes Models,” arXiv:2203.05626v1.
  • Huser, R., and Wadsworth, J. (2022), “Advances in Statistical Modeling of Spatial Extremes,” Wiley Interdisciplinary Reviews: Computational Statistics, 14, e1537.
  • Innes, M. (2018), “Flux: Elegant Machine Learning with Julia,” Journal of Open Source Software, 3, 602. DOI: 10.21105/joss.00602.
  • Lehmann, E. L., and Casella, G. (1998), Theory of Point Estimation (2nd ed.), New York: Springer.
  • Lenzi, A., Bessac, J., Rudi, J., and Stein, M. L. (2023), “Neural Networks for Parameter Estimation in Intractable Models,” Computational Statistics & Data Analysis, 185, 107762. DOI: 10.1016/j.csda.2023.107762.
  • Lintusaari, J., Gutmann, M., Dutta, R., Kaski, S., and Corander, J. (2017), “Fundamentals and Recent Developments in Approximate Bayesian Computation,” Systematic Biology, 66, 66–82. DOI: 10.1093/sysbio/syw077.
  • McCullagh, P. (2002), “What is a Statistical Model,” The Annals of Statistics, 30, 1225–1310. DOI: 10.1214/aos/1035844977.
  • Murphy, R. L., Srinivasan, B., Rao, V. A., and Ribeiro, B. (2019), “Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs,” in 7th International Conference on Learning Representations, ICLR.
  • Pacchiardi, L., and Dutta, R. (2022), “Likelihood-Free Inference with Generative Neural Networks via Scoring Rule Minimization,” arXiv:2205.15784.
  • Padoan, S. A., Ribatet, M., and Sisson, S. A. (2010), “Likelihood-based Inference for Max-Stable Processes,” Journal of the American Statistical Association, 105, 263–277. DOI: 10.1198/jasa.2009.tm08577.
  • Radev, S. T., Mertens, U. K., Voss, A., Ardizzone, L., and Köthe, U. (2022), “BayesFlow: Learning Complex Stochastic Models with Invertible Neural Networks,” IEEE Transactions on Neural Networks and Learning Systems, 33, 1452–1466. DOI: 10.1109/TNNLS.2020.3042395.
  • R Core Team. (2023), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing.
  • Richards, J., Tawn, J. A., and Brown, S. (2022), “Modelling Extremes of Spatial Aggregates of Precipitation Using Conditional Methods,” Annals of Applied Statistics, 16, 2693–2713.
  • Rudi, J., Julie, B., and Lenzi, A. (2021), “Parameter Estimation with Dense and Convolutional Neural Networks Applied to the FitzHugh-Nagumo ODE,” in Proceedings of the 2nd Annual Conference on Mathematical and Scientific Machine Learning, volume 145 of Proceedings of Machine Learning Research, eds. J. Bruna, J. Hesthaven, and L. Zdeborova, pp. 1–28, PMLR.
  • Sang, H., and Genton, M. G. (2012), “Tapered Composite Likelihood for Spatial Max-Stable Models,” Spatial Statistics, 8, 86–103. DOI: 10.1016/j.spasta.2013.07.003.
  • Schlather, M. (2002), “Models for Stationary Max-Stable Random Fields,” Extremes, 5, 33–44.
  • Simpson, E. S., Opitz, T., and Wadsworth, J. L. (2023), “High-Dimensional Modeling of Spatial and Spatio-Temporal Conditional Extremes Using INLA and Gaussian Markov Random Fields,” Extremes, to appear. DOI: 10.1007/s10687-023-00468-8.
  • Simpson, E. S., and Wadsworth, J. L. (2021), “Conditional Modelling of Spatio-Temporal Extremes for Red Sea Surface Temperatures,” Spatial Statistics, 41, 100482. DOI: 10.1016/j.spasta.2020.100482.
  • Sisson, S. A., Fan, Y., and Beaumont, M. (2018), Handbook of Approximate Bayesian Computation, Boca Raton, FL: Chapman & Hall/CRC Press.
  • Soelch, M., Akhundov, A., van der Smagt, P., and Bayer, J. (2019), “On Deep Set Learning and the Choice of Aggregations,” in Proceedings of the 28th International Conference on Artificial Neural Networks, ICANN, Lecture Notes in Computer Science, eds. I. V. Tetko, V. Kurková, P. Karpov, and F. J. Theis, pp. 444–457, Springer.
  • Stein, M. L. (1999), Interpolation of Spatial Data: Some Theory for Kriging, New York: Springer.
  • Stein, M. L., Chi, Z., and Welty, L. J. (2004), “Approximating Likelihoods for Large Spatial Data Sets,” Journal of the Royal Statistical Society, Series, B, 66, 275–296. DOI: 10.1046/j.1369-7412.2003.05512.x.
  • Strasser, H. (1981), “Consistency of Maximum Likelihood and Bayes Estimates,” The Annals of Statistics, 9, 1107–1113. DOI: 10.1214/aos/1176345590.
  • Subbotin, M. T. (1923), “On the Law of Frequency of Errors,” Mathematicheskii Sbornik, 31, 296–301.
  • Varin, C., Reid, N., and Firth, D. (2011), “An Overview of Composite Likelihood Methods,” Statistica Sinica, 21, 5–42.
  • Varin, C., and Vidoni, P. (2005), “A Note on Composite Likelihood Inference and Model Selection,” Biometrika, 92, 519–528. DOI: 10.1093/biomet/92.3.519.
  • Vecchia, A. V. (1988), “Estimation and Model Identification for Continuous Spatial Processes,” Journal of the Royal Statistical Society, Series B, 50, 297–312. DOI: 10.1111/j.2517-6161.1988.tb01729.x.
  • Wadsworth, J. L., and Tawn, J. A. (2022), “Higher-Dimensional Spatial Extremes via Single-Site Conditioning,” Spatial Statistics, 51, 100677. DOI: 10.1016/j.spasta.2022.100677.
  • Wagstaff, E., Fuchs, F. B., Engelcke, M., Osborne, M., and Posner, I. (2022), “Universal Approximation of Functions on Sets,” Journal of Machine Learning Research, 23, 1–56.
  • Wagstaff, E., Fuchs, F. B., Engelcke, M., Posner, I., and Osborne, M. (2019), “On the Limitations of Representing Functions on Sets,” in Proceedings of the 36th International Conference on Machine Learning (Vol. 97), eds. K. Chaudhuri and R. Salakhutdinov, pp. 6487–6494, PMLR.
  • Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. (2021), “A Comprehensive Survey on Graph Neural Networks,” IEEE Transactions on Neural Networks and Learning Systems, 32, 4–24. DOI: 10.1109/TNNLS.2020.2978386.
  • Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. (2017), “Deep Sets,” in Advances in Neural Information Processing Systems (Vol. 30), eds. I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Curran Associates, Inc.
  • Zammit-Mangion, A., and Wikle, C. K. (2020), “Deep Integro-Difference Equation Models for Spatio-Temporal Forecasting,” Spatial Statistics, 37, 100408. DOI: 10.1016/j.spasta.2020.100408.
  • Zhang, H. (2004), “Inconsistent Estimation and Asymptotically Equal Interpolations in Model-based Geostatistics,” Journal of the American Statistical Association, 99, 250–261. DOI: 10.1198/016214504000000241.
  • Zhou, D. (2018), “Universality of Deep Convolutional Neural Networks,” Applied and Computational Harmonic Analysis, 48, 787–794. DOI: 10.1016/j.acha.2019.06.004.