785
Views
0
CrossRef citations to date
0
Altmetric
Bayesian and Monte Carlo Methods

Quasi-Monte Carlo Methods for Binary Event Models with Complex Family Data

ORCID Icon, , , &
Pages 1393-1401 | Received 29 Oct 2021, Accepted 16 Nov 2022, Published online: 09 Jan 2023

Abstract

The generalized linear mixed model for binary outcomes with the probit link function is used in many fields but has a computationally challenging likelihood when there are many random effects. We extend a previously used importance sampler, making it much faster in the context of estimating heritability and related effects from family data by adding a gradient and a Hessian approximation and making a faster implementation. Additionally, a graph-based method is suggested to simplify the likelihood when there are thousands of individuals in each family. Simulation studies show that the resulting method is orders of magnitude faster, has a negligible efficiency loss, and confidence intervals with nominal coverage. We also analyze data from a large study of obsessive-compulsive disorder based on Swedish multi-generational data. In this analysis, the proposed method yielded similar results to a previous analysis, but was much faster. Supplementary materials for this article are available online.

1 Introduction

Generalized linear mixed models (GLMMs) are commonly used for the analysis of grouped data either to control for, or estimate, group specific effects. This article focuses on GLMMs for binary outcomes with the probit link function which are commonly referred to as mixed probit models or liability-threshold models. These models have been widely used in many disciplines, such as econometrics (Wilhelm and de Matos Citation2013), quantitative genetics (de Villemereuil et al. Citation2016), animal breeding (Thompson Citation1990), psychiatry (Lichtenstein et al. Citation2009; Sandin et al. Citation2017; Bai et al. Citation2019; Mahjani et al. Citation2020), and family-based medicine studies (Pawitan et al. Citation2004; Brown and Prescott Citation2014). However, estimation of GLMMs with many random effects per cluster is computationally challenging as this requires approximations of high-dimensional integrals. We particularly focus on estimation of heritability and related effects where the number of random effects in each cluster is equal to the cluster size. This is problematic for example, with multi-generational data from national registers where clusters can have thousands of observations.

Fast and deterministic procedures have shown to be imprecise for Bayesian inference for the models we consider (Holand et al. Citation2013). Alternatively, Markov chain Monte Carlo (MCMC) is attractive because of its asymptotic properties and because it can handle very large clusters. However, a common problem with MCMC is slow convergence or, in the worst case, lack of practical convergence.

The likelihood of the GLMMs does not have a closed form expression in frequentist inference. Deterministic and fast approximation exists (Breslow and Clayton Citation1993; Pinheiro and Bates Citation1995) but these can yield biased results particularly with low event rates or high variability (Rodríguez and Goldman Citation2001; Pinheiro and Chao Citation2006; Joe Citation2008). Adaptive Gauss-Hermite quadrature is effective with few random effects per cluster but scales very poorly in the number of random effects per cluster (Pinheiro and Bates Citation1995).

The general expression of the likelihood of the GLMMs is in the form of Gaussian weighted integrals, although for the particular class of models used in this article, the likelihood can also be written as cumulative distribution functions (CDFs) of multivariate normal distributions (Christoffersen et al. Citation2021). Pawitan et al. (Citation2004) make use of this alternative form of the likelihood using the importance sampling method proposed by Genz (Citation1992) and Genz and Bretz (Citation2002), which is based on randomized quasi-Monte Carlo (RQMC), to approximate the likelihood of each cluster. This method has been shown to provide a very efficient means of approximating the likelihood even in higher dimensions. However, a gradient approximation has not previously been used; previous implementations have used either derivative-free optimization methods or approximate derivatives from finite difference, resulting in slow estimation. This has limited how many random effects per cluster researchers have used. For example, previous applications have restricted analysis to smaller cluster sizes by dividing clusters and duplicating observations (Pawitan et al. Citation2004; Lichtenstein et al. Citation2009; Bai et al. Citation2019). This results in a pseudo-likelihood and makes it possible to estimate the model with previous integral approximations, but at the price of efficiency and invalid inference, unless corrections are made (Varin et al. Citation2011).

We develop a computationally efficient method to approximate the likelihood of GLMMs with binary outcomes and the probit link function, for large datasets with many random effects per cluster. We extend the importance sampler developed by Genz (Citation1992) and Genz and Bretz (Citation2002) to approximate the gradient of the log-likelihood which can be used by gradient based optimization methods along with an extension to provide a Hessian approximation for fast inference. A highly efficient C ++ implementation of our method is provided, which supports parallel processing. In addition, we propose an approach based on graph partitioning to split the very large clusters with thousands of random effects, to simplify the likelihood without losing much information in the context of random effect structures that are defined for heritability and related effects. Our efficient approximations and the graph partition method allows us to do frequentist inference orders of magnitude faster than with a similar Bayesian model using Gibbs sampling.

The remainder of the article is organized as follows. Section 2 introduces the class of models for family-based studies that we use in this article. Section 3 describes the importance sampler we use to approximate the likelihood, the gradient of the log-likelihood and the Hessian of the log-likelihood. We also describe our graph-based method to simplify the likelihood. Section 4 presents the results of the simulation studies, where we compare our method with two MCMC implementations. A study of possible efficiency loss, bias, and coverage of likelihood ratio-based and Wald-based confidence intervals when using our graph partitioning method and the pseudo-likelihood is provided. Section 5 presents our analysis of data from 897,210 individuals collected from the Swedish national patient register and the Swedish Multi-Generation Registry (SMGR) (Ekbom Citation2010). We estimate the heritability of obsessive-compulsive disorder (OCD) and compare the computation time with a Gibbs sampler. Sections 6 provides the discussion and conclusion. The implementation of our methods is available as an R package called pedmod at https://github.com/boennecd/pedmod.

2 Models

Statistical inference methodologies have been used to understand the heredity and risk factors of phenotypes (an individual’s observable trait). The fundamental step in understanding the inheritance of a phenotype is to partition the variability of that phenotype in a population into genetic and environmental components. For binary outcomes, the most flexible method is based on GLMMs, which can use data from families of varying sizes and structures (Tenesa and Haley Citation2013). GLMMs partition the variation of a phenotype into additive genetic effects (narrow-sense heritability) and environmental effects using random effects. Common examples of environmental effects are shared childhood environment effects and maternal effects. Maternal effects are the influences on the offspring phenotype that result from maternal phenotypes. Maternal effects can be partitioned into genetic maternal effects and environmental maternal effects.

In this section, we show how to use GLMMs to partition the variability of a phenotype. Assume that we want to estimate K different random effects (e.g., direct additive genetics or maternal effects) and adjust for a number of fixed effects. Let m be the number of independent families, with each family i{1,,m} consisting of ni members, on which we observe outcomes. Each outcome is a binary trait Yij{0,1} for the jth observation in the ith family with corresponding known fixed effect covariates Xij. We model the Yij ’s using a GLMM of the form Yij=1{βXij+ϵij>0} where 1{·} is one if the condition in the subscript is true and zero otherwise,(1) (ϵi1,,ϵini)N(ni)(0ni,Ini+k=1Kσk2Cik),(1)

0l is a l dimensional vector of zeros, Il is the l dimensional identity matrix, N(v)(θ,Γ) is the v dimensional multivariate normal distribution with mean θ and covariance matrix Γ, β is a vector of unknown fixed effect coefficients, and σ2=(σ12,,σK2) are unknown scale parameters. Cik is the known relationship matrix for family i and effect k. Another way to write the model is YijBin(1,Φ(βXij+ζij)) where (ζi1,,ζini)N(ni)(0ni,k=1Kσk2Cik), where Bin(m,p) is the binomial distribution with m trails and probability p, and Φ(x) is the standard normal distribution’s CDF evaluated at x with a corresponding density function denoted by ϕ(x). In this form, it may be easier to recognize the model as a GLMM.

After estimating the parameters, we can calculate the proportion of phenotypic variation explained by each variance component after controlling for the fixed effects. That is(2) hk=(1+j=1Kσj2)1σk2,(2) fixing the residual variance to 1 as in EquationEquation (1) (i.e., σ02=1) and assuming that all relationship matrices have ones in the diagonal. An alternative and equivalent model can be obtained by replacing the covariance matrix in EquationEquation (1) with σ¯0Ini+k=1Kσ¯k2Cik and imposing the constraint k=0Kσ¯k2=1. The corresponding fixed effect coefficients are β¯j=βj/1+k=1σk2 and it follows that σ¯k2=hk. This parameterization may be preferable for reporting the results, as the scale parameters are equal to the proportion of variance attributable to each effect and the fixed effect coefficients are easier to compare in magnitude with models without random effects.

A primary concern with estimating parameters for these GLMMs is that the log marginal likelihood is intractable. The log marginal likelihood term of the ith family is given by(3) li(β,σ)=logCiϕ(ni)(u;Xiβ,Ini+k=1Kσk2Cik)du(3) with Ci={uRni: ai<u<bi} and(aij,bij)={(,0)Yij=0(0,)Yij=1 where the inequality is elementwise, Xi=(Xij,,Xij), and ϕ(v)(x;θ,Γ) is the density of N(v)(θ,Γ) evaluated at x. In Section 3, we propose to use RQMC to approximate the log marginal likelihood in EquationEquation (3) along with the gradient with respect to β and σ2. We focus on modeling a single binary phenotype; however, as we discuss in Section 6, our approach can be easily extended for multiple phenotypes, and survival outcomes.

3 Methods

In this section we describe how we fit GLMMs for complex family data by extending and improving the method by Genz and Bretz (Citation2002), making it faster and adding a gradient approximation and Hessian approximation (Section 3.1). In Section 3.2, we introduce a novel graph-based method to simplify the likelihood.

3.1 Estimations of GLMMs using Importance Sampling

The primary issue with evaluating the log marginal likelihood in EquationEquation (3) is to approximate(4) Li(μ,Ψ)=Ciϕ(ni)(u;μ,Ψ)du(4) where μ=Xiβ and Ψ=Ini+k=1Kσk2Cik. Here, we provide details of the RQMC method we use to approximate this quantity, along with the gradient of the log-likelihood. We use importance sampling to approximate Li for a general ni . In particular, we follow the approach by Genz (Citation1992), Genz and Bretz (Citation2002), and the Geweke, Hajivassiliou, and Keane (GHK) simulator used by Hajivassiliou et al. (Citation1996), and use a series of conditional truncated normal distributions as the importance distribution. Though, special and fast algorithms are available for ni=2,3 (Genz and Bretz Citation2009), our algorithm is for the general case.

First, we rewrite Li as Li(μ,Ψ)=C˜ij=1niϕ(uj)du where C˜i={uRni: aiμ<Su<biμ}, SS=Ψ is the Cholesky decomposition of Ψ. Next, let h(·) be the importance distribution’s density function and let(5) h(u)=1{uC˜i}j=1niϕ(uj)/wj(u)wj(u)(5) =Φ(b̂ij(u1:(j1)))Φ(âij(u1:(j1)))b̂ij(x)={s111(bi1μj)j=1sjj1(bijμjs1:(j1),jx)j>1âij(x)={s111(ai1μj)j=1sjj1(aijμjs1:(j1),jx)j>1where xl:g=(xl,xl+1,,xg) is the lth to the gth element of a vector x. Then(6) Li(μ,Ψ)=h(u)j=1niwj(u)du.(6)

It is computationally cheap to recursively sample uj(âij(u1:(j1)),b̂ij(u1:(j1))) conditionally from h(u), and it is cheap to evaluate j=1niwj(u). Thus, each integrand evaluation in the importance sampling method is computationally cheap.

The variance of the estimator can be reduced by using a heuristic variable reordering (Genz and Bretz Citation2009). Our implementation is based on the Fortran code from Genz et al. (Citation2020) which uses such a heuristic variable reordering (see supplementary material S1 for further details on variable reordering). The original implementation also uses RQMC based on lattice rules. This yields an O(s1(logs)k) convergence rate for some kni where s is the number of samples compared with O(s1/2) for a Monte Carlo (MC) method (Caflisch Citation1998). We have added support for scrambled Sobol sequences (Joe and Kuo Citation2003) using the scrambling method used by Owen (Citation1998). Though, the latter RQMC method had a higher variance at a fixed computation time in preliminary experiments compared with the lattice rules. Therefore, we use the lattice rules.

We have rewritten the code from Genz et al. (Citation2020) in C ++. Our implementation supports computation in parallel by using OpenMP with Boost’s random header through the BH package (Eddelbuettel et al. Citation2020). This is a major advantage given the availability of multicore processors and that the method is compute-bound even for moderate to high ni . A substantial amount of the CPU time is spent on evaluating Φ and Φ1 even for moderately large dimensions (say ni = 50). For this reason, we allow the user to replace the 16 digit precision method for Φ1 from Wichura (Citation1988) used by the Fortran implementation through R (R Core Team Citation2020) with the seven digit precision method. We have also implemented monotone cubic Hermite interpolation based on the Fritsch–Carlson method for Φ. This yields an error no greater than approximately ±107. Using both of these alternatives seems to give a substantial reduction in the computation time while not affecting the estimates.

Importantly, we have extended the algorithm by Genz et al. (Citation2020) to also provide an approximation of the gradient with respect to β and σ2. The model can be estimated using a gradient based nonlinear optimizer given a log-likelihood and gradient approximation. We use the BFGS implementation in the stat package in R fixing the seed used in both approximations. Following Hajivassiliou et al. (Citation1996) (see supplementary material S2 for an explicit link to our work), we find that the gradient is(7) μLi(μ,Ψ)=Ψ1Ci(uμ)ϕ(ni)(u;μ,Ψ)du(7) (8) ΨLi(μ,Ψ)=12Ψ1(Ci((uμ)(uμ)Ψ)ϕ(ni)(u;μ,Ψ)du)Ψ1(8) where ΨLi(μ,Ψ)=(L(μ,Ψ)/ψlk)l,k=1,,ni and we define μLi(μ,Ψ) similarly. Thus, we can use an approximation of EquationEquations (7) and Equation(8) since the partial derivative of the log-likelihood is logLi/x=Li/x/Li where we already have an approximation of the denominator. Applying similar steps used to get to EquationEquation (6) to the gradient terms in EquationEquations (7) and Equation(8) and an application of the chain rule shows that(9) βLi(μ,Ψ)=XiS1uh(u)j=1niwj(u)du(9) (10) σk2Li(μ,Ψ)=12(uSCikS1uvec(Ψ1)vec(Cik))h(u)j=1niwj(u)du(10) where vec(·) is the vectorization operator that transforms a matrix into a column vector by stacking its columns. We use the variable reordering and the RQMC method used by Genz and Bretz (Citation2002) for both Li and the derivatives. This reordering seems to be beneficial also for the derivatives although the reordering is to heuristically reduce the variance of the integrand in EquationEquation (6). The approximation is unbiased for the likelihood when a finite number of samples is used but it is biased for the log-likelihood and the gradient (Hajivassiliou et al. Citation1996). However, the bias can be expected to be negligible—we motivate this in supplementary material S3, where we show that the bias of the log-likelihood is approximately proportional to the variance of the estimator. Moreover, we show that the bias of the gradient of the log-likelihood from the division of the inverse of the likelihood is approximately less than the variance of the estimate of the inverse of likelihood. Both cases yield a negligible bias if the variances of the respective quantities are small. Further details of the implementation are given in Appendix A in the supplementary material including formulas for the importance sampler for the Hessian approximation which are omitted here due to space limitations.

3.2 Recursive Partitioning for Handling Large Families

In the absence of genotypic data, estimating genetic contributions to phenotypes/diseases requires large datasets with detailed knowledge of family relationships. Such datasets are good examples of complex structured data. Here we analyze data from a sample of 897,210 individuals born in Sweden between 1982 and 1990, for which information on diagnosis of OCD has previously been obtained from the Swedish National Patient Register (Mahjani et al. Citation2020). Information on the family relationships for a maximum of three (prior) generations has also previously been obtained for these individuals from the SMGR, which holds information on family relatedness for more than nine million individuals. In the dataset we analyze, most individuals belong to marginally independent families that are quite small, but there are also a few very large families, with the largest cluster containing 167,279 individuals, which makes it computationally challenging to fit the GLMM.

Importance sampling is often limited in terms of scaling to very high dimensional integrals as the variance of the estimate often depends on the dimension of the integral. This is true for the importance sampler we use and thus we cannot work with a 167,279 dimension integral (Botev Citation2017). Consequently, we have to make some clusters smaller in order to apply importance sampling. Previously, authors have tended to use a pseudo-likelihood, possibly without addressing that standard asymptotics are no longer applicable (Varin et al. Citation2011). The result is a loss of efficiency and invalid confidence intervals and p-values unless adjustments are made. However, the approach simplifies the likelihood substantially and has been necessary in previously used estimation methods. Examples are Pawitan et al. (Citation2004), Lichtenstein et al. (Citation2009), and Bai et al. (Citation2019), in which datasets are similar to the three generation dataset that we use. Typically, clusters are defined in terms of grandparents where children and grandchildren belong to the same cluster. This leads to duplication of outcomes as grandchildren will have grandparents on both their mother’s and father’s side, as illustrated in supplementary material of Bai et al. (Citation2019).Footnote1

In this article, we instead suggest to simplify the pedigree without duplicating observations. The individuals in the larger families are usually connected by long parent–child paths in the three generation data. This is illustrated with a family with 48 members in . This suggests that it may be possible to limit the family sizes in the data by removing a small number of parent–child relationships and thereby splitting the pedigree. To formalize this, we formulate a family as an undirected graph G=(V,E) with the vertices V corresponding to individuals and edges E corresponding to parent–child relations. Partitioning V into two disjoint sets V1 and V2, that are connected, implies that, if we ignore the edges between V1 and V2, we only have to do |V1| and |V2| dimensional integration rather than |V|=|V1|+|V2| dimensional integration.

Fig. 1 Example of a family with 48 members and its graph representation. The left plot shows the pedigree of the family. Circles are females and squares are males. Dashed lines point to the same individual: the individual is shown with the siblings in one end and with the partner in the other end. The right plot shows the family illustrated as a graph with edges between parents and children as in Section 3.2.

Fig. 1 Example of a family with 48 members and its graph representation. The left plot shows the pedigree of the family. Circles are females and squares are males. Dashed lines point to the same individual: the individual is shown with the siblings in one end and with the partner in the other end. The right plot shows the family illustrated as a graph with edges between parents and children as in Section 3.2.

The above suggests the following procedure to simplify the clusters. Let w be a vertex weight function and h be an edge weight function. We set w to be one if we observe an outcome and 105 otherwise, and h to be 10 if the child has an observed outcome and one otherwise. We will discuss other strategies for selecting h later in Section 6. We then find a partition with connected sets V1,,Vk for some k1 such thatmine{dE:d crosses two sets}h(e)s.t.l=1kVl=VViVj=ij|w(Vi)|Ni=1,,kwhere N is a user specified maximum dimension of the integrals (i.e., the maximum number of individuals with observed outcomes in a cluster). Thus, the objective is to find the minimum weighted set of parent–child relationships to remove, such that we end up with some k1 families within which individuals’ weights sum to at most N. This amounts to that no families have more than N observed outcomes with our vertex weight function. The above problem is similar to the p-way partitioning problem from graph theory (Buluç et al. Citation2016), except that what matters in our application is the connected sets, and not possibly unconnected sets. The p-way partitioning is a NP-complete problem but many heuristics have been suggested (Buluç et al. Citation2016).

We have implemented a recursive partitioning method similar to the method suggested by Simon and Teng (Citation1997). This requires a graph partitioning method for which we have implemented a greedy algorithm similar to that suggested by Kernighan and Lin (Citation1970) and Fiduccia and Mattheyses (Citation1982). However, we found quite large improvements by starting the greedy partitioning method from an approximation of the maximally balanced connected partition. For the latter, we implemented the algorithm suggested by Chlebíková (Citation1996) using the algorithm suggested by Hopcroft and Tarjan (Citation1973) to find the biconnected components. More details are provided in supplementary material S4. It is likely that one of the recent multilevel methods (Buluç et al. Citation2016) may perform better if adapted to our problem. However, our implementation seems to work well on the OCD data as we show in Section 5. We have focused on simplifying the computational problem when the relationships matrices can be defined in terms of parent–child relationships. The motivation is that many other effects have dependencies which are also defined in terms of parent–child relationships (e.g., maternal effects, paternal effects, and shared childhood environment).

4 Simulations

We conducted multiple simulation studies to compare the computation time, coverage of Wald-based confidence intervals and also examined any potential biases with our RQMC method and MCMC methods. We made comparisons with MCMC methods since these methods can handle large families. The two MCMC implementations we compare with are MCMCglmm (Hadfield Citation2010) and a Gibbs sampler implemented in thrgibbs1f90b (as a part of the family of programs BLUPF90 Misztal et al. Citation2018). In another set of simulation studies, we investigated potential biases introduced by using our graph-based method and the pseudo-likelihood method for simplifying large clusters, described in Section 3.2.

For the first set of simulation studies, we sampled 250 families of size ni = 10. All families had the same pedigree which is shown in . We used the following model to generate the outcomes Yij=1{β0+β1Xij+β2Bij+ϵij>0} with (ϵi1,,ϵi10)N(10)(010,σAdd2CAdd+σEnv2CEnv+I10), XijN(0,1),BijBin(1,0.5), and β=(3,1,2). CAdd is the relationship matrix for the additive genetic effect, and CEnv is the relationship matrix for the childhood environment effect. Both matrices are shown in supplementary material S5.

Fig. 2 Pedigree used for all families in the simulation studies. Circles indicate females and squares indicate males.

Fig. 2 Pedigree used for all families in the simulation studies. Circles indicate females and squares indicate males.

In the first study, we simulated data using only an additive genetic effect with σAdd2=3, corresponding to a heritability of hAdd=0.75. In the second study, we simulated data using an additive genetic effect and a childhood environment effect with (σAdd2,σEnv2)=(2,1). The marginal probability of observing an event was 0.208 in both simulation studies. Four threads were used with our RQMC method using a fast heuristic to get the starting values, followed by optimizing the log marginal likelihood using 5000 to 25,000 samples per term.Footnote2 25,000 samples were used for the Hessian approximations. For the first study (additive effects only) 1000 datasets were analyzed with our RQMC method, but only the first ten first samples were analyzed using the MCMC approximations due to their long computation times. We used 1,000,000 samples with MCMCglmm with 100,000 used as burn-in. We used a multivariate normal distribution prior for the fixed effect coefficients, β, with 100I3 as the covariance matrix and a F-distribution with 1 and 10 degrees of freedom as the prior for σAdd2. The latter yielded a somewhat flat prior on the heritability (Villemereuil et al. Citation2013) and solved some of the mixing problems. The average effective sample size of σAdd2 was 335.6 with a standard deviation of 283.6. An improper prior was used with BLUPF90, with 200,000 samples with 10,000 used as burn-in. The average effective sample size for σAdd2 was 661.2 with a standard deviation of 48.6.

MCMCglmm and RQMC implementations were run on a Laptop with an Intel® CoreTM i7-8750H CPU and the code was compiled with GCC 10.1. The BLUPF90 program was run on a server with an Intel® Xeon® Processor E5-2630 v3. The results are shown in .Footnote3 There is no or only minor evidence of bias in the simulation studies with our RQMC method. Similar conclusions apply to the MCMC methods based on the first ten datasets. Importantly, our RQMC method is much faster than the MCMC implementations. This makes it possible to compute for example, profile likelihood-based confidence intervals as we do later. For the second study with additive genetic and childhood environment effects, there was similarly no, or only minor evidence, of biases. The coverage of the Wald-based confidence intervals were in both cases close to the nominal level. The time taken to compute the Hessian is included in the estimation time. We also estimated the models for the first 100 datasets without using the gradient. On average it took 3.71 and 4.13 times longer to fit the models without and with the environment effect, respectively.

Table 1 Bias estimates, coverage estimates and average computation times plus and minus one standard error for the estimates in the simulation studies.

Next, we studied potential biases introduced by simplifying pedigrees using the approach we suggest in Section 3.2. It is necessary to simplify the likelihood with families with more than a few hundred observations when using the RQMC method. For each dataset and family, we simulated a random pedigree for a three generation dataset similar to the observational data we use later in Section 5. Details of the method used to simulate the pedigrees are provided in supplementary material S5. We retained only the outcomes of the last generation, to mimic the observational data. The families we simulate each have about 100 observed outcomes. We then used the graph-based method of Section 3.2 to split the families such that the integrals had a maximum dimension of 33. The pseudo-likelihood method mentioned in Section 3.2 was also used. We used 100,000 samples per log-likelihood term because the dimensions of the integrals were larger than before. 250,000 samples were used for the Hessian approximations. One RQMC sequence was used for each term as the number of samples was fixed and the error decreases at a faster rate in the length of each RQMC sequence than in the number of sequences. For this simulation we used a maternal effect, as in Mahjani et al. (Citation2020), instead of the childhood environment effect which we used earlier. We set (σAdd2,σMat2)=(2,1) and β0=0.5 to get a prevalence closer to 50%. There were 25 families in each dataset, and 100 datasets were sampled.

Bias estimates, coverage of Wald-based confidence intervals and average computation times are shown in . There was no evidence of any bias when our RQMC method was applied to the full datasets, the reduced datasets, or the dataset from the pseudo-likelihood. The estimation times with the reduced datasets were shorter than for the full datasets, as expected. It was also smaller compared to the pseudo-likelihood. An explanation is that the algorithm behaves like O(ni) when ni is not too large and the number of samples is fixed, as we mention in Appendix A in the supplementary material. It is though perceivable that we could use fewer samples with the pseudo-likelihood to get the same precision of the likelihood estimate as with the two other likelihoods. The standard errors show that neither the reduced dataset nor the pseudo-likelihood have a large efficiency loss. However, the Wald-based confidence intervals for the pseudo-likelihood-based method do not have the right coverage and are anti-conservative, as expected. 90% profile likelihood-based confidence interval of hAdd were also constructed. The coverage was 89% with our suggested graph-based method and 84% with the pseudo-likelihood-based method consistent with the Wald-based confidence intervals.Footnote4 It is possible to correct the latter but this is not straight forward (Varin et al. Citation2011). A note on the construction of the confidence intervals is provided in supplementary material S6.

Table 2 Bias estimates and average computation times plus and minus one standard error for the estimates for the simulation study with an additive genetic and maternal effect.

5 Analysis of Data from Swedish National Registries for Estimating the Heritability of OCD

In this section, we analyze a dataset using a GLMM to estimate the direct additive genetic effect for the risk of OCD. We chose a sample of 897,210 individuals as described in Section 3.2. Family relationships were constructed using information from three generations. A total of 6888 individuals (in the last generation, that is, born between 1982 and 1990) had an OCD diagnosis. This dataset is similar to the one used by Mahjani et al. (Citation2020), but not identical, since the two datasets were extracted from the registry at different times and maintained by different institutes.

We applied the recursive partitioning method described in Section 3.2. In the largest family, which had 167,279 outcomes, we removed 3076 parent–child relationships, by requiring that we perform no more than N = 200 dimensional integration. Note that with this many outcomes and a requirement of dimension 200, the smallest possible number of removed relationships would be 167279/2001836, in the (unrealistic) case where the graph is for example, a chain. In total, we removed 3131 parent–child relations from the 3,944,737 relations in the entire dataset. This yielded a reduction of 1.32% in the unique cousin pairs and 0.038% in the unique sibling pairs with observed outcomes. These are relevant metrics for the amount of removed information as the individuals with observed outcomes are either siblings, half-siblings, or cousins. Less information was removed in this data than from the second simulation study in Section 4. In that (simulation) study, 11.7% of the unique cousin pairs were removed and 0.0159% of the unique sibling pairs were removed in the reduced datasets, on average. We also computed the Euclidean norm of the difference in the reduced and full relationship matrix divided by the Euclidean norm of the full relationship matrix. We ignored the diagonal entries which are always identical. This ratio was 0.0440 in the applied data and the average ratio was 0.1521 in the simulation study. We therefore expected the impact of reducing the OCD data to be minimal on the analysis.

In the analysis of the data in Mahjani et al. (Citation2020), fitted models included fixed effects for sex and age of the mother (AOM). AOM was included as a binary covariate (34, > 34). We used the same covariates, which allowed us to reduce the number of log marginal likelihood terms from a total of 237,332 to 37,865 (unique) terms—that is, some of the terms were identical in terms of the relationship matrix, covariates, and outcomes, and these could be weighted according to their frequency. This approach, of using weights, is only possible for discrete covariates and has been used by Pawitan et al. (Citation2004) to speed up computation. However, in our study this approach was not so useful because of the large cluster sizes, as the larger clusters take substantially more time to approximate and because most of the large families are unique. We used 100,000 samples per log marginal likelihood term. One RQMC sequence was used for each log marginal likelihood term, as in the second simulation study in Section 4.

We estimated the model on a server with two Intel® Xeon® processor E5-2660 v2 using 20 logical cores for everything but the Hessian where we used 10 logical cores as the current implementation did not run faster with more threads on the CPU on the server. It took 2 hr and 52 min to fit the model and an additional 3 hr and 18 min was spent on approximating the Hessian with 250,000 samples per log marginal likelihood term. In contrast, it takes about a month to generate a chain of length 100,000 using the BLUPF90 software. The estimates obtained using our approach are shown in . They are similar to the posterior means obtained by Mahjani et al. (Citation2020) using BLUPF90. The Bayesian analysis based on improper priors, of Mahjani et al. (Citation2020) obtained posterior means of log1.27=0.239 for βsex,log1.17=0.157 for βAOM, and 0.482 for hAdd. All values are consistent with our result. We also computed a 95% profile likelihood-based confidence interval for the hAdd. The interval was (0.387,0.545) which is very close to the Wald-based confidence interval of (0.394,0.543). However, the interval was wider than the credibility interval of (0.431,0.526) reported by Mahjani et al. (Citation2020). We are unable to explain this discrepancy but suspect that it is related to us having fewer cases (Mahjani et al. Citation2020, have 7184). It took 7 hr and 49 min to compute the confidence interval.

Table 3 Estimation of the direct additive genetic effect for OCD using our proposed method.

6 Discussion

In this article we have explored approaches for fitting generalized linear mixed models with binary outcomes, using the probit link function, which is used in a number of fields. We have in particular focused on analyzing complex family data and have presented an analysis of data from a large study of OCD in Swedish families. The suggested graph-based partitioning method has allowed us to use almost the full information from the pedigrees despite that some clusters have thousands of members. Using an extension of the importance sampler from Genz and Bretz (Citation2002) to compute gradient approximations of the log-likelihood, fast estimation was possible. We used simulation studies to show that the graph-based partitioning approach to simplifying the likelihood resulted in little, if any, loss of efficiency and in nominal coverage of likelihood ratio-based confidence intervals. The importance sampler was also extended to provide a Hessian approximation which can be used for fast frequentist inference. The simulation studies equally showed that the Wald-based confidence intervals had close to nominal coverage.

The method suggested by Genz (Citation1992) and Genz and Bretz (Citation2002) has been shown to be a very competitive means of approximating the likelihood in EquationEquation (4). However, Botev (Citation2017) has recently extended their work using exponential tilting of the importance distribution in EquationEquation (5). Botev (Citation2017) shows that his minimax tilting method may yield an improvement in general to the RQMC method, particularly for certain rare events.

We have used one particular graph-based method to simplify the pedigree structures. Other procedures have of course been suggested to simplify pedigrees, but these have primarily focused on much fewer individuals who are related through much more detailed pedigrees, and these procedures have been tailored for specific contexts (Liu et al. Citation2008; Kirichenko et al. Citation2009). Falchi et al. (Citation2004) propose a method which, like ours, is graph-based. They however write the graph by including edges between individuals with nonzero kinship coefficients and look for cliques. The method is not so useful in our application as we only have small cliques. Modifications of our graph partitioning method approach are also possible. For example, the edge weight function, h, in Section 3.2 could be set to the sum of kinship coefficients for the observed outcomes that will be set to zero if the parent–child relation is removed. This would be computationally easy to update in our three generation data, as relatively few edge weights need to be updated after removal (or addition) of a parent–child relationship. Our graph-based method may also be used to improve inference for Bayesian methods which factorizes the covariance matrix. It will be faster to work with the many smaller covariance matrices rather than the full covariance matrix despite using that the latter is sparse.

In one of the simulation studies in Section 4 we used shared childhood environment effects. In general, for observational data analysis, strong assumptions have to be made to specify the correlation structure of the environmental effects for each unique pedigree. An important issue is what assumptions to make for example, between shared environment of children and their parents and between cousins. Unfortunately, the assumptions may effect the heritability estimate and whether one can distinguish between heritability and environment effects in a given study design.

We have focused on univariate outcomes. The approach described in this article can however easily be extended to handle multiple traits, that is, for studying genetic and environmental associations between traits. Importantly, our method and software can handle this extension as outlined in supplementary material S7.

Finally, we note also that it is easy to extend our approach to handle time-to-event data with censored outcomes. Christoffersen et al. (Citation2021) have previously shown how the approximation method suggested in this article can be used for a class of mixed generalized survival models (GSMs) (Liu et al. Citation2017) and related mixed transformation models (Hothorn et al. Citation2018). An advantage of these models is that the proportion of variance attributable to each effect can be quantified on an estimated scale in a similar manner to the model presented for binary traits. This is unlike other mixed survival models (Scurrah et al. Citation2000). Further details are given in supplementary material S8.

Appendix

Contains computational details for the importance sampler and details for the Hessian approximation. (.pdf)

Supplements

Contains information about the variable reordering, the link to Hajivassiliou et al. (Citation1996), a discussion of finte sample bias, details for the graph partitioning implementation, details for the simulation studies, and a discussion on possible extensions. (.pdf)

Supplemental material

Supplemental Material

Download Zip (1.4 MB)

Acknowledgments

We thank the reviewers and editor for very constructive and useful feedback which has greatly improved the article.

Supplementary Materials

Code: Contains the code to reproduce the simulations. (.zip file)

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

This work was partially supported by the Swedish e-Science Research Center and the Swedish Research Council under grant 2019-00227.

Notes

1 Authors have also tended to limit the number of children of each pair of parents and discretized the covariates because of computational constraints. Neither are necessary with our new methods as we show in Section 4.

2 We have implemented a stopping criterion based on the error estimate from RQMC like Genz and Bretz (Citation2002); see Appendix A in the supplementary material. This is possible when multiple RQMC sequences are used. However, the error or the estimate decreases faster in the length of each sequence then in the number of sequence (Hickernell et al. Citation2005).

3 For technical reasons, the reported means for BLUPF90 are the posterior means with the parameterization where the residual variance is one which is then transformed to the parameterization where the total variance is one. The exception is hAdd which is the posterior mean of hAdd.

4 The Hessian approximation with the full data (but not the two other datasets) failed to be positive definite in seven cases where there were close to boundary solutions. This is due to almost singular Hessians combined with the errors from the RQMC estimator. In these cases, we set the smallest Eigen value to be the smallest positive Eigen value divided by 10. This likely caused the low coverage for the Wald-based confidence intervals of the proportion variances.

References

  • Bai, D., Yip, B. H. K., Windham, G. C., Sourander, A., Francis, R., Yoffe, R., Glasson, E., Mahjani, B., Suominen, A., Leonard, H., Gissler, M., Buxbaum, J. D., Wong, K., Schendel, D., Kodesh, A., Breshnahan, M., Levine, S. Z., Parner, E. T., Hansen, S. N., Hultman, C., Reichenberg, A., and Sandin, S. (2019), “Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort,” JAMA Psychiatry, 76, 1035–1043. DOI: 10.1001/jamapsychiatry.2019.1411.
  • Botev, Z. I. (2017), “The Normal Law under Linear Restrictions: Simulation and Estimation via Minimax Tilting,” Journal of the Royal Statistical Society, Series B, 79, 125–148. DOI: 10.1111/rssb.12162.
  • Breslow, N. E., and Clayton, D. G. (1993), “Approximate Inference in Generalized Linear Mixed Models,” Journal of the American Statistical Association, 88, 9–25.
  • Brown, H., and Prescott, R. (2014), Applied Mixed Models in Medicine, Chichester: Wiley.
  • Buluç, A., Meyerhenke, H., Safro, I., Sanders, P., and Schulz, C. (2016), Recent Advances in Graph Partitioning, pp. 117–158, Cham: Springer.
  • Caflisch, R. E. (1998), “Monte Carlo and quasi-Monte Carlo Methods,” Acta Numerica, 7, 1–49. DOI: 10.1017/S0962492900002804.
  • Chlebíková, J. (1996), “Approximating the Maximally Balanced Connected Partition Problem in Graphs,” Information Processing Letters, 60, 225–230. DOI: 10.1016/S0020-0190(96)00175-5.
  • Christoffersen, B., Clements, M., Kjellström, H., and Humphreys, K. (2021), “Approximation Methods for Mixed Models with Probit Link Functions.” DOI: 10.48550/ARXIV.2110.14412.
  • de Villemereuil, P., Schielzeth, H., Nakagawa, S., and Morrissey, M. (2016), “General Methods for Evolutionary Quantitative Genetic Inference from Generalized Mixed Models,” Genetics, 204, 1281–1294. DOI: 10.1534/genetics.115.186536.
  • Eddelbuettel, D., Emerson, J. W., and Kane, M. J. (2020), BH: Boost C++ Header Files. R package version 1.72.0-3.
  • Ekbom, A. (2010), “The Swedish Multi-Generation Register,” in Methods in Molecular Biology, ed. J. Dillner, pp. 215–220, Totowa, NJ: Humana Press.
  • Falchi, M., Forabosco, P., Mocci, E., Borlino, C. C., Picciau, A., Virdis, E., Persico, I., Parracciani, D., Angius, A., and Pirastu, M. (2004), “A Genomewide Search Using an Original Pairwise Sampling Approach for Large Genealogies Identifies a New Locus for Total and Low-Density Lipoprotein Cholesterol in Two Genetically Differentiated Isolates of Sardinia,” The American Journal of Human Genetics, 75, 1015–1031. DOI: 10.1086/426155.
  • Fiduccia, C., and Mattheyses, R. (1982), “A Linear-Time Heuristic for Improving Network Partitions,” in 19th Design Automation Conference, pp. 175–181.
  • Genz, A. (1992), “Numerical Computation of Multivariate Normal Probabilities,” Journal of Computational and Graphical Statistics, 1, 141–149.
  • Genz, A., and Bretz, F. (2002), “Comparison of Methods for the Computation of Multivariate t Probabilities,” Journal of Computational and Graphical Statistics, 11, 950–971. DOI: 10.1198/106186002394.
  • Genz, A., and Bretz, F. (2009), Computation of Multivariate Normal and t Probabilities, Lecture Notes in Statistics, Heidelberg: Springer-Verlag.
  • Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., and Hothorn, T. (2020), mvtnorm: Multivariate Normal and t Distributions. R package version 1.1-1.
  • Hadfield, J. D. (2010), “MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package,” Journal of Statistical Software, 33, 1–22. DOI: 10.18637/jss.v033.i02.
  • Hajivassiliou, V., McFadden, D., and Ruud, P. (1996), “Simulation of Multivariate Normal Rectangle Probabilities and their Derivatives Theoretical and Computational Results,” Journal of Econometrics, 72, 85–134. DOI: 10.1016/0304-4076(94)01716-6.
  • Hickernell, F. J., Lemieux, C., and Owen, A. B. (2005), “Control Variates for Quasi-Monte Carlo,” Statistical Science, 20, 1–31. DOI: 10.1214/088342304000000468.
  • Holand, A. M., Steinsland, I., Martino, S., and Jensen, H. (2013), “Animal Models and Integrated Nested Laplace Approximations,” G3 Genes—Genomes—Genetics, 3, 1241–1251. DOI: 10.1534/g3.113.006700.
  • Hopcroft, J., and Tarjan, R. (1973), “Algorithm 447: Efficient Algorithms for Graph Manipulation,” Communications of the ACM, 16, 372–378. DOI: 10.1145/362248.362272.
  • Hothorn, T., Möst, L., and Bühlmann, P. (2018), “Most Likely Transformations,” Scandinavian Journal of Statistics, 45, 110–134. DOI: 10.1111/sjos.12291.
  • Joe, H. (2008), “Accuracy of Laplace Approximation for Discrete Response Mixed Models,” Computational Statistics & Data Analysis, 52, 5066–5074.
  • Joe, S., and Kuo, F. Y. (2003), “Remark on Algorithm 659: Implementing Sobol’s Quasirandom Sequence Generator,” ACM Transactions on Mathematical Software, 29, 49–57. DOI: 10.1145/641876.641879.
  • Kernighan, B. W., and Lin, S. (1970), “An Efficient Heuristic Procedure for Partitioning Graphs,” The Bell System Technical Journal, 49, 291–307. DOI: 10.1002/j.1538-7305.1970.tb01770.x.
  • Kirichenko, A. V., Belonogova, N. M., Aulchenko, Y. S., and Axenovich, T. I. (2009), “Pedstr Software for Cutting Large Pedigrees for Haplotyping, IBD Computation and Multipoint Linkage Analysis,” Annals of Human Genetics, 73, 527–531. DOI: 10.1111/j.1469-1809.2009.00531.x.
  • Lichtenstein, P., Yip, B. H., Björk, C., Pawitan, Y., Cannon, T. D., Sullivan, P. F., and Hultman, C. M. (2009), “Common Genetic Determinants of Schizophrenia and Bipolar Disorder in Swedish Families: A Population-based Study,” The Lancet, 373, 234–239. DOI: 10.1016/S0140-6736(09)60072-6.
  • Liu, F., Kirichenko, A., Axenovich, T. I., van Duijn, C. M., and Aulchenko, Y. S. (2008), “An Approach for Cutting Large and Complex Pedigrees for Linkage Analysis,” European Journal of Human Genetics, 16, 854–860. DOI: 10.1038/ejhg.2008.24.
  • Liu, X.-R., Pawitan, Y., and Clements, M. S. (2017), “Generalized Survival Models for Correlated Time-to-Event Data,” Statistics in Medicine, 36, 4743–4762. DOI: 10.1002/sim.7451.
  • Mahjani, B., Klei, L., Hultman, C. M., Larsson, H., Devlin, B., Buxbaum, J. D., Sandin, S., and Grice, D. E. (2020), “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder,” Biological Psychiatry, 87, 1045–1051. Obsessive-Compulsive Disorder and Developmental Disorders. DOI: 10.1016/j.biopsych.2020.01.006.
  • Misztal, I., Tsuruta, S., Lourenco, D., Masuda, Y., Aguilar, I., Legarra, A., and Vitezica, Z. (2018), Manual for BLUPF90 Family Programs, Athens, GA: University of Georgia.
  • Owen, A. B. (1998), “Scrambling Sobol’ and Niederreiter–Xing Points,” Journal of Complexity, 14, 466–489. DOI: 10.1006/jcom.1998.0487.
  • Pawitan, Y., Reilly, M., Nilsson, E., Cnattingius, S., and Lichtenstein, P. (2004), “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data,” Statistics in Medicine, 23, 449–465. DOI: 10.1002/sim.1603.
  • Pinheiro, J. C., and Bates, D. M. (1995), “Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model,” Journal of Computational and Graphical Statistics, 4, 12–35.
  • Pinheiro, J. C., and Chao, E. C. (2006), “Efficient Laplacian and Adaptive Gaussian Quadrature Algorithms for Multilevel Generalized Linear Mixed Models,” Journal of Computational and Graphical Statistics, 15, 58–81. DOI: 10.1198/106186006X96962.
  • R Core Team (2020), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing.
  • Rodríguez, G., and N. Goldman (2001), “Improved Estimation Procedures for Multilevel Models with Binary Response: A Case-Study,” Journal of the Royal Statistical Society, Series A, 164, 339–355. DOI: 10.1111/1467-985X.00206.
  • Sandin, S., Lichtenstein, P., Kuja-Halkola, R., Hultman, C., Larsson, H., and Reichenberg, A. (2017), “The Heritability of Autism Spectrum Disorder,” JAMA, 318, 1182–1184. DOI: 10.1001/jama.2017.12141.
  • Scurrah, K. J., Palmer, L. J., and Burton, P. R. (2000), “Variance Components Analysis for Pedigree-based Censored Survival Data using Generalized Linear Mixed Models (GLMMs) and Gibbs Sampling in BUGS,” Genetic Epidemiology, 19, 127–148. DOI: 10.1002/1098-2272(200009)19:2<127::AID-GEPI2>3.0.CO;2-S.
  • Simon, H. D., and Teng, S.-H. (1997), “How Good is Recursive Bisection?” SIAM Journal on Scientific Computing, 18, 1436–1445. DOI: 10.1137/S1064827593255135.
  • Tenesa, A., and Haley, C. S. (2013), “The Heritability of Human Disease: Estimation, Uses and Abuses,” Nature Reviews Genetics, 14, 139–149. DOI: 10.1038/nrg3377.
  • Thompson, R. (1990), Generalized Linear Models and Applications to Animal Breeding, pp. 312–328, Berlin, Heidelberg: Springer.
  • Varin, C., Reid, N., and Firth, D. (2011), “An Overview of Composite Likelihood Methods,” Statistica Sinica, 21, 5–42.
  • Villemereuil, P., Gimenez, O., and Doligez, B. (2013), “Comparing Parent–Offspring Regression with Frequentist and Bayesian Animal Models to Estimate Heritability in Wild Populations: A Simulation Study for Gaussian and Binary Traits,” Methods in Ecology and Evolution, 4, 260–275. DOI: 10.1111/2041-210X.12011.
  • Wichura, M. J. (1988), “The Percentage Points of the Normal Distribution,” Journal of the Royal Statistical Society, Series C, 37, 477–484.
  • Wilhelm, S., and de Matos, M. G. (2013), “Estimating Spatial Probit Models in R,” The R Journal, 5, 130–143. DOI: 10.32614/RJ-2013-013.