879
Views
0
CrossRef citations to date
0
Altmetric
Bayesian Methods

Maximum Likelihood Estimation of Hierarchical Linear Models from Incomplete Data: Random Coefficients, Statistical Interactions, and Measurement Error

&
Pages 112-125 | Received 16 Oct 2022, Accepted 24 Jun 2023, Published online: 19 Sep 2023

Abstract

We consider two-level models where a continuous response R and continuous covariates C are assumed missing at random. Inferences based on maximum likelihood or Bayes are routinely made by estimating their joint normal distribution from observed data Robs and Cobs. However, if the model for R given C includes random coefficients, interactions, or polynomial terms, their joint distribution will be nonstandard. We propose a family of unique factorizations involving selected “provisionally known random effects” u such that h(Robs,Cobs|u) is normally distributed and u is a low-dimensional normal random vector; we approximate h(Robs,Cobs)=h(Robs,Cobs|u)g(u)du via adaptive Gauss-Hermite quadrature. For polynomial models, the approximation is exact but, in any case, can be made as accurate as required given sufficient computation time. The model incorporates random effects as explanatory variables, reducing bias due to measurement error. By construction, our factorizations solve problems of compatibility among fully conditional distributions that have arisen in Bayesian imputation based on the Gibbs Sampler. We spell out general rules for selecting u, and show that our factorizations can support fully compatible Bayesian methods of imputation using the Gibbs Sampler. Supplementary materials for this article are available online.

1 Introduction

Large-scale surveys and experiments within the social and health sciences frequently meet four conditions that supply the focus of this article. First, the data typically have a hierarchical structure, with respondents nested within local organizational units such as schools and hospitals or repeated measures nested within persons. Second, missing data are pervasive. Third, partially observed covariates may be measured with error. Finally, the covariates of interest may have random coefficients, statistical interactions or polynomial terms.

These characteristics have received some attention in recent methodological research. A popular approach conceives the response variables and partially observed covariates as outcomes within a multivariate, hierarchical linear model (HLM) under the assumption that the data are missing at random (MAR; Rubin Citation1976), an assumption often thought reasonable given a presumably rich set of covariates (Schafer and Yucel Citation2002; Goldstein, Carpenter, and Browne Citation2014). Missing values are imputed from their posterior predictive distribution of the missing values using what has been termed a “fully conditional specification” (FCS) using the Gibbs sampler. FCS requires the analyst to impute missing values for each variable subject to missingness, conditional on all other unknowns. This approach has been shown to function well when the process generating the joint distribution of unknowns is reasonably assumed multivariate normal. Under the four conditions just described, however, multivariate normality is not possible even if the separate conditional distributions are normal. A concern involves the bias that can arise from the incompatibility between the multiple normal conditional distributions that generate the imputations and the assumed joint distribution of the observed data (Erler Citation2016; Enders, Mistler, and Keller Citation2016; Enders, Du, and Keller Citation2020).

In this article, we propose to address the problem of compatibility (Arnold and Press Citation1989; Liu et al. Citation2014; Bartlett et al. Citation2015) within the framework of maximum likelihood (ML) estimation under the ignorable missing data assumption that data are MAR and that the parameter spaces for the multivariate HLM and the missing data mechanism are distinct (Rubin Citation1976; Little and Rubin Citation2002).

We will first review currently popular methods of inference for normal-theory multilevel models given incomplete data and show how to modify such approaches in which random effects often become explanatory variables, allowing us to model fallible measurement as a source of incomplete data. We’ll then describe our approach when incompletely observed covariates have random coefficients or polynomial terms, including statistical interactions. If a carefully selected subset of these random effects is conditionally assumed known, the model of interest can plausibly follow a normal-theory specification. The “provisionally known random effects (PKREs)” thus, selected can be integrated out of the likelihood using numerical methods. Capitalizing on the invariance properties of ML estimates (MLE), we show that our reparameterized model can be translated back to the original model’s parameter space. Auxiliary variables are introduced to increase the robustness of the MAR assumption. We will illustrate application of the model by studying income inequality and mathematics achievement in U.S. elementary schools. Although we base our case study on MLE, the likelihood factorization we propose can readily be implemented within a Bayesian approach with assurance of compatibility between conditional distributions and the joint distribution of observed data elements. We elucidate general rules for selecting low-dimensional “provisionally constant” random effects for a general class of models involving random coefficients or polynomial terms, including interactions.

Section 2 reviews estimation of normal-theory hierarchical models from data MAR. Section 3 explains how to estimate an analytic hierarchical model with random coefficients and cross-level interaction effects given data MAR via a PKRE. Section 4 describes estimation of the analytic model using auxiliary covariates to strengthen the MAR assumption. Section 5 extends the model with polynomial terms including within-level interactions and spells out rules for selecting provisionally constant random effects. Section 6 illustrates analysis of income inequality in math achievement. Section 7 evaluates our estimators by simulation. Finally, Section 8 discusses the limitations and extensions.

2 Inference for Multivariate Normal HLMs from Incomplete Data Using Random Effects as Predictors

We begin with a review of the normal theory case. Following Schafer and Yucel (Citation2002), our scientific interest focuses on the regression of a response variable R* on covariates C*. The elements of R* and C*, which are continuously measured, are partially observed. Ours is a two-level HLM (Lindley and Smith Citation1972; Dempster, Rubin, and Tsutakawa Citation1981) in which the response variable R* is a characteristic of “level-1 units” (e.g., students) who are clustered within level-2 units (e.g., schools). In contrast, the covariates may be characteristics of either level-1 units or level-2 units. In longitudinal studies, the level-1 units might be repeated occasions of measurements clustered within persons at level 2. Our scientist is primarily interested in the conditional distribution f1(R*|C*) assumed normally distributed. However, to account for the missing values in C*, we propose a normal linear model f2(C*). Because of missing data, we cannot separately estimate the parameters of f1 and f2 without discarding the cases with missing values on any element of R* or C*. It is well known that a procedure that analyzes only the cases with complete data is prone to bias and/or loss of efficiency (Little and Rubin Citation2002). The cost can be particularly high when a missing item of C* varies at level 2, in which case all of the level-1 units in the level-2 unit having missing values are discarded along with that level-2 unit itself.

To make efficient inference possible, and following Schafer and Yucel (Citation2002), we compose each outcome vector Y*=[R*C*T]T and write a multivariate HLM h(Y*) (1) Y*=X*α+Z*b+r*N(X*α,V*=Z*ΩZ*T+Σ*)(1) where bN(0,Ω) and r*N(0,Σ*). Here, X* and Z* are composed of completely observed covariates; α is a vector of fixed regression coefficients while b and r* are independent random effects that vary at levels 2 and 1, respectively. We partition the complete data (CD) Y* into components Y*=(Yobs,Ymis). In particular, if Y* is N by 1 but we observe only M(N) elements Yobs=Y of Y*, we construct an M-by-N matrix O in which every row contains a single entry equal to unity indicating which value of Y* that is observed. All other entries in the same row are 0. Our model for the observed Y is therefore (2) Y=Xα+Zb+rN(Xα,V=ZΩZT+Σ)(2) where Y=OY*,X=OX*,Z=OZ*,r=Or* and Σ=OΣ*OT. Assuming data MAR, we can make efficient estimates of the parameters θ=(α,Ω,Σ*) using ML or Bayes inference from the observed data according to (1). The most common method in recent literature is a Bayesian approach based on multiple imputation (MI) that we will consider in the final section of this article (see Section 8). However, Schafer and Yucel (Citation2002) showed how one can obtain MLE using the EM algorithm and use the estimates for MI. Shin and Raudenbush (Citation2007), showed how to recover MLE for the analytic model f1(R*|C*) by constructing model (1) carefully and estimating the model without a need for imputations. This approach allows some components of Y* to vary at level 2.

2.1 Estimation via the EM Algorithm

The EM algorithm (Dempster, Laird, and Rubin Citation1977) requires evaluation at each iteration m + 1 of the conditional expected CD score given the observed data and parameter estimates at iteration m. To find the CD score, our model for the CD Y* is a multivariate HLM (1). Write model (1) at the level of cluster j and let φ be an arbitrary scalar element of (Ω,Σ*). The CD score equations are well known (see, Raudenbush and Bryk Citation2002, chap. 14): (3) Sα,CD=jXj*TVj*1(Yj*Xj*α),Sφ,CD=12j(dvec(Vj*)dφ)T(Vj*1Vj*1)vec[(Yj*Xj*α)(Yj*Xj*α)TVj*].(3)

The conditional expected score equations given the observed Yj thus, clearly depend on the conditional mean and variance of Yj* which we readily derive from the fact that Yj*(m+1)|Yj,θ̂(m)N(Ŷj*(m+1),V̂j*(m+1)) given iteration-m estimates θ̂(m)=(α̂(m),Ω̂(m),Σ̂*(m)) (Shin and Raudenbush Citation2007) where (4) Ŷj*(m+1)=Xj*α̂(m)+V̂j*(m)OjTV̂j(m)1(YjXjα̂(m)),V̂j*(m+1)=V̂j*(m)V̂j*(m)OjTV̂j(m)1OjV̂j*(m).(4)

2.2 Example 1: Contextual Effects Model

We apply this framework to the “contextual effects model” (Willms Citation1986) for the study of income inequality in the mathematics achievement of U.S. elementary school children. This model decomposes the association between family income and educational achievement into a within-school component and a between-school component. In its simplest form, the model is typically written as (5) Rij=γ00+γ10(CijC¯j)+γ01(C¯jC¯)+u0j+eij(5) where Rij is a measure of math achievement for student i in school j, Cij is a measure of the income of that child’s parents, C¯j is the sample mean of Cij within school j, and C¯ is the overall sample mean income for i=1,,nj and j=1,,J. Here γ10 is known as the “within-school coefficient” while γ01 is the “between school” coefficient; and u0j and eij are independent, normally distributed random effects that vary at levels 2 and 1, respectively. Of interest is the “contextual component” γc=γ01γ10 which, if positive, suggests that attending a school with high-income peers predicts elevated achievement net of the contribution of a student’s family income.

Two problems arise in the conventional analysis (Shin and Raudenbush Citation2010). First, past analyses have treated parental income as completely observed when, in fact, most surveys report substantial fractions of missing data on income. Second, the sample mean C¯j will be a noisy proxy for the actual mean income of parents in a school if the sample size per school is modest, as is the case in most U.S. national surveys. Using random effects as explanatory variables within the MAR framework addresses both issues. Reflecting that income is partially observed, we propose the CD model (6) Rij*=γ00+γ10ϵij*+γ01νj+u0j+eij*,Cij*=δ+νj+ϵij*(6) where νj and ϵij* are independent, normally distributed random effects at levels 2 and 1, respectively. The joint distribution of Rij* and Cij* may be written (7) [Rij*Cij*]=[1001][γ00δ]+[1001][γ01νj+u0jνj]+[γ10ϵij*+eij*ϵij*].(7)

Stacking the equations within level-2 unit j, we have the general form of the model (8) Yj*=Xj*α+Zj*bj+rj*N(Xj*α,Vj*=Zj*ΩZj*T+InjΣ*)(8) for Xj*=Zj*=1njI2,bjN(0,Ω) and rj*N(0,InjΣ*) where we denote 1m as a vector of m unities and Im an m-by-m identity matrix for a positive integer m. The HLM score equations for θ are familiar (Raudenbush and Bryk Citation2002, chap. 14; Shin and Raudenbush Citation2010) and will therefore not be elaborated here.

2.3 Compatibility

A set of conditional distributions f1(R*|C*,θ1) and f2(C*|R*,θ2) is said to be compatible if there exist a joint distribution {h(Y*|θ):θΘ} and surjective maps {tj:ΘΘj:j=1,2} such that for each j, θjΘj and θtj1(θj)={θ:tj(θ)=θj}, we have f1(R*|C*,θ1)=h(R*|C*,θ) and f2(C*|R*,θ2)=h(C*|R*,θ) (Liu et al. Citation2014). Assuming a prior p(θ), Schafer and Yucel (Citation2002) developed the Gibbs sampler based on multivariate normal (MN) h(Y*|θ) that is compatible with an analytic HLM f1(R*|C*,θ1) when C* is linearly associated with R*. The MN h(Y*|θ), however, cannot be compatible with f1(R*|C*,θ1) when C* has nonlinear effects (Kim, Sugar, and Belin Citation2015; Enders, Du, and Keller Citation2020). Goldstein, Carpenter, and Browne (Citation2014) and Enders, Du, and Keller (Citation2020) factored h(Y*|θ)=f1(R*|C*,θ1)f2(C*|θ2) and estimated a compatible HLM f1(R*|C*,θ1) with the nonlinearities by the Gibbs sampler via a Metropolis algorithm. We extend the HLM further with the nonlinear effects of C* that includes random effects as latent covariates. We estimate h(Y*|θ) efficiently by ML via a PKRE and translate the estimates to the ML estimates of the compatible HLM as explained in the next section.

3 Coping with Random Coefficients and Interactions

Our scientific focus is again on f1(R*|C*,ν) assumed normal in distribution as in (6). However, the model now includes elements of C* that have nonlinearities including random coefficients, polynomial terms, or interactions. In this case, even if we can reasonably assume that f2(C*) is normal, the joint distribution h(Y*) cannot be normal.

Recent research on the problem of nonlinearities has focused primarily on two widely used methods of analysis of multilevel incomplete data. Perhaps the most popular method of imputation for HLMs under the MAR assumption imputed missing values by a series of sequential univariate regression models (Raghunathan et al. Citation2001), which is also known as MI by fully conditional specification or FCS (van Buuren et al. Citation2006). However, these conditionals will not be compatible with the joint distribution of interest in the presence of the nonlinearities of interest in this article, as shown by Enders, Mistler, and Keller (Citation2016) and Enders, Du, and Keller (Citation2020). The main alternative approach estimates the joint distribution of the outcome and covariates subject to missingness. Missing values may be imputed based on their fully conditional distributions (Liu, Taylor, and Belin Citation2000; Schafer and Yucel Citation2002; Goldstein et al. Citation2009) and by maximum likelihood (Shin and Raudenbush Citation2007, Citation2010; Ren and Shin Citation2016). These normal-theory models were not designed to handle the nonlinearities of interest here. By means of a Gibbs sampler via the Metropolis Hastings algorithm, Goldstein, Carpenter, and Browne (Citation2014) imputed missing values of a response and covariates including interaction and polynomial terms having fixed effects in a multilevel model where covariates and response may be continuous or categorical. Similarly, Erler (Citation2016) took the sequential fully Bayesian approach (Ibrahim, Chen, and Lipsitz Citation2002) that expresses the joint distribution of variables MAR, including the outcome, into a series of univariate conditional models to handle missing values of cluster-level continuous and discrete covariates having fixed effects. Erler et al. (Citation2019) extended the approach to imputing missing values of level-1 covariates having fixed effects. Enders, Du, and Keller (Citation2020) showed, however, that these approaches do not guarantee compatibility when the partially observed covariates have random coefficients. Lacking a formal model for the joint distribution of interest, these approaches appear to fall short of ensuring compatibility.

3.1 Factorization of the Likelihood Based on Provisionally Known Random Effects

To cope with nonlinearities induced by partially observed covariates having random coefficients, polynomials, or interaction terms, we model the joint distribution h(Y*|ν)p(ν) induced by the scientific model of interest f1(R*|C*,ν) and the model for the covariates, f2(C*|ν)p(ν). The problem is similar to the problem of estimation of generalized linear mixed models (Hedeker and Gibbons Citation1994; Raudenbush, Yang, and Yosef Citation2000). Using the notation of (2), we must evaluate (9) h(Y)=h(Y|b)p(b)db.(9)

The integral just defined does not have closed form in the presence of nonlinearities and must be approximated numerically. To facilitate the approximation, we choose a PKRE, call it u such that (10) h(Y|u)=h(Y|b,u)p(b|u)db(10)

is a normal HLM discussed in Section 2. The problem of approximation is then to evaluate (11) h(Y)=h(Y|u)g(u)du.(11)

The computational challenge is to select u such that the dimension of the analytic integral (10) is maximized while the dimension of numerical approximation (11) is minimized.

3.2 Example 2: A Partially Observed Covariate Having a Random Coefficient

We return to our example of income inequality in U.S. elementary schools. A common finding in multilevel studies of educational achievement is that relationship between student socioeconomic background and achievement varies from school to school (Raudenbush and Bryk Citation1986). This variation could reflect variation in school organization, composition, and resources. To assess the variation in income inequality within schools, we expand the contextual model (6) to allow for a random coefficient (12) Rij*=(γ00+u0j*)+(γ10+u1j*)ϵij*+eij*,Cij*=δ+νj+ϵij*(12) where u0j*N(0,τ00),u1j*N(0,τ11),νjN(0,τνν),cov(u0j*,u1j*)=τ01,cov(u0j,νj)=τ0ν and cov(u1j*,νj)=τ1ν. This is model (6) for u0j*=γ01νj+u0j if τ11=0. Let var(uj*)=τ for uj*=[u0j*u1j*νj]T. Level-1 random effects eij*N(0,σ2) and ϵij*N(0,σcc) are assumed independent of each other and of the level-2 random effects. We denote the parameters of the CD model (12) as θ(12)*=(γ00,γ10,τ,σ2,δ,σcc).

Clearly Rij* cannot be marginally normal in distribution because of the multiplication of the two normal random effects u1j* and ϵij*. Our strategy is to select one of these two random effects to be considered “provisionally known”; we choose u1j* for this purpose because it has lower dimension varying across schools than does ϵij* which varies across students within each school. Therefore, we write (13) [u0j*νj]|u1j*N([α0|1αν|1]u1j*,Ω=[τ00|1τ0ν|1τν0|1τνν|1])(13) where αk|1=τk1τ111 and τkk|1=τkkαk|1τ11αk|1 for k,k=0,ν. We can therefore write the “provisional” joint model h(Yj*|u1j*) for Yj*=[Y1j*TYnjj*T]T as (14) Yj*=Xj*α+Zj*bj+rj*N(Xj*α,Vj*=Zj*ΩZj*T+Ψj*)(14)

where Xj*=1nj[I2I2u1j*],α=[γ00δα0|1αν|1]T,Zj*=1njI2,bj=[u0j*α0|1u1j*νjαν|1u1j*]TN(0,Ω) and rj*=[r1j*Trnjj*T]TN(0,Ψj*=InjΣj*) for (15) rij*=[1γ10+u1j*01][eij*ϵij*],Σj*=[(γ10+u1j*)2σcc+σ2(γ10+u1j*)σcc(γ10+u1j*)σccσcc].(15)

The CD score is therefore familiar; see (3). To complete the EM algorithm to estimate h(Yj*|u1j*)ϕ(u1j*;0,τ11), we will maximize the likelihood L(θ)=j=1Jh(Yj) for (16) h(Yj)=h(Yj|u1j*)ϕ(u1j*;0,τ11)du1j*,h(Yj|u1j*)N(Xjα,Vj=ZjΩZjT+Ψj)(16) where h(Yj|u1j*) is from (14), Xj=OjXj*,Zj=OjZj* and Ψj=OjΨj*OjT=i=1njΣij for Oj=i=1njOij and Σij=OijΣj*OijT. We use adaptive Gauss-Hermite Quadrature (AGHQ) to numerically approximate integral (16) (Naylor and Smith Citation1982; Pinheiro and Bates Citation1995; Rabe-Hesketh, Skrondal, and Pickles Citation2002).

An additional AGHQ step is needed to complete the E step by evaluating the expectation of a CD score component SCDj of cluster j from (3) (17) E(SCDj|Yj)=SCDjf(Yj*|Yj,u1j*)g(u1j*|Yj)dYmisjdu1j*(17) where f(Yj*|Yj,u1j*) has the familiar form of the empirical Bayes posterior normal density with the means and covariance matrix in (4). We use AGHQ to approximate the outer integral with respect to the univariate random effect, u1j*; see Appendix C for detail. Because g(u1j*|Yj) is nonstandard, we use the Bayes theorem to find (18) g(u1j*|Yj)=h(Yj|u1j*)ϕ(u1j*;0,τ11)h(Yj).(18)

By the invariance property of MLE, we translate the MLE of the “provisional” parameters θ=(α,Ω,γ10,σcc,σ2,τ11) of model (14) to a one-to-one transformation θ̂(12)* via model (13).

3.3 Example 3: Cross-Level Interaction Effects Involving Partially Observed Covariates

Following Lee and Bryk (Citation1989), we wish to extend the contextual effects model in two ways to allow the level-1 covariate: (i) to have random coefficients as in the previous section; and (ii) to interact with the level-2 covariate. We therefore write the model (19) Rij*=(γ00+γ01νj+u0j)+(γ10+γ11νj+u1j)ϵij*+eij*,Cij*=δ+νj+ϵij*(19) where u0j and u1j are, as before, bivariate normal, but conditional on νj with variances τ00|ν and τ11|ν, respectively, and covariance τ01|ν. Other random effects are as defined in model (12). The parameters are θ(19)*=(γ00,γ01,γ10,γ11,τ00|ν,τ01|ν,τ11|ν,σ2,δ,τνν,σcc).

This model involves two products of normal theory random effects: νjϵij* and u1jϵij*. Using the logic of the last section, we wish to choose a PKRE such that, conditional on that effect, our CD joint model will be a normal-theory HLM. The question is how to choose this random effect. We might provisionally hold both νj and u1j constant, but we would prefer to minimize the dimension of the provisionally constant random effects such that the computational burden of numerical approximation is a minimum. Alternatively, we could provisionally hold ϵij* constant. But while ϵij* is a scalar, it varies across all level-1 units, which could be a very large set. Instead, we choose to provisionally hold constant the scalar level-2 random effect (20) u1j*=γ11νj+u1j.(20)

In addition, we define u0j*=γ01νj+u0j to represent the model parsimoniously as Rij=(γ00+u0j*)+(γ10+u1j*)ϵij*+eij. This model is therefore equivalent to model (12) to imply (13)–(18) and a one-to-one correspondence between θ(12)* and θ(19)*.

Consequently, the CD model (12) is a one-to-one transformation of the provisional model h(Yij*|u1j*)ϕ(u1j*;0,τ11) in (14) by distribution (13) and, also, of h(Yij*|νj)ϕ(νj;0,τνν)=f1(Rij*|Cij*,νj)f2(Cij*|νj)ϕ(νj;0,τνν) in (19) by (20). We choose to estimate joint model (14) via the PKRE for efficient computation that is guaranteed to be compatible with the scientific model f1(Rij*|Cij*,νj) by the one-to-one correspondence. Whereas scientific interest focuses on θ(19)*, we will be estimating the parameters θ=(α,Ω,γ10,σcc,σ2,τ11) of the provisional joint model (14). We then exploit the invariance property of MLE again, translating the MLE of θ back to those of θ(19)*.

The standard approach of replacing νj and ϵij* with C¯j*C¯* and Cij*C¯j*, respectively, in model (19) produces biased estimation of (γ01,γ11,τ00|ν,τ01|ν,τ11|ν) even if Rij* and Cij* were fully observed; see Appendix A. Model (19) can readily incorporate multiple covariates having random effects and also having multiple cross-level interactions. Moreover, it is straightforward to include covariates having fixed coefficients. This is important given the need to add auxiliary information to strengthen the robustness of the MAR assumption.

4 Auxiliary Covariates

Our focus is on estimation of income inequality in achievement of model (19). Because certain variables such as family income have severe missing rates, it robustifies the MAR assumption to involve auxiliary covariates (e.g., parent occupation and pretest score) correlated with missing values or patterns (Collins, Schafer, and Kam Citation2003). We consider two approaches to augment auxiliary covariates, partially observed or measured with error, to the CD model. One approach is to assume such covariates to be linearly associated with the outcome and income. Violation of the linearity, however, may produce biased estimation. The other approach is to augment them as responses to Rij*, thereby allowing them to be nonlinearly associated with the outcome and income. We then transform the MLE of the CD model to those of a nested model (19).

4.1 Linearly Associated Auxiliary Covariates

To augment auxiliary covariates that are linearly associated with the outcome and income, we extend the scalar Cij* of model (12) to a vector Cij*=[Cij*A1ij*TA2j*T]T consisting of income Cij* and auxiliary covariates A1ij* at level 1 and A2ij* at level 2. We then write the CD model (21) Rij*=(γ00+u0j*)+(γ10+u1j*)ϵij*+γ20Tϵ1ij*+eij*,Cij*=δ+νj+εij*(21) where δ=[δδ1Tδ2T]T,νj=[νjν1jTν2jT]T and εij*=[ϵij*ϵ1ij*T0T]T=[ϵij*0T]T for the means [δ1Tδ2T]T and school-specific random effects [ν1jTν2jT]T of [A1ij*TA2j*T]T and the child-specific random effects ϵ1ij* of A1ij*. Other components γ00, γ10, u0j*=γ01Tνj+u0j,u1j*=γ11Tνj+u1j and eij* are defined in model (19) except the linear effects γ20 of ϵ1ij*. Again eij* and ϵij*N(0,Σϵ) are independent of each other and uj*N(0,τ) for Σϵ=[σccΣc1Σ1cΣ11] and uj*=[u0j*u1j*νjT]T; let cov(u0j*,νj)=τ0ν,cov(u1j*,νj)=τ1ν and νjN(0,τνν). Auxiliary predictors (ν1j,ν2j) and ϵ1ij* are linearly associated with the outcome and income at levels 2 and 1, respectively. The parameters are θ(21)*=(γ00,γ10,γ20,τ,σ2,δ,Σϵ).

We select u1j* provisionally known again. Let Yij*=[Rij*Cij*A1ij*T]T and A2j* be of respective lengths p1 and p2. Using model (13) now for f(u0j*,νj|u1j*) and αν|1=[αν|1αν1|1Tαν2|1T]T, we find the provisional joint model (14) this time for Yj*=[Y1j*TYnjj*TA2j*T]T where Xj*=diag{[X11j*TX1njj*T]T,X2j*},α=[α1Tα2T]T,Zj*=diag{1njIp1,Ip2},bj=[u0j*α0|1u1j*νjTαν|1Tu1j*]T,rj*=[r1j*Trnjj*T0T]T,Ψj*=diag{InjΣj*,0}for X1ij*=[Ip1Ip1u1j*],X2j*=[Ip2Ip2u1j*],α1T=[γ00δδ1Tα0|1αν|1αν1|1T],α2T=[δ2Tαν2|1T]T, (22) rij*=[1B1jT0Ip11][eij*ϵij*],Σj*=[B1jTΣϵB1j+σ2B1jTΣϵΣϵB1jΣϵ](22) denoting B1jT=[γ10+u1j*γ20T].

We estimate θ(21)* using its one-to-one transformation θ=(α,Ω,γ10,γ20,Σϵ,σ2,τ11) of h(Yj*|u1j*)ϕ(u1j*;0,τ11) by the EM algorithm, computing h(Yj) and E(SCDj|Yj) by AGHQ as before. See Appendix B for the E step. The scalar PKRE u1j* yields efficient computation.

To translate θ(21)* to θ(19)*, we use model (13) to let βkj=γk0+ukj* and find E(βkj|νj)=γk0+γk1*νj and cov(βkj,βkj|νj)=τkk*=τkkγk1*γk1*τνν for γk1*=τkν/τνν and k,k=0,1. We then marginalize auxiliary ϵ1ij* out to obtain E(Rij*|ϵij*,νj)=γ00+γ01*νj+(γ10+γ11*νj)ϵij*+γ20TE(ϵ1ij*|ϵij*),var(Rij*|ϵij*,νj)=τ00*+2τ01*ϵij*+τ11*ϵij*2+γ20Tvar(ϵ1ij*|ϵij*)γ20+σ2that should be of form (γ00+γ01νj)+(γ10+γ11νj)ϵij* and τ00|ν+2τ01|νϵij*+τ11|νϵij*2+σ2, respectively. See Appendix D for detail. We illustrate this approach in Sections 6 and 7.

4.2 Nonlinearly Associated Auxiliary Covariates

The linearity assumption between A1ij* and (Rij*,Cij*) in model (21) may be violated to produce biased estimation of θ(19)*. In that case, we augment A1ij* to multivariate Rij*=[Rij*A1ij*T]T of length r, allowing them to be nonlinearly associated in (23) Rij*=(γ00+u0j*)+(γ10+u1j*)ϵij*+eij*,eij*N(0,Σe)(23) and Cij*=[Cij*A2j*T]T in (21) for conformable vectors γ00 and γ10 of fixed effects and random vectors u0j* and u1j* independent of eij* as before. It is straightforward to find the provisional joint model (14) for Yij*=(Rij*,Cij*) and A2j*, selecting r-by-1 u1j* to be provisionally known. We estimate the joint model efficiently and, then, translate the bivariate distribution of Rij* and Cij* to θ(19)*; see Appendix D again for the translation. Numerical approximation is now intensive with respect to vector u1j*. Finally, some of A1ij* may be linearly associated with the outcome and income while others may not. We illustrate this case in Data Analysis.

With additional known auxiliary covariates, we marginalize them out first given their expectation and covariance matrix estimated from sample before the translation above.

5 Within-Level Interactions and Polynomial Terms

We write a CD model including the level-2 interaction effects γ04 of νjν1j (24) Rij*=(γ00+γ01νj+γ02Tν1j+γ03Tν2j+γ04Tνjν1j+u0j)+γ10Tϵij*+eij*(24) and Cij* as in model (21) where u0jN(0,τ00|ν) and ϵij* has fixed effects γ10 for simplicity. We select one interactive term νj, the other in dimension, provisionally known to find (25) [u0jν1jν2j]|νjN([0αν1|ννjαν2|ννj],Ω=[τ00|ν000τν1ν1|cτν1ν2|c0τν2ν1|cτν2ν2|c]).(25)

Let bj=[u0jν1jTαν1|νTνjν2jTαν2|νTνj]T=[u0jb1jTb2jT]T to find the model given νj (26) Rij*=XRjTαR+ZRjTbj+γ10Tϵ1ij*+eij*N(XRjTαR,ZRjTΩZRj+γ10TΣϵγ10+σ2)Cij*=δ+νj+ϵij*,A1ij*=δ1+αν1|ννj+b1j+ϵ1ij*,A2j*=δ2+αν2|ννj+b2j(26) for XRjT=[1νjνj2],αR=[γ00γ01+γ02Tαν1|ν+γ03Tαν2|νγ04Tαν1|ν]T and ZRjT=[1γ02T+γ04Tνjγ03T] where Cij* varies within but not between clusters. As in Section 4.1, we stack these equations to find the implied provisional model (14) for Yj*=[Y1j*TYnj*TA2j*T]T, and compute h(Yj), setting SCDj=1 below, and E(SCDj|Yj) by (27) h(Yj)E(SCDj|Yj)=E(SCDj|νj,Yj)h(Yj|νj)ϕ(νj;0,τcc)dνj(27) numerically for h(Yj|νj) from the provisional model.

We now consider another CD model including level-1 interaction effects (28) Rij*=(γ00+γ01Tνj+u0j)+γ10ϵij*+(γ20T+γ30Tϵij*)ϵ1ij*+eij*(28) and Cij* in model (21) where ϵij* and ϵ1ij* have main (γ10 and γ20) and interaction effects (γ30). We select an interactive term ϵij*,ϵ1ij* in dimension, provisionally known again, and find ϵ1ij*|ϵij*N(αϵ1|cϵij*,Σ1|c)

for αϵ1|c=Σ1cσcc1 and Σ1|c=Σ11αϵ1|cσccαϵ1|cT. Given ϵij* provisionally constant, let u0j*=γ01Tνj+u0j and a1ij*=ϵ1ij*αϵ1|cϵij* to express (29) Rij*=XRijTαR+u0j*+B1ijTa1ij*+eij*N(XRijTαR,τ00+B1ijTΣ1|cB1ij+σ2),Cij*|ϵij*N(δ+ϵij*,τcc),A1ij*=δ1+αϵ1|cϵij*+ν1j+a1ij*,A2j*=δ2+b2j(29) where XRijT=[1ϵij*ϵij*2],αR=[γ00γ10+γ20Tαϵ1|cγ30Tαϵ1|c],B1ijT=γ20T+γ30Tϵij*,cov(Rij*,A1ij*|bj,ϵij*)=B1ijTΣ1|c and Cij* now varies between but not within clusters to imply the provisional model (14) given ϵj*=(ϵ1j*,,ϵnjj*) for bj=[u0j*νjT]T. We compute h(Yj) and E(SCDj|Yj) by (30) h(Yj)E(SCDj|Yj)=E(SCDj|ϵj*,Yj)h(Yj|ϵj*)ϕ(ϵj*;0,Injσcc)dϵj*(30) numerically for h(Yj|ϵj*) from the provisional model as before. The numerical integral can be computationally intensive, in particular, given large cluster sizes; multivariate Laplace approximation (Pinheiro and Bates Citation1995; Raudenbush, Yang, and Yosef Citation2000) and parallel computation of each cluster may result in efficient computation.

5.1 Rules for Choosing Provisionally Known Random Effects

We now provide general rules for selecting PKREs:

  1. For an interaction ϵij*ϵ1ij* as in (28), we hold ϵij*,ϵ1ij* in dimension, constant. The resulting model quadratic in ϵij* will minimize the dimension of AGHQ;

  2. For a level-2 interaction νjν1j, we hold one with a smaller dimension constant again;

  3. For a three-way interaction ϵij*ϵ1ij*ϵ2ij* at level 1, hold two terms the third one in dimension constant and this applies to a three-way interaction at level 2, too;

  4. For cross-level interactions νjϵij* as in model (21), hold u1j*=γ11Tνj+u1j constant;

  5. For the cluster-specific effects u1j* of ϵij*, we hold u1j* constant;

  6. Finally, for a subset of these effects, hold constant the union of the PKREs of the subset.

Our CD model in each case includes a scientific model of interest and induces a provisional joint model (14). Because the models are one-to-one transformations of each other, the scientific model is guaranteed to be compatible with the joint model we estimate.

6 Data Analysis

Rising income inequality in the United States and other nations has recently attracted substantial attention (Piketty Citation2014). A key question involves the consequence of such inequality for equality of opportunity among children. Following past research, we decompose the association between family income and educational achievement into a contextual component and a child-specific component (Firebaugh Citation1978; Willms Citation1986; Lee and Bryk Citation1989). The contextual component reflects the fact that elementary schools in the United States are quite segregated based on family income. Such segregation reflects and may reinforce residential segregation as a function of family income. Two children having the same family income might differ in educational achievement as a result of their experience in low-income versus high-income schools. The individual component reflects socioeconomic inequality within schools. Children attending the same school who differ with respect to family income may tend to differ with respect to their achievement. However, the magnitude of this within-school disparity may vary from school to school (Raudenbush and Bryk Citation1986; Lee and Bryk Citation1989). The contextual effects model (Willms Citation1986) supports the composition of inequality in achievement that we seek.

First, we decompose family income for child i in school j into between-school and within-school components as in model (19) ln(incomeij)=Cij*=δ+νj+ϵij*,mathSij=Rij*=(γ00+γ01νj+u0j)+(γ10+γ11νj+u1j)ϵij*+eij*for the mean of log-income δ, the school-specific deviation from the mean νjN(0,σcc), and the child-specific component ϵij*N(0,σcc). A child’s mathematics achievement in spring 1999 (mathS) depends on these components via the model (19) for mathSij=Rij*. The parameters are θ(19)*=(γ00,γ01,γ10,γ11,τ00|ν,τ01|ν,τ11|ν,σ2,δ,τνν,σcc).

We choose math achievement as our outcome because of its importance in predicting educational attainment and adult earnings (Nomi and Raudenbush Citation2016; Rivera-Batiz Citation1982). In model (19), γ01 is the between-school gradient, reflecting the expected difference in Rij* associated with a unit difference in school mean income; γ10 reflects the average within-school gradient. However, the within-school gradient may depend on school mean income, an interaction effect represented by γ11, and this gradient may also vary randomly over schools as represented by u1jN(0,τ11|ν). School-mean achievement is γ00 and, conditional on income, varies randomly over schools, u0jN(0,τ00|ν). If the within-school gradients were constant (γ11=τ11=0), the overall linear coefficient for income will be E(Rij*|Cij*=c+1)E(Rij*|Cij*=c)=ργ01+(1ρ)γ10where ρ=τcc/(τcc+σcc) can be regarded as an index of school segregation as a function of income. Define γc=γ01γ10 as the “contextual effect” (Willms Citation1986), the expected difference in math achievement between two students with the same family income who attend two schools that differ by one unit in school mean income. A nation’s income gradient would be ργc+γ10, which increases with the within school segregation based on income, ρ, the contextual coefficient, γc and within-school gradient γ10. This simple relationship will not hold if the within-school gradients vary over schools, and one purpose of our analysis is to test that proposition.

To do so, we use data from 21,211 children attending kindergarten in 1,018 schools as of fall 1998, a nationally representative sample known as the Early Childhood Longitudinal Study of 1998 (“ECLS”) that is publicly available at https://nces.ed.gov/ecls; see . Only 8% of the math achievement data are missing. However, family income data are missing for 32% of the sample, a finding that is quite typical in surveys of educational achievement. Fortunately, ECLS (Tourangeau et al. Citation2009) provides data on auxiliary variables, including the maximum occupational status score of parents (occupation), missing for only 5% of the cases, as well as math achievement in fall 1998 (mathF), which is strongly predictive of math achievement in spring 1999 (mathS). In all, we have four auxiliary variables, correlated with income, outcome or missing patterns.

Table 1 Each variable for analysis with mean (standard deviation (SD), missing %).

In this article, we take convergence to ML to be less than 104 in the square root of the summed squared differences between θ̂ of two consecutive iterations. We estimate the model for covariates Cij* using all observed values by ML via the EM algorithm (Shin and Raudenbush Citation2007) and a scientific model for Rij* given the covariates and their sample cluster and overall means by complete case analysis, and transform the estimates to the initial values θ̂ of the joint model. We carry out complete case analysis by R (RCoreTeam 2017), estimate θ on a Dell XPS laptop with the 11th generation Intel(R) Core(TM) i9-11900H processor at 2.50GHz and 64 GB RAM, and test a hypothesis at a level α=0.05.

6.1 Linearly Associated Auxiliary Covariates

Recall from Section 4 that we have two strategies. Following Section 4.1, we model the auxiliary covariates linearly associated with the outcome and income in model (21) where A1ij* is a vector of mathF, occupation, and age in months at assessment of spring 1999 (age) and A2j* is the square root of kindergarten enrollment (enrollment) by the Box-Cox transformation. Consequently, θ(21)*=(γ00,γ10,γ20,τ,σ2,δ,Σϵ) consists of 10 fixed effects (3-by-1 γ20 and 5-by-1 δ) and 39 variances and covariances (7-by-7 τ and 4-by-4 Σϵ).

We standardized each variable to have mean 0 and variance 1, except income for interpretation. Estimation of θ(21)* with a provisionally known u1j* and 20 abscissas converged fast to ML in 9 iterations and 13 sec. The transformed estimates θ̂(19)* and standard errors (SEs), multiplied by 100, are listed under “EM-AGHQ I” in . The school-mean and within-school components of family income are positively associated with math achievement while their interaction effect is insignificant. The within-school income effects appear to vary at most modestly across schools with the variance estimate τ̂11|ν=0.19 less than the associated SE 0.25. Based on ln(τ̂11|ν)N[ln(τ11|ν),var(τ̂11|ν)/τ11|ν2], we find a large-sample 95% confidence interval (CI) for τ11|ν: (0.01,2.62) near zero.

Table 2 Estimates × 100 (standard errors × 100) of model (19) by EM-AGHQ I and II.

To test this model against the null hypothesis that γ11=0 and τ11|ν=0, we estimated the null model, the multivariate normal distribution of linear associated (Rij*,Cij*,A1ij*,A2j*), efficiently by the EM algorithm (Shin and Raudenbush Citation2007, Citation2010). The model consisted of 42 parameters comprising 6 fixed intercepts and 6-by-6 level-2 and 5-by-5 level-1 variance covariance matrices, and converged to log ML -107370.60. Compared to the log ML of θ(21)* displayed at the last row of , the likelihood ratio test statistic to test the two joint models is 19.20 with 7 degrees of freedom to give a conservative p-value < 0.01 (Stram and Lee Citation1994). Therefore, we infer that the outcome is nonlinearly associated with income. Lastly, estimation using 10 abscissas also produced the same estimates under EM-AGHQ I.

6.2 Nonlinearly Associated Auxiliary Covariates

Preliminary analysis indicates that mathF and occupation may be nonlinearly associated with the outcome and income. To test the hypothesis, following Section 4.2, we model multivariate responses Rij*= (mathS, mathF, occupation) and A1ij*= age in model (21): (31) Rij*=(γ00+u0j*)+(γ10+u1j*)ϵij*+γ20ϵ1ij*+eij*(31) and Cij*=[CijA1ij*A2j*]T as before for 3-by-1 vectors γ00, γ10 and γ20 of fixed effects and random vectors u0j*,u1j* and eij*N(0,Σe). Therefore, τ is 9-by-9, Σe 3-by-3 and Σϵ 2-by-2; the CD model comprises 12 fixed effects and 54 variances and covariances. With 3-by-1 u1j* provisionally known, we estimated the joint model using 10 abscissae per dimension which converged to ML in 734th iterations and 928 min. The log ML of θ(31)* is shown at the bottom row under EM-AGHQ II. The LRT statistic to test H0: model (21) versus H1: model (31) is 367.93. The conservative LRT with 17 degrees of freedom (Stram and Lee Citation1994) produces a p-value 0 to reject the null in favor of mathF or occupation nonlinearly associated with the outcome and income.

The translated estimates θ̂(19)* and SEs are listed under EM-AGHQ II in . Compared to those under EM-AGHQ I, the main effect of within-school income is larger; furthermore, the interaction effect is significant, and so is the random effects of income by the Wald test to produce a 95% CI for τ11 now distant from zero. We conclude that the linearity assumption associated with (21) is violated to attenuate the main, interaction and random effects of within-school income ϵij*, confounded with the auxiliary covariates. As a result, EM-AGHQ II produces a smaller σ̂2=73.74 and, thus, explains more outcome variability within schools than does EM-AGHQ I.

6.3 Known Auxiliary Covariates

Either model (21) or (31) may be extended to control for known auxiliary covariates such as race ethnicity and gender at level 1 and school location and sector at level 2. The joint model may be estimated given the provisionally known u1j* again, and translated to θ̂(19)* and SEs. See Appendices B and D for detail.

7 Simulation Study

We focus on ML estimation of the scientific HLM (19) after simulating outcome Rij* and income Cij* from a joint model conditional on auxiliary covariates within which the HLM is nested. The goal is to compare our estimators (EM-AGHQ) with those by four methods: (a) the benchmark method (BM) given νj and ϵij*; (b) complete-case analysis (CC) given C¯j*C¯* and Cij*C¯j* instead; (c) MLE on MI (Shin and Raudenbush Citation2007, Citation2010); and (d) the Gibbs sampler (GS) of Enders, Du, and Keller (Citation2020) implemented in software Blimp (Keller and Enders Citation2021). BM is based on complete data while others are based on data MAR. Therefore, a good method will produce estimates near the BM counterparts. BM and CC estimate the scientific model by the lme4 package (Bates et al. Citation2015) in R. MLE on MI uses C programs to estimate incompatible MHLM (8) by ML and impute missing values including latent school mean incomes 20 times, more than did past multilevel missing data analyses (Schafer and Yucel Citation2002; Shin and Raudenbush Citation2007, Citation2013; Enders, Du, and Keller Citation2020), from their predictive distribution given observed data implied by the MHLM at ML (Shin and Raudenbush Citation2007, Citation2010); and estimates the HLM given the MI by lme4. GS estimates the joint model, simultaneously generating 20 imputations of missing values excluding latent school mean incomes, by Blimp and, then, the HLM (19) given the MI by Blimp again. EM-AGHQ estimates the joint model by our C program and translates the estimates to the desired MLE in R.

We simulate the ECLS Rij* and Cij* closely in terms of sample sizes, correlations and missing rates. Specifically, for n = 20 children in each of J = 1000 schools, we simulate: (i) known auxiliary covariates X21jBernoulli(0.3) and X1ijBernoulli(0.45) equal to 1 (0) for a private (public) school and a minority (white) child, respectively, where X1ij varies within, but not between, schools for simplicity; (ii) random intercepts and slope β0j=γ00TX2j+u0j*,β1j=γ10TX2j+u1j* and βCj=δ00TX2j+νj from N(1+X21j,1) with covariances 0.8 for X2j=[1X21j]T; (iii) independent eij*,ϵij*N(0,10) to simulate the CD joint model (32) Rij*=β0j+β1j(Cij*βCj)+γ20X1ij+eij*,Cij*=βCj+δ10X1ij+ϵij*.(32)

The simulated parameters in θ consist of γ00T=γ10T=δ00T=[11],γ20=δ10=1, variances τ00=τ11=τνν=1 and σcc=σ2=10, and covariances τ01=τ0ν=τ1ν=0.8. We marginalize X1ij and X21j out given their simulated expectations and variances, and translate θ to the scientific model in column two of as explained in Appendix D.

Table 3 Scientific model (19) estimated by BM, CC, MLE on MI, GS and EM-AGHQ.

Next, we simulate the ECLS missing rates closely by (33) logit(pij)=ϕ1X1ij+ϕ2(1X1ij)+zj,zjN(0,1)(33) given the known level-1 covariate X1ij: missing values drawn from Bernoulli(pij) are MAR. Because the mechanism does not provide information about the simulated model (32), the parameter spaces of the missing data mechanism and joint model are also distinct. We simulate higher missing rates for minority than white students by ϕ1>ϕ2: ϕ1=0.2>ϕ2=1.2 for income Cij* with a 35% missing rate (46% for X1ij=1, 27% for X1ij=0); and ϕ1=2>ϕ2=3 for response Rij* with an 11% missing rate (16% for X1ij=1, 7% for X1ij=0) on average.

We repeated simulating data and estimating the scientific model by the approaches 500 times to compute the % bias, average estimated SE (ASE), empirical estimate of the true SE (ESE) over samples and coverage probability (coverage) of each estimator in the next five columns. Each cell or estimate occupies two rows: % bias (ASE) in the first, and ESE and coverage in the next row. The lme4 package is unable to produce ESE and coverage of a variance or covariance estimate. The BM estimates are of course very accurate and precise with 0.13% bias, small ASE close to ESE, and good coverages near the nominal 0.95 in column three.

The CC estimates in column four, however, are biased despite the large sample sizes. The standard deviations (SDs) τ00|ν and τ11|ν are 60% and 39% biased upward while the intercept γ00, interaction effect γ11 and covariance τ01|ν are 8%, 42%, and 29% biased downward, respectively. Only the estimates of γ01, γ10, and σ2 are comparable in accuracy to those by BM as (A1) reveals in Appendix A. The coverages are low with a zero coverage for γ11. Finally, the uncertainty associated with the estimator of a cluster-level effect γ01 seems underestimated by ASE smaller than ESE.

MLE on MI generates incompatible MI based on MHLM (8) without consideration of the interaction effect and PKRE, thereby producing all estimates biased in column five that do not seem better than the CC estimates. Fixed effects and covariance are biased downward, and SDs upward. SEs are close to but coverages lower than CC counterparts.

On the contrary, GS without consideration of a PKRE produces estimates in column six nearly as accurate as BM counterparts except the SD of the random slope that is biased upward by 2.69% while EM-AGHQ yields all estimates in the last column as accurate or almost as accurate as BM estimates. Overall, both approaches produce estimates slightly less precise than BM estimates; ASE and ESE appear slightly larger than those by BM to reflect extra uncertainty due to latent covariates and missing values.

Computation

We used 20 abscissas to estimate the joint model by EM-AGHQ. Given data MAR, the estimation converged 450 times taking 101.6 iterations on average and 189 iterations at maximum, but did not converge until and was stopped to produce the estimates at the 300th iteration 50 times (10%). This does not appear to be the weakness of our approach as the convergence issue also occurred to each of BM and CC estimations producing 50 or more warnings of a model failing to converge. The convergence issue seems partly due to high missing rates but few covariates to explain missing values and patterns. In our experience thus far, the convergence rates seem positively associated with more covariates or abscissas. For example, this simulation using 10 abscissas resulted in practically identical estimates, but lowered the convergence rate given data MAR.

Blimp estimates 21 models per simulated dataset: joint model (32) and the HLM given each of 20 imputations. We set 20,000 burn-in and 10,000 post burn-in iterations to estimate the joint model and impute MI, and 5000 burn-in and 5000 post burn-in iterations to estimate the HLM given the MI. These settings are based on preliminary analysis of five simulated datasets that produced the potential scale reduction statistics of all estimates of each model lower than or near 1.1 to imply a reasonable convergence to posterior distributions (Gelman and Rubin Citation1992).

8 Discussion

In this article, we have considered how to estimate a two-level hierarchical linear model (HLM) efficiently where a continuous response R* and continuous covariates C* may be MAR and C* may have interactive, polynomial or randomly varying effects. Nonlinearities of C* imply a nonstandard joint model h(R*,C*)=h(Y*) where Y*=(Y,Ymis) for observed Y and missing Ymis. The key idea is to introduce a unique factorization of the joint model involving “provisionally known” random effects (PKREs) u such that the observed joint model h(Y|u)=h(Y|ν,u)g(ν|u)dν is an analytically tractable multivariate normal (MN) theory HLM with respect to a high-dimensional random vector ν. We computed the likelihood h(Y)=h(Y|u)g(u)du numerically with respect to a low dimensional u by means of adaptive Gauss-Hermite quadrature (AGHQ). The HLM involved random effects as predictors, reducing bias due to measurement error. The joint model h(Y*|u)g(u) induced by the HLM is guaranteed to be compatible with the HLM. We suggested general rules for selecting the PKREs in a way that minimized the dimension of AGHQ. Although useful for the HLMs considered in this article, they are yet to be extended to other models, for example, for discrete outcomes. We hope that our work will spur research on estimation via PKREs.

The nonlinearities of multiple covariates, multiple outcomes and/or the presence of partially observed discrete variables will increase the dimension of PKREs and, thus, the expense of numerical integration by AGHQ. In that case, integration via multivariate Laplace approximation may contribute to efficient computation (Pinheiro and Bates Citation1995; Raudenbush, Yang, and Yosef Citation2000).

Further research may address the problem of highly correlated random effects at the cluster level. One strategy would introduce shared random effects to cope with the “curse of dimensionality” by AGHQ as well as the multicollinearity (Miyazaki and Frank Citation2006; Sun et al. Citation2023). In addition, parallel computation of numerical integrals for groups of or single clusters will reduce per-iteration computation time while application of the parameter-extended EM algorithm (Liu, Rubin, and Wu Citation1998) may reduce the number of iterations to converge.

In related research, Rockwood (Citation2020) estimated a multilevel structural equations model by ML, integrating linear random effects conditional on nonlinear random effects analytically and, then, nonlinear effects numerically by Gaussian quadrature. Our analysis encountered both outcome and predictors quite severely missing. To simulate the analysis closely, we simulated a MAR mechanism due to a known auxiliary predictor. We leave the important extension to a MAR (Grund, Lüdtke, and Robitzsch Citation2021) or other mechanism due to the fully observed or missing values of the outcome to near future.

It is possible to extend and automate our program that enables a user to specify an analytic HLM and determines PKREs based on a set of rules given the HLM. To that end, we need to develop a more general set of rules, for example, involving discrete covariates MAR.

Often, MI of a binary predictor MAR under multivariate normality is efficiently analyzed (Schafer Citation1997; Grund, Lüdtke, and Robitzsch Citation2018). The MI will be, however, incompatible with a HLM having the nonlinear effect of the predictor and, thus, unable to always guarantee unbiased estimation of the HLM. We are currently extending our ML approach via the PKRE idea that will ensure compatibility with and, thus, produce unbiased estimation of a HLM having the nonlinear effects of categorical predictors.

Extension of our approach to MI via Bayesian methods may increase the robustness of findings and is straightforward. In particular, our MN joint model h(Y*,ν|u;θ)=h(Ymis,Y,ν|u;θ) given the PKRE u implies estimation of θ by the Gibbs sampler. The sampler will impute (Ymis,ν,u,θ) from their posteriors compatible with the joint model h(Y*,ν|u,θ)g(u|θ)p(θ) for a reasonably assumed prior p(θ) by drawing: (i) Ymis and ν from MN h(Ymis,ν|Y,u,θ); (ii) u from nonstandard g(u|Y*,ν,θ)=h(Y*,ν|u,θ)g(u|θ)/h(Y*,ν|θ), for example, by importance sampling via Markov chain Monte Carlo integration of h(Y*,ν|θ)=E[h(Y*,ν|u,θ)] that samples u from a normal prior g(u|θ); and (iii) θ from a standard posterior p(θ|Y*,u,ν) (Schafer and Yucel Citation2002). A potential virtue of a PKRE is to minimize the dimension of sampling the PKRE from a nonstandard posterior by importance sampling. We find it important to solve the measurement error problem by including level-2 random effects as latent covariates. To that end, we also explain how the Gibbs sampler without consideration of a PKRE (Goldstein et al. Citation2014; Enders, Du, and Keller Citation2020) may be modified to be compatible with our scientific model conditional on a latent covariate ν in Appendix E. A valuable future study is to compare the proposed Gibbs sampler estimators with existing estimators of a more sophisticated HLM, for example, involving multiple nonlinear effects or outcomes.

Supplementary Materials

The supplementary files that contain R and C executable codes, data sets and initial values to reproduce and may be downloaded. See README_supplementarymaterials.txt for instruction.

Supplemental material

README_supplementarymaterials.txt

Download Text (5.3 KB)

ShinRaudenbushComputation.zip

Download Zip (1.9 MB)

Acknowledgments

We thank two anonymous reviewers and an associate editor for their helpful comments, Craig Enders and Brian Keller for providing Blimp simulation codes, and Dongho Shin for helping Blimp simulation in R environment.

Disclosure Statement

The authors report there are no competing interests to declare.

Additional information

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D210022. The opinions expressed are those of the authors and do not represent views of the Institute or the U.S. Department of Education.

References

  • Arnold, B. C., and Press, S. J. (1989), “Compatible Conditional Distributions,” Journal of the American Statistical Association, 84, 152–156. DOI: 10.1080/01621459.1989.10478750.
  • Bartlett, J. W., Seaman, S. R., White, I. R., and Carpenter, J. R. (2015), “Multiple Imputation of Covariates by Fully Conditional Specification: Accommodating the Substantive Model,” Statistical Methods in Medical Research, 24, 462–487. DOI: 10.1177/0962280214521348.
  • Bates, D., Maechler, M., Bolker, B., and Walker, S. (2015), “Fitting Linear Mixed-Effects Models Using lme4,” Journal of Statistical Software, 67, 1–48. DOI: 10.18637/jss.v067.i01.
  • Carlin, B.P. and Louis, T.A. (2009), Bayesian Methods for Data Analysis. 3rd ed. Boca Raton, FL: CRC Press.
  • Collins, L. M., Schafer, J. L., and Kam, C. (2003), “A Comparison of Inclusive and Restrictive Strategies in Modern Missing Data Procedures,” Psychological Methods, 6, 330–351. DOI: 10.1037/1082-989X.6.4.330.
  • Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood From Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 76, 1–38. DOI: 10.1111/j.2517-6161.1977.tb01600.x.
  • Dempster, A. P., Rubin, D. B., and Tsutakawa, R. K. (1981), “Estimation in Covariance Components Models,” Journal of the American Statistical Association, 76, 341–353. DOI: 10.1080/01621459.1981.10477653.
  • Enders, C. K., Mistler, S. A., and Keller, B. T. (2016), “Multilevel Multiple Imputation: A Review and Evaluation of Joint Modeling and Chained Equations Imputation,” Psychological Methods, 21, 222–240. DOI: 10.1037/met0000063.
  • Enders, C. K., Du, H., and Keller, B. T. (2020), “A Model-based Imputation Procedure for Multilevel Regression Models with Random Coefficients, Interaction Effects, and Nonlinear Terms,” Psychological Methods, 25, 88–112. DOI: 10.1037/met0000228.
  • Erler, N. S., Rizopoulos, D., Rosmalen, J., Jaddoe, V. W., Franco, O. H., and Lesaffre, E. M. (2016), “Dealing with Missing Covariates in Epidemiologic Studies: A Comparison between Multiple Imputation and a Full Bayesian Approach,” Statistics in Medicine, 35, 2955–2974. DOI: 10.1002/sim.6944.
  • Erler, N. S., Rizopoulos, D., Jaddoe, V. W., Franco, O. H., and Lesaffre, E. M. (2019), “Bayesian Imputation of Time-Varying Covariates in Linear Mixed Models,” Statistical Methods in Medical Research, 28, 555–568. DOI: 10.1177/0962280217730851.
  • Firebaugh, G. (1978), “A Rule for Inferring Individual-Level Relationships from Aggregate Data,” American Sociological Review, 43, 557–572. DOI: 10.2307/2094779.
  • Gelman, A. and Rubin, D. B. (1992). “Inference from iterative simulation using multiple sequences,” Statistical Science, 7, 457–472. DOI: 10.1214/ss/1177011136.
  • Goldstein, H., Carpenter, J., Kenward, M., and Levin, K. (2009), “Multilevel Models with Multivariate Mixed Response Types,” Statistical Modellng, 9, 173–197. DOI: 10.1177/1471082X0800900301.
  • Goldstein, H., Carpenter, J. R., and Browne, W. J. (2014), “Fitting Multilevel Multivariate Models with Missing Data in Responses and Covariates that may Include Interactions and Non-linear Terms,” Journal of the Royal Statistical Society, Series A, 177, 553–564. DOI: 10.1111/rssa.12022.
  • Grund, S., Lüdtke, O., and Robitzsch, A. (2018), “Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations,” Organizational Research Methods, 21, 111–149. DOI: 10.1177/1094428117703686.
  • Grund, S., Lüdtke, O., and Robitzsch, A. (2021), “Multiple Imputation of Missing Data in Multilevel Models with the R Package mdmb: A Flexible Sequential Modeling Approach,” Behavior Research Methods, 53, 2631–2649.
  • Hedeker, D., and Gibbons, R. D. (1994), “A Random-Effects Ordinal Regression Model for Multilevel Analysis,” Biometrics, 50, 933–944.
  • Ibrahim, J. G., Chen, M. H., and Lipsitz, S. R. (2002), “Bayesian Methods for Generalized Linear Models with Covariates Missing at Random,” Canadian Journal of Statistics/Revue Canadienne De Statistique, 30, 55–78. DOI: 10.2307/3315865.
  • Keller, B. T., and Enders, C. K. (2021), Blimp user’s guide (Version 3). Available at www.appliedmissingdata.com/multilevel-imputation.html
  • Kim, S., Sugar, C. A., and Belin, T. R. (2015). “Evaluating model-based imputation methods for missing covariates in regression models with interactions,” Statistics in Medicine, 34, 1876–1888. DOI: 10.1002/sim.6435.
  • Lee, V. E., and Bryk, A. S. (1989), “A Multilevel Model of the Social Distribution of High School Achievement,” Sociology of Education, 62, 172–192. DOI: 10.2307/2112866.
  • Lindley, D. V., and Smith, A. F. M. (1972), “Bayes Estimates for the Linear Model,” Journal of the Royal Statistical Society, Series B, 34, 1–41. DOI: 10.1111/j.2517-6161.1972.tb00885.x.
  • Little, R. J. A., and Rubin, D. B. (2002), Statistical Analysis with Missing Data, New York: Wiley.
  • Liu, C., Rubin, B. D., and Wu, Y. (1998), “Parameter Expansion to Accelerate EM: The PX-EM Algorithm,” Biometrika, 85, 755–770. DOI: 10.1093/biomet/85.4.755.
  • Liu, M., Taylor, J. M. G., and Belin, T. R. (2000), “Multiple Imputation and Posterior Simulation for Multivariate Missing Data in Longitudinal Studies,” Biometrics, 56, 1157–1163. DOI: 10.1111/j.0006-341x.2000.01157.x.
  • Liu, J., Gelman, A., Hill, J., Su, Y., and Kropko, J. (2014), “On the Stationary Distribution of Iterative Imputations,” Biometrika, 101, 155–173. DOI: 10.1093/biomet/ast044.
  • Miyazaki, Y. and Frank, K. A. (2006). “A hierarchical linear model with factor analysis structure at level 2,” Journal of Educational and Behavioral Statistics, 31, 125–156. DOI: 10.3102/10769986031002125.
  • Naylor, J. C., and Smith, A. F. M. (1982), “Applications of a Method for the Efficient Computation of Posterior Distributions,” Applied Statistics, 31, 214–225. DOI: 10.2307/2347995.
  • Nomi, T., and Raudenbush, S. W. (2016), “Making a Success of “Algebra for All:” the Impact of Extended Instructional Time and Classroom Peer Skill in Chicago,” Educational Evaluation and Policy Analysis, 38, 431–451. DOI: 10.3102/0162373716643756.
  • Olsen, M. K., and Schafer, J. L. (2001), “A Two-Part Random-Effects Model for Semicontinuous Longitudinal Data,” Journal of the American Statistical Association, 96, 730–745. DOI: 10.1198/016214501753168389.
  • Piketty, T. (2014), Capital in the 21st Century, Cambridge, MA: Harvard University Press.
  • Pinheiro, J. C., and Bates, D. M. (1995), “Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model,” Journal of Computational and Graphical Statistics, 4, 12–35. DOI: 10.2307/1390625.
  • R Core Team (2017), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
  • Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2002), “Reliable Estimation of Generalized Linear Mixed Models using Adaptive Quadrature,” The Stata Journal, 2, 1–21. DOI: 10.1177/1536867X0200200101.
  • Raghunathan, T., Lepkowski, J., Van Hoewyk, J., and Solenberger, P. (2001), “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,” Survey Methodology, 27, 85–95.
  • Raudenbush, S. W., and Bryk A. S. (1986), “Hierarchical Model for Studying School Effects,” Sociology of Education, 59, 1–17. DOI: 10.2307/2112482.
  • Raudenbush, S. W., Yang, M., Yosef, M. (2000), “Maximum Likelihood for Generalized Linear Models with Nested Random Effects via High-Order, Multivariate Laplace Approximation,” Journal of Computational and Graphical Statistics, 9, 141–157. DOI: 10.2307/1390617.
  • Raudenbush, S. W., and Bryk, A. S. (2002), Hierarchical Linear Models, Newbury Park, CA: Sage.
  • Ren, C., and Shin, Y. (2016), “Longitudinal Latent Variable Models Given Incompletely Observed Biomarkers and Covariates,” Statistics in Medicine, 35, 4729–4745. DOI: 10.1002/sim.7022.
  • Rivera-Batiz, F. L. (1982), “International Migration, Non-traded Goods and Economic Welfare in the Source Country,” Journal of Development Economics, 11, 81–90. DOI: 10.1016/0304-3878(82)90043-8.
  • Rockwood, N. J. (2020), “Maximum Likelihood Estimation of Multilevel Structural Equation Models with Random Slopes for Latent Covariates,” Psychometrika, 85, 275–300 DOI: 10.1007/s11336-020-09702-9.
  • Rubin, D. B. (1976), “Inference and Missing Data,” Biometrika, 63, 581–592. DOI: 10.1093/biomet/63.3.581.
  • Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, New York: Wiley.
  • Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman & Hall.
  • Schafer, J. L., and Yucel, R. M. (2002), “Computational Strategies for Multivariate Linear Mixed-Effects Models with Missing Values,” Journal of Computational and Graphical Statistics, 11, 437–457. DOI: 10.1198/106186002760180608.
  • Shin, Y., and Raudenbush, S. W. (2007), “Just-Identified Versus Over-Identified Two-Level Hierarchical Linear Models with Missing Data,” Biometrics, 63, 1262–1268. DOI: 10.1111/j.1541-0420.2007.00818.x.
  • Shin, Y., and Raudenbush, S. W. (2010), “A Latent Cluster Mean Approach to The Contextual Effects Model with Missing Data,” Journal of Educational and Behavioral Statistics, 35, 26–53.
  • Shin, Y. and Raudenbush, SW. (2013), “Efficient Analysis of Q-Level Nested Hierarchical General Linear Models Given Ignorable Missing Data,” The International Journal of Biostatistics, 9(1), 109–133. DOI: 10.1515/ijb-2012-0048.
  • Stram, D. O., and Lee, J. (1994), “Variance Components Testing in the Longitudinal Mixed Effects Model,” Biometrics, 50, 1171–1177. DOI: 10.2307/2533455.
  • Sun, X., Shin, Y., Lafata, J. E., and Raudenbush, S. W. (2023), “Variability in Causal Effects and Noncompliance in a Multisite Trial: Estimation of a Bivariate Hierarchical Generalized Linear Model,” submitted.
  • Tourangeau, K., Nord, C.L.ê, T., Sorongon, A. G., and Najarian, M. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K), Combined User’s Manual for the ECLS-K Eighth-Grade and K-8 Full Sample Data Files and Electronic Codebooks (NCES 2009-004). National Center for Education Statistics, Institute of Education Sciences, US Department of Education. Washington, D.C.
  • van Buuren, S., Brand, J., Groothuis-Oudshoorn, C., and Rubin, D. (2006), “Fully Conditional Specification in Multivariate Imputation,” Journal of Statistical Computation and Simulation, 76, 1049–1064. DOI: 10.1080/10629360600810434.
  • Willms, J. D. (1986), “Social Class Segregation and its Relationship to Pupils’ Examination Results in Scotland,” American Sociological Review, 51, 224–241. DOI: 10.2307/2095518.

Appendix A

Problem of Bias in Estimating Model (19)

Let Rij*=Rij and Cij*=Cij fully observed, δ = 0 to simplify notation, βkj=γk0+ukj*N(γk0,τkk) and cov(u0j*,u1j*)=τ01 for k = 0, 1 in model (19). Because C¯·j and βj=(β0j,β1j,νj) are independent of CijC¯·j, and ϵij*|CijC¯·jN(CijC¯·j,σcc/nj), [RijC¯·j]|βj,CijC¯·jN([β0j+β1j(CijC¯·j)νj],[β1j2σcc/nj+σ2β1jσcc/njβ1jσcc/njσcc/nj]) implying a mixed model Rij|CijC¯·jN[γ00+γ10(CijC¯·j),var(Rij|CijC¯·j)] for var(Rij|CijC¯·j)=τ00+2τ01(CijC¯·j)+τ11(CijC¯·j)2+(τ11+γ102)σcc/nj+σ2, cov(Rij,C¯·j|CijC¯·j)=γ01τνν+γ11τνν(CijC¯·j)+γ10σcc/nj. Let λj=τνν/(τνν+σcc/nj) be the reliability of C¯·j as an error-prone measure of νj (Raudenbush and Bryk Citation2002). The implied Rij|CijC¯·j,C¯·jN(μij,Vij) has (A1) μij=γ00+[γ01(1λj)(γ01γ10)]C¯·j+γ10(CijC¯·j)+λjγ11C¯·j(CijC¯·j)Vij=σ2+[τ00|ν+(1λj)(γ01γ10)2τνν+(τ11|ν+γ112τνν)σcc/nj]+2[τ01|ν+(1λj)(γ01γ10)γ11τνν](CijC¯·j)+[τ11|ν+(1λj)γ112τνν](CijC¯·j)2.(A1)

The bias terms are complicated functions of cluster sizes nj and parameters, but revealing in the balanced case of nj = n where λj=λ. The interaction effect λγ11 of C¯·j(CijC¯·j) has a downward bias term (1λ)γ11 that introduces bias (1λ)(γ01γ10)γ11τνν and (1λ)γ112τνν in estimation of τ01|ν and τ11|ν, respectively. Likewise, the main effect of C¯·j has a bias term (1λ)(γ01γ10) which propagates bias (1λj)(γ01γ10)2τνν and (1λ)(γ01γ10)γ11τνν in estimation of τ00|ν and τ01|ν, respectively. Estimation of τ00|ν results in an additional upward bias term (τ11|ν+γ112τνν)σcc/n from the error-prone measure C¯·j of νj. Consequently, this approach results in biased estimation of (γ01,γ11,τ00|ν,τ01|ν,τ11|ν). In particular, the estimate of γ11 is biased downward, but those of τ00|ν and τ11|ν upward.

Two special cases are of interest. When γ11=0,cov(β0j,β1j|C¯·j)=τ01|ν and var(β1j|C¯·j)=τ11|ν become unbiased, and the estimator of τ00|ν becomes less biased. As cluster sizes nj,λj1,C¯·jνj by the laws of large numbers, and all bias terms tend to zero.

Appendix B

The E Step for estimation of model (14)

Let A1ij*=[Cij*A1ij*T]T,Yij*=[Rij*A1ij*T]T p1-by-1, A2j* p2-by-1, and ν1j=[νjν1jT]T. A reasonably general CD model given known covariates X1ij at level 1 and X2j at level 2 is (B1) Rij*=γ00TX2j+u0j*+B1jT(A1ij*Δ00X2jν1j)+γ30TX1ij+eij*A1ij*=Δ00X2j+Δ10X1ij+ν1j+ϵij*, A2j*=Δ2X2j+ν2j(B1) for B1jT=[γ10TX2j+u1j*γ20T],Δ00=[δ001Δ002T]T and Δ10=[δ101Δ102T]T. Denote ν0j=[u0j*ν1jT]T to separate all other random effects νj*=[ν0jTν2jT]T from u1j* at level 2, and let var(νj*)=[T00T02T20T22],cov(νj*,u1j*)=[T01T21] and var(νj*|u1j*)=[T00|1T02|1T20|1T22|1]. Define matrix O2j that selects observed values A2j=O2jA2j* in A2j* such that var(A2j|u1j*)=O2jT22|1O2jT=T22|1j, and cov(νaj,νbj|u1j*,A2j)=Ωabj=Tab|1Ta2|1O2jTT22|1j1O2jT2b|1 for a,b=0,2. The likelihood L(θ)=jh(Yj|u1j*)ϕ(u1j*|0,τ11)du1j* has a key component (B2) h(Yj|u1j*)(|Ω00j|1|Δj|1|T22|1j|1i|Σij|1)1/2exp{12[ieo1ijTΣij1eo1ijieo1ijTΣij1OijΔj1(iOijTΣij1eo1ij+2Ω00j1T02|1O2jTT22|1j1eo2j)+eo2jT(T22|1j1O2jT20|1(Ω00j1Ω00j1Δj1Ω00j1)T02|1O2jTT22|1j1+T22|1j1)eo2j]}(B2) for eo1ij=Oij(dij*T01τ111u1j*),eo2j=O2j(A2j*Δ2X2jT21τ111u1j*) and Δj=i=1njAOOij+Ω00j1 where AOOij=OijTΣij1Oij and dij*=[Rij*A1ij][γ00TΔ00]X2j[1B1jT0Ip11][γ30TΔ10]X1ij.

Define E(A)=E(A|u1j*,Yj),V(A)=var(A|u1j*,Yj) and C(A,B)=cov(A,B|u1j*,Yj). We have multivariate normal f(ν0j,ν2j,rij*|u1j*,Yj) for E(ν0j)=T01τ111u1j*+Δj1(iOijTΣij1eo1ij+Ω00j1T02|1O2jTT22|1j1eo2j),E(ν2j)=T21τ111u1j*+T22|1O2jTT22|1j1eo2j+Ω20jΩ00j1[E(ν0j)T01τ111u1j*T02|1O2jTT22|1j1eo2j],E(rij*)=Σj*AOOij[dij*E(ν0j)],V(rij*)=Σj*Σj*(AOOijAOOijΔj1AOOij)Σj*,V(ν0j)=Δj1,C(ν0j,ν2j)=Δj1Ω00j1Ω02j,C(ν0j,rij*)=Δj1AOOijΣj*,V(ν2j)=Ω22jΩ20j(Ω00j1Ω00j1Δj1Ω00j1)Ω02j,C(ν2j,rij*)=Ω20jΩ00j1C(ν0j,rij*).

Let A˜1ij*=A1ij*Δ002X2jν1j,δ10=vec(Δ10T) and βj=(Ip1+1+p2X2jT)γβ+uj* for γβ=vec[γ00γ10Δ00TΔ2T] and uj*=[u0j*u1j*νjT]T. The expected CD MLEs are γ̂20=γ20+(jE[iE(A˜1ij*A˜1ij*T)|Yj])1jE[iE(eij*A˜1ij*)|Yj],γ̂30=γ30+(jiX1ijX1ijT)1jiX1ijE[E(eij*)|Yj],δ̂10=δ10+vec[(jiE[E(ϵij*)|Yj]X1ijT)(jiX1ijX1ijT)1]γ̂β=γβ+vec[(jE[E(uj*)|Yj]X2jT)(jX2jX2jT)1]σ̂2=jE[iE(eij*2)|Yj]/N,Σ̂ϵ=jE[iE(ϵij*ϵij*T)|Yj]/N,τ̂=jE[E(uj*uj*T)|Yj]/Jgiven θ where E(eij*)=B1j*TE(rij*) and E(eij*2)=B1j*TE(rij*rij*T)B1j* for B1j*T=[1B1jT].

Appendix C

Numerical Integration by AGHQ

Let f(u1j*)=h(Yj|u1j*)ϕ(u1j*;0,τ11) be a function of u1j*. Given u˜1j*=E(u1j*|Yj),Vu1j=var(u1j*|Yj)=Lu1j2/2, Q-point weights (w1,,wQ) and abscissas (a1,,aQ), (C1) h(Yj)=ϕ(u1j*;u˜1j*,Vu1j)ϕ(u1j*;u˜1j*,Vu1j)f(u1j*)du1j*Lu1jk=1Qwkeak2f(zkj),(C1) for zkj=Lu1jak+u˜1j*. The g(u1j*|Yj)=f(u1j*)/h(Yj) is approximately ϕ(u1j*;u˜1j*,Vu1j) for large cluster sizes nj by the Bayesian central limit theorem such that f(u1j*)ϕ(u1j*;u˜1j*,Vu1j) produces well approximated h(Yj) by a low degree polynomial. The approximation is exact if f(u1j*) is a 2Q1 degree polynomial in u1j* (Pinheiro and Bates Citation1995; Rabe-Hesketh, Skrondal, and Pickles Citation2002; Carlin and Louis Citation2009). Likewise, for E(SCDj|u1j*,Yj) closed-form, (C2) E(SCDj|Yj)=E(SCDj|u1j*,Yj)g(u1j*|Yj)du1j*Lu1jh(Yj)k=1QE(SCDj|zkj,Yj)wkeak2f(zkj)(C2)

Let Yj*=(Yj,Ymisj), and ϕτ and ϕΣ be vectors of distinct elements of τ and Σϵ, respectively. The log-likelihood l=jlj and score S=jSj have summands lj=lnh(Yj)=lngjdYmisjdν0jdu1j*,Sj=ljθ=E[E(lngjθ)|Yj].

for gj=if(Rij*|A1ij*,uj*;γ20,γ30,σ2)f(A1ij*|uj*;δ10,ϕΣ)ϕ(uj*;γβ,τ) from (B1). Let E=vecτϕτT and F=vecΣϵϕΣT. The E(lngjθ) stacks E(lngjγ20)=σ2iE(eij*A˜1ij*),E(lngjγ30)=σ2iE(eij*)X1ij,E(lngjδ10)=vec(X1ijiE(ϵij*T)Σϵ1),E(lngjσ2)=12Aσj,E(lngjϕΣ)=12FTAΣj,E(lngjγβ)=vec(X2jE(uj*T)τ1),E(lngjϕτ)=12ETvec[τ1E(uj*uj*T)τ1τ1]for Aσj=σ4iE(eij*2)njσ2 and AΣj=vec(Σϵ1iE(ϵij*ϵij*T)Σϵ1njΣϵ1). We compute Sj also by AGHQ for var(θ̂)(jSjSjT)1 (Hedeker and Gibbons Citation1994; Raudenbush, Yang, and Yosef Citation2000; Olsen and Schafer Citation2001). Section 7 shows good approximation for the sample sizes analyzed in this article.

Appendix D

Translating model (B1) to θ̂(19)*

Define βkj=γk0TX2j+ukj*,βCj=δ001TX2j+νj,cov(ukj*,ukj*|X2j)=τkk,cov(ukj*,νj|X2j)=τkν and var(νj|X2j)=τνν for k,k=0,1. With X2j marginalized out, βj=[β0jβ1jβCj]TN([γ00Tγ10Tδ121T]EX2,[t00t01t0νt11t1νtνν]=[τ00τ01τ0ντ11τ1ντνν] +[γ00Tγ10Tδ001T]VX2[γ00γ10δ001]) for E(X2j)=EX2 and var(X2j)=VX2. Let β˜Cj=βCjδ001TEX2N(0,tνν) to find (D1) βkj|β˜Cj(γk0*+γk1*β˜Cj,τkk*)(D1) for γk0*=γk0TEX2,γk1*=tkν/tνν and cov(βkj,βkj|β˜Cj)=τkk*=tkkγk1*tννγk11*.

Within cluster j given uj*, denote A˜1ij*=A1ij*Δ00X2jν1j=Δ10X1ij+ϵij. Marginalizing X1ij out using EX1=E(X1ij) and VX1=var(X1ij), we find f(Yij*|uj*) in A1ij*˜=Δ10EX1+ϵ˜ij*,Rij=β0j+(B1jTΔ10+γ30T)EX1+B1jTϵ˜ij*+e˜ij*for [e˜ij*ϵ˜ij*]N(0,[σee*Σeϵ*Σϵe*Σϵϵ*]) where σee*=γ30TVX1γ30+σ2,Σϵe*=[σceΣ1e]=[δ101TΔ102]VX1γ30 and Σϵϵ*=[σcc*Σc1*Σ1c*Σ11*]=[δ101TVX1δ101+σccδ101TVX1Δ102T+Σc1Δ102VX1δ101+Σ1cΔ102VXiΔ102T+Σ11]. Consequently, Rij|βj,ϵijN[E(Rij*|βj,ϵij),σ2] for (D2) E(Rij*|βj,ϵij)=β0j+(B1jTΔ10+γ30T)EX1+(β1j+γ20TΣ1c*/σcc*+σec*/σcc*)ϵij,σ2=γ20T(Σ11*Σ1c*Σc1*/σcc*)γ20+2γ20T(Σ1e*Σ1c*σce*/σcc*)+(σee*σce*2/σcc*).(D2)

As explained by Section 4.1, EquationEquations (D1)–(D2) result in a mixed model (19) for (D3) γ00=γ00*+(γ10*δ101T+γ20TΔ102+γ30T)EX1,γ01=γ01*+γ11*δ101TEX1,γ10=γ10*+γ20TΣ1c*/σcc*+σec*/σcc*,γ11=γ11*,σ2=σ2τ00|ν=τ00*+2τ01*δ101TEX1+τ11*(δ101TEX1)2,τ01|ν=τ01*+τ11*δ101TEX1,τ11|ν=τ11*.(D3)

Estimating (EX1,VX1,EX2,VX2) from sample, we find θ̂(19)* above and compute var(θ̂(19)*) by the delta method. Translations (D3) simplify if EX1=0 (e.g., EX1=E(X1ijX¯1j)). In Section 6, we used the translations for X2j=1 and X1ij=0.

Appendix E

Compatible Gibbs Sampler without a PKRE

Without the merit of a PKRE, a Bayesian joint distribution based on (19) is f(Rij*|Cij*,νj,u0j,u1j,θ)f(u0j,u1j|θ)f(Cij*|νj,θ)ϕ(νj;0,τνν)p(θ) for a prior p(θ) and θ=θ(19)*. Our scientific model is an analytic integral f1(Rij*|Cij*,νj,θ)=f(Rij*|Cij*,νj,u0j,u1j,θ)f(u0j,u1j|θ)du0jdu1j. The Gibbs sampler of Enders, Du, and Keller (Citation2020) may be modified to be compatible with our scientific model conditional on a latent covariate νj by sampling (i) νj from a compatible posterior p(νj|·)=if(Rij*|Cij*,νj,u0j,u1j,θ)f(Cij*|νj,θ)ϕ(νj;0,τνν)if(Rij*|Cij*,νj,u0j,u1j,θ)f(Cij*|νj,θ)ϕ(νj;0,τνν)dνjfor the denominator approximated by the MCMC integration, and (ii) a missing Cij* from a compatible normal posterior p(Cij*|·)f(Rij*|Cij*,νj,u0j,u1j,θ)f(Cij*|νj,θ) with E(Cij*|·)=δ+νj+β1jσccβ1j2σcc+σ2(Rij*β0j),var(Cij*|·)=σccσ2β1j2σcc+σ2 by (15) for β0j=γ00+γ01νj+u0j and β1j=γ10+γ11νj+u1j.