Abstract
We consider two-level models where a continuous response R and continuous covariates C are assumed missing at random. Inferences based on maximum likelihood or Bayes are routinely made by estimating their joint normal distribution from observed data and . However, if the model for R given C includes random coefficients, interactions, or polynomial terms, their joint distribution will be nonstandard. We propose a family of unique factorizations involving selected “provisionally known random effects” u such that is normally distributed and u is a low-dimensional normal random vector; we approximate via adaptive Gauss-Hermite quadrature. For polynomial models, the approximation is exact but, in any case, can be made as accurate as required given sufficient computation time. The model incorporates random effects as explanatory variables, reducing bias due to measurement error. By construction, our factorizations solve problems of compatibility among fully conditional distributions that have arisen in Bayesian imputation based on the Gibbs Sampler. We spell out general rules for selecting u, and show that our factorizations can support fully compatible Bayesian methods of imputation using the Gibbs Sampler. Supplementary materials for this article are available online.
1 Introduction
Large-scale surveys and experiments within the social and health sciences frequently meet four conditions that supply the focus of this article. First, the data typically have a hierarchical structure, with respondents nested within local organizational units such as schools and hospitals or repeated measures nested within persons. Second, missing data are pervasive. Third, partially observed covariates may be measured with error. Finally, the covariates of interest may have random coefficients, statistical interactions or polynomial terms.
These characteristics have received some attention in recent methodological research. A popular approach conceives the response variables and partially observed covariates as outcomes within a multivariate, hierarchical linear model (HLM) under the assumption that the data are missing at random (MAR; Rubin Citation1976), an assumption often thought reasonable given a presumably rich set of covariates (Schafer and Yucel Citation2002; Goldstein, Carpenter, and Browne Citation2014). Missing values are imputed from their posterior predictive distribution of the missing values using what has been termed a “fully conditional specification” (FCS) using the Gibbs sampler. FCS requires the analyst to impute missing values for each variable subject to missingness, conditional on all other unknowns. This approach has been shown to function well when the process generating the joint distribution of unknowns is reasonably assumed multivariate normal. Under the four conditions just described, however, multivariate normality is not possible even if the separate conditional distributions are normal. A concern involves the bias that can arise from the incompatibility between the multiple normal conditional distributions that generate the imputations and the assumed joint distribution of the observed data (Erler Citation2016; Enders, Mistler, and Keller Citation2016; Enders, Du, and Keller Citation2020).
In this article, we propose to address the problem of compatibility (Arnold and Press Citation1989; Liu et al. Citation2014; Bartlett et al. Citation2015) within the framework of maximum likelihood (ML) estimation under the ignorable missing data assumption that data are MAR and that the parameter spaces for the multivariate HLM and the missing data mechanism are distinct (Rubin Citation1976; Little and Rubin Citation2002).
We will first review currently popular methods of inference for normal-theory multilevel models given incomplete data and show how to modify such approaches in which random effects often become explanatory variables, allowing us to model fallible measurement as a source of incomplete data. We’ll then describe our approach when incompletely observed covariates have random coefficients or polynomial terms, including statistical interactions. If a carefully selected subset of these random effects is conditionally assumed known, the model of interest can plausibly follow a normal-theory specification. The “provisionally known random effects (PKREs)” thus, selected can be integrated out of the likelihood using numerical methods. Capitalizing on the invariance properties of ML estimates (MLE), we show that our reparameterized model can be translated back to the original model’s parameter space. Auxiliary variables are introduced to increase the robustness of the MAR assumption. We will illustrate application of the model by studying income inequality and mathematics achievement in U.S. elementary schools. Although we base our case study on MLE, the likelihood factorization we propose can readily be implemented within a Bayesian approach with assurance of compatibility between conditional distributions and the joint distribution of observed data elements. We elucidate general rules for selecting low-dimensional “provisionally constant” random effects for a general class of models involving random coefficients or polynomial terms, including interactions.
Section 2 reviews estimation of normal-theory hierarchical models from data MAR. Section 3 explains how to estimate an analytic hierarchical model with random coefficients and cross-level interaction effects given data MAR via a PKRE. Section 4 describes estimation of the analytic model using auxiliary covariates to strengthen the MAR assumption. Section 5 extends the model with polynomial terms including within-level interactions and spells out rules for selecting provisionally constant random effects. Section 6 illustrates analysis of income inequality in math achievement. Section 7 evaluates our estimators by simulation. Finally, Section 8 discusses the limitations and extensions.
2 Inference for Multivariate Normal HLMs from Incomplete Data Using Random Effects as Predictors
We begin with a review of the normal theory case. Following Schafer and Yucel (Citation2002), our scientific interest focuses on the regression of a response variable on covariates . The elements of and , which are continuously measured, are partially observed. Ours is a two-level HLM (Lindley and Smith Citation1972; Dempster, Rubin, and Tsutakawa Citation1981) in which the response variable is a characteristic of “level-1 units” (e.g., students) who are clustered within level-2 units (e.g., schools). In contrast, the covariates may be characteristics of either level-1 units or level-2 units. In longitudinal studies, the level-1 units might be repeated occasions of measurements clustered within persons at level 2. Our scientist is primarily interested in the conditional distribution assumed normally distributed. However, to account for the missing values in , we propose a normal linear model . Because of missing data, we cannot separately estimate the parameters of f1 and f2 without discarding the cases with missing values on any element of or . It is well known that a procedure that analyzes only the cases with complete data is prone to bias and/or loss of efficiency (Little and Rubin Citation2002). The cost can be particularly high when a missing item of varies at level 2, in which case all of the level-1 units in the level-2 unit having missing values are discarded along with that level-2 unit itself.
To make efficient inference possible, and following Schafer and Yucel (Citation2002), we compose each outcome vector and write a multivariate HLM (1) (1) where and . Here, and are composed of completely observed covariates; α is a vector of fixed regression coefficients while b and are independent random effects that vary at levels 2 and 1, respectively. We partition the complete data (CD) into components . In particular, if is N by 1 but we observe only elements of , we construct an M-by-N matrix O in which every row contains a single entry equal to unity indicating which value of that is observed. All other entries in the same row are 0. Our model for the observed Y is therefore (2) (2) where and . Assuming data MAR, we can make efficient estimates of the parameters using ML or Bayes inference from the observed data according to (1). The most common method in recent literature is a Bayesian approach based on multiple imputation (MI) that we will consider in the final section of this article (see Section 8). However, Schafer and Yucel (Citation2002) showed how one can obtain MLE using the EM algorithm and use the estimates for MI. Shin and Raudenbush (Citation2007), showed how to recover MLE for the analytic model by constructing model (1) carefully and estimating the model without a need for imputations. This approach allows some components of to vary at level 2.
2.1 Estimation via the EM Algorithm
The EM algorithm (Dempster, Laird, and Rubin Citation1977) requires evaluation at each iteration m + 1 of the conditional expected CD score given the observed data and parameter estimates at iteration m. To find the CD score, our model for the CD is a multivariate HLM (1). Write model (1) at the level of cluster j and let be an arbitrary scalar element of . The CD score equations are well known (see, Raudenbush and Bryk Citation2002, chap. 14): (3) (3)
The conditional expected score equations given the observed Yj thus, clearly depend on the conditional mean and variance of which we readily derive from the fact that given iteration-m estimates (Shin and Raudenbush Citation2007) where (4) (4)
2.2 Example 1: Contextual Effects Model
We apply this framework to the “contextual effects model” (Willms Citation1986) for the study of income inequality in the mathematics achievement of U.S. elementary school children. This model decomposes the association between family income and educational achievement into a within-school component and a between-school component. In its simplest form, the model is typically written as (5) (5) where Rij is a measure of math achievement for student i in school j, Cij is a measure of the income of that child’s parents, is the sample mean of Cij within school j, and is the overall sample mean income for and . Here γ10 is known as the “within-school coefficient” while γ01 is the “between school” coefficient; and and eij are independent, normally distributed random effects that vary at levels 2 and 1, respectively. Of interest is the “contextual component” which, if positive, suggests that attending a school with high-income peers predicts elevated achievement net of the contribution of a student’s family income.
Two problems arise in the conventional analysis (Shin and Raudenbush Citation2010). First, past analyses have treated parental income as completely observed when, in fact, most surveys report substantial fractions of missing data on income. Second, the sample mean will be a noisy proxy for the actual mean income of parents in a school if the sample size per school is modest, as is the case in most U.S. national surveys. Using random effects as explanatory variables within the MAR framework addresses both issues. Reflecting that income is partially observed, we propose the CD model (6) (6) where νj and are independent, normally distributed random effects at levels 2 and 1, respectively. The joint distribution of and may be written (7) (7)
Stacking the equations within level-2 unit j, we have the general form of the model (8) (8) for and where we denote as a vector of m unities and Im an m-by-m identity matrix for a positive integer m. The HLM score equations for θ are familiar (Raudenbush and Bryk Citation2002, chap. 14; Shin and Raudenbush Citation2010) and will therefore not be elaborated here.
2.3 Compatibility
A set of conditional distributions and is said to be compatible if there exist a joint distribution and surjective maps such that for each j, and , we have and (Liu et al. Citation2014). Assuming a prior , Schafer and Yucel (Citation2002) developed the Gibbs sampler based on multivariate normal (MN) that is compatible with an analytic HLM when is linearly associated with . The MN , however, cannot be compatible with when has nonlinear effects (Kim, Sugar, and Belin Citation2015; Enders, Du, and Keller Citation2020). Goldstein, Carpenter, and Browne (Citation2014) and Enders, Du, and Keller (Citation2020) factored and estimated a compatible HLM with the nonlinearities by the Gibbs sampler via a Metropolis algorithm. We extend the HLM further with the nonlinear effects of that includes random effects as latent covariates. We estimate efficiently by ML via a PKRE and translate the estimates to the ML estimates of the compatible HLM as explained in the next section.
3 Coping with Random Coefficients and Interactions
Our scientific focus is again on assumed normal in distribution as in (6). However, the model now includes elements of that have nonlinearities including random coefficients, polynomial terms, or interactions. In this case, even if we can reasonably assume that is normal, the joint distribution cannot be normal.
Recent research on the problem of nonlinearities has focused primarily on two widely used methods of analysis of multilevel incomplete data. Perhaps the most popular method of imputation for HLMs under the MAR assumption imputed missing values by a series of sequential univariate regression models (Raghunathan et al. Citation2001), which is also known as MI by fully conditional specification or FCS (van Buuren et al. Citation2006). However, these conditionals will not be compatible with the joint distribution of interest in the presence of the nonlinearities of interest in this article, as shown by Enders, Mistler, and Keller (Citation2016) and Enders, Du, and Keller (Citation2020). The main alternative approach estimates the joint distribution of the outcome and covariates subject to missingness. Missing values may be imputed based on their fully conditional distributions (Liu, Taylor, and Belin Citation2000; Schafer and Yucel Citation2002; Goldstein et al. Citation2009) and by maximum likelihood (Shin and Raudenbush Citation2007, Citation2010; Ren and Shin Citation2016). These normal-theory models were not designed to handle the nonlinearities of interest here. By means of a Gibbs sampler via the Metropolis Hastings algorithm, Goldstein, Carpenter, and Browne (Citation2014) imputed missing values of a response and covariates including interaction and polynomial terms having fixed effects in a multilevel model where covariates and response may be continuous or categorical. Similarly, Erler (Citation2016) took the sequential fully Bayesian approach (Ibrahim, Chen, and Lipsitz Citation2002) that expresses the joint distribution of variables MAR, including the outcome, into a series of univariate conditional models to handle missing values of cluster-level continuous and discrete covariates having fixed effects. Erler et al. (Citation2019) extended the approach to imputing missing values of level-1 covariates having fixed effects. Enders, Du, and Keller (Citation2020) showed, however, that these approaches do not guarantee compatibility when the partially observed covariates have random coefficients. Lacking a formal model for the joint distribution of interest, these approaches appear to fall short of ensuring compatibility.
3.1 Factorization of the Likelihood Based on Provisionally Known Random Effects
To cope with nonlinearities induced by partially observed covariates having random coefficients, polynomials, or interaction terms, we model the joint distribution induced by the scientific model of interest and the model for the covariates, . The problem is similar to the problem of estimation of generalized linear mixed models (Hedeker and Gibbons Citation1994; Raudenbush, Yang, and Yosef Citation2000). Using the notation of (2), we must evaluate (9) (9)
The integral just defined does not have closed form in the presence of nonlinearities and must be approximated numerically. To facilitate the approximation, we choose a PKRE, call it u such that (10) (10)
is a normal HLM discussed in Section 2. The problem of approximation is then to evaluate (11) (11)
The computational challenge is to select u such that the dimension of the analytic integral (10) is maximized while the dimension of numerical approximation (11) is minimized.
3.2 Example 2: A Partially Observed Covariate Having a Random Coefficient
We return to our example of income inequality in U.S. elementary schools. A common finding in multilevel studies of educational achievement is that relationship between student socioeconomic background and achievement varies from school to school (Raudenbush and Bryk Citation1986). This variation could reflect variation in school organization, composition, and resources. To assess the variation in income inequality within schools, we expand the contextual model (6) to allow for a random coefficient (12) (12) where and . This is model (6) for if . Let for . Level-1 random effects and are assumed independent of each other and of the level-2 random effects. We denote the parameters of the CD model (12) as .
Clearly cannot be marginally normal in distribution because of the multiplication of the two normal random effects and . Our strategy is to select one of these two random effects to be considered “provisionally known”; we choose for this purpose because it has lower dimension varying across schools than does which varies across students within each school. Therefore, we write (13) (13) where and for . We can therefore write the “provisional” joint model for as (14) (14)
where and for (15) (15)
The CD score is therefore familiar; see (3). To complete the EM algorithm to estimate , we will maximize the likelihood for (16) (16) where is from (14), and for and . We use adaptive Gauss-Hermite Quadrature (AGHQ) to numerically approximate integral (16) (Naylor and Smith Citation1982; Pinheiro and Bates Citation1995; Rabe-Hesketh, Skrondal, and Pickles Citation2002).
An additional AGHQ step is needed to complete the E step by evaluating the expectation of a CD score component of cluster j from (3) (17) (17) where has the familiar form of the empirical Bayes posterior normal density with the means and covariance matrix in (4). We use AGHQ to approximate the outer integral with respect to the univariate random effect, ; see Appendix C for detail. Because is nonstandard, we use the Bayes theorem to find (18) (18)
By the invariance property of MLE, we translate the MLE of the “provisional” parameters of model (14) to a one-to-one transformation via model (13).
3.3 Example 3: Cross-Level Interaction Effects Involving Partially Observed Covariates
Following Lee and Bryk (Citation1989), we wish to extend the contextual effects model in two ways to allow the level-1 covariate: (i) to have random coefficients as in the previous section; and (ii) to interact with the level-2 covariate. We therefore write the model (19) (19) where and are, as before, bivariate normal, but conditional on νj with variances and , respectively, and covariance . Other random effects are as defined in model (12). The parameters are .
This model involves two products of normal theory random effects: and . Using the logic of the last section, we wish to choose a PKRE such that, conditional on that effect, our CD joint model will be a normal-theory HLM. The question is how to choose this random effect. We might provisionally hold both νj and constant, but we would prefer to minimize the dimension of the provisionally constant random effects such that the computational burden of numerical approximation is a minimum. Alternatively, we could provisionally hold constant. But while is a scalar, it varies across all level-1 units, which could be a very large set. Instead, we choose to provisionally hold constant the scalar level-2 random effect (20) (20)
In addition, we define to represent the model parsimoniously as . This model is therefore equivalent to model (12) to imply (13)–(18) and a one-to-one correspondence between and .
Consequently, the CD model (12) is a one-to-one transformation of the provisional model in (14) by distribution (13) and, also, of in (19) by (20). We choose to estimate joint model (14) via the PKRE for efficient computation that is guaranteed to be compatible with the scientific model by the one-to-one correspondence. Whereas scientific interest focuses on , we will be estimating the parameters of the provisional joint model (14). We then exploit the invariance property of MLE again, translating the MLE of θ back to those of .
The standard approach of replacing νj and with and , respectively, in model (19) produces biased estimation of even if and were fully observed; see Appendix A. Model (19) can readily incorporate multiple covariates having random effects and also having multiple cross-level interactions. Moreover, it is straightforward to include covariates having fixed coefficients. This is important given the need to add auxiliary information to strengthen the robustness of the MAR assumption.
4 Auxiliary Covariates
Our focus is on estimation of income inequality in achievement of model (19). Because certain variables such as family income have severe missing rates, it robustifies the MAR assumption to involve auxiliary covariates (e.g., parent occupation and pretest score) correlated with missing values or patterns (Collins, Schafer, and Kam Citation2003). We consider two approaches to augment auxiliary covariates, partially observed or measured with error, to the CD model. One approach is to assume such covariates to be linearly associated with the outcome and income. Violation of the linearity, however, may produce biased estimation. The other approach is to augment them as responses to , thereby allowing them to be nonlinearly associated with the outcome and income. We then transform the MLE of the CD model to those of a nested model (19).
4.1 Linearly Associated Auxiliary Covariates
To augment auxiliary covariates that are linearly associated with the outcome and income, we extend the scalar of model (12) to a vector consisting of income and auxiliary covariates at level 1 and at level 2. We then write the CD model (21) (21) where and for the means and school-specific random effects of and the child-specific random effects of . Other components γ00, γ10, and are defined in model (19) except the linear effects γ20 of . Again and are independent of each other and for and ; let and . Auxiliary predictors and are linearly associated with the outcome and income at levels 2 and 1, respectively. The parameters are .
We select provisionally known again. Let and be of respective lengths p1 and p2. Using model (13) now for and , we find the provisional joint model (14) this time for where for , (22) (22) denoting .
We estimate using its one-to-one transformation of by the EM algorithm, computing and by AGHQ as before. See Appendix B for the E step. The scalar PKRE yields efficient computation.
To translate to , we use model (13) to let and find and for and . We then marginalize auxiliary out to obtain that should be of form and , respectively. See Appendix D for detail. We illustrate this approach in Sections 6 and 7.
4.2 Nonlinearly Associated Auxiliary Covariates
The linearity assumption between and in model (21) may be violated to produce biased estimation of . In that case, we augment to multivariate of length r, allowing them to be nonlinearly associated in (23) (23) and in (21) for conformable vectors γ00 and γ10 of fixed effects and random vectors and independent of as before. It is straightforward to find the provisional joint model (14) for and , selecting r-by-1 to be provisionally known. We estimate the joint model efficiently and, then, translate the bivariate distribution of and to ; see Appendix D again for the translation. Numerical approximation is now intensive with respect to vector . Finally, some of may be linearly associated with the outcome and income while others may not. We illustrate this case in Data Analysis.
With additional known auxiliary covariates, we marginalize them out first given their expectation and covariance matrix estimated from sample before the translation above.
5 Within-Level Interactions and Polynomial Terms
We write a CD model including the level-2 interaction effects γ04 of (24) (24) and as in model (21) where and has fixed effects γ10 for simplicity. We select one interactive term νj, the other in dimension, provisionally known to find (25) (25)
Let to find the model given νj (26) (26) for and where varies within but not between clusters. As in Section 4.1, we stack these equations to find the implied provisional model (14) for , and compute , setting below, and by (27) (27) numerically for from the provisional model.
We now consider another CD model including level-1 interaction effects (28) (28) and in model (21) where and have main (γ10 and γ20) and interaction effects (γ30). We select an interactive term in dimension, provisionally known again, and find
for and . Given provisionally constant, let and to express (29) (29) where and now varies between but not within clusters to imply the provisional model (14) given for . We compute and by (30) (30) numerically for from the provisional model as before. The numerical integral can be computationally intensive, in particular, given large cluster sizes; multivariate Laplace approximation (Pinheiro and Bates Citation1995; Raudenbush, Yang, and Yosef Citation2000) and parallel computation of each cluster may result in efficient computation.
5.1 Rules for Choosing Provisionally Known Random Effects
We now provide general rules for selecting PKREs:
For an interaction as in (28), we hold in dimension, constant. The resulting model quadratic in will minimize the dimension of AGHQ;
For a level-2 interaction , we hold one with a smaller dimension constant again;
For a three-way interaction at level 1, hold two terms the third one in dimension constant and this applies to a three-way interaction at level 2, too;
For cross-level interactions as in model (21), hold constant;
For the cluster-specific effects of , we hold constant;
Finally, for a subset of these effects, hold constant the union of the PKREs of the subset.
Our CD model in each case includes a scientific model of interest and induces a provisional joint model (14). Because the models are one-to-one transformations of each other, the scientific model is guaranteed to be compatible with the joint model we estimate.
6 Data Analysis
Rising income inequality in the United States and other nations has recently attracted substantial attention (Piketty Citation2014). A key question involves the consequence of such inequality for equality of opportunity among children. Following past research, we decompose the association between family income and educational achievement into a contextual component and a child-specific component (Firebaugh Citation1978; Willms Citation1986; Lee and Bryk Citation1989). The contextual component reflects the fact that elementary schools in the United States are quite segregated based on family income. Such segregation reflects and may reinforce residential segregation as a function of family income. Two children having the same family income might differ in educational achievement as a result of their experience in low-income versus high-income schools. The individual component reflects socioeconomic inequality within schools. Children attending the same school who differ with respect to family income may tend to differ with respect to their achievement. However, the magnitude of this within-school disparity may vary from school to school (Raudenbush and Bryk Citation1986; Lee and Bryk Citation1989). The contextual effects model (Willms Citation1986) supports the composition of inequality in achievement that we seek.
First, we decompose family income for child i in school j into between-school and within-school components as in model (19) for the mean of log-income δ, the school-specific deviation from the mean , and the child-specific component . A child’s mathematics achievement in spring 1999 (mathS) depends on these components via the model (19) for . The parameters are .
We choose math achievement as our outcome because of its importance in predicting educational attainment and adult earnings (Nomi and Raudenbush Citation2016; Rivera-Batiz Citation1982). In model (19), γ01 is the between-school gradient, reflecting the expected difference in associated with a unit difference in school mean income; γ10 reflects the average within-school gradient. However, the within-school gradient may depend on school mean income, an interaction effect represented by γ11, and this gradient may also vary randomly over schools as represented by . School-mean achievement is γ00 and, conditional on income, varies randomly over schools, . If the within-school gradients were constant (), the overall linear coefficient for income will be where can be regarded as an index of school segregation as a function of income. Define as the “contextual effect” (Willms Citation1986), the expected difference in math achievement between two students with the same family income who attend two schools that differ by one unit in school mean income. A nation’s income gradient would be , which increases with the within school segregation based on income, ρ, the contextual coefficient, γc and within-school gradient γ10. This simple relationship will not hold if the within-school gradients vary over schools, and one purpose of our analysis is to test that proposition.
To do so, we use data from 21,211 children attending kindergarten in 1,018 schools as of fall 1998, a nationally representative sample known as the Early Childhood Longitudinal Study of 1998 (“ECLS”) that is publicly available at https://nces.ed.gov/ecls; see . Only 8% of the math achievement data are missing. However, family income data are missing for 32% of the sample, a finding that is quite typical in surveys of educational achievement. Fortunately, ECLS (Tourangeau et al. Citation2009) provides data on auxiliary variables, including the maximum occupational status score of parents (occupation), missing for only 5% of the cases, as well as math achievement in fall 1998 (mathF), which is strongly predictive of math achievement in spring 1999 (mathS). In all, we have four auxiliary variables, correlated with income, outcome or missing patterns.
In this article, we take convergence to ML to be less than in the square root of the summed squared differences between of two consecutive iterations. We estimate the model for covariates using all observed values by ML via the EM algorithm (Shin and Raudenbush Citation2007) and a scientific model for given the covariates and their sample cluster and overall means by complete case analysis, and transform the estimates to the initial values of the joint model. We carry out complete case analysis by R (RCoreTeam 2017), estimate θ on a Dell XPS laptop with the 11th generation Intel(R) Core(TM) i9-11900H processor at 2.50GHz and 64 GB RAM, and test a hypothesis at a level .
6.1 Linearly Associated Auxiliary Covariates
Recall from Section 4 that we have two strategies. Following Section 4.1, we model the auxiliary covariates linearly associated with the outcome and income in model (21) where is a vector of mathF, occupation, and age in months at assessment of spring 1999 (age) and is the square root of kindergarten enrollment (enrollment) by the Box-Cox transformation. Consequently, consists of 10 fixed effects (3-by-1 γ20 and 5-by-1 ) and 39 variances and covariances (7-by-7 τ and 4-by-4 ).
We standardized each variable to have mean 0 and variance 1, except income for interpretation. Estimation of with a provisionally known and 20 abscissas converged fast to ML in 9 iterations and 13 sec. The transformed estimates and standard errors (SEs), multiplied by 100, are listed under “EM-AGHQ I” in . The school-mean and within-school components of family income are positively associated with math achievement while their interaction effect is insignificant. The within-school income effects appear to vary at most modestly across schools with the variance estimate less than the associated SE 0.25. Based on , we find a large-sample 95% confidence interval (CI) for : near zero.
To test this model against the null hypothesis that and , we estimated the null model, the multivariate normal distribution of linear associated , efficiently by the EM algorithm (Shin and Raudenbush Citation2007, Citation2010). The model consisted of 42 parameters comprising 6 fixed intercepts and 6-by-6 level-2 and 5-by-5 level-1 variance covariance matrices, and converged to log ML -107370.60. Compared to the log ML of displayed at the last row of , the likelihood ratio test statistic to test the two joint models is 19.20 with 7 degrees of freedom to give a conservative p-value < 0.01 (Stram and Lee Citation1994). Therefore, we infer that the outcome is nonlinearly associated with income. Lastly, estimation using 10 abscissas also produced the same estimates under EM-AGHQ I.
6.2 Nonlinearly Associated Auxiliary Covariates
Preliminary analysis indicates that mathF and occupation may be nonlinearly associated with the outcome and income. To test the hypothesis, following Section 4.2, we model multivariate responses (mathS, mathF, occupation) and age in model (21): (31) (31) and as before for 3-by-1 vectors γ00, γ10 and γ20 of fixed effects and random vectors and . Therefore, τ is 9-by-9, Σe 3-by-3 and 2-by-2; the CD model comprises 12 fixed effects and 54 variances and covariances. With 3-by-1 provisionally known, we estimated the joint model using 10 abscissae per dimension which converged to ML in 734th iterations and 928 min. The log ML of is shown at the bottom row under EM-AGHQ II. The LRT statistic to test model (21) versus model (31) is 367.93. The conservative LRT with 17 degrees of freedom (Stram and Lee Citation1994) produces a p-value 0 to reject the null in favor of mathF or occupation nonlinearly associated with the outcome and income.
The translated estimates and SEs are listed under EM-AGHQ II in . Compared to those under EM-AGHQ I, the main effect of within-school income is larger; furthermore, the interaction effect is significant, and so is the random effects of income by the Wald test to produce a 95% CI for τ11 now distant from zero. We conclude that the linearity assumption associated with (21) is violated to attenuate the main, interaction and random effects of within-school income , confounded with the auxiliary covariates. As a result, EM-AGHQ II produces a smaller and, thus, explains more outcome variability within schools than does EM-AGHQ I.
6.3 Known Auxiliary Covariates
Either model (21) or (31) may be extended to control for known auxiliary covariates such as race ethnicity and gender at level 1 and school location and sector at level 2. The joint model may be estimated given the provisionally known again, and translated to and SEs. See Appendices B and D for detail.
7 Simulation Study
We focus on ML estimation of the scientific HLM (19) after simulating outcome and income from a joint model conditional on auxiliary covariates within which the HLM is nested. The goal is to compare our estimators (EM-AGHQ) with those by four methods: (a) the benchmark method (BM) given νj and ; (b) complete-case analysis (CC) given and instead; (c) MLE on MI (Shin and Raudenbush Citation2007, Citation2010); and (d) the Gibbs sampler (GS) of Enders, Du, and Keller (Citation2020) implemented in software Blimp (Keller and Enders Citation2021). BM is based on complete data while others are based on data MAR. Therefore, a good method will produce estimates near the BM counterparts. BM and CC estimate the scientific model by the lme4 package (Bates et al. Citation2015) in R. MLE on MI uses C programs to estimate incompatible MHLM (8) by ML and impute missing values including latent school mean incomes 20 times, more than did past multilevel missing data analyses (Schafer and Yucel Citation2002; Shin and Raudenbush Citation2007, Citation2013; Enders, Du, and Keller Citation2020), from their predictive distribution given observed data implied by the MHLM at ML (Shin and Raudenbush Citation2007, Citation2010); and estimates the HLM given the MI by lme4. GS estimates the joint model, simultaneously generating 20 imputations of missing values excluding latent school mean incomes, by Blimp and, then, the HLM (19) given the MI by Blimp again. EM-AGHQ estimates the joint model by our C program and translates the estimates to the desired MLE in R.
We simulate the ECLS and closely in terms of sample sizes, correlations and missing rates. Specifically, for n = 20 children in each of J = 1000 schools, we simulate: (i) known auxiliary covariates and equal to 1 (0) for a private (public) school and a minority (white) child, respectively, where varies within, but not between, schools for simplicity; (ii) random intercepts and slope and from with covariances 0.8 for ; (iii) independent to simulate the CD joint model (32) (32)
The simulated parameters in θ consist of , variances and , and covariances . We marginalize and out given their simulated expectations and variances, and translate θ to the scientific model in column two of as explained in Appendix D.
Next, we simulate the ECLS missing rates closely by (33) (33) given the known level-1 covariate : missing values drawn from are MAR. Because the mechanism does not provide information about the simulated model (32), the parameter spaces of the missing data mechanism and joint model are also distinct. We simulate higher missing rates for minority than white students by : for income with a 35% missing rate (46% for , 27% for ); and for response with an 11% missing rate (16% for , 7% for ) on average.
We repeated simulating data and estimating the scientific model by the approaches 500 times to compute the % bias, average estimated SE (ASE), empirical estimate of the true SE (ESE) over samples and coverage probability (coverage) of each estimator in the next five columns. Each cell or estimate occupies two rows: % bias (ASE) in the first, and ESE and coverage in the next row. The lme4 package is unable to produce ESE and coverage of a variance or covariance estimate. The BM estimates are of course very accurate and precise with % bias, small ASE close to ESE, and good coverages near the nominal 0.95 in column three.
The CC estimates in column four, however, are biased despite the large sample sizes. The standard deviations (SDs) and are 60% and 39% biased upward while the intercept γ00, interaction effect γ11 and covariance are 8%, 42%, and 29% biased downward, respectively. Only the estimates of γ01, γ10, and are comparable in accuracy to those by BM as (A1) reveals in Appendix A. The coverages are low with a zero coverage for γ11. Finally, the uncertainty associated with the estimator of a cluster-level effect γ01 seems underestimated by ASE smaller than ESE.
MLE on MI generates incompatible MI based on MHLM (8) without consideration of the interaction effect and PKRE, thereby producing all estimates biased in column five that do not seem better than the CC estimates. Fixed effects and covariance are biased downward, and SDs upward. SEs are close to but coverages lower than CC counterparts.
On the contrary, GS without consideration of a PKRE produces estimates in column six nearly as accurate as BM counterparts except the SD of the random slope that is biased upward by 2.69% while EM-AGHQ yields all estimates in the last column as accurate or almost as accurate as BM estimates. Overall, both approaches produce estimates slightly less precise than BM estimates; ASE and ESE appear slightly larger than those by BM to reflect extra uncertainty due to latent covariates and missing values.
Computation
We used 20 abscissas to estimate the joint model by EM-AGHQ. Given data MAR, the estimation converged 450 times taking 101.6 iterations on average and 189 iterations at maximum, but did not converge until and was stopped to produce the estimates at the 300th iteration 50 times (10%). This does not appear to be the weakness of our approach as the convergence issue also occurred to each of BM and CC estimations producing 50 or more warnings of a model failing to converge. The convergence issue seems partly due to high missing rates but few covariates to explain missing values and patterns. In our experience thus far, the convergence rates seem positively associated with more covariates or abscissas. For example, this simulation using 10 abscissas resulted in practically identical estimates, but lowered the convergence rate given data MAR.
Blimp estimates 21 models per simulated dataset: joint model (32) and the HLM given each of 20 imputations. We set 20,000 burn-in and 10,000 post burn-in iterations to estimate the joint model and impute MI, and 5000 burn-in and 5000 post burn-in iterations to estimate the HLM given the MI. These settings are based on preliminary analysis of five simulated datasets that produced the potential scale reduction statistics of all estimates of each model lower than or near 1.1 to imply a reasonable convergence to posterior distributions (Gelman and Rubin Citation1992).
8 Discussion
In this article, we have considered how to estimate a two-level hierarchical linear model (HLM) efficiently where a continuous response and continuous covariates may be MAR and may have interactive, polynomial or randomly varying effects. Nonlinearities of imply a nonstandard joint model where for observed Y and missing . The key idea is to introduce a unique factorization of the joint model involving “provisionally known” random effects (PKREs) u such that the observed joint model is an analytically tractable multivariate normal (MN) theory HLM with respect to a high-dimensional random vector ν. We computed the likelihood numerically with respect to a low dimensional u by means of adaptive Gauss-Hermite quadrature (AGHQ). The HLM involved random effects as predictors, reducing bias due to measurement error. The joint model induced by the HLM is guaranteed to be compatible with the HLM. We suggested general rules for selecting the PKREs in a way that minimized the dimension of AGHQ. Although useful for the HLMs considered in this article, they are yet to be extended to other models, for example, for discrete outcomes. We hope that our work will spur research on estimation via PKREs.
The nonlinearities of multiple covariates, multiple outcomes and/or the presence of partially observed discrete variables will increase the dimension of PKREs and, thus, the expense of numerical integration by AGHQ. In that case, integration via multivariate Laplace approximation may contribute to efficient computation (Pinheiro and Bates Citation1995; Raudenbush, Yang, and Yosef Citation2000).
Further research may address the problem of highly correlated random effects at the cluster level. One strategy would introduce shared random effects to cope with the “curse of dimensionality” by AGHQ as well as the multicollinearity (Miyazaki and Frank Citation2006; Sun et al. Citation2023). In addition, parallel computation of numerical integrals for groups of or single clusters will reduce per-iteration computation time while application of the parameter-extended EM algorithm (Liu, Rubin, and Wu Citation1998) may reduce the number of iterations to converge.
In related research, Rockwood (Citation2020) estimated a multilevel structural equations model by ML, integrating linear random effects conditional on nonlinear random effects analytically and, then, nonlinear effects numerically by Gaussian quadrature. Our analysis encountered both outcome and predictors quite severely missing. To simulate the analysis closely, we simulated a MAR mechanism due to a known auxiliary predictor. We leave the important extension to a MAR (Grund, Lüdtke, and Robitzsch Citation2021) or other mechanism due to the fully observed or missing values of the outcome to near future.
It is possible to extend and automate our program that enables a user to specify an analytic HLM and determines PKREs based on a set of rules given the HLM. To that end, we need to develop a more general set of rules, for example, involving discrete covariates MAR.
Often, MI of a binary predictor MAR under multivariate normality is efficiently analyzed (Schafer Citation1997; Grund, Lüdtke, and Robitzsch Citation2018). The MI will be, however, incompatible with a HLM having the nonlinear effect of the predictor and, thus, unable to always guarantee unbiased estimation of the HLM. We are currently extending our ML approach via the PKRE idea that will ensure compatibility with and, thus, produce unbiased estimation of a HLM having the nonlinear effects of categorical predictors.
Extension of our approach to MI via Bayesian methods may increase the robustness of findings and is straightforward. In particular, our MN joint model given the PKRE u implies estimation of θ by the Gibbs sampler. The sampler will impute from their posteriors compatible with the joint model for a reasonably assumed prior by drawing: (i) and ν from MN ; (ii) u from nonstandard , for example, by importance sampling via Markov chain Monte Carlo integration of that samples u from a normal prior ; and (iii) θ from a standard posterior (Schafer and Yucel Citation2002). A potential virtue of a PKRE is to minimize the dimension of sampling the PKRE from a nonstandard posterior by importance sampling. We find it important to solve the measurement error problem by including level-2 random effects as latent covariates. To that end, we also explain how the Gibbs sampler without consideration of a PKRE (Goldstein et al. Citation2014; Enders, Du, and Keller Citation2020) may be modified to be compatible with our scientific model conditional on a latent covariate ν in Appendix E. A valuable future study is to compare the proposed Gibbs sampler estimators with existing estimators of a more sophisticated HLM, for example, involving multiple nonlinear effects or outcomes.
Supplementary Materials
The supplementary files that contain R and C executable codes, data sets and initial values to reproduce and may be downloaded. See README_supplementarymaterials.txt for instruction.
README_supplementarymaterials.txt
Download Text (5.3 KB)ShinRaudenbushComputation.zip
Download Zip (1.9 MB)Acknowledgments
We thank two anonymous reviewers and an associate editor for their helpful comments, Craig Enders and Brian Keller for providing Blimp simulation codes, and Dongho Shin for helping Blimp simulation in R environment.
Disclosure Statement
The authors report there are no competing interests to declare.
Additional information
Funding
References
- Arnold, B. C., and Press, S. J. (1989), “Compatible Conditional Distributions,” Journal of the American Statistical Association, 84, 152–156. DOI: 10.1080/01621459.1989.10478750.
- Bartlett, J. W., Seaman, S. R., White, I. R., and Carpenter, J. R. (2015), “Multiple Imputation of Covariates by Fully Conditional Specification: Accommodating the Substantive Model,” Statistical Methods in Medical Research, 24, 462–487. DOI: 10.1177/0962280214521348.
- Bates, D., Maechler, M., Bolker, B., and Walker, S. (2015), “Fitting Linear Mixed-Effects Models Using lme4,” Journal of Statistical Software, 67, 1–48. DOI: 10.18637/jss.v067.i01.
- Carlin, B.P. and Louis, T.A. (2009), Bayesian Methods for Data Analysis. 3rd ed. Boca Raton, FL: CRC Press.
- Collins, L. M., Schafer, J. L., and Kam, C. (2003), “A Comparison of Inclusive and Restrictive Strategies in Modern Missing Data Procedures,” Psychological Methods, 6, 330–351. DOI: 10.1037/1082-989X.6.4.330.
- Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum Likelihood From Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, 76, 1–38. DOI: 10.1111/j.2517-6161.1977.tb01600.x.
- Dempster, A. P., Rubin, D. B., and Tsutakawa, R. K. (1981), “Estimation in Covariance Components Models,” Journal of the American Statistical Association, 76, 341–353. DOI: 10.1080/01621459.1981.10477653.
- Enders, C. K., Mistler, S. A., and Keller, B. T. (2016), “Multilevel Multiple Imputation: A Review and Evaluation of Joint Modeling and Chained Equations Imputation,” Psychological Methods, 21, 222–240. DOI: 10.1037/met0000063.
- Enders, C. K., Du, H., and Keller, B. T. (2020), “A Model-based Imputation Procedure for Multilevel Regression Models with Random Coefficients, Interaction Effects, and Nonlinear Terms,” Psychological Methods, 25, 88–112. DOI: 10.1037/met0000228.
- Erler, N. S., Rizopoulos, D., Rosmalen, J., Jaddoe, V. W., Franco, O. H., and Lesaffre, E. M. (2016), “Dealing with Missing Covariates in Epidemiologic Studies: A Comparison between Multiple Imputation and a Full Bayesian Approach,” Statistics in Medicine, 35, 2955–2974. DOI: 10.1002/sim.6944.
- Erler, N. S., Rizopoulos, D., Jaddoe, V. W., Franco, O. H., and Lesaffre, E. M. (2019), “Bayesian Imputation of Time-Varying Covariates in Linear Mixed Models,” Statistical Methods in Medical Research, 28, 555–568. DOI: 10.1177/0962280217730851.
- Firebaugh, G. (1978), “A Rule for Inferring Individual-Level Relationships from Aggregate Data,” American Sociological Review, 43, 557–572. DOI: 10.2307/2094779.
- Gelman, A. and Rubin, D. B. (1992). “Inference from iterative simulation using multiple sequences,” Statistical Science, 7, 457–472. DOI: 10.1214/ss/1177011136.
- Goldstein, H., Carpenter, J., Kenward, M., and Levin, K. (2009), “Multilevel Models with Multivariate Mixed Response Types,” Statistical Modellng, 9, 173–197. DOI: 10.1177/1471082X0800900301.
- Goldstein, H., Carpenter, J. R., and Browne, W. J. (2014), “Fitting Multilevel Multivariate Models with Missing Data in Responses and Covariates that may Include Interactions and Non-linear Terms,” Journal of the Royal Statistical Society, Series A, 177, 553–564. DOI: 10.1111/rssa.12022.
- Grund, S., Lüdtke, O., and Robitzsch, A. (2018), “Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations,” Organizational Research Methods, 21, 111–149. DOI: 10.1177/1094428117703686.
- Grund, S., Lüdtke, O., and Robitzsch, A. (2021), “Multiple Imputation of Missing Data in Multilevel Models with the R Package mdmb: A Flexible Sequential Modeling Approach,” Behavior Research Methods, 53, 2631–2649.
- Hedeker, D., and Gibbons, R. D. (1994), “A Random-Effects Ordinal Regression Model for Multilevel Analysis,” Biometrics, 50, 933–944.
- Ibrahim, J. G., Chen, M. H., and Lipsitz, S. R. (2002), “Bayesian Methods for Generalized Linear Models with Covariates Missing at Random,” Canadian Journal of Statistics/Revue Canadienne De Statistique, 30, 55–78. DOI: 10.2307/3315865.
- Keller, B. T., and Enders, C. K. (2021), Blimp user’s guide (Version 3). Available at www.appliedmissingdata.com/multilevel-imputation.html
- Kim, S., Sugar, C. A., and Belin, T. R. (2015). “Evaluating model-based imputation methods for missing covariates in regression models with interactions,” Statistics in Medicine, 34, 1876–1888. DOI: 10.1002/sim.6435.
- Lee, V. E., and Bryk, A. S. (1989), “A Multilevel Model of the Social Distribution of High School Achievement,” Sociology of Education, 62, 172–192. DOI: 10.2307/2112866.
- Lindley, D. V., and Smith, A. F. M. (1972), “Bayes Estimates for the Linear Model,” Journal of the Royal Statistical Society, Series B, 34, 1–41. DOI: 10.1111/j.2517-6161.1972.tb00885.x.
- Little, R. J. A., and Rubin, D. B. (2002), Statistical Analysis with Missing Data, New York: Wiley.
- Liu, C., Rubin, B. D., and Wu, Y. (1998), “Parameter Expansion to Accelerate EM: The PX-EM Algorithm,” Biometrika, 85, 755–770. DOI: 10.1093/biomet/85.4.755.
- Liu, M., Taylor, J. M. G., and Belin, T. R. (2000), “Multiple Imputation and Posterior Simulation for Multivariate Missing Data in Longitudinal Studies,” Biometrics, 56, 1157–1163. DOI: 10.1111/j.0006-341x.2000.01157.x.
- Liu, J., Gelman, A., Hill, J., Su, Y., and Kropko, J. (2014), “On the Stationary Distribution of Iterative Imputations,” Biometrika, 101, 155–173. DOI: 10.1093/biomet/ast044.
- Miyazaki, Y. and Frank, K. A. (2006). “A hierarchical linear model with factor analysis structure at level 2,” Journal of Educational and Behavioral Statistics, 31, 125–156. DOI: 10.3102/10769986031002125.
- Naylor, J. C., and Smith, A. F. M. (1982), “Applications of a Method for the Efficient Computation of Posterior Distributions,” Applied Statistics, 31, 214–225. DOI: 10.2307/2347995.
- Nomi, T., and Raudenbush, S. W. (2016), “Making a Success of “Algebra for All:” the Impact of Extended Instructional Time and Classroom Peer Skill in Chicago,” Educational Evaluation and Policy Analysis, 38, 431–451. DOI: 10.3102/0162373716643756.
- Olsen, M. K., and Schafer, J. L. (2001), “A Two-Part Random-Effects Model for Semicontinuous Longitudinal Data,” Journal of the American Statistical Association, 96, 730–745. DOI: 10.1198/016214501753168389.
- Piketty, T. (2014), Capital in the 21st Century, Cambridge, MA: Harvard University Press.
- Pinheiro, J. C., and Bates, D. M. (1995), “Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model,” Journal of Computational and Graphical Statistics, 4, 12–35. DOI: 10.2307/1390625.
- R Core Team (2017), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
- Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2002), “Reliable Estimation of Generalized Linear Mixed Models using Adaptive Quadrature,” The Stata Journal, 2, 1–21. DOI: 10.1177/1536867X0200200101.
- Raghunathan, T., Lepkowski, J., Van Hoewyk, J., and Solenberger, P. (2001), “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,” Survey Methodology, 27, 85–95.
- Raudenbush, S. W., and Bryk A. S. (1986), “Hierarchical Model for Studying School Effects,” Sociology of Education, 59, 1–17. DOI: 10.2307/2112482.
- Raudenbush, S. W., Yang, M., Yosef, M. (2000), “Maximum Likelihood for Generalized Linear Models with Nested Random Effects via High-Order, Multivariate Laplace Approximation,” Journal of Computational and Graphical Statistics, 9, 141–157. DOI: 10.2307/1390617.
- Raudenbush, S. W., and Bryk, A. S. (2002), Hierarchical Linear Models, Newbury Park, CA: Sage.
- Ren, C., and Shin, Y. (2016), “Longitudinal Latent Variable Models Given Incompletely Observed Biomarkers and Covariates,” Statistics in Medicine, 35, 4729–4745. DOI: 10.1002/sim.7022.
- Rivera-Batiz, F. L. (1982), “International Migration, Non-traded Goods and Economic Welfare in the Source Country,” Journal of Development Economics, 11, 81–90. DOI: 10.1016/0304-3878(82)90043-8.
- Rockwood, N. J. (2020), “Maximum Likelihood Estimation of Multilevel Structural Equation Models with Random Slopes for Latent Covariates,” Psychometrika, 85, 275–300 DOI: 10.1007/s11336-020-09702-9.
- Rubin, D. B. (1976), “Inference and Missing Data,” Biometrika, 63, 581–592. DOI: 10.1093/biomet/63.3.581.
- Rubin, D. B. (1987), Multiple Imputation for Nonresponse in Surveys, New York: Wiley.
- Schafer, J. L. (1997), Analysis of Incomplete Multivariate Data, London: Chapman & Hall.
- Schafer, J. L., and Yucel, R. M. (2002), “Computational Strategies for Multivariate Linear Mixed-Effects Models with Missing Values,” Journal of Computational and Graphical Statistics, 11, 437–457. DOI: 10.1198/106186002760180608.
- Shin, Y., and Raudenbush, S. W. (2007), “Just-Identified Versus Over-Identified Two-Level Hierarchical Linear Models with Missing Data,” Biometrics, 63, 1262–1268. DOI: 10.1111/j.1541-0420.2007.00818.x.
- Shin, Y., and Raudenbush, S. W. (2010), “A Latent Cluster Mean Approach to The Contextual Effects Model with Missing Data,” Journal of Educational and Behavioral Statistics, 35, 26–53.
- Shin, Y. and Raudenbush, SW. (2013), “Efficient Analysis of Q-Level Nested Hierarchical General Linear Models Given Ignorable Missing Data,” The International Journal of Biostatistics, 9(1), 109–133. DOI: 10.1515/ijb-2012-0048.
- Stram, D. O., and Lee, J. (1994), “Variance Components Testing in the Longitudinal Mixed Effects Model,” Biometrics, 50, 1171–1177. DOI: 10.2307/2533455.
- Sun, X., Shin, Y., Lafata, J. E., and Raudenbush, S. W. (2023), “Variability in Causal Effects and Noncompliance in a Multisite Trial: Estimation of a Bivariate Hierarchical Generalized Linear Model,” submitted.
- Tourangeau, K., Nord, C.L.ê, T., Sorongon, A. G., and Najarian, M. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K), Combined User’s Manual for the ECLS-K Eighth-Grade and K-8 Full Sample Data Files and Electronic Codebooks (NCES 2009-004). National Center for Education Statistics, Institute of Education Sciences, US Department of Education. Washington, D.C.
- van Buuren, S., Brand, J., Groothuis-Oudshoorn, C., and Rubin, D. (2006), “Fully Conditional Specification in Multivariate Imputation,” Journal of Statistical Computation and Simulation, 76, 1049–1064. DOI: 10.1080/10629360600810434.
- Willms, J. D. (1986), “Social Class Segregation and its Relationship to Pupils’ Examination Results in Scotland,” American Sociological Review, 51, 224–241. DOI: 10.2307/2095518.
Appendix A
Problem of Bias in Estimating Model (19)
Let and fully observed, δ = 0 to simplify notation, and for k = 0, 1 in model (19). Because and are independent of , and , implying a mixed model for , . Let be the reliability of as an error-prone measure of νj (Raudenbush and Bryk Citation2002). The implied has (A1) (A1)
The bias terms are complicated functions of cluster sizes nj and parameters, but revealing in the balanced case of nj = n where . The interaction effect of has a downward bias term that introduces bias and in estimation of and , respectively. Likewise, the main effect of has a bias term which propagates bias and in estimation of and , respectively. Estimation of results in an additional upward bias term from the error-prone measure of νj. Consequently, this approach results in biased estimation of . In particular, the estimate of γ11 is biased downward, but those of and upward.
Two special cases are of interest. When and become unbiased, and the estimator of becomes less biased. As cluster sizes by the laws of large numbers, and all bias terms tend to zero.
Appendix B
The E Step for estimation of model (14)
Let -by-1, -by-1, and . A reasonably general CD model given known covariates at level 1 and at level 2 is (B1) (B1) for and . Denote to separate all other random effects from at level 2, and let and . Define matrix that selects observed values in such that , and for . The likelihood has a key component (B2) (B2) for and where and .
Define and . We have multivariate normal for
Let and for and . The expected CD MLEs are given θ where and for .
Appendix C
Numerical Integration by AGHQ
Let be a function of . Given , Q-point weights and abscissas , (C1) (C1) for . The is approximately for large cluster sizes nj by the Bayesian central limit theorem such that produces well approximated by a low degree polynomial. The approximation is exact if is a degree polynomial in (Pinheiro and Bates Citation1995; Rabe-Hesketh, Skrondal, and Pickles Citation2002; Carlin and Louis Citation2009). Likewise, for closed-form, (C2) (C2)
Let , and and be vectors of distinct elements of τ and , respectively. The log-likelihood and score have summands
for from (B1). Let and . The stacks for and . We compute Sj also by AGHQ for (Hedeker and Gibbons Citation1994; Raudenbush, Yang, and Yosef Citation2000; Olsen and Schafer Citation2001). Section 7 shows good approximation for the sample sizes analyzed in this article.
Appendix D
Translating model (B1) to
Define and for . With marginalized out, for and . Let to find (D1) (D1) for and .
Within cluster j given , denote . Marginalizing out using and , we find in for where and . Consequently, for (D2) (D2)
As explained by Section 4.1, EquationEquations (D1)–(D2) result in a mixed model (19) for (D3) (D3)
Estimating from sample, we find above and compute by the delta method. Translations (D3) simplify if (e.g., ). In Section 6, we used the translations for and .
Appendix E
Compatible Gibbs Sampler without a PKRE
Without the merit of a PKRE, a Bayesian joint distribution based on (19) is for a prior and . Our scientific model is an analytic integral . The Gibbs sampler of Enders, Du, and Keller (Citation2020) may be modified to be compatible with our scientific model conditional on a latent covariate νj by sampling (i) νj from a compatible posterior for the denominator approximated by the MCMC integration, and (ii) a missing from a compatible normal posterior with by (15) for and .