305
Views
0
CrossRef citations to date
0
Altmetric
Articles

Verifiable identification condition for nonignorable nonresponse data with categorical instrumental variables

ORCID Icon & ORCID Icon
Pages 40-50 | Received 01 Apr 2023, Accepted 25 Dec 2023, Published online: 04 Jan 2024

Abstract

We consider a model identification problem in which an outcome variable contains nonignorable missing values. Statistical inference requires a guarantee of the model identifiability to obtain estimators enjoying theoretically reasonable properties such as consistency and asymptotic normality. Recently, instrumental or shadow variables, combined with the completeness condition in the outcome model, have been highlighted to make a model identifiable. In this paper, we elucidate the relationship between the completeness condition and model identifiability when the instrumental variable is categorical. We first show that when both the outcome and instrumental variables are categorical, the two conditions are equivalent. However, when one of the outcome and instrumental variables is continuous, the completeness condition may not necessarily hold, even for simple models. Consequently, we provide a sufficient condition that guarantees the identifiability of models exhibiting a monotone-likelihood property, a condition particularly useful in instances where establishing the completeness condition poses significant challenges. Using observed data, we demonstrate that the proposed conditions are easy to check for many practical models and outline their usefulness in numerical experiments and real data analysis.

1. Introduction

There has been a rapidly growing movement to utilize all the available data that may explicitly, even implicitly, contain missing values, such as causal inference (Imbens & Rubin, Citation2015) and data integration (Hu et al., Citation2022; Yang & Kim, Citation2020). For such datasets, appropriate analysis of missing data is indispensable to correct selection bias owing to the missingness. In recent years, analysis of missing data under missing at random (MAR) assumption (Little & Rubin, Citation2019) has gradually matured (Kim & Shao, Citation2021; Robins et al., Citation1994). Although model identifiability is one of the most fundamental conditions in constructing the asymptotic theory, removing the MAR assumption makes statistical inference drastically difficult, especially in model identification (Miao et al., Citation2016). Estimation with unidentifiable models may provide multiple solutions that have exactly the same model fitting. Several researchers have considered giving sufficient conditions for the model identification under missing not at random (MNAR).

Constructing observed likelihood consists of two distributions: (R) response mechanism and (O) outcome distribution (Kim & Shao, Citation2021). Miao et al. (Citation2016) considered identification condition with Logistic, Probit, and Robit (cumulative distribution function of t-distribution) models for (R) and normal and t (mixture) distributions for (O). Cui et al. (Citation2017) assumed Logistic, Probit, and cLog-log models for (R) and the generalized linear models for (O). These studies depend heavily on the model specification of both (R) and (O). Wang et al. (Citation2014) introduced a covariate called instrument or shadow variable and demonstrated that the use of the instrument could considerably relax conditions on (R) and (O). For example, (O) requires only the monotone-likelihood property, which includes a variety of models, such as the generalized linear model. Tang et al. (Citation2003) and Miao and Tchetgen (Citation2018) derived conditions for model identifiability without postulating any assumptions on (R) with the help of the instrument. Miao et al. (Citation2019) further relaxed the assumption under an assumption referred to as the completeness condition on (R) (D'Haultfœuille, Citation2010Citation2011). For example, the generalized linear model with continuous covariates satisfies the completeness condition. To the best of our knowledge, this combination of an instrument on (R) and completeness on (O) is the most general condition for model identification and has been accepted in numerous studies (Yang et al., Citation2019; J. Zhao & Ma, Citation2022).

Generally, assumptions on (O) rely on the distribution of the complete data, which is untestable from observed data. Recently, modelling (O') the observed or respondents' outcome model, instead of (O), has been used to relax the subjective assumption (Miao et al., Citation2019; Riddles et al., Citation2016). However, the observed likelihood with (R) and (O') involves an integration that makes the identification problem intractable. Morikawa and Kim (Citation2021) and Beppu et al. (Citation2021) established that the integration can be computed explicitly with Logistic models for (R) and generalized linear models for (O') and derived identification condition. For general response mechanisms and respondents' outcome distributions, the model identification remains an open question. Furthermore, when the instrument is categorical such as smoking history and sex, the completeness condition is not available. For example, Ibrahim et al. (Citation2001) considered a study on the mental health of children in Connecticut and used the parents' report of the psychopathology of the child as the binary instrument.

In this paper, we consider an identification problem with an instrument for (R) and (O') that satisfies the monotone-likelihood ratio property. Note that although our model setup is similar to Wang et al. (Citation2014), we can check the validity of (O') with observed data, for example, by using the information criteria such as AIC and BIC. Furthermore, we can use semiparametric/nonparametric methods for modelling both (O') and (R).

The rest of this paper is organized as follows. Section 2 introduces the notation and defines model identifiability. Section 3 derives the proposed identification condition. We demonstrate the effects of identifiability via a limited numerical study in Section 4. Moreover, application to real data is presented in Section 5. Finally, concluding remarks are summarized in Section 6. All the technical proofs are relegated to the Appendix.

2. Basic setup

2.1. Observed likelihood

Let {xi,yi,δi}i=1n be independent and identically distributed samples from a distribution of (x,y,δ), where x is a fully observed covariate vector, y is an outcome variable subject to missingness, and δ is a response indicator of y being 1(0) if y is observed (missing). We use the generic notation p() and p() for the marginal density and conditional density, respectively. For example, p(x) is the marginal density of x, and p(yx) is the conditional density of y given x. We model the MNAR response mechanism P(δ=1x,y) and consider its identification. The observed likelihood is defined as (1) i:δi=1P(δi=1yi,xi)p(yixi)i:δi=0{1P(δi=1y,xi)}p(yxi)dy.(1) We say that this model is identifiable if parameters in (Equation1) are identified, which is equivalent to parameters in P(δ=1y,x)p(yx) being identified. This identification condition is essential even for semiparametric models such as an estimator defined by moment conditions (Morikawa & Kim, Citation2021). However, simple models can be easily unidentifiable. For example, Example 1 in Wang et al. (Citation2014) presented an unidentifiable model when the outcome model is normal, and the response mechanism is a Logistic model.

There is an alternative way to express the relationship between y and x. A disadvantage of modelling p(yx) is its subjective assumption on the distribution of complete data, not of observed data. In other words, if we made assumptions about p(yx) and ensured its identifiability, we could not verify the assumptions using the observed data. By contrast, this issue can be overcome by modelling p(yx,δ=1) because p(yx,δ=1) is the outcome model for the observed data, and we can check its validity using ordinal information criteria such as AIC and BIC. Therefore, we model p(yx,δ=1) and consider the identification condition in Section 3. Hereafter, we assume two parametric models p(yx,δ=1;γ) and P(δ=1x,y;ϕ), where γ and ϕ are parameters of the outcome and response models, respectively. Although our method requires two parametric models, the class of identifiable models is very large. For example, it can include semiparametric outcome models for p(yx,δ=1;γ) and general response models P(δ=1x,y;ϕ) other than Logistic models, as discussed in Example 3.7.

2.2. Estimation

We present a procedure of parameter estimation based on parametric models of p(yx,δ=1;γ) and P(δ=1x,y;ϕ). Let γˆ be the maximum likelihood estimator of γ. The observed likelihood (Equation1) yields to the mean score equation for ϕ (Kim & Shao, Citation2021): i=1n{δilogπ(xi,yi;ϕ)ϕ(1δi)π(xi,y;ϕ)/ϕp(yx)dy{1π(xi,y;ϕ)}p(yx)dy}=0 where π(x,y;ϕ)=P(δ=1x,y;ϕ). By using Bayes' formula p(yx)p(yx,δ=1)/π(x,y;ϕ), the mean score can be written as i=1n{δis1(xi,yi;ϕ)+(1δi)s0(xi;ϕ)}=0, where s1(x,y;ϕ)=logπ(x,y;ϕ)ϕ,s0(x;ϕ)=s1(x,y;ϕ)p(yx,δ=1)dy{1/π(x,y;ϕ)1}p(yx,δ=1)dy. To compute the two integrations in s0(), we can use the fractional imputation (Kim, Citation2011). As described in Riddles et al. (Citation2016), the EM algorithm is also applicable.

3. Identifiability

3.1. Definition of identification

Recall that the identification condition in (Equation1) is for parameters in P(δ=1y,x)p(yx). As seen in Section 2.2, the conditional density p(yx) is represented by p(yx,δ=1;γ) and P(δ=1x,y;α,ϕ) by Bayes' formula. Thus, using the formula, identification with these models changes to parameters in φ(y,x;ϕ,γ), where (2) φ(y,x;ϕ,γ)=p(yx,δ=1;γ)p(yx,δ=1;γ)/π(x,y;ϕ)dy.(2) Strictly speaking, the identification condition is φ(y,x;ϕ,γ)=φ(y,x;ϕ,γ) with probability 1 implying that (ϕ,γ)=(ϕ,γ). Generally, the integral in the denominator of (Equation2) does not have the closed form, which makes deriving a sufficient condition for the identifiability quite challenging. Morikawa and Kim (Citation2021) identified a combination of Logistic models and normal distributions for response and outcome models has a closed form of the integration and derived a sufficient condition for the model identifiability. Beppu et al. (Citation2021) extended the model to a case where the outcome model belongs to the exponential family while the response model is still a Logistic model. However, when the response mechanism is general, simple outcome models such as normal distribution can be unidentifiable.

Example 3.1

Suppose that the respondents' outcome model is y(δ=1,x)N(γ0+γ1x,1), and the response model is P(δ=1x,y)=Ψ(α0+α1x+βy), where Ψ is a known distribution function such that the integration in (Equation2) exists; then, this model is unidentifiable. For example, different parametrization (α0,α1,β,γ0,γ1)=(0,1,1,0,1), (α0,α1,β,γ0,γ1)=(0,3,1,0,1) yields the same value of the observed likelihood.

Recently, widely applicable sufficient conditions have been proposed. Assume that a covariate x has two components, x=(u,z), such that

(C1)

zδ(u,y) and z⊥̸y(δ=1,u).

The covariate z is called an instrument (D'Haultfœuille, Citation2010) or a shadow variable (Miao & Tchetgen Tchetgen, Citation2016). Miao et al. (Citation2019) derived sufficient conditions for model identifiability by combining the instrument and the completeness condition.

(C2)

For all square-integrable function h(u,y), E[h(u,y)δ=1,u,z]=0 almost surely implies h(u,y)=0 almost surely.

Lemma 3.2

Identification condition by Miao et al., Citation2019

Under the conditions (C1) and (C2), the joint distribution p(y,u,z,δ) is identifiable.

Although the completeness condition is useful and applicable for general models, a simple model with a categorical instrument does not hold the completeness condition.

Example 3.3

Violating completeness with categorical instrument

Suppose y(δ=1,u,z) follows the normal distribution N(u+z,1), and an instrument z is binary taking 0 or 1. This distribution does not satisfy the completeness condition because the conditional expectation E[h(u,y)δ=1,u,z]=0 when h(u,y)=1+yu(yu)2.

A vital implication of Example 3.3 is that instruments are no longer evidence of model identification when the instrument is categorical. Developing the identification condition for models with discrete instruments is important in applications (Ibrahim et al., Citation2001). We separately discuss two cases: (i) both y and z are categorical; (ii) respondents' outcome model has the monotone-likelihood ratio property.

When all variables, y and z, are categorical, the model can be fully nonparametric. Theorem 3.4 demonstrates that, under these conditions, the completeness and identifiability conditions are equivalent. See Appendix 2 in Riddles et al. (Citation2016) for the estimation of such fully nonparametric models.

Theorem 3.4

When both y and z are categorical, under condition (C1), the joint distribution p(y,u,z,δ) is identifiable if and only if condition (C2) holds.

As evidenced in Lemma 3.2, condition (C2) is generally sufficient for model identifiability, but Theorem 3.4 also reveals that it is necessary when y and z are categorical.

Next, we consider the identification condition for the other case (ii). Let Sy be the support of the random variable y. We assume the following four conditions.

(C3)

The response mechanism is (3) P(δ=1y,x;ϕ)=P(δ=1y,u;ϕ)=Ψ{h(u;α)+g(u;β)m(y)},(3) where ϕ=(α,β), m:SyR and Ψ:R(0,1] are known continuous strictly monotone functions, and h(u;α) and g(u;β) are known injective functions of α and β, respectively.

(C4)

The density or mass function p(yx,δ=1;γ) is identifiable, and its support does not depend on x.

(C5)

For all uSu, there exists z1 and z2, such that p(yu,z1,δ=1)p(yu,z2,δ=1), and p(yu,z1,δ=1)/p(yu,z2,δ=1) is monotone.

(C6)

p(yx,δ=1;γ)Ψ{h(u;α)+g(u;β)m(y)}dy< a.s.

The condition (C3) means that the random variable z plays a role of an instrument. The condition (C4) is the identifiability of p(yx,δ=1;γ), which is testable from the observed data. The condition (C5) assumes a monotone-likelihood property on the outcome model, which was also used in Wang et al. (Citation2014) for the complete data. The condition (C6) is necessary for (Equation1) to be well-defined. It is essentially the same condition as Theorem 3.1 (I1) of Morikawa and Kim (Citation2021). This condition is always true when the support of y is finite. However, it must be carefully verified when y is continuous. See Proposition 3.8 below for useful sufficient conditions when the respondents' outcome model is normal distribution.

Under conditions (C3)–(C6), we obtain the desired identification condition.

Theorem 3.5

The parameter (ϕ,γ) is identifiable if the conditions (C1) and (C3)–(C6) hold.

We provide an example of outcome models satisfying the condition (C5).

Example 3.6

Model satisfying (C5)

Let density functions in the exponential family be p(yx,δ=1;γ)=exp(b(θ)τ+c(y;τ)), where θ=θ(η), η=l=1Lηl(x)κl, κ=(κ1,,κL), and γ=(τ,κ). Then the density ratio becomes p(yu,z1,δ=1)p(yu,z2,δ=1)exp(θ1θ2τy), where xi=(u,zi) and θi=θ{l=1Lηl(xi)κl}, i=1,2. Therefore, the density ratio is monotone.

Example 3.7

Model satisfying (C6)

In application, it is often reasonable to assume a normal distribution on the respondents' outcome model. Focusing on the tail of the outcome model, we provide a sufficient condition to check (C6) for models with general response mechanisms.

Proposition 3.8

Suppose that the observed distribution p(yx,δ=1) is normal distribution N(μ(x;κ),σ2), the response mechanism is (Equation3) with m(y)=y and g(u;β)=β, and the strictly monotone increasing function Ψ meets the following condition: (4) s(0,2) s.t. lim infzΨ(z)exp(|z|s)>0.(4) Then, this model satisfies (C6).

The condition (Equation4) is easy to check. For example, it holds for Logistic and Robit functions but not for the Probit function. According to Proposition 3.8, it is possible to estimate μ(x;κ) with observed data using splines and other nonparametric methods, which allows us to use very flexible models. Furthermore, we can also estimate the response mechanism using nonparametric methods because it does not impose any restrictions on the functional form of h(u;α).

4. Numerical experiment

We present the effects of identifiability in numerical experiments by comparing weak and strong identifiable models. We prepared four Scenarios S1–S4:

S1:

(Outcome: Normal, Response: Logistic) [yu,z,δ=1]N(κ0+κ1u+κ2z,σ2), logit{P(δ=1u,y;α,β)}=α0+α1u+βy, uN(0,12), and zB(1,0.5), where (κ0,κ1,σ2)=(0.3,0.4,1/22) and (α0,α1,β)=(0.7,0.2,0.29).

S2:

(Outcome: Normal, Response: Cauchy) [yu,z,δ=1]N(κ0+κ1u+κ2z,σ2), P(δ=1u,y;α,β)=Ψ(α0+α1u+βy), uUnif(1,1), and zB(1,0.7), where (κ0,κ1,σ2)=(0.36,0.59,1/22), (α0,α1,β)=(0.24,0.1,0.42), and Ψ is the cumulative distribution function of the Cauchy distribution.

S3:

(Outcome: Bernoulli, Response: Probit) [yu,z,δ=1]B(1,p(u,z;κ)), P(δ=1u,y;α,β)=Ψ(α0+α1u+βy), uN(0,12), and zN(0,12), where p(u,z;κ)=1/{1+exp(κ0κ1uκ2z)}, (κ0,κ1,κ2)=(0.21,3.8,1.0), (α0,α1,β)=(0.4,0.39,0.3), and Ψ is the cumulative distribution function of the standard normal.

S4:

(Outcome: Normal+nonlinear mean structure, Response: Cauchy or Logistic) [yu,z,δ=1]N(μ(x),0.52), P(δ=1u,y;α,β)=Ψ(α0+α1u+βy), uUnif(1,1), and zB(1,0.5), where μ(x)=z+cos(2πu)+exp(z+u), (α0,α1,β)=(0.1,0.2,0.3), and Ψ is the cumulative distribution function of the Cauchy or Logistic distribution.

In S1 and S2, the strength of the identification can be adjusted by changing the parameter κ2 because κ2=0 indicates that the model is unidentifiable by Example 3.1. On the other hand, we can verify that the models in S3 and S4 are identifiable by Theorem 3.5. For example, in S4, we can see that checking (C3) and (C4) is straightforward to the setting, while (C5) and (C6) hold from Example 3.6 and Proposition 3.8, respectively. From S3 and S4, we can confirm the successful inference even in the case of discrete outcome and complex mean structures, respectively.

We generated 1000 independent Monte Carlo samples and computed two estimators for E[y] and β with two methods: fractional imputation (FI) and complete case (CC) estimators, which use only completely observed data. The estimator for E[y] is computed by the standard inverse probability weighting method with estimated response models (Riddles et al., Citation2016). We used correctly specified models for Scenarios S1–S3 but used nonparametric models for Scenario S4 because it is unrealistic to assume that the complicated mean structure is known. The R package ‘crs’ specialized in nonparametric spline regression on the mixture of categorical and continuous covariates (Nie & Racine, Citation2012) is used to estimate the respondents' outcome model. Response models are estimated by using the method discussed in Section 2.2.

Bias, root mean squared error (RMSE), and coverage rate for 95% confidence intervals in S1–S4 are reported in Table . In all the Scenarios, CC estimators have a significant bias, and the coverage rates are far from 95%, while FI estimators work well when the model is surely identifiable. When κ2 is small in S1 and S2, the performance of variance estimation with FI is poor, as expected, although that of point estimates is acceptable. The results in S4 indicate that the model is identifiable even if we use a nonparametric mean structure, and the estimates are almost the same between the two response models.

Table 1. Results of S1–S4: Bias, root mean square error (RMSE), and coverage rate (CR,%) with 95% confidence interval are reported.

5. Real data analysis

We analyzed a dataset of 2139 HIV-positive patients enrolled in AIDS Clinical Trials Group Study 175 (ACTG175; Hammer et al., Citation1996). In this analysis, we specify 532 patients for analysis who received zidovudine (ZDV) monotherapy. Let each y, x1, and x2 be the CD4 cell count at 96±5 weeks, at the baseline, and at 20±5 weeks, x3 be the CD8 cell count at the baseline, and z be sex. The outcome was subject to missingness with a 60.34% observation rate, while all covariates were observed. To make estimation stable and easy, we standardized all the data. We expect that z (sex) is a reasonable choice for an instrument variable because the information is a biological value, which affects the value of CD4, but has little effect on the response probability.

Patients who are suffering from a mild illness of HIV tend to have higher CD4 cell count; thus, one may consider that missingness of the outcome relates to serious conditions and may expect that the missing value of the outcome would be a lower CD4 cell count than the respondent. We therefore considered five different MNAR response models: P(δ=1x1,x2,x3,y)=Ψ(α0+α1x1+α2x2+α3x3+βy), where Ψ represents either the Logistic function or the distribution functions of the Cauchy or t distribution with degrees of freedom v(=2,5,10). Theorem 3.5 and Proposition 3.8 ensure that all the models with these five response models are identifiable, even when the instrumental variable z is discrete. From the above conjecture on missing values, the sign of β is expected to be negative. We assumed that the respondent's outcome is a normal distribution with a nonparametric mean structure and estimated by the ‘crs’ R package as considered in Scenario S4 in Section 4. The residual plots shown in Figure  and the computed R2-value (=0.453) signify the assumed distribution on the respondents' outcome fit well. Table  reports the estimated parameters and their estimated standard errors calculated by 1000 bootstrap samples. The results of the five response models were almost similar. This suggests that the response mechanism is robust to the choice of response models. Although we cannot determine whether it is MNAR or MAR because the estimated standard error for β is large, the point estimate is negative, as we expected. This result is consistent with the result in P. Zhao et al. (Citation2021).

Figure 1. Residual plots of respondents' outcome in ACTG175 data.

Figure 1. Residual plots of respondents' outcome in ACTG175 data.

Table 2. Estimated parameters: Estimates and standard errors for the target parameters are reported.

6. Conclusion

In this paper, we proposed a new identification condition for models using respondents' outcome and response models. Although our method requires the specification of the two models, the model can be very general with the help of an instrument. As considered in Scenario S4 in Section 4, the mean function in the respondents' outcome model can be nonparametric, and the response model can be any strictly monotone function, other than Logistic models. Our condition guarantees model identifiability even when instruments are categorical, which is not covered by previous conditions. Another advantage of using our method is the identification condition is easy to verify with observed data.

However, our method has some limitations. First, respondents' outcome models need to have the monotone-likelihood property by condition (C5). For example, we cannot deal with mixture models in our framework. Second, the specification of instruments is necessary in advance. To date, some studies on finding the instruments have been proposed (P. Zhao et al., Citation2021), but there are still no gold standard methods.

Supplemental material

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Research by the second author was supported by MEXT Project for Seismology toward Research Innovation with Data of Earthquake (STAR-E) [Grant Number JPJ010217].

References

  • Beppu, K., Morikawa, K., & Im, J. (2021). Imputation with verifiable identification condition for nonignorable missing outcomes. arXiv:2204.10508.
  • Cui, X., Guo, J., & Yang, G. (2017). On the identifiability and estimation of generalized linear models with parametric nonignorable missing data mechanism. Computational Statistics & Data Analysis, 107, 64–80. https://doi.org/10.1016/j.csda.2016.10.017
  • D'Haultfœuille, X. (2010). A new instrumental method for dealing with endogenous selection. Journal of Econometrics, 154(1), 1–15. https://doi.org/10.1016/j.jeconom.2009.06.005
  • D'Haultfœuille, X. (2011). On the completeness condition in nonparametric instrumental problems. Econometric Theory, 27(3), 460–471. https://doi.org/10.1017/S0266466610000368
  • Hammer, S. M., Katzenstein, D. A., Hughes, M. D., Gundacker, H., Schooley, R. T., Haubrich, R. H., Henry, W. K., Lederman, M. M., Phair, J. P., Niu, M., Hirsch, M. S., & Merigan, T. C. (1996). A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine, 335(15), 1081–1090. https://doi.org/10.1056/NEJM199610103351501
  • Hu, W., Wang, R., Li, W., & Miao, W. (2022). Paradoxes and resolutions for semiparametric fusion of individual and summary data. arXiv:2210.00200.
  • Ibrahim, J. G., Lipsitz, S. R., & Horton, N. (2001). Using auxiliary data for parameter estimation with non-ignorably missing outcomes. Journal of the Royal Statistical Society: Series C (Applied Statistics), 50(3), 361–373.
  • Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  • Kim, J. K. (2011). Parametric fractional imputation for missing data analysis. Biometrika, 98(1), 119–132. https://doi.org/10.1093/biomet/asq073
  • Kim, J. K., & Shao, J. (2021). Statistical methods for handling incomplete data. CRC Press.
  • Little, R. J., & Rubin, D. B. (2019). Statistical analysis with missing data. John Wiley & Sons.
  • Miao, W., Ding, P., & Geng, Z. (2016). Identifiability of normal and normal mixture models with nonignorable missing data. Journal of the American Statistical Association, 111(516), 1673–1683. https://doi.org/10.1080/01621459.2015.1105808
  • Miao, W., Liu, L., Tchetgen, E. T., & Geng, Z. (2019). Identification, doubly robust estimation, and semiparametric efficiency theory of nonignorable missing data with a shadow variable. arXiv:1509.02556.
  • Miao, W., & Tchetgen, E. T. (2018). Identification and inference with nonignorable missing covariate data. Statistica Sinica, 28(4), 2049.
  • Miao, W., & E. J. Tchetgen Tchetgen (2016). On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika, 103(2), 475–482. https://doi.org/10.1093/biomet/asw016
  • Morikawa, K., & Kim, J. K. (2021). Semiparametric optimal estimation with nonignorable nonresponse data. The Annals of Statistics, 49(5), 2991–3014. https://doi.org/10.1214/21-AOS2070
  • Nie, Z., & Racine, J. S. (2012). The crs package: Nonparametric regression splines for continuous and categorical predictors. R Journal, 4(2), 48. https://doi.org/10.32614/RJ-2012-012
  • Riddles, M. K., Kim, J. K., & Im, J. (2016). A propensity-score-adjustment method for nonignorable nonresponse. Journal of Survey Statistics and Methodology, 4(2), 215–245. https://doi.org/10.1093/jssam/smv047
  • Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846–866. https://doi.org/10.1080/01621459.1994.10476818
  • Tang, G., Little, R. J., & Raghunathan, T. E. (2003). Analysis of multivariate missing data with nonignorable nonresponse. Biometrika, 90(4), 747–764. https://doi.org/10.1093/biomet/90.4.747
  • Wang, S., Shao, J., & Kim, J. K. (2014). An instrumental variable approach for identification and estimation with nonignorable nonresponse. Statistica Sinica, 24(3), 1097–1116.
  • Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science, 3(2), 625–650. https://doi.org/10.1007/s42081-020-00093-w
  • Yang, S., Wang, L., & Ding, P. (2019). Causal inference with confounders missing not at random. Biometrika, 106(4), 875–888. https://doi.org/10.1093/biomet/asz048
  • Zhao, J., & Ma, Y. (2022). A versatile estimation procedure without estimating the nonignorable missingness mechanism. Journal of the American Statistical Association, 117(540), 1916–1930. https://doi.org/10.1080/01621459.2021.1893176
  • Zhao, P., Wang, L., & Shao, J. (2021). Sufficient dimension reduction and instrument search for data with nonignorable nonresponse. Bernoulli, 27(2), 930–945. https://doi.org/10.3150/20-BEJ1260

Appendix. Technical proofs

We first provide a technical result to prove Theorem 3.4.

Lemma A.1

Let a, b, and c be any positive real numbers. Assume that r1 and r2 are positive real numbers satisfying (A1) aba+b<r12r22r12r22<c.(A1) Then, there exist 0<πj(k)<1(j=1,2,3;k=1,2) such that (A2) j=13πj(1)=r12,j=13πj(2)=r22,(A2) and (A3) 1π1(1)1π1(2)=a,1π2(1)1π2(2)=b,1π3(1)1π3(2)=c.(A3)

Proof

Proof of Lemma A.1

By using a polar coordinate system, we transform πj(k)(j=1,2,3;k=1,2) into (π1(1),π2(1),π3(1))=r1(sinϕ1cosϕ2,sinϕ1sinϕ2,cosϕ1),(π1(2),π2(2),π3(2))=r2(sinψ1cosψ2,sinψ1sinψ2,cosψ1), where 0<ϕ1,ϕ2,ψ1,ψ2<π/2 to ensure πj(k)(j=1,2,3;k=1,2) satisfy (EquationA2). It follows from (EquationA3) and double-angular formulas that we have (A4) r12(1ω1)(1+ω2)r22(1ω3)(1+ω4)=ar12r224(1ω1)(1+ω2)(1ω3)(1+ω4),(A4) (A5) r12(1ω1)(1ω2)r22(1ω3)(1ω4)=br12r224(1ω1)(1ω2)(1ω3)(1ω4),(A5) (A6) r12(1+ω1)r22(1+ω3)=cr12r222(1+ω1)(1+ω3),(A6) where ω1=cos2ϕ1,ω2=cos2ϕ2,ω3=cos2ψ1, and ω4=cos2ψ2. Setting ω2=ω4 and Equations (EquationA4) and (EquationA5) yield r12(1ω1)r22(1ω3)=ar12r224(1ω1)(1+ω2)(1ω3),r12(1ω1)r22(1ω3)=br12r224(1ω1)(1ω2)(1ω3). Fixing ω2=12a/(a+b) reduces the above equations to the one common equation (A7) r12(1ω1)r22(1ω3)=r12r22ab2(a+b)(1ω1)(1ω3),(A7) maintaing the condition 1<ω2<1. It remains to show that there exists 1<ω3<1 satisfying (EquationA6) and (EquationA7). Solving the Equation (EquationA7) with respect to ω1, we have (A8) ω1=1r22(1ω3)r12+r12r22ab(1ω3)/{2(a+b)}.(A8) Substituting (EquationA8) into (EquationA6) leads to the following quadratic equation with respect to ω3: f(ω3)=(r12r24ab+cr14r24ab2(a+b)cr12r242)ω32(r14r22aba+b+cr14r22)ω3+(r12r22ab(2r12r22cr12r22)2(a+b)+cr12r242+2r142r12r22cr14r22)=0. It follows from (EquationA1) that f(1)=r12(2r122r222cr12r22)<0,f(1)=2r12(r12r22+r12r22aba+b)>0, which implies that there is at least one solution of ω3 to the equation f(ω3)=0 in the open interval (1,1).

Finally, we prove Theorem 3.4 with the help of Lemma A.1.

Proof

Proof of Theorem 3.4

Without loss of generality, we set the value of u to be a fixed vector because the following proof holds for each u. Let the categorical variables y and z take values in {1,2,,p} and {1,2,,q}, respectively. We show that model identifiability implies the completeness condition (C2) by individually addressing three cases: (i) p = 2, (ii) p = 3, and (iii) p4 because ‘if’ part has been already established by Lemma 3.2.

When p = 2, condition (C1) results in the rank of a q×2 matrix, composed of p(y=jδ=1,z=i) in its (i,j)th element (i=1,2;j=1,q), being 2. Hence, identifiable models always satisfy the completeness condition (C2).

For cases where p3, we must show that the model becomes unidentifiable when the completeness condition is violated. The breach of the completeness condition indicates the existence of a non-zero vector (h1,,hp) such that for z=1,,q, we have (A9) E[hyδ=1,z]=y=1phyp(yδ=1,z)=0.(A9) The elements in (h1,,hp) do not all share the same sign, and multiplying this vector by any constant does not affect the above equation. Recall that the model's unidentifiability implies that πy(1)πy(2) exists for some y{1,,p}, satisfying y=1pp(yδ=1,z)/πy(1)=y=1pp(yδ=1,z)/πy(2). We now construct an unidentifiable model when the completeness condition is violated.

When p = 3, without loss of generality, we assume h1>0, h2>0, and h3<0 satisfying the condition y=13hyp(yδ=1,z)=0 for all z{1,,q}. Employing Lemma A.1 with a=h1, b=h2, c=h3, and r1=r2=1, we derive 1π1(1)1π1(2)=h1,1π2(1)1π2(2)=h2,1π3(1)1π3(2)=h3, where j=13πj(1)=j=13πj(2)=1. Substituting h1, h2, and h3 into y=13hyp(yδ=1,z)=0 shows that the model is unidentifiable.

Lastly, we consider the case of p4. Suppose hy(y=1,,p) satisfies (EquationA9). Within (h1,,hp), we select three elements with signs as positive, positive, and negative, respectively, and define them as a, b, and c where a, b, c>0, and λ is set to be sufficiently large to ensure that (A10) λ>2max{a+bab,1c}.(A10) For ease of notation, we denote (h1,,hp)=(h1,,hp3,a,b,c). The remaining part of the proof is similar when the combination of the signs is negative, negative, and positive. With the selected λ, 0<πy(k)<1(y=1,,p3;k=1,2) are determined to be sufficiently small to satisfy (A11) (1y=1p3πy(1))(1y=1p3πy(2))12,y=1p3πy(1)<1,y=1p3πy(2)<1,1πy(1)1πy(2)=λhy,for y=1,,p3.(A11) Furthermore, we define r1 and r2 as (A12) r12=1y=1p3πy(1),r22=1y=1p3πy(2).(A12) By determining the variables through these steps, it follows from (EquationA10)–(EquationA12) that condition (EquationA1) with a=λa, b=λb, and c=λc is fulfilled: r12r22r12r222(r12r22)2cc<(λc),(λa)(λb)(λa)+(λb)<aba+b2(a+b)ab=2r12r221r12r221r12r22<r12r22r12r22. Therefore, by applying Lemma A.1, we demonstrate that there exist πp2(k), πp1(k), and πp(k)(k=1,2) such that y=p2pπy(1)=r12,y=p2pπy(2)=r22,1πp2(1)1πp2(2)=λa,1πp1(1)1πp1(2)=λb,1πp(1)1πp(2)=λc. The condition (EquationA9) suggests that the constructed πy(k)(y=1,,p;k=1,2) satisfy y=1pπy(k)=1 for k = 1, 2 and, for any z{1,,q}, y=1p(1πy(1)1πy(2))p(yδ=1,z)=λy=1phyp(yδ=1,z)=0. Therefore, the model is unidentifiable.

Proof

Proof of Theorem 3.5

We consider when y is continuous because when y is discrete, we just need to change the integral to summation. To simplify the discussion, we consider the case where Sy=R. Let u be a fixed value. Because h and g are injective functions, it is sufficient to prove the case where α:=h(u;α) and β:=g(u;β). Therefore, our goal is to prove p(yx,δ=1;γ)p(yx,δ=1;γ)Ψ{α+βm(y)}1dy=p(yx,δ=1;γ)p(yx,δ=1;γ)Ψ{α+βm(y)}1dy, implies α=α, β=β and γ=γ. Integrating both sides of the above equation with respect to y yields the equality of the denominator. Thus, we have p(yx,δ=1;γ)=p(yx,δ=1;γ); this implies γ=γ by (C4).

Next, we consider the identification of β. Taking z1 and z2 such that they satisfy (C5), we show that (A13) p(yu,z1,δ=1;γ)Ψ{α+βm(y)}dy=p(yu,z1,δ=1;γ)Ψ{α+βm(y)}dy,(A13) (A14) p(yu,z2,δ=1;γ)Ψ{α+βm(y)}dy=p(yu,z2,δ=1;γ)Ψ{α+βm(y)}dy,(A14) implies β=β. It follows from (EquationA13) and (EquationA14) that (A15) K(y;α,α,β,β)p(yu,z1,δ=1;γ)dy=K(y;α,α,β,β)p(yu,z2,δ=1;γ)dy=0,(A15) where K(y;α,α,β,β)=Ψ1{α+βm(y)}Ψ1{α+βm(y)}. It remains to show that (EquationA15) implies β=β in the following two steps.

Step I. We prove that the function K(y;α,α,β,β) has a single change of sign when ββ. Assume that ββ. The equation K(y;α,α,β,β)=0 has only one solution ySy satisfying m(y)=(αα)/(ββ) because of the injectivity of the function m() and Ψ(). This implies K(y) has a single change of sign.

Step II. We prove that the Equation (EquationA15) does not hold when β=β. Without loss of generality, by Step I, we consider a case where K(y;α,α,β,β)<0 (y<y) and K(y;α,α,β,β)>0 (y>y), and p(yu,z2,δ=1)/p(yu,z1,δ=1) is monotone increasing. Let c be the upper bound of the density ratio c:=supy<yp(yu,z2,δ=1)p(yu,z1,δ=1). By a property on K(y;α,α,β,β) shown in (EquationA15), we have 0=K(y;α,α,β,β)p(yu,z2,δ=1)dy=yK(y;α,α,β,β)p(yu,z2,δ=1)p(yu,z1,δ=1)p(yu,z1,δ=1)dy+yK(y;α,α,β,β)p(yu,z2,δ=1)p(yu,z1,δ=1)p(yu,z1,δ=1)dyycK(y;α,α,β,β)p(yu,z1,δ=1)dy+ycK(y;α,α,β,β)p(yu,z1,δ=1)dy=cK(y;α,α,β,β)p(yu,z1,δ=1)dy=0, where the inequality follows from the definition of c. This results in the density ratio p(yu,z2,δ=1)/p(yu,z1,δ=1) being a constant on Sy, hence, p(yu,z2,δ=1)=p(yu,z1,δ=1) on Sy. This contradicts with (C5), and thus β=β.

Finally, from the strict monotonicity of Ψ, it follows that the integration p(yu,z1,δ=1;γ)Ψ{α+βm(y)}dy, is injective with respect to α. Therefore, Equation (EquationA13) implies that α=α.

Proof

Proof of Proposition 3.8

It follows from the assumption (Equation4) that there exist M, C>0 such that p(yx,δ=1;γ)Ψ{h(u;α)+g(u;β)m(y)}dyexp{12(yh(u;α)βμ(x,κ))2β2σ2}1Ψ(y)exp(|y|s)exp(|y|s)dyMexp{12(yh(u;α)βμ(x,κ))2β2σ2}Cexp(|y|s)dy+C<, where 0<s<2. The first and the second terms of the last equation hold by the condition (Equation4) and the increasing assumption of Ψ, respectively.