389
Views
0
CrossRef citations to date
0
Altmetric
Articles

Relaxed doubly robust estimation in causal inference

& ORCID Icon
Pages 69-79 | Received 25 Apr 2023, Accepted 27 Jan 2024, Published online: 08 Feb 2024

Abstract

Causal inference plays a crucial role in biomedical studies and social sciences. Over the years, researchers have devised various methods to facilitate causal inference, particularly in observational studies. Among these methods, the doubly robust estimator distinguishes itself through a remarkable feature: it retains its consistency even when only one of the two components – either the propensity score model or the outcome mean model – is correctly specified, rather than demanding correctness in both simultaneously. In this paper, we focus on scenarios where semiparametric models are employed for both the propensity score and the outcome mean. Semiparametric models offer a valuable blend of interpretability akin to parametric models and the adaptability characteristic of nonparametric models. In this context, achieving correct model specification involves both accurately specifying the unknown function and consistently estimating the unknown parameter. We introduce a novel concept: the relaxed doubly robust estimator. It operates in a manner reminiscent of the traditional doubly robust estimator but with a reduced requirement for double robustness. In essence, it only mandates the consistent estimate of the unknown parameter, without requiring the correct specification of the unknown function. This means that it only necessitates a partially correct model specification. We conduct a thorough analysis to establish the double robustness and semiparametric efficiency of our proposed estimator. Furthermore, we bolster our findings with comprehensive simulation studies to illustrate the practical implications of our approach.

1. Introduction

Causal inference plays a pivotal role across a variety of scientific disciplines, encompassing both randomized experiments and observational studies. While randomized experiments are the ideal gold standard for establishing causal relationships regarding treatment effects, their implementation can be infeasible and often challenging, due to ethical or logistical constraints in many real-world situations.

Alternatively, one can utilize observational data to draw causal inferences, provided that appropriate corrections are made to mitigate the bias arising from confounding factors resulting from nonrandomized treatment assignment. Given that all confounders affecting the relationship between treatment and outcome are observed, researchers have devised a wide array of techniques to effectively control for confounding, such as the weighting and matching methods based on propensity score (Rosenbaum & Rubin, Citation1983), outcome regression methods, and doubly robust (DR) methods based on the idea of augmenting the inverse probability weighting estimator (Bang & Robins, Citation2005; Lunceford & Davidian, Citation2004; Robins et al., Citation1994Citation1995; Rotnitzky et al., Citation1998; Rotnitzky & Vansteelandt, Citation2014; Scharfstein et al., Citation1999). These methods have been well written into textbooks, such as Imbens and Rubin (Citation2015).

Among all these methods, the doubly robust method has been popular because its consistency relies on the correct specification of either the propensity score model or the outcome mean model, not necessarily both. Further, if both propensity score model and outcome mean model are correctly specified, the doubly robust estimator becomes semiparametrically efficient (Bickel et al., Citation1993; Tsiatis, Citation2006). Various semiparametric models such as the single-index model have been popularly used in modelling the propensity score or the outcome mean, because semiparametric models typically possess the interpretability of parametric models and the flexibility of nonparametric models. Nonetheless, the advantageous property of double robustness may be compromised when semiparametric models are employed for either the propensity score or the outcome mean. In such cases, correct model specification necessitates the correct specification of the unknown function as well as the consistent estimation of the unknown parameter – two critical elements within the semiparametric model. In particular, the consistent estimation of the unknown parameter might not need the correct specification of the unknown function.

In this paper, we introduce a novel approach to doubly robust estimation specifically tailored for semiparametric models used in both the propensity score and the outcome mean. As mentioned earlier, the traditional doubly robust estimation hinges on the correct specification of the unknown function and the consistent estimation of the unknown parameter – two essential components that need to be accurately identified within either the propensity score or the outcome mean. In contrast, our newly proposed doubly robust estimation only demands the consistent estimation of the unknown parameter, reducing the requirement to a single piece of unknown information within the semiparametric model. Hence, we refer to our innovative method as the ‘relaxed doubly robust’ (RDR) estimation. Furthermore, when both parameters (one within the propensity score model and the other within the outcome mean model) are consistently estimated, our RDR estimator also attains the semiparametric efficiency bound.

The remaining of the paper is structured as follows. In Section 2, we first review the basics of the traditional DR estimation and point out its limitation when a semiparametric propensity score model or a semiparametric outcome mean model needs to be correctly specified. We then present the proposed RDR estimation in Section 3, that includes both algorithm in Section 3.1 and theory in Section 3.2. In Section 4, we conduct comprehensive simulation studies to illustrate our proposed estimator as well as its comparison with the traditional DR estimation. We conclude the paper with some discussions in Section 5.

2. Review of traditional DR estimation

We adopt the potential outcome framework (Neyman, Citation1923; Rubin, Citation1974) to briefly present the traditional DR estimation. We denote X a p-dimensional pre-treatment covariate. Suppose the treatment variable T is binary, T{0,1}, in that 1 stands for treatment and 0 control. For each level of the treatment T = t, under the standard stable unit treatment value assumption (SUTVA) (Rubin, Citation1980), we assume that there exists the potential outcome Yt, representing the outcome had the subject, possibly contrary to the fact, been given the treatment t. We denote the observed outcome as Y that has the decomposition Y=TY1+(1T)Y0, based on the consistency assumption. We have independent and identically distributed data {(ti,yi,xi)}, i=1,,n, that are realizations of the triplet (T,Y,X).

Throughout, we focus on estimating the average treatment effect (ATE), defined as Δ=E(Y1)E(Y0).The ATE is one of the most popularly used causal estimands in a variety of scientific applications, and could generate important policy implications. The fundamental difficulty of estimating the ATE in causal inference is that one may observe at most one of Y1 and Y0 for every subject. We adopt the following two standard assumptions that are widely used in the causal inference literature. The first assumption is about the ignorability.

Assumption 2.1

Ignorability

We assume (Y1,Y0)TX, which stands for conditional independence.

Assumption 2.1 requires X to include all potential confounders relevant to both treatment and outcome to be fully identified and completely observed. It indicates, the treatment assignment is independent of the potential outcomes Y1 and Y0 after conditioning on all pre-treatment confounders X. Accordingly, we denote the propensity score model as π(X)pr(T=1Y1,Y0,X)=pr(T=1X).We also denote the outcome mean models as μ1(X)E(Y1X)=E(Y1X,T=1), and μ0(X)E(Y0X)=E(Y0X,T=0).The second assumption pertains to the overlapping condition of the covariate distributions between the treated group and the controls.

Assumption 2.2

Overlap

There exist constants c1 and c2 such that 0<c1π(X)c2<1 almost surely.

Assumption 2.2 means that the covariates' distributions between the two groups are roughly somewhat similar to each other. If this assumption is not satisfied at some value of X, the subjects with this value can only be in the treatment group or the control group, which would lead to the extrapolation at this value that makes the inference about the ATE inappropriate.

Under Assumptions 2.1 and 2.2, the ATE can be identified and then estimated via a variety of well known methods (Imbens, Citation2004; Rosenbaum, Citation2002). These methods would need either an estimate of the propensity score model, say, πˆ(x), that correctly specifies π(x), or an estimate of the outcome mean models, say, μˆ1(x) and μˆ0(x), that respectively specifies μ1(x) and μ0(x) correctly, or both of them. These methods would also induce different estimation efficiencies. In particular, the efficient influence function (EIF) (Tsiatis, Citation2006) for estimating the ATE Δ is ϕ=tπ(x){y1μ1(x)}1t1π(x){y0μ0(x)}+μ1(x)μ0(x)Δ.Accordingly, the traditional DR estimator is Δˆdr=1ni=1n[tiπˆ(xi){y1iμˆ1(xi)}1ti1πˆ(xi){y0iμˆ0(xi)}+μˆ1(xi)μˆ0(xi)].It is called DR since it only needs that πˆ(x) correctly specifies π(x) or that μˆt(x) correctly specifies μt(x) for t = 1, 0, but not necessarily both. If all of these nuisance functions are correctly specified, the asymptotic variance of this estimator achieves the semiparametric efficiency bound E(ϕ2) and this traditional DR estimator is called semiparametrically efficient. Among all regular and asymptotically linear estimators, E(ϕ2) is the minimum possible asymptotic variance that one can attain.

2.1. Limitation of traditional DR estimation

In reality, practitioners may choose to model the nuisance functions π(x), μ1(x) and μ0(x) differently based on their own preferences. For example, one may impose parametric restrictions, say, π(x,α), μ1(x,β) and μ0(x,γ), where only the parameters α, β and γ are unknown. On the other hand, one may choose to impose no parametric restrictions and regard all of them π(x), μ1(x) and μ0(x) as nonparametric functions.

As another alternative, semiparametric models such as single index models generally possess the interpretability of parametric models as well as the flexibility of nonparametric models, and thus have been widely used in practice to model either the propensity score or the outcome mean or the both. Throughout, we consider single index models π(αx), μ1(βx) and μ0(γx), where π(), μ1() and μ0() are unknown functions, α, β and γ are unknown p-dimensional vectors whose first elements are all one's. Under such a scenario, the traditional DR estimator becomes Δˆdr=1ni=1n[tiπˆ(αˆxi){y1iμˆ1(βˆxi)}1ti1πˆ(αˆxi){y0iμˆ0(γˆxi)}+μˆ1(βˆxi)μˆ0(γˆxi)].Clearly, the double robustness then refers to either the correct specification of π() as well as the consistent estimate of α, or the correct specifications of μ1() and μ0() as well as the consistent estimates of β and γ. This might not be appealing for practical use since the correct model specification includes both correct specification of functional forms and consistent estimate of unknown parameters.

3. Proposed RDR estimation

We propose a relaxed doubly robust (RDR) estimator for estimating ATE under the scenario that both propensity score and outcome mean are single index models. We term it RDR since its double robustness refers to the consistent estimate of the parameter α or of the parameters β and γ only. Further, when all of the parameters α, β and γ are consistently estimated, the RDR estimator would also be semiparametrically efficient. The correct specification of functional forms π(), μ1() and μ0() would never be required.

Our key idea is to rewrite the definition of Δ as Δ=E{E(Y1αx,βx)}E{E(Y0αx,γx)},and then propose the RDR estimator as ΔˆrdrΔˆrdr(αˆ,βˆ,γˆ)=1ni=1n{Eˆ(Y1αˆxi,βˆxi)Eˆ(Y0αˆxi,γˆxi)},where the estimates of parameters α, β and γ can be obtained through some off-the-shelf dimension reduction methods, and the estimate of the conditional expectation Eˆ() can be obtained from the standard kernel regression (Cheng, Citation1994; Hu et al., Citation2012), which will be detailed below.

3.1. Algorithm

The algorithm entails the following six steps.

  1. Estimate the coefficients α, β, and γ from the regression models of regressing T on X, regressing Y1 on X given T = 1, regressing Y0 on X given T = 0, using some dimension reduction technique such as the sliced inverse regression (SIR). These estimates are denoted by αˆ, βˆ, and γˆ, respectively.

  2. Compute Sˆi=(αˆxi,βˆxi) and Mˆi=(αˆxi,γˆxi) for i=1,2,,n. Now the original data (X,T,Y) has been transformed to (Sˆ,T,Y) and (Mˆ,T,Y).

  3. For i=1,2,,n, estimate E(Y1|Sˆi) by a Nadaraya-Watson kernel regression of Y1 on Sˆ, Eˆ(Y1|Sˆi)=Eˆ(Y1|αˆxi,βˆxi)=j=1nTjKH(SˆiSˆj)k=1nTkKH(SˆiSˆk)Y1j,where K() is a bivariate kernel function, KH(u)=K(H1u)det(H)hn with u=(u1,u2) and H=Σˆ11/2 as the square root of the 2-by-2 sample covariance matrix of S. Here hn is the bandwidth, and we follow Hu et al. (Citation2012) to choose the optimal bandwidth hnn13.

  4. Similarly, for i=1,2,,n, estimate E(Y0|Mˆi) by a Nadaraya-Watson kernel regression of Y0 on Mˆ, Eˆ(Y0|Mˆi)=Eˆ(Y0|αˆxi,γˆxi)=j=1n(1Tj)KH(MˆiMˆj)k=1n(1Tk)KH(MˆiMˆk)Y0j,where K() is a bivariate kernel function, KH(u)=K(H1u)det(H)hn with u=(u1,u2) and H=Σˆ01/2 as the square root of the 2-by-2 sample covariance matrix of M.

  5. Estimate E(Y1) and E(Y0) by the sample average of Eˆ(Y1|αˆxi,βˆxi) and Eˆ(Y0|αˆxi,γˆxi). That is, EˆSˆ(Y1)=1ni=1nEˆ(Y1|αˆxi,βˆxi), and EˆMˆ(Y0)=1ni=1nEˆ(Y0|αˆxi,γˆxi).

  6. The proposed RDR estimator is Δˆrdr=EˆSˆ(Y1)EˆMˆ(Y0).

3.2. Theory

To facilitate the presentation of the theory, we define the following conditions.

Condition (i): α is consistently estimated and its estimate αˆ satisfies αˆα=Op(n1/2).

Condition (ii): β and γ are consistently estimated and their estimates βˆ and γˆ satisfy βˆβ=Op(n1/2) and γˆγ=Op(n1/2).

We will defer some discussions on how to consistently estimate these parameters α, β and γ to Section 5. In general, interested readers should refer to the literature on semiparametric modelling techniques, including the single-index models and the dimension reduction techniques.

Other than Assumptions 2.1 and 2.2, our theory also requires the following regularity conditions.

  1. The kernel function in KH() satisfies uK(u)du=0, uuK(u)du=γKI for γK, and K2(u)du=τK for τK.

  2. The density of X is bounded away from 0.

  3. As n, the bandwidth matrix H0 such that tr(HH)0 and ndet(H), where tr is the trace of a matrix.

  4. E{|XE(X)|k}< for some k>6.

Theorem 3.1

Relaxed Double Robustness

Given Assumptions 2.1 and 2.2 and the above regularity conditions, if either Condition (i) or Condition (ii) is satisfied, then the proposed estimator Δˆrdr is asymptotically consistent; that is, ΔˆrdrpΔ as n.

Theorem 3.2

Semiparametric Efficiency

Given Assumptions 2.1 and 2.2 and the above regularity conditions, if both Condition (i) and Condition (ii) are satisfied, then the proposed estimator Δˆrdr achieves the semiparametric efficiency; that is n(ΔˆrdrΔ)dN{0,E(ϕ2)},where ϕ=tπ(αx){y1μ1(βx)}1t1π(αx){y0μ0(γx)}+μ1(βx)μ0(γx)Δ is the EIF for estimating Δ under the scenario that both propensity score and outcome mean are single index models.

The proofs of Theorems 3.1 and 3.2 are contained in the Appendix.

4. Simulation studies

In this section, we conduct comprehensive simulation studies to illustrate the proposed RDR estimator, as well as its comparison with the oracle estimator (Oracle, using all simulated Y1 and Y0 data), the regression based estimator (REG), the IPW estimator (IPW) and the traditional DR estimator (DR).

We consider the following three settings pertaining to the misspecification of the nuisance functional forms.

  • Setting 1: neither π() nor μt(), t = 1, 0 is correctly specified;

  • Setting 2: only π() is correctly specified;

  • Setting 3: only μt(), t = 1, 0 is correctly specified.

We expect that, in every setting, the RDR would perform similarly as the Oracle. Because of the constraints on the correct model specification, we expect that all three methods REG, IPW and DR would be biased in Setting 1, REG would be biased in Setting 2, and IPW would be biased in Setting 3.

In each setting, we consider four cases with sample size n = 1000 and n = 2000 blended with independent and correlated covariates: XN([0000],[0.50.50.50.5]), andXN([0000],[0.50.520.530.540.520.50.520.530.530.520.50.520.540.530.520.5]).For Setting 1, we consider the nuisance models π(x)=probit(2(αx)21.3|αx|),μ1(x)=βx0.5+(βx+1.5)2,μ0(x)=γx0.5+(γx+1)2,where α=(1,0.5,1,0.5), β=(1,1,0.5,1) and γ=(0.5,1,1,1). For Setting 2, we consider π(x)=logit(αx),and the same outcome means models as Setting 1. For Setting 3, we consider π(x)=probit((αx)2|αx|),μ1(x)=βx,μ0(x)=γx.For every specific case, we use the sliced inverse regression (SIR) method to obtain αˆ, βˆ and γˆ for the proposed RDR estimator. For the traditional REG, IPW and DR methods, we simply fit logistic regression for π(x) and linear regression for μt(x), t = 1, 0. We conduct 500 Monte Carlo simulations and report the empirical bias and empirical mean squared error in Tables  and , respectively.

Table 1. Simulation results for setting 1.

Table 2. Simulation results for setting 2.

Table 3. Simulation results for setting 3.

The results of the simulation Setting 1 are shown in Table . We could see that when both models are misspecified, the relaxed doubly robust estimator has the smallest biases for E(Y1), E(Y0), and Δ among all these estimators. However, the regression, IPW, and DR estimators are significantly biased for these estimands. Also, the mean squared errors of the RDR estimator are quite close to that of the oracle estimator under different scenarios, but other estimators are not.

From Table , we could check the numerical results for our simulation Setting 2. When the regression models are misspecified, the regression estimator has significantly larger biases than other estimators while the IPW and DR estimators have small biases, which echoes with our expectation. Our RDR estimator achieves relatively smaller biases and MSEs than the regression estimators, and its MSEs are quite close to that of the oracle estimator, which means that the performance of the RDR estimator is quite good under this condition.

Finally, we can find the similar results in Table  for simulation Setting 3. When regression models are correctly specified, the regression and DR estimators both achieve relatively small biases and mean square errors. The IPW estimator has large biases and MSEs due to the model misspecification. Our RDR estimator achieves small biases and mean square errors that are close to the oracle estimator, which means that it has good performance when the propensity score model is misspecified.

5. Discussion

In this paper, we propose a relaxed doubly robust (RDR) estimator for estimating the average treatment effect under the scenario that both propensity score and outcome mean are single index models. In contrast to the traditional doubly robust (DR) estimation, our new estimator only necessitates partial correctness in model specification. It only requires consistent estimation of the unknown parameter in either the propensity score model or the outcome mean model, or both, to achieve double robustness and semiparametric efficiency, respectively. The correct specification of unknown functions in these single-index models is no longer a need. This is why we term it ‘relaxed double robustness’.

To implement the proposed RDR estimator, one needs to consistently estimate α, β and γ in a semiparametric modelling framework. This is a stand-alone problem that was fortunately well studied in the literature. One can either go with standard semiparametric methods in single-index models (Ichimura, Citation1993; Klein & Spady, Citation1993) or choose to estimate these unknown parameters in the context of dimension reduction (Cook, Citation2009; Ma & Zhu, Citation2012Citation2013). In our numerical studies, we choose the standard sliced inverse regression (Li, Citation1991) in dimension reduction to estimate these parameters α, β and γ.

Finally, as the Associate Editor suggested, we point out that different relaxed doubly robust estimations also exist. For instance, after estimating the unknown parameters α, β and γ same as what we propose, one may opt to using some one-dimensional nonparametric regression method such as kernels to estimate πˆ(), μˆ1() and μˆ0() respectively. It will ultimately lead to the estimator 1ni=1n[tiπˆ(αˆxi){y1iμˆ1(βˆxi)}1ti1πˆ(αˆxi){y0iμˆ0(γˆxi)}+μˆ1(βˆxi)μˆ0(γˆxi)],which is also relaxed doubly robust. This estimator and the proposed estimator have different ideas. The proposed RDR estimator achieves the double robustness by fitting two augmented outcome mean models E(Y1αˆx,βˆx) and E(Y0αˆx,γˆx); see the details in steps (iii)-(v) of the algorithm in Section 3.1. To the contrary, the aforementioned estimator achieves the double robustness because it was directly implemented according to the efficient influence function.

Acknowledgements

The authors would like to thank the Editor, an Associate Editor, and one reviewer for their insightful comments which have helped improve the manuscript substantially.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

The research is supported in part by US National Science Foundation [Grant Numbers DMS 1953526, 2122074, 2310942], US National Institutes of Health [Grant Number R01DC021431] and the American Family Funding Initiative of UW-Madison.

References

  • Bang, H., & Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4), 962–973. https://doi.org/10.1111/biom.2005.61.issue-4
  • Bickel, P. J., Klaassen, J., Ritov, Y., & Wellner, J. A. (1993). Efficient and adaptive estimation for semiparametric models. Johns Hopkins University Press, Baltimore.
  • Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missing at random. Journal of the American Statistical Association, 89(425), 81–87. https://doi.org/10.1080/01621459.1994.10476448
  • Cook, R. D. (2009). Regression graphics: Ideas for studying regressions through graphics. John Wiley & Sons.
  • Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2), 315–331. https://doi.org/10.2307/2998560
  • Hu, Z., Follmann, D. A., & Qin, J. (2012). Semiparametric double balancing score estimation for incomplete data with ignorable missingness. Journal of the American Statistical Association, 107(497), 247–257. https://doi.org/10.1080/01621459.2012.656009
  • Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics, 58(1–2), 71–120. https://doi.org/10.1016/0304-4076(93)90114-K
  • Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and Statistics, 86(1), 4–29. https://doi.org/10.1162/003465304323023651
  • Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  • Klein, R. W., & Spady, R. H. (1993). An efficient semiparametric estimator for binary response models. Econometrica: Journal of the Econometric Society, 61(2), 387–421. https://doi.org/10.2307/2951556
  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
  • Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19), 2937–2960. https://doi.org/10.1002/sim.v23:19
  • Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
  • Ma, Y., & Zhu, L. (2013). A review on dimension reduction. International Statistical Review, 81(1), 134–150. https://doi.org/10.1111/insr.2013.81.issue-1
  • Neyman, J. (1923). Sur les applications de la thar des probabilities aux experiences agaricales: Essay de principle. english translation of excerpts by Dabrowska, D. and Speed, T. Statistical Science, 5(4), 465–472.
  • Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846–866. https://doi.org/10.1080/01621459.1994.10476818
  • Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association, 90(429), 106–121. https://doi.org/10.1080/01621459.1995.10476493
  • Rosenbaum, P. R. (2002). Overt bias in observational studies. Springer.
  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.1093/biomet/70.1.41
  • Rotnitzky, A., Robins, J. M., & D. O. Scharfstein (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association, 93(444), 1321–1339. https://doi.org/10.1080/01621459.1998.10473795
  • Rotnitzky, A., & Vansteelandt, S. (2014). Double-robust methods. In Handbook of Missing Data Methodology (pp. 185–212). CRC Press.
  • Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701. https://doi.org/10.1037/h0037350
  • Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371), 591–593.
  • Scharfstein, D. O., Rotnitzky, A., & Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448), 1096–1120. https://doi.org/10.1080/01621459.1999.10473862
  • Tsiatis, A. A. (2006). Semiparametric theory and missing data. Springer.

Appendix

Proof

Proof of Theorem 3.1

First, we can prove that Eˆ(Y1|S), i.e., Eˆ(Y1|αX,βX), is consistent for E(Y1|S) if either of the following conditions is satisfied.

  • Condition (a): α is correctly specified, where α is the true coefficient for the propensity score model;

  • Condition (b): β is correctly specified, where β is the true coefficient for the linear model of Y1 on X.

If α is correctly specified, the propensity score π(X) should be some function of αX with probability 1, which means that π(X) could also be written as a function of S=(αX,βX). Denote this function by f, and we have π(X)=f(S) with probability 1.If β is correctly specified, E(Y1|X) should be some function of βX with probability 1, which means that E(Y1|X) could also be written as a function of S=(αX,βX). Denote this function by g, and we have E(Y1|X)=g(S) with probability 1.Note that the following equation holds if β is correctly specified (A1) E(Y1|S)=E{E(Y1|X)|S}=g(S)=E(Y1|X).(A1) For the estimation in the algorithm, we estimate E(Y1|αxi,βxi) by a Nadaraya-Watson kernel regression of Y1 on Sˆ. Eˆ(Y1|αxi,βxi)=j=1nTjKH(SiSj)k=1nTkKH(SiSk)Y1j=n1j=1nTjY1jKH(SiSj)n1k=1nTkKH(SiSk).Under the regularity conditions, we can show that Eˆ(Y1|αxi,βxi) converges in probability to E(TY1|Si)E(T|Si)=E{E(TY1|Xi)|Si}E{E(T|Xi)|Si}=E{π(Xi)E(Y1|Xi)|Si}E{π(Xi)|Si},which can be shown equal to E(Y1|Si) under either Condition (a) or Condition (b).

When Condition (a) is satisfied, i.e., αX is correctly specified, we have shown that π(Xi)=f(Si) with probability 1, whereby we can further simplify the limit above. E{π(Xi)E(Y1|Xi)|Si}E{π(Xi)|Si}=f(Si)E{E(Y1|Xi)|Si}f(Si)=E{E(Y1|Xi)|Si}=E(Y1|Si)from the law of total expectation.When Condition (b) is satisfied, i.e., βX is correctly specified, we have shown that E(Y1|Xi)=g(Si) with probability 1. The limit could be further simplified as follows. E{π(Xi)E(Y1|Xi)|Si}E{π(Xi)|Si}=g(Si)E{π(Xi)|Si}E{π(Xi)|Si}=g(Si)=E(Y1|Si)from Equation (A1).Thus, Eˆ(Y1|αxi,βxi) is consistent for E(Y1|αxi,βxi), i.e., Eˆ(Y1|Si) is consistent for E(Y1|Si) when either αX or βX is correctly specified.

Similarly, we could also show that Eˆ(Y0|αxi,γxi) is consistent for E(Y0|αxi,γxi), i.e., Eˆ(Y0|Mi) is consistent for E(Y0|Mi) when either αX or γX is correctly specified.

Combining these two consistencies, we know that EˆS(Y1)EˆM(Y0) is a consistent estimator for Δ when either Condition (1) or Condition (2) is satisfied.

Note that the proposed estimator Δˆrdr is EˆSˆ(Y1)EˆMˆ(Y0) instead of EˆS(Y1)EˆM(Y0). Next, we want to show that EˆSˆ(Y1) and EˆMˆ(Y0) are asymptotically equivalent to EˆS(Y1) and EˆM(Y0) respectively under certain conditions, whereby the proposed estimator Δˆrdr is asymptotically equivalent to EˆS(Y1)EˆM(Y0), and has the same double robustness.

Define ζ=(α,β) and ζˆ=(αˆ,βˆ). Thus, S=b(Xi;ζ) and Sˆ=b(Xi;ζˆ), where b is a continuous function of X.

Define φ(X;ζ)=∂b(X;ζ)/ζE{∂b(X;ζ)/ζ}=[XE(X)XE(X)].We want to show that EˆSˆ(Y1) is asymptotically equivalent to EˆS(Y1) if E{|XE(X)|k}< for some k>6 and ζˆζ=Op(n1/2) are satisfied.

To prove the claim above, it is enough to prove that KH(SˆiSˆj) is asymptotically equivalent to KH(SiSj) under the given conditions.

We can first simplify the bivariate kernel function KH(u) by normalization. Note that if we use S=Σˆ11/2S, S can also preserve the same property as S since there exists a continuous function g such that f(S)=f(Σˆ11/2S)=g(S). After the linear transformation, S has the identity covariance, which means that we can simplify the bivariate kernel function to KH(u)=hn2K(uhn).Thus, we can rewrite the bivariate kernel functions as KH(SiSj)=hn2K(SiSjhn)and KH(SˆiSˆj)=hn2K(SˆiSˆjhn),respectively.

Notice that KH(SˆiSˆj)=KH(b(Xi;ζˆ)b(Xj;ζˆ)) is a function of ζˆ. By Taylor's Expansion, it could be further written as KH(SˆiSˆj)=hn2[K(SiSjhn)+K({φ(Xi;ζ)φ(Xj;ζ)}(ζˆζ)hn)]=hn2K(SiSjhn+{φ(Xi;ζ)φ(Xj;ζ)}(ζˆζ)hn),where ζ is on the line segment between ζˆ and ζ. Since E{|φ(X;ζ)|k}<, we have |φ(X;ζ)|n1/k almost surely by the Borel-Cantelli lemma. Thus, |φ(Xj;ζ)φ(Xi;ζ)|2n1/k almost surely by the triangle inequality. Combined with the optimal bandwidth hnn1/3 and the condition that ζˆζ=Op(n1/2), K({φ(Xi;ζ)φ(Xj;ζ)}(ζˆζ)hn) is O(n1/k+1/31/2). Since k>6, this term goes to zero, which means that KH(SˆiSˆj) is asymptotically equivalent to KH(SiSj).

Thus, that the kernel regression estimator EˆSˆ(Y1) is consistent for EˆS(Y1) if E{|XE(X)|k}< for some k>6 and ζˆζ=Op(n1/2) are satisfied.

Similarly, to prove the consistency of EˆMˆ(Y0), we can define ζ=(α,γ) and ζˆ=(αˆ,γˆ). We can use a similar argument to show that EˆMˆ(Y0) is asymptotically equivalent to EˆM(Y0) if E{|XE(X)|k}< for some k>6 and ζˆζ=Op(n1/2) are satisfied.

Hence, Δˆrdr=EˆSˆ(Y1)EˆMˆ(Y0) is asymptotically equivalent to EˆS(Y1)EˆM(Y0) under the given conditions, which indicates that Δˆrdr is asymptotically consistent if either Condition (i) or Condition (ii) is satisfied.

Proof

Proof of Theorem 3.2

To prove the optimal efficiency for Δˆrdr, we need to prove the optimal efficiency for EˆSˆ(Y1) and EˆMˆ(Y0).

We can first show that if α is correctly specified, EˆS(Y1) has an asymptotic normal distribution n1/2(EˆS(Y1)E(Y1))N(0,var{E(Y1|S)}+E{var(Y1|S)π(X)}).And then, if β is additionally satisfied, we can further prove that EˆS(Y1) achieves optimal efficiency.

To prove that n1/2(EˆS(Y1)E(Y1)) has such asymptotic distribution, we could split it into three parts, n1/2(EˆS(Y1)E(Y1))=n1/2[1ni=1n{Eˆ(Y1|Si)E(Y1)}]=n1/2An+n1/2Bn+n1/2Cn,where An=1ni=1n{E(Y1|Si)E(Y1)},Bn=1ni=1nE{Eˆ(Y1|Si)E(Y1|Si)|Θi},Cn=1ni=1n[Eˆ(Y1|Si)E(Y1|Si)E{Eˆ(Y1|Si)E(Y1|Si)|Θi}],Θi={(Xj,Y1j,Tj)|ji}.For An, we can show that n1/2An asymptotically converges to N(0,E(Y1|Si)) due to the central limit theorem. For Bn, we can first rewrite Eˆ(Y1|Si) as Eˆ(Y1|Si)=j=1nY1jTjKH(SiSj)k=1nTkKH(SiSk)=n1j=1nTjY1jKH(SiSj)n1k=1nTkKH(SiSk).When α is correctly specified, π(X)=f(S) with probability 1. The denominator could be further written as n1k=1nTkKH(SiSk)=f(Si)+op(1). Hence, Bn could be rewritten as Bn=1nj=1nTjE[KH(SiSj){Y1jE(Y1|Si)}f(Si)|Θi]{1+op(1)}.We can show that n(BnBn)=op(1) with Bn=1nj=1nTjY1jE(Y1|Si)f(Sj)by Theorem 2.1 from Cheng (Citation1994). Thus, by the central limit, we can show that n1/2Bn asymptotically converges to N(0,E{var(Y|S)f(S)}). For Cn, we know that E(Cn)=0 and nE(Cn2)E[{Eˆ(Y1|Si)E(Y1|Si)}2]=O({tr(HH)}2+{ndet(H)}1).Thus, Cn=op(n1/2).

Combining all three parts together, we know that we have the following asymptotic distribution: n1/2(EˆS(Y1)E(Y1))N(0,var{E(Y1|S)}+E{var(Y1|S)π(X)}).Note that the asymptotic variance n[var{EˆS(Y1)}] could be rewritten as n[var{EˆS(Y1)}]=var{E(Y1|S)}+E{var(Y|S)π(X)}=var(Y1)E[var(Y1|S)]+E{var(Y1|S)π(X)}=var(Y1)+E[{π(X)11}var(Y1|S)].Additionally, if β is correctly specified, we have the equation E(Y|X)=E(Y|S) according to the argument in the proof of Theorem 3.1. From this equation, var(Y1|S) could be further rewritten as E{var(Y1|X)|S}: var(Y1|S)=E[{Y1E(Y1|S)}2|S]=E(E[{Y1E(Y1|S)}2|X]|S)=E{var(Y1|X)|S}.Thus, the second term E[{π(X)11}var(Y1|S)] could be rewritten according to the equation E(Y|X)=E(Y|S) and the condition that π(X)=f(S) with probability 1. E[{π(X)11}var(Y1|S)]=E[{π(X)11}E{var(Y1|X)|S}]=E[{f(S)11}E{var(Y1|X)|S}]=E[E{(f(S)11)var(Y1|X)|S}]=E[{f(S)11}var(Y1|X)]=E[{π(X)11}var(Y1|X)].Combining all the equations above, we know that the asymptotic variance could be transformed into the following expression, n[var{EˆS(Y1)}]=var(Y1)+E[{π(X)11}var(Y1|S)]=var(Y1)+E[{π(X)11}var(Y1|X)].According to Hahn (Citation1998), var(Y1)+E[{π(X)11}var(Y1|X)] is the optimal efficient variance that a double robust estimator can achieve. It means that when α and β are both correctly specified, EˆS(Y1) achieves optimal efficiency.

We can use a similar argument to prove that when α and γ are both correctly specified, EˆM(Y0) achieves optimal efficiency.

Note that the optimal efficiency for EˆS(Y1) and EˆM(Y0) indicates that EˆS(Y1)EˆM(Y0) achieves optimal efficiency for Δ when both Condition (1) and Condition (2) are satisfied.

According to the proof of Theorem 3.1, we know that the proposed estimator Δˆrdr=EˆSˆ(Y1)EˆMˆ(Y0) is asymptotically equivalent to EˆS(Y1)EˆM(Y0) under the given conditions. It implies that when both Condition (1) and Condition (2) are satisfied, Δˆrdr also achieves optimal efficiency for Δ.