Abstract
Data analysis in modern scientific research and practice has shifted from analysing a single dataset to coupling several datasets. We propose and study a kernel regression method that can handle the challenge of heterogeneous populations. It greatly extends the constrained kernel regression [Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press] that requires a homogeneous population of different datasets. The asymptotic normality of proposed estimators is established under some conditions and simulation results are presented to confirm our theory and to quantify the improvements from datasets with heterogeneous populations.
1. Introduction
With advanced technologies in data collection and storage, in modern statistical analyses we have not only a primary random sample from a population of interest, which results in a dataset referred to as the internal dataset, but also some independent external datasets from sources such as past investigations and publicly available datasets. In this paper, we consider nonparametric kernel regression (Bierens, Citation1987; Wand & Jones, Citation1994, December; Wasserman, Citation2006) between a univariate response Y and a covariate vector from a sampled subject, using the internal dataset with the help from independent external datasets. Specifically, we consider kernel estimation of the conditional expectation (regression function) of Y given under an internal data population, (1) (1) where D = 1 indicates internal population and is a fixed point in , the range of . The indicator D can be either random or deterministic. The subscript 1 in emphasizes that it is for internal data population (D = 1), which may be different from , a mixture of quantities from the internal and external data populations.
When external datasets also have measurements Y and , we may simply combine the internal and external datasets when the populations for internal and external data are identical (homogeneous). However, heterogeneity typically exists among populations for different datasets, especially when there are multiple external datasets collected in different ways and/or different time periods. In Section 2, we propose a method to handle heterogeneity among different populations and derive a kernel regression more efficient than the one using internal data alone. The result is also a crucial building block for the more complicated case in Section 3 where external datasets contain fewer measured covariates as described next.
In applications, it often occurs that an external dataset has measured Y and from each subject, where is a part of the vector , i.e., some components of are not measured due to high measurement cost or the progress of technology and/or scientific relevance. With some unmeasured components of , the external dataset cannot be directly used to estimate in (Equation1(1) (1) ), since conditioning on the entire is involved. To solve this problem, Dai and Shao (Citation2023) proposes a two-step kernel regression using external information as a constraint to improve kernel regression based on internal data alone, following the idea of using constraints in Chatterjee et al. (Citation2016) and H. Zhang et al. (Citation2020). However, these three cited papers mainly assume that the internal and external datasets share the same population, which may be unrealistic. The challenge in dealing with the heterogeneity among different populations is similar to the difficulty in handling nonignorable missing data if unmeasured components of is treated as missing data, although in missing data problems we usually want to estimate in (Equation1(1) (1) ).
In Section 3, we develop a methodology to handle population heterogeneity for internal and external datasets, which extends the procedure in Dai and Shao (Citation2023) to heterogeneous populations and greatly widens its application scope.
Under each scenario, we derive asymptotic normality in Section 4 for the proposed kernel estimators and obtain explicitly the asymptotic variances, which is important for large sample inference. Some simulation results are presented in Section 5 to compare finite sample performance of several estimators. Discussions on extensions and handling high dimension covariates are given in Section 6. All technical details are in the Appendix.
Our research fits into a general framework of data integration (Kim et al., Citation2021; Lohr & Raghunathan, Citation2017; Merkouris, Citation2004; Rao, Citation2021; Yang & Kim, Citation2020; Y. Zhang et al., Citation2017).
2. Efficient kernel estimation by combining datasets
The internal dataset contains observations , , independent and identically distributed (iid) from , the internal population of , where Y is the response and is a p-dimensional covariate vector associated with Y. We are interested in the estimation of conditional expectation in (Equation1(1) (1) ). The standard kernel regression estimator of based on the internal dataset alone is (2) (2) where , is a given kernel function on (the range of ), and b>0 is a bandwidth depending on n. We assume that is standardized so that the same bandwidth b is used for every component of in kernel regression. Because of the well-known curse of dimensionality for kernel-type methods, we focus on a low dimension p not varying with n. A discussion of handling a large dimensional is given in Section 6.
We consider the case with one external dataset, independent of the internal dataset. Extension to multiple external datasets is straightforward and discussed in Section 6.
In this section we consider the situation where the external dataset contains iid observations , , from , the external population of .
2.1. Combing data from homogeneous populations
If we assume that the two populations and are identical, then we can simply combine two datasets to obtain the kernel estimator (3) (3) which is obviously more efficient than in (Equation2(2) (2) ) as the sample size is increased to N>n. The estimator in (Equation3(3) (3) ), however, is not correct (i.e., it is biased) when populations and are different, because for external population may be different from for internal population.
2.2. Combing data from heterogeneous populations
We now derive a kernel estimator using two datasets and is asymptotically correct regardless of whether and are the same or not. Let be the conditional density of Y given and 1 or 0 (for internal or external population). Then (4) (4) The ratio links internal and external populations so that we can overcome the difficulty in utilizing the external data under heterogeneous populations.
If we can construct an estimator of for every y, , and D = 0 or 1, then we can modify the estimator in (Equation3(3) (3) ) by replacing every with i>n by constructed response . The resulting kernel estimator is (5) (5) Note that we use internal data , , to obtain estimator and external data , , to construct estimator . Applying kernel estimation, we obtain that (6) (6) where and are kernels with dimensions p + 1 and p and bandwidths and , respectively. The estimator in (Equation5(5) (5) ) is asymptotically valid under some regularity conditions for kernel and bandwidth, summarized in Theorem 4.1 of Section 4.
2.3. Combing data from heterogeneous populations with additional information
If additional information exists, then the approach in Section 2.2 can be improved. Assume that the internal and external datasets are formed according to a random binary indicator D such that , , are iid distributed as , where and are observed internal data when , and are observed external data when , and N is still the known total sample size for internal and external data. In this situation, the internal and external sample sizes are and N−n, respectively, both of which are random. In most applications, the assumption of random D is not substantial. From the identity (7) (7) we just need to estimate and for every , constructed using for example the nonparametric estimators in Fan et al. (Citation1998) for binary response. For each estimator, both internal and external data on and the indicator D are used.
A further improvement can be made if the following semi-parametric model holds, (8) (8) where is an unspecified unknown function and γ is an unknown parameter. From (Equation7(7) (7) )–(Equation8(8) (8) ), (9) (9) If , then and the estimator in (Equation3(3) (3) ) is correct. Under (Equation9(9) (9) ) with , we just need to derive an estimator of γ and apply kernel estimation to estimate as a function of . Note that we do not need to estimate the unspecified function in (Equation8(8) (8) ), which is a nice feature of semi-parametric model (Equation8(8) (8) ).
We now derive an estimator . Applying (Equation7(7) (7) )–(Equation8(8) (8) ) to (Equation4(4) (4) ), we obtain that where the second and third equalities follow from (Equation8(8) (8) ) and the last equation follows from as . For every real number t, define Its estimator by kernel regression is (10) (10) where is a kernel and is a bandwidth. Then, we estimate γ by (11) (11) motivated by the fact that the objective function for minimization in (Equation11(11) (11) ) approximates and, for any t, because .
Once is obtained, our estimator of is (12) (12) with in view of (Equation9(9) (9) ).
In applications, we need to choose bandwidths with given sample sizes n and N−n. We can apply the k-fold cross-validation as described in Györfi et al. (Citation2002). Requirements on the rates of bandwidths are described in theorems in Section 3.
3. Constrained kernel regression with unmeasured covariates
We still consider the case with one external dataset, independent of the internal dataset. In this section, the external dataset contains iid observations , , from the external population , where is a q-dimensional sub-vector of with q<p.
Since the external dataset has only , not the entire , we cannot apply the method in Section 2 when q<p. Instead, we consider kernel regression using external information in a constraint. First, we consider the estimation of the n-dimensional vector , where denotes the transpose of vector or matrix throughout. Note that the standard kernel regression (Equation2(2) (2) ) estimates as Taking partial derivatives with respect to 's, we obtain that (13) (13) We improve by the following constrained minimization, (14) (14) (15) (15) where , l in (Equation14(14) (14) ) is a bandwidth that may be different from b in (Equation2(2) (2) ) or (Equation13(13) (13) ), and is the kernel estimator of using the jth of the three methods described in Section 2, j = 1, 2, 3. Specifically, is given by (Equation3(3) (3) ), is given by (Equation5(5) (5) ), and is given by (Equation12(12) (12) ), with and replaced by and , respectively, and kernels and bandwidths suitably adjusted as dimensions of and are different. Note that can be computed as both internal and external datasets have measured 's.
It turns out that in (Equation14(14) (14) ) has an explicit form where is the matrix whose ith row is and is the n-dimensional vector whose ith component is . Constraint (Equation15(15) (15) ) is an empirical analog of the theoretical constraint (based on internal data), as . Thus, if is a good estimator of , then in (Equation14(14) (14) ) is more accurate than the unconstrained in (Equation13(13) (13) ).
To obtain an improved estimator of the entire regression function in (Equation1(1) (1) ), not just the function at , , we apply the standard kernel regression with response vector replaced by in (Equation14(14) (14) ), which results in the following three estimators of : (16) (16) where is the ith component of in (Equation14(14) (14) ) and b is the same bandwidth in (Equation2(2) (2) ). The first estimator is simple, but can be incorrect when populations and are different. The asymptotic validity of and are established in the next section.
4. Asymptotic normality
We now establish the asymptotic normality of and for a fixed , as the sample size of the internal dataset increases to infinity. All technical proofs are given in the Appendix.
The first result is about in (Equation5(5) (5) ). The result is also applicable to in (Equation3(3) (3) ) with an added condition that .
Theorem 4.1
Assume the following conditions.
(B1) | The densities and for , respectively under internal and external populations have continuous and bounded first- and second-order partial derivatives. | ||||
(B2) | , , and the first- and second-order partial derivatives of are continuous and bounded, where , , and . Also, and are bounded for a constant s>2. | ||||
(B3) | The kernel κ is second order, i.e., and . | ||||
(B4) | The bandwidth b satisfies and , where (assumed to exist without loss of generality). | ||||
(B5) | The kernels and in (Equation6(6) (6) ) have bounded supports and orders and , respectively, as defined by Bierens (Citation1987), , are th-order continuously differentiable with bounded partial derivatives, and and are th-order continuously differentiable with bounded partial derivatives. Functions and are bounded away from zero. The bandwidths and satisfy and . |
Then, for any fixed with and and in (Equation5(5) (5) ), (17) (17) where denotes convergence in distribution as ,
Conditions (B1)–(B4) are typically assumed for kernel estimation (Bierens, Citation1987). Condition (B5) is a sufficient condition for (18) (18) (Lemma 8.10 in Newey & McFadden, Citation1994), where denotes a term tending to 0 in probability. Result (Equation18(18) (18) ) implies that the estimation of ratio does not affect the asymptotic distribution of in (Equation5(5) (5) ).
Note that both the squared bias and variance in (Equation17(17) (17) ) are decreasing in the limit , a quantity reflecting how many external data we have. In the extreme case of a = 0, i.e., the size of the external dataset is negligible compared with the size of the internal dataset, result (Equation17(17) (17) ) reduces to the well-known asymptotic normality for the standard kernel estimator in (Equation2(2) (2) ) (Bierens, Citation1987). In the other extreme case of , on the other hand, and, hence, has a convergence rate tending to 0 faster than , the convergence rate of the standard kernel estimator .
The next result is about in (Equation16(16) (16) ) as described in Section 3.
Theorem 4.2
Assume (B1)–(B5) with and p replaced by and q, respectively, and the following conditions, where and , k = 0, 1, are defined in (B1)–(B2).
(C1) | The range of is a compact set in the p-dimensional Euclidean space and is bounded away from infinity and zero on ; and have continuous and bounded first- and second-order partial derivatives. | ||||
(C2) | Functions and are Lipschitz continuous; has bounded third-order partial derivatives; has bounded first- and second-order partial derivatives; and is bounded with . | ||||
(C3) | All kernel functions are positive, bounded, and Lipschitz continuous with mean zero and finite sixth moments. | ||||
(C4) | and the bandwidths b in (Equation2(2) (2) ) and l in (Equation14(14) (14) ) satisfy , , , , and , as . | ||||
(C5) | The densities and for , respectively under internal and external populations are bounded away from zero. There exists a constant s>4 such that and are finite, and are bounded, and the bandwidth for satisfies |
Then, for any fixed and in (Equation16(16) (16) ), (19) (19) where
and is assumed to be positive definite without loss of generality.
The next result is about in (Equation11(11) (11) ).
Theorem 4.3
Suppose that (Equation8(8) (8) ) holds for binary random D indicating internal and external data. Assume also the following conditions.
(D1) | The kernel in (Equation10(10) (10) ) is Lipschitz continuous, satisfies , has a bounded support, and has order . | ||||
(D2) | The bandwidth in (Equation10(10) (10) ) satisfies and as the total sample size of internal and external datasets , where d is given in (D1). | ||||
(D3) | γ in (Equation8(8) (8) ) is an interior point of a compact domain Γ and it is the unique solution to , . For any , is second-order continuously differentiable in t, and h, , are bounded over t and . As , , , and converge uniformly. | ||||
(D4) | and is bounded, where , , and is the density of . Furthermore, there is a function with such that . | ||||
(D5) | The function is bounded away from zero, and it is dth-order continuously differentiable with bounded partial derivatives on an open set containing the support of . There is a functional linear in such that and, for small enough , , where is a function with , , is the jth component of , , , and is the range of . Also, there exists an almost everywhere continuous 8-dimensional function with and for some such that for all . |
Then, as the total sample size of internal and external datasets , (20) (20) where .
Conditions (D1)–(D5) are technical assumptions discussed in Lemmas 8.11 and 8.12 in Newey and McFadden (Citation1994). As discussed by Newey and McFadden (Citation1994), the condition that has a bounded support can be relaxed, as it is imposed for a simple proof.
Combining Theorems 4.1–4.3, we obtain the following result for in (Equation12(12) (12) ) or in (Equation16(16) (16) ).
Corollary 4.1
Suppose that (Equation8(8) (8) ) holds for the binary random D indicating internal and external data.
(i) | Under (B1)–(B4) and (D1)–(D5), result (Equation17(17) (17) ) holds with replaced by . | ||||
(ii) | Under (C1)–(C4) and (D1)–(D5) with and p replaced by and q, respectively, result (Equation19(19) (19) ) holds with replaced by . |
5. Simulation results
5.1. The performance of given by (16)
We first present simulation results to examine and compare the performance of the standard kernel estimator in (Equation2(2) (2) ) without using external information and our proposed estimator (Equation16(16) (16) ) with three variations, , , and , as described in the end of Section 3. We consider with univariate covariates X and Z, where Z is unmeasured in the external dataset (p = 2 and q = 1). The covariates are generated in two ways:
normal covariates: is bivariate normal with means 0, variances 1, and correlation 0.5;
bounded covariates: and , where , and are identically distributed as uniform on , B is uniform on , and , , and B are independent.
Conditioned on , the response Y is normal with mean and variance 1, where follows one of the following four models:
(M1) | ; | ||||
(M2) | ; | ||||
(M3) | ; | ||||
(M4) | . |
Note that all four models are nonlinear in ; (M1)-(M2) are additive models, while (M3)-(M4) are non-additive.
A total of N = 1, 200 data are generated from the population of as previously described. A data point is treated as internal or external according to a random binary D with conditional probability , where or 1/2, and or . Under the setting or , the unconditional is around 13% or 50%.
The simulation studies performance of kernel estimators in terms of mean integrated square error (MISE). The following measure is calculated by simulation with S replications: (21) (21) where are test data for each simulation replication s, the simulation is repeated independently for , and is one of , , , and , independent of test data. We consider two ways of generating test data 's. The first one is to use T = 121 fixed grid points on with equal space. The second one is to take a random sample of T = 121 without replacement from the covariate 's of the internal dataset, for each fixed and independently across s.
To show the benefit of using external information, we calculate the improvement in efficiency defined as follows: (22) (22) where the minimum is over one of , , , and .
In all cases, we use the Gaussian kernel. The bandwidths b and l affect the performance of kernel methods. We consider two types of bandwidths in the simulation. The first one is ‘the best bandwidth’; for each method, we evaluate MISE in a pool of bandwidths and display the one that has the minimal MISE. This shows the best we can achieve in terms of bandwidth, but it cannot be used in applications. The second one is to select bandwidth from a pool of bandwidths via 10-fold cross validation (Györfi et al., Citation2002), which produces a decent bandwidth that can be applied to real data.
The simulated MISE values based on S = 200 replications are shown in Tables –.
Consider first the results in Tables –. Since , all three estimators, , , and , are correct and more efficient than the standard estimator in (Equation2(2) (2) ) without using external information. The estimator is the best, as it uses the correct information that populations are homogeneous () and is simpler than and .
Next, the results in Tables – for indicate that the estimator or using a correct constraint is better than the estimator using an incorrect constraint or the estimator without using external information. Since uses more information, it is in general better than . Furthermore, with an incorrect constraint, can be much worse than without using external information.
5.2. The performance of given by (3), (5), or (12)
Under the same simulation setting as described in Section 5.1 but with covariate Z measured in both internal and external datasets, we compare the performance of three estimators, , , and given by (Equation3(3) (3) ), (Equation5(5) (5) ), and (Equation12(12) (12) ), respectively, with the standard kernel estimator in (Equation2(2) (2) ) without using external information. The mean integrated squared error (MISE) and improvement (IMP) are calculated using formulas (Equation21(21) (21) ) and (Equation22(22) (22) ), respectively, with one of , , , and .
Tables – present the simulation results. The relative performance of , , , and follows the same pattern as , , , and in Section 5.1.
The only difference between the results here and those in Section 5.1 is that the use of more external data (a smaller n/N) results in a better performance of or (or when it is correct). This is actually consistent with our theoretical result Theorem 4.1 in Section 4, which shows that both the squared bias and variance in (Equation17(17) (17) ) are decreasing in the limit . On the other hand, the simulation results in Section 5.1 and Theorem 4.2 in Section 4 do not show a clear indication of using more external data produces better estimators. The main reason for this is that, when Z is not observed in the external dataset, the estimator relies more on internal data to recover the loss of Z from external dataset in a complicated way.
5.3. The performance of given by (16) with q = 2
We re-consider the simulation in Section 5.1 but with the dimension of to be q = 2, i.e., . We only consider normally distributed covariates with means 0, variances 1, and the correlations in , , and being 0.5, 0.5, and 0.25, respectively. Given , the response variable Y is normally distributed with mean and variance 1. Moreover, , while the remaining settings are the same as in Section 5.1. In calculating MISE (Equation21(21) (21) ), we only a random with T = 121, not fixed grid points. Also, we consider only evaluating the performance of estimators , since estimators are simpler.
The results are shown in Table . Compared with results in Tables – for the case of q = 1, the MISEs in this case are larger due to the fact of having more covariates (q = 2). But the relative performances of estimators are the same as those shown in Tables –.
6. Discussion
Curse of dimensionality is a well-known problem for nonparametric methods. Thus, the proposed method in Section 2 is intended for low dimensional covariate , i.e., p is small. If p is not small, then we should reduce the dimension of prior to applying the CK, or any kernel methods. For example, consider a single index model assumption (K.-C. Li, Citation1991), i.e., in (Equation1(1) (1) ) is assumed to be (23) (23) where is an unknown p-dimensional vector. The well-known SIR technique (K.-C. Li, Citation1991) can be applied to obtain a consistent and asymptotically normal estimator of in (Equation23(23) (23) ). Once is replaced by , the kernel method can be applied with replaced by the one-dimensional ‘covariate’ . We can also apply other dimension reduction techniques developed under assumptions weaker than (Equation23(23) (23) ) (Cook & Weisberg, Citation1991; B. Li & Wang, Citation2007; Ma & Zhu, Citation2012; Y. Shao et al., Citation2007; Xia et al., Citation2002).
We turn to the dimension of in the external dataset. When the dimension of is high, we may consider the following approach. Instead of using constraint (Equation15(15) (15) ), we use component-wise constraints (24) (24) where is the kth component of , , and is an estimator of using methods described in Section 2. More constraints are involved in (Equation24(24) (24) ), but estimation only involves one dimensional , .
The kernel κ we adopted in (Equation2(2) (2) ) and (Equation16(16) (16) ) is the second-order kernel so that the convergence rate of is . An mth-order kernel with m>2 as defined by Bierens (Citation1987) may be used to achieve convergence rate . Alternatively, we may apply other nonparametric smoothing techniques such as the local polynomial (Fan et al., Citation1997) to achieve convergence rate with .
Our results can be extended to the scenarios where several external datasets are available. Since each external source may provide different covariate variables, we may need to apply component-wise constraints (Equation24(24) (24) ) by estimating via combining all the external sources that collects covariate . If populations of external datasets are different, then we may have to apply a combination of the methods described in Section 2.
Acknowledgments
The authors would like to thank two anonymous referees for helpful comments and suggestions.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Additional information
Funding
References
- Bierens, H. J. (1987). Kernel estimators of regression functions. In Advances in Econometrics: Fifth World Congress (Vol. 1, pp. 99–144). Cambridge University Press.
- Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
- Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414), 328–332. https://doi.org/10.2307/2290564
- Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press. https://doi.org/10.5705/ss.202021.0446
- Fan, J., Farmen, M., & Gijbels, I. (1998). Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(3), 591–608. https://doi.org/10.1111/1467-9868.00142
- Fan, J., Gasser, T., Gijbels, I., Brockmann, M., & Engel, J. (1997). Local polynomial regression: optimal kernels and asymptotic minimax efficiency. Annals of the Institute of Statistical Mathematics, 49(1), 79–99. https://doi.org/10.1023/A:1003162622169
- Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Springer.
- Kim, H. J., Wang, Z., & Kim, J. K. (2021). Survey data integration for regression analysis using model calibration. arXiv 2107.06448.
- Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
- Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
- Lohr, S. L., & Raghunathan, T. E. (2017). Combining survey data with other data sources. Statistical Science, 32(2), 293–312. https://doi.org/10.1214/16-STS584
- Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
- Merkouris, T. (2004). Combining independent regression estimators from multiple surveys. Journal of the American Statistical Association, 99(468), 1131–1139. https://doi.org/10.1198/016214504000000601
- Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142. https://doi.org/10.1137/1109020
- Newey, W. K. (1994). Kernel estimation of partial means and a general variance estimator. Econometric Theory, 10(2), 1–21. https://doi.org/10.1017/S0266466600008409.
- Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4, 2111–2245. https://doi.org/10.1016/S1573-4412(05)80005-4
- Rao, J. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B, 83(1), 242–272. https://doi.org/10.1007/s13571-020-00227-w
- Shao, J. (2003). Mathematical statistics. 2nd ed., Springer.
- Shao, Y., Cook, R. D., & Weisberg, S. (2007). Marginal tests with sliced average variance estimation. Biometrika, 94(2), 285–296. https://doi.org/10.1093/biomet/asm021
- Wand, M. P., & Jones, M. C. (1994, December). Kernel smoothing. Number 60 in Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Boca Raton.
- Wasserman, L. (2006). All of nonparametric statistics. Springer.
- Xia, Y., Tong, H., Li, W. K., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/1467-9868.03411
- Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: a review. Japanese Journal of Statistics and Data Science, 3(2), 625–650. https://doi.org/10.1007/s42081-020-00093-w
- Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107(3), 689–703. https://doi.org/10.1093/biomet/asaa014
- Zhang, Y., Ouyang, Z., & Zhao, H. (2017). A statistical framework for data integration through graphical models with application to cancer genomics. The Annals of Applied Statistics, 11(1), 161–184. https://doi.org/10.1214/16-AOAS998
Appendix
Proof of Theorem 4.1.
Proof of Theorem 4.1.
Let where , , and Under (B3)–(B4), Theorem 2 in Nadaraya (Citation1964) shows that converges to in probability. Under (B1)–(B4), , and , . Then (Equation17(17) (17) ) holds for , by Slutsky's theorem, the independence between and , and the definition of a. The desired result (Equation17(17) (17) ) follows from the fact that is bounded by (A1) (A1) which is by result (Equation18(18) (18) ) under condition (B5).
Proof of Theorem 4.2.
Proof of Theorem 4.2.
Write (A2) (A2) where , , , , , , , , is the identity matrix of order n, is the n-vector with all components being 1, is the diagonal matrix whose ith diagonal element is , is the matrix whose th entry is , with , is the n-dimensional vector whose ith component is , , and , , and are defined in Section 2.
We first show that in (EquationA2(A2) (A2) ) is asymptotically normal with mean 0 and variance defined in Theorem 4.2. Consider a further decomposition , where is a V-statistic with and Note that having variance where is given in condition (C2), the second and third equalities follow from changing variables and , respectively. From the continuity of and , converges to Therefore, by the theory for asymptotic normality of V-statistics (e.g., Theorem 3.16 in J. Shao, Citation2003), .
Conditioned on , has mean 0 and variance This proves that . Note that and is bounded by Therefore, under the assumed condition that is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies . Note that Conditioned on , has mean 0 and variance because, under the assumed condition that is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies Thus, Consequently, has the same asymptotic distribution as , the claimed result.
From Lemma 4 in Dai and Shao (Citation2023) and (C4), Note that where the second equality follows from (A4) and Lemmas 3–4 in Dai and Shao (Citation2023), and the last equality follows from Lemma 2 in Dai and Shao (Citation2023) and continuity of . Also, where the first equality follows from Lemma 3 in Dai and Shao (Citation2023) and the law of large numbers, the second equality follows from Lemma 4 in Dai and Shao (Citation2023), and the last equality follows from the law of large numbers. Similarly, where the second equality follows from Lemma 3 in Dai and Shao (Citation2023). Under (B1)–(B5) with and p replaced by and q, and (C5), Lemma 8.10 in Newey and McFadden (Citation1994) implies that (A3) (A3) which is and, hence, From Lemma 3 in Dai and Shao (Citation2023) and the Central Limit Theorem, Combining these results, we obtain that . This completes the proof.
Proof of Theorem 4.3.
Proof of Theorem 4.3.
Define Then, , , and Let , , and Taking derivatives with respect to t, we obtain and where ψ is given in (D5). Note that and . We establish the asymptotic normality of in the following four steps.
Step 1: Since γ is the unique minimizer of , from Theorem 2.1 in Newey and McFadden (Citation1994), it suffices to prove that Note that From (D3), is bounded by for a constant c and hence Lemma 2.4 in Newey and McFadden (Citation1994) implies that Based on Lemma B.3 in Newey (Citation1994), conditions (D1)–(D4) imply that for all . As a result, by a similar argument of the proof of Lemma B.3 in Newey (Citation1994), we obtain that Since is bounded away from zero and and are Lipschitz continuous functions with respect to , and These results together with the previous inequality implies that
Step 2: Conditions (D1)–(D5) ensure that Lemma 8.11 in Newey and McFadden (Citation1994) holds and hence with .
Step 3: Note that and where , , and the last term The law of large numbers guarantees that . A similar argument in Step 1 shows that . For , we have Under (D3), , , and converge uniformly for all as and, thus, the because . This shows that
Step 4: By Taylor's expansion, for some . From the results in Steps 1-3, This completes the proof of (Equation20(20) (20) ).
Proof of Corollary 4.1.
Proof of Corollary 4.1.
From Theorem 4.3, (Equation20(20) (20) ) shows that . Furthermore, Lemma 8.10 in Newey and McFadden (Citation1994) shows that (A4) (A4) which is under the assumed condition and . Since converges faster than (EquationA4(A4) (A4) ), (Equation18(18) (18) ) holds. As a result, (Equation17(17) (17) ) holds with replaced by under (B1)–(B4) and (D1)–(D5).
Under (D1)–(D5) with replaced by and p replaced by q, Lemma 8.10 in Newey and McFadden (Citation1994) implies that From the asymptotic normality of , , which converges to 0 faster than . Hence (EquationA3(A3) (A3) ) holds while is estimated by . Then, the rest of proof of the second claims follows the argument in the proof of Theorem 4.2.