Full article: Kernel regression utilizing heterogeneous datasets

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Data analysis in modern scientific research and practice has shifted from analysing a single dataset to coupling several datasets. We propose and study a kernel regression method that can handle the challenge of heterogeneous populations. It greatly extends the constrained kernel regression [Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press] that requires a homogeneous population of different datasets. The asymptotic normality of proposed estimators is established under some conditions and simulation results are presented to confirm our theory and to quantify the improvements from datasets with heterogeneous populations.

Keywords:

1. Introduction

With advanced technologies in data collection and storage, in modern statistical analyses we have not only a primary random sample from a population of interest, which results in a dataset referred to as the internal dataset, but also some independent external datasets from sources such as past investigations and publicly available datasets. In this paper, we consider nonparametric kernel regression (Bierens, Citation1987; Wand & Jones, Citation1994, December; Wasserman, Citation2006) between a univariate response Y and a covariate vector $U$ from a sampled subject, using the internal dataset with the help from independent external datasets. Specifically, we consider kernel estimation of the conditional expectation (regression function) of Y given $U = u$ under an internal data population, (1) $μ_{1} (u) = E (Y ∣ U = u, D = 1),$ (1) where D = 1 indicates internal population and $u$ is a fixed point in $U$ , the range of $U$ . The indicator D can be either random or deterministic. The subscript 1 in $μ_{1} (u)$ emphasizes that it is for internal data population (D = 1), which may be different from $μ (u) = E (Y ∣ U = u)$ , a mixture of quantities from the internal and external data populations.

When external datasets also have measurements Y and $U$ , we may simply combine the internal and external datasets when the populations for internal and external data are identical (homogeneous). However, heterogeneity typically exists among populations for different datasets, especially when there are multiple external datasets collected in different ways and/or different time periods. In Section 2, we propose a method to handle heterogeneity among different populations and derive a kernel regression more efficient than the one using internal data alone. The result is also a crucial building block for the more complicated case in Section 3 where external datasets contain fewer measured covariates as described next.

In applications, it often occurs that an external dataset has measured Y and $X$ from each subject, where $X$ is a part of the vector $U$ , i.e., some components of $U$ are not measured due to high measurement cost or the progress of technology and/or scientific relevance. With some unmeasured components of $U$ , the external dataset cannot be directly used to estimate $μ_{1} (u)$ in (Equation1(1) $μ_{1} (u) = E (Y ∣ U = u, D = 1),$ (1) ), since conditioning on the entire $U$ is involved. To solve this problem, Dai and Shao (Citation2023) proposes a two-step kernel regression using external information as a constraint to improve kernel regression based on internal data alone, following the idea of using constraints in Chatterjee et al. (Citation2016) and H. Zhang et al. (Citation2020). However, these three cited papers mainly assume that the internal and external datasets share the same population, which may be unrealistic. The challenge in dealing with the heterogeneity among different populations is similar to the difficulty in handling nonignorable missing data if unmeasured components of $U$ is treated as missing data, although in missing data problems we usually want to estimate $μ (u) = E (Y ∣ U = u) \neq μ_{1} (u)$ in (Equation1(1) $μ_{1} (u) = E (Y ∣ U = u, D = 1),$ (1) ).

In Section 3, we develop a methodology to handle population heterogeneity for internal and external datasets, which extends the procedure in Dai and Shao (Citation2023) to heterogeneous populations and greatly widens its application scope.

Under each scenario, we derive asymptotic normality in Section 4 for the proposed kernel estimators and obtain explicitly the asymptotic variances, which is important for large sample inference. Some simulation results are presented in Section 5 to compare finite sample performance of several estimators. Discussions on extensions and handling high dimension covariates are given in Section 6. All technical details are in the Appendix.

Our research fits into a general framework of data integration (Kim et al., Citation2021; Lohr & Raghunathan, Citation2017; Merkouris, Citation2004; Rao, Citation2021; Yang & Kim, Citation2020; Y. Zhang et al., Citation2017).

2. Efficient kernel estimation by combining datasets

The internal dataset contains observations $(Y_{i}, U_{i})$ , $i = 1, \dots, n$ , independent and identically distributed (iid) from $P_{1}$ , the internal population of $(Y, U)$ , where Y is the response and $U$ is a p-dimensional covariate vector associated with Y. We are interested in the estimation of conditional expectation $μ_{1} (u)$ in (Equation1(1) $μ_{1} (u) = E (Y ∣ U = u, D = 1),$ (1) ). The standard kernel regression estimator of $μ_{1} (u)$ based on the internal dataset alone is (2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) where $κ_{b} (a) = b^{- p} κ (a / b)$ , $κ (\cdot)$ is a given kernel function on $U$ (the range of $u$ ), and b>0 is a bandwidth depending on n. We assume that $U$ is standardized so that the same bandwidth b is used for every component of $U$ in kernel regression. Because of the well-known curse of dimensionality for kernel-type methods, we focus on a low dimension p not varying with n. A discussion of handling a large dimensional $U$ is given in Section 6.

We consider the case with one external dataset, independent of the internal dataset. Extension to multiple external datasets is straightforward and discussed in Section 6.

In this section we consider the situation where the external dataset contains iid observations $(Y_{i}, U_{i})$ , $i = n + 1, \dots, N$ , from $P_{0}$ , the external population of $(Y, U)$ .

2.1. Combing data from homogeneous populations

If we assume that the two populations $P_{1}$ and $P_{0}$ are identical, then we can simply combine two datasets to obtain the kernel estimator (3) ${\hat{μ}}_{1}^{E 1} (u) = \sum_{i = 1}^{N} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}),$ (3) which is obviously more efficient than ${\hat{μ}}_{1} (u)$ in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) as the sample size is increased to N>n. The estimator ${\hat{μ}}_{1}^{E 1} (u)$ in (Equation3(3) ${\hat{μ}}_{1}^{E 1} (u) = \sum_{i = 1}^{N} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}),$ (3) ), however, is not correct (i.e., it is biased) when populations $P_{1}$ and $P_{0}$ are different, because $E (Y ∣ U = u, D = 0)$ for external population may be different from $μ_{1} (u) = E (Y ∣ U = u, D = 1)$ for internal population.

2.2. Combing data from heterogeneous populations

We now derive a kernel estimator using two datasets and is asymptotically correct regardless of whether $P_{1}$ and $P_{0}$ are the same or not. Let $f (y | u, D)$ be the conditional density of Y given $U = u$ and $D =$ 1 or 0 (for internal or external population). Then (4) $μ_{1} (x) = E (Y | U = u, D = 1) = E {Y \frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} | U = u, D = 0} .$ (4) The ratio $f (Y | u, D = 1) / f (Y | u, D = 0)$ links internal and external populations so that we can overcome the difficulty in utilizing the external data under heterogeneous populations.

If we can construct an estimator $\hat{f} (y | u, D)$ of $f (y | u, D)$ for every y, $u$ , and D = 0 or 1, then we can modify the estimator in (Equation3(3) ${\hat{μ}}_{1}^{E 1} (u) = \sum_{i = 1}^{N} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}),$ (3) ) by replacing every $Y_{i}$ with i>n by constructed response ${\hat{Y}}_{i} = Y_{i} \hat{f} (Y_{i} | U_{i}, D = 1) / \hat{f} (Y_{i} | U_{i}, D = 0)$ . The resulting kernel estimator is (5) ${\hat{μ}}_{1}^{E 2} (u) = {\sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = n + 1}^{N} {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .$ (5) Note that we use internal data $(Y_{i}, U_{i})$ , $i = 1, \dots, n$ , to obtain estimator $\hat{f} (Y_{i} | U_{i}, D = 1)$ and external data $(Y_{i}, U_{i})$ , $i = n + 1, \dots, N$ , to construct estimator $\hat{f} (Y_{i} | U_{i}, D = 0)$ . Applying kernel estimation, we obtain that (6) $\begin{aligned} \hat{f} (y | U = u, = 1) & = \sum_{i = 1}^{n} {\tilde{κ}}_{\tilde{b}} (y - Y_{i}, u - U_{i}) / \sum_{i = 1}^{n} {\bar{κ}}_{\bar{b}} (u - U_{i}), \\ \hat{f} (y | U = u, D = 0) & = \sum_{i = n + 1}^{N} {\tilde{κ}}_{\tilde{b}} (y - Y_{i}, u - U_{i}) / \sum_{i = n + 1}^{N} {\bar{κ}}_{\bar{b}} (u - U_{i}), \end{aligned}$ (6) where $\tilde{κ}$ and $\bar{κ}$ are kernels with dimensions p + 1 and p and bandwidths $\tilde{b}$ and $\bar{b}$ , respectively. The estimator in (Equation5(5) ${\hat{μ}}_{1}^{E 2} (u) = {\sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = n + 1}^{N} {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .$ (5) ) is asymptotically valid under some regularity conditions for kernel and bandwidth, summarized in Theorem 4.1 of Section 4.

2.3. Combing data from heterogeneous populations with additional information

If additional information exists, then the approach in Section 2.2 can be improved. Assume that the internal and external datasets are formed according to a random binary indicator D such that $(Y_{i}, U_{i}, D_{i})$ , $i = 1, \dots, N$ , are iid distributed as $(Y, U, D)$ , where $Y_{i}$ and $U_{i}$ are observed internal data when $D_{i} = 1$ , $Y_{i}$ and $U_{i}$ are observed external data when $D_{i} = 0$ , and N is still the known total sample size for internal and external data. In this situation, the internal and external sample sizes are $n = \sum_{i = 1}^{N} D_{i}$ and N−n, respectively, both of which are random. In most applications, the assumption of random D is not substantial. From the identity (7) $\frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} = \frac{P (D = 1 | U = u, Y)}{P (D = 0 | U = u, Y)} \frac{P (D = 0 | U = u)}{P (D = 1 | U = u)},$ (7) we just need to estimate $P (D = 1 | U = u, Y)$ and $P (D = 1 | U = u)$ for every $u$ , constructed using for example the nonparametric estimators in Fan et al. (Citation1998) for binary response. For each estimator, both internal and external data on $(Y, U)$ and the indicator D are used.

A further improvement can be made if the following semi-parametric model holds, (8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) where $α (\cdot)$ is an unspecified unknown function and γ is an unknown parameter. From (Equation7(7) $\frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} = \frac{P (D = 1 | U = u, Y)}{P (D = 0 | U = u, Y)} \frac{P (D = 0 | U = u)}{P (D = 1 | U = u)},$ (7) )–(Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ), (9) $\frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} = e^{- γY} E (e^{γY} ∣ U = u, D = 1) .$ (9) If $γ = 0$ , then $f (Y | u, D = 1) = f (Y | u, D = 0)$ and the estimator ${\hat{μ}}_{1}^{E 1} (u)$ in (Equation3(3) ${\hat{μ}}_{1}^{E 1} (u) = \sum_{i = 1}^{N} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}),$ (3) ) is correct. Under (Equation9(9) $\frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} = e^{- γY} E (e^{γY} ∣ U = u, D = 1) .$ (9) ) with $γ \neq 0$ , we just need to derive an estimator $\hat{γ}$ of γ and apply kernel estimation to estimate $E (e^{\hat{γ} Y} ∣ U = u, D = 1)$ as a function of $u$ . Note that we do not need to estimate the unspecified function $α (\cdot)$ in (Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ), which is a nice feature of semi-parametric model (Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ).

We now derive an estimator $\hat{γ}$ . Applying (Equation7(7) $\frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} = \frac{P (D = 1 | U = u, Y)}{P (D = 0 | U = u, Y)} \frac{P (D = 0 | U = u)}{P (D = 1 | U = u)},$ (7) )–(Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ) to (Equation4(4) $μ_{1} (x) = E (Y | U = u, D = 1) = E {Y \frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} | U = u, D = 0} .$ (4) ), we obtain that $\begin{aligned} μ_{1} (u) & = E {Y \frac{P (D = 1 | U = u, Y)}{P (D = 0 | U = u, Y)} | U = u, D = 0} \frac{P (D = 0 | U = u)}{P (D = 1 | U = u)} \\ = E (Y e^{- α (u) - γY} | U = u, D = 0) \frac{E {P (D = 0 | U = u, Y) | U = u}}{P (D = 1 | U = u)} \\ = e^{- α (u)} E (Y e^{- γY} | U = u, D = 0) \frac{E {e^{α (u) + γY} P (D = 1 | U = u, Y) | U = u}}{P (D = 1 | U = u)} \\ = E (Y e^{- γY} | U = u, D = 0) \frac{E {e^{γY} E (D | U = u, Y) | U = u}}{P (D = 1 | U = u)} \\ = E (Y e^{- γY} | U = u, D = 0) \frac{E (e^{γY} D | U = u)}{P (D = 1 | U = u)} \\ = E (Y e^{- γY} | U = u, D = 0) E (e^{γY} | U = u, D = 1), \end{aligned}$ where the second and third equalities follow from (Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ) and the last equation follows from $\begin{aligned} E (e^{γY} D | U = u) & = E (e^{γY} D | U = u, D = 1) P (D = 1 | U = u) \\ + E (e^{γY} D | U = u, D = 0) P (D = 0 | U = u) \\ = E (e^{γY} | U = u, D = 1) P (D = 1 | U = u), \end{aligned}$ as $E (e^{γY} D | U = u, D = 0) = 0$ . For every real number t, define $h (u, t) = E (Y e^{- tY} | U = u, D = 0) E (e^{tY} | U = u, D = 1) .$ Its estimator by kernel regression is (10) $\hat{h} (u, t) = \frac{\sum_{i = 1}^{N} (1 - D_{i}) {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i}) Y_{i} e^{- t Y_{i}}}{\sum_{i = 1}^{N} (1 - D_{i}) {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i})} \frac{\sum_{i = 1}^{N} D_{i} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i}) e^{t Y_{i}}}{\sum_{i = 1}^{N} D_{i} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i})}$ (10) where $\overset{ˇ}{κ}$ is a kernel and $\overset{ˇ}{b}$ is a bandwidth. Then, we estimate γ by (11) $\hat{γ} = \arg min_{t} \frac{1}{N} \sum_{i = 1}^{N} D_{i} {Y_{i} - \hat{h} (U_{i}, t)}^{2},$ (11) motivated by the fact that the objective function for minimization in (Equation11(11) $\hat{γ} = \arg min_{t} \frac{1}{N} \sum_{i = 1}^{N} D_{i} {Y_{i} - \hat{h} (U_{i}, t)}^{2},$ (11) ) approximates $E [D {Y - h (U, t)}^{2} | D = 1]$ and, for any t, $E [D {Y - h (U, γ)}^{2} | D = 1] \leq E [D {Y - h (U, t)}^{2} | D = 1]$ because $h (u, γ) = μ_{1} (u)$ .

Once $\hat{γ}$ is obtained, our estimator of $μ_{1} (u)$ is (12) ${\hat{μ}}_{1}^{E 3} (u) = {\sum_{i = 1}^{N} D_{i} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = 1}^{N} (1 - D_{i}) {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i})$ (12) with ${\hat{Y}}_{i} = Y_{i} e^{- \hat{γ} Y_{i}} \sum_{j = 1}^{n} e^{\hat{γ} Y_{j}} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (U_{i} - U_{j}) / \sum_{j = 1}^{n} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (U_{i} - U_{j}),$ in view of (Equation9(9) $\frac{f (Y | u, D = 1)}{f (Y | u, D = 0)} = e^{- γY} E (e^{γY} ∣ U = u, D = 1) .$ (9) ).

In applications, we need to choose bandwidths with given sample sizes n and N−n. We can apply the k-fold cross-validation as described in Györfi et al. (Citation2002). Requirements on the rates of bandwidths are described in theorems in Section 3.

3. Constrained kernel regression with unmeasured covariates

We still consider the case with one external dataset, independent of the internal dataset. In this section, the external dataset contains iid observations $(Y_{i}, X_{i})$ , $i = n + 1, \dots, N$ , from the external population $P_{0}$ , where $X$ is a q-dimensional sub-vector of $U$ with q<p.

Since the external dataset has only $X$ , not the entire $U$ , we cannot apply the method in Section 2 when q<p. Instead, we consider kernel regression using external information in a constraint. First, we consider the estimation of the n-dimensional vector $μ_{1} = (μ_{1} (U_{1}), \dots, μ_{1} (U_{n}))^{⊤}$ , where $A^{⊤}$ denotes the transpose of vector or matrix $A$ throughout. Note that the standard kernel regression (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) estimates $μ_{1}$ as ${\hat{μ}}_{1} = {(\sum_{i = 1}^{n} Y_{i} κ_{b} (U_{1} - U_{i}) / \sum_{i = 1}^{n} κ_{b} (U_{1} - U_{i}), \dots, \sum_{i = 1}^{n} Y_{i} κ_{b} (U_{n} - U_{i}) / \sum_{i = 1}^{n} κ_{b} (U_{n} - U_{i}))}^{⊤} .$ Taking partial derivatives with respect to $μ_{i}$ 's, we obtain that (13) ${\hat{μ}}_{1} = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{b} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{b} (U_{i} - U_{k}) .$ (13) We improve ${\hat{μ}}_{1}$ by the following constrained minimization, (14) $\begin{aligned} {\hat{μ}}_{1}^{Cj} & = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{l} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{l} (U_{i} - U_{k}) \end{aligned}$ (14) (15) $\begin{aligned} subject to \sum_{i = 1}^{n} {μ_{i} - {\hat{h}}_{1}^{Ej} (X_{i})} g (X_{i})^{⊤} = 0, \end{aligned}$ (15) where $g (x)^{⊤} = (1, x^{⊤})$ , l in (Equation14(14) $\begin{aligned} {\hat{μ}}_{1}^{Cj} & = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{l} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{l} (U_{i} - U_{k}) \end{aligned}$ (14) ) is a bandwidth that may be different from b in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) or (Equation13(13) ${\hat{μ}}_{1} = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{b} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{b} (U_{i} - U_{k}) .$ (13) ), and ${\hat{h}}_{1}^{Ej} (x)$ is the kernel estimator of $h_{1} (x) = E (Y ∣ X = x, D = 1)$ using the jth of the three methods described in Section 2, j = 1, 2, 3. Specifically, ${\hat{h}}_{1}^{E 1} (x)$ is given by (Equation3(3) ${\hat{μ}}_{1}^{E 1} (u) = \sum_{i = 1}^{N} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}),$ (3) ), ${\hat{h}}_{1}^{E 2} (x)$ is given by (Equation5(5) ${\hat{μ}}_{1}^{E 2} (u) = {\sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = n + 1}^{N} {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .$ (5) ), and ${\hat{h}}_{1}^{E 3} (x)$ is given by (Equation12(12) ${\hat{μ}}_{1}^{E 3} (u) = {\sum_{i = 1}^{N} D_{i} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = 1}^{N} (1 - D_{i}) {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i})$ (12) ), with $u$ and $U$ replaced by $x$ and $X$ , respectively, and kernels and bandwidths suitably adjusted as dimensions of $U$ and $X$ are different. Note that ${\hat{h}}_{1}^{Ej}$ can be computed as both internal and external datasets have measured $X_{i}$ 's.

It turns out that ${\hat{μ}}_{1}^{Cj}$ in (Equation14(14) $\begin{aligned} {\hat{μ}}_{1}^{Cj} & = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{l} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{l} (U_{i} - U_{k}) \end{aligned}$ (14) ) has an explicit form ${\hat{μ}}_{1}^{Cj} = {\hat{μ}}_{1} + G (G^{⊤} G)^{- 1} G^{⊤} ({\hat{h}}_{1}^{Ej} - {\hat{μ}}_{1}),$ where $G$ is the $n \times n$ matrix whose ith row is $g (X_{i})^{⊤}$ and ${\hat{h}}_{1}^{Ej}$ is the n-dimensional vector whose ith component is ${\hat{h}}_{1}^{Ej} (X_{i})$ . Constraint (Equation15(15) $\begin{aligned} subject to \sum_{i = 1}^{n} {μ_{i} - {\hat{h}}_{1}^{Ej} (X_{i})} g (X_{i})^{⊤} = 0, \end{aligned}$ (15) ) is an empirical analog of the theoretical constraint $E [{μ_{1} (U) - h_{1} (X)} g (X)^{⊤} ∣ D = 1] = 0$ (based on internal data), as $E {E (Y ∣ U, D = 1) ∣ X, D = 1} = E (Y ∣ X, D = 1) = h_{1} (X)$ . Thus, if ${\hat{h}}_{1}^{Ej} (\cdot)$ is a good estimator of $h_{1} (\cdot)$ , then ${\hat{μ}}_{1}^{Cj}$ in (Equation14(14) $\begin{aligned} {\hat{μ}}_{1}^{Cj} & = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{l} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{l} (U_{i} - U_{k}) \end{aligned}$ (14) ) is more accurate than the unconstrained ${\hat{μ}}_{1}$ in (Equation13(13) ${\hat{μ}}_{1} = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{b} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{b} (U_{i} - U_{k}) .$ (13) ).

To obtain an improved estimator of the entire regression function $μ_{1} (\cdot)$ in (Equation1(1) $μ_{1} (u) = E (Y ∣ U = u, D = 1),$ (1) ), not just the function at $u = U_{i}$ , $i = 1, \dots, n$ , we apply the standard kernel regression with response vector $(Y_{1}, \dots, Y_{n})^{⊤}$ replaced by ${\hat{μ}}_{1}^{Cj}$ in (Equation14(14) $\begin{aligned} {\hat{μ}}_{1}^{Cj} & = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{l} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{l} (U_{i} - U_{k}) \end{aligned}$ (14) ), which results in the following three estimators of $μ_{1} (u)$ : (16) ${\hat{μ}}_{1}^{Cj} (u) = \sum_{i = 1}^{n} {\hat{μ}}_{i}^{Cj} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}), j = 1, 2, 3,$ (16) where ${\hat{μ}}_{i}^{Cj}$ is the ith component of ${\hat{μ}}_{1}^{Cj}$ in (Equation14(14) $\begin{aligned} {\hat{μ}}_{1}^{Cj} & = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{l} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{l} (U_{i} - U_{k}) \end{aligned}$ (14) ) and b is the same bandwidth in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ). The first estimator ${\hat{μ}}_{i}^{C 1}$ is simple, but can be incorrect when populations $P_{1}$ and $P_{0}$ are different. The asymptotic validity of ${\hat{μ}}_{1}^{C 2}$ and ${\hat{μ}}_{1}^{C 3}$ are established in the next section.

4. Asymptotic normality

We now establish the asymptotic normality of ${\hat{μ}}_{1}^{Ej} (u)$ and ${\hat{μ}}_{1}^{Cj} (u)$ for a fixed $u$ , as the sample size of the internal dataset increases to infinity. All technical proofs are given in the Appendix.

The first result is about ${\hat{μ}}_{1}^{E 2} (u)$ in (Equation5(5) ${\hat{μ}}_{1}^{E 2} (u) = {\sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = n + 1}^{N} {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .$ (5) ). The result is also applicable to ${\hat{μ}}_{1}^{E 1} (u)$ in (Equation3(3) ${\hat{μ}}_{1}^{E 1} (u) = \sum_{i = 1}^{N} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}),$ (3) ) with an added condition that $P_{1} = P_{0}$ .

Theorem 4.1

Assume the following conditions.

(B1)	The densities $f_{1} (u)$ and $f_{0} (u)$ for $U$ , respectively under internal and external populations have continuous and bounded first- and second-order partial derivatives.
(B2)	$μ_{1}^{2} (u) f_{k} (u)$ , $σ_{k}^{2} (u) f_{k} (u)$ , and the first- and second-order partial derivatives of $μ_{1} (u) f_{k} (u)$ are continuous and bounded, where $σ_{1}^{2} (u) = E [{Y - μ_{1} (U)}^{2} ∣ U = u, D = 1]$ , $σ_{0}^{2} (u) = E [{\tilde{Y} - μ_{1} (U)}^{2} ∣ U = u, D = 0]$ , and $\tilde{Y} = Yf (Y \| U, D = 1) / f (Y \| U, D = 0)$ . Also, $E (\| Y \|^{s} \| U = u, D = 1) f_{1} (u)$ and $E (\| \tilde{Y} \|^{s} \| U = u, D = 0) f_{0} (u)$ are bounded for a constant s>2.
(B3)	The kernel κ is second order, i.e., $\int u κ (u) d u = 0$ and $0 < \int u^{⊤} u κ (u) d u < \infty$ .
(B4)	The bandwidth b satisfies $b \to 0$ and $(a + 1) n b^{p + 4} \to c \in [0, \infty)$ , where $a = lim_{n \to \infty} (N - n) / n$ (assumed to exist without loss of generality).
(B5)	The kernels $\tilde{κ}$ and $\bar{κ}$ in (Equation6(6) $\begin{aligned} \hat{f} (y \| U = u, = 1) & = \sum_{i = 1}^{n} {\tilde{κ}}_{\tilde{b}} (y - Y_{i}, u - U_{i}) / \sum_{i = 1}^{n} {\bar{κ}}_{\bar{b}} (u - U_{i}), \\ \hat{f} (y \| U = u, D = 0) & = \sum_{i = n + 1}^{N} {\tilde{κ}}_{\tilde{b}} (y - Y_{i}, u - U_{i}) / \sum_{i = n + 1}^{N} {\bar{κ}}_{\bar{b}} (u - U_{i}), \end{aligned}$ (6) ) have bounded supports and orders $\tilde{m} > 2 + 2 / p$ and $\bar{m} > 2$ , respectively, as defined by Bierens (Citation1987), $f (y, u \| D = 1)$ , $f (y, u \| D = 0)$ are $\tilde{m}$ th-order continuously differentiable with bounded partial derivatives, and $f_{1} (u)$ and $f_{0} (u)$ are $\bar{m}$ th-order continuously differentiable with bounded partial derivatives. Functions $f (y, u \| D = 0)$ and $f_{1} (u)$ are bounded away from zero. The bandwidths $\tilde{b}$ and $\bar{b}$ satisfy $n {\tilde{b}}^{p + 1} / \log (n) \to \infty$ and $n {\bar{b}}^{p} / \log (n) \to \infty$ .

Then, for any fixed $u$ with $f_{0} (u) > 0$ and $f_{1} (u) > 0$ and ${\hat{μ}}_{1}^{E 2}$ in (Equation5(5) ${\hat{μ}}_{1}^{E 2} (u) = {\sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = n + 1}^{N} {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .$ (5) ), (17) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{E 2} (u) - μ_{1} (u)} ⟹ d N (B_{a} (u), V_{a} (u)),$ (17) where $⟹ d$ denotes convergence in distribution as $n \to \infty$ , $\begin{aligned} B_{a} (u) & = \frac{c^{1 / 2} {f_{1} (u) A_{1} (u) + a f_{0} (u) A_{0} (u)}}{(a + 1)^{1 / 2} {f_{1} (u) + a f_{0} (u)}}, \\ A_{1} (u) & = \int κ (v) {\frac{1}{2} v^{⊤} \nabla^{2} μ_{1} (u) v + v^{⊤} ∇log f_{1} (u) \nabla μ_{1} (u)^{⊤} v} d v, \\ A_{0} (u) & = \int κ (v) {\frac{1}{2} v^{⊤} \nabla^{2} μ_{1} (u) v + v^{⊤} ∇log f_{0} (u) \nabla μ_{1} (u)^{⊤} v} d v, \\ V_{a} (u) & = \frac{f_{1} (u) σ_{1}^{2} (u) + a f_{0} (u) σ_{0}^{2} (u)}{{f_{1} (u) + a f_{0} (u)}^{2}} \int κ (v)^{2} d v . \end{aligned}$

Conditions (B1)–(B4) are typically assumed for kernel estimation (Bierens, Citation1987). Condition (B5) is a sufficient condition for (18) $max_{i = n + 1, \dots, N} | \frac{\hat{f} (Y_{i} | U = U_{i}, D = 1)}{\hat{f} (Y_{i} | U = U_{i}, D = 0)} - \frac{f (Y_{i} | U = U_{i}, D = 1)}{f (Y_{i} | U = U_{i}, D = 0)} | = \frac{o_{p} (1)}{\sqrt{n b^{p}}}$ (18) (Lemma 8.10 in Newey & McFadden, Citation1994), where $o_{p} (1)$ denotes a term tending to 0 in probability. Result (Equation18(18) $max_{i = n + 1, \dots, N} | \frac{\hat{f} (Y_{i} | U = U_{i}, D = 1)}{\hat{f} (Y_{i} | U = U_{i}, D = 0)} - \frac{f (Y_{i} | U = U_{i}, D = 1)}{f (Y_{i} | U = U_{i}, D = 0)} | = \frac{o_{p} (1)}{\sqrt{n b^{p}}}$ (18) ) implies that the estimation of ratio $f (Y | U, D = 1) / f (Y | U, D = 0)$ does not affect the asymptotic distribution of ${\hat{μ}}_{1}^{E 2} (u)$ in (Equation5(5) ${\hat{μ}}_{1}^{E 2} (u) = {\sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = n + 1}^{N} {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .$ (5) ).

Note that both the squared bias $B_{a}^{2} (u)$ and variance $V_{a} (u)$ in (Equation17(17) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{E 2} (u) - μ_{1} (u)} ⟹ d N (B_{a} (u), V_{a} (u)),$ (17) ) are decreasing in the limit $a = lim_{n \to \infty} (N - n) / n$ , a quantity reflecting how many external data we have. In the extreme case of a = 0, i.e., the size of the external dataset is negligible compared with the size of the internal dataset, result (Equation17(17) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{E 2} (u) - μ_{1} (u)} ⟹ d N (B_{a} (u), V_{a} (u)),$ (17) ) reduces to the well-known asymptotic normality for the standard kernel estimator ${\hat{μ}}_{1} (u)$ in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) (Bierens, Citation1987). In the other extreme case of $a = \infty$ , on the other hand, $B_{a} (u) = V_{a} (u) = 0$ and, hence, ${\hat{μ}}_{1}^{E 2} (u)$ has a convergence rate tending to 0 faster than $1 / \sqrt{n b^{p}}$ , the convergence rate of the standard kernel estimator ${\hat{μ}}_{1} (u)$ .

The next result is about ${\hat{μ}}_{1}^{C 2} (u)$ in (Equation16(16) ${\hat{μ}}_{1}^{Cj} (u) = \sum_{i = 1}^{n} {\hat{μ}}_{i}^{Cj} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}), j = 1, 2, 3,$ (16) ) as described in Section 3.

Theorem 4.2

Assume (B1)–(B5) with $U$ and p replaced by $X$ and q, respectively, and the following conditions, where $f_{k} (u)$ and $σ_{k}^{2} (u)$ , k = 0, 1, are defined in (B1)–(B2).

(C1)	The range $U$ of $U$ is a compact set in the p-dimensional Euclidean space and $f_{1} (u)$ is bounded away from infinity and zero on $U$ ; $f_{1} (u)$ and $f_{0} (u)$ have continuous and bounded first- and second-order partial derivatives.
(C2)	Functions $μ_{1} (u) = E (Y \| U = u)$ and $σ_{1}^{2} (u)$ are Lipschitz continuous; $μ_{1} (u)$ has bounded third-order partial derivatives; $h_{1} (x) = E (Y ∣ X = x, D = 1)$ has bounded first- and second-order partial derivatives; and $E (\| Y \|^{s} \| U = u, D = 1)$ is bounded with $s > 2 + p / 2$ .
(C3)	All kernel functions are positive, bounded, and Lipschitz continuous with mean zero and finite sixth moments.
(C4)	$a = lim_{n \to \infty} (N - n) / n > 0$ and the bandwidths b in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) and l in (Equation14(14) $\begin{aligned} {\hat{μ}}_{1}^{Cj} & = \arg min_{μ_{1}, \dots, μ_{n}} \sum_{i = 1}^{n} \sum_{j = 1}^{n} κ_{l} (U_{i} - U_{j}) (Y_{j} - μ_{i})^{2} / \sum_{k = 1}^{n} κ_{l} (U_{i} - U_{k}) \end{aligned}$ (14) ) satisfy $b \to 0$ , $l \to 0$ , $l / b \to r \in (0, \infty)$ , $n b^{p} \to \infty$ , and $n b^{4 + p} \to c \in [0, \infty)$ , as $n \to \infty$ .
(C5)	The densities $f_{X 0} (x)$ and $f_{X 1} (x)$ for $X$ , respectively under internal and external populations are bounded away from zero. There exists a constant s>4 such that $E (\| Y \|^{s} ∣ D = 1)$ and $E (\| \tilde{Y} \|^{s} ∣ D = 0)$ are finite, $E (\| Y \|^{s} ∣ X = x, D = 1) f_{X 1} (x)$ and $E (\| \tilde{Y} \|^{s} ∣ X = x, D = 0) f_{X 0} (x)$ are bounded, and the bandwidth $b_{h}$ for ${\hat{h}}_{1}$ satisfies $n^{1 - 2 / s} b_{h}^{q} / \log (n) \to \infty .$

Then, for any fixed $u \in U$ and ${\hat{μ}}_{1}^{C 2} (u)$ in (Equation16(16) ${\hat{μ}}_{1}^{Cj} (u) = \sum_{i = 1}^{n} {\hat{μ}}_{i}^{Cj} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}), j = 1, 2, 3,$ (16) ), (19) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{C 2} (u) - μ_{1} (u)} ⟹ d N (B_{r} (u), V_{r} (u)),$ (19) where $\begin{aligned} B_{r} (u) & = c^{1 / 2} [(1 + r^{2}) A_{1} (u) - r^{2} g (x)^{⊤} Σ_{g}^{- 1} E {g (X) A_{1} (U) | D = 1}], \\ A_{1} (u) & = \int κ (v) {\frac{1}{2} v^{⊤} \nabla^{2} μ_{1} (u) v + v^{⊤} ∇log f_{1} (u) \nabla μ_{1} (u)^{⊤} v} d v, \\ V_{r} (u) & = \frac{σ_{1}^{2} (u)}{f_{1} (u)} \int {\int κ (v - r w) κ (w) d w}^{2} d v, \end{aligned}$

and $Σ = E {g (X) g (X)^{⊤} ∣ D = 1}$ is assumed to be positive definite without loss of generality.

The next result is about $\hat{γ}$ in (Equation11(11) $\hat{γ} = \arg min_{t} \frac{1}{N} \sum_{i = 1}^{N} D_{i} {Y_{i} - \hat{h} (U_{i}, t)}^{2},$ (11) ).

Theorem 4.3

Suppose that (Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ) holds for binary random D indicating internal and external data. Assume also the following conditions.

(D1)	The kernel $\overset{ˇ}{κ}$ in (Equation10(10) $\hat{h} (u, t) = \frac{\sum_{i = 1}^{N} (1 - D_{i}) {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i}) Y_{i} e^{- t Y_{i}}}{\sum_{i = 1}^{N} (1 - D_{i}) {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i})} \frac{\sum_{i = 1}^{N} D_{i} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i}) e^{t Y_{i}}}{\sum_{i = 1}^{N} D_{i} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i})}$ (10) ) is Lipschitz continuous, satisfies $\int \overset{ˇ}{κ} (u) d u = 1$ , has a bounded support, and has order $d > max {(p + 4) / 2, p}$ .
(D2)	The bandwidth $\overset{ˇ}{b}$ in (Equation10(10) $\hat{h} (u, t) = \frac{\sum_{i = 1}^{N} (1 - D_{i}) {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i}) Y_{i} e^{- t Y_{i}}}{\sum_{i = 1}^{N} (1 - D_{i}) {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i})} \frac{\sum_{i = 1}^{N} D_{i} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i}) e^{t Y_{i}}}{\sum_{i = 1}^{N} D_{i} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (u - U_{i})}$ (10) ) satisfies $N {\overset{ˇ}{b}}^{2 q} / (\log N)^{2} \to \infty$ and $N {\overset{ˇ}{b}}^{2 d} \to 0$ as the total sample size of internal and external datasets $N \to \infty$ , where d is given in (D1).
(D3)	γ in (Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ) is an interior point of a compact domain Γ and it is the unique solution to $h_{1} (\cdot) = h (\cdot, t)$ , $t \in Γ$ . For any $u$ , $h (u, t)$ is second-order continuously differentiable in t, and h, $\nabla_{t} h$ , $\nabla_{t}^{2} h$ are bounded over t and $u$ . As $t \to γ$ , $h (\cdot, t)$ , $\nabla_{t} h (\cdot, t)$ , and $\nabla_{t}^{2} h (\cdot, t)$ converge uniformly.
(D4)	$sup_{t \in Γ} E ‖ W_{t} ‖^{4} < \infty$ and $sup_{t \in Γ} E [‖ W_{t} ‖^{4} \| U] f_{U} (U)$ is bounded, where $‖ a ‖^{2} = a^{⊤} a$ , $W_{t} = (D e^{tY}, (1 - D) Y e^{- tY}, D, (1 - D), DY e^{tY}, (1 - D) Y^{2} e^{- tY},$ $D Y^{2} e^{tY}, (1 - D) Y^{3} e^{- tY})^{⊤}$ , and $f_{U}$ is the density of $U$ . Furthermore, there is a function $τ (Y, D)$ with $E {τ (Y, D)} < \infty$ such that $‖ W_{t} - W_{t^{'}} ‖ < τ (Y, D) \| t - t^{'} \|$ .
(D5)	The function $ω_{t} (u) = E (W_{t} \| U = u) f_{U} (u)$ is bounded away from zero, and it is dth-order continuously differentiable with bounded partial derivatives on an open set containing the support of $U$ . There is a functional $G (Y, D, ω)$ linear in $ω$ such that $\| G (Y, D, ω) \| \leq ι (Y, D) ‖ ω ‖_{\infty}$ and, for small enough $‖ ω - ω_{γ} ‖_{\infty}$ , $\| ψ (Y, D, ω) - ψ (Y, D, ω_{γ}) - G (Y, D, ω - ω_{γ}) \| \leq$ $ι (Y, D) ‖ ω - ω_{γ} ‖_{\infty}^{2}$ , where $ι (Y, D)$ is a function with $E {ι (Y, D)} < \infty$ , $ψ (Y, D, ω)$ $= - 2 D (Y - \frac{ω_{1} ω_{2}}{ω_{3} ω_{4}}) (\frac{ω_{2} ω_{5} - ω_{1} ω_{6}}{ω_{3} ω_{4}})$ , $ω_{j}$ is the jth component of $ω$ , $‖ ω ‖_{\infty} = sup_{x \in U} ‖ ω (u) ‖$ , $‖ ω - ω_{γ} ‖_{\infty} = sup_{x \in U} ‖ ω (u) - ω_{γ} (u) ‖$ , and $U$ is the range of $U$ . Also, there exists an almost everywhere continuous 8-dimensional function $ν (U)$ with $\int ‖ ν (u) ‖ d u < \infty$ and $E {sup_{‖ δ ‖ \leq ϵ} ‖ ν (U + δ) ‖^{4}} < \infty$ for some $ϵ > 0$ such that $E {G (Y, D, ω)} = \int ν (u)^{⊤} ω (u) d u$ for all $‖ ω ‖_{\infty} < \infty$ .

Then, as the total sample size of internal and external datasets $N \to \infty$ , (20) $\sqrt{N} (\hat{γ} - γ) ⟹ d N (0, σ_{γ}^{2}),$ (20) where $σ_{γ}^{2} = [2 E {D \nabla_{γ} h (U, γ)}^{2}]^{- 1} Var [ψ (Y, D, ω_{γ}) + ν (U)^{⊤} W_{γ} - E {ν (U)^{⊤} W_{γ}}]$ .

Conditions (D1)–(D5) are technical assumptions discussed in Lemmas 8.11 and 8.12 in Newey and McFadden (Citation1994). As discussed by Newey and McFadden (Citation1994), the condition that $\overset{ˇ}{κ}$ has a bounded support can be relaxed, as it is imposed for a simple proof.

Combining Theorems 4.1–4.3, we obtain the following result for ${\hat{μ}}_{1}^{E 3} (u)$ in (Equation12(12) ${\hat{μ}}_{1}^{E 3} (u) = {\sum_{i = 1}^{N} D_{i} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = 1}^{N} (1 - D_{i}) {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i})$ (12) ) or ${\hat{μ}}_{1}^{C 3} (u)$ in (Equation16(16) ${\hat{μ}}_{1}^{Cj} (u) = \sum_{i = 1}^{n} {\hat{μ}}_{i}^{Cj} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}), j = 1, 2, 3,$ (16) ).

Corollary 4.1

Suppose that (Equation8(8) $\frac{P (D = 0 ∣ U, Y)}{P (D = 1 ∣ U, Y)} = \exp {α (U) + γY},$ (8) ) holds for the binary random D indicating internal and external data.

(i)	Under (B1)–(B4) and (D1)–(D5), result (Equation17(17) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{E 2} (u) - μ_{1} (u)} ⟹ d N (B_{a} (u), V_{a} (u)),$ (17) ) holds with ${\hat{μ}}_{1}^{E 2} (u)$ replaced by ${\hat{μ}}_{1}^{E 3} (u)$ .
(ii)	Under (C1)–(C4) and (D1)–(D5) with $U$ and p replaced by $X$ and q, respectively, result (Equation19(19) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{C 2} (u) - μ_{1} (u)} ⟹ d N (B_{r} (u), V_{r} (u)),$ (19) ) holds with ${\hat{μ}}_{1}^{C 2} (u)$ replaced by ${\hat{μ}}_{1}^{C 3} (u)$ .

5. Simulation results

5.1. The performance of ${\hat{μ}}_{1}^{Cj}$ given by (16)

We first present simulation results to examine and compare the performance of the standard kernel estimator ${\hat{μ}}_{1}$ in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) without using external information and our proposed estimator (Equation16(16) ${\hat{μ}}_{1}^{Cj} (u) = \sum_{i = 1}^{n} {\hat{μ}}_{i}^{Cj} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}), j = 1, 2, 3,$ (16) ) with three variations, ${\hat{μ}}_{1}^{C 1}$ , ${\hat{μ}}_{1}^{C 2}$ , and ${\hat{μ}}_{1}^{C 3}$ , as described in the end of Section 3. We consider $U = (X, Z)^{⊤}$ with univariate covariates X and Z, where Z is unmeasured in the external dataset (p = 2 and q = 1). The covariates are generated in two ways:

normal covariates: $(X, Z)^{⊤}$ is bivariate normal with means 0, variances 1, and correlation 0.5;
bounded covariates: $X = B W_{1} + (1 - B) W_{2}$ and $Z = B W_{1} + (1 - B) W_{3}$ , where $W_{1},$ $W_{2}$ , and $W_{3}$ are identically distributed as uniform on $[- 1, 1]$ , B is uniform on $[0, 1]$ , and $W_{1},$ $W_{2}$ , $W_{3}$ , and B are independent.

Conditioned on $(X, Z)^{⊤}$ , the response Y is normal with mean $μ (X, Z)$ and variance 1, where $μ (X, Z)$ follows one of the following four models:

(M1)	$μ (X, Z) = X / 2 - Z^{2} / 4$ ;
(M2)	$μ (X, Z) = \cos (2 X) / 2 + \sin (Z)$ ;
(M3)	$μ (X, Z) = \cos (2 XZ) / 2 + \sin (Z)$ ;
(M4)	$μ (X, Z) = X / 2 - Z^{2} / 4 + \cos (XZ) / 4$ .

Note that all four models are nonlinear in $(X, Z)^{⊤}$ ; (M1)-(M2) are additive models, while (M3)-(M4) are non-additive.

A total of N = 1, 200 data are generated from the population of $(Y, X, Z)$ as previously described. A data point is treated as internal or external according to a random binary D with conditional probability $P (D = 1 ∣ Y, X, Z) = 1 / \exp (γ_{0} + 2 | X | + γY)$ , where $γ = 0$ or 1/2, and $γ_{0} = 1$ or $- 1.5$ . Under the setting $γ_{0} = 1$ or $- 1.5$ , the unconditional $P (D = 1) \approx n / N$ is around 13% or 50%.

The simulation studies performance of kernel estimators in terms of mean integrated square error (MISE). The following measure is calculated by simulation with S replications: (21) $MISE ({\hat{μ}}_{1}^{*}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{*} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) where ${U_{s, t} : t = 1, \dots, T}$ are test data for each simulation replication s, the simulation is repeated independently for $s = 1, \dots, S$ , and ${\hat{μ}}_{1}^{*}$ is one of ${\hat{μ}}_{1}$ , ${\hat{μ}}_{1}^{C 1}$ , ${\hat{μ}}_{1}^{C 2}$ , and ${\hat{μ}}_{1}^{C 3}$ , independent of test data. We consider two ways of generating test data $U_{s, t}$ 's. The first one is to use T = 121 fixed grid points on $[- 1, 1] \times [- 1, 1]$ with equal space. The second one is to take a random sample of T = 121 without replacement from the covariate $U$ 's of the internal dataset, for each fixed $s = 1, \dots, S$ and independently across s.

To show the benefit of using external information, we calculate the improvement in efficiency defined as follows: (22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) where the minimum is over ${\hat{μ}}_{1}^{*} =$ one of ${\hat{μ}}_{1}$ , ${\hat{μ}}_{1}^{C 1}$ , ${\hat{μ}}_{1}^{C 2}$ , and ${\hat{μ}}_{1}^{C 3}$ .

In all cases, we use the Gaussian kernel. The bandwidths b and l affect the performance of kernel methods. We consider two types of bandwidths in the simulation. The first one is ‘the best bandwidth’; for each method, we evaluate MISE in a pool of bandwidths and display the one that has the minimal MISE. This shows the best we can achieve in terms of bandwidth, but it cannot be used in applications. The second one is to select bandwidth from a pool of bandwidths via 10-fold cross validation (Györfi et al., Citation2002), which produces a decent bandwidth that can be applied to real data.

The simulated MISE values based on S = 200 replications are shown in Tables –.

Table 1. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains only X, with S = 200 under $γ = 0$ , $n / N \approx 13 %$ .

Display Table

Table 2. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains only X, with S = 200 under $γ = 0$ , $n / N \approx 50 %$ .

Display Table

Table 3. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains only X, with S = 200 under $γ = 0.5$ , $n / N \approx 13 %$ .

Display Table

Table 4. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains only X, with S = 200 under $γ = 0.5$ , $n / N \approx 50 %$ .

Display Table

Consider first the results in Tables –. Since $γ = 0$ , all three estimators, ${\hat{μ}}_{1}^{C 1}$ , ${\hat{μ}}_{1}^{C 2}$ , and ${\hat{μ}}_{1}^{C 3}$ , are correct and more efficient than the standard estimator ${\hat{μ}}_{1}$ in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) without using external information. The estimator ${\hat{μ}}_{1}^{C 1}$ is the best, as it uses the correct information that populations are homogeneous ( $γ = 0$ ) and is simpler than ${\hat{μ}}_{1}^{C 2}$ and ${\hat{μ}}_{1}^{C 3}$ .

Next, the results in Tables – for $γ = 1 / 2$ indicate that the estimator ${\hat{μ}}_{1}^{C 2}$ or ${\hat{μ}}_{1}^{C 3}$ using a correct constraint is better than the estimator ${\hat{μ}}_{1}^{C 1}$ using an incorrect constraint or the estimator ${\hat{μ}}_{1}$ without using external information. Since ${\hat{μ}}_{1}^{C 3}$ uses more information, it is in general better than ${\hat{μ}}_{1}^{C 2}$ . Furthermore, with an incorrect constraint, ${\hat{μ}}_{1}^{C 1}$ can be much worse than ${\hat{μ}}_{1}$ without using external information.

5.2. The performance of ${\hat{μ}}_{1}^{Ej}$ given by (3), (5), or (12)

Under the same simulation setting as described in Section 5.1 but with covariate Z measured in both internal and external datasets, we compare the performance of three estimators, ${\hat{μ}}_{1}^{E 1}$ , ${\hat{μ}}_{1}^{E 2}$ , and ${\hat{μ}}_{1}^{E 3}$ given by (Equation3(3) ${\hat{μ}}_{1}^{E 1} (u) = \sum_{i = 1}^{N} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}),$ (3) ), (Equation5(5) ${\hat{μ}}_{1}^{E 2} (u) = {\sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = n + 1}^{N} {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .$ (5) ), and (Equation12(12) ${\hat{μ}}_{1}^{E 3} (u) = {\sum_{i = 1}^{N} D_{i} Y_{i} κ_{b} (u - U_{i}) + \sum_{i = 1}^{N} (1 - D_{i}) {\hat{Y}}_{i} κ_{b} (u - U_{i})} / \sum_{i = 1}^{N} κ_{b} (u - U_{i})$ (12) ), respectively, with the standard kernel estimator ${\hat{μ}}_{1}$ in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) without using external information. The mean integrated squared error (MISE) and improvement (IMP) are calculated using formulas (Equation21(21) $MISE ({\hat{μ}}_{1}^{*}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{*} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ), respectively, with ${\hat{μ}}_{1}^{*} =$ one of ${\hat{μ}}_{1}$ , ${\hat{μ}}_{1}^{E 1}$ , ${\hat{μ}}_{1}^{E 2}$ , and ${\hat{μ}}_{1}^{E 3}$ .

Tables – present the simulation results. The relative performance of ${\hat{μ}}_{1}^{E 1}$ , ${\hat{μ}}_{1}^{E 2}$ , ${\hat{μ}}_{1}^{E 3}$ , and ${\hat{μ}}_{1}$ follows the same pattern as ${\hat{μ}}_{1}^{C 1}$ , ${\hat{μ}}_{1}^{C 2}$ , ${\hat{μ}}_{1}^{C 3}$ , and ${\hat{μ}}_{1}$ in Section 5.1.

Table 5. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains both X and Z, with S = 200 under $γ = 0$ , $n / N \approx 13 %$ .

Display Table

Table 6. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains both X and Z, with S = 200 under $γ = 0$ , $n / N \approx 50 %$ .

Display Table

Table 7. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains both X and Z, with S = 200 under $γ = 0.5$ , $n / N \approx 13 %$ .

Display Table

Table 8. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains both X and Z, with S = 200 under $γ = 0.5$ , $n / N \approx 50 %$ .

Display Table

The only difference between the results here and those in Section 5.1 is that the use of more external data (a smaller n/N) results in a better performance of ${\hat{μ}}_{1}^{E 2}$ or ${\hat{μ}}_{1}^{E 3}$ (or ${\hat{μ}}_{1}^{E 1}$ when it is correct). This is actually consistent with our theoretical result Theorem 4.1 in Section 4, which shows that both the squared bias $B_{a}^{2} (u)$ and variance $V_{a} (u)$ in (Equation17(17) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{E 2} (u) - μ_{1} (u)} ⟹ d N (B_{a} (u), V_{a} (u)),$ (17) ) are decreasing in the limit $a = lim_{n \to \infty} (N - n) / n$ . On the other hand, the simulation results in Section 5.1 and Theorem 4.2 in Section 4 do not show a clear indication of using more external data produces better estimators. The main reason for this is that, when Z is not observed in the external dataset, the estimator ${\hat{μ}}_{1}^{Cj}$ relies more on internal data to recover the loss of Z from external dataset in a complicated way.

5.3. The performance of ${\hat{μ}}_{1}^{Cj}$ given by (16) with q = 2

We re-consider the simulation in Section 5.1 but with the dimension of $X$ to be q = 2, i.e., $U = (X_{1}, X_{2}, Z)^{⊤}$ . We only consider normally distributed covariates with means 0, variances 1, and the correlations in $(X_{1}, Z)$ , $(X_{2}, Z)$ , and $(X_{1}, X_{2})$ being 0.5, 0.5, and 0.25, respectively. Given $U$ , the response variable Y is normally distributed with mean $μ (X_{1}, X_{2}, Z) = X_{1} / 2 + X_{2} / 4 - Z^{2} / 4$ and variance 1. Moreover, $P (D = 1 | Y, X, Z) = 1 / \exp (γ_{0} + 2 | X_{1} | + γY)$ , while the remaining settings are the same as in Section 5.1. In calculating MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{*}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{*} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ), we only a random $U_{s, t}$ with T = 121, not fixed grid points. Also, we consider only evaluating the performance of estimators ${\hat{μ}}_{1}^{Cj}$ , since estimators ${\hat{μ}}_{1}^{Ej}$ are simpler.

The results are shown in Table . Compared with results in Tables – for the case of q = 1, the MISEs in this case are larger due to the fact of having more covariates (q = 2). But the relative performances of estimators are the same as those shown in Tables –.

Table 9. Simulated MISE (Equation21(21) $MISE ({\hat{μ}}_{1}^{}) = \frac{1}{S} \sum_{s = 1}^{S} \frac{1}{T} \sum_{t = 1}^{T} {{\hat{μ}}_{1}^{} (U_{s, t}) - μ_{1} (U_{s, t})}^{2},$ (21) ) and IMP (Equation22(22) $IMP = 1 - \frac{min {MISE ({\hat{μ}}_{1}^{*})}}{MISE ({\hat{μ}}_{1})},$ (22) ) when the external dataset contains only normally distributed $(X_{1}, X_{2})$ , with S = 200.

Display Table

6. Discussion

Curse of dimensionality is a well-known problem for nonparametric methods. Thus, the proposed method in Section 2 is intended for low dimensional covariate $U$ , i.e., p is small. If p is not small, then we should reduce the dimension of $U$ prior to applying the CK, or any kernel methods. For example, consider a single index model assumption (K.-C. Li, Citation1991), i.e., $μ_{1} (U)$ in (Equation1(1) $μ_{1} (u) = E (Y ∣ U = u, D = 1),$ (1) ) is assumed to be (23) $μ_{1} (U) = μ_{1} (η^{⊤} U),$ (23) where $η$ is an unknown p-dimensional vector. The well-known SIR technique (K.-C. Li, Citation1991) can be applied to obtain a consistent and asymptotically normal estimator $\hat{η}$ of $η$ in (Equation23(23) $μ_{1} (U) = μ_{1} (η^{⊤} U),$ (23) ). Once $η$ is replaced by $\hat{η}$ , the kernel method can be applied with $U$ replaced by the one-dimensional ‘covariate’ ${\hat{η}}^{⊤} U$ . We can also apply other dimension reduction techniques developed under assumptions weaker than (Equation23(23) $μ_{1} (U) = μ_{1} (η^{⊤} U),$ (23) ) (Cook & Weisberg, Citation1991; B. Li & Wang, Citation2007; Ma & Zhu, Citation2012; Y. Shao et al., Citation2007; Xia et al., Citation2002).

We turn to the dimension of $X$ in the external dataset. When the dimension of $X$ is high, we may consider the following approach. Instead of using constraint (Equation15(15) $\begin{aligned} subject to \sum_{i = 1}^{n} {μ_{i} - {\hat{h}}_{1}^{Ej} (X_{i})} g (X_{i})^{⊤} = 0, \end{aligned}$ (15) ), we use component-wise constraints (24) $\sum_{i = 1}^{n} {μ_{i} - {\hat{h}}_{1}^{(k)} (X_{i}^{(k)})} g_{k} (X_{i}^{(k)})^{⊤} = 0, k = 1, \dots, q,$ (24) where $X_{i}^{(k)}$ is the kth component of $X_{i}$ , $g_{k} (X^{(k)}) = (1, X^{(k)})^{⊤}$ , and ${\hat{h}}_{1}^{(k)} (X_{i}^{(k)})$ is an estimator of $h_{1}^{(k)} (X^{(k)}) = E (Y ∣ X^{(k)}, D = 1)$ using methods described in Section 2. More constraints are involved in (Equation24(24) $\sum_{i = 1}^{n} {μ_{i} - {\hat{h}}_{1}^{(k)} (X_{i}^{(k)})} g_{k} (X_{i}^{(k)})^{⊤} = 0, k = 1, \dots, q,$ (24) ), but estimation only involves one dimensional $X^{(k)}$ , $k = 1, \dots, q$ .

The kernel κ we adopted in (Equation2(2) ${\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} Y_{i} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}),$ (2) ) and (Equation16(16) ${\hat{μ}}_{1}^{Cj} (u) = \sum_{i = 1}^{n} {\hat{μ}}_{i}^{Cj} κ_{b} (u - U_{i}) / \sum_{i = 1}^{n} κ_{b} (u - U_{i}), j = 1, 2, 3,$ (16) ) is the second-order kernel so that the convergence rate of ${\hat{μ}}_{1}^{E} (u) - μ_{1} (u)$ is $n^{- 2 / (4 + p)}$ . An mth-order kernel with m>2 as defined by Bierens (Citation1987) may be used to achieve convergence rate $n^{- m / (2 m + p)}$ . Alternatively, we may apply other nonparametric smoothing techniques such as the local polynomial (Fan et al., Citation1997) to achieve convergence rate $n^{- m / (2 m + p)}$ with $m \geq 2$ .

Our results can be extended to the scenarios where several external datasets are available. Since each external source may provide different covariate variables, we may need to apply component-wise constraints (Equation24(24) $\sum_{i = 1}^{n} {μ_{i} - {\hat{h}}_{1}^{(k)} (X_{i}^{(k)})} g_{k} (X_{i}^{(k)})^{⊤} = 0, k = 1, \dots, q,$ (24) ) by estimating ${\hat{h}}_{1}^{(k)}$ via combining all the external sources that collects covariate $X^{(k)}$ . If populations of external datasets are different, then we may have to apply a combination of the methods described in Section 2.

Acknowledgments

The authors would like to thank two anonymous referees for helpful comments and suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Jun Shao's research was partially supported by the National Natural Science Foundation of China [Grant Number 11831008] and the U.S. National Science Foundation [Grant Number DMS-1914411].

References

Bierens, H. J. (1987). Kernel estimators of regression functions. In Advances in Econometrics: Fifth World Congress (Vol. 1, pp. 99–144). Cambridge University Press.
Google Scholar
Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
PubMed Web of Science ®Google Scholar
Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414), 328–332. https://doi.org/10.2307/2290564
Web of Science ®Google Scholar
Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press. https://doi.org/10.5705/ss.202021.0446
Google Scholar
Fan, J., Farmen, M., & Gijbels, I. (1998). Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(3), 591–608. https://doi.org/10.1111/1467-9868.00142
Web of Science ®Google Scholar
Fan, J., Gasser, T., Gijbels, I., Brockmann, M., & Engel, J. (1997). Local polynomial regression: optimal kernels and asymptotic minimax efficiency. Annals of the Institute of Statistical Mathematics, 49(1), 79–99. https://doi.org/10.1023/A:1003162622169
Web of Science ®Google Scholar
Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Springer.
Google Scholar
Kim, H. J., Wang, Z., & Kim, J. K. (2021). Survey data integration for regression analysis using model calibration. arXiv 2107.06448.
Google Scholar
Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
Web of Science ®Google Scholar
Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
Web of Science ®Google Scholar
Lohr, S. L., & Raghunathan, T. E. (2017). Combining survey data with other data sources. Statistical Science, 32(2), 293–312. https://doi.org/10.1214/16-STS584
Web of Science ®Google Scholar
Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
PubMed Web of Science ®Google Scholar
Merkouris, T. (2004). Combining independent regression estimators from multiple surveys. Journal of the American Statistical Association, 99(468), 1131–1139. https://doi.org/10.1198/016214504000000601
Web of Science ®Google Scholar
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142. https://doi.org/10.1137/1109020
Google Scholar
Newey, W. K. (1994). Kernel estimation of partial means and a general variance estimator. Econometric Theory, 10(2), 1–21. https://doi.org/10.1017/S0266466600008409.
Web of Science ®Google Scholar
Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4, 2111–2245. https://doi.org/10.1016/S1573-4412(05)80005-4
Google Scholar
Rao, J. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B, 83(1), 242–272. https://doi.org/10.1007/s13571-020-00227-w
Google Scholar
Shao, J. (2003). Mathematical statistics. 2nd ed., Springer.
Google Scholar
Shao, Y., Cook, R. D., & Weisberg, S. (2007). Marginal tests with sliced average variance estimation. Biometrika, 94(2), 285–296. https://doi.org/10.1093/biomet/asm021
Web of Science ®Google Scholar
Wand, M. P., & Jones, M. C. (1994, December). Kernel smoothing. Number 60 in Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Boca Raton.
Google Scholar
Wasserman, L. (2006). All of nonparametric statistics. Springer.
Google Scholar
Xia, Y., Tong, H., Li, W. K., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/1467-9868.03411
Web of Science ®Google Scholar
Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: a review. Japanese Journal of Statistics and Data Science, 3(2), 625–650. https://doi.org/10.1007/s42081-020-00093-w
Google Scholar
Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107(3), 689–703. https://doi.org/10.1093/biomet/asaa014
Web of Science ®Google Scholar
Zhang, Y., Ouyang, Z., & Zhao, H. (2017). A statistical framework for data integration through graphical models with application to cancer genomics. The Annals of Applied Statistics, 11(1), 161–184. https://doi.org/10.1214/16-AOAS998
PubMed Web of Science ®Google Scholar

Appendix

Proof of Theorem 4.1.

Let

{\tilde{μ}}_{1} (u) = \hat{p} (u) {\hat{μ}}_{1} (u) + {1 - \hat{p} (u)} {\hat{μ}}_{0} (u),

where

{\hat{μ}}_{1} (u) = \sum_{i = 1}^{n} κ_{b} (u - U_{i}) Y_{i} / \sum_{i = 1}^{n} κ_{b} (u - U_{i})

{\hat{μ}}_{0} (u) = \sum_{i = n + 1}^{N} κ_{b} (u - U_{i}) {\tilde{Y}}_{i} / \sum_{i = n + 1}^{N} κ_{b} (u - U_{i})

, and

\hat{p} (u) = \sum_{i = 1}^{n} κ_{b} (u - U_{i}) / \sum_{i = 1}^{N} κ_{b} (u - U_{i}) .

Under (B3)–(B4), Theorem 2 in Nadaraya (Citation1964) shows that

\hat{p} (u)

converges to

P (D = 1 | U = u)

in probability. Under (B1)–(B4),

\sqrt{n b^{p}} {{\hat{μ}}_{1} (u) - μ_{1} (u)} ⟹ d N (B_{1} (u), V_{1} (u)),

B_{1} (u) = c^{1 / 2} A_{1} (u)

V_{1} (u) = \frac{σ_{1}^{2} (u)}{f_{1} (u)} \int κ (v)^{2} d v,

and

\sqrt{n / (N - n)} \sqrt{(N - n) b^{p}} {{\hat{μ}}_{0} (u) - μ_{1} (u)} ⟹ d N (B_{0} (u), V_{0} (u)),

B_{0} (u) = c^{1 / 2} A_{0} (u)

V_{0} (u) = \frac{σ_{0}^{2} (u)}{a f_{0} (u)} \int κ (v)^{2} d v

. Then (Equation17) holds for

{\tilde{μ}}_{1} (u)

, by Slutsky's theorem, the independence between

{\hat{μ}}_{1}

and

{\hat{μ}}_{0}

, and the definition of a. The desired result (Equation17) follows from the fact that

| {\hat{μ}}_{1}^{E 2} (u) - {\tilde{μ}}_{1} (u) |

is bounded by

(A1)

{1 - \hat{p} (u)} max_{i > n} | \frac{\hat{f} (Y_{i} | U = U_{i}, D = 1)}{\hat{f} (Y_{i} | U = U_{i}, D = 0)} - \frac{f (Y_{i} | U = U_{i}, D = 1)}{f (Y_{i} | U = U_{i}, D = 0)} | \frac{\sum_{i = n + 1}^{N} | Y_{i} | κ_{b} (u - U_{i})}{\sum_{i = n + 1}^{N} κ_{b} (u - U_{i})},

(A1) which is

o_{p} (1 / \sqrt{n b^{p}})

by result (Equation18) under condition (B5).

Proof of Theorem 4.2.

Write (A2) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{C 2} (u) - μ_{1} (u)} = T_{1} + \dots + T_{6},$ (A2) where $T_{1} = n^{- 1 / 2} b^{p / 2} δ_{b} (u)^{⊤} (I_{n} - P) B_{l}^{- 1} Δ_{l} ϵ / {\hat{f}}_{b} (u)$ , $T_{2} = n^{- 1 / 2} b^{p / 2} δ_{b} (u)^{⊤} {μ_{1} - μ_{1} (u) 1_{n}} / {\hat{f}}_{b} (u)$ , $T_{3} = n^{- 1 / 2} b^{p / 2} δ_{b} (u)^{⊤} (B_{l}^{- 1} Δ_{l} μ_{1} - μ_{1}) / {\hat{f}}_{b} (u)$ , $T_{4} = - n^{- 1 / 2} b^{p / 2} δ_{b} (u)^{⊤} P (B_{l}^{- 1} Δ_{l} μ_{1} - μ_{1}) / {\hat{f}}_{b} (u)$ , $T_{5} = n^{- 1 / 2} b^{p / 2} δ_{b} (u)^{⊤} P ({\hat{h}}_{1} - h_{1}) / {\hat{f}}_{b} (u)$ , $T_{6} = n^{- 1 / 2} b^{p / 2} δ_{b} (u)^{⊤} P (h_{1} - μ_{1}) / {\hat{f}}_{b} (u)$ , ${\hat{f}}_{b} (u) = \sum_{i = 1}^{n} κ_{b} (u - U_{i}) / n$ , $δ_{b} (u) = (κ_{b} (u - U_{1}), \dots, κ_{b} (u - U_{n}))^{⊤}$ , $I_{n}$ is the identity matrix of order n, $1_{n}$ is the n-vector with all components being 1, $B_{l}$ is the $n \times n$ diagonal matrix whose ith diagonal element is ${\hat{f}}_{l} (U_{i})$ , $Δ_{l}$ is the $n \times n$ matrix whose $(i, j)$ th entry is $κ_{l} (U_{i} - U_{j}) / n$ , $ϵ = (ϵ_{1}, \dots, ϵ_{n})^{⊤}$ with $ϵ_{i} = Y_{i} - μ_{1} (U_{i})$ , $h_{1}$ is the n-dimensional vector whose ith component is $h_{1} (X_{i})$ , $P = G (G^{⊤} G)^{- 1} G^{⊤}$ , and $G$ , ${\hat{h}}_{1}$ , and $μ_{1}$ are defined in Section 2.

We first show that $T_{1}$ in (EquationA2(A2) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{C 2} (u) - μ_{1} (u)} = T_{1} + \dots + T_{6},$ (A2) ) is asymptotically normal with mean 0 and variance $V_{r} (u)$ defined in Theorem 4.2. Consider a further decomposition $T_{1} = \sqrt{n} V + T_{11} + T_{12} + T_{13}$ , where $V = \frac{1}{n^{2}} \sum_{j = 1}^{n} \sum_{i = 1}^{n} S (U_{i}, ϵ_{i}, U_{j}, ϵ_{j})$ is a V-statistic with $\begin{aligned} S (U_{i}, ϵ_{i}, U_{j}, ϵ_{j}) & = \frac{b^{p / 2}}{2 f_{1} (u)} {\frac{κ_{b} (u - U_{i}) κ_{l} (U_{i} - U_{j}) ϵ_{j}}{f_{1} (U_{i})} + \frac{κ_{b} (u - U_{j}) κ_{l} (U_{j} - U_{i}) ϵ_{i}}{f_{1} (U_{j})}}, \\ T_{11} & = \frac{b^{p / 2}}{n^{3 / 2}} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i}) κ_{l} (0) ϵ_{i}}{f_{1} (u) f_{1} (U_{i})}, \\ T_{12} & = \frac{b^{p / 2}}{n^{3 / 2}} \sum_{j = 1}^{n} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i}) κ_{l} (U_{i} - U_{j})}{f_{1} (u) f_{1} (U_{i})} {\frac{f_{1} (u) f_{1} (U_{i})}{{\hat{f}}_{b} (u) {\hat{f}}_{l} (U_{i})} - 1} ϵ_{j}, \end{aligned}$ and $T_{13} = - n^{- 1 / 2} b^{p / 2} δ_{b} (u)^{⊤} P B_{l}^{- 1} Δ_{l} ϵ / {\hat{f}}_{b} (u) .$ Note that $S_{1} (U_{1}, ϵ_{1}) = E {S (U_{1}, ϵ_{1}, U_{2}, ϵ_{2}) ∣ U_{1}, ϵ_{1}} = \frac{b^{p / 2}}{2 f_{1} (u)} {\int κ_{l} (u_{2} - U_{1}) κ_{b} (u - u_{2}) d u_{2}} ϵ_{1}$ having variance $\begin{aligned} Var {S_{1} (U_{1}, ϵ_{1})} & = \frac{b^{p / 2}}{4 f_{1}^{2} (u)} \int f_{1} (u_{1}) σ_{1}^{2} (u_{1}) {\int κ_{l} (u_{2} - u_{1}) κ_{b} (u - u_{2}) d u_{2}}^{2} d u_{1} \\ = \frac{b^{p / 2}}{4 f_{1}^{2} (u)} \int f_{1} (u_{1}) σ_{1}^{2} (u_{1}) {\int κ_{l} (v) κ_{b} (u - u_{1} - l ν) d ν}^{2} d u_{1} \\ = \frac{1}{4 f_{1}^{2} (u)} \int f_{1} (u - b w) σ_{1}^{2} (u - b w) {\int κ (v) κ (w - ν \frac{l}{b}) d ν}^{2} d w, \end{aligned}$ where $σ_{1}^{2} (\cdot)$ is given in condition (C2), the second and third equalities follow from changing variables $u_{2} - u_{1} = l ν$ and $u - u_{1} = b w$ , respectively. From the continuity of $f_{1} (\cdot)$ and $σ_{1}^{2} (\cdot)$ , $Var {S_{1} (u_{1}, ϵ_{1})}$ converges to $V_{r} (u) .$ Therefore, by the theory for asymptotic normality of V-statistics (e.g., Theorem 3.16 in J. Shao, Citation2003), $\sqrt{n} V ⟹ d N (0, V_{r} (u))$ .

Conditioned on $U_{1}, \dots, U_{n}$ , $T_{11}$ has mean 0 and variance $\begin{aligned} Var (T_{11} | U_{1}, \dots, U_{n}) & = \frac{b^{p}}{4 f_{1}^{2} (u) n^{3}} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i})^{2} κ_{l} (0)^{2} σ_{1}^{2} (U_{i})}{f_{1} (U_{i})} \\ \leq \frac{sup_{u \in U} κ (u)^{3}}{4 f_{1}^{2} (u) n^{3} b^{2 p}} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i}) σ_{1}^{2} (U_{i})}{f_{1} (U_{i})} = o_{p} (1) . \end{aligned}$ This proves that $T_{11} = o_{p} (1)$ . Note that $E (T_{12} ∣ U_{1}, \dots, U_{n}) = 0$ and $Var (T_{12} ∣ U_{1}, \dots, U_{n})$ is bounded by $max {\frac{1}{f_{1}^{2} (u)}, \frac{1}{{\hat{f}}_{b}^{2} (u)}} max_{i = 1, \dots, n} {| \frac{f_{1} (U_{i})}{{\hat{f}}_{l} (U_{i})} - 1 |}^{2} Var (\sqrt{n} V + T_{11} | U_{1}, \dots, U_{n}) .$ Therefore, under the assumed condition that $f_{1}$ is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies $T_{12} = o_{p} (1)$ . Note that $T_{13} = \frac{b^{p / 2}}{n^{1 / 2}} \sum_{j = 1}^{n} W_{j} (u) ϵ_{j}, W_{j} (u) = \frac{1}{n} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i}) g (X_{i})^{⊤}}{{\hat{f}}_{b} (u)} (G^{⊤} G)^{- 1} \sum_{i = 1}^{n} \frac{κ_{l} (U_{i} - U_{j}) g (X_{i})}{{\hat{f}}_{l} (U_{i})} .$ Conditioned on $U_{1}, \dots, U_{n}$ , $T_{13}$ has mean 0 and variance $Var (T_{13} ∣ U_{1}, \dots, U_{n}) = \frac{b^{p}}{n} \sum_{j = 1}^{n} W_{j}^{2} (u) σ_{1}^{2} (U_{j}) = O_{p} (b^{p}) = o_{p} (1),$ because, under the assumed condition that $f_{1}$ is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies $max_{j = 1, \dots, n} | W_{j} (u) - g (u)^{⊤} Σ_{g}^{- 1} g (X_{j}) | = o_{p} (1) .$ Thus, $T_{13} = o_{p} (1) .$ Consequently, $T_{1}$ has the same asymptotic distribution as $\sqrt{n} V$ , the claimed result.

From Lemma 4 in Dai and Shao (Citation2023) and (C4), $T_{2} = \sqrt{c} A_{1} (u) {1 + o_{p} (1)} .$ Note that $\begin{aligned} T_{3} & = \frac{\sqrt{n b^{p}} l^{2}}{n {\hat{f}}_{b} (u)} \sum_{j = 1}^{n} κ_{b} (u - U_{j}) [\frac{1}{n l^{2} {\hat{f}}_{b} (U_{j})} \sum_{i = 1}^{n} κ_{l} (u - U_{i}) {μ_{1} (U_{i}) - μ_{1} (U_{j})}] \\ = {\frac{\sqrt{c} r^{2}}{n {\hat{f}}_{b} (u)} \sum_{j = 1}^{n} κ_{b} (u - U_{j}) A_{1} (U_{j})} {1 + o_{p} (1)} = \sqrt{c} r^{2} A_{1} (u) {1 + o_{p} (1)}, \end{aligned}$ where the second equality follows from (A4) and Lemmas 3–4 in Dai and Shao (Citation2023), and the last equality follows from Lemma 2 in Dai and Shao (Citation2023) and continuity of $A_{1} (\cdot)$ . Also, $\begin{aligned} - \frac{n^{1 / 2} T_{4}}{b^{p / 2}} & = \frac{1}{n} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i}) g (X_{i})^{⊤}}{{\hat{f}}_{b} (u)} (G^{⊤} G)^{- 1} \sum_{j = 1}^{n} \frac{g (X_{j})}{n {\hat{f}}_{b} (U_{j})} \sum_{i = 1}^{n} κ_{l} (u - U_{i}) {μ_{1} (U_{i}) - μ_{1} (U_{j})} \\ = {g (x)^{⊤} Σ_{g}^{- 1} \frac{1}{n} \sum_{j = 1}^{n} \frac{g (X_{j})}{n {\hat{f}}_{b} (U_{j})} \sum_{i = 1}^{n} κ_{l} (u - U_{i}) {μ_{1} (U_{i}) - μ_{1} (U_{j})}} {1 + o_{p} (1)} \\ = {g (x)^{⊤} Σ_{g}^{- 1} \frac{l^{2 / p}}{n} \sum_{j = 1}^{n} g (X_{j}) A_{1} (U_{j})} {1 + o_{p} (1)} \\ = \sqrt{c} r^{2} g (x)^{⊤} Σ_{g}^{- 1} E {g (X) A_{1} (U)} {1 + o_{p} (1)}, \end{aligned}$ where the first equality follows from Lemma 3 in Dai and Shao (Citation2023) and the law of large numbers, the second equality follows from Lemma 4 in Dai and Shao (Citation2023), and the last equality follows from the law of large numbers. Similarly, $\begin{aligned} \frac{n^{1 / 2} T_{5}}{b^{p / 2}} & = \frac{1}{n} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i}) g (X_{i})^{⊤}}{{\hat{f}}_{b} (u)} (G^{⊤} G)^{- 1} \sum_{i = 1}^{n} g (X_{i}) {{\hat{h}}_{1} (X_{i}) - h_{1} (X_{i})} \\ = [g (x)^{⊤} Σ_{g}^{- 1} \frac{1}{n} \sum_{i = 1}^{n} g (X_{i}) {{\hat{h}}_{1} (X_{i}) - h_{1} (X_{i})}] {1 + o_{p} (1)} \\ \leq {1 + o_{p} (1)} O_{p} (1) max_{j = 1, \dots, n} | {\hat{h}}_{1} (X_{j}) - h_{1} (X_{j}) |, \end{aligned}$ where the second equality follows from Lemma 3 in Dai and Shao (Citation2023). Under (B1)–(B5) with $U$ and p replaced by $X$ and q, and (C5), Lemma 8.10 in Newey and McFadden (Citation1994) implies that (A3) $max | {\hat{h}}_{1} (X_{i}) - h_{1} (X_{1}) | = O_{p} (\sqrt{\log (n)} n^{- 2 / (q + 4)}),$ (A3) which is $o_{p} (1 / \sqrt{n b^{p}}) = o_{p} (n^{- 2 / (p + 4)})$ and, hence, $T_{5} = o_{p} (1) .$ From Lemma 3 in Dai and Shao (Citation2023) and the Central Limit Theorem, $T_{6} = \frac{b^{p / 2}}{n^{1 / 2}} \sum_{i = 1}^{n} \frac{κ_{b} (u - U_{i}) g (X_{i})^{⊤}}{{\hat{f}}_{b} (u)} (G^{⊤} G)^{- 1} \sum_{i = 1}^{n} g (X_{i}) {h_{1} (X_{i}) - μ_{1} (U_{i})} = O_{p} (b^{p / 2}) = o_{p} (1) .$ Combining these results, we obtain that $T_{2} + \dots + T_{6} = B_{r} (u) + o_{p} (1)$ . This completes the proof.

Proof of Theorem 4.3.

Define $\begin{aligned} {\hat{ω}}_{t 1} & = \frac{1}{N} \sum_{i = 1}^{N} R_{i} {\overset{ˇ}{κ}}_{b} (u - U_{i}) e^{t Y_{i}}, {\hat{ω}}_{t 2} = \frac{1}{N} \sum_{i = 1}^{N} (1 - R_{i}) {\overset{ˇ}{κ}}_{b} (u - U_{i}) Y_{i} e^{- t Y_{i}}, \\ {\hat{ω}}_{t 3} & = \frac{1}{N} \sum_{i = 1}^{N} R_{i} {\overset{ˇ}{κ}}_{b} (u - U_{i}), {\hat{ω}}_{t 4} = \frac{1}{N} \sum_{i = 1}^{N} (1 - R_{i}) {\overset{ˇ}{κ}}_{b} (u - U_{i}), \\ {\hat{ω}}_{t 5} & = \frac{1}{N} \sum_{i = 1}^{N} R_{i} {\overset{ˇ}{κ}}_{b} (u - U_{i}) Y_{i} e^{t Y_{i}}, {\hat{ω}}_{t 6} = \frac{1}{N} \sum_{i = 1}^{N} (1 - R_{i}) {\overset{ˇ}{κ}}_{b} (u - U_{i}) Y_{i}^{2} e^{- t Y_{i}}, \\ {\hat{ω}}_{t 7} & = \frac{1}{N} \sum_{i = 1}^{N} R_{i} {\overset{ˇ}{κ}}_{b} (u - U_{i}) Y_{i}^{2} e^{t Y_{i}}, {\hat{ω}}_{t 8} = \frac{1}{N} \sum_{i = 1}^{N} (1 - R_{i}) {\overset{ˇ}{κ}}_{b} (u - U_{i}) Y_{i}^{3} e^{- t Y_{i}} . \end{aligned}$ Then, $\hat{h} (u, t) = {\hat{ω}}_{t 1} {\hat{ω}}_{t 2} / {\hat{ω}}_{t 3} {\hat{ω}}_{t 4}$ , $\nabla_{t} \hat{h} (u, t) = ({\hat{ω}}_{t 2} {\hat{ω}}_{t 5} - {\hat{ω}}_{t 1} {\hat{ω}}_{t 6}) / {\hat{ω}}_{t 3} {\hat{ω}}_{t 4}$ , and $\nabla_{t}^{2} \hat{h} (u, t) = ({\hat{ω}}_{t 1} {\hat{ω}}_{t 8} - 2 {\hat{ω}}_{t 5} {\hat{ω}}_{t 6} + {\hat{ω}}_{t 2} {\hat{ω}}_{t 7}) / {\hat{ω}}_{t 3} {\hat{ω}}_{t 4} .$ Let $L (t) = E [R {Y - h (U, t)}^{2}]$ , ${\hat{L}}_{n} (t) = N^{- 1} \sum_{i = 1}^{N} R_{i} {Y_{i} - \hat{h} (U_{i}, t)}^{2}$ , and $L_{n} (t) = N^{- 1} \sum_{i = 1}^{N} R_{i} {Y_{i} - h (U_{i}, t)}^{2} .$ Taking derivatives with respect to t, we obtain $\begin{aligned} \nabla_{t} {\hat{L}}_{n} (t) & = \frac{1}{N} \sum_{i = 1}^{N} - 2 R_{i} {Y_{i} - \hat{h} (U_{i}, t)} \nabla_{t} \hat{h} (U_{i}, t) = \frac{1}{N} \sum_{i = 1}^{N} ψ {Y_{i}, R_{i}, {\hat{ω}}_{t} (U_{i})}, \\ \nabla_{t} L_{n} (t) & = \frac{1}{N} \sum_{i = 1}^{N} - 2 R_{i} {Y_{i} - h (U_{i}, t)} \nabla_{t} h (U_{i}, t) = \frac{1}{N} \sum_{i = 1}^{N} ψ {Y_{i}, R_{i}, ω_{t} (U_{i})}, \end{aligned}$ and $\nabla_{t} L (t) = - 2 E [R {Y - h (u, t)} \nabla_{t} h (u, t)] = E [ψ {Y, R, ω_{t} (U)}],$ where ψ is given in (D5). Note that $\nabla_{t} L (γ) = 0$ and $\nabla_{t}^{2} L (γ) = 2 E [{\nabla_{t} h (U, γ)}^{2} R] = ν_{γ} \geq 0$ . We establish the asymptotic normality of $\hat{γ}$ in the following four steps.

Step 1: Since γ is the unique minimizer of $L (t)$ , from Theorem 2.1 in Newey and McFadden (Citation1994), it suffices to prove that $sup_{t \in Γ} | \nabla_{t} {\hat{L}}_{n} (t) - \nabla_{t} L (t) | ⟹ p 0.$ Note that $\begin{aligned} sup_{t \in Γ} | \nabla_{t} {\hat{L}}_{n} (t) - \nabla_{t} L (t) | & \leq sup_{t \in Γ} | \nabla_{t} L_{n} (t) - \nabla_{t} L (t) | + sup_{t \in Γ} | \nabla_{t} {\hat{L}}_{n} (t) - \nabla_{t} L_{n} (t) | \\ \leq sup_{t \in Γ} | \nabla_{t} L_{n} (t) - \nabla_{t} L (t) | \\ + \frac{2}{n} \sum_{i = 1}^{n} R_{i} | Y_{i} | {sup_{t \in Γ, x \in U} | \nabla_{t} \hat{h} (u, t) - \nabla_{t} h (u, t) | \\ + sup_{t \in Γ, u \in U} | \hat{h} (u, t) \nabla_{t} \hat{h} (u, t) - h (u, t) \nabla_{t} h (u, t) |} \end{aligned}$ From (D3), $| 2 R {Y - h (u, t)} \nabla_{t} h (U, t) |$ is bounded by $c | Y |$ for a constant c and hence Lemma 2.4 in Newey and McFadden (Citation1994) implies that $sup_{t \in Γ} | \nabla_{t} L_{n} (t) - \nabla_{t} L (t) | = o_{p} (1) .$ Based on Lemma B.3 in Newey (Citation1994), conditions (D1)–(D4) imply that $sup_{u \in U} | {\hat{ω}}_{t} (u) - ω_{t} (u) | \to 0$ for all $t \in Γ$ . As a result, by a similar argument of the proof of Lemma B.3 in Newey (Citation1994), we obtain that $sup_{t \in Γ, u \in U} | {\hat{ω}}_{t} (u) - ω_{t} (u) | ⟹ p 0.$ Since $ω_{t}$ is bounded away from zero and $h (\cdot, t)$ and $\nabla_{t} h (\cdot, t)$ are Lipschitz continuous functions with respect to $ω_{t}$ , $sup_{t \in Γ, u \in U} | \hat{h} (u, t) - h (u, t) | ⟹ p 0$ and $sup_{t \in Γ, u \in U} | \nabla_{t} \hat{h} (u, t) - \nabla_{t} h (u, t) | ⟹ p 0.$ These results together with the previous inequality implies that $\hat{γ} ⟹ p γ .$

Step 2: Conditions (D1)–(D5) ensure that Lemma 8.11 in Newey and McFadden (Citation1994) holds and hence $\sqrt{N} \nabla_{t} {\hat{L}}_{n} (γ) ⟹ d N (0, σ_{L}^{2})$ with $σ_{L}^{2} = Var {m (Y, R, U, ω_{γ}) + τ (Y, R, U, γ)}$ .

Step 3: Note that $\nabla_{t}^{2} L_{n} (t) = N^{- 1} \sum_{i = 1}^{N} - 2 R_{i} {Y_{i} - h (U_{i}, t)} \nabla_{t}^{2} h (U_{i}, t) + 2 R_{i} {\nabla_{t} h (U_{i}, t)}^{2}$ and $sup_{| t - γ | \leq | \hat{γ} - γ |} | \nabla_{t}^{2} {\hat{L}}_{n} (t) - \nabla_{t}^{2} L (γ) | \leq A_{1} + A_{2} + A_{3},$ where $A_{1} = | \nabla_{t}^{2} L_{n} (γ) - \nabla_{t}^{2} L (γ) |$ , $A_{2} = sup_{t \in Γ} | \nabla_{t}^{2} {\hat{L}}_{n} (t) - \nabla_{t}^{2} L_{n} (t) |$ , and the last term $A_{3} = sup_{| t - γ | \leq | \hat{γ} - γ |} | \nabla_{t}^{2} L_{n} (t) - \nabla_{t}^{2} L_{n} (γ) | .$ The law of large numbers guarantees that $A_{1} = o_{p} (1)$ . A similar argument in Step 1 shows that $A_{2} = o_{p} (1)$ . For $A_{3}$ , we have $\begin{aligned} | \nabla_{t}^{2} L_{n} (t) - \nabla_{t}^{2} L_{n} (γ) | & \leq \frac{2}{N} \sum_{i = 1}^{N} | {\nabla_{t} h (U_{i}, t)}^{2} - {\nabla_{t} h (U_{i}, γ)}^{2} | \\ + \frac{2}{N} \sum_{i = 1}^{N} | Y_{i} | | \nabla_{t}^{2} h (U_{i}, t) - \nabla_{t}^{2} h (U_{i}, γ) | \\ + \frac{2}{N} \sum_{i = 1}^{N} | h (U_{i}, t) \nabla_{t}^{2} h (U_{i}, t) - h (U_{i}, γ) \nabla_{t}^{2} h (U_{i}, γ) | . \end{aligned}$ Under (D3), $h (\cdot, t)$ , $∇h (\cdot, t)$ , and $∇h (\cdot, t)$ converge uniformly for all $x$ as $t \to γ$ and, thus, the $A_{3} = o_{p} (1)$ because $\hat{γ} ⟹ p γ$ . This shows that $sup_{| t - γ | \leq | \hat{γ} - γ |} | \nabla_{t}^{2} {\hat{L}}_{n} (t) - \nabla_{t}^{2} L (γ) | ⟹ p 0.$

Step 4: By Taylor's expansion, ${\hat{L}}_{n} (\hat{γ}) - {\hat{L}}_{n} (γ) = 0 - {\hat{L}}_{n} (γ) = \nabla_{t} {\hat{L}}_{n} (ξ) (\hat{γ} - γ)$ for some $ξ \in (γ, \hat{γ})$ . From the results in Steps 1-3, $\sqrt{N} (\hat{γ} - γ) ⟹ d N (0, [2 E {R \nabla_{γ} h (U, γ)}^{2}]^{- 1} σ_{L}^{2}) .$ This completes the proof of (Equation20(20) $\sqrt{N} (\hat{γ} - γ) ⟹ d N (0, σ_{γ}^{2}),$ (20) ).

Proof of Corollary 4.1.

From Theorem 4.3, (Equation20(20) $\sqrt{N} (\hat{γ} - γ) ⟹ d N (0, σ_{γ}^{2}),$ (20) ) shows that $\hat{γ} - γ = O_{p} (1 / \sqrt{N})$ . Furthermore, Lemma 8.10 in Newey and McFadden (Citation1994) shows that (A4) $max_{i} | \frac{e^{- γ Y_{i}} \sum_{j = 1}^{n} e^{γ Y_{j}} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (U_{i} - U_{j})}{\sum_{j = 1}^{n} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (U_{i} - U_{j})} - \frac{f (Y_{i} | U = U_{i}, D = 1)}{f (Y_{i} | U = U_{i}, D = 0)} | = O_{p} (\sqrt{\frac{\log N}{N {\overset{ˇ}{b}}^{p}}} + {\overset{ˇ}{b}}^{d}),$ (A4) which is $o_{p} (N^{- 2 / (p + 4)}) = o_{p} (1) / \sqrt{n b^{p}}$ under the assumed condition $d > max {(p + 4) / 2, p}$ and $N {\overset{ˇ}{b}}^{2 d} \to 0$ . Since $\hat{γ} - γ$ converges faster than (EquationA4(A4) $max_{i} | \frac{e^{- γ Y_{i}} \sum_{j = 1}^{n} e^{γ Y_{j}} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (U_{i} - U_{j})}{\sum_{j = 1}^{n} {\overset{ˇ}{κ}}_{\overset{ˇ}{b}} (U_{i} - U_{j})} - \frac{f (Y_{i} | U = U_{i}, D = 1)}{f (Y_{i} | U = U_{i}, D = 0)} | = O_{p} (\sqrt{\frac{\log N}{N {\overset{ˇ}{b}}^{p}}} + {\overset{ˇ}{b}}^{d}),$ (A4) ), (Equation18(18) $max_{i = n + 1, \dots, N} | \frac{\hat{f} (Y_{i} | U = U_{i}, D = 1)}{\hat{f} (Y_{i} | U = U_{i}, D = 0)} - \frac{f (Y_{i} | U = U_{i}, D = 1)}{f (Y_{i} | U = U_{i}, D = 0)} | = \frac{o_{p} (1)}{\sqrt{n b^{p}}}$ (18) ) holds. As a result, (Equation17(17) $\sqrt{n b^{p}} {{\hat{μ}}_{1}^{E 2} (u) - μ_{1} (u)} ⟹ d N (B_{a} (u), V_{a} (u)),$ (17) ) holds with $μ_{1}^{E 2} (u)$ replaced by $μ_{1}^{E 3} (u)$ under (B1)–(B4) and (D1)–(D5).
Under (D1)–(D5) with $U$ replaced by $X$ and p replaced by q, Lemma 8.10 in Newey and McFadden (Citation1994) implies that $sup_{x \in X} | \hat{h} (x, γ) - h_{1} (x) | = O_{p} ((\log N)^{1 / 2} (N {\overset{ˇ}{b}}^{q})^{- 1 / 2} + {\overset{ˇ}{b}}^{d}) = o_{p} (n^{- 2 / (p + 4)}) .$ From the asymptotic normality of $\hat{γ}$ , $\hat{γ} - γ = O_{p} (1 / \sqrt{N})$ , which converges to 0 faster than $sup_{x \in X} | \hat{h} (x, γ) - h_{1} (x) | \to 0$ . Hence (EquationA3(A3) $max | {\hat{h}}_{1} (X_{i}) - h_{1} (X_{1}) | = O_{p} (\sqrt{\log (n)} n^{- 2 / (q + 4)}),$ (A3) ) holds while ${\hat{h}}_{1}$ is estimated by $\hat{h} (X, \hat{γ})$ . Then, the rest of proof of the second claims follows the argument in the proof of Theorem 4.2.

Kernel regression utilizing heterogeneous datasets

Abstract

1. Introduction

2. Efficient kernel estimation by combining datasets

2.1. Combing data from homogeneous populations

2.2. Combing data from heterogeneous populations

2.3. Combing data from heterogeneous populations with additional information

3. Constrained kernel regression with unmeasured covariates

4. Asymptotic normality

5. Simulation results

5.1. The performance of ${\hat{μ}}_{1}^{Cj}$ given by (16)

5.2. The performance of ${\hat{μ}}_{1}^{Ej}$ given by (3), (5), or (12)

5.3. The performance of ${\hat{μ}}_{1}^{Cj}$ given by (16) with q = 2

6. Discussion

Acknowledgments

Disclosure statement

References

Appendix

Proof of Theorem 4.1.

Proof of Theorem 4.2.

Proof of Theorem 4.3.

Proof of Corollary 4.1.

Information for

Open access

Opportunities

Help and information

Kernel regression utilizing heterogeneous datasets

Abstract

1. Introduction

2. Efficient kernel estimation by combining datasets

2.1. Combing data from homogeneous populations

2.2. Combing data from heterogeneous populations

2.3. Combing data from heterogeneous populations with additional information

3. Constrained kernel regression with unmeasured covariates

4. Asymptotic normality

5. Simulation results

5.1. The performance of μˆ1Cj given by (16)

Table 1. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains only X, with S = 200 under γ=0, n/N≈13%.

Table 2. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains only X, with S = 200 under γ=0, n/N≈50%.

Table 3. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains only X, with S = 200 under γ=0.5, n/N≈13%.

Table 4. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains only X, with S = 200 under γ=0.5, n/N≈50%.

5.2. The performance of μˆ1Ej given by (3), (5), or (12)

Table 5. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains both X and Z, with S = 200 under γ=0, n/N≈13%.

Table 6. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains both X and Z, with S = 200 under γ=0, n/N≈50%.

Table 7. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains both X and Z, with S = 200 under γ=0.5, n/N≈13%.

Table 8. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains both X and Z, with S = 200 under γ=0.5, n/N≈50%.

5.3. The performance of μˆ1Cj given by (16) with q = 2

Table 9. Simulated MISE (Equation21(21) MISE(μˆ1∗)=1S∑s=1S1T∑t=1T{μˆ1∗(Us,t)−μ1(Us,t)}2,(21) ) and IMP (Equation22(22) IMP=1−min{MISE(μˆ1∗)}MISE(μˆ1),(22) ) when the external dataset contains only normally distributed (X1,X2), with S = 200.

6. Discussion

Acknowledgments

Disclosure statement

Additional information

Funding

References

Appendix

Proof of Theorem 4.1.

Proof of Theorem 4.2.

Proof of Theorem 4.3.

Proof of Corollary 4.1.

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date

5.1. The performance of ${\hat{μ}}_{1}^{Cj}$ given by (16)

5.2. The performance of ${\hat{μ}}_{1}^{Ej}$ given by (3), (5), or (12)

5.3. The performance of ${\hat{μ}}_{1}^{Cj}$ given by (16) with q = 2