466
Views
0
CrossRef citations to date
0
Altmetric
Articles

Kernel regression utilizing heterogeneous datasets

&
Pages 51-68 | Received 05 Dec 2022, Accepted 08 Apr 2023, Published online: 28 Apr 2023

Abstract

Data analysis in modern scientific research and practice has shifted from analysing a single dataset to coupling several datasets. We propose and study a kernel regression method that can handle the challenge of heterogeneous populations. It greatly extends the constrained kernel regression [Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press] that requires a homogeneous population of different datasets. The asymptotic normality of proposed estimators is established under some conditions and simulation results are presented to confirm our theory and to quantify the improvements from datasets with heterogeneous populations.

1. Introduction

With advanced technologies in data collection and storage, in modern statistical analyses we have not only a primary random sample from a population of interest, which results in a dataset referred to as the internal dataset, but also some independent external datasets from sources such as past investigations and publicly available datasets. In this paper, we consider nonparametric kernel regression (Bierens, Citation1987; Wand & Jones, Citation1994, December; Wasserman, Citation2006) between a univariate response Y and a covariate vector U from a sampled subject, using the internal dataset with the help from independent external datasets. Specifically, we consider kernel estimation of the conditional expectation (regression function) of Y given U=u under an internal data population, (1) μ1(u)=E(YU=u,D=1),(1) where D = 1 indicates internal population and u is a fixed point in U, the range of U. The indicator D can be either random or deterministic. The subscript 1 in μ1(u) emphasizes that it is for internal data population (D = 1), which may be different from μ(u)=E(YU=u), a mixture of quantities from the internal and external data populations.

When external datasets also have measurements Y and U, we may simply combine the internal and external datasets when the populations for internal and external data are identical (homogeneous). However, heterogeneity typically exists among populations for different datasets, especially when there are multiple external datasets collected in different ways and/or different time periods. In Section 2, we propose a method to handle heterogeneity among different populations and derive a kernel regression more efficient than the one using internal data alone. The result is also a crucial building block for the more complicated case in Section 3 where external datasets contain fewer measured covariates as described next.

In applications, it often occurs that an external dataset has measured Y and X from each subject, where X is a part of the vector U, i.e., some components of U are not measured due to high measurement cost or the progress of technology and/or scientific relevance. With some unmeasured components of U, the external dataset cannot be directly used to estimate μ1(u) in (Equation1), since conditioning on the entire U is involved. To solve this problem, Dai and Shao (Citation2023) proposes a two-step kernel regression using external information as a constraint to improve kernel regression based on internal data alone, following the idea of using constraints in Chatterjee et al. (Citation2016) and H. Zhang et al. (Citation2020). However, these three cited papers mainly assume that the internal and external datasets share the same population, which may be unrealistic. The challenge in dealing with the heterogeneity among different populations is similar to the difficulty in handling nonignorable missing data if unmeasured components of U is treated as missing data, although in missing data problems we usually want to estimate μ(u)=E(YU=u)μ1(u) in (Equation1).

In Section 3, we develop a methodology to handle population heterogeneity for internal and external datasets, which extends the procedure in Dai and Shao (Citation2023) to heterogeneous populations and greatly widens its application scope.

Under each scenario, we derive asymptotic normality in Section 4 for the proposed kernel estimators and obtain explicitly the asymptotic variances, which is important for large sample inference. Some simulation results are presented in Section 5 to compare finite sample performance of several estimators. Discussions on extensions and handling high dimension covariates are given in Section 6. All technical details are in the Appendix.

Our research fits into a general framework of data integration (Kim et al., Citation2021; Lohr & Raghunathan, Citation2017; Merkouris, Citation2004; Rao, Citation2021; Yang & Kim, Citation2020; Y. Zhang et al., Citation2017).

2. Efficient kernel estimation by combining datasets

The internal dataset contains observations (Yi,Ui), i=1,,n, independent and identically distributed (iid) from P1, the internal population of (Y,U), where Y is the response and U is a p-dimensional covariate vector associated with Y. We are interested in the estimation of conditional expectation μ1(u) in (Equation1). The standard kernel regression estimator of μ1(u) based on the internal dataset alone is (2) μˆ1(u)=i=1nYiκb(uUi)/i=1nκb(uUi),(2) where κb(a)=bpκ(a/b), κ() is a given kernel function on U (the range of u), and b>0 is a bandwidth depending on n. We assume that U is standardized so that the same bandwidth b is used for every component of U in kernel regression. Because of the well-known curse of dimensionality for kernel-type methods, we focus on a low dimension p not varying with n. A discussion of handling a large dimensional U is given in Section 6.

We consider the case with one external dataset, independent of the internal dataset. Extension to multiple external datasets is straightforward and discussed in Section 6.

In this section we consider the situation where the external dataset contains iid observations (Yi,Ui), i=n+1,,N, from P0, the external population of (Y,U).

2.1. Combing data from homogeneous populations

If we assume that the two populations P1 and P0 are identical, then we can simply combine two datasets to obtain the kernel estimator (3) μˆ1E1(u)=i=1NYiκb(uUi)/i=1Nκb(uUi),(3) which is obviously more efficient than μˆ1(u) in (Equation2) as the sample size is increased to N>n. The estimator μˆ1E1(u) in (Equation3), however, is not correct (i.e., it is biased) when populations P1 and P0 are different, because E(YU=u,D=0) for external population may be different from μ1(u)=E(YU=u,D=1) for internal population.

2.2. Combing data from heterogeneous populations

We now derive a kernel estimator using two datasets and is asymptotically correct regardless of whether P1 and P0 are the same or not. Let f(y|u,D) be the conditional density of Y given U=u and D= 1 or 0 (for internal or external population). Then (4) μ1(x)=E(Y|U=u,D=1)=E{Yf(Y|u,D=1)f(Y|u,D=0)|U=u,D=0}.(4) The ratio f(Y|u,D=1)/f(Y|u,D=0) links internal and external populations so that we can overcome the difficulty in utilizing the external data under heterogeneous populations.

If we can construct an estimator fˆ(y|u,D) of f(y|u,D) for every y, u, and D = 0 or 1, then we can modify the estimator in (Equation3) by replacing every Yi with i>n by constructed response Yˆi=Yifˆ(Yi|Ui,D=1)/fˆ(Yi|Ui,D=0). The resulting kernel estimator is (5) μˆ1E2(u)={i=1nYiκb(uUi)+i=n+1NYˆiκb(uUi)}/i=1Nκb(uUi).(5) Note that we use internal data (Yi,Ui), i=1,,n, to obtain estimator fˆ(Yi|Ui,D=1) and external data (Yi,Ui), i=n+1,,N, to construct estimator fˆ(Yi|Ui,D=0). Applying kernel estimation, we obtain that (6) fˆ(y|U=u,=1)=i=1nκ~b~(yYi,uUi)/i=1nκ¯b¯(uUi),fˆ(y|U=u,D=0)=i=n+1Nκ~b~(yYi,uUi)/i=n+1Nκ¯b¯(uUi),(6) where κ~ and κ¯ are kernels with dimensions p + 1 and p and bandwidths b~ and b¯, respectively. The estimator in (Equation5) is asymptotically valid under some regularity conditions for kernel and bandwidth, summarized in Theorem 4.1 of Section 4.

2.3. Combing data from heterogeneous populations with additional information

If additional information exists, then the approach in Section 2.2 can be improved. Assume that the internal and external datasets are formed according to a random binary indicator D such that (Yi,Ui,Di), i=1,,N, are iid distributed as (Y,U,D), where Yi and Ui are observed internal data when Di=1, Yi and Ui are observed external data when Di=0, and N is still the known total sample size for internal and external data. In this situation, the internal and external sample sizes are n=i=1NDi and Nn, respectively, both of which are random. In most applications, the assumption of random D is not substantial. From the identity (7) f(Y|u,D=1)f(Y|u,D=0)=P(D=1|U=u,Y)P(D=0|U=u,Y)P(D=0|U=u)P(D=1|U=u),(7) we just need to estimate P(D=1|U=u,Y) and P(D=1|U=u) for every u, constructed using for example the nonparametric estimators in Fan et al. (Citation1998) for binary response. For each estimator, both internal and external data on (Y,U) and the indicator D are used.

A further improvement can be made if the following semi-parametric model holds, (8) P(D=0U,Y)P(D=1U,Y)=exp{α(U)+γY},(8) where α() is an unspecified unknown function and γ is an unknown parameter. From (Equation7)–(Equation8), (9) f(Y|u,D=1)f(Y|u,D=0)=eγYE(eγYU=u,D=1).(9) If γ=0, then f(Y|u,D=1)=f(Y|u,D=0) and the estimator μˆ1E1(u) in (Equation3) is correct. Under (Equation9) with γ0, we just need to derive an estimator γˆ of γ and apply kernel estimation to estimate E(eγˆYU=u,D=1) as a function of u. Note that we do not need to estimate the unspecified function α() in (Equation8), which is a nice feature of semi-parametric model (Equation8).

We now derive an estimator γˆ. Applying (Equation7)–(Equation8) to (Equation4), we obtain that μ1(u)=E{YP(D=1|U=u,Y)P(D=0|U=u,Y)|U=u,D=0}P(D=0|U=u)P(D=1|U=u)=E(Yeα(u)γY|U=u,D=0)E{P(D=0|U=u,Y)|U=u}P(D=1|U=u)=eα(u)E(YeγY|U=u,D=0)E{eα(u)+γYP(D=1|U=u,Y)|U=u}P(D=1|U=u)=E(YeγY|U=u,D=0)E{eγYE(D|U=u,Y)|U=u}P(D=1|U=u)=E(YeγY|U=u,D=0)E(eγYD|U=u)P(D=1|U=u)=E(YeγY|U=u,D=0)E(eγY|U=u,D=1),where the second and third equalities follow from (Equation8) and the last equation follows from E(eγYD|U=u)= E(eγYD|U=u,D=1)P(D=1|U=u)+E(eγYD|U=u,D=0)P(D=0|U=u)=E(eγY|U=u,D=1)P(D=1|U=u),as E(eγYD|U=u,D=0)=0. For every real number t, define h(u,t)=E(YetY|U=u,D=0)E(etY|U=u,D=1).Its estimator by kernel regression is (10) hˆ(u,t)=i=1N(1Di)κˇbˇ(uUi)YietYii=1N(1Di)κˇbˇ(uUi) i=1NDiκˇbˇ(uUi)etYii=1NDiκˇbˇ(uUi)(10) where κˇ is a kernel and bˇ is a bandwidth. Then, we estimate γ by (11) γˆ=argmint1Ni=1NDi{Yihˆ(Ui,t)}2,(11) motivated by the fact that the objective function for minimization in (Equation11) approximates E[D{Yh(U,t)}2|D=1] and, for any t, E[D{Yh(U,γ)}2|D=1]E[D{Yh(U,t)}2|D=1]because h(u,γ)=μ1(u).

Once γˆ is obtained, our estimator of μ1(u) is (12) μˆ1E3(u)={i=1NDiYiκb(uUi)+i=1N(1Di)Yˆiκb(uUi)}/i=1Nκb(uUi)(12) with Yˆi=YieγˆYij=1neγˆYjκˇbˇ(UiUj)/j=1nκˇbˇ(UiUj),in view of (Equation9).

In applications, we need to choose bandwidths with given sample sizes n and Nn. We can apply the k-fold cross-validation as described in Györfi et al. (Citation2002). Requirements on the rates of bandwidths are described in theorems in Section 3.

3. Constrained kernel regression with unmeasured covariates

We still consider the case with one external dataset, independent of the internal dataset. In this section, the external dataset contains iid observations (Yi,Xi), i=n+1,,N, from the external population P0, where X is a q-dimensional sub-vector of U with q<p.

Since the external dataset has only X, not the entire U, we cannot apply the method in Section 2 when q<p. Instead, we consider kernel regression using external information in a constraint. First, we consider the estimation of the n-dimensional vector μ1=(μ1(U1),,μ1(Un)), where A denotes the transpose of vector or matrix A throughout. Note that the standard kernel regression (Equation2) estimates μ1 as μˆ1=(i=1nYiκb(U1Ui)/i=1nκb(U1Ui),,i=1nYiκb(UnUi)/i=1nκb(UnUi)).Taking partial derivatives with respect to μi's, we obtain that (13) μˆ1=argminμ1,,μni=1nj=1nκb(UiUj)(Yjμi)2/k=1nκb(UiUk).(13) We improve μˆ1 by the following constrained minimization, (14) μˆ1Cj=argminμ1,,μni=1nj=1nκl(UiUj)(Yjμi)2/k=1nκl(UiUk)(14) (15) subject to i=1n{μihˆ1Ej(Xi)}g(Xi)=0,(15) where g(x)=(1,x), l in (Equation14) is a bandwidth that may be different from b in (Equation2) or (Equation13), and hˆ1Ej(x) is the kernel estimator of h1(x)=E(YX=x,D=1) using the jth of the three methods described in Section 2, j = 1, 2, 3. Specifically, hˆ1E1(x) is given by (Equation3), hˆ1E2(x) is given by (Equation5), and hˆ1E3(x) is given by (Equation12), with u and U replaced by x and X, respectively, and kernels and bandwidths suitably adjusted as dimensions of U and X are different. Note that hˆ1Ej can be computed as both internal and external datasets have measured Xi's.

It turns out that μˆ1Cj in (Equation14) has an explicit form μˆ1Cj=μˆ1+G(GG)1G(hˆ1Ejμˆ1), where G is the n×n matrix whose ith row is g(Xi) and hˆ1Ej is the n-dimensional vector whose ith component is hˆ1Ej(Xi). Constraint (Equation15) is an empirical analog of the theoretical constraint E[{μ1(U)h1(X)}g(X)D=1]=0(based on internal data), as E{E(YU,D=1)X,D=1}=E(YX,D=1)=h1(X). Thus, if hˆ1Ej() is a good estimator of h1(), then μˆ1Cj in (Equation14) is more accurate than the unconstrained μˆ1 in (Equation13).

To obtain an improved estimator of the entire regression function μ1() in (Equation1), not just the function at u=Ui, i=1,,n, we apply the standard kernel regression with response vector (Y1,,Yn) replaced by μˆ1Cj in (Equation14), which results in the following three estimators of μ1(u): (16) μˆ1Cj(u)=i=1nμˆiCjκb(uUi)/i=1nκb(uUi),j=1,2,3,(16) where μˆiCj is the ith component of μˆ1Cj in (Equation14) and b is the same bandwidth in (Equation2). The first estimator μˆiC1 is simple, but can be incorrect when populations P1 and P0 are different. The asymptotic validity of μˆ1C2 and μˆ1C3 are established in the next section.

4. Asymptotic normality

We now establish the asymptotic normality of μˆ1Ej(u) and μˆ1Cj(u) for a fixed u, as the sample size of the internal dataset increases to infinity. All technical proofs are given in the Appendix.

The first result is about μˆ1E2(u) in (Equation5). The result is also applicable to μˆ1E1(u) in (Equation3) with an added condition that P1=P0.

Theorem 4.1

Assume the following conditions.

(B1)

The densities f1(u) and f0(u) for U, respectively under internal and external populations have continuous and bounded first- and second-order partial derivatives.

(B2)

μ12(u)fk(u), σk2(u)fk(u), and the first- and second-order partial derivatives of μ1(u)fk(u) are continuous and bounded, where σ12(u)=E[{Yμ1(U)}2U=u,D=1], σ02(u)=E[{Y~μ1(U)}2U=u,D=0], and Y~=Yf(Y|U,D=1)/f(Y|U,D=0). Also, E(|Y|s|U=u,D=1)f1(u) and E(|Y~|s|U=u,D=0)f0(u) are bounded for a constant s>2.

(B3)

The kernel κ is second order, i.e., uκ(u)du=0 and 0<uuκ(u)du<.

(B4)

The bandwidth b satisfies b0 and (a+1)nbp+4c[0,), where a=limn(Nn)/n (assumed to exist without loss of generality).

(B5)

The kernels κ~ and κ¯ in (Equation6) have bounded supports and orders m~>2+2/p and m¯>2, respectively, as defined by Bierens (Citation1987), f(y,u|D=1), f(y,u|D=0) are m~th-order continuously differentiable with bounded partial derivatives, and f1(u) and f0(u) are m¯th-order continuously differentiable with bounded partial derivatives. Functions f(y,u|D=0) and f1(u) are bounded away from zero. The bandwidths b~ and b¯ satisfy nb~p+1/log(n) and nb¯p/log(n).

Then, for any fixed u with f0(u)>0 and f1(u)>0 and μˆ1E2 in (Equation5), (17) nbp{μˆ1E2(u)μ1(u)} d N(Ba(u),Va(u)),(17) where d denotes convergence in distribution as n, Ba(u)=c1/2{f1(u)A1(u)+af0(u)A0(u)}(a+1)1/2{f1(u)+af0(u)},A1(u)=κ(v){12v2μ1(u)v+v∇logf1(u)μ1(u)v}dv,A0(u)=κ(v){12v2μ1(u)v+v∇logf0(u)μ1(u)v}dv,Va(u)=f1(u)σ12(u)+af0(u)σ02(u){f1(u)+af0(u)}2κ(v)2dv.

Conditions (B1)–(B4) are typically assumed for kernel estimation (Bierens, Citation1987). Condition (B5) is a sufficient condition for (18) maxi=n+1,,N|fˆ(Yi|U=Ui,D=1)fˆ(Yi|U=Ui,D=0)f(Yi|U=Ui,D=1)f(Yi|U=Ui,D=0)|=op(1)nbp(18) (Lemma 8.10 in Newey & McFadden, Citation1994), where op(1) denotes a term tending to 0 in probability. Result (Equation18) implies that the estimation of ratio f(Y|U,D=1)/f(Y|U,D=0) does not affect the asymptotic distribution of μˆ1E2(u) in (Equation5).

Note that both the squared bias Ba2(u) and variance Va(u) in (Equation17) are decreasing in the limit a=limn(Nn)/n, a quantity reflecting how many external data we have. In the extreme case of a = 0, i.e., the size of the external dataset is negligible compared with the size of the internal dataset, result (Equation17) reduces to the well-known asymptotic normality for the standard kernel estimator μˆ1(u) in (Equation2) (Bierens, Citation1987). In the other extreme case of a=, on the other hand, Ba(u)=Va(u)=0 and, hence, μˆ1E2(u) has a convergence rate tending to 0 faster than 1/nbp, the convergence rate of the standard kernel estimator μˆ1(u).

The next result is about μˆ1C2(u) in (Equation16) as described in Section 3.

Theorem 4.2

Assume (B1)(B5) with U and p replaced by X and q, respectively, and the following conditions, where fk(u) and σk2(u), k = 0, 1, are defined in (B1)(B2).

(C1)

The range U of U is a compact set in the p-dimensional Euclidean space and f1(u) is bounded away from infinity and zero on U; f1(u) and f0(u) have continuous and bounded first- and second-order partial derivatives.

(C2)

Functions μ1(u)=E(Y|U=u) and σ12(u) are Lipschitz continuous; μ1(u) has bounded third-order partial derivatives; h1(x)=E(YX=x,D=1) has bounded first- and second-order partial derivatives; and E(|Y|s|U=u,D=1) is bounded with s>2+p/2.

(C3)

All kernel functions are positive, bounded, and Lipschitz continuous with mean zero and finite sixth moments.

(C4)

a=limn(Nn)/n>0 and the bandwidths b in (Equation2) and l in (Equation14) satisfy b0, l0, l/br(0,), nbp, and nb4+pc[0,), as n.

(C5)

The densities fX0(x) and fX1(x) for X, respectively under internal and external populations are bounded away from zero. There exists a constant s>4 such that E(|Y|sD=1) and E(|Y~|sD=0) are finite, E(|Y|sX=x,D=1)fX1(x) and E(|Y~|sX=x,D=0)fX0(x) are bounded, and the bandwidth bh for hˆ1 satisfies n12/sbhq/log(n).

Then, for any fixed uU and μˆ1C2(u) in (Equation16), (19) nbp{μˆ1C2(u)μ1(u)} d N(Br(u),Vr(u)),(19) where Br(u)=c1/2[(1+r2)A1(u)r2g(x)Σg1E{g(X)A1(U)|D=1}],A1(u)=κ(v){12v2μ1(u)v+v∇logf1(u)μ1(u)v}dv,Vr(u)=σ12(u)f1(u){κ(vrw)κ(w)dw}2dv,

and Σ=E{g(X)g(X)D=1} is assumed to be positive definite without loss of generality.

The next result is about γˆ in (Equation11).

Theorem 4.3

Suppose that (Equation8) holds for binary random D indicating internal and external data. Assume also the following conditions.

(D1)

The kernel κˇ in (Equation10) is Lipschitz continuous, satisfies κˇ(u)du=1, has a bounded support, and has order d>max{(p+4)/2,p}.

(D2)

The bandwidth bˇ in (Equation10) satisfies Nbˇ2q/(logN)2 and Nbˇ2d0 as the total sample size of internal and external datasets N, where d is given in (D1).

(D3)

γ in (Equation8) is an interior point of a compact domain Γ and it is the unique solution to h1()=h(,t), tΓ. For any u, h(u,t) is second-order continuously differentiable in t, and h, th, t2h are bounded over t and u. As tγ, h(,t), th(,t), and t2h(,t) converge uniformly.

(D4)

suptΓEWt4< and suptΓE[Wt4|U]fU(U) is bounded, where a2=aa, Wt=(DetY,(1D)YetY,D,(1D),DYetY,(1D)Y2etY, DY2etY,(1D)Y3etY), and fU is the density of U. Furthermore, there is a function τ(Y,D) with E{τ(Y,D)}< such that WtWt<τ(Y,D)|tt|.

(D5)

The function ωt(u)=E(Wt|U=u)fU(u) is bounded away from zero, and it is dth-order continuously differentiable with bounded partial derivatives on an open set containing the support of U. There is a functional G(Y,D,ω) linear in ω such that |G(Y,D,ω)|ι(Y,D)ω and, for small enough ωωγ, |ψ(Y,D,ω)ψ(Y,D,ωγ)G(Y,D,ωωγ)| ι(Y,D)ωωγ2, where ι(Y,D) is a function with E{ι(Y,D)}<, ψ(Y,D,ω) =2D(Yω1ω2ω3ω4)(ω2ω5ω1ω6ω3ω4), ωj is the jth component of ω, ω=supxUω(u), ωωγ=supxUω(u)ωγ(u), and U is the range of U. Also, there exists an almost everywhere continuous 8-dimensional function ν(U) with ν(u)du< and E{supδϵν(U+δ)4}< for some ϵ>0 such that E{G(Y,D,ω)}=ν(u)ω(u)du for all ω<.

Then, as the total sample size of internal and external datasets N, (20) N(γˆγ)dN(0,σγ2),(20) where σγ2=[2E{Dγh(U,γ)}2]1Var[ψ(Y,D,ωγ)+ν(U)WγE{ν(U)Wγ}].

Conditions (D1)–(D5) are technical assumptions discussed in Lemmas 8.11 and 8.12 in Newey and McFadden (Citation1994). As discussed by Newey and McFadden (Citation1994), the condition that κˇ has a bounded support can be relaxed, as it is imposed for a simple proof.

Combining Theorems 4.1–4.3, we obtain the following result for μˆ1E3(u) in (Equation12) or μˆ1C3(u) in (Equation16).

Corollary 4.1

Suppose that (Equation8) holds for the binary random D indicating internal and external data.

(i)

Under (B1)(B4) and (D1)(D5), result (Equation17) holds with μˆ1E2(u) replaced by μˆ1E3(u).

(ii)

Under (C1)(C4) and (D1)(D5) with U and p replaced by X and q, respectively, result (Equation19) holds with μˆ1C2(u) replaced by μˆ1C3(u).

5. Simulation results

5.1. The performance of μˆ1Cj given by (16)

We first present simulation results to examine and compare the performance of the standard kernel estimator μˆ1 in (Equation2) without using external information and our proposed estimator (Equation16) with three variations, μˆ1C1, μˆ1C2, and μˆ1C3, as described in the end of Section 3. We consider U=(X,Z) with univariate covariates X and Z, where Z is unmeasured in the external dataset (p = 2 and q = 1). The covariates are generated in two ways:

  1. normal covariates: (X,Z) is bivariate normal with means 0, variances 1, and correlation 0.5;

  2. bounded covariates: X=BW1+(1B)W2 and Z=BW1+(1B)W3, where W1, W2, and W3 are identically distributed as uniform on [1,1], B is uniform on [0,1], and W1, W2, W3, and B are independent.

Conditioned on (X,Z), the response Y is normal with mean μ(X,Z) and variance 1, where μ(X,Z) follows one of the following four models:

(M1)

μ(X,Z)=X/2Z2/4;

(M2)

μ(X,Z)=cos(2X)/2+sin(Z);

(M3)

μ(X,Z)=cos(2XZ)/2+sin(Z);

(M4)

μ(X,Z)=X/2Z2/4+cos(XZ)/4.

Note that all four models are nonlinear in (X,Z); (M1)-(M2) are additive models, while (M3)-(M4) are non-additive.

A total of N = 1, 200 data are generated from the population of (Y,X,Z) as previously described. A data point is treated as internal or external according to a random binary D with conditional probability P(D=1Y,X,Z)=1/exp(γ0+2|X|+γY), where γ=0 or 1/2, and γ0=1 or 1.5. Under the setting γ0=1 or 1.5, the unconditional P(D=1)n/N is around 13% or 50%.

The simulation studies performance of kernel estimators in terms of mean integrated square error (MISE). The following measure is calculated by simulation with S replications: (21) MISE(μˆ1)=1Ss=1S1Tt=1T{μˆ1(Us,t)μ1(Us,t)}2,(21) where {Us,t:t=1,,T} are test data for each simulation replication s, the simulation is repeated independently for s=1,,S, and μˆ1 is one of μˆ1, μˆ1C1, μˆ1C2, and μˆ1C3, independent of test data. We consider two ways of generating test data Us,t's. The first one is to use T = 121 fixed grid points on [1,1]×[1,1] with equal space. The second one is to take a random sample of T = 121 without replacement from the covariate U's of the internal dataset, for each fixed s=1,,S and independently across s.

To show the benefit of using external information, we calculate the improvement in efficiency defined as follows: (22) IMP=1min{MISE(μˆ1)}MISE(μˆ1),(22) where the minimum is over μˆ1= one of μˆ1, μˆ1C1, μˆ1C2, and μˆ1C3.

In all cases, we use the Gaussian kernel. The bandwidths b and l affect the performance of kernel methods. We consider two types of bandwidths in the simulation. The first one is ‘the best bandwidth’; for each method, we evaluate MISE in a pool of bandwidths and display the one that has the minimal MISE. This shows the best we can achieve in terms of bandwidth, but it cannot be used in applications. The second one is to select bandwidth from a pool of bandwidths via 10-fold cross validation (Györfi et al., Citation2002), which produces a decent bandwidth that can be applied to real data.

The simulated MISE values based on S = 200 replications are shown in Tables .

Table 1. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains only X, with S = 200 under γ=0, n/N13%.

Table 2. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains only X, with S = 200 under γ=0, n/N50%.

Table 3. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains only X, with S = 200 under γ=0.5, n/N13%.

Table 4. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains only X, with S = 200 under γ=0.5, n/N50%.

Consider first the results in Tables . Since γ=0, all three estimators, μˆ1C1, μˆ1C2, and μˆ1C3, are correct and more efficient than the standard estimator μˆ1 in (Equation2) without using external information. The estimator μˆ1C1 is the best, as it uses the correct information that populations are homogeneous (γ=0) and is simpler than μˆ1C2 and μˆ1C3.

Next, the results in Tables  for γ=1/2 indicate that the estimator μˆ1C2 or μˆ1C3 using a correct constraint is better than the estimator μˆ1C1 using an incorrect constraint or the estimator μˆ1 without using external information. Since μˆ1C3 uses more information, it is in general better than μˆ1C2. Furthermore, with an incorrect constraint, μˆ1C1 can be much worse than μˆ1 without using external information.

5.2. The performance of μˆ1Ej given by (3), (5), or (12)

Under the same simulation setting as described in Section 5.1 but with covariate Z measured in both internal and external datasets, we compare the performance of three estimators, μˆ1E1, μˆ1E2, and μˆ1E3 given by (Equation3), (Equation5), and (Equation12), respectively, with the standard kernel estimator μˆ1 in (Equation2) without using external information. The mean integrated squared error (MISE) and improvement (IMP) are calculated using formulas (Equation21) and (Equation22), respectively, with μˆ1= one of μˆ1, μˆ1E1, μˆ1E2, and μˆ1E3.

Tables  present the simulation results. The relative performance of μˆ1E1, μˆ1E2, μˆ1E3, and μˆ1 follows the same pattern as μˆ1C1, μˆ1C2, μˆ1C3, and μˆ1 in Section 5.1.

Table 5. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains both X and Z, with S = 200 under γ=0, n/N13%.

Table 6. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains both X and Z, with S = 200 under γ=0, n/N50%.

Table 7. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains both X and Z, with S = 200 under γ=0.5, n/N13%.

Table 8. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains both X and Z, with S = 200 under γ=0.5, n/N50%.

The only difference between the results here and those in Section 5.1 is that the use of more external data (a smaller n/N) results in a better performance of μˆ1E2 or μˆ1E3 (or μˆ1E1 when it is correct). This is actually consistent with our theoretical result Theorem 4.1 in Section 4, which shows that both the squared bias Ba2(u) and variance Va(u) in (Equation17) are decreasing in the limit a=limn(Nn)/n. On the other hand, the simulation results in Section 5.1 and Theorem 4.2 in Section 4 do not show a clear indication of using more external data produces better estimators. The main reason for this is that, when Z is not observed in the external dataset, the estimator μˆ1Cj relies more on internal data to recover the loss of Z from external dataset in a complicated way.

5.3. The performance of μˆ1Cj given by (16) with q = 2

We re-consider the simulation in Section 5.1 but with the dimension of X to be q = 2, i.e., U=(X1,X2,Z). We only consider normally distributed covariates with means 0, variances 1, and the correlations in (X1,Z), (X2,Z), and (X1,X2) being 0.5, 0.5, and 0.25, respectively. Given U, the response variable Y is normally distributed with mean μ(X1,X2,Z)=X1/2+X2/4Z2/4 and variance 1. Moreover, P(D=1|Y,X,Z)=1/exp(γ0+2|X1|+γY), while the remaining settings are the same as in Section 5.1. In calculating MISE (Equation21), we only a random Us,t with T = 121, not fixed grid points. Also, we consider only evaluating the performance of estimators μˆ1Cj, since estimators μˆ1Ej are simpler.

The results are shown in Table . Compared with results in Tables  for the case of q = 1, the MISEs in this case are larger due to the fact of having more covariates (q = 2). But the relative performances of estimators are the same as those shown in Tables .

Table 9. Simulated MISE (Equation21) and IMP (Equation22) when the external dataset contains only normally distributed (X1,X2), with S = 200.

6. Discussion

Curse of dimensionality is a well-known problem for nonparametric methods. Thus, the proposed method in Section 2 is intended for low dimensional covariate U, i.e., p is small. If p is not small, then we should reduce the dimension of U prior to applying the CK, or any kernel methods. For example, consider a single index model assumption (K.-C. Li, Citation1991), i.e., μ1(U) in (Equation1) is assumed to be (23) μ1(U)=μ1(ηU),(23) where η is an unknown p-dimensional vector. The well-known SIR technique (K.-C. Li, Citation1991) can be applied to obtain a consistent and asymptotically normal estimator ηˆ of η in (Equation23). Once η is replaced by ηˆ, the kernel method can be applied with U replaced by the one-dimensional ‘covariate’ ηˆU. We can also apply other dimension reduction techniques developed under assumptions weaker than (Equation23) (Cook & Weisberg, Citation1991; B. Li & Wang, Citation2007; Ma & Zhu, Citation2012; Y. Shao et al., Citation2007; Xia et al., Citation2002).

We turn to the dimension of X in the external dataset. When the dimension of X is high, we may consider the following approach. Instead of using constraint (Equation15), we use component-wise constraints (24) i=1n{μihˆ1(k)(Xi(k))}gk(Xi(k))=0,k=1,,q,(24) where Xi(k) is the kth component of Xi, gk(X(k))=(1,X(k)), and hˆ1(k)(Xi(k)) is an estimator of h1(k)(X(k))=E(YX(k),D=1) using methods described in Section 2. More constraints are involved in (Equation24), but estimation only involves one dimensional X(k), k=1,,q.

The kernel κ we adopted in (Equation2) and (Equation16) is the second-order kernel so that the convergence rate of μˆ1E(u)μ1(u) is n2/(4+p). An mth-order kernel with m>2 as defined by Bierens (Citation1987) may be used to achieve convergence rate nm/(2m+p). Alternatively, we may apply other nonparametric smoothing techniques such as the local polynomial (Fan et al., Citation1997) to achieve convergence rate nm/(2m+p) with m2.

Our results can be extended to the scenarios where several external datasets are available. Since each external source may provide different covariate variables, we may need to apply component-wise constraints (Equation24) by estimating hˆ1(k) via combining all the external sources that collects covariate X(k). If populations of external datasets are different, then we may have to apply a combination of the methods described in Section 2.

Acknowledgments

The authors would like to thank two anonymous referees for helpful comments and suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Jun Shao's research was partially supported by the National Natural Science Foundation of China [Grant Number 11831008] and the U.S. National Science Foundation [Grant Number DMS-1914411].

References

  • Bierens, H. J. (1987). Kernel estimators of regression functions. In Advances in Econometrics: Fifth World Congress (Vol. 1, pp. 99–144). Cambridge University Press.
  • Chatterjee, N., Chen, Y. H., Maas, P., & Carroll, R. J. (2016). Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. Journal of the American Statistical Association, 111(513), 107–117. https://doi.org/10.1080/01621459.2015.1123157
  • Cook, R. D., & Weisberg, S. (1991). Sliced inverse regression for dimension reduction: Comment. Journal of the American Statistical Association, 86(414), 328–332. https://doi.org/10.2307/2290564
  • Dai, C.-S., & Shao, J. (2023). Kernel regression utilizing external information as constraints. Statistica Sinica, 33, in press. https://doi.org/10.5705/ss.202021.0446
  • Fan, J., Farmen, M., & Gijbels, I. (1998). Local maximum likelihood estimation and inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(3), 591–608. https://doi.org/10.1111/1467-9868.00142
  • Fan, J., Gasser, T., Gijbels, I., Brockmann, M., & Engel, J. (1997). Local polynomial regression: optimal kernels and asymptotic minimax efficiency. Annals of the Institute of Statistical Mathematics, 49(1), 79–99. https://doi.org/10.1023/A:1003162622169
  • Györfi, L., Kohler, M., Krzyżak, A., & Walk, H. (2002). A distribution-free theory of nonparametric regression. Springer.
  • Kim, H. J., Wang, Z., & Kim, J. K. (2021). Survey data integration for regression analysis using model calibration. arXiv 2107.06448.
  • Li, B., & Wang, S. (2007). On directional regression for dimension reduction. Journal of the American Statistical Association, 102(479), 997–1008. https://doi.org/10.1198/016214507000000536
  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414), 316–327. https://doi.org/10.1080/01621459.1991.10475035
  • Lohr, S. L., & Raghunathan, T. E. (2017). Combining survey data with other data sources. Statistical Science, 32(2), 293–312. https://doi.org/10.1214/16-STS584
  • Ma, Y., & Zhu, L. (2012). A semiparametric approach to dimension reduction. Journal of the American Statistical Association, 107(497), 168–179. https://doi.org/10.1080/01621459.2011.646925
  • Merkouris, T. (2004). Combining independent regression estimators from multiple surveys. Journal of the American Statistical Association, 99(468), 1131–1139. https://doi.org/10.1198/016214504000000601
  • Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142. https://doi.org/10.1137/1109020
  • Newey, W. K. (1994). Kernel estimation of partial means and a general variance estimator. Econometric Theory, 10(2), 1–21. https://doi.org/10.1017/S0266466600008409.
  • Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. Handbook of Econometrics, 4, 2111–2245. https://doi.org/10.1016/S1573-4412(05)80005-4
  • Rao, J. (2021). On making valid inferences by integrating data from surveys and other sources. Sankhya B, 83(1), 242–272. https://doi.org/10.1007/s13571-020-00227-w
  • Shao, J. (2003). Mathematical statistics. 2nd ed., Springer.
  • Shao, Y., Cook, R. D., & Weisberg, S. (2007). Marginal tests with sliced average variance estimation. Biometrika, 94(2), 285–296. https://doi.org/10.1093/biomet/asm021
  • Wand, M. P., & Jones, M. C. (1994, December). Kernel smoothing. Number 60 in Chapman & Hall/CRC Monographs on Statistics & Applied Probability, Boca Raton.
  • Wasserman, L. (2006). All of nonparametric statistics. Springer.
  • Xia, Y., Tong, H., Li, W. K., & Zhu, L.-X. (2002). An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3), 363–410. https://doi.org/10.1111/1467-9868.03411
  • Yang, S., & Kim, J. K. (2020). Statistical data integration in survey sampling: a review. Japanese Journal of Statistics and Data Science, 3(2), 625–650. https://doi.org/10.1007/s42081-020-00093-w
  • Zhang, H., Deng, L., Schiffman, M., Qin, J., & Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika, 107(3), 689–703. https://doi.org/10.1093/biomet/asaa014
  • Zhang, Y., Ouyang, Z., & Zhao, H. (2017). A statistical framework for data integration through graphical models with application to cancer genomics. The Annals of Applied Statistics, 11(1), 161–184. https://doi.org/10.1214/16-AOAS998

Appendix

Proof of Theorem 4.1.

Proof of Theorem 4.1.

Let μ~1(u)=pˆ(u)μˆ1(u)+{1pˆ(u)}μˆ0(u), where μˆ1(u)=i=1nκb(uUi)Yi/i=1nκb(uUi), μˆ0(u)=i=n+1Nκb(uUi)Y~i/i=n+1Nκb(uUi), and pˆ(u)=i=1nκb(uUi)/i=1Nκb(uUi). Under (B3)–(B4), Theorem 2 in Nadaraya (Citation1964) shows that pˆ(u) converges to P(D=1|U=u) in probability. Under (B1)–(B4), nbp{μˆ1(u)μ1(u)}dN(B1(u),V1(u)), B1(u)=c1/2A1(u), V1(u)=σ12(u)f1(u)κ(v)2dv, and n/(Nn)(Nn)bp{μˆ0(u)μ1(u)}dN(B0(u),V0(u)), B0(u)=c1/2A0(u), V0(u)=σ02(u)af0(u)κ(v)2dv. Then (Equation17) holds for μ~1(u), by Slutsky's theorem, the independence between μˆ1 and μˆ0, and the definition of a. The desired result (Equation17) follows from the fact that |μˆ1E2(u)μ~1(u)| is bounded by (A1) {1pˆ(u)}maxi>n|fˆ(Yi|U=Ui,D=1)fˆ(Yi|U=Ui,D=0)f(Yi|U=Ui,D=1)f(Yi|U=Ui,D=0)|i=n+1N|Yi|κb(uUi)i=n+1Nκb(uUi),(A1) which is op(1/nbp) by result (Equation18) under condition (B5).

Proof of Theorem 4.2.

Proof of Theorem 4.2.

Write (A2) nbp{μˆ1C2(u)μ1(u)}=T1++T6,(A2) where T1=n1/2bp/2δb(u)(InP)Bl1Δlϵ/fˆb(u), T2=n1/2bp/2δb(u){μ1μ1(u)1n}/fˆb(u), T3=n1/2bp/2δb(u)(Bl1Δlμ1μ1)/fˆb(u), T4=n1/2bp/2δb(u)P(Bl1Δlμ1μ1)/fˆb(u), T5=n1/2bp/2δb(u)P(hˆ1h1)/fˆb(u), T6=n1/2bp/2δb(u)P(h1μ1)/fˆb(u), fˆb(u)=i=1nκb(uUi)/n, δb(u)=(κb(uU1),,κb(uUn)), In is the identity matrix of order n, 1n is the n-vector with all components being 1, Bl is the n×n diagonal matrix whose ith diagonal element is fˆl(Ui), Δl is the n×n matrix whose (i,j)th entry is κl(UiUj)/n, ϵ=(ϵ1,,ϵn) with ϵi=Yiμ1(Ui), h1 is the n-dimensional vector whose ith component is h1(Xi), P=G(GG)1G, and G, hˆ1, and μ1 are defined in Section 2.

We first show that T1 in (EquationA2) is asymptotically normal with mean 0 and variance Vr(u) defined in Theorem 4.2. Consider a further decomposition T1=nV+T11+T12+T13, where V=1n2j=1ni=1nS(Ui,ϵi,Uj,ϵj)is a V-statistic with S(Ui,ϵi,Uj,ϵj)=bp/22f1(u){κb(uUi)κl(UiUj)ϵjf1(Ui)+κb(uUj)κl(UjUi)ϵif1(Uj)},T11=bp/2n3/2i=1nκb(uUi)κl(0)ϵif1(u)f1(Ui),T12=bp/2n3/2j=1ni=1nκb(uUi)κl(UiUj)f1(u)f1(Ui){f1(u)f1(Ui)fˆb(u)fˆl(Ui)1}ϵj,and T13=n1/2bp/2δb(u)PBl1Δlϵ/fˆb(u). Note that S1(U1,ϵ1)=E{S(U1,ϵ1,U2,ϵ2)U1,ϵ1}=bp/22f1(u){κl(u2U1)κb(uu2)du2}ϵ1having variance Var{S1(U1,ϵ1)}=bp/24f12(u)f1(u1)σ12(u1){κl(u2u1)κb(uu2)du2}2du1=bp/24f12(u)f1(u1)σ12(u1){κl(v)κb(uu1lν)dν}2du1=14f12(u)f1(ubw)σ12(ubw){κ(v)κ(wνlb)dν}2dw,where σ12() is given in condition (C2), the second and third equalities follow from changing variables u2u1=lν and uu1=bw, respectively. From the continuity of f1() and σ12(), Var{S1(u1,ϵ1)} converges to Vr(u). Therefore, by the theory for asymptotic normality of V-statistics (e.g., Theorem 3.16 in J. Shao, Citation2003), nVdN(0,Vr(u)).

Conditioned on U1,,Un, T11 has mean 0 and variance Var(T11|U1,,Un)=bp4f12(u)n3i=1nκb(uUi)2κl(0)2σ12(Ui)f1(Ui)supuUκ(u)34f12(u)n3b2pi=1nκb(uUi)σ12(Ui)f1(Ui)=op(1).This proves that T11=op(1). Note that E(T12U1,,Un)=0 and Var(T12U1,,Un) is bounded by max{1f12(u),1fˆb2(u)}maxi=1,,n|f1(Ui)fˆl(Ui)1|2Var(nV+T11|U1,,Un).Therefore, under the assumed condition that f1 is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies T12=op(1). Note that T13=bp/2n1/2j=1nWj(u)ϵj,Wj(u)=1ni=1nκb(uUi)g(Xi)fˆb(u)(GG)1i=1nκl(UiUj)g(Xi)fˆl(Ui).Conditioned on U1,,Un, T13 has mean 0 and variance Var(T13U1,,Un)=bpnj=1nWj2(u)σ12(Uj)=Op(bp)=op(1),because, under the assumed condition that f1 is bounded away from zero, Lemma 3 in Dai and Shao (Citation2023) implies maxj=1,,n|Wj(u)g(u)Σg1g(Xj)|=op(1). Thus, T13=op(1). Consequently, T1 has the same asymptotic distribution as nV, the claimed result.

From Lemma 4 in Dai and Shao (Citation2023) and (C4), T2=cA1(u){1+op(1)}. Note that T3=nbpl2nfˆb(u)j=1nκb(uUj)[1nl2fˆb(Uj)i=1nκl(uUi){μ1(Ui)μ1(Uj)}]={cr2nfˆb(u)j=1nκb(uUj)A1(Uj)}{1+op(1)}=cr2A1(u){1+op(1)},where the second equality follows from (A4) and Lemmas 3–4 in Dai and Shao (Citation2023), and the last equality follows from Lemma 2 in Dai and Shao (Citation2023) and continuity of A1(). Also, n1/2T4bp/2=1ni=1nκb(uUi)g(Xi)fˆb(u)(GG)1j=1ng(Xj)nfˆb(Uj)i=1nκl(uUi){μ1(Ui)μ1(Uj)}={g(x)Σg11nj=1ng(Xj)nfˆb(Uj)i=1nκl(uUi){μ1(Ui)μ1(Uj)}}{1+op(1)}={g(x)Σg1l2/pnj=1ng(Xj)A1(Uj)}{1+op(1)}=cr2g(x)Σg1E{g(X)A1(U)}{1+op(1)},where the first equality follows from Lemma 3 in Dai and Shao (Citation2023) and the law of large numbers, the second equality follows from Lemma 4 in Dai and Shao (Citation2023), and the last equality follows from the law of large numbers. Similarly, n1/2T5bp/2=1ni=1nκb(uUi)g(Xi)fˆb(u)(GG)1i=1ng(Xi){hˆ1(Xi)h1(Xi)}=[g(x)Σg11ni=1ng(Xi){hˆ1(Xi)h1(Xi)}]{1+op(1)}{1+op(1)}Op(1)maxj=1,,n|hˆ1(Xj)h1(Xj)|,where the second equality follows from Lemma 3 in Dai and Shao (Citation2023). Under (B1)–(B5) with U and p replaced by X and q, and (C5), Lemma 8.10 in Newey and McFadden (Citation1994) implies that (A3) max|hˆ1(Xi)h1(X1)|=Op(log(n)n2/(q+4)),(A3) which is op(1/nbp)=op(n2/(p+4)) and, hence, T5=op(1). From Lemma 3 in Dai and Shao (Citation2023) and the Central Limit Theorem, T6=bp/2n1/2i=1nκb(uUi)g(Xi)fˆb(u)(GG)1i=1ng(Xi){h1(Xi)μ1(Ui)}=Op(bp/2)=op(1).Combining these results, we obtain that T2++T6=Br(u)+op(1). This completes the proof.

Proof of Theorem 4.3.

Proof of Theorem 4.3.

Define ωˆt1=1Ni=1NRiκˇb(uUi)etYi,ωˆt2=1Ni=1N(1Ri)κˇb(uUi)YietYi,ωˆt3=1Ni=1NRiκˇb(uUi),ωˆt4=1Ni=1N(1Ri)κˇb(uUi),ωˆt5=1Ni=1NRiκˇb(uUi)YietYi,ωˆt6=1Ni=1N(1Ri)κˇb(uUi)Yi2etYi,ωˆt7=1Ni=1NRiκˇb(uUi)Yi2etYi,ωˆt8=1Ni=1N(1Ri)κˇb(uUi)Yi3etYi.Then, hˆ(u,t)=ωˆt1ωˆt2/ωˆt3ωˆt4, thˆ(u,t)=(ωˆt2ωˆt5ωˆt1ωˆt6)/ωˆt3ωˆt4, and t2hˆ(u,t)=(ωˆt1ωˆt82ωˆt5ωˆt6+ωˆt2ωˆt7)/ωˆt3ωˆt4. Let L(t)=E[R{Yh(U,t)}2], Lˆn(t)=N1i=1NRi{Yihˆ(Ui,t)}2, and Ln(t)=N1i=1NRi{Yih(Ui,t)}2. Taking derivatives with respect to t, we obtain tLˆn(t)=1Ni=1N2Ri{Yihˆ(Ui,t)}thˆ(Ui,t)=1Ni=1Nψ{Yi,Ri,ωˆt(Ui)},tLn(t)=1Ni=1N2Ri{Yih(Ui,t)}th(Ui,t)=1Ni=1Nψ{Yi,Ri,ωt(Ui)},and tL(t)=2E[R{Yh(u,t)}th(u,t)]=E[ψ{Y,R,ωt(U)}],where ψ is given in (D5). Note that tL(γ)=0 and t2L(γ)=2E[{th(U,γ)}2R]=νγ0. We establish the asymptotic normality of γˆ in the following four steps.

Step 1: Since γ is the unique minimizer of L(t), from Theorem 2.1 in Newey and McFadden (Citation1994), it suffices to prove that suptΓ|tLˆn(t)tL(t)|p0. Note that suptΓ|tLˆn(t)tL(t)|suptΓ|tLn(t)tL(t)|+suptΓ|tLˆn(t)tLn(t)|suptΓ|tLn(t)tL(t)|+2ni=1nRi|Yi|{suptΓ,xU|thˆ(u,t)th(u,t)|+suptΓ,uU|hˆ(u,t)thˆ(u,t)h(u,t)th(u,t)|}From (D3), |2R{Yh(u,t)}th(U,t)| is bounded by c|Y| for a constant c and hence Lemma 2.4 in Newey and McFadden (Citation1994) implies that  suptΓ|tLn(t)tL(t)|=op(1). Based on Lemma B.3 in Newey (Citation1994), conditions (D1)–(D4) imply that supuU|ωˆt(u)ωt(u)|0 for all tΓ. As a result, by a similar argument of the proof of Lemma B.3 in Newey (Citation1994), we obtain that suptΓ,uU|ωˆt(u)ωt(u)|p0. Since ωt is bounded away from zero and h(,t) and th(,t) are Lipschitz continuous functions with respect to ωt, suptΓ,uU|hˆ(u,t)h(u,t)|p0 and suptΓ,uU|thˆ(u,t)th(u,t)|p0. These results together with the previous inequality implies that γˆpγ.

Step 2: Conditions (D1)–(D5) ensure that Lemma 8.11 in Newey and McFadden (Citation1994) holds and hence NtLˆn(γ)dN(0,σL2) with σL2=Var{m(Y,R,U,ωγ)+τ(Y,R,U,γ)}.

Step 3: Note that t2Ln(t)=N1i=1N2Ri{Yih(Ui,t)}t2h(Ui,t)+2Ri{th(Ui,t)}2 and sup|tγ||γˆγ||t2Lˆn(t)t2L(γ)|A1+A2+A3, where A1=|t2Ln(γ)t2L(γ)|, A2=suptΓ|t2Lˆn(t)t2Ln(t)|, and the last term A3=sup|tγ||γˆγ||t2Ln(t)t2Ln(γ)|. The law of large numbers guarantees that A1=op(1). A similar argument in Step 1 shows that A2=op(1). For A3, we have |t2Ln(t)t2Ln(γ)|2Ni=1N|{th(Ui,t)}2{th(Ui,γ)}2|+2Ni=1N|Yi||t2h(Ui,t)t2h(Ui,γ)|+2Ni=1N|h(Ui,t)t2h(Ui,t)h(Ui,γ)t2h(Ui,γ)|.Under (D3), h(,t), ∇h(,t), and ∇h(,t) converge uniformly for all x as tγ and, thus, the A3=op(1) because γˆpγ. This shows that sup|tγ||γˆγ||t2Lˆn(t)t2L(γ)|p0.

Step 4: By Taylor's expansion, Lˆn(γˆ)Lˆn(γ)=0Lˆn(γ)=tLˆn(ξ)(γˆγ) for some ξ(γ,γˆ). From the results in Steps 1-3, N(γˆγ)dN(0,[2E{Rγh(U,γ)}2]1σL2). This completes the proof of (Equation20).

Proof of Corollary 4.1.

Proof of Corollary 4.1.

  1. From Theorem 4.3, (Equation20) shows that γˆγ=Op(1/N). Furthermore, Lemma 8.10 in Newey and McFadden (Citation1994) shows that (A4) maxi|eγYij=1neγYjκˇbˇ(UiUj)j=1nκˇbˇ(UiUj)f(Yi|U=Ui,D=1)f(Yi|U=Ui,D=0)|=Op(logNNbˇp+bˇd),(A4) which is op(N2/(p+4))=op(1)/nbp under the assumed condition d>max{(p+4)/2,p} and Nbˇ2d0. Since γˆγ converges faster than (EquationA4), (Equation18) holds. As a result, (Equation17) holds with μ1E2(u) replaced by μ1E3(u) under (B1)–(B4) and (D1)–(D5).

  2. Under (D1)–(D5) with U replaced by X and p replaced by q, Lemma 8.10 in Newey and McFadden (Citation1994) implies that supxX|hˆ(x,γ)h1(x)|=Op((logN)1/2(Nbˇq)1/2+bˇd)=op(n2/(p+4)).From the asymptotic normality of γˆ, γˆγ=Op(1/N), which converges to 0 faster than supxX|hˆ(x,γ)h1(x)|0. Hence (EquationA3) holds while hˆ1 is estimated by hˆ(X,γˆ). Then, the rest of proof of the second claims follows the argument in the proof of Theorem 4.2.