208
Views
0
CrossRef citations to date
0
Altmetric
Report

Estimating the density in a subgroup with imputed subgroup indicators

ORCID Icon
Article: 2279720 | Received 31 Jul 2023, Accepted 30 Oct 2023, Published online: 11 Dec 2023

Abstract

In this paper, I consider the problem of how to estimate the density in a subgroup when some of the subgroup indicators are missing at random. Four different imputation estimators are compared to each other and to an inverse probability weighted estimator suggested previously. An optimal estimator is derived. I also provide expressions for the asymptotic variance of the imputation estimators including terms of order 1/(nh) and 1/n.

1 Introduction

Missing data is a problem in many applications of statistics. Observations are incompletely observed either by accident or intentionally, and failure to take the incompleteness into account in the statistical analysis may bias the results. Strategies for analyzing incomplete data range from ad hoc methods to principled methods. The focus in this paper is on imputation, i.e., filling in suitable values for the missing observations. This leads to a complete data set, which allows us to use the estimator we would have used had we had complete data. The disadvantage of imputation is that the estimation of standard errors is complicated by the fact that the imputed data does not give the same information as the original data. Hence using the formula for the standard error of the complete data estimator will typically underestimate the uncertainty.

The problem considered in this paper is how to estimate the density of a subgroup of the sample, when it is not known for some of the observations, if they belong to the relevant subgroup or not. If the reason for the missing group membership is related to the variable for which we wish to estimate the density, then a complete case estimator—a density estimator based on observations known to be in the group of interest—will be biased. A typical example of such data comes from medical statistics, where it may too costly or too difficult to ascertain the disease status of all patients. As a consequence disease status will be more likely to be known for patients that are more likely to be ill. If we are interested in the distribution of a measurement in the diseased sub-population, those known to be ill will then be a biased sample. A proxy measurement may be available for all patients allowing us to impute the missing disease status. In business economics, data with partially missing group information occurs for instance in auditing and rejection inference. Not all records are audited, and in many cases the records selected for audit are not chosen completely at random but based on dollar amount (“sampling proportional to size”) or perceived likelihood of being flawed (“non-statistical” sampling). If we are interested in e.g., the distribution of the dollar amount among the flawed records, the inspected records may be a biased sample. When applying for a loan, the application is granted or denied based on the applicant’s credit status and other information. Rejection inference aims at improving credit scoring by incorporating information from rejected applicants who would have repaid the loan had it been granted. But whether an applicant would have repaid their loan is obviously missing for rejected applicants, and the non-rejected applicants will typically be a biased sample.

The problem of estimating a subgroup density with missing group membership information has previously been considered by Tang, He, and Gunzler (Citation2012), who use inverse probability weighted kernel density estimation to get a consistent asymptotically normal estimator of the unknown density. Imputation—replacing missing group membership information with suitable predictions or simulations—is an obvious alternative approach. In this paper, I will consider a number of imputation estimators and compare them to the estimator suggested by Tang, He, and Gunzler (Citation2012).

The remainder of this paper is structured as follows. In Section 2, I define notation and specify the regularity assumptions that will be used to derive the main asymptotic results, which are given in Section 3. The following section gives results from a small simulation experiment, and the estimators are then applied to data in Section 5. In Section 6, I present higher-order expressions for the asymptotic variance of our estimators, before I conclude in Section 7. Proofs are deferred to the Appendix.

2 Set-up

Let Y1,,Yn be iid real random variables with an unknown density f. Let D1,,Dn iid 0–1 variables, such that Di = 1 if and only if the ith observation belongs to the group of interest. Hence, I am interested in estimating the conditional density, f1, of Yi given Di = 1.

Let p(y)=P{Di=1|Yi=y}

and note that for all yR p(y)f(y)=f1(y)pwithp=P{Di=1}.

I assume that p(y) > 0 whenever f(y) > 0, so that the marginal distribution of Yi and the conditional distribution of Yi given Di = 1 share support.

I will assume that the Di’s are missing at random (MAR), i.e., that the probability that Di is missing does not depend on Di but only on observed data. As Tang, He, and Gunzler (Citation2012) do, I will allow this probability to depend not only on Yi but also on additional variables Xi. If Ri denotes the “response indicator” such that Ri = 1 if and only if Di is observed, the assumption of MAR is that (1) π(x,y)=P{Ri=1|Xi=x,Yi=yi}=P{Ri=1|Xi=x,Yi=y,Di=d},d=0,1.(1)

I let πi denote π(Xi,Yi).

As mentioned in Section 1 a complete case estimator of the unknown conditional density will in general be biased if the probability that Di is missing depends on Yi. Estimating the density using inverse probability weights requires that we can estimate the πi’s; the assumption of MAR makes this feasible. Estimating the density using imputation requires on the other hand that we can estimate the distribution of Di given Yi and Xi. A consequence of MAR is that (2) p(x,y)=P{Di=1|Xi=x,Yi=y}=P{Di=1|Xi=x,Yi=y,Ri=1}.(2)

Thus, the conditional probability of belonging to the group of interest can be estimated from the complete cases, i.e., observations with Ri = 1. I put pi=p(Xi,Yi). I will assume throughout this paper that the vectors (Yi,Di,Ri,Xi),i=1,,n, are iid.

As a model for the conditional probability p(x, y) of Di = 1 given Xi = x and Y i = y I will assume a generalized linear model with linear predictor Ziβ, where β is an unknown parameter and Zi is a vector of explanatory variables constructed from Yi and Xi. Assuming that the components of Zi have second moments, it follows that n(β̂β)=1ni=1nSi+oP(1)where (3) Si=RiDipipi(1pi)pi I(β)1Zi(3) with I(β) equal to the expected information matrix (given Ri = 1) and pi is the derivative of the inverse link function taken at the ith linear predictor, i.e., ddβpi=pi·Zi. The factor Ri in (3) is due to the fact that the estimation is based on complete cases only. I will assume that the inverse link function of this generalized linear model has a bounded derivative. Commonly used link functions, such as logit and probit, satisfy this assumption. I let p̂i equal the fitted probabilities at (Xi, Y i). I will refer to the assumptions on the distribution of the data made so far as assumption A.

I choose a kernel function, K, such that: (4) K(u)du=1    uK(u)du=0u2K(u)du=1    K2K(u)2du<(4)

These assumptions, which I will refer to as assumption B, are satisfied for instance if the kernel K is a bounded symmetric probability density with variance equal to 1. Finally, I need some smoothness of various functions of y: f(y), f1(y), p(y), E[pi2|Yi=y],E[πipi|Yi=y],E[πipi2|Yi=y] are all assumed to be twice continuously differentiable with a second derivative which is locally Lipschitz so that for each yR there is a constant (possibly depending on y) such that for f1(y) |f1(y)f1(y+δ)|const·δfor δ sufficiently small (possibly depending on y) and similarly for the other functions. I do not need smoothness of all of these functions for each result. Consequently I will let assumption C1 be the assumption about f(y), C2 the assumption on f1(y), and so on. For completeness I have listed the assumptions in the Appendix.

3 Asymptotic results

With complete data, the kernel density estimator for f1 is given by (5) f̂1(y)=i=1nDiK(yYih)/hi=1nDi .(5)

Provided that n and h0 such that nh f̂1(y)approx N(f1(y)+h22f1(y),f1(y)pnhK2).

With missing data, a complete case estimator, i.e., replacing Di by DiRi in (5), will be biased with an asymptotic mean equal to 1E[π1p1](E[π1p1|Y1=y]f(y)+h22d2dy2(E[π1p1|Y1=y]f(y)))+o(h2),the density of Yi given that Di = 1 and Ri = 1 plus O(h2)-terms.

One way of eliminating this bias is to weight each observed Di with the inverse probability of observing it, i.e., replace Di in (5) by DiRi/π̂i, where π̂i is an estimator of πi. Tang, He, and Gunzler (Citation2012) show that the asymptotic distribution of the resulting estimator is N(f1(y)+h22f1(y),E[1/π1|Y1=y]f1(y)pnhK2)as n and h0 such that nh for y in a bounded set using a logistic regression for estimating π. It seems that they implicitly assume that p(x, y) only depends on y. They also consider the case where πi is known and show that the weighted estimator using the known πi’s has the same asymptotic distribution as when the πi’s are estimated.

In the following subsections, I will consider the asymptotic distribution of the density estimator (5), when missing Di’s are replaced by imputations.

3.1 Single imputation

The simplest imputation estimator is obtained by replacing missing Di’s by simulated 0–1 variables where the success probabilities are given by p̂i=p̂(Xi,Yi), the fitted value from the generalized linear model. Hence, the estimator is given by (6) f̂1,SI(y)=i=1nD˜iK(yYih)/hi=1nD˜i(6)

with D˜i=RiDi+(1Ri)1{Uip̂i}where (Ui)i=1,,n is a sequence of independent uniformly distributed random variables, independent of the observations. For this estimator, I obtain the following result:

Theorem 1.

Under assumptions A, B, C1, C2, C4–C6, then for any y in the interior of the support of Yi f̂1,SI(y)approx N(f1(y)+h22f1(y),f1(y)pnhK2)as n and h0 such that nh.

I only give results for y’s in the interior of the support. At boundary points the bias of the kernel density estimator may be considerably larger. I conjecture that using a boundary kernel in this missing data set-up will give results similar to those found in complete data cases but proving this is beyond the scope of this paper. Alternatively, boundary effects can be mitigated using transformation as I do in Section 5.

The proof of Theorem 1 may be found in the Appendix. The asymptotic variance may be consistently estimated by f̂1,SI(y)/(hi=1nD˜i)·K2.

It is somewhat surprising that not only does the asymptotic distribution of the imputation estimator f̂1,SI(y) not depend on the missing data mechanism in any way, it is in fact the same asymptotic distribution as that of the complete data estimator (5). In particular, it is more efficient than the inverse probability weighted estimator considered by Tang, He, and Gunzler (Citation2012). This is due to the fact that the rate of convergence of β̂ is n whereas the rate of convergence of the kernel density estimator is only nh. Therefore the estimation uncertainty regarding β may be ignored asymptotically, and the imputations may be treated as if they were the actual (missing) observations. As I will discuss later, the estimation of β may well influence the precision of the imputation estimator (6) in finite samples.

3.2 Multiple imputation

Multiple imputation (Rubin Citation1987) is a popular alternative to the single imputation considered in the previous subsection. Rather than just generate one set of imputations (D˜i for i such that Ri = 0), B > 1 sets of imputations are generated, B imputation estimators similar to (6) are calculated and then averaged to obtain the estimator f̂1,MI(y)=1Bj=1Bf̂1,j(y)where f̂1,j(y)=i=1nD˜i,jK(yYih)/hi=1nD˜i,j

with D˜i,j=RiDi+(1Ri)1{Ui,jp̂i,j}where (Ui,j)i=1,,n,j=1,,B are independent uniformly distributed random variables independent of the data; p̂i,j will be discussed below. Apart from decreasing the simulation noise, multiple imputation is attractive because—under suitable regularity assumptions—the variance of the estimator may be estimated by a combination of the average of the “complete data variance estimators” and the empirical variance of the B imputation estimators: (7) 1Bj=1Bf̂1,j(y)hi=1nD˜i,jK2+(1+1B)1B1j=1B(f̂1,j(y)f̂1,MI(y))2.(7)

Rubin (Citation1987) suggests that a requirement for the suitability of (7) as an estimator of the asymptotic variance of the multiple imputation estimator f̂1,MI(y) is that the imputations are generated at least “approximately” from a Bayesian model. Thus, p̂i,j should be the predicted probability of Di = 1 given (Xi, Y i) using β˜j as the value of the unknown parameter, where β˜j,j=1,,B are independent draws from the posterior distribution of β given the observed data. This is my assumption D.

In the present case I obtain the following result:

Theorem 2.

Under assumptions A, B, C1, C2, C4–C6, and D, then for any y in the interior of the support of Yi f̂1,MI(y)approx N(f1(y)+h22f1(y),(v1(y)+1Bv2(y))f(y)p2nhK2)with v1(y)=E[π1p1+(1π1)p12|Y1=y]v2(y)=E[(1π1)p1(1p1)|Y1=y]

as n and h0 such that nh. Moreover, the variance estimator (7) estimates f1(y)pnhK2+(1+1B)v2(y)f(y)pnhK2 .

The asymptotic distribution of f̂1,MI(y) does not depend on whether the β˜j’s are from a Bayesian posterior distribution or are all equal to the MLE β̂. As in the previous section, this is due to the faster rate of convergence. In fact, the Bernstein-von Mises theorem says that given the observed data (8) β˜japprox N(β̂,I(β̂)1/n).(8)

This means that the difference between β˜j’s and β̂ is negligible compared to the nh-rate of convergence of the kernel density estimator.

Note that v1(y)+v2(y)=p(y) so that when B = 1 the asymptotic variance of f̂1,MI(y) equals the asymptotic variance of f̂1,SI(y) as it should. Furthermore, the asymptotic variance of f̂1,MI(y) is strictly decreasing as B increases. Note also that the variance estimator (7) is biased. This is due to the fact that the imputations asymptotically may be treated as if they were actual observations and not simulations. An simple asymptotically unbiased estimator may in principle be obtained by 1Bj=1Bf̂1,j(y)hi=1nD˜i,jK2+(1B1)1B1j=1B(f̂1,j(y)f̂1,MI(y))2.

In practice, however, this estimator will be negative with positive probability as the first term is asymptotically a constant divided by nh, and the second term is asymptotically χ2-distributed with B–1 degrees of freedom, with a scale parameter inversely proportional to nh. A consistent estimator may be obtained for instance by kernel regression of Rip̂i+(1Ri)p̂i2 and (1Ri)p̂i(1p̂i) on Yi to obtain estimates of v1(y) and v2(y), respectively, combined with a kernel density estimator of the marginal density f(y).

3.3 Fixed imputation

The multiple imputation estimator is asymptotically equivalent to a “single imputation” estimator based on the average of the imputations, i.e., (6) with D˜i=1Bj=1BD˜i,j. Generating the imputations using β̂ instead of β̂j and letting B tend to infinity, I obtain the estimator f̂1,FI(y)=i=1n(RiDi+(1Ri)p̂i)K(yYih)/hi=1n(RiDi+(1Ri)p̂i).

This is often called a fixed imputation estimator. In many cases, “random imputations” (as used in the two previous sections) are preferable to “fixed imputations”, as “fixed imputations” may lead to biased estimators if a nonlinear transformation is applied to the imputations. However, “fixed imputations” do not lead to biased estimation of the density here:

Theorem 3.

Under assumptions A, B, C1, C2, C4–C6, then for any y in the interior of the support of Yi f̂1,FI(y)approx N(f1(y)+h22f1(y),E[π1p1+(1π1)p12|Y1=y]f(y)p2nhK2)as n and h0 such that nh.

As in the previous subsection, the asymptotic variance may be estimated from a kernel regression and a marginal density estimate.

3.4 Optimal estimation

So far I have shown that if the distribution of Di given (Xi, Y i) can be modeled parametrically then imputed data is asymptotically as good as observed data for estimating the conditional density of Yi given Di = 1. Improvements in the form of lower asymptotic variances are obtained by multiple imputation and fixed imputation schemes. As fixed imputations are better than random imputations and random imputations are as good as real data, it would be natural to suppose that imputing all the Di’s (using fixed imputations) regardless of whether they are missing or not will lead to an even better estimator: f̂1,RB(y)=i=1np̂iK(yYih)/hi=1np̂i.

Indeed, I obtain:

Theorem 4.

Under assumptions A, B, C1–C4, then for any y in the interior of the support of Yi f̂1,RB(y)approx N(f1(y)+h22f1(y),E[p12|Y1=y]f(y)p2nhK2)as n and h0 such that nh.

As such the result—that the estimator is improved if the group indicators are replaced by estimates of the conditional probabilities—is not surprising. The faster rate of convergence ensures that the estimated pi’s may be treated as known, and then the improvement is just an example of a “Rao-Blackwellization”. This phenomenon occurs in other missing data examples, too. For instance, Müller (Citation2009) shows that what she calls “full imputation” (replacing observed data with imputations) is efficient in her model, a semi-parametric regression with outcomes missing at random, when the imputations are chosen optimally. She does not consider smoothing, so her results are not directly applicable to the problem considered in this paper.

The question is if there are further improvements that may be easily obtained? I restrict attention to estimators of the form (9) f̂1,d(y)=i=1ndiK(yYih)/hi=1ndi(9)

with di=d(Yi,Xi,Ri,Di,Ui), where Ui is a (simulated) uniformly distributed random variable (so that random imputation estimators are included), such that d(Yi,Xi,0,1,Ui)=d(Yi,Xi,0,0,Ui)E[di|Yi=y]=p(y).

The first of these restrictions ensures that the estimator only depends on observed data, i.e., does not depend on Di when Ri = 0. The second ensures that the estimator (9) is asymptotically unbiased. The imputation estimators belong to this class if p(x, y) is known as does the estimator considered by Tang, He, and Gunzler (Citation2012) when π is known. The asymptotic variance of an estimator of this form is E[d12|Y1=y]f(y)p2nhK2 and as E[d12|Y1=y]=Var [d1|Y1=y]+p(y)2p(y)2with equality if and only if di=p(Yi) almost surely, the optimal estimator (within the specified class) is given by f̂1(y)=i=1np(Yi)K(yYih)/hi=1np(Yi).

In practice, p(y) is typically unknown, making the optimal estimator infeasible. If, however, we are able to correctly specify a parametric model satisfying the regularity conditions listed in Section 2 for this conditional probability, then Theorem 4 shows that the estimator has the same asymptotic distribution as the infeasible optimal estimator: f̂1(y)approx N(f1(y)+h22f1(y),p(y)2f1(y)pnhK2).

However, estimating p(y) may be complicated in our set-up. If the missingness depends on Xi as well as on Yi, then a regression of Di on Yi using complete cases only will give a biased estimator of p(y). One option could be to use inverse probability weighting 1ni=1nRiπ̂iS˜i(γ)=0where S˜i(γ) is the score function for a binomial regression of Di on Yi. With suitable regularity assumptions (see Robins, Rotnitzky, and Zhao Citation1995), the fitted p̂(Yi) will be n-consistent, and we will obtain the asymptotic result outlined above.

3.5 Bandwidth selection

It is well known that the performance of a kernel estimator is highly dependent on the chosen bandwidth. For their estimator, Tang, He, and Gunzler (Citation2012) suggest choosing the bandwidth that minimizes the mean integrated squared error derived from the asymptotic distribution, and obviously this would be a possibility for the estimators considered in this paper as well. More computationally demanding methods such as cross validation or Ruppert’s (Citation1997) EBBS-method could also be considered. These methods make the choice of bandwidth data-driven, and it should be considered to what degree this influences the asymptotic results given above. Doing this is however outside the scope of this paper. Note however that the bandwidths used in the simulations in the next section are random, as they depend on the sample standard deviation. The effect of this seems to be minor compared to the effects of the kernel estimator itself and the effect of estimating the parameters of the regression of Di on (Y i, Xi).

4 Simulations

I will now present some simulation results aimed at comparing the four estimators—the single imputation estimator, the multiple imputation estimator, the fixed imputation estimator, and the “Rao-Blackwellized” estimator—discussed in the previous section to each other as well as to the complete data estimator and the weighted estimator considered by Tang, He, and Gunzler (Citation2012). Note that I present results for the complete data estimator, not for the complete case estimator, as I am interested in the comparison of different (asymptotically) unbiased estimators, not in how they improve on a biased estimator, such as the complete case estimator. The multiple imputation estimator is based on 5 sets of imputations. I have on purpose chosen a small number of imputations to make differences between the multiple imputation estimator and the single and fixed imputation estimators stand out. The simulation results are based on 2000 replications.

In the simulations, the distribution of Y given D = d is a normal distribution with mean equal to d–1 and standard deviation equal to 1. Hence, the density I am trying to estimate is a standard normal density. I simulate D such that P{D=d}=1/2 for d = 0, 1. Thus, the two groups are of approximately equal size, and the distribution of D given Y = y is given by a logistic regression. The sample size equals 400.

As “response mechanism” I use π(y)=1/(1+exp(22y)). Hence the probability of D being unobserved decreases with y, and the distribution of Y given that D is observed to be equal to 1 is a right-skewed distribution. With this missing data mechanism, approximately 22.5% of the observations with D = 1 have missing data; overall approximately 15% of the observations have missing group indicators.

I use a standard normal kernel function and Silverman’s rule of thumb based on the complete data in the D = 1-group to choose the bandwidth. This will probably give the advantage to the complete data estimator. I estimate the density in 21 equally spaced points between–2 and 2. Over this grid, π(y) decreases from 0.88 to 0.12. The results of the simulations are presented in and . In present the results for the estimates of the density at y=1,0,1.

Fig. 1 Simulation results. The points are the averages of 2000 simulated estimators, the vertical lines represent ±1 times the standard deviations of the estimators. The grey curve is the density we are trying to estimate.

Fig. 1 Simulation results. The points are the averages of 2000 simulated estimators, the vertical lines represent ±1 times the standard deviations of the estimators. The grey curve is the density we are trying to estimate.

Fig. 2 Simulation results. The first panel shows the absolute bias of the estimators, the second the standard deviations of the estimators. The final panel shows the bias of the estimators of the standard deviations of the estimators.

Fig. 2 Simulation results. The first panel shows the absolute bias of the estimators, the second the standard deviations of the estimators. The final panel shows the bias of the estimators of the standard deviations of the estimators.

Table 1 Simulation results for y=1,0,1.

shows that all estimators are behaving as expected: they have similar biases, and the largest bias is found at y = 0, where the underlying normal density has the largest curvature. The weighted estimator has considerably larger standard deviation for small values of y where π(y) is small and more data is missing. The Rao-Blackwellized estimator has noticeably smaller standard deviations than the other estimators.

provides a closer look at the bias and the standard deviations of the six estimators. The weighted estimator has smaller bias when y is small than the other estimators, but otherwise the estimators seem to have the same bias as the asymptotic results suggest. Indeed, the bias of the weighted estimator is too small compared to what the asymptotic results predict when π(y) is small. When it comes to the standard deviation, the imputation estimators clearly outperforms the weighted estimator, in particular when π(y) is small. For small y, the four imputation estimators have essentially the same standard deviation as the complete data estimator, but as y increases the standard deviation of the Rao-Blackwellized estimator becomes considerably smaller than the standard deviations of the others. This is due to the fact that for small y almost all the observations belong to the D = 1-group and consequently p̂(y)1 when y is small. As y increases the Rao-Blackwellized estimator begins to take advantage of the observations in the second group, and this lowers the standard deviation of the estimator. There is very little difference between the four other estimators, but where there is a discernible difference the ordering of the imputation estimators is as expected: the fixed imputation estimator has smaller standard deviation than the multiple imputation estimator which is slightly better than the single imputation estimator. Contrary to what the asymptotic results suggest, the standard deviations of the single imputation estimator differ from the standard deviations of the complete case estimator. Hence, it seems that in finite samples there may well be an effect of estimating the probability of D given Y = y; I will investigate this in Section 6. It seems that the complete data estimator has larger standard deviation than the single imputation estimator for y close to 0. It is difficult to see from the graph, but when y is further away from 0, the complete data estimator has smaller standard deviations than the single imputation estimator.

The right-most panel of shows the bias of the estimators of the standard deviations of the six different estimators. As the bias is quite large compared to the true standard deviation, the variance estimators derived may be quite misleading in finite samples. I will investigate this aspect further in Section 6.

5 Application to rejection inference data

As an illustration of the difference between the weighted estimator and the imputation estimator as well as the effect of incorporating the missing data on the results, I look at an example from rejection inference.

I use data from the online peer-to-peer lending market Lending Club previously used by Li et al. (Citation2020). The data set contains information on 233318 loan applications of which 27157 were approved. This means that for 88.4% of the observations the information regarding whether the loan would have been repaid, had the loan been granted, is missing. The data set contains information on debt-to-income ratios, a Fair Isaac Corporation score (“FICO”), employment length and a categorized bad risk rate based on geography for each loan application. I will focus on the distribution of the FICO score for those who would repay their loan. The FICO score is one of the variables used to decide whether at loan is granted, so its distribution for those who repay, had the application been granted, will differ from its distribution among those who has been granted a loan and repaid it.

A little more than half the applications have a FICO score below 660 even though a FICO score above 660 is a requirement for a loan. Without any accepted applications with FICO scores below 660 the weighted kernel density estimator is not able to estimate the density of the FICO score below 660 in any reasonable way. The imputation estimator can, but the estimate will be based on extrapolations from the imputation model, so I remove loan applications with FICO scores below 660 and effectively estimate the density of the FICO score conditional on this being at least 660 in the group of applicants who would repay their loan. I end up with 114679 loan applications, of which 76.4% were turned down.

As the density of the truncated FICO scores is supported on the interval [660;[I estimate the density of the logarithm of the scores minus 660 to mitigate the boundary bias and then transform my estimate to get an estimate of the density that I am interested in.

For the imputations I use a generalized linear model with a complementary log-log link using all the explanatory variables. The debt to income ratio is log-transformed and both this variable and the FICO score are included as second order polynomials. As a model for the missing data mechanism I again use a generalized linear model with a complementary log-log link and all the explanatory variables. In this model the FICO score, the log-transformed debt to income ration and the loan amount are included as second order polynomials whereas I use a linear spline with a knot at 2 for the employment length. Both models appear to fit the observed data.

An important assumption for the weighted estimator is that the probability of response is bounded away from 0. A generalized linear model with a standard link function will not have this property unless the covariates are bounded, but in many real data examples the estimated probability of response will be sufficiently far away from 0 for this to cause any problems. In this example, however, the missing data model gives estimated probability of response very close to zero, which makes the weighted density estimator highly unstable. It spikes at the lowest FICO scores, and it is essentially zero everywhere else. As an ad hoc fix, I replace all estimated probabilities of response smaller than a cutoff by the cutoff value. The results of this are shown in next to the imputation estimator.

Fig. 3 Estimated FICO score density. Grey curves are complete case estimates, dashed curves are asymptotic pointwise confidence bands.

Fig. 3 Estimated FICO score density. Grey curves are complete case estimates, dashed curves are asymptotic pointwise confidence bands.

Compared to the complete case estimator (the grey curve) the imputation estimator puts more distribution mass on smaller FICO scores. This is what we should expect: low FICO scores lead to rejection, so the unapproved applications have smaller FICO scores. We only show the single imputation estimator here as all the imputation estimators are visually indistinguishable. The three following graphs show weighted kernel estimators with smaller probabilities of response replaced by a cutoff value. A cutoff value of 0.001 seems too little, and a cutoff value of 0.1 is probably too much: here the weighted density estimator is too close to the presumably biased complete case estimator. With a cutoff of 0.01 I get a density with more of the mass at very small and large FICO scores compared to the complete case estimator.

Obviously, the results depend on the chosen bandwidth. I have used the bandwidth calculated using Silverman’s rule of thumb based on the complete cases for all estimators. A larger bandwidth may improve the weighted estimator with a cutoff value of 0.01. But this example shows how unstable the weighted estimator is when the probability of response varies as much as it does here. The weighted estimator also has a much larger standard error. Again this is particularly bad in this example due to the low probability of response. The imputation estimator works well even though most of the data is missing. Compared to the complete case estimator, the imputation estimator moves some of the probability mass to lower values of the FICO score. It seems that low FICO scores possibly could play a smaller role in the decision of whether to grant a loan or not.

6 Higher order expressions for the asymptotic variance

Due to the faster rate of convergence, the estimation uncertainty in the estimate of β may be ignored asymptotically in the density estimator. However, as the simulations indicate, it is not ignorable in small samples where 1/n may not be negligible compared to 1/(nh). In this section I will derive an expression for the asymptotic variance which includes terms of order up to 1/n.

Including terms of order 1/n, the variance of the complete data estimator (5) can be written as (10) Var[f̂1(y)]=f1(y)pnhK2f1(y)pnuK2(u)duf1(y)2pn+o(1/n).(10)

With a symmetric kernel K the second term of this expression is 0, and the asymptotic variance of the estimator may be written as (11) f1(y)pnhK2·(1hf1(y)K2).(11)

If h is larger than K2/f1(y), this expression is negative, and in this case the o(1/n)-term in (10) is not negligible. The expression for the asymptotic variance given in (11) is clearly smaller than the usual one. For instance, with f1 equal to the standard normal density, a standard normal kernel, and h = 0.367 (corresponding to using Silverman’s rule of thumb with np = 200), (11) is less than half the usual expression when y = 0. An unbiased estimator of the variance of a kernel density estimator (without missing data and with a nonrandom bandwidth) is easy to obtain: The kernel estimator is an average of iid terms, so the empirical variance based on these terms divided by n is an unbiased estimate.

For our imputation estimators, I obtain the following results:

Theorem 5.

Under the same assumptions as in Section 3 and the additional assumptions listed in A.3, the asymptotic variances may be written as follows: Var [f̂1,SI(y)]=f1(y)pnhK2f1(y)2pn+1p2nA1(y)I(β)1A1(y)+2p2nhA1(y)I(β)1A2(y)+o(1/n)Var [f̂1,MI(y)]=1p2nh(v1(y)+v2(y)/B)f(y)K21p2n2(v1(y)+v2(y)/B)f(y)f1(y)+1p2nE[v1(Yi)+v2(Yi)/B]f1(y)2+(1+1B)1p2nA1(y)I(β)1A1(y)+1B2p2nhA1(y)I(β)1A2(y)+o(1/n)Var [f̂1,FI(y)]=1p2nhv1(y)f(y)K21p2n2v1(y)f(y)f1(y)+1p2nE[v1(Yi)]f1(y)2+1p2nA1(y)I(β)1A1(y)+o(1/n)Var [f̂1,RB(y)]=1p2nhE[p12|Y1=y]f(y)K22p2nE[p12|Y1=y]f(y)f1(y)+1p2nE[p12]f1(y)2+1p2nA(y)I(β)1A(y)+o(1/n)where A1(y)=E[(1π1)p1Z1|Y1=y]f(y)E[(1π1)p1Z1]f1(y)A2(y)=E[π1p1Z1|Y1=y]f(y)E[π1p1Z1]f1(y)A(y)=E[p1Z1|Y1=y]f(y)E[p1Z1]f1(y)v1(y)=E[π1p1+(1π1)p12|Y1=y]v2(y)=E[(1π1)p1(1p1)|Y1=y] as n and h0 such that nh.

An expression for the variance of the weighted estimator similar to those in the theorem above may be derived from Tang, He, and Gunzler (Citation2012)’s Theorem 3. provides a comparison of these formulae and the simulation results from Section 4. It is clear that the higher order expressions are better able to capture the true variability of the estimators.

Fig. 4 Standard deviations of density estimators. The open circles are the standard deviations from the simulations in Section 4, the filled circles are asymptotic standard deviations obtained from Theorem 5.

Fig. 4 Standard deviations of density estimators. The open circles are the standard deviations from the simulations in Section 4, the filled circles are asymptotic standard deviations obtained from Theorem 5.

As with the formula (10), the expressions for the variances in Theorem 5 may give negative values for h sufficiently large. In the (somewhat uninteresting) case, where π is a constant, it is easy to see that the sum of the last two terms of the variance of the single imputation estimator is positive (unless h is large). Thus, if the missing data mechanism is missing completely at random, the (infeasible) complete data estimator is more efficient than the single imputation estimator. In other words, there is a price to be paid when estimating β. However, the expressions for the variances given in the theorem above confirm the findings from the simulations in the previous section: In the set-up from the simulations, where π depends on y, the variance of the single imputation estimator (ignoring o(1/n)-terms) is smaller than the variance of the complete data estimator for y close to 0.

The expressions for the asymptotic variances in Theorem 5 are difficult to compare to each other algebraically. It is however clear from the proof that the variance of the “fixed imputation” estimator is smaller than the variance of the single and the multiple imputation estimators. The single and the multiple imputation estimators are not easy to compare. Though more imputations (larger B) clearly decreases the asymptotic variance, the variance of the multiple imputation estimator is actually larger than the variance of the single imputation estimator, when B = 1. This is due to the extra randomness inherent in simulating from a Bayesian predictive distribution. As B increases, the asymptotic variance of the multiple imputation estimator approaches the variance of the fixed imputation estimator and in particular becomes smaller than the variance of the single imputation estimator. Comparing the variance of the “Rao-Blackwellized” estimator to the variance of the fixed imputation estimator, note that E[p12|Y1=y]v1(y). This implies that 1p2nhE[p12|Y1=y]f(y)K22p2nE[p12|Y1=y]f(y)+1p2nE[p12]f1(y)21p2nhv1(y)f(f)K21p2n2v1(y)f(f)+1p2nE[v1(Y1)]f1(y)2for h sufficiently small (hK2/(2f1(y))). However, 1p2nA(y)I(β)1A(y)1p2nA1(y)I(β)1A1(y) in the case where πi is constant, and this will presumably be the case also when πi is not constant, at least for some values of y. Hence, comparing these two estimators algebraically does not seem possible. Eventually, of course, as n increases and h decreases, the variance of f̂1,RB(y) will be the smaller of the two, but I am not able to rule out that the fixed imputation may have smaller variance in small samples.

Even though it is possible to estimate the different terms in these expressions consistently using kernel regressions, so that these formulae may be used to get improved estimation of the variance of the imputation estimators, it seems too complicated for routine use. Moreover, even if such a plug-in estimator would be consistent, the finite sample bias in the kernel estimators will presumably bias the variance estimators to some degree. Bootstrapping may be a useful alternative way of obtaining accurate standard errors.

7 Conclusion

In this paper, I have considered four different imputation estimators for estimating the density in a subgroup when group membership is not known for every observation. The single random imputation estimator is attractive, because it is easily implemented and has the same asymptotic distribution as the complete data estimator. The “Rao-Blackwellized” estimator, where every subgroup indicator is replaced by its conditional mean, is more efficient but may require a little more work to implement and estimating its variance will be somewhat harder, especially if the estimated probability of belonging to the subgroup of interest depend on Xi as well as Yi. The other two estimators—the multiple imputation estimator and the fixed imputation estimator—are more complicated to work with than the single imputation estimator and less efficient than the “Rao-Blackwellized” estimator. Thus for practical applications, either the single imputation or the Rao-Blackwellized estimator should be used. As shown, the asymptotic variance ignoring terms of smaller order than 1/(nh) may be quite misleading in finite samples, and I have provided more accurate formulae for the asymptotic variance. These formulae will require more work to implement in practice, where a simple bootstrap may be an easier option.

All the imputation estimators are more efficient than the weighted estimator considered by Tang, He, and Gunzler (Citation2012). This is due to the weighting with the inverse of potentially very small probabilities, which inflates the variance. Robins, Rotnitzky, and Zhao (Citation1995) show how the efficiency of inverse probability weighed estimators for semi-parametric regression models may be improved, and this is probably possible also for the kernel density estimator considered here. It seems unlikely that one would be able to obtain better results using weighting than by using imputation in the situation considered in this paper, though. In general, a reason for preferring weighting to imputation is that it is easier to fit a binary regression model for the response mechanism than it is to fit an imputation model, but here the imputation model is also a binary regression, and as imputation leads to more efficient estimators there is little reason to prefer weighting.

The simplicity of the results are due to the fact that the faster rate of convergence of the parameters from the imputation model allows us to treat the imputation model as known. Even though a parametric binary regression model is a flexible tool, it would be of interest to see how the asymptotic results would change if we used a non-parametric or a semi-parametric binary regression model for the imputation model. I hope to address this in a future paper.

Disclosure statement

The author reports there are no competing interests to declare.

References

  • Li Z, Hu X, Li K, Zhou F, Shen F. 2020. Inferring the outcomes of rejected loans: an application of semisupervised clustering. J R Stat Soc Ser A Stat Soc. 183:631–654.
  • Müller UU. 2009. Estimating linear functionals in nonlinear regression with responses missing at random. Ann Stat. 37:2245–2277.
  • Robins JM, Rotnitzky A, Zhao LP. 1995. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc. 90:106–121.
  • Rubin D. 1987. Multiple imputation for nonresponse in surveys. New York: Wiley.
  • Ruppert D. 1997. Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. J Am Stat Assoc. 92:1049–1062.
  • Tang W, He H, Gunzler D. 2012. Kernel smoothing density estimation when group membership is subject to missing. J Stat Plan Inference. 142:685–694.

Appendix A:

Proofs

A.1 Assumptions for the main results

Assumption A

The random variables (Yi,Di,Ri,Xi),i=1,,n are iid and the Yi’s are missing at random.

The conditional distribution of Di given Xi = x and Y i = y is from a generalized linear model with an inverse link function that has a bounded derivative.

Assumption B

The kernel function satisfies (4).

Assumptions C1–C6

C1: f(y) is twice continuously differentiable with a second derivative that is locally Lipschitz.

C2: f1(y) is twice continuously differentiable with a second derivative that is locally Lipschitz.

C3: p(y) is twice continuously differentiable with a second derivative that is locally Lipschitz.

C4: E[pi2|Yi=y] is twice continuously differentiable with a second derivative that is locally Lipschitz.

C5: E[πipi|Yi=y] is twice continuously differentiable with a second derivative that is locally Lipschitz.

C6: E[πipi2|Yi=y] is twice continuously differentiable with a second derivative that is locally Lipschitz.

A.2 Asymptotic distribution

I first consider the centered single imputation estimator f̂1,SI(y)(f1(y)+h22f1(y))=1ni=1nD˜i(K(yYih)/hf1(y)h22f1(y))1ni=1nD˜i.

The denominator converges in probability to p. To see this first note that 1ni=1nRiDi+(1Ri)1{Uipi}pinprobability.

Next, note that E|1ni=1nD˜i1ni=1nRiDi+(1Ri)1{Uipi}|E[1ni=1n|1{Uip̂i}1{Uipi}|]=E[2ni=1n|p̂ipi|].

This mean converges to 0 as the average is bounded and 1ni=1n|p̂ipi| const 1ni=1n|Zi|·|β̂β|which converges to 0 in probability. Similar arguments hold for the denominators of the other estimators. Thus the denominators may be treated as known.

To derive the asymptotic distribution I decompose the numerator as follows: (12) 1ni=1nD˜i(K(yYih)/hf1(y)h22f1(y))=1ni=1n(RiDi+(1Ri)p̂i)(K(yYih)/hf1(y)h22f1(y))(12) (13) +1ni=1n(1Ri)(1{Uip̂i}p̂i)(K(yYih)/hf1(y)h22f1(y))(13)

The second term (13) has mean 0 and variance 1n2i=1n(1Ri)(K(yYih)/hf1(y)h22f1(y))2p̂i(1p̂i)conditional on the observed data. Moreover, as |p̂i(1p̂i)pi(1pi)| const |Zi|·|β̂β|=|Zi| OP(1/n) it follows that unconditionally hni=1n(1Ri)(K(yYih)/hf1(y)h22f1(y))2p̂i(1p̂i)PE[(1πi)pi(1pi)|Yi=y]f(y)K2.

It follows by the central limit theorem for triangular arrays that given the observed data (14) nh 1ni=1n(1Ri)(1{Uip̂i}p̂i)(K(yYih)/hf1(y)h22f1(y))DN(0,E[(1π1)p1(1p1)|Y1=y]f(y)K2)(14) in probability. As the limiting distribution does not depend on the observed data, the convergence is also valid unconditionally, and (13) is asymptotically independent of (12).

Now turn to the term (12), which is decomposed as follows: nh1ni=1n(RiDi+(1Ri)p̂i)(K(yYih)/hf1(y)h22f1(y))=nh1ni=1n(RiDi+(1Ri)pi)(K(yYih)/hf1(y)h22f1(y))+nh1ni=1n(1Ri)(p̂ipi)(K(yYih)/hf1(y)h22f1(y)).

Here the second term is oP(1) as it numerically bounded by a constant times h1ni=1n|Zi|·|K(yYih)/hf1(y)h22f1(y)|·|n(β̂β)|=OP(h).

Ignoring the remainder term, it follows that (12) has mean 0 and a variance that is E[π1p1+(1π1)p12|Y1=y]f(y)K2+o(1).

Thus, I obtain nh1ni=1n(RiDi+(1Ri)p̂i)(K(yYih)/hf1(y)h22f1(y))DN(0,E[π1p1+(1π1)p12|Y1=y]f(y)K2).

Combining this with (14) I obtain Theorem 1.

The other imputation estimators are handled similarly. The “fixed imputation estimator” in Section 3.3 corresponds to ignoring (13), whereas the estimators in Section 3.4 are obtained by letting Ri = 0 in (12) and omitting (13). The multiple imputation estimator is slightly more complicated. The Ui should be replaced by Ui,j’s, p̂i should be replaced by p̂i,j in (13) (but not in (12)) and a new term (15) 1ni=1n(1Ri)(p̂i,jp̂i)(K(yYih)/hf1(y)h22f1(y))(15) should be included between with (12) and (13). The term (13) may be handled as in the single imputation case by conditioning on β˜j as well as the observed data; the term (12) is unchanged. The new term (15) is handled conditionally on the observed data: 1ni=1n(1Ri)(p˜i,jp̂i)(K(yYih)/hf1(y)h22f1(y))=1ni=1n(1Ri)(K(yYih)/hf1(y)h22f1(y))p˜i,jZi(β˜jβ̂) where p̂i,j is the derivative of the inverse link function at a point between Ziβ˜j and Ziβ̂. As β˜jβ̂=OP(1/n) the term (15) is oP(1/nh) and may be ignored asymptotically. Thus, the multiple imputation estimator is asymptotically equivalent to the sum of (12) and an average of B independent copies of (13) divided by p.

A.3 Variances

For Theorem 5 I need a few additional assumptions: I will assume that the derivative of the inverse link function in the model for p(x, y) is Lipschitz continuous and that E[piZi|Yi=y] and E[πipiZi|Yi=y] are twice continuously differentiable (as functions of y) with a locally Lipschitz continuous second derivative.

Focusing first on the single imputation estimator, the variance of (13) may be written as 1nh v2(y)f(y)K22nv2(y)f(y)f1(y)+1nE[v2(Y1)]f1(y)2+o(1/n)with v2(y) given in Theorem 5.

A little more work is required to obtain a higher order expression for the variance of (12), as the effect of p̂ipi which is OP(1/n) has to be incorporated. Note, however, that (13) and (12) are uncorrelated. Taylor expanding pi as a function of the linear predictor, we get p̂i=pi+piZi(β̂β)+(p̂ipi)Zi(β̂β)where p̂i is the derivative of the inverse link function at a point between Ziβ̂ and Ziβ. Moreover, as |p̂ipi| const |Ziβ̂Ziβ| const |Zi|·|β̂β|

I get p̂ipi=piZi(β̂β)+op(1/n)=piZi1nj=1nSj+op(1/n).

Hence, I obtain nh1ni=1n(RiDi+(1Ri)p̂i)(K(yYih)/hf1(y)h22f1(y))=nh1ni=1n(RiDi+(1Ri)pi)(K(yYih)/hf1(y)h22f1(y))+hA1(y)1ni=1nSi+oP(1)with A1(y) given in Theorem 5, as 1ni=1n(1Ri)(K(yYih)/hf1(y)h22f1(y))p̂iZiPA1(y).

It now follows that the asymptotic variance of (12) may be written as 1hv1(y)f(f)K22v1(y)f(y)f1(y)+E[v1(Yi)]f1(y)2+A1(y)I(β)1A1(y)+2A1(y)I(β)1A2(y)/h+o(1/n)with v1(y) and A2(y) given in Theorem 5 as E[(RiDi+(1Ri)pi)(K(yYih)/hf1(y)h22f1(y))Si]=E[DiSi(K(yYih)/hf1(y)h22f1(y))]=E[E[DiSi|Yi](K(yYih)/hf1(y)h22f1(y))]=I(β)1A2(y)

since E[DiSi|Yi]=E[RiDi(1pi)pi(1pi)piI(β)1Zi|Yi]=I(β)1E[πipiZi|Yi].

Noting that v2(y)+v1(y)=p(y) gives the desired result.

As in the proofs of the asymptotic results, the expressions for the asymptotic variances of f̂1,FI(y) and f̂1,RB(y) follows easily by ignoring the expression coming from (13) and, in the case of f̂1,RB(y), by formally putting Ri = 0 and noting that the “covariance term” vanishes as E[piSi|Yi]=E[RiDipi1pipiI(β)1Zi|Yi]=0.

Turning to the multiple imputation estimator, the term corresponding to (13) contribute 1B(1nhv2(y)f(f)K22nv2(y)f(y)f1(y)+1nE[v2(Yi)]f1(y)2)+o(1/n)to the variance of the average, whereas the contribution from the term (12) is unchanged. The term (15) may be written as 1ni=1n(1Ri)(K(yYih)/hf1(y)h22f1(y))p̂i,jZi(β˜i,jβ̂)=A1(y)(β˜i,jβ̂)+oP(1/n).

As E[p̂i,jp̂i|data]=OP(1), the correlation between this term and (12) is negligible. Thus the contribution of this term to the variance is A1(y)I(β)A1(y).