268
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Variable selection and subgroup analysis for high-dimensional censored data

, &
Received 24 Jul 2023, Accepted 29 Feb 2024, Published online: 13 Mar 2024

Abstract

This paper proposes a penalized method for high-dimensional variable selection and subgroup identification in the Tobit model. Based on Olsen's [(1978). Note on the uniqueness of the maximum likelihood estimator for the Tobit model. Econometrica: Journal of the Econometric Society, 46(5), 1211–1215. https://doi.org/10.2307/1911445] convex reparameterization of the Tobit negative log-likelihood, we develop an efficient algorithm for minimizing the objective function by combining the alternating direction method of multipliers (ADMM) and generalised coordinate descent (GCD). We also establish the oracle properties of our proposed estimator under some mild regularity conditions. Furthermore, extensive simulations and an empirical data study are conducted to demonstrate the performance of the proposed approach.

1. Introduction

Subgroup analysis has broad applicability in precision medicine, economics and sociology as there is an increasing need to distinguish homogeneous subgroups of individuals, detect the subgroup structure and model the relationships between the response variable and predictors for individuals belonging to different subgroups. Thus, vast statistical methods for subgroup analysis have been developed, such as mixture models (Everitt, Citation2013) and regularization methods (Ma & Huang, Citation2017). Mixture model methods assume that the data come from a mixture of subgroups and require the specification of an underlying distribution. Shen and He (Citation2015) proposed a structured logistic-normal mixture model to identify subgroups. However, they often require the number of subgroups to be specified to group the parameterized models, which can often be difficult to implement in practice. In contrast, Ma and Huang (Citation2017) developed a pairwise fusion approach using concave penalty functions, such as the smoothly clipped absolute deviation (SCAD, J. Fan & Li, Citation2001) penalty and the minimax concave penalty (MCP, Zhang, Citation2010), that automatically identifies subgroup structures and estimates subgroup-specific effects. Ma et al. (Citation2019) considered a heterogeneous treatment effects model. Wang et al. (Citation2019) proposed a general framework of spatial subgroup analysis method for spatial data with repeated measures.

In numerous regression problems, the dependent variables can only be observed within a restricted range. For instance, when studying the influencing factors of different family expenditures in a group, some families may spend zero on items like medical insurance. Similarly, when studying individual or collective income, negative income generated by debt cannot be included in the income calculation. These scenarios involve left-censored data, which exist commonly in economics. Therefore, it encourages us to develop models specially tailored to address such situations. Tobin (Citation1958) developed the Tobit model to study the relationship between the annual expenditure of durable goods and household income. Due to the large number of scenarios of left-censored data in economics and social sciences, the Tobit model remained popular and it has been thoroughly studied and extended to deal with other types of censored data (Amemiya, Citation1984). Regarding subgroup analysis with censored data, Dagne (Citation2016) proposed a method that simultaneously addresses left-censoring and unobserved heterogeneity within longitudinal data. Additionally, Yan et al. (Citation2021) developed a censored linear regression model with heterogeneous treatment effects.

The advent of advanced data collection techniques has led to an increase in the prevalence of high-dimensional data in the aforementioned fields. When dealing with high-dimensional problems, J. Fan and Lv (Citation2010) provided a comprehensive overview of variable selection approaches, which incorporate methods discussed by J. Fan and Li (Citation2001). In the context of high-dimensional censored models, Müller and van de Geer (Citation2016) and Zhou and Liu (Citation2016) respectively introduced lasso penalty and adaptive lasso penalty to the least absolute deviation (LAD) estimator (Powell, Citation1984) for variable selection in high-dimensional censored models. Johnson (Citation2009) and Soret et al. (Citation2018) proposed lasso penalties for Buckley-James estimators in right and left censored data. Moreover, Bradic et al. (Citation2011) provided a non-concave penalized approach in Cox proportional hazards model with non-polynomial-dimensionality. Alhamzawi (Citation2016) and Alhamzawi (Citation2020) developed the Bayesian method of penalty censored regression. Recently, Jacobson and Zou (Citation2023) extended the Tobit model to high-dimensional regression. However, none of these methods focus on subgroup analysis in high-dimensional censored data.

This paper focuses on subgroup identification and variable selection for a high-dimensional Tobit model. To the best of our knowledge, there have been no discussions on subgroup analysis for high-dimensional Tobit models in the existing literature. We adopt a penalized approach to identify the subgroup structures and select covariates simultaneously. The subgroup structure is determined by penalizing pairwise differences between subject-specific effects while significant covariates are chosen based on a penalty on coefficients. To ensure the sparsity and unbiasedness of the proposed estimators, we consider two commonly used concave penalties, SCAD (J. Fan & Li, Citation2001) and MCP (Zhang, Citation2010). Due to the non-convex of the negative log-likelihood in the Tobit model (Tobin, Citation1958), optimization becomes challenging for high-dimensional settings. To address this issue, we employ a convex reparameterization of negative log-likelihood, building upon the idea proposed by Olsen (Citation1978). This reparameterization enables us to solve the problem using convex optimization approaches. The computational algorithm we proposed combines the alternating direction method of multipliers (ADMM) algorithm (Boyd et al., Citation2011) and generalized coordinate descent (GCD) algorithm (Jacobson & Zou, Citation2023; Yang & Zou, Citation2013) using two concave penalties, such as SCAD or MCP. Furthermore, we conduct a theoretical analysis of the proposed estimators and establish their oracle properties under mild conditions.

The remainder of this paper is organized as follows. Section 2 introduces the main problem and outlines the proposed method. In Section 3, we propose an algorithm for identifying the subgroup structures and performing variable selection. We state technical assumptions and establish the theoretical properties of our proposed approach in Section 4. Section 5 provides extensive simulation studies to illustrate the empirical performance of the proposed method, while Section 6 presents its application to empirical data. A summary and prospects for future research are presented in Section 7 and all technical proofs are given in Appendix.

2. Model and method

2.1. Model setting

Suppose that xi=(xi1,,xip) is a p-dimensional vector of covariates for the ith subject. yic is the response for a restricted range, where c is a known constant. Without loss of generality, we assume that c = 0 in the following. The Tobit model assumes that the observed data y satisfies y=max(y,c), where y is a latent variable. Under the homogeneous case, the classical linear model takes the form (1) yi=μ+xiβ+ϵi,i=1,,n,(1) where μ is the unknown intercept, β=(β1,,βp) is the vector of coefficients for the covariates xi, and ϵi are assumed to be independent and identically distributed with normal distribution N(0,σ2).

If individuals are from different groups with a unique intercept μi, the homogeneity assumption in the model (Equation1) is invalid. To model the subject-specific effects, we consider the subject-specific linear model (2) yi=μi+xiβ+ϵi,i=1,,n.(2) We assume (y1,,yn) arise from K different groups with K1 unknown and the subjects from the same groups have the same intercept. In other words, we have μi=πk for all iGk, where πk is the common value of intercept μi in subgroup Gk and G=(G1,,GK) is a mutually partition of {1,,n}. In practice, the number of subgroups K is unknown and is smaller than the sample size n.

Define di=I(yi0), where I() is an indicator function. Then the observed data (y1,,yn) satisfy the following Tobit model (3) yi=diyi={μi+xiβ+ϵi,ifyi0,0,ifyi<0,i=1,,n,(3) with subgroup structure (4) μi={π1,ifiG1,π2,ifiG2,πK,ifiGK.(4) Let Φ() denote the standard normal cumulative distribution function (CDF). Then P(yi0)=P(μi+xiβ<0)=Φ(μi+xiβσ),and the Tobit likelihood is given by Ln(μ,β,σ2)=i=1n[12πσexp{12σ2(yiμixiβ)2}]di[Φ(μixiβσ)]1di,where μ=(μ1,,μn). Therefore, after omitting an inconsequential constant, the log-likelihood function of the Tobit model is logLn(μ,β,σ2)=i=1n(di[12(yiμixiβ)2/σ2log(σ)]+(1di)log[Φ(μi+xiβσ)]).It is apparent that the function logLn(μ,β,σ2) is non-concave with respect to the parameters (μ,β,σ2). By adopting the reparameterization suggested in the works of Olsen (Citation1978) and Jacobson and Zou (Citation2023) with δ=β/σ, αi=μi/σ and γ2=σ2, we achieve a transformation that leads to a concave function with respect to parameters (α,δ,γ), logLn(α,δ,γ)=i=1n{di[log(γ)12(γyiαixiδ)2]+(1di)log(Φ(αixiδ))},where α=(α1,,αn).

2.2. Method

There are usually some redundant covariates in high-dimensional scenarios, and regularization is the most commonly used method to identify the sparsity of regression coefficient vectors (Bondell & Reich, Citation2008; J. Fan & Lv, Citation2010; Y. Fan & Tang, Citation2013). The subgroup structure (Equation4) can be transformed as the fusion sparse structure, αiαj=(μiμj)/σ=0(i,jGk,k=1,,K). In order to estimate the parameters α,δ, and γ, and to select proper covariates through the sparsity assumption of δ, we propose a new method that combines ideas of penalized Tobit regression (Jacobson & Zou, Citation2023) and the subgroup analysis by concave pairwise fusion penalization (Ma & Huang, Citation2017; Ma et al., Citation2019), which can be expressed as minimizing the following loss function (5) Q(α,δ,γ;λ1,λ2)=1nlogLn(α,δ,γ)+i=1pPλ1(|δi|)+i<jPλ2(|αiαj|),(5) where Pλ1() and Pλ2() are penalty functions, λ1,λ20 are tuning parameters that control the strengths of regularization of |δi| and |αiαj|, respectively. Note that the sparsity of δ is achieved through the first penalty term, while the homogeneity detection is achieved by the second penalty term. Additionally, when λ1=0 the problem reduces to the subgroup analysis in censored regression; when λ2=0 the problem reduces to the penalized Tobit regression.

It is important to note that lasso estimators may fail to achieve consistent model selection unless a stringent ‘irrepresentable condition’ (Zhao & Yu, Citation2006; Zou, Citation2006). Particularly, the L1 penalty tends to overshrink non-zero difference of |αiαj|, which can result in an inflated number of subgroups. To address this limitation, we consider two common concave penalty functions for the purposes of identifying the subgroup structure and selecting variables, namely the smoothly clipped absolute deviation (SCAD, J. Fan & Li, Citation2001) and the minimax concave penalty (MCP, Zhang, Citation2010). These penalties provide alternative approaches to handle the challenges associated with variable selection and subgroup detection.

The SCAD is defined as follows Pλ(t)=λ0tI(xλ)+(x)+(a1)λI(x>λ)dx,a>2,and the MCP is Pλ(t)=0t(x)+adx,a>1,where a is a parameter that controls the concavity of the penalty function.

3. Computational algorithm

In this section, we propose an algorithm utilizing the alternating direction method of multipliers (ADMM) (Boyd et al., Citation2011) in conjunction with generalized coordinate descent (GCD) (Jacobson & Zou, Citation2023; Yang & Zou, Citation2013) to address the minimization problem (Equation5). Since the penalty function is not separable with respect to αi, it is challenging to directly minimize the objective function (Equation5). We introduce a new set of parameters ηij=αiαj, and then the minimizing problem can be written as the following constraint optimization problem S(α,δ,η,γ;λ1,λ2)=n(α,δ,γ)+i=1pPλ1(|δi|)+i<jPλ2(|ηij|),subject toαiαjηij=0,where n(α,δ,γ)=1nlogLn(α,δ,γ) is the negative log-likelihood function which we call it Tobit loss for short and η={ηij,i<j}. Applying the augmented Lagrangian method, the estimates of the parameters can be obtained by minimizing (6) L(α,δ,η,φ,γ;λ1,λ2,ρ)=S(α,δ,η,γ;λ1,λ2)+i<jφij(αiαjηij)+ρ2i<j(αiαjηij)2,(6) where φ={φij,i<j} are Lagrange multipliers, and ρ is the penalty parameter.

To obtain the minimum in (Equation6), we propose to use the following iterative algorithm based on the ADMM. Let t denote the iteration step. We update the estimates of (α,δ,γ), η, and φ iteratively at the (t+1)th iteration step as follows (7) (α(t+1),δ(t+1),γ(t+1))=argminδ,α,γL(α,δ,γ,η(t),φ(t)),(7) (8) η(t+1)=argminηL(α(t+1),δ(t+1),γ(t+1),η,φ(t)),(8) (9) φ(t+1)=φ(t)+ρ(Δα(t+1)η(t+1)),(9) where Δ={(eiej),i<j}.

It is worth noting that, given (η(t),φ(t)), the objective function in the first minimization problem (Equation7) can be simplified as (10) L(α,δ,γ,η,φ)=1ni=1ni(αi,δ,γ)+i=1pPλ1(|δi|)+i<jφij(αiαjηij)+ρ2i<j(αiαjηij)2+C,(10) where i(αi,δ,γ)=12di(γyiαixiδ)2(1di)logΦ(xiδαi)), and C is a constant independent with (α,δ,γ).

Due to the complexity of the function (Equation10) with respect to (α,δ,γ), we apply the generalized coordinate descent (GCD) method to solve the problem. By employing the GCD method, we can iteratively update each variable while holding the others fixed.

Let α,δ and γ be the current values for α,δ and γ, respectively. For the sake of simplicity in notation, let v(j) denote the vector v with the jth element removed in the subsequent context. In order to get the estimate of αi, we begin by expressing the Tobit loss n with respect to αi n(αiα(i),δ,γ)=1n{12di(γyiαixiδ)2(1di)logΦ(xiδαi)}.We observe that after dropping the negligible constants, the Tobit loss n associated with αi solely depends on the data collected from subject i. In line with Theorem 1 presented in Jacobson and Zou (Citation2023), the quadratic majorization function for n(αiα,δ,γ) takes the following form: Qα(αiα(i),δ,γ)=n(αiδ,γ)+˙n(αiδ,γ)(αiαi)+12(αiαi)2,where ˙n(αiδ,γ) represents the derivative of function with respect to αi. Following the MM principle, the update of αi can be obtained by minimizing the expression Qα(αiα(i),δ,γ)+ρ2k<j{(ekej)αηkj+ρ1φkj}2, and ej is an n×1 vector with the jth element being 1 and the remaining elements being 0. Therefore, for fixed α(t,k),δ(t,k), and γ(t,k) at the kth step, the update of α(t,k+1) is as follows (11) α(t,k+1)=(I+ΔΔ)1(Δη(t)nΔφ(t)+a(t,k)),(11) where I is the identity matrix, and a(t,k)=α(t,k)˙n(α(t,k)) with ˙n(α(t,k))=(˙n(α1(t,k)),,˙n(αn(t,k))).

Now we consider coordinate-wise updates of δj,j=1,,p. Let Mj=1ni=1nxij2. Similarly to αi, we also have the quadratic majorization function of n(δjα,δ,γ) with respect to δj with form Qδ(δjα,δj,γ)=n(δjα,δ(j),γ)+˙n(δjα,δ(j),γ)(δjδj)+Mj2(δjδj)2,where ˙n(δjα,δ(j),γ) is the derivative of n(δjα,δ(j),γ) with respect to δj. Here, the Tobit loss with respect to δj can be expressed as n(δjα,δ(j),γ)=1ni=1n12di(γyiαixi,(j)δ(j)xijδj)2(1di)logΦ(xi,(j))δ(j)αixijδj)).We can update δj by minimizing Qδ(δjα,δj,γ)+Pλ1(δj) through MM principle. For j=1,,p, let υj(t,k)=δj(t,k)1Mjn(δj(t,k)α(t,k+1),δ(t,k),γ(t,k)). Hence, for SCAD penalty with a1>maxj{Mj1}+1, the update of δj at the (k+1)th step is (12) δj(t,k+1)={ST(υj(t,k),λ1Mj),if|υj(t,k)|λ1+λ1/Mj,ST(υj(t,k),a1λ1/((a11)Mj))1((a11)Mj)1,ifλ1+λ1/Mj<|υj(t,k)|a1λ1,υj(t,k),if|υj(t,k)|>a1λ1,(12) where ST(t,λ)=sign(t)(|t|λ)+ is the soft-thresholding rule, and (x)+=max{x,0}. And when a1>maxj{Mj1} for the MCP penalty, the updated value is (13) δj(t,k+1)={ST(υj(t,k),λ1/Mj)1(a1Mj)1,if|υj(t,k)|a1λ1,υj(t,k),if|υj(t,k)|>a1λ1.(13) Lastly, for given α(t,k+1) and δ(t,k+1), we minimize n(γα(t,k+1),δ(t,k+1)) to update γ, (14) γ(t,k+1)=i=1ndiyi(αi(t,k+1)+xiδ(t,k+1))+(i=1ndiyi(αi(t,k+1)+xiδ(t,k+1)))2+4(i=1ndiyi2)i=1ndi2i=1ndiyi2.(14) Once the convergence is achieved, we denote the final iteration of the GCD as (α(t+1),δ(t+1),γ(t+1)).

As for the second minimization function (Equation8), by eliminating insignificant constants that have no effect on the minimization process, the optimization function simplifies to the following form (15) ηij=argminηijρ2(ηijζij)2+Pλ2(|ηij|),(15) with respect to ηij, where ζij=αiαj+ρ1φij. It's worth noting that (Equation15) is convex with respect to each ηij when a2>ρ1 for MCP or a2>ρ1+1 for SCAD. Hence, the closed-form solution for the MCP penalty at the (t+1) iteration is (16) ηij(t+1)={ST(ζij(t+1),λ2/ρ)1(a2ρ)1,if|ζij(t+1)|a2λ2,ζij(t+1),if|ζij(t+1)|>a2λ2.(16) Then for the SCAD penalty, it is (17) ηij(t+1)={ST(ζij(t+1),λ2/ρ),if|ζijt+1|λ2+λ2/ρ,ST(ζij(t+1),a2λ2/((a21)ρ))1((a21)ρ)1,ifλ2+λ2/ρ<|ζij(t+1)|a2λ2,ζij(t+1),if|ζij|>a2λ2.(17) We provide the complete algorithm in Algorithm (1), referred to as the ADMM-GCD algorithm for convenience.

Remark

In the algorithm, there are two iterative steps. In the nested GCD iteration, the iterative convergence criterion is defined as follows (α(t,k+1)α(t,k)),(δ(t,k+1)δ(t,k)),(γ(t,k+1)γ(t,k))22ϵa.On the other hand, for ADMM, we employ the following criterion r(t+1)2=Δα(t+1)η(t+1)22<ϵb.It is worth mentioning that in our simulation studies, we set ϵa=104 and ϵb=105, respectively.

The following Proposition 3.1 presents the convergence properties of the proposed algorithm.

Proposition 3.1

Given a1>max{1/Mj} and a2>1/ρ for MCP or a1>max{1/Mj}+1 and a2>1/ρ+1 for SCAD, any accumulation point (δ(t+1),α(t+1),γ(t+1),η(t+1)) generated by Algorithm 1 is a coordinate-wise minimum of L(δ,α,γ,η,φ(t)). In addition, the primal residual r(t+1)=Δα(t+1)η(t+1) and the dual residual s(t+1)=ρΔ(η(t+1)η(t)) of the ADMM satisfy that limtr(t)22=0 and limts(t)22=0 for both MCP and SCAD penalties.

4. Theoretical properties

In this section, we study the theoretical properties of the proposed estimators. We first introduce MG, a subspace of Rn, defined as MG={αRn:αi=αj,for anyi,jGk,1kK}.For each αMG, it can be also written as α=Zτ, where Z={zik} is the n×K indicator matrix defined by zik=1 for iGk and zik=0 otherwise, and τ is a K×1 vector of parameters. Let |Gk| denotes the number of elements in Gk, we have T=ZZ=diag(|G1|,,|GK|) by matrix calculation. Define |Gmin|=min1kK|Gk| and |Gmax|=max1kK|Gk|.

First, we assume that the true values of the parameters for δ, α and γ are δ, α and γ, respectively. Let τ=(τk,k=1,,K), where τk is the underlying common intercept for group Gk. We let A={j:δj0}{1,,p} denote the true support set of δ and define A=A{p+1,,p+K+1}, A1=A{p+K+1} and s=|A|. Under the sparsity assumption, sp.

To consider the true variables, we set δ=(δ1,δ0) and define MB, where δ1 are the true variables with {δj0} and δ0 are the zero variables. Then we define a subspace of Rp as MB={δRp:δi=δiforiAandδi=0foriAc}.Additionally, we let Θ=:(α,δ,γ) and Ξ=:(τ,δ1,γ) for notational convenience.

Since there are obvious differences between censored and uncensored observations in the Tobit likelihood, we use a more convenient expression to differentiate them clearly. When we define n1 as the number of observations for which yi>0 and n0=nn1, we can re-block our observations as X=[X0X1]andy=[y0y1],where X0 is the n0×(p+1) matrix of predictors corresponding to the observations for which yi0 while X1 is the n1×(p+1) matrix of predictors corresponding to the observations for which yi>0. Similarly, y0 and y1 denote the responses greater than and not greater than 0, respectively. Then by the definition above we get the same form α=[α0α1]==[Z0τZ1τ].

4.1. Technical conditions

In this subsection, we will introduce several mild conditions and discuss their relevance in detail.

(C1):

Assume Xj2=n for 1jp, XC1s, δC2s, αC3n, and |γ|C0 for some constant 0C0,C1,C2,C3.

(C2):

The penalty function Pλ(t) is symmetric of t, and it is nondecreasing and concave for t[0,). It is a constant on t for the function ρ(t)=Pλ(t)/λ with 0a and ρ(0)=0. Moreover, ρ(t) exists and is continuous except for a finite number of t and ρ(0+)=1.

(C3):

The error vectors εi,i=1,,n are i.i.d. normal distributed with mean zero and variance σ2 such that P(|εi|>t)2cexp(12c2t2)/t, where c is a constant.

Conditions (C2) and (C3) are widely adopted in high-dimensional settings. The penalties, including MCP and SCAD mentioned in the article, satisfy (C2).

When the true group memberships G1,,GK and true support set A are known, the oracle estimators for α, δ and γ are defined as Θ^or=(α^or,δ^or,γ^or)=argmaxαMG,δMB,γRlogLn(α,δ,γ).After removing the redundant variables, we can write X~=(Z,XA) and δ~=(τ,δ1) for ease of calculation. Then, the oracle estimators for τ, δ1 and γ are given by (18) Ξ^or=(τ^or,δ^1or,γ^or)=(δ~or,γ^or)=argmaxδ1RK+s,γRlogLn(δ1,γ)=argmaxδ1RK+s,γRi=1ndi[log(γ)12(γyixi~δ~1)2]+(1di)logΦ((xi~δ~1)).(18) The problem (Equation18) can be reduced to the traditional Tobit model. To obtain the estimator using the maximum likelihood method, it is necessary to calculate the matrix of second partials. We define g(s)=ϕ(s)/Φ(s) and h(s)=g(s)(s+g(s)). The matrix can be expressed as follows (19) 2logLn(Ξ)=[X~y][D(δ~1)00I][X~y][000n1γ2]=[X~0D(δ~1)X~0+X~1X~1X~0D(δ~1)y0X~1y1y0D(δ~1)X~0y1X~1y0D(δ~1)y0+y1y1n1γ2],(19) where D(δ~1) is a n0×n0 diagonal matrix with [D(δ~1)]ii=hi=h(x~iδ~).

Olsen (Citation1978) found that the matrix in (Equation19) is negative semidefinite. Theorem 1 in Amemiya (Citation1973) established the asymptotic result that this matrix becomes non-zero with probability one. This ensures the invertibility of the above gradient matrix. Moreover, we introduce an additional condition to support Theorem 4.2.

(C4):

The tuning parameter λ1C1smax{s3/2n0,1n0,lognn1} and λ2|Gmin|1max{s3/2n0,1n0,lognn1}.

4.2. Theoretical results

Theorem 4.1

Suppose that yi=μi+xiβ+ϵi where ϵiiidN(0,σ2) and define yi=yiI(yi>0) for i=1,,n. Let Ξ^or denote the oracle solution to the Tobit model when the true group memberships and true support set of β are known. Suppose conditions (C1)(C3) hold, and then Ξ^or corresponds to the unique maximum of the likelihood function and is a consistent estimator of the true parameter values Ξ such that (20) n(Ξ^orΞ)N(0,Σ),(20) where Σ=limn[1n2logLn(Ξ)|Ξ=Ξ]1. Moreover, we denote λmax as the maximum eigenvalue of the matrix Σ. If λmax=O(1) is satisfied, we have that with probability at least 1p1=1Clognnexp{n2logn}, (21) Θ^orΘϕn,(21) where ϕn=1/logn and C is a constant.

For K2, let bn=miniGk,jGk,kk|αiαj|=minkk|τiτj|be the minimal difference of the common values between the two groups.

Theorem 4.2

Suppose the conditions in Theorem 4.1 hold and K2. If the minimum signal strength of δ satisfies |δA|min>(a+1)λ1 and bn>aλ2. When λ1,λ2ϕn, where a is a given constant in (C2), then there exists a local minimum Θ^(λ1,λ2)=(α^,δ^,γ) of the objective function Q(α,δ,γ) given in (Equation5) satisfying P((α^,δ^,γ)=(α^or),(δ^or),(γ^or))1,that is, P(Θ^(λ1,λ2)=Θ^or)1.

The proofs of these theorems are given in the Appendix.

5. Simulation studies

In this section, we conduct extensive simulation studies to investigate the numerical performance of the proposed approaches. We generate data from the censored heterogeneous linear model: yi=μi+xiβ+ϵi,i=1,,n,where xij,j=1,,p are generated from independent normal distribution N(1,1), and the error terms ϵi are from independent normal distribution N(0,0.52). We set yi=max{0,yi} with censoring rate q. The true coefficients are set as β=(5,1,2,0.5,0.1,0,,0), which is a p-dimensional vector with p−5 zero elements. To investigate the effect of the magnitude of difference between subgroup-specific effects, we consider two cases for the subgroup structure:

Case 1:

K=2,P(μi=2)=P(μi=2)=12;

Case 2:

K=3,P(μi=2)=P(μi=2)=P(μi=0.5)=13.

We evaluate the performance of the estimators obtained using the proposed method using three different penalties (SCAD, MCP and Lasso), and compare them to the penalized Tobit approach (Tobit SCAD, Jacobson & Zou, Citation2023), which assumes a homogeneous intercept effect μ. Additionally, we present the results of Oracle estimators (Tobit Oracle) as well. For each simulation, we generate 100 datasets with sample size of n = 100, for every combination of q{20%,40%} and p = 10, 50, 200. Specifically, we set ρ=2 and set a1=a2=3.7 for the SCAD penalty and a1=a2=3 for the MCP penalty. Subsequently, we conduct the simulations by selecting the optimal tuning parameters via minimizing the modified BIC (Wang et al., Citation2007): (22) BIC(λ1,λ2)=2logLn(α^,δ^,γ^)+Cnlog(n)(K^+s+1).(22) Wang et al. (Citation2009) used Cn=log(log(d)) in the simulation to apply the divergence of the predictor with sample size in high-dimensional scenarios. In this article, we let Cn=clog(log(d)), where d = n + p + 1 and c is a positive constant that we set to 1.5.

We evaluate the methods based on three aspects, accuracy of the coefficient estimates, performance of the variable selection and identifying the subgroup structures. To measure the estimation accuracy of parameters μ^,β^,andσ^, we use the square error of the mean squared errors. Let μ,β and σ represent the true parameters. The square roots of the mean squared errors for μ^,β^ and σ^ are defined by err(μ^)=μ^μ2/n, err(β^)=β^β2 and err(σ^)=|σ^σ|, respectively.

In order to assess the variable selection of these methods, we report the number of true variables not included (NT) and the number of error variables included (NE). Moreover, to evaluate the performance of the subgroup analysis, we present the estimate of the number of groups (K^), the rate of false estimation of the number of groups (FK%) and the Rand Index (RI, Rand, Citation1971), which is defined by RI=TP+TNTP+FP+FN+TN,where true positive (TP) indicates that two observations from the same truth group are allocated to the same group, true negative (TN) means two observations from different groups are allocated to different groups, false positive (FP) denotes two observations from different groups but allocated to the same group, and false negative (FN) represents two observations from the same group are allocated to different groups. A high Rand Index indicates a substantial proportion which of individuals being assigned to the correct subgroups.

Tables  and  present the average square root of the mean squared error (RSME) of the estimates of the five methods. The cases considered in the tables involve varying values of p and censoring rates. It is evident from the tables that the proposed methods, namely MCP and SCAD, consistently yield smaller RMSE values compared to the Lasso and Tobit SCAD methods across all simulations. Moreover, the method utilizing the lasso penalty consistently underestimates the number of groups. Consequently, the substantial deviation of the estimator can be attributed to the loss of heterogeneous intercept information. Similarly, the Tobit SCAD method when utilized without the subgroup recognition function, shows comparable results.

Table 1. The sample means and standard deviations of the index to be measured when K = 2.

Table 2. The sample means and standard deviations of the index to be measured when K = 3.

In Tables  and , we report the results for variable selection and identification of subgroups. The NT and NE values indicate that our method exhibits comparable performance to other methods in identifying relevant variables while outperforming them in efficiently screening out irrelevant variables. The proposed approaches also achieve higher RI and lower FK values, further establishing their superiority over the alternative methods. Despite facing increased challenges in model recovery due to higher censoring rates and larger parameter dimensions, our method still maintains a remarkably low prediction error rate for the number of subgroups. Incorrect estimations only occur when the deletion ratio reaches 40%, underscoring the robustness of our approach even under such demanding conditions.

Table 3. The sample means and standard deviations of the index to be measured when K = 2.

Table 4. The sample means and standard deviations of the index to be measured when K = 3.

6. An empirical application to the HIV drug resistance data

Antiretroviral therapy (ART) is a common medical treatment for human immunodeficiency virus (HIV). However, the high mutation rate of HIV leads to drug-resistant mutations (DRMs) in HIV-infected patients receiving ART. In response to this challenge, physicians regularly monitor HIV viral load. When the patient's treatment regimen fails to suppress the virus, genotypic testing is conducted to check for DRMs, and subsequently, they can update the patient's drug regimen appropriately.

To identify DRMs and quantify the degree of resistance they provide to different ART treatments, the proposed approach is applied to model the relationship between HIV viral load and mutations in the virus's genome. The data used in this section are from the OPTIONS trial by the AIDS Clinical Trials Group (Gandhi et al., Citation2020), which can be downloaded from the Stanford HIV Drug Resistance database (Shafer, Citation2006). Specifically, the OPTIONS trial encompassed 412 participants afflicted with HIV-infected, who were undergoing protease inhibitor (PI)-based treatment and grappling with virological failure. Each individual was administered an individualized ART regimen on the basis of their drug resistance and treatment history. Individuals exhibiting moderate drug resistance were randomly allocated to either include nucleoside reverse transcriptase inhibitors (NRTIs) into their optimized treatment regimens or to exclude NRTIs from these regimens. Individuals with high drug resistance were all provided with optimized regimens that encompassed NRTIs.

Our dataset includes n = 407 participants who were subjected to a comprehensive 12-week follow-up assessment. Within this dataset, there are a multitude of predictors p = 601, including 99 protease (PR) and 240 reverse transcriptase (RT) gene mutation indicators. Due to the technical limitations of the assays employed for its measurement, it cannot be measured when the HIV viral load is less than the threshold (50 copies/ml), the response variable is left censored. As proposed by Soret et al. (Citation2018), we use log10-HIV viral load as our response due to its prevalent conformity to a normal distribution. In this trial, 35.6% of individuals have no detectable viral load, which implies a left censoring ratio of 35.6% for the data sample. We compare our proposed methods (SCAD and MCP) and Tobit SCAD (Jacobson & Zou, Citation2023) for modelling HIV viral load 12 weeks after drug regimen assignment as a function of several variables, including HIV genotypic mutations, current drug regimen, baseline viral load, and observation week, etc.

First, to evaluate the performance of model selection, we apply Tobit SCAD and the proposed methods (SCAD and MCP) for fitting the entire data respectively. The constant values a1,a2,ρ are set as in Section 5. Consequently, sparse models containing covariates M184V, the baseline viral load and RAL are consistently selected across all three approaches. Among them, M184V is an indicator of mutations in the reverse transcriptase (RT) gene, and RAL refers to whether the participant was taking raltegravir. The Stanford University HIV Resistance Database lists M184V as a major NRTIs resistance mutation (Jacobson & Zou, Citation2023; Shafer, Citation2006). Moreover, the absence of additional NRTIs being chosen as important variables aligns with the results in Gandhi et al. (Citation2020). This congruence highlights the practical performance of the proposed approaches in model selection. As a result, the specific estimated coefficients are comprehensively shown in Table .

Table 5. The estimators of parameters in the example.

The key difference between the proposed methods and the Tobit SCAD lies in the capacity of the proposed methods to identify the subgroup structures within the intercepts. As depicted in Figure , we present the estimated density function of yixiβ^TS, where β^TS is the coefficient vector estimate for Tobit SCAD model (Jacobson & Zou, Citation2023). It is not difficult to see the distribution of yixiβ^TS is multimodal. Consequently, it appears more appropriate to employ a model with heterogeneous intercepts to fit the dataset. Subsequently, in Figure , we present the density functions estimates of yixiβ^PS for two subgroups characterized by different intercepts, where β^PS is the coefficient vector estimate for proposed SCAD model. The density function estimates of the proposed MCP method exhibit similar characteristics and are therefore omitted here. When compared with the density functions shown in Figure , it is evident that each subgroup in Figure displays an unimodal distribution with greater homogeneity.

Figure 1. Density plot of the response variable after adjusting for the effects of the covariates in the empirical example.

Figure 1. Density plot of the response variable after adjusting for the effects of the covariates in the empirical example.

Figure 2. Density plots of response variable after adjusting for the effects of the covariates under two different intercept groups for proposed SCAD method.

Figure 2. Density plots of response variable after adjusting for the effects of the covariates under two different intercept groups for proposed SCAD method.

The application of the proposed SCAD method leads to the detection of two subgroups with sample sizes of n1=47 and n2=360, respectively. It is worth noting that the patients in the OPTION design were originally categorized into two groups: a randomized group consisting of 356 patients, and a highly resistant group, comprising 51 patients. Consequently, we can consider this historical grouping as a control group structure. We calculate the rand index (RI) for the proposed SCAD method based on the control group structure. The rand index (RI) is 0.757, which indicates a substantial degree of agreement between the detected subgroup structure and the historical one. This suggests that the subgroup composition may undergo changes as a result of antiretroviral therapy, a phenomenon that is quite reasonable in the context of HIV patient management.

In addition, we compare the in-sample error and out-of-sample error for the aforementioned methods and the Tobit SCAD method with known subgroup structures. These errors are defined as the mean square error of (i)err(yi,yi^)=(yiyi^)2,(ii)err(yi,yi^)=(yiyi^)2,i=1,,nacross all samples, where y^i is the fitted response, and y^i is the predicted response using a 5-fold cross-validation approach. To ensure comparability, we employ stratified sampling to maintain a consistent left-censored ratio across each test set. Table  presents the two types of criteria. For both two types of error, the Tobit SCAD method, when applied with known subgroup structures, exhibits smaller errors compared to the homogeneous model. This result underscores the significance of capturing the inherent heterogeneity within the dataset, as it enables more effective modelling of the influential variables affecting the response. While the errors in the control group are slightly larger compared to the proposed methods, it is essential to consider that these discrepancies may stem from changes in drug resistance among patients during treatment. Moreover, the disparities with the control group could serve as a basis for further examination of specific patient cases. It is conceivable that the existence of distinct groups may be attributed to disparities in the treatment trajectories of patients or inherent physiological differences among individuals. In clinical terms, these findings suggest the potential for devising more precise and tailored management protocols for specific patient subgroups, improving the overall treatment efficacy.

Table 6. The in-sample error and out-of-sample error

7. Conclusion

This paper introduces a novel approach to analysing censored data with potential heterogeneity in intercept effects by combining the penalty Tobit likelihood function with a concave fusion penalty. The proposed method can automatically identify heterogeneous structures in intercept effects and conduct variable selection.

To address the optimization problem associated with the method, we propose an algorithm based on the generalized coordinate descent method and alternating direction method of multipliers. This algorithm simplifies the optimization problem and reduces computational costs by employing a quadratic optimization function instead of a complex nonlinear optimization problem. This choice ensures efficient computation, particularly in scenarios with high complexity, while still maintaining good properties for the estimator. Furthermore, we establish the oracle property of parameter estimators. It is shown that within the domain of the oracle solution, a local minimum point of the objective function can be consistent with the oracle solution, providing theoretical support for the correctness of the method. Our ADMM-GCD algorithm with SCAD or MCP performs well in both extensive simulation case studies and the application of real data.

The proposed method can also be extended to incorporate the fusion of coefficient terms. However, there are still challenges in applying this method to generalized linear models or survival models, and further research is needed to develop algorithms and establish theoretical properties in these models.

Acknowledgements

We are very grateful to the editor, managing editor and the referee for their comments that improved this article.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China (NSF) of China [grant numbers 12171450 and 71921001.

References

  • Alhamzawi, A. (2020). A new Bayesian elastic net for Tobit regression. Journal of Physics: Conference Series, 1664(1), 012047.
  • Alhamzawi, R. (2016). Bayesian elastic net Tobit quantile regression. Communications in Statistics-Simulation and Computation, 45(7), 2409–2427. https://doi.org/10.1080/03610918.2014.904341
  • Amemiya, T. (1973). Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, 41(6), 997–1016. https://doi.org/10.2307/1914031
  • Amemiya, T. (1984). Tobit models: A survey. Journal of Econometrics, 24(1–2), 3–61. https://doi.org/10.1016/0304-4076(84)90074-5
  • Bondell, H. D., & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64(1), 115–123. https://doi.org/10.1111/biom.2008.64.issue-1
  • Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122. https://doi.org/10.1561/2200000016
  • Bradic, J., Fan, J., & Jiang, J. (2011). Regularization for Cox's proportional hazards model with NP-dimensionality. Annals of Statistics, 39(6), 3092. https://doi.org/10.1214/11-AOS911
  • Dagne, G. A. (2016). A growth mixture Tobit model: Application to AIDS studies. Journal of Applied Statistics, 43(7), 1174–1185. https://doi.org/10.1080/02664763.2015.1092114
  • Everitt, B. (2013). Finite mixture distributions. Springer Science & Business Media.
  • Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
  • Fan, J., & Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148.
  • Fan, Y., & Tang, C. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 531–552. https://doi.org/10.1111/rssb.12001
  • Gandhi, R. T., Tashima, K. T., Smeaton, L. M., Vu, V., Ritz, J., Andrade, A., Eron, J. J., Hogg, E., & Fichtenbaum, C. J. (2020). Long-term outcomes in a large randomized trial of HIV-1 salvage therapy: 96-week results of AIDS Clinical Trials Group A5241 (OPTIONS). The Journal of Infectious Diseases, 221(9), 1407–1415. https://doi.org/10.1093/infdis/jiz281
  • Jacobson, T., & Zou, H. (2023). High-dimensional censored regression via the penalized Tobit likelihood. Journal of Business & Economic Statistics, 42(1), 286–297. https://doi.org/10.1080/07350015.2023.2182309
  • Johnson, B. A. (2009). On lasso for censored data. Electronic Journal of Statistics, 3, 485–506. https://doi.org/10.1214/08-EJS322
  • Ma, S., & Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517), 410–423. https://doi.org/10.1080/01621459.2016.1148039
  • Ma, S., Huang, J., Zhang, Z., & Liu, M. (2019). Exploration of heterogeneous treatment effects via concave fusion. The International Journal of Biostatistics, 16(1), 20180026. https://doi.org/10.1515/ijb-2018-0026
  • Müller, P., & van de Geer, S. (2016). Censored linear model in high dimensions: Penalised linear regression on high-dimensional data with left-censored response variable. Test, 25(1), 75–92. https://doi.org/10.1007/s11749-015-0441-7
  • Olsen, R. J. (1978). Note on the uniqueness of the maximum likelihood estimator for the Tobit model. Econometrica: Journal of the Econometric Society, 46(5), 1211–1215. https://doi.org/10.2307/1911445
  • Powell, J. L. (1984). Least absolute deviations estimation for the censored regression model. Journal of Econometrics, 25(3), 303–325. https://doi.org/10.1016/0304-4076(84)90004-6
  • Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850. https://doi.org/10.1080/01621459.1971.10482356
  • Shafer, R. W. (2006). Rationale and uses of a public HIV drug-resistance database. The Journal of Infectious Diseases, 194(s1), S51–S58. https://doi.org/10.1086/jid.2006.194.issue-s1
  • Shen, J., & He, X. (2015). Inference for subgroup analysis with a structured logistic-normal mixture model. Journal of the American Statistical Association, 110(509), 303–312. https://doi.org/10.1080/01621459.2014.894763
  • Soret, P., Avalos, M., Wittkop, L., Commenges, D., & Thiébaut, R. (2018). Lasso regularization for left-censored Gaussian outcome and high-dimensional predictors. BMC Medical Research Methodology, 18(1), 1–13. https://doi.org/10.1186/s12874-018-0609-4
  • Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica: Journal of the Econometric Society, 26(1), 24–36. https://doi.org/10.2307/1907382
  • Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3), 475–494. https://doi.org/10.1023/A:1017501703105
  • Wang, H., Li, B., & Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3), 671–683. https://doi.org/10.1111/j.1467-9868.2008.00693.x
  • Wang, H., Li, R., & Tsai, C. L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3), 553–568. https://doi.org/10.1093/biomet/asm053
  • Wang, X., Zhu, Z., & Zhang, H. H. (2019). Spatial automatic subgroup analysis for areal data with repeated measures. arXiv:1906.01853.
  • Yan, X., Yin, G., & Zhao, X. (2021). Subgroup analysis in censored linear regression. Statistica Sinica, 31(2), 1027–1054.
  • Yang, Y., & Zou, H. (2013). An efficient algorithm for computing the HHSVM and its generalizations. Journal of Computational and Graphical Statistics, 22(2), 396–415. https://doi.org/10.1080/10618600.2012.680324
  • Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 32(2), 894–942.
  • Zhao, P., & Yu, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research, 7(90), 2541–2563.
  • Zhou, X., & Liu, G. (2016). LAD-lasso variable selection for doubly censored median regression models. Communications in Statistics-Theory and Methods, 45(12), 3658–3667. https://doi.org/10.1080/03610926.2014.904357
  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735

Appendix

A.1. Proof of Proposition 3.1

First, according to the definition of η(t+1), we have L(α(t+1),δ(t+1),γ(t+1),η(t+1),φ(t))L(α(t+1),δ(t+1),γ(t+1),η,φ(t))for any η. Define f(t+1)=infΔα(t+1)η=0{L(α(t+1),δ(t+1),γ(t+1),η,φ(t))},and then we have L(α(t+1),δ(t+1),γ(t+1),η(t+1),φ(t))f(t+1).Let k be a non-negative integer, since φ(t+k1)=φ(t)+ρi=1k1(Δα(t+i)η(t+i)), so we can obtain that L(α(t+k),δ(t+k),γ(t+k),η(t+k),φ(t+k1))=n(α(t+k),δ(t+k),γ(t+k))+j=1pPλ1(|δj(t+k)|)+i<jPλ2(|ηij(t+k)|)+φ(t+k1)T(Δα(t+k)η(t+k))+ρ2Δα(t+k)η(t+k)22=S(α(t+k),δ(t+k),γ(t+k),η(t+k))+ρ2Δα(t+k)η(t+k)22+[φ(t)+ρi=1k1(Δα(t+i)η(t+i))]×(Δα(t+k)η(t+k))f(t+k).According to Yang and Zou (Citation2013), we can infer that the GCD limit point (α(t+1),δ(t+1),γ(t+1)) is the coordinatewise minimum of L(α,δ,γ,η(t),φ(t)). Additionally, the objective function is convex with respect to η. Then according to Theorem 4.1 in Tseng (Citation2001), the sequence (α(t),δ(t),γ(t),η(t)) converges to a coordinatewise minimum point (α,δ,γ,η). Thus we have f=limtf(t+1)=limtf(t+k)=infΔαη=0{n(α,δ,γ)+j=1pPλ1(|δj|)+i<jPλ2(|ηij|)},and for all k0 flimtL(α(t+k),δ(t+k),γ(t+k),η(t+k),φ(t+k1))=n(α,δ,γ)+j=1pPλ1(|δj|)+i<jPλ2(|ηij|)+limtφ(t)(Δαη)+(k12)ρΔαη22.Therefore, (k12)ρΔαη22infΔαη=0{i<jPλ2(|ηij|)}i<jPλ2(|ηij|)limtφ(t)(Δαη).Since the above inequality holds for all k0, then limtr(t)22=Δαη22=0.

Moreover, since (α(t+1),δ(t+1),γ(t+1)) minimize L(α,δ,γ,η(t),φ(t)), by definition, we have that L(α(t+1),δ(t+1),γ(t+1),η(t),φ(t))/α=∂S(α(t+1),δ(t+1),γ(t+1),η(t))/α+Δφ(t)+ρΔ(Δα(t+1)η(t))=∂S(α(t+1),δ(t+1),γ(t+1),η(t))/α+Δ(φ(t)+ρ(Δα(t+1)η(t)))=∂S(α(t+1),δ(t+1),γ(t+1),η(t))/α+Δφ(t+1)+ρΔ(η(t+1)η(t))=0.Thus, s(t+1)22=ρΔ(η(t+1)η(t))=(∂S(α(t+1),δ(t+1),γ(t+1),η(t))/α+Δφ(t+1)) Since limtr(t)22=Δαη22=0, limt∂L(α(t+1),δ(t+1),γ(t+1),η(t),φ(t))/α=limt∂S(α(t+1),δ(t+1),γ(t+1),η(t))/α+Δφ(t)=0.Therefore, limts(t+1)22=0.

A.2. Proof of Theorem 4.1

Let θ=(Ξ^(or)Ξ^), q = K + s + 1. We can see that nθN(0,Σ) according to (Equation20) where Σ=[1n2logLn(Ξ)|Ξ=Ξ]1.

In order to get the tail probability of the event in (Equation21), we introduce a q-dimension vector a=(a1,,aq) satisfying a2=1, that is a12++aq2=1. It is obvious that aθN(0,1naΣa). Since |aθ|=|a1θ1++aqθq||a1θ1|++|aqθq|,|aθ|1|θj|max. Then it is easy to get maxa2=1|aθ|=maxj|θj|. Since Σ is a symmetric positive definite matrix, the maximum value of the quadratic f=aΣa at a=1 is the largest eigenvalue λmax of the matrix Σ.

For a normal distribution XN(0,σ2), the two-tailed probability inequality is (A1) P(|X|<c)12πσcexp{c22σ2}.(A1) So P(|aθ|<ϕn)12πf/nϕnexp{ϕn22(f/n)}.Since the right-side of the above inequality with respect to f is a decreasing function, then P(maxa2=1|aθ|<ϕn)12πλmax/nϕnexp{ϕn22(λmax/n)}.Therefore, (A2) P(Θ^orΘ<ϕn)=P(Ξ^orΞ<ϕn)=P(maxj|θj|<ϕn)=P(maxa2=1|aθ|<ϕn)12πλmax/nϕnexp{ϕn22(λmax/n)}.(A2) Since λmax=O(1), we set ϕn=1/logn and t=logn/n. Then P(Θ^orΘ<ϕn)1Clognnexp{n2logn}=1Ctexp{12t2}, where C is a constant. Obviously we have t0 and then P(Θ^orΘ<ϕn)1 when n.

A.3. Proof of Theorem 4.2

First, with the underlying group division G1,,GK and true support set A we define Ln(α,δ,γ)=n(α,δ,γ),Pλ1(δ)=λ1j=1pρ(|δj|),Pλ2(α)=λ2i<jρ(|αiαj|),LnO(τ,δ1,γ)=nO(τ,δ1,γ),Pλ1O(δ1)=λ1jAρ(|δj|),Pλ2O(τ)=λ2k<k|Gk||Gk|ρ(|τiτj|),and denote Qn(α,δ,γ)=Ln(α,δ,γ)+Pλ1(δ)+Pλ2(α),QnO(τ,δ1,γ)=LnO(τ,δ1,γ)+Pλ1O(δ1)+Pλ2O(τ).Let T:MGRK be the mapping such as that T(α) is the K×1 vector whose kth coordinate equals to the common value of αi for iGk. And let T0:RnRK be the mapping such that T0(α)={|Gk|1iGkαi}k=1K. Let S:RpRs be the mapping such that S(δ) retains only the part of δ whose corner is labelled A, that is S(δ)=δA. And let S1:RsMB be the mapping such that S1(δ1)=(δ1,0Ac). Obviously, when αMG and δMB, T(α)=T0(α) and δA=S(δ). Moreover, for every δMB and αMG, we have Pλ1(δ)=Pλ1O(δA) and Pλ2(α)=Pλ2O(T(α)). For every δ1Rs and τRK, we have Pλ1(S1(δ1))=Pλ1O(δ1) and Pλ2(T1(τ))=Pλ2O(τ). Hence (A3) Qn(α,δ,γ)=QnO(T(α),δA,γ),QnO(τ,δ1,γ)=Qn(T1(τ),S1(δ1),γ).(A3) Consider the neighbourhood of (α,δ,γ) Ψ={αRn,δRp,γR:((αα),(δδ),γγ)ϕn}.Define the event E1={Θ^Ψ}. By Theorem 1 we have P(E1c)p1. For any αRn and δRp let α0=T1(T0(α)) and δ0=S1(δA). We will prove that (α^or,δ^or,γ^or) is a local minimizer of the objective function Qn(α,δ,γ) with probability approaching 1 through the following two steps.

  1. On the event E1, Qn(α0,δ0,γ)Qn(α^or,δ^or,γ^or) for any (α,δ,γ)Ψ.

  2. There is an event E2 such that P(E2c)p2=n1nlogn. On E1E2, there is a neighbourhood of ((α^or),(δ^or),γ^or), denoted by Ψn={α,δ:((αα^or),(δδ^or))tn} such that Qn(α,δ,γ)Qn(α0,δ0,γ) for any (α,δ,γ)ΨΨn for sufficiently large n.

Therefore, by the result of (i) and (ii), we have Qn(α,δ,γ)Qn(α^or,δ^or,γ^or) for any (α,δ,γ)ΨΨn, so that ((α^or),(δ^or),γ^or) is a strict local minimizer of Qn(α,δ,γ) on the event E1E2 with P(E1E2)1p1p2 for sufficiently large n.

Step (i): Since (τ^or,δ^1or,γ^or) is a global minimizer of LnO(τ,δ1,γ), LnO(T0(α),S(δ),γ)LnO(τ^or,δ^1or,γ^or) for all (α,δ,γ)Ψ. Then we derive that Pλ1O(S(δ)) is a constant which does not depend on δ for δΨ and Pλ2O(T0(α)) is also a constant which does not depend on α for αΨ. Let T0(α)=τ=(τ1,,τK). For any kk, since |τkτk|=|τkτk+τkτk+τkτk||τkτk|+|τkτk|+|τkτk|,so |τkτk||τkτk|2supk|τkτk|,and (A4) supk|τkτk|=supk|iGkαi/|Gk|τk|=supk|iGk(αiαk)/|Gk||supksupiGk|αiαi|=αα.(A4) Therefore, for all k and k, |τkτk||τkτk|2αkαkbn2ϕn>aλ2,which indicates ρ(|τkτk|) is a constant by Condition (C2), and as a result Pλ2G(T0(α)) is also a constant. Similarly, for any jA, let S(α)=δ1=(δ1,,δp1), since |δj|=|δjδj+δj||δjδj|+|δj|By the condition |δA|min>(a+1)λ1, we have |δj|δminδδδminΘΘ(a+1)λ1ϕnaλ1.As a result, both ρ(|δj|) and Pλ1O(S(δ)) are constants.

On conclusion, we have QnO(T0(α),S(δ),γ)QnO(τ^or,δ^1or,γ^or) for all (α,δ,γ)Ψ. In addition, QnO(τ^or,δ^1or,γ^or)=Qn(α^or,δ^or,γ^or) and QnO(T0(α),S(δ),γ)=Qn(T1(T0(α)),S1(δA),γ)=Qn(α0,δ0,γ). Hence, we get Qn(α0,δ0,γ)Qn(α^or,δ^or,γ^or), and the result in (i) is proved.

Step (ii): First, we introduce a neighbourhood Ψn={α,δ:((αα^or),(δδ^or))tn} for a positive sequence tn. For (α,δ,γ)ΨΨn, by Taylor's expansion at (α0,δ0), we have Qn(α,δ,γ)Qn(α0,δ0,γ)=w(αα0)+i=1nPλ2(αm)αi(αiαi0)v(δδ0)+j=1pPλ1(δm)δj(δjδj0)=Γ1+Γ2+Γ3+Γ4,where w=[1n1(γy1α1mX1δ),1n0g(α0mX0δ)]=(1n1Λ1,1n0Λ2) and v=1n1X1(γy1α1mX1δm)1n0X0g(α0mX0δm)=1n1X1Λ1+1n0X0Λ2 in which αm=ζ1α+(1ζ1)α0 and δm=ζ2δ+(1ζ2)δ0 for some ζ1,ζ2(0,1). Firstly, (A5) Γ1=w(αα0)=k=1K{i,jGk}wi(αiαj)|Gk|=k=1K{i,jGk}wi(αiαj)2|Gk|k=1Ki,jGkwi(αiαj)2|Gk|=k=1K{i,jGk}(wjwi)(αjαi)2|Gk|=k=1K{i,jGk,i<j}(wjwi)(αjαi)|Gk|.(A5) As shown in (EquationA4), (A6) α0α=τταα.(A6) Since αm=ζ1α+(1ζ1)α0, (A7) αmαααϕn.(A7) As the same steps in (EquationA4) and (EquationA6), (A8) δ0δ=δ1δ1δδ.(A8) Since δm=ζ2δ+(1ζ2)δ0, (A9) δmδδδϕn.(A9) Then by Condition (C1), Λ1=γγ(α1+X1δ+ε1γ)X1δmα1m=γγγ(α1+X1δ)+α1α1m+X1(δδm)+(γγ+γ)ε1,Λ1γγγ(α1+X1δ)+α1α1m+X1(δδm)+(γγ+γ)ε1ϕnC0(C3n1+C1sC2s)+ϕn+C1sϕn+(ϕn+C0)ε1.Consider the ith element of the vector in Λ2, define a positive constant ξi satisfying gi(X0δmα0m)|xiδmαim|+ξi|xi(δmδ)|+|αimαi|+|xiδαi|+ξi.Define ξ=max{ξ1,mξn}, so we have Λ2X0(δmδ)+α0mα0+X0δ+α0+ξX0δmδ+α0mα0+X0δ+α0+ξC1sϕn+ϕn+C1sC2s+C3n0+ξ.Therefore, maxi,j|wjwi|2w2max{Λ1n1,Λ2n0}.By Condition (C3), set a constant c1 P(ε1>lognc1)i=1n1P(|εi|>lognc1)n1nlogn.Thus there is an event E2 such that P(E2c)n1nlogn, and on the event E2, |Gmin|1maxi,j|wjwi|2|Gmin|1max{ϕnn1(C1n1+C2s32+C3s+C4logn)+C5lognn1,1n0(C6n0+C7s32+C8ϕns+C9ϕ+ξ)}.Under the conditon (C4), it is easy to get λ2|Gmin|1max(s32n0,1n0,lognn1), and hence (A10) λ2|Gmin|1maxi,j|wjwi|.(A10) Then, denote ρ¯(t)=ρ(|t|)sgn(t), Γ2=λ2i=1njiρ¯(αimαjm)(αiαi0)=λ2i<jρ¯(αimαjm)(αiαi0)+λ2i>jρ¯(αimαjm)(αiαi0).Swap i and j in the second term of the second equation, (A11) Γ2=λ2i<jρ¯(αimαjm)(αiαi0)+λ2j>iρ¯(αjmαim)(αjαj0)=λ2i<jρ¯(αimαjm)(αiαi0)λ2i<jρ¯(αimαjm)(αjαj0)=λ2i<jρ¯(αimαjm){(αiαi0)(αjαj0)}.(A11) When i,jGk,αi0=αj0, and αimαjm has the same sign as αiαj, and hence Γ2=λ2i=1Ki,jGk,i<jρ(|αimαjm|)|αiαj|+λ2k<kiGk,jGkρ¯(αimαjm){(αiαi0)(αjαj0)}.Then, for kk,iGk,jGk, since |αiαj||αimαjm|+|αiαim|+|αjmαj|,we have |αimαjm|miniGk,jGk|αiαj|2αmαbn2ααbn2ϕnaλ2,and thus ρ¯(αimαjm)=0. Therefore, (A12) Γ2=λ2i=1Ki,jGk,i<jρ(|αimαjm|)|αiαj|.(A12) Moreover, by the same reasoning as (EquationA4), for i,jG we have α0α^orαα^or.Then (A13) |αimαjm||αimαi0|+|αjmαj0|2αmα02αα02(αα^or+α0α^or)4αα^or4tn.(A13) Since ρ() is concave, ρ(|αimαjm|)ρ(4tn). As a result, (A14) Γ2λ2k=1Ki,jGk,i<jρ(4tn)|αiαj|.(A14) On the other hand, we have (A15) Γ3=v(δδ0)=(jAvj(δjδj0)+jAcvj(δjδj0))=jAcvjδj.(A15) Since Λ3=X1Λ1 and Λ4=X0Λ2, on the event E2, (A16) max|vj|(X1n1Λ1+X0n0Λ2)C1s(1n1Λ1+1n0Λ2)C1s{ϕnn1(C1n1+C2s32+C3s+C4logn)+C5lognn1+1n0(C6n0+C7s32+C8ϕns+C9ϕ+ξ)}.(A16) Under the condition (C4), we can get (A17) λ1maxj|δj|.(A17) Then, (A18) Γ4=λ1j=1pρ¯(δjm)(δjδj0)=λ1(jAρ¯(δjm)(δjδj0)+jAcρ¯(δjm)(δjδj0)).(A18) When jAc, δj0=0, and δjm has the same sign as δj. Hence (A19) Γ4=λ1(jAcρ(|δjm|)|δj|+jAρ¯(δjm)(δjδj0)).(A19) For jA, by (EquationA9), (A20) |δjm|minjA|δj|δδm(a+1)λ1ϕnaλ1.(A20) Thus ρ¯(δjm)=0. Therefore, (A21) Γ4=λ1jAcρ(|δjm|)|δj|.(A21) Furthermore, by the same process as (EquationA13), for jAc (A22) |δjm|δmδ0δδ0δδ^or+δ0δ^or2δδ^or2tn.(A22) Let tn=o(1). Then ρ(4tn)1, ρ(2tn)1. Therefore, by (EquationA5), (EquationA10) and (EquationA14), (A23) Γ1+Γ2k=1Ki,jGk,i<j[λ2ρ(4tn)|Gmin|1maxi,j|wjwi|]|αiαj|0.(A23) And by (EquationA15), (EquationA17) and (EquationA21), (A24) Γ3+Γ4jAc[λ1ρ(2tn)maxj|vj|]|δj|0.(A24) Therefore, for sufficiently large n, Qn(α,δ,γ)Qn(α0,δ0,γ)=Γ1+Γ2+Γ3+Γ40,so that the result (ii) is proved.