Full article: Variable selection and subgroup analysis for high-dimensional censored data

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

This paper proposes a penalized method for high-dimensional variable selection and subgroup identification in the Tobit model. Based on Olsen's [(1978). Note on the uniqueness of the maximum likelihood estimator for the Tobit model. Econometrica: Journal of the Econometric Society, 46(5), 1211–1215. https://doi.org/10.2307/1911445] convex reparameterization of the Tobit negative log-likelihood, we develop an efficient algorithm for minimizing the objective function by combining the alternating direction method of multipliers (ADMM) and generalised coordinate descent (GCD). We also establish the oracle properties of our proposed estimator under some mild regularity conditions. Furthermore, extensive simulations and an empirical data study are conducted to demonstrate the performance of the proposed approach.

Keywords:

1. Introduction

Subgroup analysis has broad applicability in precision medicine, economics and sociology as there is an increasing need to distinguish homogeneous subgroups of individuals, detect the subgroup structure and model the relationships between the response variable and predictors for individuals belonging to different subgroups. Thus, vast statistical methods for subgroup analysis have been developed, such as mixture models (Everitt, Citation2013) and regularization methods (Ma & Huang, Citation2017). Mixture model methods assume that the data come from a mixture of subgroups and require the specification of an underlying distribution. Shen and He (Citation2015) proposed a structured logistic-normal mixture model to identify subgroups. However, they often require the number of subgroups to be specified to group the parameterized models, which can often be difficult to implement in practice. In contrast, Ma and Huang (Citation2017) developed a pairwise fusion approach using concave penalty functions, such as the smoothly clipped absolute deviation (SCAD, J. Fan & Li, Citation2001) penalty and the minimax concave penalty (MCP, Zhang, Citation2010), that automatically identifies subgroup structures and estimates subgroup-specific effects. Ma et al. (Citation2019) considered a heterogeneous treatment effects model. Wang et al. (Citation2019) proposed a general framework of spatial subgroup analysis method for spatial data with repeated measures.

In numerous regression problems, the dependent variables can only be observed within a restricted range. For instance, when studying the influencing factors of different family expenditures in a group, some families may spend zero on items like medical insurance. Similarly, when studying individual or collective income, negative income generated by debt cannot be included in the income calculation. These scenarios involve left-censored data, which exist commonly in economics. Therefore, it encourages us to develop models specially tailored to address such situations. Tobin (Citation1958) developed the Tobit model to study the relationship between the annual expenditure of durable goods and household income. Due to the large number of scenarios of left-censored data in economics and social sciences, the Tobit model remained popular and it has been thoroughly studied and extended to deal with other types of censored data (Amemiya, Citation1984). Regarding subgroup analysis with censored data, Dagne (Citation2016) proposed a method that simultaneously addresses left-censoring and unobserved heterogeneity within longitudinal data. Additionally, Yan et al. (Citation2021) developed a censored linear regression model with heterogeneous treatment effects.

The advent of advanced data collection techniques has led to an increase in the prevalence of high-dimensional data in the aforementioned fields. When dealing with high-dimensional problems, J. Fan and Lv (Citation2010) provided a comprehensive overview of variable selection approaches, which incorporate methods discussed by J. Fan and Li (Citation2001). In the context of high-dimensional censored models, Müller and van de Geer (Citation2016) and Zhou and Liu (Citation2016) respectively introduced lasso penalty and adaptive lasso penalty to the least absolute deviation (LAD) estimator (Powell, Citation1984) for variable selection in high-dimensional censored models. Johnson (Citation2009) and Soret et al. (Citation2018) proposed lasso penalties for Buckley-James estimators in right and left censored data. Moreover, Bradic et al. (Citation2011) provided a non-concave penalized approach in Cox proportional hazards model with non-polynomial-dimensionality. Alhamzawi (Citation2016) and Alhamzawi (Citation2020) developed the Bayesian method of penalty censored regression. Recently, Jacobson and Zou (Citation2023) extended the Tobit model to high-dimensional regression. However, none of these methods focus on subgroup analysis in high-dimensional censored data.

This paper focuses on subgroup identification and variable selection for a high-dimensional Tobit model. To the best of our knowledge, there have been no discussions on subgroup analysis for high-dimensional Tobit models in the existing literature. We adopt a penalized approach to identify the subgroup structures and select covariates simultaneously. The subgroup structure is determined by penalizing pairwise differences between subject-specific effects while significant covariates are chosen based on a penalty on coefficients. To ensure the sparsity and unbiasedness of the proposed estimators, we consider two commonly used concave penalties, SCAD (J. Fan & Li, Citation2001) and MCP (Zhang, Citation2010). Due to the non-convex of the negative log-likelihood in the Tobit model (Tobin, Citation1958), optimization becomes challenging for high-dimensional settings. To address this issue, we employ a convex reparameterization of negative log-likelihood, building upon the idea proposed by Olsen (Citation1978). This reparameterization enables us to solve the problem using convex optimization approaches. The computational algorithm we proposed combines the alternating direction method of multipliers (ADMM) algorithm (Boyd et al., Citation2011) and generalized coordinate descent (GCD) algorithm (Jacobson & Zou, Citation2023; Yang & Zou, Citation2013) using two concave penalties, such as SCAD or MCP. Furthermore, we conduct a theoretical analysis of the proposed estimators and establish their oracle properties under mild conditions.

The remainder of this paper is organized as follows. Section 2 introduces the main problem and outlines the proposed method. In Section 3, we propose an algorithm for identifying the subgroup structures and performing variable selection. We state technical assumptions and establish the theoretical properties of our proposed approach in Section 4. Section 5 provides extensive simulation studies to illustrate the empirical performance of the proposed method, while Section 6 presents its application to empirical data. A summary and prospects for future research are presented in Section 7 and all technical proofs are given in Appendix.

2. Model and method

2.1. Model setting

Suppose that $x_{i} = (x_{i 1}, \dots, x_{ip})^{⊤}$ is a p-dimensional vector of covariates for the ith subject. $y_{i} \geq c$ is the response for a restricted range, where c is a known constant. Without loss of generality, we assume that c = 0 in the following. The Tobit model assumes that the observed data y satisfies $y = max (y^{*}, c)$ , where $y^{*}$ is a latent variable. Under the homogeneous case, the classical linear model takes the form (1) $\begin{aligned} y_{i}^{*} = μ + x_{i}^{⊤} β + ϵ_{i}, i = 1, \dots, n, \end{aligned}$ (1) where μ is the unknown intercept, $β = (β_{1}, \dots, β_{p})^{⊤}$ is the vector of coefficients for the covariates $x_{i}$ , and $ϵ_{i}$ are assumed to be independent and identically distributed with normal distribution $N (0, σ^{2})$ .

If individuals are from different groups with a unique intercept $μ_{i}$ , the homogeneity assumption in the model (Equation1(1) $\begin{aligned} y_{i}^{*} = μ + x_{i}^{⊤} β + ϵ_{i}, i = 1, \dots, n, \end{aligned}$ (1) ) is invalid. To model the subject-specific effects, we consider the subject-specific linear model (2) $\begin{aligned} y_{i}^{*} = μ_{i} + x_{i}^{⊤} β + ϵ_{i}, i = 1, \dots, n . \end{aligned}$ (2) We assume $(y_{1}^{*}, \dots, y_{n}^{*})$ arise from $K$ different groups with $K \geq 1$ unknown and the subjects from the same groups have the same intercept. In other words, we have $μ_{i} = π_{k}$ for all $i \in G_{k}$ , where $π_{k}$ is the common value of intercept $μ_{i}$ in subgroup $G_{k}$ and $G = (G_{1}, \dots, G_{K})$ is a mutually partition of ${1, \dots, n}$ . In practice, the number of subgroups K is unknown and is smaller than the sample size n.

Define $d_{i} = I (y_{i} \geq 0)$ , where $I (\cdot)$ is an indicator function. Then the observed data $(y_{1}, \dots, y_{n})$ satisfy the following Tobit model (3) $\begin{aligned} y_{i} = d_{i} y_{i}^{*} = {\begin{cases} μ_{i} + x_{i} β + ϵ_{i}, & if y_{i}^{*} \geq 0, \\ 0, & if y_{i}^{*} < 0, \end{cases} i = 1, \dots, n, \end{aligned}$ (3) with subgroup structure (4) $\begin{aligned} μ_{i} = {\begin{cases} π_{1}, & if i \in G_{1}, \\ π_{2}, & if i \in G_{2}, \\ ⋮ & ⋮ \\ π_{K}, & if i \in G_{K} . \end{cases} \end{aligned}$ (4) Let $Φ (\cdot)$ denote the standard normal cumulative distribution function (CDF). Then $P (y_{i}^{*} \leq 0) = P (μ_{i} + x_{i}^{⊤} β < 0) = Φ (- \frac{μ_{i} + x_{i}^{⊤} β}{σ}),$ and the Tobit likelihood is given by $L_{n} (μ, β, σ^{2}) = \prod_{i = 1}^{n} {[\frac{1}{\sqrt{2 π} σ} \exp {- \frac{1}{2 σ^{2}} (y_{i} - μ_{i} - x_{i}^{⊤} β)^{2}}]}^{d_{i}} {[Φ (\frac{- μ_{i} - x_{i}^{⊤} β}{σ})]}^{1 - d_{i}},$ where $μ = (μ_{1}, \dots, μ_{n})^{⊤}$ . Therefore, after omitting an inconsequential constant, the log-likelihood function of the Tobit model is $\begin{aligned} \log L_{n} (μ, β, σ^{2}) & = \sum_{i = 1}^{n} (d_{i} [- \frac{1}{2} (y_{i} - μ_{i} - x_{i}^{⊤} β)^{2} / σ^{2} - \log (σ)] \\ + (1 - d_{i}) \log [Φ (- \frac{μ_{i} + x_{i}^{⊤} β}{σ})]) . \end{aligned}$ It is apparent that the function $\log L_{n} (μ, β, σ^{2})$ is non-concave with respect to the parameters $(μ, β, σ^{2})$ . By adopting the reparameterization suggested in the works of Olsen (Citation1978) and Jacobson and Zou (Citation2023) with $δ = β / σ$ , $α_{i} = μ_{i} / σ$ and $γ^{2} = σ^{- 2}$ , we achieve a transformation that leads to a concave function with respect to parameters $(α, δ, γ)$ , $\log L_{n} (α, δ, γ) = \sum_{i = 1}^{n} {d_{i} [\log (γ) - \frac{1}{2} (γ y_{i} - α_{i} - x_{i}^{⊤} δ)^{2}] + (1 - d_{i}) \log (Φ (- α_{i} - x_{i}^{⊤} δ))},$ where $α = (α_{1}, \dots, α_{n})^{⊤}$ .

2.2. Method

There are usually some redundant covariates in high-dimensional scenarios, and regularization is the most commonly used method to identify the sparsity of regression coefficient vectors (Bondell & Reich, Citation2008; J. Fan & Lv, Citation2010; Y. Fan & Tang, Citation2013). The subgroup structure (Equation4(4) $\begin{aligned} μ_{i} = {\begin{cases} π_{1}, & if i \in G_{1}, \\ π_{2}, & if i \in G_{2}, \\ ⋮ & ⋮ \\ π_{K}, & if i \in G_{K} . \end{cases} \end{aligned}$ (4) ) can be transformed as the fusion sparse structure, $α_{i} - α_{j} = (μ_{i} - μ_{j}) / σ = 0 (i, j \in G_{k}, k = 1, \dots, K)$ . In order to estimate the parameters $α, δ$ , and γ, and to select proper covariates through the sparsity assumption of $δ$ , we propose a new method that combines ideas of penalized Tobit regression (Jacobson & Zou, Citation2023) and the subgroup analysis by concave pairwise fusion penalization (Ma & Huang, Citation2017; Ma et al., Citation2019), which can be expressed as minimizing the following loss function (5) $\begin{aligned} Q (α, δ, γ; λ_{1}, λ_{2}) = - \frac{1}{n} \log L_{n} (α, δ, γ) + \sum_{i = 1}^{p} P_{λ_{1}} (| δ_{i} |) + \sum_{i < j} P_{λ_{2}} (| α_{i} - α_{j} |), \end{aligned}$ (5) where $P_{λ_{1}} (\cdot)$ and $P_{λ_{2}} (\cdot)$ are penalty functions, $λ_{1}, λ_{2} \geq 0$ are tuning parameters that control the strengths of regularization of $| δ_{i} |$ and $| α_{i} - α_{j} |$ , respectively. Note that the sparsity of $δ$ is achieved through the first penalty term, while the homogeneity detection is achieved by the second penalty term. Additionally, when $λ_{1} = 0$ the problem reduces to the subgroup analysis in censored regression; when $λ_{2} = 0$ the problem reduces to the penalized Tobit regression.

It is important to note that lasso estimators may fail to achieve consistent model selection unless a stringent ‘irrepresentable condition’ (Zhao & Yu, Citation2006; Zou, Citation2006). Particularly, the $L_{1}$ penalty tends to overshrink non-zero difference of $| α_{i} - α_{j} |$ , which can result in an inflated number of subgroups. To address this limitation, we consider two common concave penalty functions for the purposes of identifying the subgroup structure and selecting variables, namely the smoothly clipped absolute deviation (SCAD, J. Fan & Li, Citation2001) and the minimax concave penalty (MCP, Zhang, Citation2010). These penalties provide alternative approaches to handle the challenges associated with variable selection and subgroup detection.

The SCAD is defined as follows $P_{λ} (t) = λ \int_{0}^{t} I (x \leq λ) + \frac{(aλ - x)_{+}}{(a - 1) λ} I (x > λ) d x, a > 2,$ and the MCP is $P_{λ} (t) = \int_{0}^{t} \frac{(aλ - x)_{+}}{a} d x, a > 1,$ where a is a parameter that controls the concavity of the penalty function.

3. Computational algorithm

In this section, we propose an algorithm utilizing the alternating direction method of multipliers (ADMM) (Boyd et al., Citation2011) in conjunction with generalized coordinate descent (GCD) (Jacobson & Zou, Citation2023; Yang & Zou, Citation2013) to address the minimization problem (Equation5(5) $\begin{aligned} Q (α, δ, γ; λ_{1}, λ_{2}) = - \frac{1}{n} \log L_{n} (α, δ, γ) + \sum_{i = 1}^{p} P_{λ_{1}} (| δ_{i} |) + \sum_{i < j} P_{λ_{2}} (| α_{i} - α_{j} |), \end{aligned}$ (5) ). Since the penalty function is not separable with respect to $α_{i}$ , it is challenging to directly minimize the objective function (Equation5(5) $\begin{aligned} Q (α, δ, γ; λ_{1}, λ_{2}) = - \frac{1}{n} \log L_{n} (α, δ, γ) + \sum_{i = 1}^{p} P_{λ_{1}} (| δ_{i} |) + \sum_{i < j} P_{λ_{2}} (| α_{i} - α_{j} |), \end{aligned}$ (5) ). We introduce a new set of parameters $η_{ij} = α_{i} - α_{j}$ , and then the minimizing problem can be written as the following constraint optimization problem $\begin{aligned} S (α, δ, η, γ; λ_{1}, λ_{2}) & = ℓ_{n} (α, δ, γ) + \sum_{i = 1}^{p} P_{λ_{1}} (| δ_{i} |) + \sum_{i < j} P_{λ_{2}} (| η_{ij} |), \\ subject to α_{i} - α_{j} - η_{ij} = 0, \end{aligned}$ where $ℓ_{n} (α, δ, γ) = - \frac{1}{n} \log L_{n} (α, δ, γ)$ is the negative log-likelihood function which we call it Tobit loss for short and $η = {η_{ij}, i < j}^{⊤}$ . Applying the augmented Lagrangian method, the estimates of the parameters can be obtained by minimizing (6) $\begin{aligned} L (α, δ, η, φ, γ; λ_{1}, λ_{2}, ρ) = S (α, δ, η, γ; λ_{1}, λ_{2}) + \sum_{i < j} φ_{ij} (α_{i} - α_{j} - η_{ij}) + \frac{ρ}{2} \sum_{i < j} (α_{i} - α_{j} - η_{ij})^{2}, \end{aligned}$ (6) where $φ = {φ_{ij}, i < j}^{⊤}$ are Lagrange multipliers, and ρ is the penalty parameter.

To obtain the minimum in (Equation6(6) $\begin{aligned} L (α, δ, η, φ, γ; λ_{1}, λ_{2}, ρ) = S (α, δ, η, γ; λ_{1}, λ_{2}) + \sum_{i < j} φ_{ij} (α_{i} - α_{j} - η_{ij}) + \frac{ρ}{2} \sum_{i < j} (α_{i} - α_{j} - η_{ij})^{2}, \end{aligned}$ (6) ), we propose to use the following iterative algorithm based on the ADMM. Let t denote the iteration step. We update the estimates of $(α, δ, γ)$ , $η$ , and $φ$ iteratively at the $(t + 1)$ th iteration step as follows (7) $\begin{aligned} (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}) & = \arg min_{δ, α, γ} L (α, δ, γ, η^{(t)}, φ^{(t)}), \end{aligned}$ (7) (8) $\begin{aligned} η^{(t + 1)} & = \arg min_{η} L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η, φ^{(t)}), \end{aligned}$ (8) (9) $\begin{aligned} φ^{(t + 1)} & = φ^{(t)} + ρ (Δ α^{(t + 1)} - η^{(t + 1)}), \end{aligned}$ (9) where $Δ = {(e_{i} - e_{j}), i < j}^{⊤}$ .

It is worth noting that, given $(η^{(t)}, φ^{(t)})$ , the objective function in the first minimization problem (Equation7(7) $\begin{aligned} (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}) & = \arg min_{δ, α, γ} L (α, δ, γ, η^{(t)}, φ^{(t)}), \end{aligned}$ (7) ) can be simplified as (10) $\begin{aligned} L (α, δ, γ, η, φ) & = \frac{1}{n} \sum_{i = 1}^{n} ℓ_{i} (α_{i}, δ, γ) + \sum_{i = 1}^{p} P_{λ_{1}} (| δ_{i} |) + \sum_{i < j} φ_{ij} (α_{i} - α_{j} - η_{ij}) \\ + \frac{ρ}{2} \sum_{i < j} (α_{i} - α_{j} - η_{ij})^{2} + C, \end{aligned}$ (10) where $ℓ_{i} (α_{i}, δ, γ) = \frac{1}{2} d_{i} (γ y_{i} - α_{i} - x_{i}^{⊤} δ)^{2} - (1 - d_{i}) \log Φ (- x_{i}^{⊤} δ - α_{i}))$ , and C is a constant independent with $(α, δ, γ)$ .

Due to the complexity of the function (Equation10(10) $\begin{aligned} L (α, δ, γ, η, φ) & = \frac{1}{n} \sum_{i = 1}^{n} ℓ_{i} (α_{i}, δ, γ) + \sum_{i = 1}^{p} P_{λ_{1}} (| δ_{i} |) + \sum_{i < j} φ_{ij} (α_{i} - α_{j} - η_{ij}) \\ + \frac{ρ}{2} \sum_{i < j} (α_{i} - α_{j} - η_{ij})^{2} + C, \end{aligned}$ (10) ) with respect to $(α, δ, γ)$ , we apply the generalized coordinate descent (GCD) method to solve the problem. By employing the GCD method, we can iteratively update each variable while holding the others fixed.

Let $α^{'}, δ^{'}$ and $γ^{'}$ be the current values for $α, δ$ and γ, respectively. For the sake of simplicity in notation, let $v_{(- j)}$ denote the vector $v$ with the jth element removed in the subsequent context. In order to get the estimate of $α_{i}$ , we begin by expressing the Tobit loss $ℓ_{n}$ with respect to $α_{i}$ $ℓ_{n} (α_{i} ∣ α_{(- i)}^{'}, δ^{'}, γ^{'}) = \frac{1}{n} {\frac{1}{2} d_{i} (γ y_{i} - α_{i} - x_{i}^{⊤} δ^{'})^{2} - (1 - d_{i}) \log Φ (- x_{i}^{⊤} δ^{'} - α_{i})} .$ We observe that after dropping the negligible constants, the Tobit loss $ℓ_{n}$ associated with $α_{i}$ solely depends on the data collected from subject i. In line with Theorem 1 presented in Jacobson and Zou (Citation2023), the quadratic majorization function for $ℓ_{n} (α_{i} ∣ α^{'}, δ^{'}, γ^{'})$ takes the following form: $Q_{α} (α_{i} ∣ α_{(- i)}^{'}, δ^{'}, γ^{'}) = ℓ_{n} (α_{i}^{'} ∣ δ^{'}, γ^{'}) + {\dot{ℓ}}_{n} (α_{i}^{'} ∣ δ^{'}, γ^{'}) (α_{i} - α_{i}^{'}) + \frac{1}{2} (α_{i} - α_{i}^{'})^{2},$ where ${\dot{ℓ}}_{n} (α_{i}^{'} ∣ δ^{'}, γ^{'})$ represents the derivative of function with respect to $α_{i}$ . Following the MM principle, the update of $α_{i}$ can be obtained by minimizing the expression $Q_{α} (α_{i} ∣ α_{(- i)}^{'}, δ^{'}, γ^{'}) + \frac{ρ}{2} \sum_{k < j} {(e_{k} - e_{j})^{⊤} α - η_{kj} + ρ^{- 1} φ_{kj}}^{2}$ , and $e_{j}$ is an $n \times 1$ vector with the jth element being 1 and the remaining elements being 0. Therefore, for fixed $α^{(t, k)}, δ^{(t, k)}$ , and $γ^{(t, k)}$ at the kth step, the update of $α^{(t, k + 1)}$ is as follows (11) $\begin{aligned} α^{(t, k + 1)} = (I + nρ Δ^{⊤} Δ)^{- 1} (nρ Δ^{⊤} η^{(t)} - n Δ^{⊤} φ^{(t)} + a^{(t, k)}), \end{aligned}$ (11) where I is the identity matrix, and $a^{(t, k)} = α^{(t, k)} - {\dot{ℓ}}_{n} (α^{(t, k)})$ with ${\dot{ℓ}}_{n} (α^{(t, k)}) = ({\dot{ℓ}}_{n} (α_{1}^{(t, k)}), \dots, {\dot{ℓ}}_{n} (α_{n}^{(t, k)}))^{⊤}$ .

Now we consider coordinate-wise updates of $δ_{j}, j = 1, \dots, p$ . Let $M_{j} = \frac{1}{n} \sum_{i = 1}^{n} x_{ij}^{2}$ . Similarly to $α_{i}$ , we also have the quadratic majorization function of $ℓ_{n} (δ_{j} ∣ α^{'}, δ^{'}, γ^{'})$ with respect to $δ_{j}$ with form $Q_{δ} (δ_{j} ∣ α^{'}, δ_{- j}^{'}, γ^{'}) = ℓ_{n} (δ_{j}^{'} ∣ α^{'}, δ_{(- j)}^{'}, γ^{'}) + {\dot{ℓ}}_{n} (δ_{j}^{'} ∣ α^{'}, δ_{(- j)}^{'}, γ^{'}) (δ_{j} - δ_{j}^{'}) + \frac{M_{j}}{2} (δ_{j} - δ_{j}^{'})^{2},$ where ${\dot{ℓ}}_{n} (δ_{j}^{'} ∣ α^{'}, δ_{(- j)}^{'}, γ^{'})$ is the derivative of $ℓ_{n} (δ_{j} ∣ α^{'}, δ_{(- j)}^{'}, γ^{'})$ with respect to $δ_{j}$ . Here, the Tobit loss with respect to $δ_{j}$ can be expressed as $\begin{aligned} ℓ_{n} (δ_{j} ∣ α^{'}, δ_{(- j)}^{'}, γ^{'}) & = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{2} d_{i} (γ y_{i} - α_{i}^{'} - x_{i, (- j)}^{⊤} δ_{(- j)}^{'} - x_{ij} δ_{j})^{2} \\ - (1 - d_{i}) \log Φ (- x_{i, (- j)})^{⊤} δ_{(- j)}^{'} - α_{i}^{'} - x_{ij} δ_{j})) . \end{aligned}$ We can update $δ_{j}$ by minimizing $Q_{δ} (δ_{j} ∣ α^{'}, δ_{- j}^{'}, γ^{'}) + P_{λ_{1}} (δ_{j})$ through MM principle. For $j = 1, \dots, p$ , let $υ_{j}^{(t, k)} = δ_{j}^{(t, k)} - \frac{1}{M_{j}} ℓ_{n}^{'} (δ_{j}^{(t, k)} ∣ α^{(t, k + 1)}, δ^{(t, k)}, γ^{(t, k)})$ . Hence, for SCAD penalty with $a_{1} > max_{j} {M_{j}^{- 1}} + 1$ , the update of $δ_{j}$ at the $(k + 1)$ th step is (12) $\begin{aligned} δ_{j}^{(t, k + 1)} = {\begin{array}{ccl} ST (υ_{j}^{(t, k)}, \frac{λ_{1}}{M_{j}}), & if & | υ_{j}^{(t, k)} | \leq λ_{1} + λ_{1} / M_{j}, \\ \frac{ST (υ_{j}^{(t, k)}, a_{1} λ_{1} / ((a_{1} - 1) M_{j}))}{1 - ((a_{1} - 1) M_{j})^{- 1}}, & if & λ_{1} + λ_{1} / M_{j} < | υ_{j}^{(t, k)} | \leq a_{1} λ_{1}, \\ υ_{j}^{(t, k)}, & if & | υ_{j}^{(t, k)} | > a_{1} λ_{1}, \end{array} \end{aligned}$ (12) where $ST (t, λ) = sign (t) (| t | - λ)_{+}$ is the soft-thresholding rule, and $(x)_{+} = max {x, 0}$ . And when $a_{1} > max_{j} {M_{j}^{- 1}}$ for the MCP penalty, the updated value is (13) $\begin{aligned} δ_{j}^{(t, k + 1)} = {\begin{array}{ccl} \frac{ST (υ_{j}^{(t, k)}, λ_{1} / M_{j})}{1 - (a_{1} M_{j})^{- 1}}, & if & | υ_{j}^{(t, k)} | \leq a_{1} λ_{1}, \\ υ_{j}^{(t, k)}, & if & | υ_{j}^{(t, k)} | > a_{1} λ_{1} . \end{array} \end{aligned}$ (13) Lastly, for given $α^{(t, k + 1)}$ and $δ^{(t, k + 1)}$ , we minimize $ℓ_{n} (γ ∣ α^{(t, k + 1)}, δ^{(t, k + 1)})$ to update γ, (14) $\begin{aligned} γ^{(t, k + 1)} \\ = \frac{\sum_{i = 1}^{n} d_{i} y_{i} (α_{i}^{(t, k + 1)} + x_{i}^{⊤} δ^{(t, k + 1)}) + \sqrt{{(\sum_{i = 1}^{n} d_{i} y_{i} (α_{i}^{(t, k + 1)} + x_{i}^{⊤} δ^{(t, k + 1)}))}^{2} + 4 (\sum_{i = 1}^{n} d_{i} y_{i}^{2}) \sum_{i = 1}^{n} d_{i}}}{2 \sum_{i = 1}^{n} d_{i} y_{i}^{2}} . \end{aligned}$ (14) Once the convergence is achieved, we denote the final iteration of the GCD as $(α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)})$ .

As for the second minimization function (Equation8(8) $\begin{aligned} η^{(t + 1)} & = \arg min_{η} L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η, φ^{(t)}), \end{aligned}$ (8) ), by eliminating insignificant constants that have no effect on the minimization process, the optimization function simplifies to the following form (15) $\begin{aligned} η_{ij} = \arg min_{η_{ij}} \frac{ρ}{2} (η_{ij} - ζ_{ij})^{2} + P_{λ_{2}} (| η_{ij} |), \end{aligned}$ (15) with respect to $η_{ij}$ , where $ζ_{ij} = α_{i} - α_{j} + ρ^{- 1} φ_{ij}$ . It's worth noting that (Equation15(15) $\begin{aligned} η_{ij} = \arg min_{η_{ij}} \frac{ρ}{2} (η_{ij} - ζ_{ij})^{2} + P_{λ_{2}} (| η_{ij} |), \end{aligned}$ (15) ) is convex with respect to each $η_{ij}$ when $a_{2} > ρ^{- 1}$ for MCP or $a_{2} > ρ^{- 1} + 1$ for SCAD. Hence, the closed-form solution for the MCP penalty at the $(t + 1)$ iteration is (16) $\begin{aligned} η_{ij}^{(t + 1)} = {\begin{array}{ccl} \frac{ST (ζ_{ij}^{(t + 1)}, λ_{2} / ρ)}{1 - (a_{2} ρ)^{- 1}}, & if & | ζ_{ij}^{(t + 1)} | \leq a_{2} λ_{2}, \\ ζ_{ij}^{(t + 1)}, & if & | ζ_{ij}^{(t + 1)} | > a_{2} λ_{2} . \end{array} \end{aligned}$ (16) Then for the SCAD penalty, it is (17) $\begin{aligned} η_{ij}^{(t + 1)} = {\begin{array}{ccl} ST (ζ_{ij}^{(t + 1)}, λ_{2} / ρ), & if & | ζ_{ij}^{t + 1} | \leq λ_{2} + λ_{2} / ρ, \\ \frac{ST (ζ_{ij}^{(t + 1)}, a_{2} λ_{2} / ((a_{2} - 1) ρ))}{1 - ((a_{2} - 1) ρ)^{- 1}}, & if & λ_{2} + λ_{2} / ρ < | ζ_{ij}^{(t + 1)} | \leq a_{2} λ_{2}, \\ ζ_{ij}^{(t + 1)}, & if & | ζ_{ij} | > a_{2} λ_{2} . \end{array} \end{aligned}$ (17) We provide the complete algorithm in Algorithm (1), referred to as the ADMM-GCD algorithm for convenience.

Remark

In the algorithm, there are two iterative steps. In the nested GCD iteration, the iterative convergence criterion is defined as follows ${‖ {(α^{(t, k + 1)} - α^{(t, k)})}^{⊤}, {(δ^{(t, k + 1)} - δ^{(t, k)})}^{⊤}, (γ^{(t, k + 1)} - γ^{(t, k)}) ‖}_{2}^{2} \leq ϵ_{a} .$ On the other hand, for ADMM, we employ the following criterion $‖ r^{(t + 1)} ‖^{2} = ‖ Δ α^{(t + 1)} - η^{(t + 1)} ‖_{2}^{2} < ϵ_{b} .$ It is worth mentioning that in our simulation studies, we set $ϵ_{a} = 10^{- 4}$ and $ϵ_{b} = 10^{- 5}$ , respectively.

The following Proposition 3.1 presents the convergence properties of the proposed algorithm.

Proposition 3.1

Given $a_{1} > max {1 / M_{j}}$ and $a_{2} > 1 / ρ$ for MCP or $a_{1} > max {1 / M_{j}} + 1$ and $a_{2} > 1 / ρ + 1$ for SCAD, any accumulation point $(δ^{(t + 1)}, α^{(t + 1)}, γ^{(t + 1)}, η^{(t + 1)})$ generated by Algorithm 1 is a coordinate-wise minimum of $L (δ, α, γ, η, φ^{(t)})$ . In addition, the primal residual $r^{(t + 1)} = Δ α^{(t + 1)} - η^{(t + 1)}$ and the dual residual $s^{(t + 1)} = ρ Δ^{⊤} (η^{(t + 1)} - η^{(t)})$ of the ADMM satisfy that $lim_{t \to \infty} ‖ r^{(t)} ‖_{2}^{2} = 0$ and $lim_{t \to \infty} ‖ s^{(t)} ‖_{2}^{2} = 0$ for both MCP and SCAD penalties.

4. Theoretical properties

In this section, we study the theoretical properties of the proposed estimators. We first introduce $M_{G}$ , a subspace of $R^{n}$ , defined as $M_{G} = {α \in R^{n} : α_{i} = α_{j}, for any i, j \in G_{k}, 1 \leq k \leq K} .$ For each $α \in M_{G}$ , it can be also written as $α = Z τ$ , where $Z = {z_{ik}}$ is the $n \times K$ indicator matrix defined by $z_{ik} = 1$ for $i \in G_{k}$ and $z_{ik} = 0$ otherwise, and $τ$ is a $K \times 1$ vector of parameters. Let $| G_{k} |$ denotes the number of elements in $G_{k}$ , we have $T = Z^{⊤} Z = diag (| G_{1} |, \dots, | G_{K} |)$ by matrix calculation. Define $| G_{min} | = min_{1 \leq k \leq K} | G_{k} |$ and $| G_{max} | = max_{1 \leq k \leq K} | G_{k} |$ .

First, we assume that the true values of the parameters for $δ$ , $α$ and γ are $δ^{*}$ , $α^{*}$ and $γ^{*}$ , respectively. Let $τ^{*} = (τ_{k}^{*}, k = 1, \dots, K)$ , where $τ_{k}^{*}$ is the underlying common intercept for group $G_{k}$ . We let $A = {j : δ_{j} \neq 0} \subseteq {1, \dots, p}$ denote the true support set of $δ$ and define $A^{'} = A \cup {p + 1, \dots, p + K + 1}$ , $A_{1} = A ∖ {p + K + 1}$ and $s = | A |$ . Under the sparsity assumption, $s ≪ p$ .

To consider the true variables, we set $δ = (δ_{1}^{⊤}, δ_{0}^{⊤})^{⊤}$ and define $M_{B}$ , where $δ_{1}$ are the true variables with ${δ_{j} \neq 0}$ and $δ_{0}$ are the zero variables. Then we define a subspace of $R^{p}$ as $M_{B} = {δ \in R^{p} : δ_{i} = δ_{i} for i \in A and δ_{i} = 0 for i \in A^{c}} .$ Additionally, we let $Θ =: (α^{⊤}, δ^{⊤}, γ)^{⊤}$ and $Ξ =: (τ^{⊤}, δ_{1}^{⊤}, γ)^{⊤}$ for notational convenience.

Since there are obvious differences between censored and uncensored observations in the Tobit likelihood, we use a more convenient expression to differentiate them clearly. When we define $n_{1}$ as the number of observations for which $y_{i} > 0$ and $n_{0} = n - n_{1}$ , we can re-block our observations as $X = [\begin{matrix} X_{0} \\ X_{1} \end{matrix}] and y = [\begin{matrix} y_{0} \\ y_{1} \end{matrix}],$ where $X_{0}$ is the $n_{0} \times (p + 1)$ matrix of predictors corresponding to the observations for which $y_{i} \leq 0$ while $X_{1}$ is the $n_{1} \times (p + 1)$ matrix of predictors corresponding to the observations for which $y_{i} > 0$ . Similarly, $y_{0}$ and $y_{1}$ denote the responses greater than and not greater than 0, respectively. Then by the definition above we get the same form $α = [\begin{matrix} α_{0} \\ α_{1} \end{matrix}] = Zτ = [\begin{matrix} Z_{0} τ \\ Z_{1} τ \end{matrix}] .$

4.1. Technical conditions

In this subsection, we will introduce several mild conditions and discuss their relevance in detail.

(C1):	Assume $‖ X_{j} ‖_{2} = \sqrt{n}$ for $1 \leq j \leq p$ , $‖ X ‖_{\infty} \leq C_{1} s$ , $‖ δ^{} ‖_{\infty} \leq C_{2} \sqrt{s}$ , $‖ α^{} ‖_{\infty} \leq C_{3} \sqrt{n}$ , and $\| γ^{*} \| \leq C_{0}$ for some constant $0 \leq C_{0}, C_{1}, C_{2}, C_{3} \leq \infty$ .
(C2):	The penalty function $P_{λ} (t)$ is symmetric of t, and it is nondecreasing and concave for $t \in [0, \infty)$ . It is a constant on $t \geq aλ$ for the function $ρ (t) = P_{λ} (t) / λ$ with $0 \leq a \leq \infty$ and $ρ (0) = 0$ . Moreover, $ρ^{'} (t)$ exists and is continuous except for a finite number of t and $ρ^{'} (0 +) = 1$ .
(C3):	The error vectors $ε_{i}, i = 1, \dots, n$ are i.i.d. normal distributed with mean zero and variance $σ^{* 2}$ such that $P (\| ε_{i} \| > t) \leq 2 cexp (- \frac{1}{2} c^{- 2} t^{2}) / t$ , where c is a constant.

Conditions (C2) and (C3) are widely adopted in high-dimensional settings. The penalties, including MCP and SCAD mentioned in the article, satisfy (C2).

When the true group memberships $G_{1}, \dots, G_{K}$ and true support set $A$ are known, the oracle estimators for $α$ , $δ$ and γ are defined as ${\hat{Θ}}^{or} = ({\hat{α}}^{or}, {\hat{δ}}^{or}, {\hat{γ}}^{or}) = \arg max_{α \in M_{G}, δ \in M_{B}, γ \in R} \log L_{n} (α, δ, γ) .$ After removing the redundant variables, we can write $\tilde{X} = (Z, X_{A})$ and $\tilde{δ} = (τ^{⊤}, δ_{1}^{⊤})$ for ease of calculation. Then, the oracle estimators for $τ$ , $δ_{1}$ and γ are given by (18) $\begin{aligned} {\hat{Ξ}}^{or} & = ({\hat{τ}}^{or}, {\hat{δ}}_{1}^{or}, {\hat{γ}}^{or}) = ({\tilde{δ}}^{or}, {\hat{γ}}^{or}) \\ = \arg max_{δ_{1} \in R^{K + s}, γ \in R} \log L_{n} (δ_{1}, γ) \\ = \arg max_{δ_{1} \in R^{K + s}, γ \in R} \sum_{i = 1}^{n} d_{i} [\log (γ) - \frac{1}{2} (γ y_{i} - {\tilde{x_{i}}}^{⊤} {\tilde{δ}}_{1})^{2}] + (1 - d_{i}) \log Φ ((- {\tilde{x_{i}}}^{⊤} {\tilde{δ}}_{1})) . \end{aligned}$ (18) The problem (Equation18(18) $\begin{aligned} {\hat{Ξ}}^{or} & = ({\hat{τ}}^{or}, {\hat{δ}}_{1}^{or}, {\hat{γ}}^{or}) = ({\tilde{δ}}^{or}, {\hat{γ}}^{or}) \\ = \arg max_{δ_{1} \in R^{K + s}, γ \in R} \log L_{n} (δ_{1}, γ) \\ = \arg max_{δ_{1} \in R^{K + s}, γ \in R} \sum_{i = 1}^{n} d_{i} [\log (γ) - \frac{1}{2} (γ y_{i} - {\tilde{x_{i}}}^{⊤} {\tilde{δ}}_{1})^{2}] + (1 - d_{i}) \log Φ ((- {\tilde{x_{i}}}^{⊤} {\tilde{δ}}_{1})) . \end{aligned}$ (18) ) can be reduced to the traditional Tobit model. To obtain the estimator using the maximum likelihood method, it is necessary to calculate the matrix of second partials. We define $g (s) = ϕ (s) / Φ (s)$ and $h (s) = g (s) (s + g (s))$ . The matrix can be expressed as follows (19) $\begin{aligned} \nabla^{2} \log L_{n} (Ξ) & = - [\begin{matrix} {\tilde{X}}^{⊤} \\ - y^{⊤} \end{matrix}] [\begin{array}{cc} D ({\tilde{δ}}_{1}) & 0 \\ 0 & I \end{array}] [\begin{array}{cc} \tilde{X} & - y \end{array}] - [\begin{array}{cc} 0 & 0 \\ 0 & n_{1} γ^{- 2} \end{array}] \\ = - [\begin{array}{cc} {\tilde{X}}_{0}^{⊤} D ({\tilde{δ}}_{1}) {\tilde{X}}_{0} + {\tilde{X}}_{1}^{⊤} {\tilde{X}}_{1} & - {\tilde{X}}_{0}^{⊤} D ({\tilde{δ}}_{1}) y_{0} - {\tilde{X}}_{1}^{⊤} y_{1} \\ - y_{0}^{⊤} D ({\tilde{δ}}_{1}) {\tilde{X}}_{0} - y_{1}^{⊤} {\tilde{X}}_{1} & y_{0}^{⊤} D ({\tilde{δ}}_{1}) y_{0} + y_{1}^{⊤} y_{1} - n_{1} γ^{- 2} \end{array}], \end{aligned}$ (19) where $D ({\tilde{δ}}_{1})$ is a $n_{0} \times n_{0}$ diagonal matrix with $[D ({\tilde{δ}}_{1})]_{ii} = h_{i} = h (- {\tilde{x}}_{i}^{⊤} \tilde{δ})$ .

Olsen (Citation1978) found that the matrix in (Equation19(19) $\begin{aligned} \nabla^{2} \log L_{n} (Ξ) & = - [\begin{matrix} {\tilde{X}}^{⊤} \\ - y^{⊤} \end{matrix}] [\begin{array}{cc} D ({\tilde{δ}}_{1}) & 0 \\ 0 & I \end{array}] [\begin{array}{cc} \tilde{X} & - y \end{array}] - [\begin{array}{cc} 0 & 0 \\ 0 & n_{1} γ^{- 2} \end{array}] \\ = - [\begin{array}{cc} {\tilde{X}}_{0}^{⊤} D ({\tilde{δ}}_{1}) {\tilde{X}}_{0} + {\tilde{X}}_{1}^{⊤} {\tilde{X}}_{1} & - {\tilde{X}}_{0}^{⊤} D ({\tilde{δ}}_{1}) y_{0} - {\tilde{X}}_{1}^{⊤} y_{1} \\ - y_{0}^{⊤} D ({\tilde{δ}}_{1}) {\tilde{X}}_{0} - y_{1}^{⊤} {\tilde{X}}_{1} & y_{0}^{⊤} D ({\tilde{δ}}_{1}) y_{0} + y_{1}^{⊤} y_{1} - n_{1} γ^{- 2} \end{array}], \end{aligned}$ (19) ) is negative semidefinite. Theorem 1 in Amemiya (Citation1973) established the asymptotic result that this matrix becomes non-zero with probability one. This ensures the invertibility of the above gradient matrix. Moreover, we introduce an additional condition to support Theorem 4.2.

(C4):

The tuning parameter $λ_{1} ≫ C_{1} s \cdot max {\frac{s^{3 / 2}}{n_{0}}, \frac{1}{\sqrt{n_{0}}}, \sqrt{\frac{\log n}{n_{1}}}}$ and $λ_{2} ≫ | G_{min} |^{- 1} \cdot max {\frac{s^{3 / 2}}{n_{0}}, \frac{1}{\sqrt{n_{0}}}, \sqrt{\frac{\log n}{n_{1}}}}$ .

4.2. Theoretical results

Theorem 4.1

Suppose that $y_{i}^{*} = μ_{i}^{*} + x_{i}^{⊤} β^{*} + ϵ_{i}$ where $ϵ_{i} \overset{iid}{\sim} N (0, σ^{* 2})$ and define $y_{i} = y_{i}^{*} \cdot I (y_{i}^{*} > 0)$ for $i = 1, \dots, n$ . Let ${\hat{Ξ}}^{or}$ denote the oracle solution to the Tobit model when the true group memberships and true support set of $β$ are known. Suppose conditions (C1)–(C3) hold, and then ${\hat{Ξ}}^{or}$ corresponds to the unique maximum of the likelihood function and is a consistent estimator of the true parameter values $Ξ^{*}$ such that (20) $\begin{aligned} \sqrt{n} ({\hat{Ξ}}^{or} - Ξ^{*}) \to N (0, Σ), \end{aligned}$ (20) where $Σ = lim_{n \to \infty} [- \frac{1}{n} \nabla^{2} \log L_{n} (Ξ) |_{Ξ = Ξ^{*}}]^{- 1}$ . Moreover, we denote $λ_{max}$ as the maximum eigenvalue of the matrix $Σ$ . If $λ_{max} = O (1)$ is satisfied, we have that with probability at least $1 - p_{1} = 1 - C \sqrt{\frac{\log n}{n}} \cdot \exp {- \frac{n}{2 \log n}}$ , (21) $\begin{aligned} {‖ {\hat{Θ}}^{or} - Θ^{*} ‖}_{\infty} \leq ϕ_{n}, \end{aligned}$ (21) where $ϕ_{n} = 1 / \sqrt{\log n}$ and C is a constant.

For $K \geq 2$ , let $b_{n} = min_{i \in G_{k}, j \in G_{k^{'}}, k \neq k^{'}} | α_{i}^{*} - α_{j}^{*} | = min_{k \neq k^{'}} | τ_{i}^{*} - τ_{j}^{*} |$ be the minimal difference of the common values between the two groups.

Theorem 4.2

Suppose the conditions in Theorem 4.1 hold and $K \geq 2$ . If the minimum signal strength of $δ^{*}$ satisfies $| δ_{A} |_{min} > (a + 1) λ_{1}$ and $b_{n} > a λ_{2}$ . When $λ_{1}, λ_{2} ≫ ϕ_{n}$ , where a is a given constant in (C2), then there exists a local minimum $\hat{Θ} (λ_{1}, λ_{2}) = ({\hat{α}}^{⊤}, {\hat{δ}}^{⊤}, γ)^{⊤}$ of the objective function $Q (α, δ, γ)$ given in (Equation5(5) $\begin{aligned} Q (α, δ, γ; λ_{1}, λ_{2}) = - \frac{1}{n} \log L_{n} (α, δ, γ) + \sum_{i = 1}^{p} P_{λ_{1}} (| δ_{i} |) + \sum_{i < j} P_{λ_{2}} (| α_{i} - α_{j} |), \end{aligned}$ (5) ) satisfying $P (({\hat{α}}^{⊤}, {\hat{δ}}^{⊤}, γ)^{⊤} = ({\hat{α}}^{or})^{⊤}, ({\hat{δ}}^{or})^{⊤}, ({\hat{γ}}^{or})^{⊤}) \to 1,$ that is, $P (\hat{Θ} (λ_{1}, λ_{2}) = {\hat{Θ}}^{or}) \to 1.$

The proofs of these theorems are given in the Appendix.

5. Simulation studies

In this section, we conduct extensive simulation studies to investigate the numerical performance of the proposed approaches. We generate data from the censored heterogeneous linear model: $y_{i}^{*} = μ_{i} + x_{i}^{⊤} β + ϵ_{i}, i = 1, \dots, n,$ where $x_{ij}, j = 1, \dots, p$ are generated from independent normal distribution $N (1, 1),$ and the error terms $ϵ_{i}$ are from independent normal distribution $N (0, {0.5}^{2})$ . We set $y_{i} = max {0, y_{i}^{*}}$ with censoring rate q. The true coefficients are set as $β = (5, 1, - 2, 0.5, 0.1, 0, \dots, 0)^{⊤}$ , which is a p-dimensional vector with p−5 zero elements. To investigate the effect of the magnitude of difference between subgroup-specific effects, we consider two cases for the subgroup structure:

Case 1:	$K = 2, P (μ_{i} = - 2) = P (μ_{i} = 2) = \frac{1}{2}$ ;
Case 2:	$K = 3, P (μ_{i} = - 2) = P (μ_{i} = 2) = P (μ_{i} = 0.5) = \frac{1}{3}$ .

We evaluate the performance of the estimators obtained using the proposed method using three different penalties (SCAD, MCP and Lasso), and compare them to the penalized Tobit approach (Tobit SCAD, Jacobson & Zou, Citation2023), which assumes a homogeneous intercept effect μ. Additionally, we present the results of Oracle estimators (Tobit Oracle) as well. For each simulation, we generate 100 datasets with sample size of n = 100, for every combination of $q \in {20 %, 40 %}$ and p = 10, 50, 200. Specifically, we set $ρ = 2$ and set $a_{1} = a_{2} = 3.7$ for the SCAD penalty and $a_{1} = a_{2} = 3$ for the MCP penalty. Subsequently, we conduct the simulations by selecting the optimal tuning parameters via minimizing the modified BIC (Wang et al., Citation2007): (22) $\begin{aligned} BIC (λ_{1}, λ_{2}) = - 2 \log L_{n} (\hat{α}, \hat{δ}, \hat{γ}) + C_{n} \log (n) (\hat{K} + s + 1) . \end{aligned}$ (22) Wang et al. (Citation2009) used $C_{n} = \log (\log (d))$ in the simulation to apply the divergence of the predictor with sample size in high-dimensional scenarios. In this article, we let $C_{n} = clog (\log (d))$ , where d = n + p + 1 and c is a positive constant that we set to 1.5.

We evaluate the methods based on three aspects, accuracy of the coefficient estimates, performance of the variable selection and identifying the subgroup structures. To measure the estimation accuracy of parameters $\hat{μ}, \hat{β}, and \hat{σ}$ , we use the square error of the mean squared errors. Let $μ^{*}, β^{*}$ and $σ^{*}$ represent the true parameters. The square roots of the mean squared errors for $\hat{μ}, \hat{β}$ and $\hat{σ}$ are defined by $err (\hat{μ}) = ‖ \hat{μ} - μ^{*} ‖_{2} / \sqrt{n}$ , $err (\hat{β}) = ‖ \hat{β} - β^{*} ‖_{2}$ and $err (\hat{σ}) = | \hat{σ} - σ^{*} |$ , respectively.

In order to assess the variable selection of these methods, we report the number of true variables not included (NT) and the number of error variables included (NE). Moreover, to evaluate the performance of the subgroup analysis, we present the estimate of the number of groups ( $\hat{K}$ ), the rate of false estimation of the number of groups (FK%) and the Rand Index (RI, Rand, Citation1971), which is defined by $RI = \frac{TP + TN}{TP + FP + FN + TN},$ where true positive (TP) indicates that two observations from the same truth group are allocated to the same group, true negative (TN) means two observations from different groups are allocated to different groups, false positive (FP) denotes two observations from different groups but allocated to the same group, and false negative (FN) represents two observations from the same group are allocated to different groups. A high Rand Index indicates a substantial proportion which of individuals being assigned to the correct subgroups.

Tables and present the average square root of the mean squared error (RSME) of the estimates of the five methods. The cases considered in the tables involve varying values of p and censoring rates. It is evident from the tables that the proposed methods, namely MCP and SCAD, consistently yield smaller RMSE values compared to the Lasso and Tobit SCAD methods across all simulations. Moreover, the method utilizing the lasso penalty consistently underestimates the number of groups. Consequently, the substantial deviation of the estimator can be attributed to the loss of heterogeneous intercept information. Similarly, the Tobit SCAD method when utilized without the subgroup recognition function, shows comparable results.

Table 1. The sample means and standard deviations of the index to be measured when K = 2.

Display Table

Table 2. The sample means and standard deviations of the index to be measured when K = 3.

Display Table

In Tables and , we report the results for variable selection and identification of subgroups. The NT and NE values indicate that our method exhibits comparable performance to other methods in identifying relevant variables while outperforming them in efficiently screening out irrelevant variables. The proposed approaches also achieve higher RI and lower FK values, further establishing their superiority over the alternative methods. Despite facing increased challenges in model recovery due to higher censoring rates and larger parameter dimensions, our method still maintains a remarkably low prediction error rate for the number of subgroups. Incorrect estimations only occur when the deletion ratio reaches $40 %$ , underscoring the robustness of our approach even under such demanding conditions.

Table 3. The sample means and standard deviations of the index to be measured when K = 2.

Display Table

Table 4. The sample means and standard deviations of the index to be measured when K = 3.

Display Table

6. An empirical application to the HIV drug resistance data

Antiretroviral therapy (ART) is a common medical treatment for human immunodeficiency virus (HIV). However, the high mutation rate of HIV leads to drug-resistant mutations (DRMs) in HIV-infected patients receiving ART. In response to this challenge, physicians regularly monitor HIV viral load. When the patient's treatment regimen fails to suppress the virus, genotypic testing is conducted to check for DRMs, and subsequently, they can update the patient's drug regimen appropriately.

To identify DRMs and quantify the degree of resistance they provide to different ART treatments, the proposed approach is applied to model the relationship between HIV viral load and mutations in the virus's genome. The data used in this section are from the OPTIONS trial by the AIDS Clinical Trials Group (Gandhi et al., Citation2020), which can be downloaded from the Stanford HIV Drug Resistance database (Shafer, Citation2006). Specifically, the OPTIONS trial encompassed 412 participants afflicted with HIV-infected, who were undergoing protease inhibitor (PI)-based treatment and grappling with virological failure. Each individual was administered an individualized ART regimen on the basis of their drug resistance and treatment history. Individuals exhibiting moderate drug resistance were randomly allocated to either include nucleoside reverse transcriptase inhibitors (NRTIs) into their optimized treatment regimens or to exclude NRTIs from these regimens. Individuals with high drug resistance were all provided with optimized regimens that encompassed NRTIs.

Our dataset includes n = 407 participants who were subjected to a comprehensive 12-week follow-up assessment. Within this dataset, there are a multitude of predictors p = 601, including 99 protease (PR) and 240 reverse transcriptase (RT) gene mutation indicators. Due to the technical limitations of the assays employed for its measurement, it cannot be measured when the HIV viral load is less than the threshold (50 copies/ml), the response variable is left censored. As proposed by Soret et al. (Citation2018), we use $\log_{10}$ -HIV viral load as our response due to its prevalent conformity to a normal distribution. In this trial, 35.6% of individuals have no detectable viral load, which implies a left censoring ratio of 35.6% for the data sample. We compare our proposed methods (SCAD and MCP) and Tobit SCAD (Jacobson & Zou, Citation2023) for modelling HIV viral load 12 weeks after drug regimen assignment as a function of several variables, including HIV genotypic mutations, current drug regimen, baseline viral load, and observation week, etc.

First, to evaluate the performance of model selection, we apply Tobit SCAD and the proposed methods (SCAD and MCP) for fitting the entire data respectively. The constant values $a_{1}, a_{2}, ρ$ are set as in Section 5. Consequently, sparse models containing covariates M184V, the baseline viral load and RAL are consistently selected across all three approaches. Among them, M184V is an indicator of mutations in the reverse transcriptase (RT) gene, and RAL refers to whether the participant was taking raltegravir. The Stanford University HIV Resistance Database lists M184V as a major NRTIs resistance mutation (Jacobson & Zou, Citation2023; Shafer, Citation2006). Moreover, the absence of additional NRTIs being chosen as important variables aligns with the results in Gandhi et al. (Citation2020). This congruence highlights the practical performance of the proposed approaches in model selection. As a result, the specific estimated coefficients are comprehensively shown in Table .

Table 5. The estimators of parameters in the example.

Download CSV Display Table

The key difference between the proposed methods and the Tobit SCAD lies in the capacity of the proposed methods to identify the subgroup structures within the intercepts. As depicted in Figure , we present the estimated density function of $y_{i} - x_{i}^{⊤} {\hat{β}}^{TS}$ , where ${\hat{β}}^{TS}$ is the coefficient vector estimate for Tobit SCAD model (Jacobson & Zou, Citation2023). It is not difficult to see the distribution of $y_{i} - x_{i}^{⊤} {\hat{β}}^{TS}$ is multimodal. Consequently, it appears more appropriate to employ a model with heterogeneous intercepts to fit the dataset. Subsequently, in Figure , we present the density functions estimates of $y_{i} - x_{i}^{⊤} {\hat{β}}^{PS}$ for two subgroups characterized by different intercepts, where ${\hat{β}}^{PS}$ is the coefficient vector estimate for proposed SCAD model. The density function estimates of the proposed MCP method exhibit similar characteristics and are therefore omitted here. When compared with the density functions shown in Figure , it is evident that each subgroup in Figure displays an unimodal distribution with greater homogeneity.

Figure 1. Density plot of the response variable after adjusting for the effects of the covariates in the empirical example.

Figure 2. Density plots of response variable after adjusting for the effects of the covariates under two different intercept groups for proposed SCAD method.

The application of the proposed SCAD method leads to the detection of two subgroups with sample sizes of $n_{1} = 47$ and $n_{2} = 360$ , respectively. It is worth noting that the patients in the OPTION design were originally categorized into two groups: a randomized group consisting of 356 patients, and a highly resistant group, comprising 51 patients. Consequently, we can consider this historical grouping as a control group structure. We calculate the rand index (RI) for the proposed SCAD method based on the control group structure. The rand index (RI) is 0.757, which indicates a substantial degree of agreement between the detected subgroup structure and the historical one. This suggests that the subgroup composition may undergo changes as a result of antiretroviral therapy, a phenomenon that is quite reasonable in the context of HIV patient management.

In addition, we compare the in-sample error and out-of-sample error for the aforementioned methods and the Tobit SCAD method with known subgroup structures. These errors are defined as the mean square error of $(i) err (y_{i}, \hat{y_{i}}) = (y_{i} - \hat{y_{i}})^{2}, (ii) err (y_{i}, {\hat{y_{i}}}^{-}) = (y_{i} - {\hat{y_{i}}}^{-})^{2}, i = 1, \dots, n$ across all samples, where ${\hat{y}}_{i}$ is the fitted response, and ${\hat{y}}_{i}^{-}$ is the predicted response using a 5-fold cross-validation approach. To ensure comparability, we employ stratified sampling to maintain a consistent left-censored ratio across each test set. Table presents the two types of criteria. For both two types of error, the Tobit SCAD method, when applied with known subgroup structures, exhibits smaller errors compared to the homogeneous model. This result underscores the significance of capturing the inherent heterogeneity within the dataset, as it enables more effective modelling of the influential variables affecting the response. While the errors in the control group are slightly larger compared to the proposed methods, it is essential to consider that these discrepancies may stem from changes in drug resistance among patients during treatment. Moreover, the disparities with the control group could serve as a basis for further examination of specific patient cases. It is conceivable that the existence of distinct groups may be attributed to disparities in the treatment trajectories of patients or inherent physiological differences among individuals. In clinical terms, these findings suggest the potential for devising more precise and tailored management protocols for specific patient subgroups, improving the overall treatment efficacy.

Table 6. The in-sample error and out-of-sample error

Display Table

7. Conclusion

This paper introduces a novel approach to analysing censored data with potential heterogeneity in intercept effects by combining the penalty Tobit likelihood function with a concave fusion penalty. The proposed method can automatically identify heterogeneous structures in intercept effects and conduct variable selection.

To address the optimization problem associated with the method, we propose an algorithm based on the generalized coordinate descent method and alternating direction method of multipliers. This algorithm simplifies the optimization problem and reduces computational costs by employing a quadratic optimization function instead of a complex nonlinear optimization problem. This choice ensures efficient computation, particularly in scenarios with high complexity, while still maintaining good properties for the estimator. Furthermore, we establish the oracle property of parameter estimators. It is shown that within the domain of the oracle solution, a local minimum point of the objective function can be consistent with the oracle solution, providing theoretical support for the correctness of the method. Our ADMM-GCD algorithm with SCAD or MCP performs well in both extensive simulation case studies and the application of real data.

The proposed method can also be extended to incorporate the fusion of coefficient terms. However, there are still challenges in applying this method to generalized linear models or survival models, and further research is needed to develop algorithms and establish theoretical properties in these models.

Acknowledgements

We are very grateful to the editor, managing editor and the referee for their comments that improved this article.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by the National Natural Science Foundation of China (NSF) of China [grant numbers 12171450 and 71921001.

References

Alhamzawi, A. (2020). A new Bayesian elastic net for Tobit regression. Journal of Physics: Conference Series, 1664(1), 012047.
Google Scholar
Alhamzawi, R. (2016). Bayesian elastic net Tobit quantile regression. Communications in Statistics-Simulation and Computation, 45(7), 2409–2427. https://doi.org/10.1080/03610918.2014.904341
Web of Science ®Google Scholar
Amemiya, T. (1973). Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, 41(6), 997–1016. https://doi.org/10.2307/1914031
Web of Science ®Google Scholar
Amemiya, T. (1984). Tobit models: A survey. Journal of Econometrics, 24(1–2), 3–61. https://doi.org/10.1016/0304-4076(84)90074-5
Web of Science ®Google Scholar
Bondell, H. D., & Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64(1), 115–123. https://doi.org/10.1111/biom.2008.64.issue-1
PubMed Web of Science ®Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122. https://doi.org/10.1561/2200000016
Google Scholar
Bradic, J., Fan, J., & Jiang, J. (2011). Regularization for Cox's proportional hazards model with NP-dimensionality. Annals of Statistics, 39(6), 3092. https://doi.org/10.1214/11-AOS911
PubMed Web of Science ®Google Scholar
Dagne, G. A. (2016). A growth mixture Tobit model: Application to AIDS studies. Journal of Applied Statistics, 43(7), 1174–1185. https://doi.org/10.1080/02664763.2015.1092114
Web of Science ®Google Scholar
Everitt, B. (2013). Finite mixture distributions. Springer Science & Business Media.
Google Scholar
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456), 1348–1360. https://doi.org/10.1198/016214501753382273
Web of Science ®Google Scholar
Fan, J., & Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1), 101–148.
PubMed Web of Science ®Google Scholar
Fan, Y., & Tang, C. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 531–552. https://doi.org/10.1111/rssb.12001
Web of Science ®Google Scholar
Gandhi, R. T., Tashima, K. T., Smeaton, L. M., Vu, V., Ritz, J., Andrade, A., Eron, J. J., Hogg, E., & Fichtenbaum, C. J. (2020). Long-term outcomes in a large randomized trial of HIV-1 salvage therapy: 96-week results of AIDS Clinical Trials Group A5241 (OPTIONS). The Journal of Infectious Diseases, 221(9), 1407–1415. https://doi.org/10.1093/infdis/jiz281
PubMed Web of Science ®Google Scholar
Jacobson, T., & Zou, H. (2023). High-dimensional censored regression via the penalized Tobit likelihood. Journal of Business & Economic Statistics, 42(1), 286–297. https://doi.org/10.1080/07350015.2023.2182309
Web of Science ®Google Scholar
Johnson, B. A. (2009). On lasso for censored data. Electronic Journal of Statistics, 3, 485–506. https://doi.org/10.1214/08-EJS322
Web of Science ®Google Scholar
Ma, S., & Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517), 410–423. https://doi.org/10.1080/01621459.2016.1148039
Web of Science ®Google Scholar
Ma, S., Huang, J., Zhang, Z., & Liu, M. (2019). Exploration of heterogeneous treatment effects via concave fusion. The International Journal of Biostatistics, 16(1), 20180026. https://doi.org/10.1515/ijb-2018-0026
Google Scholar
Müller, P., & van de Geer, S. (2016). Censored linear model in high dimensions: Penalised linear regression on high-dimensional data with left-censored response variable. Test, 25(1), 75–92. https://doi.org/10.1007/s11749-015-0441-7
Web of Science ®Google Scholar
Olsen, R. J. (1978). Note on the uniqueness of the maximum likelihood estimator for the Tobit model. Econometrica: Journal of the Econometric Society, 46(5), 1211–1215. https://doi.org/10.2307/1911445
Web of Science ®Google Scholar
Powell, J. L. (1984). Least absolute deviations estimation for the censored regression model. Journal of Econometrics, 25(3), 303–325. https://doi.org/10.1016/0304-4076(84)90004-6
Web of Science ®Google Scholar
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850. https://doi.org/10.1080/01621459.1971.10482356
Web of Science ®Google Scholar
Shafer, R. W. (2006). Rationale and uses of a public HIV drug-resistance database. The Journal of Infectious Diseases, 194(s1), S51–S58. https://doi.org/10.1086/jid.2006.194.issue-s1
PubMedGoogle Scholar
Shen, J., & He, X. (2015). Inference for subgroup analysis with a structured logistic-normal mixture model. Journal of the American Statistical Association, 110(509), 303–312. https://doi.org/10.1080/01621459.2014.894763
Web of Science ®Google Scholar
Soret, P., Avalos, M., Wittkop, L., Commenges, D., & Thiébaut, R. (2018). Lasso regularization for left-censored Gaussian outcome and high-dimensional predictors. BMC Medical Research Methodology, 18(1), 1–13. https://doi.org/10.1186/s12874-018-0609-4
PubMedGoogle Scholar
Tobin, J. (1958). Estimation of relationships for limited dependent variables. Econometrica: Journal of the Econometric Society, 26(1), 24–36. https://doi.org/10.2307/1907382
Web of Science ®Google Scholar
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3), 475–494. https://doi.org/10.1023/A:1017501703105
Web of Science ®Google Scholar
Wang, H., Li, B., & Leng, C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3), 671–683. https://doi.org/10.1111/j.1467-9868.2008.00693.x
Web of Science ®Google Scholar
Wang, H., Li, R., & Tsai, C. L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3), 553–568. https://doi.org/10.1093/biomet/asm053
PubMed Web of Science ®Google Scholar
Wang, X., Zhu, Z., & Zhang, H. H. (2019). Spatial automatic subgroup analysis for areal data with repeated measures. arXiv:1906.01853.
Google Scholar
Yan, X., Yin, G., & Zhao, X. (2021). Subgroup analysis in censored linear regression. Statistica Sinica, 31(2), 1027–1054.
Web of Science ®Google Scholar
Yang, Y., & Zou, H. (2013). An efficient algorithm for computing the HHSVM and its generalizations. Journal of Computational and Graphical Statistics, 22(2), 396–415. https://doi.org/10.1080/10618600.2012.680324
Web of Science ®Google Scholar
Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 32(2), 894–942.
Google Scholar
Zhao, P., & Yu, B. (2006). On model selection consistency of Lasso. The Journal of Machine Learning Research, 7(90), 2541–2563.
Google Scholar
Zhou, X., & Liu, G. (2016). LAD-lasso variable selection for doubly censored median regression models. Communications in Statistics-Theory and Methods, 45(12), 3658–3667. https://doi.org/10.1080/03610926.2014.904357
Web of Science ®Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429. https://doi.org/10.1198/016214506000000735
Web of Science ®Google Scholar

Appendix

A.1. Proof of Proposition 3.1

First, according to the definition of

η^{(t + 1)}

, we have

L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t + 1)}, φ^{(t)}) \leq L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η, φ^{(t)})

for any

η

. Define

f^{(t + 1)} = inf_{{Δ α}^{(t + 1)} - η = 0} {L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η, φ^{(t)})},

and then we have

L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t + 1)}, φ^{(t)}) \leq f^{(t + 1)} .

Let k be a non-negative integer, since

φ^{(t + k - 1)} = φ^{(t)} + ρ \sum_{i = 1}^{k - 1} ({Δ α}^{(t + i)} - η^{(t + i)})

, so we can obtain that

\begin{aligned} L (α^{(t + k)}, δ^{(t + k)}, γ^{(t + k)}, η^{(t + k)}, φ^{(t + k - 1)}) \\ = ℓ_{n} (α^{(t + k)}, δ^{(t + k)}, γ^{(t + k)}) + \sum_{j = 1}^{p} P_{λ_{1}} (| δ_{j}^{(t + k)} |) + \sum_{i < j} P_{λ_{2}} (| η_{ij}^{(t + k)} |) \\ + φ^{(t + k - 1) T} ({Δ α}^{(t + k)} - η^{(t + k)}) + \frac{ρ}{2} ‖ {Δ α}^{(t + k)} - η^{(t + k)} ‖_{2}^{2} \\ = S (α^{(t + k)}, δ^{(t + k)}, γ^{(t + k)}, η^{(t + k)}) + \frac{ρ}{2} ‖ {Δ α}^{(t + k)} - η^{(t + k)} ‖_{2}^{2} \\ + [φ^{(t)} + ρ \sum_{i = 1}^{k - 1} ({Δ α}^{(t + i)} - η^{(t + i)})] \times ({Δ α}^{(t + k)} - η^{(t + k)}) \\ \leq f^{(t + k)} . \end{aligned}

According to Yang and Zou (Citation2013), we can infer that the GCD limit point

(α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)})

is the coordinatewise minimum of

L (α, δ, γ, η^{(t)}, φ^{(t)})

. Additionally, the objective function is convex with respect to

η

. Then according to Theorem 4.1 in Tseng (Citation2001), the sequence

(α^{(t)}, δ^{(t)}, γ^{(t)}, η^{(t)})

converges to a coordinatewise minimum point

(α^{*}, δ^{*}, γ^{*}, η^{*})

. Thus we have

\begin{aligned} f^{*} & = lim_{t \to \infty} f^{(t + 1)} = lim_{t \to \infty} f^{(t + k)} \\ = inf_{{Δ α}^{*} - η = 0} {ℓ_{n} (α^{*}, δ^{*}, γ^{*}) + \sum_{j = 1}^{p} P_{λ_{1}} (| δ_{j}^{*} |) + \sum_{i < j} P_{λ_{2}} (| η_{ij} |)}, \end{aligned}

and for all

k \geq 0

\begin{aligned} f^{*} \geq & lim_{t \to \infty} L (α^{(t + k)}, δ^{(t + k)}, γ^{(t + k)}, η^{(t + k)}, φ^{(t + k - 1)}) \\ = ℓ_{n} (α^{*}, δ^{*}, γ^{*}) + \sum_{j = 1}^{p} P_{λ_{1}} (| δ_{j}^{*} |) + \sum_{i < j} P_{λ_{2}} (| η_{ij}^{*} |) + lim_{t \to \infty} φ^{(t) ⊤} (Δ α^{*} - η^{*}) \\ + (k - \frac{1}{2}) ρ ‖ Δ α^{*} - η^{*} ‖_{2}^{2} . \end{aligned}

Therefore,

(k - \frac{1}{2}) ρ ‖ Δ α^{*} - η^{*} ‖_{2}^{2} \leq inf_{{Δ α}^{*} - η = 0} {\sum_{i < j} P_{λ_{2}} (| η_{ij} |)} - \sum_{i < j} P_{λ_{2}} (| η_{ij}^{*} |) - lim_{t \to \infty} φ^{(t) ⊤} (Δ α^{*} - η^{*}) .

Since the above inequality holds for all

k \geq 0

, then

lim_{t \to \infty} ‖ r^{(t)} ‖_{2}^{2} = ‖ Δ α^{*} - η^{*} ‖_{2}^{2} = 0

Moreover, since $(α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)})$ minimize $L (α, δ, γ, η^{(t)}, φ^{(t)})$ , by definition, we have that $\begin{aligned} \partial L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t)}, φ^{(t)}) / \partial α \\ = ∂S (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t)}) / \partial α + Δ^{⊤} φ^{(t)} + ρ Δ^{⊤} ({Δ α}^{(t + 1)} - η^{(t)}) \\ = ∂S (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t)}) / \partial α + Δ^{⊤} (φ^{(t)} + ρ ({Δ α}^{(t + 1)} - η^{(t)})) \\ = ∂S (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t)}) / \partial α + Δ^{⊤} φ^{(t + 1)} + ρ Δ^{⊤} (η^{(t + 1)} - η^{(t)}) = 0. \end{aligned}$ Thus, $‖ s^{(t + 1)} ‖_{2}^{2} = ρ Δ^{⊤} (η^{(t + 1)} - η^{(t)}) = - (∂S (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t)}) / \partial α + Δ^{⊤} φ^{(t + 1)})$ Since $lim_{t \to \infty} ‖ r^{(t)} ‖_{2}^{2} = ‖ Δ α^{*} - η^{*} ‖_{2}^{2} = 0$ , $\begin{aligned} lim_{t \to \infty} ∂L (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t)}, φ^{(t)}) / \partial α \\ = lim_{t \to \infty} ∂S (α^{(t + 1)}, δ^{(t + 1)}, γ^{(t + 1)}, η^{(t)}) / \partial α + Δ^{⊤} φ^{(t)} = 0. \end{aligned}$ Therefore, $lim_{t \to \infty} ‖ s^{(t + 1)} ‖_{2}^{2} = 0$ .

A.2. Proof of Theorem 4.1

Let $θ = ({\hat{Ξ}}^{(or)} - {\hat{Ξ}}^{*})$ , q = K + s + 1. We can see that $\sqrt{n} θ \sim N (0, Σ)$ according to (Equation20(20) $\begin{aligned} \sqrt{n} ({\hat{Ξ}}^{or} - Ξ^{*}) \to N (0, Σ), \end{aligned}$ (20) ) where $Σ = [- \frac{1}{n} \nabla^{2} \log L_{n} (Ξ) |_{Ξ = Ξ^{*}}]^{- 1}$ .

In order to get the tail probability of the event in (Equation21(21) $\begin{aligned} {‖ {\hat{Θ}}^{or} - Θ^{*} ‖}_{\infty} \leq ϕ_{n}, \end{aligned}$ (21) ), we introduce a q-dimension vector $a = (a_{1}, \dots, a_{q})^{⊤}$ satisfying $‖ a ‖_{2} = 1$ , that is $a_{1}^{2} + \dots + a_{q}^{2} = 1$ . It is obvious that $a^{⊤} θ \sim N (0, \frac{1}{n} a^{⊤} Σ a)$ . Since $| a^{⊤} θ | = | a_{1} θ_{1} + \dots + a_{q} θ_{q} | \leq | a_{1} θ_{1} | + \dots + | a_{q} θ_{q} |,$ $| a^{⊤} θ | \leq 1 \cdot | θ_{j} |_{max}$ . Then it is easy to get $max_{‖ a ‖_{2} = 1} | a^{⊤} θ | = max_{j} | θ_{j} |$ . Since $Σ$ is a symmetric positive definite matrix, the maximum value of the quadratic $f = a^{⊤} Σ a$ at $‖ a ‖ = 1$ is the largest eigenvalue $λ_{max}$ of the matrix $Σ$ .

For a normal distribution $X \sim N (0, σ^{2})$ , the two-tailed probability inequality is (A1) $\begin{aligned} P (| X | < c) \geq 1 - \sqrt{\frac{2}{π}} \cdot \frac{σ}{c} \cdot \exp {- \frac{c^{2}}{2 σ^{2}}} . \end{aligned}$ (A1) So $P (| a^{⊤} θ | < ϕ_{n}) \geq 1 - \sqrt{\frac{2}{π}} \cdot \frac{\sqrt{f / n}}{ϕ_{n}} \cdot \exp {- \frac{ϕ_{n}^{2}}{2 (f / n)}} .$ Since the right-side of the above inequality with respect to f is a decreasing function, then $P (max_{‖ a ‖_{2} = 1} | a^{⊤} θ | < ϕ_{n}) \geq 1 - \sqrt{\frac{2}{π}} \cdot \frac{\sqrt{λ_{max} / n}}{ϕ_{n}} \cdot \exp {- \frac{ϕ_{n}^{2}}{2 (λ_{max} / n)}} .$ Therefore, (A2) $\begin{aligned} P ({‖ {\hat{Θ}}^{or} - Θ^{*} ‖}_{\infty} < ϕ_{n}) = P ({‖ {\hat{Ξ}}^{or} - Ξ^{*} ‖}_{\infty} < ϕ_{n}) \\ = P (max_{j} | θ_{j} | < ϕ_{n}) = P (max_{‖ a ‖_{2} = 1} | a^{⊤} θ | < ϕ_{n}) \\ \geq 1 - \sqrt{\frac{2}{π}} \cdot \frac{\sqrt{λ_{max} / n}}{ϕ_{n}} \cdot \exp {- \frac{ϕ_{n}^{2}}{2 (λ_{max} / n)}} . \end{aligned}$ (A2) Since $λ_{max} = O (1)$ , we set $ϕ_{n} = 1 / \sqrt{\log n}$ and $t = \sqrt{\log n / n}$ . Then $P (‖ {\hat{Θ}}^{or} - Θ^{*} ‖_{\infty} < ϕ_{n}) \geq 1 - C \sqrt{\frac{\log n}{n}} \cdot \exp {- \frac{n}{2 \log n}} = 1 - C \cdot texp {- \frac{1}{2 t^{2}}}$ , where C is a constant. Obviously we have $t \to 0$ and then $P (‖ {\hat{Θ}}^{or} - Θ^{*} ‖_{\infty} < ϕ_{n}) \to 1$ when $n \to \infty$ .

A.3. Proof of Theorem 4.2

First, with the underlying group division $G_{1}, \dots, G_{K}$ and true support set $A$ we define $\begin{aligned} L_{n} (α, δ, γ) & = ℓ_{n} (α, δ, γ), P_{λ_{1}} (δ) = λ_{1} \sum_{j = 1}^{p} ρ (| δ_{j} |), P_{λ_{2}} (α) = λ_{2} \sum_{i < j} ρ (| α_{i} - α_{j} |), \\ L_{n}^{O} (τ, δ_{1}, γ) & = ℓ_{n}^{O} (τ, δ_{1}, γ), P_{λ_{1}}^{O} (δ_{1}) = λ_{1} \sum_{j \in A} ρ (| δ_{j} |), P_{λ_{2}}^{O} (τ) = λ_{2} \sum_{k < k^{'}} | G_{k} | | G_{k^{'}} | ρ (| τ_{i} - τ_{j} |), \end{aligned}$ and denote $\begin{aligned} Q_{n} (α, δ, γ) = L_{n} (α, δ, γ) + P_{λ_{1}} (δ) + P_{λ_{2}} (α), \\ Q_{n}^{O} (τ, δ_{1}, γ) = L_{n}^{O} (τ, δ_{1}, γ) + P_{λ_{1}}^{O} (δ_{1}) + P_{λ_{2}}^{O} (τ) . \end{aligned}$ Let $T : M_{G} \to R^{K}$ be the mapping such as that $T (α)$ is the $K \times 1$ vector whose kth coordinate equals to the common value of $α_{i}$ for $i \in G_{k}$ . And let $T_{0} : R^{n} \to R^{K}$ be the mapping such that $T_{0} (α) = {| G_{k} |^{- 1} \sum_{i \in G_{k}} α_{i}}_{k = 1}^{K}$ . Let $S : R^{p} \to R^{s}$ be the mapping such that $S (δ)$ retains only the part of $δ$ whose corner is labelled $A$ , that is $S (δ) = δ_{A}$ . And let $S^{- 1} : R^{s} \to M_{B}$ be the mapping such that $S^{- 1} (δ_{1}) = (δ_{1}^{⊤}, 0_{A^{c}}^{⊤})^{⊤}$ . Obviously, when $α \in M_{G}$ and $δ \in M_{B}$ , $T (α) = T_{0} (α)$ and $δ_{A} = S (δ)$ . Moreover, for every $δ \in M_{B}$ and $α \in M_{G}$ , we have $P_{λ_{1}} (δ) = P_{λ_{1}}^{O} (δ_{A})$ and $P_{λ_{2}} (α) = P_{λ_{2}}^{O} (T (α))$ . For every $δ_{1} \in R^{s}$ and $τ \in R^{K}$ , we have $P_{λ_{1}} (S^{- 1} (δ_{1})) = P_{λ_{1}}^{O} (δ_{1})$ and $P_{λ_{2}} (T^{- 1} (τ)) = P_{λ_{2}}^{O} (τ)$ . Hence (A3) $\begin{aligned} Q_{n} (α, δ, γ) = Q_{n}^{O} (T (α), δ_{A}, γ), Q_{n}^{O} (τ, δ_{1}, γ) = Q_{n} (T^{- 1} (τ), S^{- 1} (δ_{1}), γ) . \end{aligned}$ (A3) Consider the neighbourhood of $(α^{*}, δ^{*}, γ^{*})$ $Ψ = {α \in R^{n}, δ \in R^{p}, γ \in R : {‖ ((α - α^{*})^{⊤}, (δ - δ^{*})^{⊤}, γ - γ^{*})^{⊤} ‖}_{\infty} \leq ϕ_{n}} .$ Define the event $E_{1} = {\hat{Θ} \in Ψ}$ . By Theorem 1 we have $P (E_{1}^{c}) \leq p_{1}$ . For any $α \in R^{n}$ and $δ \in R^{p}$ let $α^{0} = T^{- 1} (T_{0} (α))$ and $δ^{0} = S^{- 1} (δ_{A})$ . We will prove that $({\hat{α}}^{or}, {\hat{δ}}^{or}, {\hat{γ}}^{or})$ is a local minimizer of the objective function $Q_{n} (α, δ, γ)$ with probability approaching 1 through the following two steps.

On the event $E_{1}$ , $Q_{n} (α^{0}, δ^{0}, γ) \geq Q_{n} ({\hat{α}}^{or}, {\hat{δ}}^{or}, {\hat{γ}}^{or})$ for any $(α^{⊤}, δ^{⊤}, γ)^{⊤} \in Ψ$ .
There is an event $E_{2}$ such that $P (E_{2}^{c}) \leq p_{2} = \frac{n_{1}}{n \sqrt{\log n}}$ . On $E_{1} \cap E_{2}$ , there is a neighbourhood of $(({\hat{α}}^{or})^{⊤}, ({\hat{δ}}^{or})^{⊤}, {\hat{γ}}^{or})^{⊤}$ , denoted by $Ψ_{n} = {α, δ : ‖ ((α - {\hat{α}}^{or})^{⊤}, (δ - {\hat{δ}}^{or})^{⊤})^{⊤} ‖_{\infty} \leq t_{n}}$ such that $Q_{n} (α, δ, γ) \geq Q_{n} (α^{0}, δ^{0}, γ)$ for any $(α^{⊤}, δ^{⊤}, γ)^{⊤} \in Ψ \cap Ψ_{n}$ for sufficiently large n.

Therefore, by the result of (i) and (ii), we have $Q_{n} (α, δ, γ) \geq Q_{n} ({\hat{α}}^{or}, {\hat{δ}}^{or}, {\hat{γ}}^{or})$ for any $(α^{⊤}, δ^{⊤}, γ)^{⊤} \in Ψ \cap Ψ_{n}$ , so that $(({\hat{α}}^{or})^{⊤}, ({\hat{δ}}^{or})^{⊤}, {\hat{γ}}^{or})^{⊤}$ is a strict local minimizer of $Q_{n} (α, δ, γ)$ on the event $E_{1} \cap E_{2}$ with $P (E_{1} \cap E_{2}) \geq 1 - p_{1} - p_{2}$ for sufficiently large n.

Step (i): Since $({\hat{τ}}^{or}, {\hat{δ}}_{1}^{or}, {\hat{γ}}^{or})$ is a global minimizer of $L_{n}^{O} (τ, δ_{1}, γ)$ , $L_{n}^{O} (T_{0} (α), S (δ), γ) \geq L_{n}^{O} ({\hat{τ}}^{or}, {\hat{δ}}_{1}^{or}, {\hat{γ}}^{or})$ for all $(α^{⊤}, δ^{⊤}, γ) \in Ψ$ . Then we derive that $P_{λ_{1}}^{O} (S (δ))$ is a constant which does not depend on $δ$ for $δ \in Ψ$ and $P_{λ_{2}}^{O} (T_{0} (α))$ is also a constant which does not depend on $α$ for $α \in Ψ$ . Let $T_{0} (α) = τ = (τ_{1}, \dots, τ_{K})^{⊤}$ . For any $k \neq k^{'}$ , since $\begin{aligned} | τ_{k}^{*} - τ_{k^{'}}^{*} | & = | τ_{k}^{*} - τ_{k^{'}}^{*} + τ_{k} - τ_{k} + τ_{k^{'}} - τ_{k^{'}} | \\ \leq | τ_{k} - τ_{k^{'}} | + | τ_{k}^{*} - τ_{k} | + | τ_{k^{'}} - τ_{k^{'}}^{*} |, \end{aligned}$ so $| τ_{k} - τ_{k^{'}} | \geq | τ_{k}^{*} - τ_{k^{'}}^{*} | - 2 sup_{k} | τ_{k} - τ_{k}^{*} |,$ and (A4) $\begin{aligned} sup_{k} | τ_{k} - τ_{k}^{*} | & = sup_{k} | \sum_{i \in G_{k}} α_{i} / | G_{k} | - τ_{k}^{*} | = sup_{k} | \sum_{i \in G_{k}} (α_{i} - α_{k}^{*}) / | G_{k} | | \\ \leq sup_{k} sup_{i \in G_{k}} | α_{i} - α_{i}^{*} | = ‖ α - α^{*} ‖_{\infty} . \end{aligned}$ (A4) Therefore, for all k and $k^{'}$ , $| τ_{k} - τ_{k^{'}} | \geq | τ_{k}^{*} - τ_{k^{'}}^{*} | - 2 ‖ α_{k} - α_{k}^{*} ‖_{\infty} \geq b_{n} - 2 ϕ_{n} > a λ_{2},$ which indicates $ρ (| τ_{k} - τ_{k^{'}} |)$ is a constant by Condition (C2), and as a result $P_{λ_{2}}^{G} (T_{0} (α))$ is also a constant. Similarly, for any $j \in A$ , let $S (α) = δ_{1} = (δ_{1}, \dots, δ_{p_{1}})$ , since $| δ_{j}^{*} | = | δ_{j}^{*} - δ_{j} + δ_{j} | \leq | δ_{j}^{*} - δ_{j} | + | δ_{j} |$ By the condition $| δ_{A} |_{min} > (a + 1) λ_{1}$ , we have $| δ_{j} | \geq ‖ δ^{*} ‖_{min} - ‖ δ^{*} - δ ‖_{\infty} \geq ‖ δ^{*} ‖_{min} - ‖ Θ^{*} - Θ ‖_{\infty} \geq (a + 1) λ_{1} - ϕ_{n} \geq a λ_{1} .$ As a result, both $ρ (| δ_{j} |)$ and $P_{λ_{1}}^{O} (S (δ))$ are constants.

On conclusion, we have $Q_{n}^{O} (T_{0} (α), S (δ), γ) \geq Q_{n}^{O} ({\hat{τ}}^{or}, {\hat{δ}}_{1}^{or}, {\hat{γ}}^{or})$ for all $(α^{⊤}, δ^{⊤}, γ) \in Ψ$ . In addition, $Q_{n}^{O} ({\hat{τ}}^{or}, {\hat{δ}}_{1}^{or}, {\hat{γ}}^{or}) = Q_{n} ({\hat{α}}^{or}, {\hat{δ}}^{or}, {\hat{γ}}^{or})$ and $Q_{n}^{O} (T_{0} (α), S (δ), γ) = Q_{n} (T^{- 1} (T_{0} (α)), S^{- 1} (δ_{A}), γ) = Q_{n} (α^{0}, δ^{0}, γ)$ . Hence, we get $Q_{n} (α^{0}, δ^{0}, γ) \geq Q_{n} ({\hat{α}}^{or}, {\hat{δ}}^{or}, {\hat{γ}}^{or})$ , and the result in (i) is proved.

Step (ii): First, we introduce a neighbourhood $Ψ_{n} = {α, δ : ‖ ((α - {\hat{α}}^{or})^{⊤}, (δ - {\hat{δ}}^{or})^{⊤})^{⊤} ‖_{\infty} \leq t_{n}}$ for a positive sequence $t_{n}$ . For $(α^{⊤}, δ^{⊤}, γ) \in Ψ \cap Ψ_{n}$ , by Taylor's expansion at $(α^{0}, δ^{0})$ , we have $\begin{aligned} Q_{n} (α, δ, γ) - Q_{n} (α^{0}, δ^{0}, γ) \\ = - w (α - α^{0}) + \sum_{i = 1}^{n} \frac{\partial P_{λ_{2}} (α^{m})}{\partial α_{i}} (α_{i} - α_{i}^{0}) - v (δ - δ^{0}) + \sum_{j = 1}^{p} \frac{\partial P_{λ_{1}} (δ^{m})}{\partial δ_{j}} (δ_{j} - δ_{j}^{0}) \\ = Γ_{1} + Γ_{2} + Γ_{3} + Γ_{4}, \end{aligned}$ where $w = [\frac{1}{n_{1}} (γ y_{1} - α_{1}^{m} - X_{1} δ)^{⊤}, - \frac{1}{n_{0}} g (- α_{0}^{m} - X_{0} δ)^{⊤}]^{⊤} = (\frac{1}{n_{1}} Λ_{1}^{⊤}, - \frac{1}{n_{0}} Λ_{2}^{⊤})^{⊤}$ and $v = \frac{1}{n_{1}} X_{1}^{⊤} (γ y_{1} - α_{1}^{m} - X_{1} δ^{m}) - \frac{1}{n_{0}} X_{0}^{⊤} g (- α_{0}^{m} - X_{0}^{⊤} δ^{m}) = \frac{1}{n_{1}} X_{1}^{⊤} Λ_{1} + \frac{1}{n_{0}} X_{0}^{⊤} Λ_{2}$ in which $α^{m} = ζ_{1} α + (1 - ζ_{1}) α^{0}$ and $δ^{m} = ζ_{2} δ + (1 - ζ_{2}) δ^{0}$ for some $ζ_{1}, ζ_{2} \in (0, 1)$ . Firstly, (A5) $\begin{aligned} Γ_{1} & = - w^{⊤} (α - α^{0}) = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{w_{i} (α_{i} - α_{j})}{| G_{k} |} \\ = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{w_{i} (α_{i} - α_{j})}{2 | G_{k} |} - \sum_{k = 1}^{K} \sum_{i, j \in G_{k}} \frac{w_{i} (α_{i} - α_{j})}{2 | G_{k} |} \\ = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{(w_{j} - w_{i}) (α_{j} - α_{i})}{2 | G_{k} |} \\ = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} \frac{(w_{j} - w_{i}) (α_{j} - α_{i})}{| G_{k} |} . \end{aligned}$ (A5) As shown in (EquationA4(A4) $\begin{aligned} sup_{k} | τ_{k} - τ_{k}^{*} | & = sup_{k} | \sum_{i \in G_{k}} α_{i} / | G_{k} | - τ_{k}^{*} | = sup_{k} | \sum_{i \in G_{k}} (α_{i} - α_{k}^{*}) / | G_{k} | | \\ \leq sup_{k} sup_{i \in G_{k}} | α_{i} - α_{i}^{*} | = ‖ α - α^{*} ‖_{\infty} . \end{aligned}$ (A4) ), (A6) $\begin{aligned} ‖ α^{0} - α^{*} ‖_{\infty} = ‖ τ - τ^{*} ‖_{\infty} \leq ‖ α - α^{*} ‖_{\infty} . \end{aligned}$ (A6) Since $α^{m} = ζ_{1} α + (1 - ζ_{1}) α^{0}$ , (A7) $\begin{aligned} ‖ α^{m} - α^{*} ‖_{\infty} \leq ‖ α - α^{*} ‖_{\infty} \leq ϕ_{n} . \end{aligned}$ (A7) As the same steps in (EquationA4(A4) $\begin{aligned} sup_{k} | τ_{k} - τ_{k}^{*} | & = sup_{k} | \sum_{i \in G_{k}} α_{i} / | G_{k} | - τ_{k}^{*} | = sup_{k} | \sum_{i \in G_{k}} (α_{i} - α_{k}^{*}) / | G_{k} | | \\ \leq sup_{k} sup_{i \in G_{k}} | α_{i} - α_{i}^{*} | = ‖ α - α^{*} ‖_{\infty} . \end{aligned}$ (A4) ) and (EquationA6(A6) $\begin{aligned} ‖ α^{0} - α^{*} ‖_{\infty} = ‖ τ - τ^{*} ‖_{\infty} \leq ‖ α - α^{*} ‖_{\infty} . \end{aligned}$ (A6) ), (A8) $\begin{aligned} ‖ δ^{0} - δ^{*} ‖_{\infty} = ‖ δ_{1} - δ_{1}^{*} ‖_{\infty} \leq ‖ δ - δ^{*} ‖_{\infty} . \end{aligned}$ (A8) Since $δ^{m} = ζ_{2} δ + (1 - ζ_{2}) δ^{0}$ , (A9) $\begin{aligned} ‖ δ^{m} - δ^{*} ‖_{\infty} \leq ‖ δ - δ^{*} ‖_{\infty} \leq ϕ_{n} . \end{aligned}$ (A9) Then by Condition (C1), $\begin{aligned} Λ_{1} & = \frac{γ}{γ^{*}} (α_{1}^{*} + X_{1} δ^{*} + ε_{1} γ^{*}) - X_{1} δ^{m} - α_{1}^{m} \\ = \frac{γ - γ^{*}}{γ^{*}} (α_{1}^{*} + X_{1} δ^{*}) + α_{1}^{*} - α_{1}^{m} + X_{1} (δ^{*} - δ^{m}) + (γ - γ^{*} + γ^{*}) ε_{1}, \\ ‖ Λ_{1} ‖_{\infty} & \leq \frac{‖ γ - γ^{*} ‖_{\infty}}{‖ γ^{*} ‖_{\infty}} (‖ α_{1}^{*} ‖_{\infty} + ‖ X_{1} δ^{*} ‖_{\infty}) + ‖ α_{1}^{*} - α_{1}^{m} ‖_{\infty} \\ + ‖ X_{1} (δ^{*} - δ^{m}) ‖_{\infty} + (‖ γ - γ^{*} ‖_{\infty} + ‖ γ^{*} ‖_{\infty}) ‖ ε_{1} ‖_{\infty} \\ \leq \frac{ϕ_{n}}{C_{0}} (C_{3} \sqrt{n_{1}} + C_{1} s \cdot C_{2} \sqrt{s}) + ϕ_{n} + C_{1} s \cdot ϕ_{n} + (ϕ_{n} + C_{0}) ‖ ε_{1} ‖_{\infty} . \end{aligned}$ Consider the ith element of the vector in $Λ_{2}$ , define a positive constant $ξ_{i}$ satisfying $\begin{aligned} g_{i} (- X_{0} δ^{m} - α_{0}^{m}) & \leq | - x_{i}^{⊤} δ^{m} - α_{i}^{m} | + ξ_{i} \\ \leq | - x_{i}^{⊤} (δ^{m} - δ^{*}) | + | α_{i}^{m} - α_{i}^{*} | + | - x_{i}^{⊤} δ^{*} - α_{i}^{*} | + ξ_{i} . \end{aligned}$ Define $ξ = max {ξ_{1}, \dots m ξ_{n}}$ , so we have $\begin{aligned} ‖ Λ_{2} ‖_{\infty} & \leq ‖ - X_{0} (δ^{m} - δ^{*}) ‖_{\infty} + ‖ α_{0}^{m} - α_{0}^{*} ‖_{\infty} + ‖ X_{0} δ^{*} + α_{0}^{*} ‖_{\infty} + ξ \\ \leq ‖ X_{0} ‖_{\infty} ‖ δ^{m} - δ^{*} ‖_{\infty} + ‖ α_{0}^{m} - α_{0}^{*} ‖_{\infty} + ‖ X_{0} ‖_{\infty} ‖ δ^{*} ‖_{\infty} + ‖ α_{0}^{*} ‖_{\infty} + ξ \\ \leq C_{1} s \cdot ϕ_{n} + ϕ_{n} + C_{1} s \cdot C_{2} \sqrt{s} + C_{3} \sqrt{n_{0}} + ξ . \end{aligned}$ Therefore, $max_{i, j} | w_{j} - w_{i} | \leq 2 ‖ w ‖_{\infty} \leq 2 max {\frac{‖ Λ_{1} ‖_{\infty}}{n_{1}}, \frac{‖ Λ_{2} ‖_{\infty}}{n_{0}}} .$ By Condition (C3), set a constant $c_{1}$ $P (‖ ε_{1} ‖_{\infty} > \sqrt{\frac{\log n}{c_{1}}}) \leq \sum_{i = 1}^{n_{1}} P (| ε_{i} | > \sqrt{\frac{\log n}{c_{1}}}) \leq \frac{n_{1}}{n \sqrt{\log n}} .$ Thus there is an event $E_{2}$ such that $P (E_{2}^{c}) \leq \frac{n_{1}}{n \sqrt{\log n}}$ , and on the event $E_{2}$ , $\begin{aligned} | G_{min} |^{- 1} max_{i, j} | w_{j} - w_{i} | \\ \leq 2 | G_{min} |^{- 1} max {\frac{ϕ_{n}}{n_{1}} (C_{1}^{'} \sqrt{n_{1}} + C_{2}^{'} s^{\frac{3}{2}} + C_{3}^{'} s + C_{4}^{'} \sqrt{\log n}) + C_{5}^{'} \frac{\sqrt{\log n}}{n_{1}}, \\ \frac{1}{n_{0}} (C_{6}^{'} \sqrt{n_{0}} + C_{7}^{'} s^{\frac{3}{2}} + C_{8}^{'} ϕ_{n} s + C_{9}^{'} ϕ + ξ)} . \end{aligned}$ Under the conditon (C4), it is easy to get $λ_{2} ≫ | G_{min} |^{- 1} max (\frac{s^{\frac{3}{2}}}{n_{0}}, \frac{1}{\sqrt{n_{0}}}, \sqrt{\frac{\log n}{n_{1}}})$ , and hence (A10) $\begin{aligned} λ_{2} ≫ | G_{min} |^{- 1} max_{i, j} | w_{j} - w_{i} | . \end{aligned}$ (A10) Then, denote $\bar{ρ} (t) = ρ^{'} (| t |) sgn (t)$ , $\begin{aligned} Γ_{2} & = λ_{2} \sum_{i = 1}^{n} \sum_{j \neq i} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) (α_{i} - α_{i}^{0}) \\ = λ_{2} \sum_{i < j} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) (α_{i} - α_{i}^{0}) + λ_{2} \sum_{i > j} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) (α_{i} - α_{i}^{0}) . \end{aligned}$ Swap i and j in the second term of the second equation, (A11) $\begin{aligned} Γ_{2} & = λ_{2} \sum_{i < j} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) (α_{i} - α_{i}^{0}) + λ_{2} \sum_{j > i} \bar{ρ} (α_{j}^{m} - α_{i}^{m}) (α_{j} - α_{j}^{0}) \\ = λ_{2} \sum_{i < j} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) (α_{i} - α_{i}^{0}) - λ_{2} \sum_{i < j} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) (α_{j} - α_{j}^{0}) \\ = λ_{2} \sum_{i < j} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) {(α_{i} - α_{i}^{0}) - (α_{j} - α_{j}^{0})} . \end{aligned}$ (A11) When $i, j \in G_{k}, α_{i}^{0} = α_{j}^{0}$ , and $α_{i}^{m} - α_{j}^{m}$ has the same sign as $α_{i} - α_{j}$ , and hence $\begin{aligned} Γ_{2} & = λ_{2} \sum_{i = 1}^{K} \sum_{i, j \in G_{k}, i < j} ρ^{'} (| α_{i}^{m} - α_{j}^{m} |) | α_{i} - α_{j} | \\ + λ_{2} \sum_{k < k^{'}} \sum_{i \in G_{k}, j \in G_{k^{'}}} \bar{ρ} (α_{i}^{m} - α_{j}^{m}) {(α_{i} - α_{i}^{0}) - (α_{j} - α_{j}^{0})} . \end{aligned}$ Then, for $k \neq k^{'}, i \in G_{k}, j \in G_{k^{'}}$ , since $| α_{i}^{*} - α_{j}^{*} | \leq | α_{i}^{m} - α_{j}^{m} | + | α_{i}^{*} - α_{i}^{m} | + | α_{j}^{m} - α_{j}^{*} |,$ we have $\begin{aligned} | α_{i}^{m} - α_{j}^{m} | & \geq min_{i \in G_{k}, j \in G_{k^{'}}} | α_{i}^{*} - α_{j}^{*} | - 2 ‖ α^{m} - α^{*} ‖_{\infty} \\ \geq b_{n} - 2 ‖ α - α^{*} ‖_{\infty} \geq b_{n} - 2 ϕ_{n} \geq a λ_{2}, \end{aligned}$ and thus $\bar{ρ} (α_{i}^{m} - α_{j}^{m}) = 0$ . Therefore, (A12) $\begin{aligned} Γ_{2} = λ_{2} \sum_{i = 1}^{K} \sum_{i, j \in G_{k}, i < j} ρ^{'} (| α_{i}^{m} - α_{j}^{m} |) | α_{i} - α_{j} | . \end{aligned}$ (A12) Moreover, by the same reasoning as (EquationA4(A4) $\begin{aligned} sup_{k} | τ_{k} - τ_{k}^{*} | & = sup_{k} | \sum_{i \in G_{k}} α_{i} / | G_{k} | - τ_{k}^{*} | = sup_{k} | \sum_{i \in G_{k}} (α_{i} - α_{k}^{*}) / | G_{k} | | \\ \leq sup_{k} sup_{i \in G_{k}} | α_{i} - α_{i}^{*} | = ‖ α - α^{*} ‖_{\infty} . \end{aligned}$ (A4) ), for $i, j \in G$ we have $‖ α^{0} - {\hat{α}}^{or} ‖_{\infty} \leq ‖ α - {\hat{α}}^{or} ‖_{\infty} .$ Then (A13) $\begin{aligned} | α_{i}^{m} - α_{j}^{m} | & \leq | α_{i}^{m} - α_{i}^{0} | + | α_{j}^{m} - α_{j}^{0} | \\ \leq 2 ‖ α^{m} - α^{0} ‖_{\infty} \leq 2 ‖ α - α^{0} ‖_{\infty} \\ \leq 2 (‖ α - {\hat{α}}^{or} ‖_{\infty} + ‖ α^{0} - {\hat{α}}^{or} ‖_{\infty}) \\ \leq 4 ‖ α - {\hat{α}}^{or} ‖_{\infty} \leq 4 t_{n} . \end{aligned}$ (A13) Since $ρ (\cdot)$ is concave, $ρ^{'} (| α_{i}^{m} - α_{j}^{m} |) \geq ρ^{'} (4 t_{n})$ . As a result, (A14) $\begin{aligned} Γ_{2} \geq λ_{2} \sum_{k = 1}^{K} \sum_{i, j \in G_{k}, i < j} ρ^{'} (4 t_{n}) | α_{i} - α_{j} | . \end{aligned}$ (A14) On the other hand, we have (A15) $\begin{aligned} Γ_{3} & = - v (δ - δ^{0}) \\ = - (\sum_{j \in A} v_{j} (δ_{j} - δ_{j}^{0}) + \sum_{j \in A^{c}} v_{j} (δ_{j} - δ_{j}^{0})) = \sum_{j \in A^{c}} v_{j} δ_{j} . \end{aligned}$ (A15) Since $Λ_{3} = X_{1}^{⊤} Λ_{1}$ and $Λ_{4} = X_{0}^{⊤} Λ_{2}$ , on the event $E_{2}$ , (A16) $\begin{aligned} max | v_{j} | & \leq (\frac{‖ X_{1} ‖_{\infty}}{n_{1}} ‖ Λ_{1} ‖_{\infty} + \frac{‖ X_{0} ‖_{\infty}}{n_{0}} ‖ Λ_{2} ‖_{\infty}) \\ \leq C_{1} s (\frac{1}{n_{1}} ‖ Λ_{1} ‖_{\infty} + \frac{1}{n_{0}} ‖ Λ_{2} ‖_{\infty}) \\ \leq C_{1} s {\frac{ϕ_{n}}{n_{1}} (C_{1}^{'} \sqrt{n_{1}} + C_{2}^{'} s^{\frac{3}{2}} + C_{3}^{'} s + C_{4}^{'} \sqrt{\log n}) + C_{5}^{'} \frac{\sqrt{\log n}}{n_{1}} \\ + \frac{1}{n_{0}} (C_{6}^{'} \sqrt{n_{0}} + C_{7}^{'} s^{\frac{3}{2}} + C_{8}^{'} ϕ_{n} s + C_{9}^{'} ϕ + ξ)} . \end{aligned}$ (A16) Under the condition (C4), we can get (A17) $\begin{aligned} λ_{1} ≫ max_{j} | δ_{j} | . \end{aligned}$ (A17) Then, (A18) $\begin{aligned} Γ_{4} & = λ_{1} \sum_{j = 1}^{p} \bar{ρ} (δ_{j}^{m}) (δ_{j} - δ_{j}^{0}) \\ = λ_{1} (\sum_{j \in A} \bar{ρ} (δ_{j}^{m}) (δ_{j} - δ_{j}^{0}) + \sum_{j \in A^{c}} \bar{ρ} (δ_{j}^{m}) (δ_{j} - δ_{j}^{0})) . \end{aligned}$ (A18) When $j \in A^{c}$ , $δ_{j}^{0} = 0$ , and $δ_{j}^{m}$ has the same sign as $δ_{j}$ . Hence (A19) $\begin{aligned} Γ_{4} = λ_{1} (\sum_{j \in A^{' c}} ρ^{'} (| δ_{j}^{m} |) | δ_{j} | + \sum_{j \in A} \bar{ρ} (δ_{j}^{m}) (δ_{j} - δ_{j}^{0})) . \end{aligned}$ (A19) For $j \in A$ , by (EquationA9(A9) $\begin{aligned} ‖ δ^{m} - δ^{*} ‖_{\infty} \leq ‖ δ - δ^{*} ‖_{\infty} \leq ϕ_{n} . \end{aligned}$ (A9) ), (A20) $\begin{aligned} | δ_{j}^{m} | \geq min_{j \in A} | δ_{j}^{*} | - ‖ δ^{*} - δ^{m} ‖_{\infty} \geq (a + 1) λ_{1} - ϕ_{n} \geq a λ_{1} . \end{aligned}$ (A20) Thus $\bar{ρ} (δ_{j}^{m}) = 0$ . Therefore, (A21) $\begin{aligned} Γ_{4} = λ_{1} \sum_{j \in A^{c}} ρ^{'} (| δ_{j}^{m} |) | δ_{j} | . \end{aligned}$ (A21) Furthermore, by the same process as (EquationA13(A13) $\begin{aligned} | α_{i}^{m} - α_{j}^{m} | & \leq | α_{i}^{m} - α_{i}^{0} | + | α_{j}^{m} - α_{j}^{0} | \\ \leq 2 ‖ α^{m} - α^{0} ‖_{\infty} \leq 2 ‖ α - α^{0} ‖_{\infty} \\ \leq 2 (‖ α - {\hat{α}}^{or} ‖_{\infty} + ‖ α^{0} - {\hat{α}}^{or} ‖_{\infty}) \\ \leq 4 ‖ α - {\hat{α}}^{or} ‖_{\infty} \leq 4 t_{n} . \end{aligned}$ (A13) ), for $j \in A^{c}$ (A22) $\begin{aligned} | δ_{j}^{m} | & \leq ‖ δ^{m} - δ^{0} ‖_{\infty} \leq ‖ δ - δ^{0} ‖_{\infty} \\ \leq ‖ δ - {\hat{δ}}^{or} ‖_{\infty} + ‖ δ^{0} - {\hat{δ}}^{or} ‖_{\infty} \\ \leq 2 ‖ δ - {\hat{δ}}^{or} ‖_{\infty} \leq 2 t_{n} . \end{aligned}$ (A22) Let $t_{n} = o (1)$ . Then $ρ^{'} (4 t_{n}) \to 1$ , $ρ^{'} (2 t_{n}) \to 1$ . Therefore, by (EquationA5(A5) $\begin{aligned} Γ_{1} & = - w^{⊤} (α - α^{0}) = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{w_{i} (α_{i} - α_{j})}{| G_{k} |} \\ = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{w_{i} (α_{i} - α_{j})}{2 | G_{k} |} - \sum_{k = 1}^{K} \sum_{i, j \in G_{k}} \frac{w_{i} (α_{i} - α_{j})}{2 | G_{k} |} \\ = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}}} \frac{(w_{j} - w_{i}) (α_{j} - α_{i})}{2 | G_{k} |} \\ = - \sum_{k = 1}^{K} \sum_{{i, j \in G_{k}, i < j}} \frac{(w_{j} - w_{i}) (α_{j} - α_{i})}{| G_{k} |} . \end{aligned}$ (A5) ), (EquationA10(A10) $\begin{aligned} λ_{2} ≫ | G_{min} |^{- 1} max_{i, j} | w_{j} - w_{i} | . \end{aligned}$ (A10) ) and (EquationA14(A14) $\begin{aligned} Γ_{2} \geq λ_{2} \sum_{k = 1}^{K} \sum_{i, j \in G_{k}, i < j} ρ^{'} (4 t_{n}) | α_{i} - α_{j} | . \end{aligned}$ (A14) ), (A23) $\begin{aligned} Γ_{1} + Γ_{2} \geq \sum_{k = 1}^{K} \sum_{i, j \in G_{k}, i < j} [λ_{2} ρ^{'} (4 t_{n}) - | G_{min} |^{- 1} max_{i, j} | w_{j} - w_{i} |] | α_{i} - α_{j} | \geq 0. \end{aligned}$ (A23) And by (EquationA15(A15) $\begin{aligned} Γ_{3} & = - v (δ - δ^{0}) \\ = - (\sum_{j \in A} v_{j} (δ_{j} - δ_{j}^{0}) + \sum_{j \in A^{c}} v_{j} (δ_{j} - δ_{j}^{0})) = \sum_{j \in A^{c}} v_{j} δ_{j} . \end{aligned}$ (A15) ), (EquationA17(A17) $\begin{aligned} λ_{1} ≫ max_{j} | δ_{j} | . \end{aligned}$ (A17) ) and (EquationA21(A21) $\begin{aligned} Γ_{4} = λ_{1} \sum_{j \in A^{c}} ρ^{'} (| δ_{j}^{m} |) | δ_{j} | . \end{aligned}$ (A21) ), (A24) $\begin{aligned} Γ_{3} + Γ_{4} \geq \sum_{j \in A^{c}} [λ_{1} ρ^{'} (2 t_{n}) - max_{j} | v_{j} |] | δ_{j} | \geq 0. \end{aligned}$ (A24) Therefore, for sufficiently large n, $Q_{n} (α, δ, γ) - Q_{n} (α^{0}, δ^{0}, γ) = Γ_{1} + Γ_{2} + Γ_{3} + Γ_{4} \geq 0,$ so that the result (ii) is proved.

Variable selection and subgroup analysis for high-dimensional censored data

Abstract

1. Introduction