Abstract
Zero-inflated explanatory variables, as opposed to outcome variables, are common, for example, in environmental sciences. In this article, we address the problem of having excess of zero values in some continuous explanatory variables, which are subject to multi-outcome lasso-regularized variable selection. In short, the problem results from the failure of the lasso-type of shrinkage methods to recognize any difference between zero value occurring either in the regression coefficient or in the corresponding value of the explanatory variable. This kind of confounding will obviously increase the number of false positives – all non-zero regression coefficients do not necessarily represent true outcome effects. We present here the adaptive LAD-lasso for multiple outcomes, which extends the earlier work of multi-outcome LAD-lasso with adaptive penalization. In addition to well-known property of having less biased regression coefficients, we show that the adaptivity also improves method’s ability to recover from influences of excess of zero values measured in continuous covariates.
1 Introduction
In high-dimensional regression problems, the number of parameters is often larger than the number of individuals in the sample (). Ordinary least squares and other conventional estimation methods do not work in such situations. Simultaneous estimation and variable selection using lasso (least absolute shrinkage and selection operator) (Tibshirani Citation1996; Li and Sillanpää Citation2012) is a popular shrinkage-estimation approach to obtain sparse estimates for regression coefficients in high-dimensional regression problems with desirable statistical properties (Fan and Li Citation2001; Fan and Peng Citation2004). However, the generally known drawback of shrinkage-inducing methods, including lasso, is that they improve estimation accuracy but also introduce downward bias to the estimates (due to bias-variance tradeoff). To alleviate this problem, a re-estimation procedure has been proposed, where effects of selected predictors are re-estimated using no penalty (Efron et al. Citation2004; Meinshausen Citation2007). Adaptive lasso has been also presented (Zou Citation2006), so that the selected predictors are subject of reduced penalty and non-selected positions obtain heavier penalty. This is performed either by using two-stage or iterative estimation strategy. See also related work on thresholded lasso (Zhou Citation2010; Van De Geer, Bühlmann, and Zhou Citation2011). One would expect to see larger variance as a consequence of reduced bias in the adaptive lasso. However, it has been observed that the opposite could happen in the adaptive lasso context meaning that we may see improved accuracy together with reduced bias, especially in the situation (Huang, Ma, and Zhang Citation2008).
When the data set contains some outlying observations, it is possible to apply robust least absolute deviation (LAD) regression instead of ordinary regression. In lasso context, LAD-lasso has been proposed for univariate (Wang, Li, and Jiang Citation2007) and multi-outcome cases (Möttönen and Sillanpää Citation2015; Li, Möttönen, and Sillanpää Citation2015). The multi-outcome LAD-lasso is related to the group lasso (Yuan and Lin Citation2006) because the object function of multi-outcome LAD-lasso also contains group lasso penalty. As common shrinkage-inducing methods, LAD-lasso methods are also suffering from downward bias of the estimates. To alleviate this, adaptive LAD-lasso has been proposed for univariate LAD-lasso (Arslan Citation2012). In this article, we study the multi-outcome adaptive LAD-lasso and make comparisons to the non-adaptive or standard version, which we refer to as ordinary LAD-lasso. Note that adaptive group lasso has been presented by Wang and Leng (Citation2008), and it has already been used in empirical application and in modeling old-age retirement (Lähderanta et al. Citation2022).
In our earlier work with empirical data, we located a potential problem in lasso methods in general, when there are excess of real zero values in some covariates (Lähderanta et al. Citation2022). We describe this problem here in more detail in LAD-lasso and provide some potential solutions using controlled simulation and real data example. In short, the problem occurs as the lasso-type of shrinkage methods do not recognize any difference between zero value occurring either at the regression coefficient or in the corresponding value of the covariate. As a result, there is a danger of misinterpretation because all non zero regression coefficients do not necessarily represent true effects on the outcomes. As a potential solution, we show empirically how the use of adaptive LAD-lasso will minimize this kind of confounding to occur in practice.
In this article, we show the potential benefits of the multi-outcome adaptive LAD-lasso methods with three examples. In the first example, we utilize a public genotype data set (Crooks et al. Citation2009) to simulate new phenotypes with tri-variate traits. In the second example, we simulate several general data sets with zero-inflated (excess of zeros) values in continuous explanatory variables, which is typical in many fields of science. Examples of zero-inflated data can be found in environmental sciences (Maldonado, Aguilera, and Salmerón Citation2016; Kamarianakis et al. Citation2008). We demonstrate that this problem can be mitigated with the adaptive version of the lasso. In the third example, we use a real-life data set to study the main socioeconomic factors that affect pensions and length of working lives of disability pension retirees, both having highly skewed outcome distributions.
2 Methodology
In this section, we present the multi-outcome ordinary LAD-lasso (Möttönen and Sillanpää Citation2015; Li, Möttönen, and Sillanpää Citation2015) and the novel adaptive version of the method.
2.1 Multi-outcome ordinary LAD-lasso
Consider multiple regression model with multiple outcomeswhere is an matrix of observed values of outcome variables, is an matrix of observed values of explanatory variables, is a matrix of regression coefficients, and is an matrix of residuals. We further assume that is a random sample of size from a -variate distribution centered at the origin.
The multi-outcome lasso estimation method is based on the penalized objective function(1) (1)
(e.g., Turlach et al. Citation2005; Yuan and Lin Citation2006; Yuan et al. Citation2007). The minimizer of the objective function Equation(1)(1) (1) gives now multi-outcome lasso estimate for the regression coefficient matrix . Note that EquationEquation (1)(1) (1) is also an objective function of group lasso where each of explanatory variables (the intercept terms are omitted) have regression coefficients (one for each outcome variable) which form a group.
The multi-outcome lasso method gives sparse solutions but it is obviously not very robust. A more robust version of multi-outcome lasso, the multi-outcome LAD-lasso (Möttönen and Sillanpää Citation2015; Li, Möttönen, and Sillanpää Citation2015), can be obtained by replacing the squared norm with an L1 norm in the objective function Equation(1)(1) (1) :(2) (2)
It is easily seen that if we definewhere is a unit vector with th element as one and all the other elements are zero and is a vector of zeros, the objective function Equation(2)(2) (2) reduces to the LAD estimation objective function (Oja Citation2010) of the form(3) (3) which shows that we can use any multi-outcome LAD regression estimation routine to find the multi-outcome LAD-lasso estimates. One can, for example, use the function mv.l1lm of the R-package MNM (Nordhausen, Möttönen, and Oja Citation2016; Nordhausen and Oja Citation2011).
2.2 Multi-outcome adaptive LAD-lasso
The multi-outcome LAD-lasso estimate for a fixed can be defined aswhere argmin stands for” the value minimizing”. This brings up the question of how to choose the tuning parameter ? If we are mainly concerned about recovering the right model, then we can use, for example, Akaike’s information criterion (AIC) or Bayesian information criterion (BIC). On the other hand, if we are mainly concerned about prediction accuracy, then cross-validation technique (CV) is often a good choice. For a general survey on CV methods, see Arlot and Celisse (Citation2010). We continue this study with a CV criterion as robust AIC or BIC are not readily available in the multi-outcome LAD-lasso context. We have used fivefold cross-validation where the original sample of observations is randomly partitioned into five approximately equal sized subsamples. Let denote the set of indices of the observations that belong to the th subsample. The CV criterion is then defined as the mean absolute error over the five subsamples where is the number of observations in the th subsample and is the multi-outcome LAD-lasso estimate using all but the observations in the th subsample. For a BIC-like criterion in multi-outcome LAD-lasso context, see Möttönen and Sillanpää (Citation2015).
Letbe the value of the tuning parameter which minimizes the cross-validation criterion function for the multi-outcome LAD-lasso. The multi-outcome LAD-lasso estimate can then be defined as
It has been shown that lasso estimation tends to underestimate the regression coefficients and the same is true for the multi-outcome LAD-lasso estimate . Zou (Citation2006) proposed an adaptive lasso method, which gives an estimate whose bias is smaller than that of the ordinary lasso. Similarly, the adaptive method can also be used in the LAD-lasso (Arslan Citation2012). We extend this method here for multi-outcome LAD-lasso case. Since the performance of the adaptive LAD-lasso might be sensitive to the initial weights, we propose an iterative -step adaptive estimation method. The th () step multi-outcome adaptive LAD-lasso estimate for fixed is defined aswhere
is the th row of the th step multi-outcome LAD-lasso estimate and . The multi-outcome LAD-lasso estimate is used as the initial value . It can be easily shown that the th step of the multi-outcome adaptive LAD-lasso estimate for fixed can be written in the formwhich further implies that it can be written in the multi-outcome LAD regression form(4) (4) where
The estimate can then be found using the function mv.l1lm of the R-package MNM (Nordhausen, Möttönen, and Oja Citation2016; Nordhausen and Oja Citation2011).
3 Simulation studies
In this section, we present two simulation experiments with simulated data sets to provide insights to the multi-outcome adaptive LAD-lasso in certain scenarios. R code for the simulations can be found on GitHub (Lähderanta and Möttönen Citation2021).
3.1 Bias reduction
The first experiment stems from medical sciences, where single and multi-trait quantitative trait locus (QTL) mapping is commonly applied. In QTL mapping, for each individual in the population sample, we have quantitative trait phenotypes and genotype patterns measured in a high number of marker loci distributed over the genome. Based on this data, the target is to find small subsets of trait-associated loci out of many candidates. The trait-associated findings, as well as, the true causative loci, are both interchangeably called QTLs. More details of QTL mapping can be found in Balding (Citation2006) and Li and Sillanpää (Citation2012).
As a first step, we simulated new phenotypes with tri-variate traits using the public genotype data set from the 12th QTL-MAS workshop in Uppsala, Sweden, in 2008 (Crooks et al. Citation2009). The original genotype data set contains of 5865 individuals and 6000 markers. We took a random sample of 300 individuals and chose every 30th marker with a resulting total number of 200 markers. We then simulated tri-variate traits by using a multi-outcome multiple regression modelwhere is a matrix of tri-variate traits, is a matrix with th element
is a matrix with four QTLs indicated as non-zero rowsand is a matrix with i.i.d. rows normally distributed as
We estimated the tuning parameter by using fivefold cross-validation. In , we show the marker effects corresponding to ordinary LAD-lasso. We see that the LAD-lasso method correctly finds all four QTLs.
In the next step, we studied the bias of the LAD-lasso estimates of the non-zero coefficient vectors , , and :
It is clear that the ordinary LAD-lasso estimates are biased in most cases. In the final step, we used the multi-outcome adaptive LAD-lasso estimation method. The shows that the biases decrease and the regression coefficients stabilize after some adaptive steps. We can see from that there are still small biases after five adaptive steps but the biases are smaller than the biases corresponding to the ordinary LAD-lasso. Note that with adaptive LAD-lasso the accuracy of the estimates has improved on average. The biases of the non-zero coefficient vectors corresponding to the adaptive LAD-lasso with five steps were
In , we show the marker effects corresponding to the adaptive LAD-lasso with five steps. We see that the multi-outcome adaptive LAD-lasso method correctly finds the true QTLs. Note that the true zeros indicated by the adaptive technique are not shown in .
3.2 Excess of zeros in covariates
In the next experiment, we simulate an other kind of data sets to demonstrate the excess of zeros in covariates scenario with multi-outcome adaptive LAD-lasso and multi-outcome ordinary LAD-lasso.
Results from multiple data sets are shown to illustrate the performance of the algorithms in wide variety of situations. In practice, we generate two types of data sets, one with high number of observations compared to covariates (), and contrarily one with high number of covariates when compared to observations (). Moreover, we alternate the proportion of zeros () in the covariates from 0.1 to 0.4 and apply two different error E distributions: uniform and asymmetric Laplace. When all the combinations of these properties are considered, 16 different types of scenarios are examined in total. In each scenario, we simulate 100 data sets to further evaluate the performance of the methods. The data sets and the alternating proportion of zeros are shown in .
In practice, the data sets are generated with a multi-outcome regression modelwith matrix of a size , response matrix of a size . The observed values are simulated from a normal distribution such as where and is the proportion of zeros.
is a matrix where
is a matrix whereor depending of the choice for error distribution. Above, Unif refers to uniform distribution and ALaplace stands for Asymmetric Laplace distribution.
From the simulations, we can see that the adaptive LAD-lasso is superior to ordinary LAD-lasso in every scenario, when we compare the correctly found zero coefficients (). In scenario , the difference between the methods is somewhat smaller.
4 Empirical data example: retirement analysis
In this section, we apply the presented adaptive method to model retirement behavior. The data consist of disability pension retirees in Finnish statutory earnings-related pension system for the public sector employees. A random sample of 3598 municipal employees is drawn from the comprehensive administrative registers. The study-design includes two outcome variables: length of working life at retirement in 2017 (years) and amount of final pension (Eur/month), and 40 socioeconomic explaining variables preceding retirement in 2005/2010–2016 (see ).
The distributions of both outcomes have long tails, both left and right, and no zeros. Explaining variables include measures (spells) of social security benefits available for all employees – unemployment, occupational rehabilitation, long-term sickness and occupational accident compensation (days in year), and wages (Eur/year). The distributions of explaining variables have long right tails and excess number of zeros, as some of the aforementioned benefits are rarely used. A note is in order on the fact that in practice zero yearly wages follow from absence from work for different reasons, such as, working in private sector or as self-employed, or drawing social security benefits. Wages and pensions are transformed using asinh transformation (Bellemare and Wichman Citation2020).
The results indicate that the adaptive technique yields more non-selected variables for several social security benefits compared to ordinary LAD-lasso (see and and ). From practical point of view and variable selection perspective, it is important to separate which covariates become truly zero and which are nearly so. From , we can also see that benefits drawn near retirement (2017) tend to be mild (in terms of values of the estimates), yet meaningful estimates on both outcomes. Occupational rehabilitation and unemployment benefits have positive effect on work ability and working life, whereas long-term sickness indicates weakening work ability and risk on working life. On the other hand, drawing social security benefits in general indicates negative effect on final pension. Wages have positive effect on pension, which is as expected in earnings-related pension system. The effects of yearly wages on the length of working life are somewhat mixed.
The spike in the estimate for wages in 2005 on pension is a result of high correlation between consecutive wages (see ). This multicollinearity in regression analysis can generally lead to instability in the estimates, and to a problem where the explained variance is allocated arbitrarily among the correlated variables (Farrar and Glauber Citation1967).
5 Concluding remarks
We have shown that the multi-outcome adaptive LAD-lasso is a versatile and robust tool, capable of reducing bias from effect estimates as well as alleviating the influence from the problem of an excess of zero values in continuous covariates. The competitive performance of the adaptive version compared to the ordinary LAD-lasso has been illustrated with simulated and real-life examples.
The empirical data example illustrates the practical importance of multi-outcome modeling strategy with highly skewed outcome distributions (including outliers) and high share of zeros in covariates, which is common situation, for example, in social sciences analyzing complex socioeconomic phenomena. Researchers in other fields of science, with high-dimensional study-designs (, or ) could certainly also benefit from robust estimates provided by multi-outcome adaptive LAD-lasso regression technique.
As future steps, we have considered multiple ways to improve the LAD-lasso modeling approach. First, it would be beneficial to develop a robust BIC criterion for a multi-outcome LAD-lasso context in order to take into account linear dependencies between outcomes. Second, as we demonstrated with the empirical data, the problem of multicollinearity in the explaining variables can lead to some unintuitive estimates. To tackle this problem, some interesting approaches have been developed, such as, fused lasso (Tibshirani et al. Citation2005) and clustering algorithm in regression via data-driven segmentation (CARDS) (Ke, Fan, and Wu Citation2015). Implementing these concepts into the LAD-lasso framework might be beneficial to applications with correlated explaining variables. Third, the inclusion of categorized variables in the adaptive LAD-lasso modeling approach would be intriguing in many applications, especially in the life-course studies, where several categorical factors exist in the data sets. This requires a careful implementation of coding scheme such that the re-coded variables are managed correctly in the adaptive algorithm (O’Grady and Medoff Citation1988). Last, the size of the data sets in many fields bring some challenges to the estimation. Naturally, a more scalable algorithm is often needed to perform the estimation in a reasonable time.
Disclosure statement
No potential competing interest was reported by the authors.
Additional information
Funding
References
- Arlot, S., and A. Celisse. 2010. A survey of cross-validation procedures for model selection. Statistics Surveys 4:40–79. doi: 10.1214/09-SS054.
- Arslan, O. 2012. Weighted LAD-lasso method for robust parameter estimation and variable selection in regression. Computational Statistics & Data Analysis 56 (6):1952–65. doi: 10.1016/j.csda.2011.11.022.
- Balding, D. J. 2006. A tutorial on statistical methods for population association studies. Nature Reviews Genetics 7:781–91.
- Bellemare, M. F., and C. J. Wichman. 2020. Elasticities and the inverse hyperbolic sine transformation. Oxford Bulletin of Economics and Statistics 82 (1):50–61. doi: 10.1111/obes.12325.
- Crooks, L., G. Sahana, D. Koning, M. Sandø Lund, and Ö. Carlborg. 2009. Comparison of analyses of the QTLMAS XII common dataset. II: Genome-wide association and fine mapping. BMC Proceedings 3. doi: 10.1186/1753-6561-3-S1-S2.
- Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani. 2004. Least angle regression. The Annals of Statistics 32 (2):407–99. doi: 10.1214/009053604000000067.
- Fan, J., and R. Li. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96:1348–60. doi: 10.1198/016214501753382273.
- Fan, J., and H. Peng. 2004. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics 32:928–61. doi: 10.1214/009053604000000256.
- Farrar, D. E., and R. R. Glauber. 1967. Multicollinearity in regression analysis: The problem revisited. The Review of Economics and Statistics 49 (1):92–107.
- Huang, J., S. Ma, and C.-H. Zhang. 2008. Adaptive lasso for sparse high-dimensional rergression models. Statistica Sinica 18:1603–18.
- Kamarianakis, Y., H. Feidas, G. Kokolatos, N. Chrysoulakis, and V. Karatzias. 2008. Evaluating remotely sensed rainfall estimates using nonlinear mixed models and geographically weighted regression. Environmental Modelling & Software 23 (12):1438–47.
- Ke, Z., J. Tracy, Y. Fan, and Y. Wu. 2015. Homogeneity pursuit. Journal of the American Statistical Association 110 (509):175–94. PMID: 26085701, doi: 10.1080/01621459.2014.892882.
- Lähderanta, T., and J. Möttönen. 2021. Multioutcome LAD-lasso. https://github.com/terolahderanta/multioutcome_lad_lasso.
- Lähderanta, T., J. Salonen, J. Möttönen, and M. J. Sillanpää. 2022. Modelling old-age retirement: An adaptive multi-outcome LAD-lasso regression approach. International Social Security Review 75 (1):3–29. doi: 10.1111/issr.12287.
- Li, Z., J. Möttönen, and M. J. Sillanpää. 2015. A robust multiple-locus method for quantitative trait locus analysis of non-normally distributed multiple traits. Heredity 115:556–64. doi: 10.1038/hdy.2015.61.
- Li, Z., and M. J. Sillanpää. 2012. Overview of lasso-related penalized regression methods for quantitative trait mapping and genomic selection. Theoretical and Applied Genetics 125 (3):419–35. doi: 10.1007/s00122-012-1892-9.
- Maldonado, A. D., P. A. Aguilera, and A. Salmerón. 2016. Modeling zero-inflated explanatory variables in hybrid Bayesian network classifiers for species occurrence prediction. Environmental Modelling & Software 82:31–43.
- Meinshausen, N. 2007. Relaxed lasso. Computational Statistics & Data Analysis 52 (1):374–93. doi: 10.1016/j.csda.2006.12.019.
- Möttönen, J., and M. J. Sillanpää. 2015. Robust variable selection and coefficient estimation in multivariate multiple regression using LAD-lasso. In Modern Nonparametric, Robust and Multivariate Methods, ed. K. Nordhausen, and S. Taskinen. Cham: Springer,. doi: 10.1007/978-3-319-22404-6_14.
- Nordhausen, K., J. Möttönen, and H. Oja. 2016. R package version 1.0-2. In MNM: Multivariate nonparametric methods- An approach based on spatial signs and ranks. http://CRAN.R-project.org/package=MNM.
- Nordhausen, K., and H. Oja. 2011. Multivariate L1 methods: The package MNM. Journal of Statistical Software 43 (5):1–28. doi: 10.18637/jss.v043.i05.
- O’Grady, K. E., and D. R. Medoff. 1988. Categorical variables in multiple regression: Some cautions. Multivariate Behavioral Research 23 (2):243–60. doi: 10.1207/s15327906mbr2302_7.
- Oja, H. 2010. Multivariate nonparametric methods with R- An approach based on spatial signs and ranks. New York: Springer. doi: 10.1007/978-1-4419-0468-3.
- Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 58 (1):267–88.
- Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K. Knight. 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1):91–108. doi: 10.1111/j.1467-9868.2005.00490.x.
- Turlach, B. A., N. William, S. J. Venables, and Wright. 2005. Simultaneous variable selection. Technometrics 47 (3):349–63. doi: 10.1198/004017005000000139.
- Van De Geer, S., P. Bühlmann, and S. Zhou. 2011. The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electronic Journal of Statistics 5:688–749. doi: 10.1214/11-EJS624.
- Wang, H., and C. Leng. 2008. A note on adaptive group lasso. Computational Statistics & Data Analysis 52 (12):5277–86. doi: 10.1016/j.csda.2008.05.006.
- Wang, H., G. Li, and G. Jiang. 2007. Robust regression shrinkage and consistent variable selection through the LAD-lasso. Journal of Business & Economic Statistics 25:347–55. doi: 10.1198/073500106000000251.
- Yuan, M., A. Ekici, Z. Lu, and R. Monteiro. 2007. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (3):329–46.
- Yuan, M., and Y. Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (1):49–67. doi: 10.1111/j.1467-9868.2005.00532.x.
- Zhou, S. 2010. Thresholded lasso for high dimensional variable selection and statistical estimation. arXiv preprint arXiv:1002.1583.
- Zou, H. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101 (476):1418–29. doi: 10.1198/016214506000000735.