Skip to main content
Log in

Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Variable selection is important in high-dimensional data analysis. The Lasso regression is useful since it possesses sparsity, soft-decision rule, and computational efficiency. However, since the Lasso penalized likelihood contains a nondifferentiable term, standard optimization tools cannot be applied. Many computation algorithms to optimize this Lasso penalized likelihood function in high-dimensional settings have been proposed. To name a few, coordinate descent (CD) algorithm, majorization-minimization using local quadratic approximation, fast iterative shrinkage thresholding algorithm (FISTA) and alternating direction method of multipliers (ADMM). In this paper, we undertake a comparative study that analyzes relative merits of these algorithms. We are especially concerned with numerical sensitivity to the correlation between the covariates. We conduct a simulation study considering factors that affect the condition number of covariance matrix of the covariates, as well as the level of penalization. We apply the algorithms to cancer biomarker discovery, and compare convergence speed and stability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, pp 267–281

  2. Beck A, Teboulle M (2009) A Fast Iterative Shrinkage-Thresholding Algorithm fo Linear Inverse Problems. SIAM J. Imaging Sciences, doi:10.1137/080716542

  3. Bejamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc Ser B 57(1):289–300

    MathSciNet  MATH  Google Scholar 

  4. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122

    Article  MATH  Google Scholar 

  5. Cagle PT, Allen TC, Olsen RJ (2013) Lung cancer biomarkers: present status and future developments. Arch Pathol Labor Med 137(9):1191–1198

    Article  Google Scholar 

  6. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MathSciNet  MATH  Google Scholar 

  7. El-Telbany A, Ma PC (2012) Cancer genes in lung cancer: racial disparities: are there any? Genes Cancer 3:467–480

    Article  Google Scholar 

  8. Friedman J, Hastie T, Hofling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1(2):307–332

    Article  MathSciNet  MATH  Google Scholar 

  9. Gemmeke JF, Hamme HV, Cranen B, Boves L (2010) Compressive sensing for missing data imputation in noise robust speech recognitions. J Sel Topics Signal Process 4(2):272–287

    Article  Google Scholar 

  10. Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat Protoc 4(1):44–57

    Article  Google Scholar 

  11. Hunter DR, Lange K (2000) Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics pp 60–77

  12. Hunter DR, Li R (2005) Variable selection using mm algorithms. Ann Stat 33(4):1617–1642

    Article  MathSciNet  MATH  Google Scholar 

  13. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31(4):e15

    Article  Google Scholar 

  14. Jemal A, Siegel R, Xu J, Ward E (2010) Cancer statistics. Cancer J Clin 60(5):277–300

    Article  Google Scholar 

  15. Kati C, Alacam H, Duran L, Guzel A, Akdemir HU, Sisman B, Sahin C, Yavuz Y, Altintas N, Murat N, Okuyucu A (2014) The effectiveness of the serum surfactant protein d (sp-d) level to indicate lung injury in pulmonary embolism. Clin Lab 60(9):1457–1464

    Google Scholar 

  16. Parikh N, Boyd S (2013) Proximal algorithms. Found Trends Optim 1(3):123–231

    Google Scholar 

  17. Peng J, Wang P, Zhou N, Zhu J (2012) Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association

  18. Pounds S, Morris SW (2003) Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19(10):1236–1242

    Article  Google Scholar 

  19. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  20. Shedden K, Taylor JM, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, Chang AC, Zhu CQ, Strumpf D, Hanash S, Shepherd FA, Ding K, Seymour L, Naoki K, Pennell N, Weir B, Verhaak R, Ladd-Acosta C, Golub T, Gruidl M, Sharma A, Szoke J, Zakowski M, Rusch V, Kris M, Viale A, Motoi N, Travis W, Conley B, Seshan VE, Meyerson M, Kuick R, Dobbin KK, Lively T, Jacobson JW, Beer DG (2008) Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 14(8):822–827

    Article  Google Scholar 

  21. Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing pain. Carnegie Mellon University, Pittsburgh, PA

    Google Scholar 

  22. Tang H, Xiao G, Behrens C, Schiller J, Allen J, Chow CW, Suraokar M, Corvalan A, Mao J, White MA, Wistuba II, Minna JD, Xie Y (2013) A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients. Clin Cancer Res 19(6):1577–1586

    Article  Google Scholar 

  23. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B 58:267–288

    MathSciNet  MATH  Google Scholar 

  24. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012) Strong rules for discarding predictors in lasso-type problems. J Royal Stat Soc Ser B (Stat Methodol) 74(2):245–266

    Article  MathSciNet  Google Scholar 

  25. Woenckhaus M, Klein-Hitpass L, Grepmeier U, Merk J, Pfeifer M, Wild P, Bettstetter M, Wuensch P, Blaszyk H, Hartmann A, et al. (2006) Smoking and cancer-related gene expression in bronchial epithelium and non-small-cell lung cancers. J Pathol 210(2):192–204

    Article  Google Scholar 

  26. Wright J, Yang AY, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227

    Article  Google Scholar 

  27. Wu TT, Lange K (2008) Coordinate descent algorithms for lasso penalizaed regression. Ann Appl Stat 2 (1):224–244

    Article  MathSciNet  MATH  Google Scholar 

  28. Yang AY, Zhou Z, Ganesh A, Shankar SS, Ma Y (2013) Fast l1-minimization algorithms for robust face recognition. IEEE Trans Image Process 22:8

    Google Scholar 

  29. Yu D, Son W, Lim J, Xiao G (2015) Statistical completion of partially identified graph with application to estimation of gene regulatory network. Biostatistics 16(4):670–685

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

Donghyeon Yu was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP, No. 2015R1C1A1A02036312). Joong-Ho Won was supported by the National Research Foundation of Kor ea (NRF) grant funded by the Korean government (MSIP, Nos. 2013R1A1A1057949 and 2014R1A4A1007895).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joong-Ho Won.

Appendix:

Appendix:

1.1 A Preconditioned conjugate gradient (PCG) method

Conjugate gradient (CG) is a method to resolve positive definite linear equations, A x = b, applied to sparse system that has too large data to solve with Cholesky Decomposition. Instead of directly solving linear equations, this method is to minimize the following function f(x),

$$f(x) = \frac{1}{2}x^{T}Ax -b^{T}x. $$

For a positive definite A, two nonzero vectors u, v are said to be conjugate with respect to A, if they satisfy

$$\langle u,v\rangle_{A} \triangleq u^{T}Av=0. $$

If P is defined as

$$P=\left\lbrace p_{k} : \forall i \neq k, \langle p_{i},p_{k}\rangle_{A}=0 \right\rbrace, $$

it means the set of n number of mutually conjugate directional vectors. Thus, the set P becomes a basis of \(\mathbb {R}^{n}\) and x is represented in the form of

$$x=\sum\limits_{i=1}^{n} \alpha_{i} p_{i}. $$

By multiplying both sides by matrix A, b is decomposed by

$$b=Ax=\sum\limits_{i=1}^{n} \alpha_{i} Ap_{i}. $$

Multiplying an arbitrary directional vector p k P,

$${p_{k}^{T}}b={p_{k}^{T}}Ax=\sum\limits_{i=1}^{n} \alpha_{i} {p_{k}^{T}} A p_{i} = \alpha_{k} {p_{k}^{T}} A p_{k}. $$

Accordingly, the explicit form of α k can be derived as followed,

$$\alpha_{k}=\frac{{p_{k}^{T}}b}{{p_{k}^{T}}Ap_{k}}=\frac{\langle p_{k},b\rangle} {\|p_{k}\|_{A}^{2}}. $$

If mutually conjugate directional vectors are not given, conjugate gradient (CG) solves the problem iteratively. Set x 0 as an initial value of x, and a linear equation given by

$$Az=b-Ax_{0} $$

becomes a target function to solve. If we regarding r k = bA x k as k-th residual, r k becomes a negative gradient of convex function x = x k , ∇f(x) given by,

$$\nabla f(x_{k}) = Ax_{k} -b, $$

which means that conjugate gradient method moves toward the direction of r k . Since all directional vectors should satisfy the condition that all vectors are conjugate with respect to A, then k-th direction p k is given by,

$$p_{k}=r_{k}-\sum\limits_{i>k}\frac{{p_{i}^{T}}Ar_{k}}{{p_{i}^{T}}Ap_{i}}p_{i}. $$

Following this direction, next value of x is updated as followed,

$$x_{k+1}=x_{k} +\alpha_{k} p_{k}, $$

where

$$\alpha_{k}=\frac{{p_{k}^{T}}b}{{p_{k}^{T}}Ap_{k}}=\frac{{p_{k}^{T}}r_{k-1}}{{p_{k}^{T}}Ap_{k}}. $$

Convergence rate of conjugate gradient method depends on condition number of A and especially eigenvalues of A [21]. Accordingly, A x = b problem can be regarded same as linear equation that multiply by inverse matrix of preconditioner given by

$$M^{-1}Ax=M^{-1}b. $$

In choosing an appropriate preconditioner, it should satisfy some necessary conditions.

  • M is both symmetric and positive definite matrix.

  • M −1 A is well conditioned and hardly has extreme eigenvalues.

  • M x = b is easy to solve.

Widely used preconditioners that satisfy these conditions are the followings;

  1. 1)

    Diagonal: M=diag(1/A 11,...,1/A n n ),

  2. 2)

    Incomplete(approximate) Cholesky factorization: \(M=\hat {A}^{-1}\), where \(\hat {A}=\hat {L}\hat {L}^{T}\).

figure f

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, B., Yu, D. & Won, JH. Comparative study of computational algorithms for the Lasso with high-dimensional, highly correlated data. Appl Intell 48, 1933–1952 (2018). https://doi.org/10.1007/s10489-016-0850-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-016-0850-7

Keywords

Navigation