Abstract
As powerful tools, machine learning and data mining techniques have been widely applied in various areas. However, in many real-world applications, besides establishing accurate black box predictors, we are also interested in white box mechanisms, such as discovering predictive patterns in data that enhance our understanding of underlying physical, biological and other natural processes. For these purposes, sparse representation and its variations have been one of the focuses. More recently, structural sparsity has attracted increasing attentions. In previous research, structural sparsity was often achieved by imposing convex but non-smooth norms such as \({\ell _{2}/\ell _{1}}\) and group \({\ell _{2}/\ell _{1}}\) norms. In this paper, we present the explicit \({\ell _2/\ell _0}\) and group \({\ell _2/\ell _0}\) norm to directly approach the structural sparsity. To tackle the problem of intractable \({\ell _2/\ell _0}\) optimizations, we develop a general Lipschitz auxiliary function that leads to simple iterative algorithms. In each iteration, optimal solution is achieved for the induced subproblem and a guarantee of convergence is provided. Furthermore, the local convergent rate is also theoretically bounded. We test our optimization techniques in the multitask feature learning problem. Experimental results suggest that our approaches outperform other approaches in both synthetic and real-world data sets.
Similar content being viewed by others
Notes
References
Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Mach Learn 73(3):243–272
Bach FR (2008) Bolasso: model consistent lasso estimation through the bootstrap. In: ‘ICML’, pp 33–40
Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the smo algorithm. In: ICML
Bach FR, Thibaux R, Jordan MI (2004) Computing regularization paths for learning multiple kernels. In: NIPS
Baralis E, Bruno G, Fiori A (2011) Measuring gene similarity by means of the classification distance. Knowl Inform Syst 29(1):81–101
Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202
Cai J-F, Candès EJ, Shen Z (2008) A singular value thresholding algorithm for matrix completion. SIAM J Optim 20(4):1956–1982
Candès EJ, Romberg JK (2006) Quantitative robust uncertainty principles and optimally sparse decompositions. Found Comput Math 6(2):227–254
Candès E, Tao T (2004) Rejoinder: statistical estimation when \(p\) is much larger than \(n\)’. Annu Stat 35:2392–2404
Candès E, Tao T (2005) Decoding by linear programming. IEEE Trans Inform Theory 51:4203–4215
Candès E, Wakin M (2008) An introduction to compressive sensing’. IEEE Signal Process Mag 25(2): 21–30
Chen X, Lin Q, Kim S, Xing E (2010) An efficient proximal-gradient method for single and multi-task regression with structured sparsity. Technical Report, arXiv:1005.4717
Davis G, Mallat S, Avellaneda M (1997) Greedy adaptive approximation. J Constr Approx 13:57–98
Ding CHQ, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix t-factorizations for clustering. In: KDD, pp 126–135
Ding C, Zhou D, He X, Zha H (June 2006) R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization. Proceedings of international conference on machine learning (ICML)
Efron B, Hastie T, Johnstone L, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499
El Akadi A, Amine A, El Ouardighi A, Aboutajdine D (2011) A two-stage gene selection scheme utilizing mrmr filter and ga wrapper. Knowl Inform Syst 26(3):487–500
Fan J, Li R (2003) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Friedman J, Hastie T, Hölfling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Stat 1(2):302–332
Fu WJ (2000) Penalized regressions: the bridge versus the lasso. J Comput Graph Stat 7(3):397–416
Huang K, Ying Y, Campbell C (2011) Generalized sparse metric learning with relative comparisons. Knowl Inform Syst 28(1):25–45
Huang S, Li J, Sun L, Ye J, Fleisher A, Wu T, Chen K, Reiman E (2010) Learning brain connectivity of alzheimers disease by sparse inverse covariance estimation. NeuroImage 50:935–949
Jenatton R, Obozinski G, Bach F (2009) Structured sparse principal component analysis’. Arxiv, preprint arXiv: 0909.1440
Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23(12):1495–1502
Lee DD, Seung HS (1983) A method for solving a convex programming problem with convergence rate\(o(1/k^2)\). Sov Math Dokl 27:372–376
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Leggetter CJ, Woodland PC (1995) Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput Speech Lang 9(2):171–185
Liu J, Chen J, Ye J (2009) Large-scale sparse logistic regression. In: SIGKDD09, pp 547–556
Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient \(l_{2,1}\)-norm minimization. In: UAI2009
Liu J, Musialski P, Wonka P, Ye J (2009) Tensor completion for estimating missing values in visual data. In: ICCV09, pp 2114–2121
Liu J, Ye J (2010) Moreau-yosida regularization for grouped tree structure learning. In: Lafferty J, Williams CKI, Shawe-Taylor J, Zemel R, Culotta A (eds) NIPS vol 23, pp 1459–1467
Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J Mach Learn Res 11:10–60
Mallat S, Zhang Z (1993) Matching pursuit in a time-frequency dictionary. IEEE Trans Signal Process 41(12):3397–3415
Nesterov Y (2003) Introductory lectures on convex optimization: a basic course. Kluwer, Dordrecht
Nesterov Y (2007) Gradient methods for minimizing composite objective function. Technical report CORE
Nie F, Huang H, Cai X, Ding C (2010) Efficient and robust feature selection via joint \(\ell _{2,1}\)-norms minimization. In: NIPS
Obozinski G, Taskar B, Jordan MI (2010) Joint covariate selection and joint subspace selection for multiple classification problems. Stat Comput 20:231–252
Osborne MR, Presnell B, Turlach BA (2000) On the lasso and its dual. J Comput Graph Stat 9(2):319–337
Peng J, Zhu J, Bergamaschi A, Han W, Noh D-Y, Pollack JR, Wang P (2010) Regularized multivariate regression for identifying master predictors with application to integrative genomics study of breast cancer. Ann Appl Stat 2(1):53–77
Shevade SK, Keerthi SS (2003) A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17):2246–2253
Simmuteit S, Schleif F, Villmann T, Hammer B (2010) Evolving trees for the retrieval of mass spectrometry-based bacteria fingerprints. Knowl Inform Syst 25(2):327–343
Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval, pp 321–330
Stojnic M (2009) \(\ell _2/\ell _1\)-optimization in block-sparse compressed sensing and its strong thresholds. IEEE J Sel Top Signal Process 4(2):350–357
Sun L, Liu J, Chen J, Ye J (2009) Efficient recovery of jointly sparse vectors. Adv Neural Inform Process Syst 22:1812–1820
Sun L, Patel R, Liu J, Chen K, Wu T, Li J, Reiman E, Ye J (2009) Mining brain region connectivity for alzheimer’s disease study via sparse inverse covariance estimation. In: SIGKDD09, pp 1335–1344
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc B 58:267–288
Tibshirani R (2008) Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics 9(1):18–29
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2004) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B 67(1):91–108
Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I, (2008) Multilabel classification of music into emotions. In: Proceedings 9th international conference on music information retrieval (ISMIR, 2008) Philadelphia, PA, USA, vol 2008
Tropp J (2004) Just relax: Convex programming methods for subset selection and sparse approximation. ICES report, pp 04–04
Tropp J, Gilbert A (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inform Theory 53(12):4655–4666
Wright J, Yang A, Ganesh A, Sastry S, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B 68(1):49–67
Zhao Z et al (2008) Imputation of missing genotypes: an empirical evaluation of impute. BMC Genetics 9:85
Zhao P, Rocha G, Yu B (2009) Grouped and hierarchical model selection through composite absolute penalties. Ann Stat 37(6A):3468–3497
Zhao P, Yu B (2006) On model selection consistency of lasso. J Mach Learn Res 7:2541–2563
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B 67:301–320
Zuo H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 110(476):1418–1429
Acknowledgments
This research was partially supported by NSF CCF-0830780, 0917274, NSF DMS-0915228, NSF CNS-0923494, 1035913.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Luo, D., Ding, C. & Huang, H. Toward structural sparsity: an explicit \(\ell _{2}/\ell _0\) approach. Knowl Inf Syst 36, 411–438 (2013). https://doi.org/10.1007/s10115-012-0545-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0545-2