Hierarchically penalized support vector machine with grouped variables

Bang, Sungwan; Kang, Jongkyeong; Jhun, Myoungshic; Kim, Eunkyung

doi:10.1007/s13042-016-0494-2

Hierarchically penalized support vector machine with grouped variables

Original Article
Published: 29 January 2016

Volume 8, pages 1211–1221, (2017)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Sungwan Bang¹,
Jongkyeong Kang¹,
Myoungshic Jhun² &
…
Eunkyung Kim²

312 Accesses
13 Citations
Explore all metrics

Abstract

When input features are naturally grouped or generated by factors in a linear classification problem, it is more meaningful to identify important groups or factors rather than individual features. The F _∞-norm support vector machine (SVM) and the group lasso penalized SVM have been developed to perform simultaneous classification and factor selection. However, these group-wise penalized SVM methods may suffer from estimation inefficiency and model selection inconsistency because they cannot perform feature selection within an identified group. To overcome this limitation, we propose the hierarchically penalized SVM (H-SVM) that not only effectively identifies important groups but also removes irrelevant features within an identified group. Numerical results are presented to demonstrate the competitive performance of the proposed H-SVM over existing SVM methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Data clustering: application and trends

Article 27 November 2022

References

Bang S, Jhun M (2012) Simultaneous estimation and factor selection in quantile regression via adaptive sup-norm regularization. Comput Stat Data Anal 56:813–826
Bang S, Jhun M (2014) Adaptive sup-norm regularized simultaneous multiple quantiles regression. Statistics 48:17–33
Article MathSciNet MATH Google Scholar
Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37:373–384
Article MathSciNet MATH Google Scholar
Chapelle O, Keerthi S (2008) Multi-class feature selection with support vector machines. In: Proceedings of the Amercian Statistical Association
Frank I, Friedman J (1993) A statistical view of some chemometrics regression tools. Technometrics 35:109–148
Article MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning. Springer-Verlag, New York
Book MATH Google Scholar
Hoerl A, Kennard R (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12:55–67
Article MATH Google Scholar
Kim Y, Kim J, Kim Y (2006) Blockwise sparse regression. Stat Sin 16:375–390
MathSciNet MATH Google Scholar
Meier L, van de Geer S, Buhlmann P (2008) The group lasso for logistic regression. J Roy Stat Soc B 70:53–71
Article MathSciNet MATH Google Scholar
R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org/
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc B 58:267–288
MathSciNet MATH Google Scholar
Turlach B, Venables W, Wright S (2005) Simultaneous variable selection. Technometrics 47:349–363
Article MathSciNet Google Scholar
Vapnik V (1995) The nature of statistical learning theory. Springer-Verlag, New York
Book MATH Google Scholar
Wang H, Leng C (2008) A note on adaptive group lasso. Comput Stat Data Anal 52:5277–5286
Article MathSciNet MATH Google Scholar
Wang S, Nan B, Zhou N, Zhu J (2009) Hierarchically penalized Cox regression with grouped variables. Biometrika 96:307–322
Article MathSciNet MATH Google Scholar
Yang Y, Zou H (2014) A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput. doi:10.1007/s11222-014-9498-5
MATH Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J Roy Stat Soc B 68:49–67
Article MathSciNet MATH Google Scholar
Zhang H, Liu Y, Wu Y, Zhu J (2008) Variable selection for multicategory svm via sup-norm regularization. Electr J Stat 2:149–167
Article MathSciNet MATH Google Scholar
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497
Article MathSciNet MATH Google Scholar
Zhou N, Zhu J (2010) Group variable selection via a hierarchical lasso and its oracle property. Stat Interf 3:557–574
Article MathSciNet MATH Google Scholar
Zhu J, Rosset S, Hastie T, Tibshirani R (2003) 1-norm support vector machine. Neural Inf Proc Syst 16
Zou H, Yuan M (2008) The F _∞-norm support vector machine. Stat Sin 18:379–398
MathSciNet MATH Google Scholar
Zou H, Yuan M (2008) Regularized simultaneous model selection in multiple quantiles regression. Comput Stat Data Anal 52:5296–5304
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors are grateful to the editor and the reviewers for their constructive and insightful comments and suggestions, which helped to dramatically improve the quality of this paper. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by (1) the Ministry of Science, ICT and Future Planning (NRF-2013R1A1A1007536) for S. Bang and (2) the Ministry of Education (NRF-2013R1A1A2A10007545) for M. Jhun.

Author information

Authors and Affiliations

Department of Mathematics, Korea Military Academy, Seoul, P.O. Box 77, Republic of Korea
Sungwan Bang & Jongkyeong Kang
Department of Statistics, Korea University, Seoul, 136-701, Republic of Korea
Myoungshic Jhun & Eunkyung Kim

Authors

Sungwan Bang
View author publications
You can also search for this author in PubMed Google Scholar
Jongkyeong Kang
View author publications
You can also search for this author in PubMed Google Scholar
Myoungshic Jhun
View author publications
You can also search for this author in PubMed Google Scholar
Eunkyung Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eunkyung Kim.

Appendix: Proofs

Proof of Lemma 1

Let $Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})$ denote the criterion that we would like to minimize in problem (2.5), let $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})$ denote the corresponding criterion in problem (2.6), and let $(\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} )$ denote a local minimizer of $Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})$. We will prove that $(\hat{\varvec{\gamma }}^{\diamondsuit } = \lambda_{1} \hat{\varvec{\gamma }}^{*} ,\;\hat{\theta }_{0}^{\diamondsuit } \varvec{ = }\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{\diamondsuit } = \hat{\varvec{\theta }}^{*} /\lambda_{1} )$ is a local minimizer of $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})$.

We immediately have $Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})_{{}}$ = $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})$. Since $(\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} )$ is a local minimizer of $Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})$, there exists δ > 0 such that if $(\varvec{\gamma^{\prime}},\theta^{\prime}_{0} ,\,\varvec{\theta^{\prime}})$ satisfies $\left\| {\varvec{\gamma^{\prime}} - \hat{\varvec{\gamma }}^{*} } \right\|_{1} + \left\| {\theta^{\prime}_{0} - \hat{\theta }_{0}^{*} } \right\|_{1} + \left\| {\varvec{\theta^{\prime}} - \hat{\varvec{\theta }}^{*} } \right\|_{1} < \delta$, then $Q^{*} (\lambda_{1} ,\lambda_{2} ,\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} ) \le Q^{*} (\lambda_{1} ,\lambda_{2} ,\varvec{\gamma^{\prime}},\theta^{\prime}_{0} ,\,\varvec{\theta^{\prime}})$.

Choose δ ^′ such that δ ^′/ min (λ ₁, 1/λ ₁) ≤ δ. Then for any $(\varvec{\gamma^{\prime\prime}},\theta^{\prime\prime}_{0} ,\,\varvec{\theta^{\prime\prime}})$ satisfying $\left\| {\varvec{\gamma^{\prime\prime}} - \hat{\varvec{\gamma }}^{\diamondsuit } } \right\|_{1}$ $+ \left\| {\theta^{\prime\prime}_{0} - \hat{\theta }_{0}^{\diamondsuit } } \right\|_{1} + \left\| {\varvec{\theta^{\prime\prime}} - \hat{\varvec{\theta }}^{\diamondsuit } } \right\|_{1} < \delta^{'}$, we have

$$\left\| {\frac{{\varvec{\gamma^{\prime\prime}}}}{{\lambda_{1} }} - \hat{\varvec{\gamma }}^{*} } \right\|_{1} + \left\| {\theta^{\prime\prime}_{0} - \hat{\theta }_{0}^{*} } \right\|_{1} + \left\| {\lambda_{1} \varvec{\theta^{\prime\prime}} - \hat{\varvec{\theta }}^{*} } \right\|_{1} \le \frac{{\lambda_{1} \left\| {\frac{{\varvec{\gamma^{\prime\prime}}}}{{\lambda_{1} }} - \hat{\varvec{\gamma }}^{*} } \right\|_{1} + \left\| {\theta^{\prime\prime}_{0} - \hat{\theta }_{0}^{*} } \right\|_{1} + \frac{1}{{\lambda_{1} }}\left\| {\lambda_{1} \varvec{\theta^{\prime\prime}} - \hat{\varvec{\theta }}^{*} } \right\|_{1} }}{{\hbox{min} \left( {\lambda_{1} ,\frac{1}{{\lambda_{1} }}} \right)}} < \frac{{\delta^{\prime}}}{{\hbox{min} \left( {\lambda_{1} ,\frac{1}{{\lambda_{1} }}} \right)}} \le \delta .$$

Hence

$$Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{\diamondsuit } ,\hat{\theta }_{0}^{\diamondsuit } ,\,\hat{\varvec{\theta }}^{\diamondsuit } ) = Q^{*} (\lambda_{1} ,\lambda_{2} ,\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\hat{\varvec{\theta }}^{*} ) \le Q^{*} (\lambda_{1} ,\lambda_{2} ,\varvec{\gamma^{\prime\prime}}/\lambda_{1} ,\theta^{\prime\prime}_{0} ,\,\lambda_{1} \varvec{\theta^{\prime\prime}}) = Q^{\diamondsuit } (\lambda ,\varvec{\gamma^{\prime\prime}},\,\,\theta^{\prime\prime}_{0} \,,\,\,\varvec{\theta^{\prime\prime}}).$$

Therefore, $(\hat{\varvec{\gamma }}^{\diamondsuit } = \lambda_{1} \hat{\varvec{\gamma }}^{*} ,\;\hat{\theta }_{0}^{*} \varvec{ = }\hat{\theta }_{0}^{\diamondsuit } ,\,\hat{\varvec{\theta }}^{\diamondsuit } = \hat{\varvec{\theta }}^{*} /\lambda_{1} )$ is a local minimizer of $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})$.

Similarly, we can prove that for any local minimizer $(\hat{\varvec{\gamma }}^{\diamondsuit } ,\hat{\theta }_{0}^{\diamondsuit } ,\,\hat{\varvec{\theta }}^{\diamondsuit } )$ of $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})$, there is a corresponding local minimize $(\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} )$ of $Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})$ satisfying $\hat{\gamma }^{*}_{k} \hat{\theta }^{*}_{kj} = \hat{\gamma }^{\diamondsuit }_{k} \hat{\theta }^{\diamondsuit }_{kj}$ and $\hat{\theta }^{*}_{0} = \hat{\theta }^{\diamondsuit }_{0}$. □

Proof of Lemma 2

Without loss of generality, let β ₀ and $\varvec{\beta}$ be fixed at $\hat{\beta }_{0}$ and $\hat{\varvec{\beta }}$, respectively and let $Q^{\diamondsuit } (\lambda ,\,\varvec{\gamma},\,\theta_{0} ,\,\varvec{\theta})$ be the corresponding criterion that we would like to minimize in problem (2.6). Then $Q^{\diamondsuit } (\lambda ,\,\varvec{\gamma},\,\theta_{0} ,\,\varvec{\theta})$ only depends on the penalty term $\sum\limits_{k = 1}^{K} {\gamma_{k} } + \lambda \sum\limits_{k = 1}^{K} {||\varvec{\theta}_{(k)} ||_{1} }$. For some k with $\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} \ne 0$, the corresponding penalty term is $\gamma_{k} + {{\lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} } \mathord{\left/ {\vphantom {{\lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} } {\gamma_{k} }}} \right. \kern-0pt} {\gamma_{k} }}$, which is minimized at $\hat{\gamma }_{k} = \lambda^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} = \left( {\lambda \sum\nolimits_{j = 1}^{{p_{k} }} {|\hat{\beta }_{kj} |} } \right)^{1/2}$. Denote $\Delta\varvec{\beta}$ $= (\Delta\varvec{\beta}_{(1)}^{T} ,\ldots,\Delta\varvec{\beta}_{(K)}^{T} )^{T}$,$\Delta\varvec{\beta}^{(1)} = (\Delta\varvec{\beta}_{(1)}^{(1)T} ,\ldots,\Delta\varvec{\beta}_{(K)}^{(1)T} )^{T}$ and $\Delta\varvec{\beta}^{(2)} = (\Delta\varvec{\beta}_{(1)}^{(2)T} ,\ldots,\,\Delta\varvec{\beta}_{(K)}^{(2)T} )^{T}$, where $\Delta\varvec{\beta}_{(k)}^{(1)} = {\mathbf{0}}$ if $\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} = 0$ and $\Delta\varvec{\beta}_{(k)}^{(2)} = {\mathbf{0}}$ if $\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} \ne 0$ for k = 1, …, K. We thus have $\left\| {\Delta\varvec{\beta}} \right\|_{1}$ $= \left\| {\Delta\varvec{\user2\beta}^{(1)} } \right\|_{1} + \left\| {\Delta\varvec{\user2\beta}^{(2)} } \right\|_{1}$. Let $Q(\lambda ,\beta_{0} ,\,\varvec{\beta})$ be the corresponding criterion in problem (2.7). Now we show that there exists a δ ^′ > 0 such that if $\text{max}\{ |\Delta \beta_{0} |,||\Delta\varvec{\beta}||_{1} \} < \delta^{\prime}$, then $Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }}) \le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta})$.

We first show $Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }}) \le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )$. By the argument given at the beginning of the proof, we have $\hat{\gamma }_{k} = \lambda^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2}$ and $\hat{\varvec{\theta }}_{(k)} = \hat{\varvec{\beta }}_{(k)} /\hat{\gamma }_{k}$ if $|\hat{\gamma }_{k} | \ne 0$, and $\hat{\varvec{\theta }}_{(k)} = {\mathbf{0}}$ if $|\hat{\gamma }_{k} | = 0$. Clearly $\hat{\theta }_{0} = \hat{\beta }_{0}$. Furthermore, let $\hat{\gamma }^{\prime}_{k} = \lambda^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2}$ and $\varvec{\hat{\theta }^{\prime}}_{\left( k \right)} = (\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} )/\hat{\gamma }^{\prime}_{k}$ if $|\hat{\gamma }_{k} | \ne 0$, and let $\hat{\gamma }^{\prime}_{k} = 0$ and $\varvec{\hat{\theta }^{\prime}}_{(k)} = {\mathbf{0}}$ if $|\hat{\gamma }_{k} | = 0$ and let $\hat{\theta }^{\prime}_{0} = \hat{\beta }_{0} + \Delta \beta_{0}$. Then we have $Q^{\diamondsuit } (\lambda ,\,\varvec{\hat{\gamma }^{\prime}},\,\hat{\theta }^{\prime}_{0} ,\,\varvec{\hat{\theta }^{\prime}}) = Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )$ and $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})$ $= Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }})$. As $(\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})$ is a local minimizer of $Q^{\diamondsuit } (\lambda ,\,\varvec{\gamma},\,\theta_{0} ,\,\varvec{\theta})$, there exists a δ > 0 such that for any $(\varvec{\gamma^{\prime}},\theta^{\prime}_{0} ,\,\varvec{\theta^{\prime}})$ satisfying $\left\| {\varvec{\gamma^{\prime}} - \hat{\varvec{\gamma }}} \right\|_{1} + \left\| {\theta_{0}^{\prime } - \hat{\theta }_{0} } \right\|_{1} + \left\| {\varvec{\theta^{\prime}} - \hat{\varvec{\theta }}} \right\|_{1} < \delta$, we have $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})$ $\le Q^{\diamondsuit } (\lambda ,\varvec{\gamma^{\prime}},\theta_{0}^{\prime } ,\,\varvec{\theta^{\prime}})$. For $a = \hbox{min} \{ \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} :\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} \ne 0,\,\,k = 1,\ldots,K\}$ and δ ^′ < a/2, we have $|\hat{\gamma^{\prime}}_{k} - \hat{\gamma }_{k} | = \sqrt \lambda \left| {\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right| \le \sqrt \lambda \left| {\left( {\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} + \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} } \right)^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right| \le \frac{{\sqrt \lambda \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} }} \le \frac{{\sqrt \lambda \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{2\sqrt a },$ by the triangular inequality $||\varvec{a} + \varvec{b}||_{1}^{1/2} \le (||\varvec{a}||_{1} + ||\varvec{b}||_{1} )^{1/2}$ and $(||\varvec{a}||_{1} + ||\varvec{b}||_{1} )^{1/2} - ||\varvec{a}||_{1}^{1/2} \le \frac{{||\varvec{b}||_{1} }}{{2||\varvec{a}||_{1}^{1/2} }}$. Furthermore, by using the inequality $\sqrt \lambda \left| {\left( {\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} + \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} } \right)^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right|$ $\le \frac{{\sqrt \lambda \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} }}$, we have

$$\begin{aligned} \left\| {\varvec{\hat{\theta }^{\prime}}_{(k)} - \hat{\varvec{\theta }}_{(k)} } \right\|_{1} = \left\| {\frac{{\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} }}{{\hat{\gamma }^{\prime}_{k} }} - \frac{{\hat{\varvec{\beta }}_{(k)} }}{{\hat{\gamma }_{k} }}} \right\|_{1} = \left\| {\frac{{\Delta\varvec{\beta}_{(k)}^{(1)} }}{{\hat{\gamma }^{\prime}_{k} }} - \frac{{\hat{\varvec{\beta }}_{(k)} (\hat{\gamma }^{\prime}_{k} \varvec{ - }\hat{\gamma }_{k} )}}{{\hat{\gamma }^{\prime}_{k} \hat{\gamma }_{k} }}} \right\|_{1} \, \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\quad \; \le \left\| {\frac{{\Delta\varvec{\beta}_{(k)}^{(1)} }}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}} \right\|_{1} + \left\| {\frac{{\hat{\varvec{\beta }}_{(k)} \left( {\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right)}}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}} \right\|_{1} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\;\;\, \le \frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }} + \frac{{\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} }}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}\frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} }} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\;\;\; \le \frac{1}{\sqrt \lambda }\left( {\frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }} + \frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\quad \le \frac{{3\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{2\sqrt \lambda }\frac{1}{{\sqrt {\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} - \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} } }} \le \frac{{3\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{2\sqrt \lambda }\frac{1}{{\sqrt {a - a/2} }} = \frac{{3\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{\sqrt {2\lambda a} }} \hfill \\ \end{aligned}$$

Therefore, we are able to choose a δ ^′ satisfying δ ^′ < a/2 such that $\left\| {\varvec{\hat{\gamma}^{\prime}} - \hat{\varvec{\gamma }}} \right\|_{1} + \left\| {\hat{\theta }_{0}^{\prime } - \hat{\theta }_{0} } \right\|_{1} + \left\| {\varvec{{\hat{\theta}}^{\prime}} - \hat{\varvec{\theta }}} \right\|_{1} < \delta$ when $\left\| {\Delta\varvec{\beta}^{(1)} } \right\|_{1} < \delta^{\prime}$. Hence we have $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }}) \le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }^{\prime}},\hat{\theta }^{\prime}_{0} ,\,\hat{\varvec{\theta }^{\prime}})$ due to the local minimality. Hence $Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }})$ $\le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )$.

Next we show $Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} ) \le Q(\lambda ,\,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )$. Note that the Lipschitz continuity of the hinge loss function $\left\lfloor {h(\beta_{0} ,\,\,\varvec{\beta})} \right\rfloor_{ + }$ implies that

$$\left| {\,\,\left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )} \right\rfloor_{ + } - \left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )} \right\rfloor_{ + } \,} \right| \le L^{\prime}||\Delta\varvec{\beta}^{(2)} ||_{1}$$

for some L ^′ > 0, where $h(\beta_{0} ,\varvec{\beta})$ = 1 − y _i(β ₀ + ∑ ^K_k=1 x ^T_i,(k) β _(k)). Moreover, we can choose a real number L such that

$$\left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )} \right\rfloor_{ + } - \left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )} \right\rfloor_{ + } = L||\Delta\varvec{\beta}^{(2)} ||_{1} .$$

Hence, there exists a number L in ${\mathbb{R}}$ such that

$$Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} ) - Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }}\text{ + }\Delta\varvec{\beta}^{(1)} ) = L\left\| {\Delta\varvec{\beta}^{(2)} } \right\|_{1} + 2\lambda^{1/2} \sum\nolimits_{k = 1}^{K} {\left\| {\Delta\varvec{\beta}_{(k)}^{(2)} } \right\|_{1}^{1/2} } .$$

Since $\left\| {\Delta\varvec{\beta}^{(2)} } \right\|_{1} < \delta^{\prime}$ for a small enough δ ^′, the second term in the right side of the above equality dominates the first term, hence we have $Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )$ $\le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )$. Thus we have shown that there exists a δ ^′ > 0 such that if $\left\| {\Delta\varvec{\beta}} \right\|_{1} < \delta^{\prime}$, then $Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }}) \le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta})$, which implies that $(\hat{\beta }_{0} ,\hat{\varvec{\beta }})$ is a local minimizer of $Q(\lambda ,\,\beta_{0} ,\,\varvec{\beta})$.

Similarly we can prove that if $(\hat{\beta }_{0} ,\hat{\varvec{\beta }})$ is a local minimizer of $Q(\lambda ,\,\beta_{0} ,\,\varvec{\beta})$, then $(\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})$ is a local minimizer of $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\varvec{\theta})$ satisfying $\hat{\beta }_{0} = \hat{\theta }_{0}$ and $\hat{\beta }_{kj} = \hat{\gamma }_{k} \hat{\theta }_{kj}$. □

Proof of Lemma 3

Clearly, $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\varvec{\theta})$ is bounded below due to the positivity of its each term. Let ${\hat{\user2{\gamma }}}_{{}}^{(t)}$ and $(\hat{\theta }_{0}^{(t)} ,\,\,{\hat{\user2{\theta }}}_{{}}^{(t)} )$ denote the minimizers of the criterion in the optimization problems (3.1) and (3.2) at the tth iteration, respectively. Note that the minimization problems in Steps 1 and 2 are equivalent to minimizing $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t - 1)} ,\theta_{0} ,\,\varvec{\theta})$ with respect to $(\theta_{0} ,\,\,\varvec{\theta})$ and minimizing $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\hat{\theta }_{0}^{(t)} ,\,\,\hat{\varvec{\theta }}^{(t)} )$ with respect to $\varvec{\gamma}$, respectively. Since $\hat{\varvec{\gamma }}^{(t)}$ is the minimizer of $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\hat{\theta }_{0}^{(t)} ,\,\,\hat{\varvec{\theta }}^{(t)} )$, we have $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )$ $\le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t - 1)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )$. Similarly, since $(\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} )$ is the minimizer of $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{{}} ,\,\hat{\varvec{\theta }}^{{}} )$ and $\hat{\varvec{\gamma }}^{(t + 1)}$ is the minimizer of $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\hat{\theta }_{0}^{(t + 1)} ,\,\,\hat{\varvec{\theta }}^{(t + 1)} )$, we can show that $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} )$ $\le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )$ and $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t + 1)} ,\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} ) \le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t + 1)} ,\hat{\varvec{\theta }}^{(t + 1)} )$. Therefore, we have $Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t + 1)} ,\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} )$ $\le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )$, which implies that $Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\varvec{\theta})$ decreases for each iteration. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bang, S., Kang, J., Jhun, M. et al. Hierarchically penalized support vector machine with grouped variables. Int. J. Mach. Learn. & Cyber. 8, 1211–1221 (2017). https://doi.org/10.1007/s13042-016-0494-2

Download citation

Received: 04 February 2015
Accepted: 06 January 2016
Published: 29 January 2016
Issue Date: August 2017
DOI: https://doi.org/10.1007/s13042-016-0494-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchically penalized support vector machine with grouped variables

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hierarchically penalized support vector machine with grouped variables

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Proofs

Appendix: Proofs

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation