Skip to main content
Log in

Hierarchically penalized support vector machine with grouped variables

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

When input features are naturally grouped or generated by factors in a linear classification problem, it is more meaningful to identify important groups or factors rather than individual features. The F -norm support vector machine (SVM) and the group lasso penalized SVM have been developed to perform simultaneous classification and factor selection. However, these group-wise penalized SVM methods may suffer from estimation inefficiency and model selection inconsistency because they cannot perform feature selection within an identified group. To overcome this limitation, we propose the hierarchically penalized SVM (H-SVM) that not only effectively identifies important groups but also removes irrelevant features within an identified group. Numerical results are presented to demonstrate the competitive performance of the proposed H-SVM over existing SVM methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bang S, Jhun M (2012) Simultaneous estimation and factor selection in quantile regression via adaptive sup-norm regularization. Comput Stat Data Anal 56:813–826

  2. Bang S, Jhun M (2014) Adaptive sup-norm regularized simultaneous multiple quantiles regression. Statistics 48:17–33

    Article  MathSciNet  MATH  Google Scholar 

  3. Breiman L (1995) Better subset regression using the nonnegative garrote. Technometrics 37:373–384

    Article  MathSciNet  MATH  Google Scholar 

  4. Chapelle O, Keerthi S (2008) Multi-class feature selection with support vector machines. In: Proceedings of the Amercian Statistical Association

  5. Frank I, Friedman J (1993) A statistical view of some chemometrics regression tools. Technometrics 35:109–148

    Article  MATH  Google Scholar 

  6. Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning. Springer-Verlag, New York

    Book  MATH  Google Scholar 

  7. Hoerl A, Kennard R (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12:55–67

    Article  MATH  Google Scholar 

  8. Kim Y, Kim J, Kim Y (2006) Blockwise sparse regression. Stat Sin 16:375–390

    MathSciNet  MATH  Google Scholar 

  9. Meier L, van de Geer S, Buhlmann P (2008) The group lasso for logistic regression. J Roy Stat Soc B 70:53–71

    Article  MathSciNet  MATH  Google Scholar 

  10. R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org/

  11. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc B 58:267–288

    MathSciNet  MATH  Google Scholar 

  12. Turlach B, Venables W, Wright S (2005) Simultaneous variable selection. Technometrics 47:349–363

    Article  MathSciNet  Google Scholar 

  13. Vapnik V (1995) The nature of statistical learning theory. Springer-Verlag, New York

    Book  MATH  Google Scholar 

  14. Wang H, Leng C (2008) A note on adaptive group lasso. Comput Stat Data Anal 52:5277–5286

    Article  MathSciNet  MATH  Google Scholar 

  15. Wang S, Nan B, Zhou N, Zhu J (2009) Hierarchically penalized Cox regression with grouped variables. Biometrika 96:307–322

    Article  MathSciNet  MATH  Google Scholar 

  16. Yang Y, Zou H (2014) A fast unified algorithm for solving group-lasso penalize learning problems. Stat Comput. doi:10.1007/s11222-014-9498-5

    MATH  Google Scholar 

  17. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J Roy Stat Soc B 68:49–67

    Article  MathSciNet  MATH  Google Scholar 

  18. Zhang H, Liu Y, Wu Y, Zhu J (2008) Variable selection for multicategory svm via sup-norm regularization. Electr J Stat 2:149–167

    Article  MathSciNet  MATH  Google Scholar 

  19. Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat 37:3468–3497

    Article  MathSciNet  MATH  Google Scholar 

  20. Zhou N, Zhu J (2010) Group variable selection via a hierarchical lasso and its oracle property. Stat Interf 3:557–574

    Article  MathSciNet  MATH  Google Scholar 

  21. Zhu J, Rosset S, Hastie T, Tibshirani R (2003) 1-norm support vector machine. Neural Inf Proc Syst 16

  22. Zou H, Yuan M (2008) The F -norm support vector machine. Stat Sin 18:379–398

    MathSciNet  MATH  Google Scholar 

  23. Zou H, Yuan M (2008) Regularized simultaneous model selection in multiple quantiles regression. Comput Stat Data Anal 52:5296–5304

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The authors are grateful to the editor and the reviewers for their constructive and insightful comments and suggestions, which helped to dramatically improve the quality of this paper. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by (1) the Ministry of Science, ICT and Future Planning (NRF-2013R1A1A1007536) for S. Bang and (2) the Ministry of Education (NRF-2013R1A1A2A10007545) for M. Jhun.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eunkyung Kim.

Appendix: Proofs

Appendix: Proofs

Proof of Lemma 1

Let \(Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})\) denote the criterion that we would like to minimize in problem (2.5), let \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})\) denote the corresponding criterion in problem (2.6), and let \((\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} )\) denote a local minimizer of \(Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})\). We will prove that \((\hat{\varvec{\gamma }}^{\diamondsuit } = \lambda_{1} \hat{\varvec{\gamma }}^{*} ,\;\hat{\theta }_{0}^{\diamondsuit } \varvec{ = }\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{\diamondsuit } = \hat{\varvec{\theta }}^{*} /\lambda_{1} )\) is a local minimizer of \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})\).

We immediately have \(Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})_{{}}\) = \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})\). Since \((\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} )\) is a local minimizer of \(Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})\), there exists δ > 0 such that if \((\varvec{\gamma^{\prime}},\theta^{\prime}_{0} ,\,\varvec{\theta^{\prime}})\) satisfies \(\left\| {\varvec{\gamma^{\prime}} - \hat{\varvec{\gamma }}^{*} } \right\|_{1} + \left\| {\theta^{\prime}_{0} - \hat{\theta }_{0}^{*} } \right\|_{1} + \left\| {\varvec{\theta^{\prime}} - \hat{\varvec{\theta }}^{*} } \right\|_{1} < \delta\), then \(Q^{*} (\lambda_{1} ,\lambda_{2} ,\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} ) \le Q^{*} (\lambda_{1} ,\lambda_{2} ,\varvec{\gamma^{\prime}},\theta^{\prime}_{0} ,\,\varvec{\theta^{\prime}})\).

Choose δ such that δ / min (λ 1, 1/λ 1) ≤ δ. Then for any \((\varvec{\gamma^{\prime\prime}},\theta^{\prime\prime}_{0} ,\,\varvec{\theta^{\prime\prime}})\) satisfying \(\left\| {\varvec{\gamma^{\prime\prime}} - \hat{\varvec{\gamma }}^{\diamondsuit } } \right\|_{1}\) \(+ \left\| {\theta^{\prime\prime}_{0} - \hat{\theta }_{0}^{\diamondsuit } } \right\|_{1} + \left\| {\varvec{\theta^{\prime\prime}} - \hat{\varvec{\theta }}^{\diamondsuit } } \right\|_{1} < \delta^{'}\), we have

$$\left\| {\frac{{\varvec{\gamma^{\prime\prime}}}}{{\lambda_{1} }} - \hat{\varvec{\gamma }}^{*} } \right\|_{1} + \left\| {\theta^{\prime\prime}_{0} - \hat{\theta }_{0}^{*} } \right\|_{1} + \left\| {\lambda_{1} \varvec{\theta^{\prime\prime}} - \hat{\varvec{\theta }}^{*} } \right\|_{1} \le \frac{{\lambda_{1} \left\| {\frac{{\varvec{\gamma^{\prime\prime}}}}{{\lambda_{1} }} - \hat{\varvec{\gamma }}^{*} } \right\|_{1} + \left\| {\theta^{\prime\prime}_{0} - \hat{\theta }_{0}^{*} } \right\|_{1} + \frac{1}{{\lambda_{1} }}\left\| {\lambda_{1} \varvec{\theta^{\prime\prime}} - \hat{\varvec{\theta }}^{*} } \right\|_{1} }}{{\hbox{min} \left( {\lambda_{1} ,\frac{1}{{\lambda_{1} }}} \right)}} < \frac{{\delta^{\prime}}}{{\hbox{min} \left( {\lambda_{1} ,\frac{1}{{\lambda_{1} }}} \right)}} \le \delta .$$

Hence

$$Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{\diamondsuit } ,\hat{\theta }_{0}^{\diamondsuit } ,\,\hat{\varvec{\theta }}^{\diamondsuit } ) = Q^{*} (\lambda_{1} ,\lambda_{2} ,\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\hat{\varvec{\theta }}^{*} ) \le Q^{*} (\lambda_{1} ,\lambda_{2} ,\varvec{\gamma^{\prime\prime}}/\lambda_{1} ,\theta^{\prime\prime}_{0} ,\,\lambda_{1} \varvec{\theta^{\prime\prime}}) = Q^{\diamondsuit } (\lambda ,\varvec{\gamma^{\prime\prime}},\,\,\theta^{\prime\prime}_{0} \,,\,\,\varvec{\theta^{\prime\prime}}).$$

Therefore, \((\hat{\varvec{\gamma }}^{\diamondsuit } = \lambda_{1} \hat{\varvec{\gamma }}^{*} ,\;\hat{\theta }_{0}^{*} \varvec{ = }\hat{\theta }_{0}^{\diamondsuit } ,\,\hat{\varvec{\theta }}^{\diamondsuit } = \hat{\varvec{\theta }}^{*} /\lambda_{1} )\) is a local minimizer of \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})\).

Similarly, we can prove that for any local minimizer \((\hat{\varvec{\gamma }}^{\diamondsuit } ,\hat{\theta }_{0}^{\diamondsuit } ,\,\hat{\varvec{\theta }}^{\diamondsuit } )\) of \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\,\varvec{\theta})\), there is a corresponding local minimize \((\hat{\varvec{\gamma }}^{*} ,\hat{\theta }_{0}^{*} ,\,\hat{\varvec{\theta }}^{*} )\) of \(Q^{*} (\lambda_{1} ,\lambda_{2} ,\,\varvec{\gamma},\theta_{0} ,\varvec{\theta})\) satisfying \(\hat{\gamma }^{*}_{k} \hat{\theta }^{*}_{kj} = \hat{\gamma }^{\diamondsuit }_{k} \hat{\theta }^{\diamondsuit }_{kj}\) and \(\hat{\theta }^{*}_{0} = \hat{\theta }^{\diamondsuit }_{0}\). □

Proof of Lemma 2

Without loss of generality, let β 0 and \(\varvec{\beta}\) be fixed at \(\hat{\beta }_{0}\) and \(\hat{\varvec{\beta }}\), respectively and let \(Q^{\diamondsuit } (\lambda ,\,\varvec{\gamma},\,\theta_{0} ,\,\varvec{\theta})\) be the corresponding criterion that we would like to minimize in problem (2.6). Then \(Q^{\diamondsuit } (\lambda ,\,\varvec{\gamma},\,\theta_{0} ,\,\varvec{\theta})\) only depends on the penalty term \(\sum\limits_{k = 1}^{K} {\gamma_{k} } + \lambda \sum\limits_{k = 1}^{K} {||\varvec{\theta}_{(k)} ||_{1} }\). For some k with \(\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} \ne 0\), the corresponding penalty term is \(\gamma_{k} + {{\lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} } \mathord{\left/ {\vphantom {{\lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} } {\gamma_{k} }}} \right. \kern-0pt} {\gamma_{k} }}\), which is minimized at \(\hat{\gamma }_{k} = \lambda^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} = \left( {\lambda \sum\nolimits_{j = 1}^{{p_{k} }} {|\hat{\beta }_{kj} |} } \right)^{1/2}\). Denote \(\Delta\varvec{\beta}\) \(= (\Delta\varvec{\beta}_{(1)}^{T} ,\ldots,\Delta\varvec{\beta}_{(K)}^{T} )^{T}\),\(\Delta\varvec{\beta}^{(1)} = (\Delta\varvec{\beta}_{(1)}^{(1)T} ,\ldots,\Delta\varvec{\beta}_{(K)}^{(1)T} )^{T}\) and \(\Delta\varvec{\beta}^{(2)} = (\Delta\varvec{\beta}_{(1)}^{(2)T} ,\ldots,\,\Delta\varvec{\beta}_{(K)}^{(2)T} )^{T}\), where \(\Delta\varvec{\beta}_{(k)}^{(1)} = {\mathbf{0}}\) if \(\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} = 0\) and \(\Delta\varvec{\beta}_{(k)}^{(2)} = {\mathbf{0}}\) if \(\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} \ne 0\) for k = 1, …, K. We thus have \(\left\| {\Delta\varvec{\beta}} \right\|_{1}\) \(= \left\| {\Delta\varvec{\user2\beta}^{(1)} } \right\|_{1} + \left\| {\Delta\varvec{\user2\beta}^{(2)} } \right\|_{1}\). Let \(Q(\lambda ,\beta_{0} ,\,\varvec{\beta})\) be the corresponding criterion in problem (2.7). Now we show that there exists a δ  > 0 such that if \(\text{max}\{ |\Delta \beta_{0} |,||\Delta\varvec{\beta}||_{1} \} < \delta^{\prime}\), then \(Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }}) \le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta})\).

We first show \(Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }}) \le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )\). By the argument given at the beginning of the proof, we have \(\hat{\gamma }_{k} = \lambda^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2}\) and \(\hat{\varvec{\theta }}_{(k)} = \hat{\varvec{\beta }}_{(k)} /\hat{\gamma }_{k}\) if \(|\hat{\gamma }_{k} | \ne 0\), and \(\hat{\varvec{\theta }}_{(k)} = {\mathbf{0}}\) if \(|\hat{\gamma }_{k} | = 0\). Clearly \(\hat{\theta }_{0} = \hat{\beta }_{0}\). Furthermore, let \(\hat{\gamma }^{\prime}_{k} = \lambda^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2}\) and \(\varvec{\hat{\theta }^{\prime}}_{\left( k \right)} = (\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} )/\hat{\gamma }^{\prime}_{k}\) if \(|\hat{\gamma }_{k} | \ne 0\), and let \(\hat{\gamma }^{\prime}_{k} = 0\) and \(\varvec{\hat{\theta }^{\prime}}_{(k)} = {\mathbf{0}}\) if \(|\hat{\gamma }_{k} | = 0\) and let \(\hat{\theta }^{\prime}_{0} = \hat{\beta }_{0} + \Delta \beta_{0}\). Then we have \(Q^{\diamondsuit } (\lambda ,\,\varvec{\hat{\gamma }^{\prime}},\,\hat{\theta }^{\prime}_{0} ,\,\varvec{\hat{\theta }^{\prime}}) = Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )\) and \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})\) \(= Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }})\). As \((\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})\) is a local minimizer of \(Q^{\diamondsuit } (\lambda ,\,\varvec{\gamma},\,\theta_{0} ,\,\varvec{\theta})\), there exists a δ > 0 such that for any \((\varvec{\gamma^{\prime}},\theta^{\prime}_{0} ,\,\varvec{\theta^{\prime}})\) satisfying \(\left\| {\varvec{\gamma^{\prime}} - \hat{\varvec{\gamma }}} \right\|_{1} + \left\| {\theta_{0}^{\prime } - \hat{\theta }_{0} } \right\|_{1} + \left\| {\varvec{\theta^{\prime}} - \hat{\varvec{\theta }}} \right\|_{1} < \delta\), we have \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})\) \(\le Q^{\diamondsuit } (\lambda ,\varvec{\gamma^{\prime}},\theta_{0}^{\prime } ,\,\varvec{\theta^{\prime}})\). For \(a = \hbox{min} \{ \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} :\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} \ne 0,\,\,k = 1,\ldots,K\}\) and δ  < a/2, we have \(|\hat{\gamma^{\prime}}_{k} - \hat{\gamma }_{k} | = \sqrt \lambda \left| {\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right| \le \sqrt \lambda \left| {\left( {\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} + \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} } \right)^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right| \le \frac{{\sqrt \lambda \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} }} \le \frac{{\sqrt \lambda \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{2\sqrt a },\) by the triangular inequality \(||\varvec{a} + \varvec{b}||_{1}^{1/2} \le (||\varvec{a}||_{1} + ||\varvec{b}||_{1} )^{1/2}\) and \((||\varvec{a}||_{1} + ||\varvec{b}||_{1} )^{1/2} - ||\varvec{a}||_{1}^{1/2} \le \frac{{||\varvec{b}||_{1} }}{{2||\varvec{a}||_{1}^{1/2} }}\). Furthermore, by using the inequality \(\sqrt \lambda \left| {\left( {\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} + \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} } \right)^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right|\) \(\le \frac{{\sqrt \lambda \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} }}\), we have

$$\begin{aligned} \left\| {\varvec{\hat{\theta }^{\prime}}_{(k)} - \hat{\varvec{\theta }}_{(k)} } \right\|_{1} = \left\| {\frac{{\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} }}{{\hat{\gamma }^{\prime}_{k} }} - \frac{{\hat{\varvec{\beta }}_{(k)} }}{{\hat{\gamma }_{k} }}} \right\|_{1} = \left\| {\frac{{\Delta\varvec{\beta}_{(k)}^{(1)} }}{{\hat{\gamma }^{\prime}_{k} }} - \frac{{\hat{\varvec{\beta }}_{(k)} (\hat{\gamma }^{\prime}_{k} \varvec{ - }\hat{\gamma }_{k} )}}{{\hat{\gamma }^{\prime}_{k} \hat{\gamma }_{k} }}} \right\|_{1} \, \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\quad \; \le \left\| {\frac{{\Delta\varvec{\beta}_{(k)}^{(1)} }}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}} \right\|_{1} + \left\| {\frac{{\hat{\varvec{\beta }}_{(k)} \left( {\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} - \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} } \right)}}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}} \right\|_{1} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\;\;\, \le \frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }} + \frac{{\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} }}{{\sqrt \lambda \left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} \left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}\frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1}^{1/2} }} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\;\;\; \le \frac{1}{\sqrt \lambda }\left( {\frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }} + \frac{{\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{2\left\| {\hat{\varvec{\beta }}_{(k)} + \Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1}^{1/2} }}} \right) \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\;\quad \le \frac{{3\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{2\sqrt \lambda }\frac{1}{{\sqrt {\left\| {\hat{\varvec{\beta }}_{(k)} } \right\|_{1} - \left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} } }} \le \frac{{3\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{2\sqrt \lambda }\frac{1}{{\sqrt {a - a/2} }} = \frac{{3\left\| {\Delta\varvec{\beta}_{(k)}^{(1)} } \right\|_{1} }}{{\sqrt {2\lambda a} }} \hfill \\ \end{aligned}$$

Therefore, we are able to choose a δ satisfying δ  < a/2 such that \(\left\| {\varvec{\hat{\gamma}^{\prime}} - \hat{\varvec{\gamma }}} \right\|_{1} + \left\| {\hat{\theta }_{0}^{\prime } - \hat{\theta }_{0} } \right\|_{1} + \left\| {\varvec{{\hat{\theta}}^{\prime}} - \hat{\varvec{\theta }}} \right\|_{1} < \delta\) when \(\left\| {\Delta\varvec{\beta}^{(1)} } \right\|_{1} < \delta^{\prime}\). Hence we have \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }}) \le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }^{\prime}},\hat{\theta }^{\prime}_{0} ,\,\hat{\varvec{\theta }^{\prime}})\) due to the local minimality. Hence \(Q(\lambda ,\hat{\beta }_{0} ,\,\hat{\varvec{\beta }})\) \(\le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )\).

Next we show \(Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} ) \le Q(\lambda ,\,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )\). Note that the Lipschitz continuity of the hinge loss function \(\left\lfloor {h(\beta_{0} ,\,\,\varvec{\beta})} \right\rfloor_{ + }\) implies that

$$\left| {\,\,\left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )} \right\rfloor_{ + } - \left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )} \right\rfloor_{ + } \,} \right| \le L^{\prime}||\Delta\varvec{\beta}^{(2)} ||_{1}$$

for some L  > 0, where \(h(\beta_{0} ,\varvec{\beta})\) = 1 − y i (β 0 + ∑ K k=1 x T i,(k) β (k)). Moreover, we can choose a real number L such that

$$\left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )} \right\rfloor_{ + } - \left\lfloor {h(\hat{\beta }_{0} + \Delta \beta_{0} ,\,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )} \right\rfloor_{ + } = L||\Delta\varvec{\beta}^{(2)} ||_{1} .$$

Hence, there exists a number L in \({\mathbb{R}}\) such that

$$Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} ) - Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\,\hat{\varvec{\beta }}\text{ + }\Delta\varvec{\beta}^{(1)} ) = L\left\| {\Delta\varvec{\beta}^{(2)} } \right\|_{1} + 2\lambda^{1/2} \sum\nolimits_{k = 1}^{K} {\left\| {\Delta\varvec{\beta}_{(k)}^{(2)} } \right\|_{1}^{1/2} } .$$

Since \(\left\| {\Delta\varvec{\beta}^{(2)} } \right\|_{1} < \delta^{\prime}\) for a small enough δ , the second term in the right side of the above equality dominates the first term, hence we have \(Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} )\) \(\le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta}^{(1)} + \Delta\varvec{\beta}^{(2)} )\). Thus we have shown that there exists a δ  > 0 such that if \(\left\| {\Delta\varvec{\beta}} \right\|_{1} < \delta^{\prime}\), then \(Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }}) \le Q(\lambda ,\hat{\beta }_{0} + \Delta \beta_{0} ,\hat{\varvec{\beta }} + \Delta\varvec{\beta})\), which implies that \((\hat{\beta }_{0} ,\hat{\varvec{\beta }})\) is a local minimizer of \(Q(\lambda ,\,\beta_{0} ,\,\varvec{\beta})\).

Similarly we can prove that if \((\hat{\beta }_{0} ,\hat{\varvec{\beta }})\) is a local minimizer of \(Q(\lambda ,\,\beta_{0} ,\,\varvec{\beta})\), then \((\hat{\varvec{\gamma }},\hat{\theta }_{0} ,\,\hat{\varvec{\theta }})\) is a local minimizer of \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\varvec{\theta})\) satisfying \(\hat{\beta }_{0} = \hat{\theta }_{0}\) and \(\hat{\beta }_{kj} = \hat{\gamma }_{k} \hat{\theta }_{kj}\). □

Proof of Lemma 3

Clearly, \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\varvec{\theta})\) is bounded below due to the positivity of its each term. Let \({\hat{\user2{\gamma }}}_{{}}^{(t)}\) and \((\hat{\theta }_{0}^{(t)} ,\,\,{\hat{\user2{\theta }}}_{{}}^{(t)} )\) denote the minimizers of the criterion in the optimization problems (3.1) and (3.2) at the tth iteration, respectively. Note that the minimization problems in Steps 1 and 2 are equivalent to minimizing \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t - 1)} ,\theta_{0} ,\,\varvec{\theta})\) with respect to \((\theta_{0} ,\,\,\varvec{\theta})\) and minimizing \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\hat{\theta }_{0}^{(t)} ,\,\,\hat{\varvec{\theta }}^{(t)} )\) with respect to \(\varvec{\gamma}\), respectively. Since \(\hat{\varvec{\gamma }}^{(t)}\) is the minimizer of \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\hat{\theta }_{0}^{(t)} ,\,\,\hat{\varvec{\theta }}^{(t)} )\), we have \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )\) \(\le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t - 1)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )\). Similarly, since \((\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} )\) is the minimizer of \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{{}} ,\,\hat{\varvec{\theta }}^{{}} )\) and \(\hat{\varvec{\gamma }}^{(t + 1)}\) is the minimizer of \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\hat{\theta }_{0}^{(t + 1)} ,\,\,\hat{\varvec{\theta }}^{(t + 1)} )\), we can show that \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} )\) \(\le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )\) and \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t + 1)} ,\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} ) \le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t + 1)} ,\hat{\varvec{\theta }}^{(t + 1)} )\). Therefore, we have \(Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t + 1)} ,\hat{\theta }_{0}^{(t + 1)} ,\,\hat{\varvec{\theta }}^{(t + 1)} )\) \(\le Q^{\diamondsuit } (\lambda ,\hat{\varvec{\gamma }}^{(t)} ,\hat{\theta }_{0}^{(t)} ,\,\hat{\varvec{\theta }}^{(t)} )\), which implies that \(Q^{\diamondsuit } (\lambda ,\varvec{\gamma},\theta_{0} ,\,\varvec{\theta})\) decreases for each iteration. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bang, S., Kang, J., Jhun, M. et al. Hierarchically penalized support vector machine with grouped variables. Int. J. Mach. Learn. & Cyber. 8, 1211–1221 (2017). https://doi.org/10.1007/s13042-016-0494-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-016-0494-2

Keywords

Navigation