Skip to main content
Log in

Sample-based online learning for bi-regular hinge loss

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Support vector machine (SVM), a state-of-the-art classifier for supervised classification task, is famous for its strong generalization guarantees derived from the max-margin property. In this paper, we focus on the maximum margin classification problem cast by SVM and study the bi-regular hinge loss model, which not only performs feature selection but tends to select highly correlated features together. To solve this model, we propose an online learning algorithm that aims at solving a non-smooth minimization problem by alternating iterative mechanism. Basically, the proposed algorithm alternates between intrusion samples detection and iterative optimization, and at each iteration it obtains a closed-form solution to the model. In theory, we prove that the proposed algorithm achieves \(O(1/\sqrt{T})\) convergence rate under some mild conditions, where T is the number of training samples received in online learning. Experimental results on synthetic data and benchmark datasets demonstrate the effectiveness and performance of our approach in comparison with several popular algorithms, such as LIBSVM, SGD, PEGASOS, SVRG, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. A problem is said to be NP-hard if the solution is required to be over the integers [16].

  2. The learner in an online learning model needs to make predictions about a sequence of samples, one after the other, and then receives a loss after each prediction. The goal is to minimize the accumulated losses [30, 31]. The first explicit models of online learning were proposed by Angluin [2] and Littlestone [28].

  3. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets.

  4. http://archive.ics.uci.edu/ml/datasets.php.

  5. CVX is a free Matlab software for disciplined convex programming, which can be downloaded from http://cvxr.com/cvx.

  6. The objective function values here include the hinge loss and the bi-regular term.

References

  1. Akbari M, Gharesifard B, Linder T (2019) Individual regret bounds for the distributed online alternating direction method of multipliers. IEEE Trans Autom Control 64(4):1746–1752

    Article  MathSciNet  Google Scholar 

  2. Angluin D (1988) Queries and concept learning. Mach Learn 2:319–342

    MathSciNet  Google Scholar 

  3. Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8:141–148

    Article  MathSciNet  Google Scholar 

  4. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2010) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends Mach Learn 3(1):1–122

    Article  Google Scholar 

  5. Buhlmann P, van de Geer S (2011) Statistics for High-dimensional Data: Methods, Theory and Applications. Springer, Berlin

    Book  Google Scholar 

  6. Candès EJ, Wakin MB, Boyd SP (2008) Enhancing sparsity by reweighted \(l_1\) minimization. J Fourier Anal Appl 14(5):877–905

    Article  MathSciNet  Google Scholar 

  7. Chang KW, Hsieh CJ, Lin CJ (2008) Coordinate descent method for large-scale l2-loss linear support vector machines. J Mach Learn Res 9:1369–1398

    MathSciNet  MATH  Google Scholar 

  8. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27

    Article  Google Scholar 

  9. Chauhan VK, Dahiya K, Sharma A (2019) Problem formulations and solvers in linear SVM: a review. Artif Intell Rev 52:803–855

    Article  Google Scholar 

  10. Chauhan VK, Sharma A, Dahiya K (2020) Stochastic trust region inexact Newton method for large-scale machine learning. Int J Mach Learn and Cybern 11:1541–1555

    Article  Google Scholar 

  11. Cohen K, Nedić A, Srikant R (2017) On projected stochastic gradient descent algorithm with weighted averaging for least squares regression. IEEE Trans Autom Control 62(11):5974–5981

    Article  MathSciNet  Google Scholar 

  12. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  13. De Mol C, De Vito E, Rosasco L (2009) Elastic-net regularization in learning theory. J Complex 25(2):201–230

    Article  MathSciNet  Google Scholar 

  14. Duchi JC, Shalev-Shwartz S, Singer Y, Tewari A (2010) Composite Objective Mirror Descent. In: Proceedings of the 23rd annual conference on learning theory, pp 14–26

  15. Gabay D, Mercier B (1976) A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput Optim Appl 2(1):17–40

    MATH  Google Scholar 

  16. Garey MR, Johnson DS (1979) Computers and intractability: a guide to the theory of NP-completeness. Freeman, New York

    MATH  Google Scholar 

  17. Glowinski R, Marrocco A (1975) Sur l’approximation, paréléments finis d’ordre un, et lan résolution, par pénalisation-dualité, d’une classe de problémes de Dirichlet non linéaires. Rev Fr Automat Infor 9:41–76

    MATH  Google Scholar 

  18. Gong Y, Xu W (2007) Machine learning for multimedia content analysis. Springer Science & Business Media, New York

    Google Scholar 

  19. Hajewski J, Oliveira S, Stewart D (2018) Smoothed hinge loss and l1 support vector machines. In: Proceedings of the 2018 IEEE international conference on data mining workshops, pp 1217–1223

  20. He B, Yuan X (2012) On the \(O(1/n)\) convergence rate of the Douglas-Rachford alternating direction method. SIAM J Numer Anal 50(2):700–709

    Article  MathSciNet  Google Scholar 

  21. Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 36th international conference on machine learning, pp 408–415

  22. Huang F, Chen S, Huang H (2019) Faster stochastic alternating direction method of multipliers for nonconvex optimization. In: Proceedings of the 36th international conference on machine learning, pp 2839–2848

  23. Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 217–226

  24. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp 315-323

  25. Khan ZA, Zubair S, Alquhayz H, Azeem M, Ditta A (2019) Design of momentum fractional stochastic gradient descent for recommender systems. IEEE Access 7:179575–179590

    Article  Google Scholar 

  26. Lin CJ, Weng RC, Sathiya Keerthi S (2008) Trust region Newton method for large-scale logistic regression. J Mach Learn Res 9:627–650

    MathSciNet  MATH  Google Scholar 

  27. Liu Y, Shang F, Cheng J (2017) Accelerated variance reduced stochastic ADMM. In: Proceedings of the 31st AAAI conference on artificial intelligence, pp 2287-2293

  28. Littlestone N (1988) Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm. Mach Learn 2(4):285–318

    Google Scholar 

  29. Nalepa J, Kawulok M (2019) Selecting training sets for support vector machines: a review. Artif Intell Rev 52(2):857–900

    Article  Google Scholar 

  30. Sammut C, Webb GI (2011) Encyclopedia of Machine Learning. Springer Science & Business Media, New York

    MATH  Google Scholar 

  31. Shalev-Shwartz S (2012) Online learning and online convex optimization. Found Trends Mach Learn 4(2):107–194

    Article  Google Scholar 

  32. Shalev-Shwartz S, Singer Y, Srebro N, Cotter A (2011) Pegasos: primal estimated sub-gradient solver for SVM. Math Program 127(1):3–30

    Article  MathSciNet  Google Scholar 

  33. Singla M, Shukla KK (2020) Robust statistics-based support vector machine and its variants: a survey. Neural Comput Appl 32:11173–11194

    Article  Google Scholar 

  34. Song T, Li D, Liu Z, Yang W (2019) Online ADMM-based extreme learning machine for sparse supervised learning. IEEE Access 7:64533–64544

    Article  Google Scholar 

  35. Suzuki T (2013) Dual averaging and proximal gradient descent for online alternating direction multiplier method. In: Proceedings of the 30th international conference on machine learning, pp 392–400

  36. Tan C, Ma S, Dai Y H, Qian Y (2016) Barzilai-borwein step size for stochastic gradient descent. Advances in neural information processing systems, pp 685–693

  37. Vapnik V (1995) The nature of statistical learning theory. Springer, New York

    Book  Google Scholar 

  38. Wang L, Zhu J, Zou H (2006) The doubly regularized support vector machine. Stat Sinica 16:589–615

    MathSciNet  MATH  Google Scholar 

  39. Wang L, Zhu J, Zou H (2008) Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 2(3):412–419

    Article  Google Scholar 

  40. Wang Z, Hu R, Wang S, Jiang J (2014) Face hallucination via weighted adaptive sparse regularization. IEEE Trans Circuits Syst Video Technol 24(5):802–813

    Article  Google Scholar 

  41. Xiao L (2009) Dual averaging methods for regularized stochastic learning and online optimization. In: Advances in neural information processing systems, pp 2116–2124

  42. Xie Z, Li Y (2019) Large-scale support vector regression with budgeted stochastic gradient descent. Int J Mach Learn Cybern 10(6):1529–1541

    Article  Google Scholar 

  43. Xu Y, Akrotirianakis I, Chakraborty A (2016) Proximal gradient method for huberized support vector machine. Pattern Anal Appl 19(4):989–1005

    Article  MathSciNet  Google Scholar 

  44. Xue W, Zhang W (2017) Learning a coupled linearized method in online setting. IEEE Trans Neural Netw Learn Syst 28(2):438–450

    Article  MathSciNet  Google Scholar 

  45. Zamora E, Sossa H (2017) Dendrite morphological neurons trained by stochastic gradient descent. Neurocomputing 260:420–431

    Article  Google Scholar 

  46. Zhao P, Zhang T (2015) Stochastic optimization with importance sampling for regularized loss minimization. In: Proceedings of the 20th international conference on machine learning

  47. Zhu J, Rosset S, Hastie T, Tibshirani R (2004) 1-norm support vector machines. In: Advances in neural information processing systems, pp 49–56

  48. Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th international conference on machine learning, pp 928–936

  49. Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67(2):301–320

    Article  MathSciNet  Google Scholar 

  50. Zou H, Zhang H (2009) On the adaptive elastic-net with a diverging number of parameters. Ann Stat 37(4):1733–1751

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors thank the editor and the reviewers for their constructive comments and suggestions that greatly improved the quality and presentation of this paper. This work was partly supported by the National Natural Science Foundation of China (Grant nos. 12071104, 61671456, 61806004, 61971428), the China Postdoctoral Science Foundation (Grant no. 2020T130767), the Natural Science Foundation of the Anhui Higher Education Institutions of China (Grant no. KJ2019A0082), and the Natural Science Foundation of Zhejiang Province, China (Grant no. LD19A010002).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Wei Xue or Ping Zhong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Auxiliary Lemmas

To prove Theorems 1, we first give some auxiliary lemmas as follows.

Lemma 1

\(\forall\) \(\varvec{a}, \varvec{b}, \varvec{c} \in \mathbb {R}^n\) and \(h\in \mathbb {R}\), we have

$$\begin{aligned} \begin{aligned} h\langle \varvec{a}-\varvec{b}, \varvec{b}-\varvec{c}\rangle =&\frac{h}{2} \Big (- \Vert \varvec{a}-\varvec{b}\Vert ^2 + \Vert \varvec{a}-\varvec{c}\Vert ^2 \\&- \Vert \varvec{b}-\varvec{c}\Vert ^2 \Big ). \end{aligned} \end{aligned}$$

Proof

$$\begin{aligned} \begin{aligned} h\left\langle \varvec{a}-\varvec{b}, \varvec{b}-\varvec{c} \right\rangle =&h\left\langle \varvec{a} - \frac{\varvec{b}+\varvec{c}}{2} + \frac{\varvec{b}+\varvec{c}}{2} -\varvec{b}, \varvec{b}-\varvec{c} \right\rangle \\ =&h\left\langle \frac{\varvec{a}-\varvec{b}}{2} + \frac{\varvec{a}-\varvec{c}}{2}, \varvec{b}-\varvec{c} \right\rangle \\&+ \langle \frac{\varvec{c}-\varvec{b}}{2}, \varvec{b}-\varvec{c} \rangle \\ =&-\frac{h}{2}\Vert \varvec{a}-\varvec{b}\Vert ^2 + \frac{h}{2}\Vert \varvec{a}-\varvec{c}\Vert ^2 \\&- \frac{h}{2}\Vert \varvec{b}-\varvec{c}\Vert ^2. \end{aligned} \end{aligned}$$

\(\square\)

Particularly, when h becomes a symmetric matrix H, we have

$$\begin{aligned} \begin{aligned} (\varvec{a}-\varvec{b})^TH(\varvec{b}-\varvec{c}) =&- \frac{1}{2}\Vert \varvec{a}-\varvec{b}\Vert _H^2 \\&+ \frac{1}{2}\Vert \varvec{a}-\varvec{c}\Vert _H^2 - \frac{1}{2}\Vert \varvec{b}-\varvec{c}\Vert _H^2. \end{aligned} \end{aligned}$$

Following the Lemma 11 given in [35], we obtain the result as follows.

Lemma 2

Under the update rules of Algorithm 1, it holds that

$$\begin{aligned} \begin{aligned}&\langle \varvec{z}_*-\varvec{z}_t, \varvec{\lambda }_t-\tilde{\varvec{\lambda }}_t \rangle + \left\langle \varvec{\lambda }_*-\tilde{\varvec{\lambda }}_t, \frac{1}{\rho }(\tilde{\varvec{\lambda }}_t-\varvec{\lambda }_t) \right\rangle \\&\quad \le \frac{\rho }{2}\Vert \varvec{z}_t - \varvec{z}_*\Vert ^2 - \frac{\rho }{2}\Vert \varvec{z}_{t+1} - \varvec{z}_*\Vert ^2 \\&\qquad + \frac{1}{2\rho }(\Vert \varvec{\lambda }_t - \varvec{\lambda }_*\Vert ^2 - \Vert \varvec{\lambda }_{t+1} - \varvec{\lambda }_*\Vert ^2 - \Vert \varvec{\lambda }_t- \varvec{\lambda }_{t+1} \Vert ^2) \\&\qquad + \langle \varvec{z}_*- \varvec{z}_{t+1}, \varvec{\lambda }_* - \varvec{\lambda }_{t+1}\rangle - \langle \varvec{z}_*- \varvec{z}_t, \varvec{\lambda }_* - \varvec{\lambda }_t\rangle . \end{aligned} \end{aligned}$$

Appendix 2: Proof of Theorem 1

Proof

First, by the definition of \(\mathcal {R}_T\) and taking the optimality conditions for \(\varvec{w}_{t+1}\) and \(\varvec{z}_t\), we have

$$\begin{aligned} \begin{aligned} \mathcal {R}_T \le&\sum _{t=1}^T \Big \{ \langle \varvec{g}_t, \varvec{w}_t-\varvec{w}_* \rangle \\&\quad + \beta \langle \nabla q(\varvec{z}_t), \varvec{z}_t-\varvec{z}_*\rangle + \alpha \big [p(\varvec{w}_t)-p(\varvec{w}_*) \big ] \Big \}\\ =&\sum _{t=1}^T \Big \{ \langle \varvec{g}_t,\varvec{w}_{t+1}-\varvec{w}_{*}\rangle + \langle \varvec{g}_t,\varvec{w}_{t}-\varvec{w}_{t+1}\rangle \\&\quad + \langle -\varvec{\lambda }_t, \varvec{z}_t-\varvec{z}_*\rangle + \alpha (\Vert \varvec{w}_t\Vert ^2 - \Vert \varvec{w}_*\Vert ^2) \Big \} \\ \le&\sum _{t=1}^T \Big \{ \underbrace{\langle \tilde{\varvec{\lambda }}_t-\frac{1}{\eta _t}Q_t(\varvec{w}_{t+1} - \varvec{w}_*), \varvec{w}_{t+1} - \varvec{w}_*\rangle +\langle -\varvec{\lambda }_t, \varvec{z}_t-\varvec{z}_*\rangle }\\&\quad +\langle \varvec{g}_t,\varvec{w}_{t}-\varvec{w}_{t+1}\rangle +\alpha (\Vert \varvec{w}_t\Vert ^2 - \Vert \varvec{w}_*\Vert ^2)\Big \} \\ =&\sum _{t=1}^T \Bigg \{ \underbrace{ \left( \begin{array}{c} -\tilde{\varvec{\lambda }}_t \\ \tilde{\varvec{\lambda }}_t \\ \end{array} \right) ^T \left( \begin{array}{c} \varvec{w}_* - \varvec{w}_t \\ \varvec{z}_* - \varvec{z}_t \\ \end{array} \right) + \left( \begin{array}{c} \varvec{w}_* - \varvec{w}_{t+1} \\ \varvec{z}_* - \varvec{z}_t \\ \end{array} \right) ^T \left( \begin{array}{c} \frac{1}{\eta _t}Q_t(\varvec{w}_{t+1} - \varvec{w}_t) \\ \varvec{\lambda }_t - \tilde{\varvec{\lambda }}_t \\ \end{array} \right) \quad + \langle \tilde{\varvec{\lambda }}_t, \varvec{w}_{t+1} - \varvec{w}_{t} \rangle }\\&\quad + \langle \varvec{g}_t,\varvec{w}_{t}-\varvec{w}_{t+1}\rangle + \alpha (\Vert \varvec{w}_t\Vert ^2 - \Vert \varvec{w}_*\Vert ^2) \Bigg \} \\ \le&\sum _{t=1}^T \Bigg \{ \underbrace{ \left( \begin{array}{c} -\tilde{\varvec{\lambda }}_t \\ \tilde{\varvec{\lambda }}_t \\ \varvec{w}_t - \varvec{z}_t \\ \end{array} \right) ^T \left( \begin{array}{c} \varvec{w}_* - \varvec{w}_t \\ \varvec{z}_* - \varvec{z}_t \\ \varvec{\lambda }_* -\tilde{\varvec{\lambda }}_t \\ \end{array} \right) + \left( \begin{array}{c} \varvec{w}_* - \varvec{w}_{t+1} \\ \varvec{z}_* - \varvec{z}_t \\ \varvec{\lambda }_* -\tilde{\varvec{\lambda }}_t \\ \end{array} \right) ^T \left( \begin{array}{c} \frac{1}{\eta _t}Q_t(\varvec{w}_{t+1} - \varvec{w}_t) \\ \varvec{\lambda }_t - \tilde{\varvec{\lambda }}_t \\ \frac{\tilde{\varvec{\lambda }}_t - \varvec{\lambda }_t}{\rho } - (\varvec{w}_{t} - \varvec{w}_{t+1}) \\ \end{array} \right) + \langle \tilde{\varvec{\lambda }}_t, \varvec{w}_{t+1}-\varvec{w}_{t}\rangle } \\&+ \langle \varvec{g}_t,\varvec{w}_{t}-\varvec{w}_{t+1}\rangle + \alpha (\Vert \varvec{w}_t-\varvec{w}_*\Vert ^2) \Bigg \}. \end{aligned} \end{aligned}$$
(18)

By using Lemma 1, we have

$$\begin{aligned} \left\langle \varvec{w}_* - \varvec{w}_{t+1}, \frac{1}{\eta _t}Q_t(\varvec{w}_{t+1} - \varvec{w}_t)\right\rangle= & {} - \frac{1}{2\eta _t}(\Vert \varvec{w}_{t+1} - \varvec{w}_*\Vert _{Q_t}^2 \nonumber \\&- \Vert \varvec{w}_{t} - \varvec{w}_*\Vert _{Q_t}^2 \nonumber \\&+ \Vert \varvec{w}_{t+1} - \varvec{w}_t\Vert _{Q_t}^2). \end{aligned}$$
(19)

In addition, it holds that \(\langle \varvec{g}_t,\varvec{w}_{t}-\varvec{w}_{t+1}\rangle \le \frac{\eta _t}{2}\Vert \varvec{g}_t\Vert _{Q_t^{-1}}^2 + \frac{1}{2\eta _t}\Vert \varvec{w}_{t}-\varvec{w}_{t+1}\Vert _{Q_t}^2\). Plugging this inequality, Eq. (19), and Lemma 2 into the right-hand side of Eq. (18) gives

$$\begin{aligned} \begin{aligned} \mathcal {R}_T \le&\sum _{t=1}^T \langle -\varvec{\lambda }_*, \varvec{w}_t - \varvec{z}_t \rangle \\&\quad + \sum _{t=1}^T \Big (\frac{1}{2\eta _t}\Vert \varvec{w}_t-\varvec{w}_*\Vert _{Q_t}^2 - \frac{1}{2\eta _t}\Vert \varvec{w}_{t+1}-\varvec{w}_*\Vert _{Q_t}^2\Big )\\&\quad + \sum _{t=1}^T \frac{\eta _t}{2}\Vert \varvec{g}_t\Vert _{Q_t^{-1}}^2 \\&\quad + \sum _{t=1}^T \alpha \Vert \varvec{w}_t-\varvec{w}_*\Vert ^2 + \frac{\rho }{2}\Vert \varvec{z}_1 - \varvec{z}_*\Vert ^2 - \frac{\rho }{2}\Vert \varvec{z}_{T+1} - \varvec{z}_*\Vert ^2\\&\quad + \frac{1}{2\rho }\Vert \varvec{\lambda }_1 - \varvec{\lambda }_*\Vert ^2 - \frac{1}{2\rho }\Vert \varvec{\lambda }_{T+1} - \varvec{z}_*\Vert ^2 \\&\quad - \sum _{t=1}^T \frac{1}{2\rho }\Vert \varvec{\lambda }_t - \varvec{\lambda }_{t+1}\Vert ^2 + \langle \varvec{\lambda }_*, \varvec{w}_{T+1} - \varvec{w}_1 \rangle \\&\quad + \langle \varvec{z}_*- \varvec{z}_{T+1}, \varvec{\lambda }_* - \varvec{\lambda }_{T+1}\rangle - \langle \varvec{z}_*- \varvec{z}_1, \varvec{\lambda }_* - \varvec{\lambda }_1\rangle \end{aligned} \end{aligned}$$
(20)

Note that \(Q_t=(\gamma -\rho \eta _t)I\), we have

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^T \Big (\frac{1}{2\eta _t}\Vert \varvec{w}_t-\varvec{w}_*\Vert _{Q_t}^2 - \frac{1}{2\eta _t}\Vert \varvec{w}_{t+1}-\varvec{w}_*\Vert _{Q_t}^2\Big )\\&\quad = \frac{\Vert \varvec{w}_1-\varvec{w}_*\Vert _{Q_1}^2}{2\eta _1} + \sum _{t=2}^T \Big (\frac{\Vert \varvec{w}_t-\varvec{w}_*\Vert _{Q_t}^2}{2\eta _t} - \frac{\Vert \varvec{w}_{t}-\varvec{w}_*\Vert _{Q_{t-1}}^2}{2\eta _{t-1}}\Big ) \\&\quad = \frac{\Vert \varvec{w}_1-\varvec{w}_*\Vert _{Q_1}^2}{2\eta _1} + \sum _{t=2}^T \Big (\frac{\gamma }{2\eta _t} - \frac{\gamma }{2\eta _{t-1}} \Big )\Vert \varvec{w}_t-\varvec{w}_*\Vert ^2. \end{aligned} \end{aligned}$$
(21)

Further, it follows from \(\langle \beta \nabla q(\varvec{z}_t) + \varvec{\lambda }_t, \varvec{z} -\varvec{z}_t \rangle \ge 0\) that \(\langle \varvec{z}_*- \varvec{z}_{T+1}, \varvec{\lambda }_* - \varvec{\lambda }_{T+1}\rangle \le \langle \varvec{z}_* - \varvec{z}_{T+1}, \beta \nabla q(\varvec{z}_{T+1}) + \varvec{\lambda }_* \rangle\). Then plugging this inequality and Eq. (21) into the right-hand side of Eq. (20) yields

$$\begin{aligned} \begin{aligned} \mathcal {R}_T \le&\sum _{t=1}^T \alpha \Vert \varvec{w}_t-\varvec{w}_*\Vert ^2 + \sum _{t=2}^T \Big (\frac{\gamma }{2\eta _t} - \frac{\gamma }{2\eta _{t-1}} \Big )\Vert \varvec{w}_t-\varvec{w}_*\Vert ^2\\&\quad + \sum _{t=1}^T \frac{\eta _t}{2}\Vert \varvec{g}_t\Vert _{Q_t^{-1}}^2 + \frac{\Vert \varvec{w}_*\Vert _{Q_1}^2}{2\eta _1} + \frac{\rho }{2}\Vert \varvec{z}_*\Vert ^2 + \frac{1}{2\rho }\Vert \varvec{\lambda }_*\Vert ^2 \\&\quad + \langle T\varvec{\lambda }_*, \bar{\varvec{z}}_T - \bar{\varvec{w}}_T\rangle + \langle \varvec{\lambda }_*, \varvec{w}_{T+1} \rangle \\&\quad + \langle \varvec{z}_* - \varvec{z}_{T+1}, \beta \nabla q(\varvec{z}_{T+1}) + \varvec{\lambda }_* \rangle - \langle \varvec{z}_*, \varvec{\lambda }_*\rangle , \end{aligned} \end{aligned}$$

where we use \(\varvec{\lambda }_{t} = \varvec{\lambda }_{t-1} - \rho (\varvec{w}_{t}-\varvec{z}_{t})\) and the initial values of \(\varvec{w}_1\), \(\varvec{z}_1\), and \(\varvec{\lambda }_1\). This gives the result as desired. \(\square\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, W., Zhong, P., Zhang, W. et al. Sample-based online learning for bi-regular hinge loss. Int. J. Mach. Learn. & Cyber. 12, 1753–1768 (2021). https://doi.org/10.1007/s13042-020-01272-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01272-7

Keywords

Navigation