Abstract
Quite often, different types of loss functions are adopted in SVM or its variants to meet practical requirements. How to scale up the corresponding SVMs for large datasets are becoming more and more important in practice. In this paper, extreme vector machine (EVM) is proposed to realize fast training of SVMs with different yet typical loss functions on large datasets. EVM begins with a fast approximation of the convex hull, expressed by extreme vectors, of the training data in the feature space, and then completes the corresponding SVM optimization over the extreme vector set. When hinge loss function is adopted, EVM is the same as the approximate extreme points support vector machine (AESVM) for classification. When square hinge loss function, least squares loss function and Huber loss function are adopted, EVM corresponds to three versions, namely, L2-EVM, LS-EVM and Hub-EVM, respectively, for classification or regression. In contrast to the most related machine AESVM, with the retainment of its theoretical advantage, EVM is distinctive in its applicability to a wide variety of loss functions to meet practical requirements. Compared with the other state-of-the-art fast training algorithms CVM and FastKDE of SVMs, EVM indeed relaxes the limitation of least squares loss functions, and experimentally exhibits its superiority in training time, robustness capability and number of support vectors.
Similar content being viewed by others
References
Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297
Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin
Tahira M, Khan A (2016) Protein subcellular localization of fluorescence microscopy images: employing new statistical and Texton based image features and SVM based ensemble classification. Inf Sci 345(6):65–80
Li YJ, Leng QK, Fu YZ (2017) Cross kernel distance minimization for designing support vector machines. Int J Mach Learn Cybernet 8(5):1585–1593
Hu L, Lu SX, Wang XZ (2013) A new and informative active learning approach for support vector machine. Inf Sci 244(9):142–160
Bang S, Kang J, Jhun M, Kim E (2017) Hierarchically penalized support vector machine with grouped variables. Int J Mach Learn Cybernet 8(4):1211–1221
Reshma K, Pal A (2017) Tree based multi-category Laplacian TWSVM for content based image retrieval. Int J Mach Learn Cybernet 8(4):1197–1210
Muhammad T, Shubham K (2017) A regularization on Lagrangian twin support vector regression. Int J Mach Learn Cybernet 8(3):807–821
Williams C, Seeger M (2000) Using the Nyström method to speed up kernel machines. In: Proceedings of the 13th international conference on neural information processing systems, pp 661–667
Lin C (2007) On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans Neural Netw 18(6):1589–1595
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: International conference on neural information processing systems. Curran Associates Inc., pp 1177–1184
Halko N, Martinsson PG, Tropp JA (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288
Keerthi S, Shevade S, Bhattachayya C, Murth K (2001) Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput 13(3):637–649
Peng XJ, Kong LY, Chen DJ (2017) A structural information-based twin-hypersphere support vector machine classifier. Int J Mach Learn Cybernet 8(1):295–308
Joachims T (1999) Making large-scale support vector machine learning practical. Advances in kernel methods. MIT Press, Cambridge, pp 169–184
Wang D, Qiao H, Zhang B, Wang M (2013) Online support vector machine based on convex hull vertices selection. IEEE Trans Neural Netw Learn Syst 24(4):593–609
Gu XQ, Chung FL, Wang ST (2018) Fast convex-hull vector machine for training on large-scale ncRNA data classification tasks. Knowl Based Syst 151(1):149–164
Osuna E, Castro OD (2002) Convex hull in feature space for support vector machines. In: Proceedings of advances in artificial intelligence, pp 411–419
Osuna E, Tsang I, Kwok J, Cheung P (2005) Core vector machines: fast SVM training on very large data sets. J Mach Learn Res 6:363–392
Tsang I, Kwok J, Zurada J (2006) Generalized core vector machines. IEEE Trans Neural Netw 17(5):1126–1140
Tsang I, Kwok A, Kwok J (2007) Simpler core vector machines with enclosing balls. In: Proceedings of the 24th international conference on machine learning, pp 911–918
Wang ST, Wang J, Chung F (2014) Kernel density estimation, kernel methods, and fast learning in large data sets. IEEE Trans Cybernet 44(1):1–20
Nandan M, Khargonekar PP, Talathi SS (2014) Fast SVM training using approximate extreme points. J Mach Learn Res 15:59–98
Huang CQ, Chung FL, Wang ST (2016) Multi-view L2-SVM and its multi-view core vector machine. Neural Netw 75(3):110–125
Suykens J, Gestel T, Brabanter J, Moor B, Vandewalle J (2002) Least squares support vector machines. World Scientific Pub, Singapore
Xue H, Chen S, Yang Q (2009) Discriminatively regularized least-squares classification. Pattern Recogn 42(1):93–104
Karasuyama M, Takeuchi I (2010) Nonlinear regularization path for the modified Huber loss support vector machines. In: Proceedings of international joint conference on neural networks, pp 1–8
Cherkassky V, Ma Y (2004) Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw 17(1):113–126
Chau A, Li X, Yu W (2013) Large data sets classification using convex–concave hull and support vector machine. Soft Comput 17(5):793–804
Theodoridis S, Mavroforakis M (2007) Reduced convex hulls: a geometric approach to support vector machines. IEEE Signal Process Mag 24(3):119–122
Blum M, Floyd RW, Pratt V, Rivest RL, Tarjan RE (1973) Time bounds for selection. J Comput Syst Sci 7(8):448–461
Tax D, Duin R (1999) Support vector domain description. Pattern Recogn Lett 20(11):1191–1199
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19(5):1155–1178
Charbonnier P, Blanc-Feraud L, Aubert G, Barlaud M (1997) Deterministic edge-preserving regularization in computed imaging. IEEE Trans Image Proc 6(2):298–311
Hartley R, Zisserman A (2003) Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge
Ye J, Xiong T (2007) SVM versus least squares SVM. In: Proceedings of the 7th international conference on artificial intelligence and statistics, pp 644–651
Lin C. LIBSVM data. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Accessed 28 Feb 2017
Alcalá-Fdez J, Fernandez A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(2):255–287
Gao S, Tsang IW, Chia LT (2013) Sparse representation with kernels. IEEE Trans Image Process 22(2):423–434
Acknowledgements
This work was supported in part by the Hong Kong Polytechnic University under Grant G-UA3W, by the National Natural Science Foundation of China under Grant nos. 61572236, 61702225 and 61806026, by the Natural Science Foundation of Jiangsu Province under Grant BK20161268 and BK20180956.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
1.1 Proof of Theorem 1
-
1.
For the L2-SVM:
Let \({L_{L2 - 2}}{\text{(}}{\mathbf{w}},b,{{\mathbf{X}}^*})=(C/2N)\sum\nolimits_{{t=1}}^{M} {l({\mathbf{w}},b,{{\mathbf{z}}_t})} \sum\nolimits_{{i=1}}^{N} {{r_{i,t}}}\) and \({L_{L2 - 3}}{\text{(}}{\mathbf{w}},b,{{\mathbf{X}}^*})=(C/2N)\sum\nolimits_{{t=1}}^{M} {l({\mathbf{w}},b,{{\mathbf{u}}_i})}\), where \({{\mathbf{u}}_i}=\sum\nolimits_{{i=1}}^{M} {{r_{i,t}}} {{\mathbf{z}}_t}\). From the precondition of yi = yj in each subset, we have
According to \(max(0,\;A+B) \leq max(0,\;A)+max(0,\;B)\),
Adding \((1/2){\left\| {\mathbf{w}} \right\|^2}\) to both sides of the inequality above, we get \({F_{L2-3}}{\text{(}}{\mathbf{w}},{\text{ }}b{\text{)}} \leq {F_{L2-2}}{\text{(}}{\mathbf{w}},{\text{ }}b{\text{)}}\).
-
2.
For the LS-SVM:
Let \({L_{LS - 2}}{\text{(}}{\mathbf{w}},b,{{\mathbf{X}}^*})=(C/2N)\sum\nolimits_{{t=1}}^{M} {l({\mathbf{w}},b,{{\mathbf{z}}_t})} \sum\nolimits_{{i=1}}^{N} {{r_{i,t}}}\) and \({L_{LS - 3}}{\text{(}}{\mathbf{w}},b,{{\mathbf{X}}^*})=(C/2N)\sum\nolimits_{{i=1}}^{N} {l({\mathbf{w}},b,{{\mathbf{u}}_i})}\). We have
According to Jensen’s inequality, if \({\lambda _1},{\lambda _2}, \ldots ,{\lambda _n}\) are nonnegative real numbers such that\({\lambda _1}+{\lambda _2}+ \ldots +{\lambda _n}=1\), \(\varphi ( \cdot )\)is a real convex function, then \(\varphi ({\lambda _1}{x_1}+{\lambda _2}{x_2}+ \cdots +{\lambda _n}{x_n}) \leq {\lambda _1}\varphi ({x_1})+{\lambda _2}\varphi ({x_2})+ \cdots +{\lambda _n}\varphi ({x_n})\), for any \({x_1}, \ldots ,{x_n}\). So, we can get
Adding \((1/2){\left\| {\mathbf{w}} \right\|^2}\)to both sides of the inequality above, we get \({F_{LS{\text{-3}}}}{\text{(}}{\mathbf{w}},{\text{ }}b{\text{)}} \leq {F_{LS - 2}}{\text{(}}{\mathbf{w}},{\text{ }}b{\text{)}}\).
-
3.
For the Hub-SVM:
Let \({L_{Hub - 2app}}{\text{(}}{\mathbf{w}},{{\mathbf{X}}^*})=(C/2N)\sum\nolimits_{{t=1}}^{M} {{l_H}({\mathbf{w}},{{\mathbf{z}}_t})} \sum\nolimits_{{i=1}}^{N} {{r_{i,t}}},\) and
Adding \((1/2){\left\| {\mathbf{w}} \right\|^2}\)to both sides of the inequality above, then \({F_{Hub - {\text{3}}app}}{\text{(}}{\mathbf{w}}{\text{)}}={F_{Hub - 2app}}{\text{(}}{\mathbf{w}}{\text{)}}\).
Appendix 2
2.1 Proof of Theorem 2
-
1.
For the L2-SVM:
Let \({L_{L2 - 1}}{\text{(}}{\mathbf{w}},b,{\mathbf{X}})=(C/2N)\sum\nolimits_{{i=1}}^{N} {l({\mathbf{w}},b,{{\mathbf{z}}_i})}\) denote the average square hinge loss that is minimized in (7) and let \({L_{L2 - 3}}{\text{(}}{\mathbf{w}},b,{{\mathbf{X}}^*})\) be defined in Theorem1. Then, we have
By assuming \({\Delta _1}=\mathop {max}\limits_{{1 \leq t \leq M}} (max\{ 0,\;(1 - {y_t}({{\mathbf{w}}^T}{{\mathbf{z}}_t}+b))\} )\), we further have
Adding \((1/2){\left\| {\mathbf{w}} \right\|^2}\) to both sides of the inequality above, then this theorem is proved.
-
2.
For the LS-SVM:
We have \({L_{LS - 1}}{\text{(}}{\mathbf{w}},b,{\mathbf{X}})=(C/2N)\sum\nolimits_{{i=1}}^{N} {{{({y_i} - ({{\mathbf{w}}^T}{{\mathbf{z}}_i}+b))}^2}}\).
Assume that the maximum value of \({y_t} - ({{\mathbf{w}}^T}{{\mathbf{z}}_t}+b)\) in the dataset X is \({\Delta _2}= {\hbox{max} }_{{1 \leq t \leq M}} \left| {{y_t} - {{\mathbf{w}}^T}({{\mathbf{z}}_t}+b)} \right|\). Then, we have
Adding \((1/2){\left\| {\mathbf{w}} \right\|^2}\) to both sides of the inequality above, and in terms of M ≪ N, this theorem is proved.
-
3.
For the Hub-SVM:
We have
Meanwhile, we can get:
Appendix 3
3.1 Proof of Corollary 1
-
1.
For the L2-EVM, \({F_{L2 - 1}}({\mathbf{w}}_{1}^{*},\;b_{1}^{*}) - {F_{L2 - 2}}({\mathbf{w}}_{2}^{*},\;b_{2}^{*}) \leq {C^2}\varepsilon /2+CM{\Delta _1}\sqrt {C\varepsilon }\).
Since\(({\mathbf{w}}_{1}^{*},\;b_{1}^{*})\)is the solution of (7), \({F_{L{\text{2-1}}}}{\text{(}}{\mathbf{w}}_{1}^{*},{\text{ }}b_{1}^{*}{\text{)}} \leq {F_{L2 - 1}}{\text{(}}{\mathbf{w}}_{2}^{*},{\text{ }}b_{2}^{*}{\text{)}}\).
Using Theorem1, we obtain
-
2.
For the LS-EVM, \({F_{LS - 1}}({\mathbf{w}}_{1}^{*},\;b_{1}^{*}) - {F_{LS - 2}}({\mathbf{w}}_{2}^{*},\;b_{2}^{*}) \leq CM{\Delta _2}\Omega \sqrt \varepsilon +C{\Omega ^2}\varepsilon /2\).
Since \({F_{LS - 1}}{\text{(}}{\mathbf{w}}_{2}^{*},\;b_{2}^{*}) - {F_{LS - 3}}({\mathbf{w}}_{2}^{*},\;b_{2}^{*}) \leq (CM/N){\mathbf{w}}{_{2}^{*T}}{\Delta _2}\sum\nolimits_{{i=1}}^{N} {{\tau _i}} +(C/2N)\sum\nolimits_{{i=1}}^{N} {{{{\text{(}}{\mathbf{w}}{{_{2}^{*T}}}{\tau _i}{\text{)}}}^2}} \;\; \leq CM{\Delta _2}\Omega \sqrt \varepsilon +C{\Omega ^2}\varepsilon /2,\) we have
-
3.
For the Hub-SVM,
$$\begin{gathered} - ({C^2}/4N)\sqrt {C\varepsilon (1+h)/2} - ({\Delta _3}MC\sqrt {CN} /2N){(C\varepsilon (1+h)/2)^{1/4}} \leq {F_{Hub - 1app}}({\mathbf{w}}_{1}^{*}) - {F_{Hub - 2app}}({\mathbf{w}}_{2}^{*}), \hfill \\ \leq ({C^2}/4N)\sqrt {C\varepsilon (1+h)/2} +({\Delta _3}MC\sqrt {CN} /2N){(C\varepsilon (1+h)/2)^{1/4}}, \hfill \\ \end{gathered}$$where \({\Delta _3}= {\hbox{max} }_{{1 \leq t \leq M}} \sqrt {\left| {1+h - (C/2N){y_t}{{\mathbf{w}}^T}{{\mathbf{z}}_t}} \right|}\).
According to Theorems1 and 2, we get
Let us define \({\Delta _3}= {\hbox{max} }_{{1 \leq t \leq M}} \sqrt {\left| {1+h - (C/2N){y_t}{\mathbf{w}}{{_{2}^{*}}^T}{{\mathbf{z}}_t}} \right|}\), in terms of \(\left\| {{{\mathbf{w}}^*}} \right\| \leq \sqrt {C(1+h)/2}\), we immediately have
Appendix 4
4.1 Proof of Corollary 2
Based on Theorem1, we know that \({F_{Hub - 3app}}({\mathbf{w}}_{3}^{*}) \leq {F_{Hub - 3app}}({\mathbf{w}}_{2}^{*})={F_{Hub - 2app}}({\mathbf{w}}_{2}^{*})\). Meanwhile, \({F_{Hub - 3app}}({\mathbf{w}}_{3}^{*})={F_{Hub - 2app}}({\mathbf{w}}_{3}^{*}) \geq {F_{Hub - 2app}}({\mathbf{w}}_{2}^{*})\). Hence, we have \({F_{Hub - 3app}}({\mathbf{w}}_{3}^{*})={F_{Hub - 3app}}({\mathbf{w}}_{2}^{*})\).
From these results, we get \({F_{Hub - 3app}}({\mathbf{w}}_{2}^{*}) - {F_{Hub - 3app}}({\mathbf{w}}_{1}^{*}) \leq 0\). From Theorem2, we have the following inequalities
and
Adding these two inequalities and using the properties of \(\left\| {{{\mathbf{w}}^*}} \right\| \leq \sqrt {C(1+h)/2}\), we get
Rights and permissions
About this article
Cite this article
Gu, X., Chung, Fl. & Wang, S. Extreme vector machine for fast training on large data. Int. J. Mach. Learn. & Cyber. 11, 33–53 (2020). https://doi.org/10.1007/s13042-019-00936-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-019-00936-3