Training data reduction to speed up SVM training

Wang, Senzhang; Li, Zhoujun; Liu, Chunyang; Zhang, Xiaoming; Zhang, Haijun

doi:10.1007/s10489-014-0524-2

Training data reduction to speed up SVM training

Published: 15 March 2014

Volume 41, pages 405–420, (2014)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Senzhang Wang¹,
Zhoujun Li¹,
Chunyang Liu²,
Xiaoming Zhang¹ &
…
Haijun Zhang¹

784 Accesses
15 Citations
Explore all metrics

Abstract

Traditional Support Vector Machine (SVM) solution suffers from O(n ²) time complexity, which makes it impractical to very large datasets. To reduce its high computational complexity, several data reduction methods are proposed in previous studies. However, such methods are not effective to extract informative patterns. In this paper, a two-stage informative pattern extraction approach is proposed. The first stage of our approach is data cleaning based on bootstrap sampling. A bundle of weak SVM classifiers are constructed on the sampled datasets. Training data correctly classified by all the weak classifiers are cleaned due to lacking useful information for training. To further extract more informative training data, two informative pattern extraction algorithms are proposed in the second stage. As most training data are eliminated and only the more informative samples remain, the final SVM training time is reduced significantly. Contributions of this paper are three-fold. (1) First, a parallelized bootstrap sampling based method is proposed to clean the initial training data. By doing that, a large number of training data with little information are eliminated. (2) Then, we present two algorithms to effectively extract more informative training data. Both algorithms are based on maximum information entropy according to the empirical misclassification probability of each sample estimated in the first stage. Therefore, training time can be further reduced for training data further reduction. (3) Finally, empirical studies on four large datasets show the effectiveness of our approach in reducing the training data size and the computational cost, compared with the state-of-the-art algorithms, including PEGASOS, LIBLINEAR SVM and RSVM. Meanwhile, the generalization performance of our approach is comparable with baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fast classification strategy for SVM on the large-scale high-dimensional datasets

Article 18 April 2017

Reduction of training data for support vector machine: a survey

Article 16 March 2022

Classification Based on Structural Information in Data

Article 20 September 2021

Notes

d is a bound on the number of non-zero features for dataset and λ is the regularization parameter of SVM.
The \(\widetilde {O}(\cdot)\) notation hides logarithmic factors.

References

Wang SZ, Li ZJ, Chao WH, Cao QH (2012) Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In: Proceedings of IJCNN
Google Scholar
Cao YB, Xu J, Liu TY, Li H, Huang YL, Hon HW (2006) Adapting ranking SVM to document retrieval. In: Proceedings of SIGIR, pp 186–193
Google Scholar
Hasan MA, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning. In: SIAM workshop on link analysis, counter-terrorism and security
Google Scholar
Burges C (1999) Geometry and invariance in kernel based methods. In: Advances in kernel methods: support vector learning. MIT Press, Cambridge
Google Scholar
Panda N, Edward YC, Wu G (2006) Concept boundary detection for speeding up SVMs. In: Proceedings of ICML, pp 681–688
Chapter Google Scholar
Graf HP, Cosatto E, Bottou L, Durdanovic I, Vapnik V (2006) Parallel support vector machines: the cascade SVM. In: Advances in neural information processing system, vol 17. MIT Press, Cambridge, pp 521–528
Google Scholar
Lawrence ND, Seeger M, Herbrich R (2003) Fast sparse Gaussian process methods: the informative vector machine. In: Advances in neural information processing systems. MIT Press, Cambridge
Google Scholar
Yu H, Yang J, Han J (2003) Classifying large datasets using SVM with hierarchical clusters. In: Proceedings of KDD
Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York
MATH Google Scholar
Platt JC (1999) Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 185–208
Google Scholar
Joachims T (1999) Making large-scale support vector machine learning practical. In: Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 169–184
Google Scholar
Kao WC, Chung KM, Sun CL, Lin CJ (2004) Decomposition methods for linear support vector machines. Neural Comput. 16(8):1689–1704
Article MATH Google Scholar
Tsang IW, James TK, Cheung PM (2005) Core vector machines: fast SVM training on very large data sets. J Mach Learn Res 6:363–392
MATH MathSciNet Google Scholar
Lee YJ, Mangasarian OL (2001) RSVM: reduced support vector machines. In: Proceedings of SDM
Google Scholar
Fine S, Scheinberg K (2001) Efficient SVM training using low-rank kernel representations. J Mach Learn Res 2:243–264
Google Scholar
Shai SS, Srebro N (2008) SVM optimization: inverse dependence on training set size. In: Proceedings of ICML
Google Scholar
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of KDD
Google Scholar
Smola A, Vishwanathan S, Le Q (2008) Bundle methods for machine learning. In: Advances in neural information processing systems
Google Scholar
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Shai SS, Singer Y, Srebro N (2007) Pegasos: primal estimated sub-GrAdient solver for SVM. In: Proceedings of ICML
Google Scholar
Peter LB, Mendelson S (2002) Rademacher and Gaussian complexities: risk bounds and structural results. J Mach Learn Res 3:463–482
MathSciNet Google Scholar
Guyon I, Matic N, Vapnik V (1994) Discovering informative patterns and data cleaning. In: Proceedings of AAAI workshop on knowledge discovery in databases
Google Scholar
MacKay D (1992) Information-based objective functions for active data selection. Neural Comput 4(4):590–604
Article Google Scholar
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Article Google Scholar
Chang CC, Lin CJ (2001) IJCNN 2001 challenge: generalization ability and text decoding. In: Proceedings of IJCNN
Google Scholar
Smits GF, Jordan EM (2002) Improved SVM regression using mixtures of kernels. In: Proceedings of IJCNN
Google Scholar
Kumar A, Ghosh SK, Dadhwal VK (2006) Study of mixed kernel effect on classification accuracy using density estimation. In: Mid-term ISPRS symposium, ITC
Google Scholar
Shi YH, Gao Y, Wang RL, Zhang Y, Wang D (2013) Transductive cost-sensitive lung cancer image classification. Appl Intell 38(1):16–28
Article Google Scholar
Collobert R, Bengio S, Bengio Y (2002) A parallel mixtures of SVMs for very large scale problems. Neural Comput 14:1105–1114
Article MATH Google Scholar
Wang CW, You WH (2013) Boosting-SVM: effective learning with reduced data dimension. Appl Intell 39(3):465–474
Article Google Scholar
Idris A, Khan A, Lee YS (2013) Intelligent churn prediction in Telecom: employing mRMR feature selection and RotBoost based ensemble classification. Appl Intell 39(3):659–672
Article Google Scholar
Maudes J, Diez JJR, Osorio CG, Pardo C (2011) Random projections for linear SVM ensembles. Appl Intell 34(3):347–359
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61170189, 61370126, 61202239), the Research Fund for the Doctoral Program of Higher Education (Grant No. 20111102130003), and the Fund of the State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2013ZX-19).

Author information

Authors and Affiliations

State Key Laboratory of Software Development Environment, Beihang University, Beijing, 100191, P.R. China
Senzhang Wang, Zhoujun Li, Xiaoming Zhang & Haijun Zhang
National Computer Network Emergency Response Technical Team, Coordination Center of China, Beijing, 100029, P.R. China
Chunyang Liu

Authors

Senzhang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhoujun Li
View author publications
You can also search for this author in PubMed Google Scholar
Chunyang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haijun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhoujun Li or Chunyang Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Li, Z., Liu, C. et al. Training data reduction to speed up SVM training. Appl Intell 41, 405–420 (2014). https://doi.org/10.1007/s10489-014-0524-2

Download citation

Published: 15 March 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10489-014-0524-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training data reduction to speed up SVM training

Abstract

Access this article

Similar content being viewed by others

A fast classification strategy for SVM on the large-scale high-dimensional datasets

Reduction of training data for support vector machine: a survey

Classification Based on Structural Information in Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Training data reduction to speed up SVM training

Abstract

Access this article

Similar content being viewed by others

A fast classification strategy for SVM on the large-scale high-dimensional datasets

Reduction of training data for support vector machine: a survey

Classification Based on Structural Information in Data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation