Very large-scale data classification based on K-means clustering and multi-kernel SVM

Tang, Tinglong; Chen, Shengyong; Zhao, Meng; Huang, Wei; Luo, Jake

doi:10.1007/s00500-018-3041-0

Very large-scale data classification based on K-means clustering and multi-kernel SVM

Methodologies and Application
Published: 29 January 2018

Volume 23, pages 3793–3801, (2019)
Cite this article

Soft Computing Aims and scope Submit manuscript

Tinglong Tang^1,3,
Shengyong Chen^1,2,
Meng Zhao²,
Wei Huang² &
…
Jake Luo⁴

1310 Accesses
33 Citations
3 Altmetric
Explore all metrics

Abstract

When classifying very large-scale data sets, there are two major challenges: the first challenge is that it is time-consuming and laborious to label sufficient amount of training samples; the second challenge is that it is difficult to train a model in a time-efficient and high-accuracy manner. This is due to the fact that to create a high-accuracy model, normally it is required to generate a large and representative training set. A large training set may also require significantly more training time. There is a trade-off between the speed and accuracy when performing classification training, especially for large-scale data sets. To address this problem, a novel strategy of large-scale data classification is proposed by combining K-means clustering technology and multi-kernel support vector machine method. First, the K-means clustering method is used on a small portion of the original data set. The clustering stage is designed with a special strategy to select representative training instances. Such method reduces the needs of creating a large training set as well as the subsequent manual labeling work. K-means clustering method has two characteristics: (1) the result is greatly influenced by the cluster number k, and (2) the optimal result is difficult to achieve. In the proposed special strategy, the two characteristics are utilized to find the most representative instances by defining a relaxed cluster number k and doing K-means repeatedly. In each K-means clustering step, both the nearest and the farthest instance to each cluster center are selected into a set. Using this method, the selected instances will have a representative distribution of the original whole data set and reduce the need of labeling the original data set. An outlier detection method is applied to further delete the outlier instances according to their outlier scores. Finally, a multi-kernel SVM is trained using the selected instances and a classifier model can be obtained to predict subsequent new instances. The evaluation results show that the proposed instance selection method significantly reduces the size of training data sets as well as training time; in the meanwhile, it maintains a relatively good accuracy performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Supervised Classification Algorithms in Machine Learning: A Survey and Review

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Data clustering: application and trends

Article 27 November 2022

References

Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287
Arnaiz-González Á, Díez-Pastor J-F, Rodríguez JJ, García-Osorio C (2016) Instance selection of linear complexity for big data. Knowl Based Syst 107:83–95
Article Google Scholar
Bottou L, Lin C-J (2007) Support vector machine solvers. Large Scale Kernel Mach 3(1):301–320
Google Scholar
Cavalcanti GDC, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40(17):6894–6900
Article Google Scholar
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Google Scholar
Chen H, Zhang Y, Gutman I (2016) A kernel-based clustering method for gene selection with gene expression data. J Biomed Inform 62:12–20
Article Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Dornaika F, Aldine IK (2015) Decremental sparse modeling representative selection for prototype selection. Pattern Recogn 48(11):3714–3727
Article Google Scholar
Hamidzadeh J, Monsefi R, Yazdi HS (2016) Large symmetric margin instance selection algorithm. Int J Mach Learn Cybern 7(1):25–45
Article Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Article Google Scholar
Khosravani HR, Ruano AE, Ferreira PM (2016) A convex hull-based data selection method for data driven models. Appl Soft Comput 47:515–533
Article Google Scholar
Kim MS (2013) Robust, scalable anomaly detection for large collections of images. In: 2013 International conference on social computing (SocialCom), pp 1054–1058. IEEE
Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml
Lin W-C, Tsai C-F, Ke S-W, Hung C-W, Eberle W (2015) Learning to detect representative data for large scale instance selection. J Syst Softw 106:1–8
Article Google Scholar
Liu X, Wang L, Yin J, Liu L (2012) Incorporation of radius-info can be simple with SimpleMKL. Neurocomputing 89:30–38
Article Google Scholar
Liu X, Zhou L, Wang L, Zhang J, Yin J, Shen D (2015) An efficient radius-incorporated MKL algorithm for Alzheimer’s disease prediction. Pattern Recogn 48(7):2141–2150
Article Google Scholar
Neugebauer J, Kramer O, Sonnenschein M (2016) Improving cascade classifier precision by instance selection and outlier generation. In: ICAART, no. 2, pp 96–104
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141
Article MathSciNet Google Scholar
Onan A (2015) A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer. Expert Syst Appl 42(20):6844–6852
Article Google Scholar
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9(Nov):2491–2521
MathSciNet MATH Google Scholar
Rezaei M, Nezamabadi-Pour H (2015) Using gravitational search algorithm in prototype generation for nearest neighbor classification. Neurocomputing 157:256–263
Article Google Scholar
Silva DANS, Souza LC, Motta GHMB (2016) An instance selection method for large datasets based on Markov geometric diffusion. Data Knowl Eng 101:24–41
Article Google Scholar
Stojanović MB, Božić MM, Stanković MM, Stajić ZP (2014) A methodology for training set instance selection using mutual information in time series prediction. Neurocomputing 141:236–245
Article Google Scholar
Sun J, Li H (2011) Dynamic financial distress prediction using instance selection for the disposal of concept drift. Expert Syst Appl 38(3):2566–2576
Article Google Scholar
Triguero I, Derrac JN, GarcíA S, Herrera F (2012) Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing 97:332–343
Article Google Scholar
Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR (2016) On the suitability of prototype selection methods for kNN classification with distributed data. Neurocomputing 203:150–160
Article Google Scholar
Whelan M, Le Khac NA, Kechadi M-T (2010) Data reduction in very large spatio-temporal datasets. In: 2010 19th IEEE International workshop on enabling technologies: infrastructures for collaborative enterprises (WETICE). IEEE, pp 104–109
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286
Article MATH Google Scholar
Wu P, Duan F, Guo P (2015) A pre-selecting base kernel method in multiple kernel learning. Neurocomputing 165:46–53
Article Google Scholar
Zhai J, Wang X, Pang X (2016) Voting-based instance selection from large data sets with MapReduce and random weight networks. Inf Sci 367:1066–1077
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (U1509207, 61325019).

Author information

Authors and Affiliations

Zhejiang University of Technology, Hangzhou, 310032, China
Tinglong Tang & Shengyong Chen
Tianjin University of Technology, Tianjin, 300384, China
Shengyong Chen, Meng Zhao & Wei Huang
China Three Gorges University, Yichang, 443002, China
Tinglong Tang
University of Wisconsin-Milwaukee, Milwaukee, WI, 53211, USA
Jake Luo

Authors

Tinglong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Shengyong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Meng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Wei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jake Luo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengyong Chen.

Ethics declarations

Conflict of interest

Tinglong Tang, Shengyong Chen, Meng Zhao, Wei Huang and Jake Luo declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, T., Chen, S., Zhao, M. et al. Very large-scale data classification based on K-means clustering and multi-kernel SVM. Soft Comput 23, 3793–3801 (2019). https://doi.org/10.1007/s00500-018-3041-0

Download citation

Published: 29 January 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s00500-018-3041-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Very large-scale data classification based on K-means clustering and multi-kernel SVM

Abstract

Access this article

Similar content being viewed by others

Supervised Classification Algorithms in Machine Learning: A Survey and Review

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Very large-scale data classification based on K-means clustering and multi-kernel SVM

Abstract

Access this article

Similar content being viewed by others

Supervised Classification Algorithms in Machine Learning: A Survey and Review

Learning from imbalanced data: open challenges and future directions

Data clustering: application and trends

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation