Skip to main content
Log in

Very large-scale data classification based on K-means clustering and multi-kernel SVM

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

When classifying very large-scale data sets, there are two major challenges: the first challenge is that it is time-consuming and laborious to label sufficient amount of training samples; the second challenge is that it is difficult to train a model in a time-efficient and high-accuracy manner. This is due to the fact that to create a high-accuracy model, normally it is required to generate a large and representative training set. A large training set may also require significantly more training time. There is a trade-off between the speed and accuracy when performing classification training, especially for large-scale data sets. To address this problem, a novel strategy of large-scale data classification is proposed by combining K-means clustering technology and multi-kernel support vector machine method. First, the K-means clustering method is used on a small portion of the original data set. The clustering stage is designed with a special strategy to select representative training instances. Such method reduces the needs of creating a large training set as well as the subsequent manual labeling work. K-means clustering method has two characteristics: (1) the result is greatly influenced by the cluster number k, and (2) the optimal result is difficult to achieve. In the proposed special strategy, the two characteristics are utilized to find the most representative instances by defining a relaxed cluster number k and doing K-means repeatedly. In each K-means clustering step, both the nearest and the farthest instance to each cluster center are selected into a set. Using this method, the selected instances will have a representative distribution of the original whole data set and reduce the need of labeling the original data set. An outlier detection method is applied to further delete the outlier instances according to their outlier scores. Finally, a multi-kernel SVM is trained using the selected instances and a classifier model can be obtained to predict subsequent new instances. The evaluation results show that the proposed instance selection method significantly reduces the size of training data sets as well as training time; in the meanwhile, it maintains a relatively good accuracy performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287

  • Arnaiz-González Á, Díez-Pastor J-F, Rodríguez JJ, García-Osorio C (2016) Instance selection of linear complexity for big data. Knowl Based Syst 107:83–95

    Article  Google Scholar 

  • Bottou L, Lin C-J (2007) Support vector machine solvers. Large Scale Kernel Mach 3(1):301–320

    Google Scholar 

  • Cavalcanti GDC, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40(17):6894–6900

    Article  Google Scholar 

  • Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27

    Google Scholar 

  • Chen H, Zhang Y, Gutman I (2016) A kernel-based clustering method for gene selection with gene expression data. J Biomed Inform 62:12–20

    Article  Google Scholar 

  • Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  • Dornaika F, Aldine IK (2015) Decremental sparse modeling representative selection for prototype selection. Pattern Recogn 48(11):3714–3727

    Article  Google Scholar 

  • Hamidzadeh J, Monsefi R, Yazdi HS (2016) Large symmetric margin instance selection algorithm. Int J Mach Learn Cybern 7(1):25–45

    Article  Google Scholar 

  • Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304

    Article  Google Scholar 

  • Khosravani HR, Ruano AE, Ferreira PM (2016) A convex hull-based data selection method for data driven models. Appl Soft Comput 47:515–533

    Article  Google Scholar 

  • Kim MS (2013) Robust, scalable anomaly detection for large collections of images. In: 2013 International conference on social computing (SocialCom), pp 1054–1058. IEEE

  • Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml

  • Lin W-C, Tsai C-F, Ke S-W, Hung C-W, Eberle W (2015) Learning to detect representative data for large scale instance selection. J Syst Softw 106:1–8

    Article  Google Scholar 

  • Liu X, Wang L, Yin J, Liu L (2012) Incorporation of radius-info can be simple with SimpleMKL. Neurocomputing 89:30–38

    Article  Google Scholar 

  • Liu X, Zhou L, Wang L, Zhang J, Yin J, Shen D (2015) An efficient radius-incorporated MKL algorithm for Alzheimer’s disease prediction. Pattern Recogn 48(7):2141–2150

    Article  Google Scholar 

  • Neugebauer J, Kramer O, Sonnenschein M (2016) Improving cascade classifier precision by instance selection and outlier generation. In: ICAART, no. 2, pp 96–104

  • Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131–141

    Article  MathSciNet  Google Scholar 

  • Onan A (2015) A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer. Expert Syst Appl 42(20):6844–6852

    Article  Google Scholar 

  • Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9(Nov):2491–2521

    MathSciNet  MATH  Google Scholar 

  • Rezaei M, Nezamabadi-Pour H (2015) Using gravitational search algorithm in prototype generation for nearest neighbor classification. Neurocomputing 157:256–263

    Article  Google Scholar 

  • Silva DANS, Souza LC, Motta GHMB (2016) An instance selection method for large datasets based on Markov geometric diffusion. Data Knowl Eng 101:24–41

    Article  Google Scholar 

  • Stojanović MB, Božić MM, Stanković MM, Stajić ZP (2014) A methodology for training set instance selection using mutual information in time series prediction. Neurocomputing 141:236–245

    Article  Google Scholar 

  • Sun J, Li H (2011) Dynamic financial distress prediction using instance selection for the disposal of concept drift. Expert Syst Appl 38(3):2566–2576

    Article  Google Scholar 

  • Triguero I, Derrac JN, GarcíA S, Herrera F (2012) Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing 97:332–343

    Article  Google Scholar 

  • Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR (2016) On the suitability of prototype selection methods for kNN classification with distributed data. Neurocomputing 203:150–160

    Article  Google Scholar 

  • Whelan M, Le Khac NA, Kechadi M-T (2010) Data reduction in very large spatio-temporal datasets. In: 2010 19th IEEE International workshop on enabling technologies: infrastructures for collaborative enterprises (WETICE). IEEE, pp 104–109

  • Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257–286

    Article  MATH  Google Scholar 

  • Wu P, Duan F, Guo P (2015) A pre-selecting base kernel method in multiple kernel learning. Neurocomputing 165:46–53

    Article  Google Scholar 

  • Zhai J, Wang X, Pang X (2016) Voting-based instance selection from large data sets with MapReduce and random weight networks. Inf Sci 367:1066–1077

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (U1509207, 61325019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengyong Chen.

Ethics declarations

Conflict of interest

Tinglong Tang, Shengyong Chen, Meng Zhao, Wei Huang and Jake Luo declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, T., Chen, S., Zhao, M. et al. Very large-scale data classification based on K-means clustering and multi-kernel SVM. Soft Comput 23, 3793–3801 (2019). https://doi.org/10.1007/s00500-018-3041-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3041-0

Keywords

Navigation