Abstract
A novel method, namely ensemble support vector machine with segmentation (SeEn–SVM), for the classification of imbalanced datasets is proposed in this paper. In particular, vector quantization algorithm is used to segment the majority class and hence generates some small datasets that are of less imbalance than original one, and two different weighted functions are proposed to integrate all the results of basic classifiers. The goal of the SeEn–SVM algorithm is to improve the prediction accuracy of the minority class, which is more interesting for people. The SeEn–SVM is applied to six UCI datasets, and the results confirmed its better performance than previously proposed methods for imbalance problem.
Similar content being viewed by others
References
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6
Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: Proceedings of the AAAI’2000 workshop on learning from imbalanced data sets, pp 10–15
Chawla NV, Japkowicz N, Kolcz A (Eds.) (2003) In: Proceedings of the ICML’2003 workshop on learning from imbalanced data sets
Chawla NV, Japkowicz N, Zhou ZH (2009) In: PAKDD’2009 workshop: data mining when classes are imbalanced and errors have costs, Thailand
Nguwi YY, Cho SY (2010) An unsupervised self-organizing learning with support vector ranking for imbalanced datasets. Expert Syst Appl 37(12):8303–8312
Tian J, Gu H, Liu WQ (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: One-sided selection. In:Proceedings of the fourteenth international conference on machine learning, pp 179–186
Chawla NV, Bowyer K, Hall L, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Domingos P (1999) MetaCost: a general method for making classifiers cost sensitive. In: Proceedings of the fifth international conference on knowledge discovery and data mining, pp 155–164
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the seventeenth international joint conference on artificial intelligence. Morgan Kaufmann, San Francisco, pp 973–978
Drummond C, Holte RC (2003) C4.5, class imbalance and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, held in conjunction with ICML 2003
Manevitz LM, Yousef M (2001) One-class SVMs for document classification. J Mach Leran Res 2(2):139–154
Raskutti B, Kowalczyk A (2003) Extreme re-balancing for SVMs: a case study. In: Workshop on learning from imbalanced data sets II, international conference on machine learning
Cortes C, Vapnik V (1995) Support-vector networks. Machine Learn 20:273–297
Deng NY, Tian YJ, Zhang CH (2012) Support vector machines: theory, algorithms, and extensions. CRC Press (in press)
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Proceedings of ECML 2004. LNCS (LNAI), 3201, pp 39–50
Yang CY, Wang JJ, Yang JS and Yu GD (2008) Imbalanced SVM learning with margin compensation, In: Proceedings of ISNN 2008, Part I, LNCS 5263, pp 636–644
Benjamin X, Wang, Japkowicz N (2008) Boosting support vector machines for imbalanced data sets. In: Proceedings of ISMIS 2008, LNAI 4994, pp 38–47
Krogh A, Vedelsby J (1995) Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems 7. MIT Press, Cambridge, MA, pp 231–238
Zhou ZH, Wu J, Tang W (2002) Ensembling neural networks: many could be better than all. Artif Intell 137(1–2):239–263
Gersho A, Gray RM (1992) Vector quantization and signal compression. Kluwer, Dordrecht
Yu T, Debenham J, Jan T, Simoff S (2006) Combine vector quantization and support vector machine for imbalanced datasets, In: TFTP international federation for information processing, 2006, pp 217–227
Zhao XM, Wang Y, Chen LN, Kazuyuki A (2008) Gene function prediction using labeled and unlabeled data. BMC Bioinform 9:57–62
Dror G, Sorek R, Shamir R (2005) Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics 21(7):897–901
Ning H, Yang B, Cui J, Jing L (2009) Detection of horizontal gene transfer in bacterial genomes. In: Proceedings of the third international symposium on optimization and systems biology, pp 229–236
Kubat M, Hotle R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of the 9th European conference on machine learning. London: Springer, Heidelberg, 1224, pp 146–153
Hsu C-W, Chang C-C, Lin C-J (2008) A practical guide to support vector classification. http://www.csie.ntu.edu.tw/~cjlin
Liu XY, Wu JX, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Transact Syst Man Cybern Part B Cybern 39(2):539–550
Acknowledgments
This work is supported by the National Natural Science Foundation of China (No. 10971223, No. 11071252) and Chinese Universities Scientific Fund (2011JS039, 2012YJ130). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Q., Yang, B., Li, Y. et al. Constructing support vector machine ensemble with segmentation for imbalanced datasets. Neural Comput & Applic 22 (Suppl 1), 249–256 (2013). https://doi.org/10.1007/s00521-012-1041-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-012-1041-z