Abstract
Data imbalance is known to significantly hinder the generalization performance of supervised learning algorithms. A common strategy to overcome this challenge is synthetic oversampling, where synthetic minority class examples are generated to balance the distribution between the examples of the majority and minority classes. We present a novel adaptive oversampling algorithm, Virtual, that combines the benefits of oversampling and active learning. Unlike traditional resampling methods which require preprocessing of the data, Virtual generates synthetic examples for the minority class during the training process, therefore it removes the need for an extra preprocessing stage. In the context of learning with Support Vector Machines, we demonstrate that Virtual outperforms competitive oversampling techniques both in terms of generalization performance and computational complexity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Barua S (2012) Monirul Islam, Xin Yao, and Kazuyuki Murase. Mwmote-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans Knowl Data Eng
Blagus R, Lusa L (2012) Evaluation of smote for high-dimensional class-imbalanced microarray data. In machine learning and applications (ICMLA), 2012 11th international conference on, IEEE, 2012, vol 2, pp 89–94
Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res (JMLR) 6:1579–1619
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In advances in knowledge discovery and data mining. Springer, pp 475–482
Chan PK, Stolfo SJ (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the 4th ACM SIGKDD international conference on knowledge discovery and data mining, pp 164–168
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In knowledge discovery in databases: PKDD 2003. Springer, pp 107–119
Chen Sheng, He Haibo, Garcia Edwardo A (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Networks 21(10):1624–1642
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th international conference on knowledge discovery and data mining, pp 155–164
Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM conference on information and knowledge management (CIKM), ACM, 2007, pp 127–136
Ertekin S, Huang J, Giles CL (2007) Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference, 2007
Grzymala-Busse JW, Zheng Z, Goodwin LK, Grzymala-Busse WJ (2000) An approach to imbalanced datasets based on changing rule strength. In: Proceedings of learning from imbalanced datasets, AAAI workshop
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In neural networks, 2008. IJCNN 2008. (IEEE world congress on computational intelligence). IEEE international joint conference on, IEEE, 2008, pp 1322–1328
Hilas Constantinos S, Mastorocostas Paris As (2008) An application of supervised and unsupervised learning approaches to telecommunications fraud detection. Knowl Based Syst 21(7):721–726
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Japkowicz N (2000) The class imbalance problem: Significance and strategies. In: Proceedings of 2000 international conference on, artificial intelligence (IC-AI’2000), 1, pp 111–117
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215
Radivoja P, Chawla NV, Dunker AK, Obradovic Z (2004) Classification and knowledge discovery in protein databases. J Biomed Inf 37(4):224–239
Bhavani R, Adam K (2004) Extreme re-balancing for svms: a case study. SIGKDD Explor Newslett 6(1):60–69
Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. In The 2010 international joint Conference on neural networks (IJCNN), IEEE, 2010, pp 1–8
Tong S, Koller D (2002) Support vector machine active learning with applications to text classification. J Mach Learn Res (JMLR) 2:45–66
Wu G, Chang EY (2004) Aligning boundary in kernel space for learning imbalanced dataset. In: Proceedings of the 4th IEEE international conference on data mining (ICDM 2004), pp 265–272
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Ertekin, Ş. (2013). Adaptive Oversampling for Imbalanced Data Classification. In: Gelenbe, E., Lent, R. (eds) Information Sciences and Systems 2013. Lecture Notes in Electrical Engineering, vol 264. Springer, Cham. https://doi.org/10.1007/978-3-319-01604-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-01604-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01603-0
Online ISBN: 978-3-319-01604-7
eBook Packages: Computer ScienceComputer Science (R0)