Abstract
In this paper we propose the development of an approach capable of improving the results obtained by classification algorithms when applied to imbalanced datasets. The method, called Incremental Synthetic Balancing Algorithm (ISBA), performs an iterative procedure based on large margin classifiers, aiming to generate synthetic samples in order to reduce the level of imbalance. In the process, we use the support vectors as the reference for the generation of new instances, allowing them to be positioned in regions with greater representativeness. Furthermore, the new samples can exceed the limits of the ones used for their generation, which enables extrapolation of the boundaries of the minority class, achieving more significant recognition of this class of interest. We present comparative experiments with other techniques, among them the SMOTE, which provide strong evidence of the applicability of the proposed approach.
Similar content being viewed by others
References
Marsland S (2015) Machine learning: an algorithmic perspective. CRC press, Boca Raton
Murthy S K (1998) Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min Knowl Discov 2(4):345–389
Rosenblatt F (1962) Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan
Rumelhart D E, Hinton G E, Williams R J (1985) Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science
Howlett RJ, Jain LC (2013) Radial basis function networks 2: new advances in design, vol 67. Physica
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Chawla N V, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newslett 6(1):1–6
Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47
Liu A, Ghosh J, Martin CE (2007) Generative oversampling for mining imbalanced datasets. In: DMIN
Chan P K, Stolfo S J (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: KDD
Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. Acm sigkdd Explor Newslett 6(1):50–59
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. ML
Sun Y, Kamel M S, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. PR
Mazurowski M A, Habas P A, Zurada J M, Lo J Y, Baker J A, Tourassi G D (2008) Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw 21(2):427–436
Everson R M, Fieldsend J E (2006) Multi-objective optimisation for receiver operating characteristic analysis. Multi-Object Mach Learn 16:533–556
Raskutti B, Kowalczyk A (2004) Extreme re-balancing for svms: a case study. ACM Sigkdd Explor Newslett 6(1):60–69
Manevitz L, Yousef M (2007) One-class document classification via neural networks. Neurocomputing 70 (7):1466–1481
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. NCA
Núñez Castro H, González Abril L, Angulo Bahón C (2011) A post-processing strategy for svm learning from unbalanced data. In: 19th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp 195–200
Tao X, Ji H, Xie Y (2007) A modified psvm and its application to unbalanced data classification. In: ICNC
Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. ESA
Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: ICIC
Torres LCB, FCFSTAB CL Castro (2015) Distance-based large margin classifier suitable for integrated circuit implementation. Electron Lett 51(2):1967–1969
Vitor de Campos Souza P (2018) Pruning fuzzy neural networks based on unineuron for problems of classification of patterns. J Intell Fuzzy Syst 35(2):2597–2605
Zhang X, Fu Y, Zang A, Sigal L, Agam G (2015) Learning classifiers from synthetic data using a multichannel autoencoder. arXiv
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling Technique. JAIR
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328
Barua S, Islam M M, Yao X, Murase K (2014) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. TKDE
Koto F (2014) Smote-out, smote-cosine and selected-smote: an enhancement strategy to handle imbalance in data level. In: ICACSIS
Schapire R E, Freund Y, Bartlett P, Lee W S (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Statist 26(5):1651–1686
Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci 8:815–842
Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: Bagging of extrapolation-smote svm. Computational Intelligence and Neuroscience
Jingnian C, Shunxiang H, Li X (2018) Speeding up algorithm for support vector machine based on alien neighbor. Comput Eng 44:19–24
Xie W, Liang G, Dong Z, Tan B, Zhang B (2019) An improved oversampling algorithm based on the samples’ selection strategy for classifying imbalanced data. Mathematical Problems in Engineering
Attenbert J, Ertekin S (2013) Class imbalance and active learning. In: Imbalanced learning: foundations, Algorithms, and Applications. Wiley, pp 101–150
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp 144–152
Villela S M, Leite S C, Fonseca Neto R (2016) Incremental p-margin algorithm for classification with arbitrary norm. Pattern Recogn 55:216–272
Fernández A, Garcia S, Herrera F, Chawla N (2018) SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R, Caligiuri M A, Bloomfield C D, Lander E S (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Alon U, Barkai N, Notterman D A, Gish K, Ybarra S, Mack D, Levine A J (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A 96(12):6745–6750
Sánchez R L, Alcalá F J, Fernández H A, Luengo M J, Derrac R J, García LS, Herrera TF (2011) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. JMLSC
Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 105662:83
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ladeira Marques, M., Moraes Villela, S. & Hasenclever Borges, C.C. Large margin classifiers to generate synthetic data for imbalanced datasets. Appl Intell 50, 3678–3694 (2020). https://doi.org/10.1007/s10489-020-01719-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01719-y