Large margin classifiers to generate synthetic data for imbalanced datasets

Ladeira Marques, Marcelo; Moraes Villela, Saulo; Hasenclever Borges, Carlos Cristiano

doi:10.1007/s10489-020-01719-y

Large margin classifiers to generate synthetic data for imbalanced datasets

Published: 22 June 2020

Volume 50, pages 3678–3694, (2020)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

329 Accesses
4 Citations
Explore all metrics

Abstract

In this paper we propose the development of an approach capable of improving the results obtained by classification algorithms when applied to imbalanced datasets. The method, called Incremental Synthetic Balancing Algorithm (ISBA), performs an iterative procedure based on large margin classifiers, aiming to generate synthetic samples in order to reduce the level of imbalance. In the process, we use the support vectors as the reference for the generation of new instances, allowing them to be positioned in regions with greater representativeness. Furthermore, the new samples can exceed the limits of the ones used for their generation, which enables extrapolation of the boundaries of the minority class, achieving more significant recognition of this class of interest. We present comparative experiments with other techniques, among them the SMOTE, which provide strong evidence of the applicability of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Oversampling for Imbalanced Data Classification

An Iterated Greedy Algorithm for Improving the Generation of Synthetic Patterns in Imbalanced Learning

ESMOTE: an overproduce-and-choose synthetic examples generation strategy based on evolutionary computation

Article 03 December 2022

References

Marsland S (2015) Machine learning: an algorithmic perspective. CRC press, Boca Raton
Murthy S K (1998) Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min Knowl Discov 2(4):345–389
Google Scholar
Rosenblatt F (1962) Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan
Rumelhart D E, Hinton G E, Williams R J (1985) Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science
Howlett RJ, Jain LC (2013) Radial basis function networks 2: new advances in design, vol 67. Physica
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Chawla N V, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newslett 6(1):1–6
Google Scholar
Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47
Google Scholar
Liu A, Ghosh J, Martin CE (2007) Generative oversampling for mining imbalanced datasets. In: DMIN
Chan P K, Stolfo S J (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: KDD
Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. Acm sigkdd Explor Newslett 6(1):50–59
Google Scholar
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. ML
Sun Y, Kamel M S, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. PR
Mazurowski M A, Habas P A, Zurada J M, Lo J Y, Baker J A, Tourassi G D (2008) Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw 21(2):427–436
Google Scholar
Everson R M, Fieldsend J E (2006) Multi-objective optimisation for receiver operating characteristic analysis. Multi-Object Mach Learn 16:533–556
Google Scholar
Raskutti B, Kowalczyk A (2004) Extreme re-balancing for svms: a case study. ACM Sigkdd Explor Newslett 6(1):60–69
Google Scholar
Manevitz L, Yousef M (2007) One-class document classification via neural networks. Neurocomputing 70 (7):1466–1481
Google Scholar
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. NCA
Núñez Castro H, González Abril L, Angulo Bahón C (2011) A post-processing strategy for svm learning from unbalanced data. In: 19th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp 195–200
Tao X, Ji H, Xie Y (2007) A modified psvm and its application to unbalanced data classification. In: ICNC
Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. ESA
Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: ICIC
Torres LCB, FCFSTAB CL Castro (2015) Distance-based large margin classifier suitable for integrated circuit implementation. Electron Lett 51(2):1967–1969
Vitor de Campos Souza P (2018) Pruning fuzzy neural networks based on unineuron for problems of classification of patterns. J Intell Fuzzy Syst 35(2):2597–2605
Google Scholar
Zhang X, Fu Y, Zang A, Sigal L, Agam G (2015) Learning classifiers from synthetic data using a multichannel autoencoder. arXiv
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling Technique. JAIR
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328
Barua S, Islam M M, Yao X, Murase K (2014) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. TKDE
Koto F (2014) Smote-out, smote-cosine and selected-smote: an enhancement strategy to handle imbalance in data level. In: ICACSIS
Schapire R E, Freund Y, Bartlett P, Lee W S (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Statist 26(5):1651–1686
MathSciNet MATH Google Scholar
Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci 8:815–842
Google Scholar
Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: Bagging of extrapolation-smote svm. Computational Intelligence and Neuroscience
Jingnian C, Shunxiang H, Li X (2018) Speeding up algorithm for support vector machine based on alien neighbor. Comput Eng 44:19–24
Google Scholar
Xie W, Liang G, Dong Z, Tan B, Zhang B (2019) An improved oversampling algorithm based on the samples’ selection strategy for classifying imbalanced data. Mathematical Problems in Engineering
Attenbert J, Ertekin S (2013) Class imbalance and active learning. In: Imbalanced learning: foundations, Algorithms, and Applications. Wiley, pp 101–150
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp 144–152
Villela S M, Leite S C, Fonseca Neto R (2016) Incremental p-margin algorithm for classification with arbitrary norm. Pattern Recogn 55:216–272
Google Scholar
Fernández A, Garcia S, Herrera F, Chawla N (2018) SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R, Caligiuri M A, Bloomfield C D, Lander E S (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Google Scholar
Alon U, Barkai N, Notterman D A, Gish K, Ybarra S, Mack D, Levine A J (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A 96(12):6745–6750
Google Scholar
Sánchez R L, Alcalá F J, Fernández H A, Luengo M J, Derrac R J, García LS, Herrera TF (2011) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. JMLSC
Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 105662:83
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Federal University of Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
Marcelo Ladeira Marques, Saulo Moraes Villela & Carlos Cristiano Hasenclever Borges

Authors

Marcelo Ladeira Marques
View author publications
You can also search for this author in PubMed Google Scholar
Saulo Moraes Villela
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Cristiano Hasenclever Borges
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Cristiano Hasenclever Borges.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ladeira Marques, M., Moraes Villela, S. & Hasenclever Borges, C.C. Large margin classifiers to generate synthetic data for imbalanced datasets. Appl Intell 50, 3678–3694 (2020). https://doi.org/10.1007/s10489-020-01719-y

Download citation

Published: 22 June 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10489-020-01719-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large margin classifiers to generate synthetic data for imbalanced datasets

Abstract

Access this article

Similar content being viewed by others

Adaptive Oversampling for Imbalanced Data Classification

An Iterated Greedy Algorithm for Improving the Generation of Synthetic Patterns in Imbalanced Learning

ESMOTE: an overproduce-and-choose synthetic examples generation strategy based on evolutionary computation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Large margin classifiers to generate synthetic data for imbalanced datasets

Abstract

Access this article

Similar content being viewed by others

Adaptive Oversampling for Imbalanced Data Classification

An Iterated Greedy Algorithm for Improving the Generation of Synthetic Patterns in Imbalanced Learning

ESMOTE: an overproduce-and-choose synthetic examples generation strategy based on evolutionary computation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation