Skip to main content
Log in

Large margin classifiers to generate synthetic data for imbalanced datasets

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this paper we propose the development of an approach capable of improving the results obtained by classification algorithms when applied to imbalanced datasets. The method, called Incremental Synthetic Balancing Algorithm (ISBA), performs an iterative procedure based on large margin classifiers, aiming to generate synthetic samples in order to reduce the level of imbalance. In the process, we use the support vectors as the reference for the generation of new instances, allowing them to be positioned in regions with greater representativeness. Furthermore, the new samples can exceed the limits of the ones used for their generation, which enables extrapolation of the boundaries of the minority class, achieving more significant recognition of this class of interest. We present comparative experiments with other techniques, among them the SMOTE, which provide strong evidence of the applicability of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Marsland S (2015) Machine learning: an algorithmic perspective. CRC press, Boca Raton

  2. Murthy S K (1998) Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min Knowl Discov 2(4):345–389

    Google Scholar 

  3. Rosenblatt F (1962) Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan

  4. Rumelhart D E, Hinton G E, Williams R J (1985) Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science

  5. Howlett RJ, Jain LC (2013) Radial basis function networks 2: new advances in design, vol 67. Physica

  6. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  7. Chawla N V, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newslett 6(1):1–6

    Google Scholar 

  8. Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47

    Google Scholar 

  9. Liu A, Ghosh J, Martin CE (2007) Generative oversampling for mining imbalanced datasets. In: DMIN

  10. Chan P K, Stolfo S J (1998) Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: KDD

  11. Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. Acm sigkdd Explor Newslett 6(1):50–59

    Google Scholar 

  12. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. ML

  13. Sun Y, Kamel M S, Wong AKC, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. PR

  14. Mazurowski M A, Habas P A, Zurada J M, Lo J Y, Baker J A, Tourassi G D (2008) Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw 21(2):427–436

    Google Scholar 

  15. Everson R M, Fieldsend J E (2006) Multi-objective optimisation for receiver operating characteristic analysis. Multi-Object Mach Learn 16:533–556

    Google Scholar 

  16. Raskutti B, Kowalczyk A (2004) Extreme re-balancing for svms: a case study. ACM Sigkdd Explor Newslett 6(1):60–69

    Google Scholar 

  17. Manevitz L, Yousef M (2007) One-class document classification via neural networks. Neurocomputing 70 (7):1466–1481

    Google Scholar 

  18. Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. NCA

  19. Núñez Castro H, González Abril L, Angulo Bahón C (2011) A post-processing strategy for svm learning from unbalanced data. In: 19th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp 195–200

  20. Tao X, Ji H, Xie Y (2007) A modified psvm and its application to unbalanced data classification. In: ICNC

  21. Yen S, Lee Y (2009) Cluster-based under-sampling approaches for imbalanced data distributions. ESA

  22. Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: ICIC

  23. Torres LCB, FCFSTAB CL Castro (2015) Distance-based large margin classifier suitable for integrated circuit implementation. Electron Lett 51(2):1967–1969

  24. Vitor de Campos Souza P (2018) Pruning fuzzy neural networks based on unineuron for problems of classification of patterns. J Intell Fuzzy Syst 35(2):2597–2605

    Google Scholar 

  25. Zhang X, Fu Y, Zang A, Sigal L, Agam G (2015) Learning classifiers from synthetic data using a multichannel autoencoder. arXiv

  26. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic Minority Over-sampling Technique. JAIR

  27. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328

  28. Barua S, Islam M M, Yao X, Murase K (2014) Mwmote–majority weighted minority oversampling technique for imbalanced data set learning. TKDE

  29. Koto F (2014) Smote-out, smote-cosine and selected-smote: an enhancement strategy to handle imbalance in data level. In: ICACSIS

  30. Schapire R E, Freund Y, Bartlett P, Lee W S (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Statist 26(5):1651–1686

    MathSciNet  MATH  Google Scholar 

  31. Feng W, Huang W, Ren J (2018) Class imbalance ensemble learning based on the margin theory. Appl Sci 8:815–842

    Google Scholar 

  32. Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: Bagging of extrapolation-smote svm. Computational Intelligence and Neuroscience

  33. Jingnian C, Shunxiang H, Li X (2018) Speeding up algorithm for support vector machine based on alien neighbor. Comput Eng 44:19–24

    Google Scholar 

  34. Xie W, Liang G, Dong Z, Tan B, Zhang B (2019) An improved oversampling algorithm based on the samples’ selection strategy for classifying imbalanced data. Mathematical Problems in Engineering

  35. Attenbert J, Ertekin S (2013) Class imbalance and active learning. In: Imbalanced learning: foundations, Algorithms, and Applications. Wiley, pp 101–150

  36. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM, pp 144–152

  37. Villela S M, Leite S C, Fonseca Neto R (2016) Incremental p-margin algorithm for classification with arbitrary norm. Pattern Recogn 55:216–272

    Google Scholar 

  38. Fernández A, Garcia S, Herrera F, Chawla N (2018) SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

  39. Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  40. Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R, Caligiuri M A, Bloomfield C D, Lander E S (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Google Scholar 

  41. Alon U, Barkai N, Notterman D A, Gish K, Ybarra S, Mack D, Levine A J (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A 96(12):6745–6750

    Google Scholar 

  42. Sánchez R L, Alcalá F J, Fernández H A, Luengo M J, Derrac R J, García LS, Herrera TF (2011) Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. JMLSC

  43. Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 105662:83

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Cristiano Hasenclever Borges.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ladeira Marques, M., Moraes Villela, S. & Hasenclever Borges, C.C. Large margin classifiers to generate synthetic data for imbalanced datasets. Appl Intell 50, 3678–3694 (2020). https://doi.org/10.1007/s10489-020-01719-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01719-y

Keywords

Navigation