Skip to main content

An Adaptive Oversampling Technique for Imbalanced Datasets

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10933))

Abstract

Class imbalance is one of the challenging problems in classification domain of data mining. This is particularly so because of the inability of the classifiers in classifying minority examples correctly when data is imbalanced. Further, the performance of the classifiers gets deteriorated due to the presence of imbalance within class in addition to between class imbalance. Though class imbalance has been well addressed in literature, not enough attention has been given to within class imbalance. In this paper, we propose a method that can adaptively handle both between-class and within-class imbalance simultaneously and also that can take into account the spread of the data in the feature space. We validate our approach using 12 publicly available datasets and compare the classification performance with other existing oversampling techniques. The experimental results demonstrate that the proposed method is statistically superior to other methods in terms of various accuracy measures.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17(2–3), 255–287 (2010)

    Google Scholar 

  2. Alshomrani, S., Bawakid, A., Shim, S.O., Fernández, A., Herrera, F.: A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl.-Based Syst. 73, 1–17 (2015)

    Article  Google Scholar 

  3. Barua, S., Islam, M.M., Yao, X., Murase, K.: Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2014)

    Article  Google Scholar 

  4. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)

    Article  Google Scholar 

  5. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)

    Article  Google Scholar 

  6. Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36(3), 4626–4636 (2009)

    Article  Google Scholar 

  7. Ceci, M., Pio, G., Kuzmanovski, V., Džeroski, S.: Semi-supervised multi-view learning for gene network reconstruction. PLoS ONE 10(12), e0144031 (2015)

    Article  Google Scholar 

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  9. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39804-2_12

    Chapter  Google Scholar 

  10. Cleofas-Sánchez, L., García, V., Marqués, A., Sánchez, J.S.: Financial distress prediction using the hybrid associative memory with translation. Appl. Soft Comput. 44, 144–152 (2016)

    Article  Google Scholar 

  11. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  12. Díez-Pastor, J.F., Rodríguez, J.J., García-Osorio, C.I., Kuncheva, L.I.: Diversity techniques improve the performance of the best imbalance learning ensembles. Inf. Sci. 325, 98–117 (2015)

    Article  MathSciNet  Google Scholar 

  13. Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20(1), 18–36 (2004)

    Article  MathSciNet  Google Scholar 

  14. Fawcett, T.: ROC graphs: notes and practical considerations for researchers. Mach. Learn. 31(1), 1–38 (2004)

    MathSciNet  Google Scholar 

  15. Fraley, C., Raftery, A.E.: MCLUST: software for model-based cluster analysis. J. Classif. 16(2), 297–306 (1999)

    Article  Google Scholar 

  16. Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)

    Article  MathSciNet  Google Scholar 

  17. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)

    Article  Google Scholar 

  18. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)

    Article  Google Scholar 

  19. Garcia, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9(Dec), 2677–2694 (2008)

    MATH  Google Scholar 

  20. García, V., Mollineda, R.A., Sánchez, J.S.: A bias correction function for classification performance assessment in two-class imbalanced problems. Knowl.-Based Syst. 59, 66–74 (2014)

    Article  Google Scholar 

  21. Grant, M., Boyd, S., Ye, Y.: CVX: Matlab software for disciplined convex programming (2008)

    Google Scholar 

  22. Guo, H., Viktor, H.L.: Boosting with data generation: improving the classification of hard to learn examples. In: Orchard, B., Yang, C., Ali, M. (eds.) IEA/AIE 2004. LNCS (LNAI), vol. 3029, pp. 1082–1091. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24677-0_111

    Chapter  Google Scholar 

  23. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  24. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, IJCNN 2008, (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)

    Google Scholar 

  25. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  26. Holte, R.C., Acker, L., Porter, B.W., et al.: Concept learning and the problem of small disjuncts. In: IJCAI, vol. 89, pp. 813–818. Citeseer (1989)

    Google Scholar 

  27. Huang, J., Ling, C.X.: Using auc and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)

    Article  Google Scholar 

  28. Japkowicz, N.: Class imbalances: are we focusing on the right issue. In: Workshop on Learning from Imbalanced Data Sets II, vol. 1723, p. 63 (2003)

    Google Scholar 

  29. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)

    MATH  Google Scholar 

  30. Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004)

    Article  Google Scholar 

  31. Kubat, M., Matwin, S., et al.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, vol. 97, Nashville, USA, pp. 179–186 (1997)

    Google Scholar 

  32. Lango, M., Stefanowski, J.: Multi-class and feature selection extensions of roughly balanced bagging for imbalanced data. J. Intell. Inf. Syst. 50(1), 97–127 (2018)

    Article  Google Scholar 

  33. Lutwak, E., Yang, D., Zhang, G.: \(L_p\) John ellipsoids. Proc. Lond. Math. Soc. 90(2), 497–520 (2005)

    Article  Google Scholar 

  34. Maldonado, S., López, J.: Imbalanced data classification using second-order cone programming support vector machines. Pattern Recogn. 47(5), 2070–2079 (2014)

    Article  Google Scholar 

  35. Piras, L., Giacinto, G.: Synthetic pattern generation for imbalanced learning in image retrieval. Pattern Recogn. Lett. 33(16), 2198–2205 (2012)

    Article  Google Scholar 

  36. Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 312–321. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24694-7_32

    Chapter  Google Scholar 

  37. Provost, F.J., Fawcett, T., et al.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: KDD, vol. 97, pp. 43–48 (1997)

    Google Scholar 

  38. Richardson, A.: Nonparametric statistics for non-statisticians: a step-by-step approach by Gregory W. Corder, Dale I. Foreman. Int. Stat. Rev. 78(3), 451–452 (2010)

    Article  Google Scholar 

  39. Saradhi, V.V., Palshikar, G.K.: Employee churn prediction. Expert Syst. Appl. 38(3), 1999–2006 (2011)

    Article  Google Scholar 

  40. Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explor. Newsl. 6(1), 7–19 (2004)

    Article  Google Scholar 

  41. Yang, C.Y., Yang, J.S., Wang, J.J.: Margin calibration in SVM class-imbalanced learning. Neurocomputing 73(1), 397–411 (2009)

    Article  MathSciNet  Google Scholar 

  42. Yen, S.J., Lee, Y.S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718–5727 (2009)

    Article  MathSciNet  Google Scholar 

  43. Yu, D.J., Hu, J., Tang, Z.M., Shen, H.B., Yang, J., Yang, J.Y.: Improving protein-atp binding residues prediction by boosting svms with random under-sampling. Neurocomputing 104, 180–190 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Usha Ananthakumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shahee, S.A., Ananthakumar, U. (2018). An Adaptive Oversampling Technique for Imbalanced Datasets. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2018. Lecture Notes in Computer Science(), vol 10933. Springer, Cham. https://doi.org/10.1007/978-3-319-95786-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-95786-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-95785-2

  • Online ISBN: 978-3-319-95786-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics