Skip to main content

Experimental Comparison of Oversampling Methods for Mixed Datasets

  • Conference paper
  • First Online:
Pattern Recognition (MCPR 2021)

Abstract

In supervised classification, the class imbalance problem causes a bias that results in poor classification for the minority class. To face this problem, particularly in supervised classification with mixed data, some oversampling methods for mixed data have been reported in the literature. However, there is no experimental study comparing and evaluating these methods in a common setting. Therefore, in this paper, we present an experimental comparison of state-of-the-art oversampling methods designed specifically for mixed datasets. Our study reports the best oversampling methods for mixed data in terms of oversampling quality, taking into account the imbalance ratio, and runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)

    Google Scholar 

  2. Borowska, K., Stepaniuk, J.: Imbalanced data classification: a novel re-sampling approach combining versatile improved SMOTE and rough sets. In: Saeed, K., Homenda, W. (eds.) CISIM 2016. LNCS, vol. 9842, pp. 31–42. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45378-1_4

    Chapter  Google Scholar 

  3. Branco, P., Torgo, L., Ribeiro, R.P.: SMOGN: a pre-processing approach for imbalanced regression. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 36–50. PMLR (2017)

    Google Scholar 

  4. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-Sampling TEchnique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_43

    Chapter  Google Scholar 

  5. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  MATH  Google Scholar 

  6. Chen, B., Xia, S., Chen, Z., Wang, B., Wang, G.: RSMOTE: a self-adaptive robust smote for imbalanced problems with label noise. Inf. Sci. 553, 397–428 (2020)

    Article  MathSciNet  Google Scholar 

  7. Dong, H., He, D., Wang, F.: SMOTE-XGBoost using tree Parzen estimator optimization for copper flotation method classification. Powder Technol. 375, 174–181 (2020)

    Article  Google Scholar 

  8. Douzas, G., Bacao, F.: Geometric SMOTE a geometrically enhanced drop-in replacement for smote. Inf. Sci. 501, 118–135 (2019)

    Article  Google Scholar 

  9. Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  10. Fujita, H., Selamat, A.: Multivariate normal distribution based over-sampling for numerical and categorical features. In: Advancing Technology Industrialization Through Intelligent Software Methodologies, Tools and Techniques: Proceedings of the 18th International Conference on New Trends in Intelligent Software Methodologies, Tools and Techniques (SoMeT\(\_\)19), vol. 318, p. 107. IOS Press (2019)

    Google Scholar 

  11. Guan, H., Zhang, Y., Xian, M., Cheng, H., Tang, X.: SMOTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Appl. Intell. 51, 1–16 (2020)

    Google Scholar 

  12. Guo, S., Chen, R., Li, H., Zhang, T., Liu, Y.: Identify severity bug report with distribution imbalance by CR-SMOTE and ELM. Int. J. Softw. Eng. Knowl. Eng. 29(02), 139–175 (2019)

    Article  Google Scholar 

  13. Hämäläinen, W., Nykänen, M.: Efficient discovery of statistically significant association rules. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 203–212. IEEE (2008)

    Google Scholar 

  14. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  15. Kovács, G.: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl. Soft Comput. 83, 105662 (2019)

    Article  Google Scholar 

  16. Kovács, G.: Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019)

    Article  Google Scholar 

  17. Kurgan, L.A., Cios, K.J., Tadeusiewicz, R., Ogiela, M., Goodenday, L.S.: Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif. Intell. Med. 23(2), 149–169 (2001)

    Article  Google Scholar 

  18. Liang, X., Jiang, A., Li, T., Xue, Y., Wang, G.: LR-SMOTE-an improved unbalanced data set oversampling based on k-means and SVM. Knowl.-Based Syst. 196, 105845 (2020)

    Article  Google Scholar 

  19. Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 104–111. IEEE (2011)

    Google Scholar 

  20. Maldonado, S., López, J., Vairetti, C.: An alternative smote oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 76, 380–389 (2019)

    Article  Google Scholar 

  21. Rodriguez-Torres, F., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Deterministic oversampling methods based on smote. J. Intell. Fuzzy Syst. 36(5), 4945–4955 (2019)

    Article  Google Scholar 

  22. Rögnvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics 31(8), 1204–1210 (2015)

    Article  Google Scholar 

  23. Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)

    Article  Google Scholar 

  24. Sidana, S., Laclau, C., Amini, M.R.: Learning to recommend diverse items over implicit feedback on PANDOR. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 427–431 (2018)

    Google Scholar 

  25. Sun, J., Li, H., Fujita, H., Fu, B., Ai, W.: Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with smote and time weighting. Inf. Fus. 54, 128–144 (2020)

    Article  Google Scholar 

  26. Torgo, L., Ribeiro, R.P., Pfahringer, B., Branco, P.: SMOTE for regression. In: Correia, L., Reis, L.P., Cascalho, J. (eds.) EPIA 2013. LNCS (LNAI), vol. 8154, pp. 378–389. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40669-0_33

    Chapter  Google Scholar 

  27. Torres, F.R., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: SMOTE-D a deterministic version of SMOTE. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Ayala-Ramírez, V., Olvera-López, J.A., Jiang, X. (eds.) MCPR 2016. LNCS, vol. 9703, pp. 177–188. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39393-3_18

    Chapter  Google Scholar 

  28. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The corresponding author thanks the National Council of Science and Technology of Mexico for partly support this work through a scholarship grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fredy Rodríguez-Torres .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rodríguez-Torres, F., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. (2021). Experimental Comparison of Oversampling Methods for Mixed Datasets. In: Roman-Rangel, E., Kuri-Morales, Á.F., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2021. Lecture Notes in Computer Science(), vol 12725. Springer, Cham. https://doi.org/10.1007/978-3-030-77004-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77004-4_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77003-7

  • Online ISBN: 978-3-030-77004-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics