Experimental Comparison of Oversampling Methods for Mixed Datasets

Rodríguez-Torres, Fredy; Carrasco-Ochoa, J. A.; Martínez-Trinidad, José Fco.

doi:10.1007/978-3-030-77004-4_8

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12725))

Included in the following conference series:

Mexican Conference on Pattern Recognition

734 Accesses

Abstract

In supervised classification, the class imbalance problem causes a bias that results in poor classification for the minority class. To face this problem, particularly in supervised classification with mixed data, some oversampling methods for mixed data have been reported in the literature. However, there is no experimental study comparing and evaluating these methods in a common setting. Therefore, in this paper, we present an experimental comparison of state-of-the-art oversampling methods designed specifically for mixed datasets. Our study reports the best oversampling methods for mixed data in terms of oversampling quality, taking into account the imbalance ratio, and runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)
Google Scholar
Borowska, K., Stepaniuk, J.: Imbalanced data classification: a novel re-sampling approach combining versatile improved SMOTE and rough sets. In: Saeed, K., Homenda, W. (eds.) CISIM 2016. LNCS, vol. 9842, pp. 31–42. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45378-1_4
Chapter Google Scholar
Branco, P., Torgo, L., Ribeiro, R.P.: SMOGN: a pre-processing approach for imbalanced regression. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 36–50. PMLR (2017)
Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-Sampling TEchnique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_43
Chapter Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Chen, B., Xia, S., Chen, Z., Wang, B., Wang, G.: RSMOTE: a self-adaptive robust smote for imbalanced problems with label noise. Inf. Sci. 553, 397–428 (2020)
Article MathSciNet Google Scholar
Dong, H., He, D., Wang, F.: SMOTE-XGBoost using tree Parzen estimator optimization for copper flotation method classification. Powder Technol. 375, 174–181 (2020)
Article Google Scholar
Douzas, G., Bacao, F.: Geometric SMOTE a geometrically enhanced drop-in replacement for smote. Inf. Sci. 501, 118–135 (2019)
Article Google Scholar
Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Article MathSciNet MATH Google Scholar
Fujita, H., Selamat, A.: Multivariate normal distribution based over-sampling for numerical and categorical features. In: Advancing Technology Industrialization Through Intelligent Software Methodologies, Tools and Techniques: Proceedings of the 18th International Conference on New Trends in Intelligent Software Methodologies, Tools and Techniques (SoMeT\(\_\)19), vol. 318, p. 107. IOS Press (2019)
Google Scholar
Guan, H., Zhang, Y., Xian, M., Cheng, H., Tang, X.: SMOTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Appl. Intell. 51, 1–16 (2020)
Google Scholar
Guo, S., Chen, R., Li, H., Zhang, T., Liu, Y.: Identify severity bug report with distribution imbalance by CR-SMOTE and ELM. Int. J. Softw. Eng. Knowl. Eng. 29(02), 139–175 (2019)
Article Google Scholar
Hämäläinen, W., Nykänen, M.: Efficient discovery of statistically significant association rules. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 203–212. IEEE (2008)
Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
Kovács, G.: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl. Soft Comput. 83, 105662 (2019)
Article Google Scholar
Kovács, G.: Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019)
Article Google Scholar
Kurgan, L.A., Cios, K.J., Tadeusiewicz, R., Ogiela, M., Goodenday, L.S.: Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif. Intell. Med. 23(2), 149–169 (2001)
Article Google Scholar
Liang, X., Jiang, A., Li, T., Xue, Y., Wang, G.: LR-SMOTE-an improved unbalanced data set oversampling based on k-means and SVM. Knowl.-Based Syst. 196, 105845 (2020)
Article Google Scholar
Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 104–111. IEEE (2011)
Google Scholar
Maldonado, S., López, J., Vairetti, C.: An alternative smote oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 76, 380–389 (2019)
Article Google Scholar
Rodriguez-Torres, F., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Deterministic oversampling methods based on smote. J. Intell. Fuzzy Syst. 36(5), 4945–4955 (2019)
Article Google Scholar
Rögnvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics 31(8), 1204–1210 (2015)
Article Google Scholar
Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Article Google Scholar
Sidana, S., Laclau, C., Amini, M.R.: Learning to recommend diverse items over implicit feedback on PANDOR. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 427–431 (2018)
Google Scholar
Sun, J., Li, H., Fujita, H., Fu, B., Ai, W.: Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with smote and time weighting. Inf. Fus. 54, 128–144 (2020)
Article Google Scholar
Torgo, L., Ribeiro, R.P., Pfahringer, B., Branco, P.: SMOTE for regression. In: Correia, L., Reis, L.P., Cascalho, J. (eds.) EPIA 2013. LNCS (LNAI), vol. 8154, pp. 378–389. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40669-0_33
Chapter Google Scholar
Torres, F.R., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: SMOTE-D a deterministic version of SMOTE. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Ayala-Ramírez, V., Olvera-López, J.A., Jiang, X. (eds.) MCPR 2016. LNCS, vol. 9703, pp. 177–188. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39393-3_18
Chapter Google Scholar
Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The corresponding author thanks the National Council of Science and Technology of Mexico for partly support this work through a scholarship grant.

Author information

Authors and Affiliations

Instituto Nacional de Astrofísica Óptica y Electrónica, 08544, San Andres Cholula, Puebla, Mexico
Fredy Rodríguez-Torres, J. A. Carrasco-Ochoa & José Fco. Martínez-Trinidad

Authors

Fredy Rodríguez-Torres
View author publications
You can also search for this author in PubMed Google Scholar
J. A. Carrasco-Ochoa
View author publications
You can also search for this author in PubMed Google Scholar
José Fco. Martínez-Trinidad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fredy Rodríguez-Torres .

Editor information

Editors and Affiliations

Instituto Tecnológico Autónomo de México, Mexico City, Mexico
Edgar Roman-Rangel
Instituto Tecnológico Autónomo de México, Mexico City, Mexico
Ángel Fernando Kuri-Morales
Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico
José Francisco Martínez-Trinidad
Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico
Jesús Ariel Carrasco-Ochoa
Autonomous University of Puebla, Puebla, Mexico
José Arturo Olvera-López

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rodríguez-Torres, F., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. (2021). Experimental Comparison of Oversampling Methods for Mixed Datasets. In: Roman-Rangel, E., Kuri-Morales, Á.F., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2021. Lecture Notes in Computer Science(), vol 12725. Springer, Cham. https://doi.org/10.1007/978-3-030-77004-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-77004-4_8
Published: 16 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77003-7
Online ISBN: 978-3-030-77004-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)