Abstract
In supervised classification, the class imbalance problem causes a bias that results in poor classification for the minority class. To face this problem, particularly in supervised classification with mixed data, some oversampling methods for mixed data have been reported in the literature. However, there is no experimental study comparing and evaluating these methods in a common setting. Therefore, in this paper, we present an experimental comparison of state-of-the-art oversampling methods designed specifically for mixed datasets. Our study reports the best oversampling methods for mixed data in terms of oversampling quality, taking into account the imbalance ratio, and runtime.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17 (2011)
Borowska, K., Stepaniuk, J.: Imbalanced data classification: a novel re-sampling approach combining versatile improved SMOTE and rough sets. In: Saeed, K., Homenda, W. (eds.) CISIM 2016. LNCS, vol. 9842, pp. 31–42. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45378-1_4
Branco, P., Torgo, L., Ribeiro, R.P.: SMOGN: a pre-processing approach for imbalanced regression. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications, pp. 36–50. PMLR (2017)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-level-SMOTE: safe-level-synthetic minority over-Sampling TEchnique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_43
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chen, B., Xia, S., Chen, Z., Wang, B., Wang, G.: RSMOTE: a self-adaptive robust smote for imbalanced problems with label noise. Inf. Sci. 553, 397–428 (2020)
Dong, H., He, D., Wang, F.: SMOTE-XGBoost using tree Parzen estimator optimization for copper flotation method classification. Powder Technol. 375, 174–181 (2020)
Douzas, G., Bacao, F.: Geometric SMOTE a geometrically enhanced drop-in replacement for smote. Inf. Sci. 501, 118–135 (2019)
Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Fujita, H., Selamat, A.: Multivariate normal distribution based over-sampling for numerical and categorical features. In: Advancing Technology Industrialization Through Intelligent Software Methodologies, Tools and Techniques: Proceedings of the 18th International Conference on New Trends in Intelligent Software Methodologies, Tools and Techniques (SoMeT\(\_\)19), vol. 318, p. 107. IOS Press (2019)
Guan, H., Zhang, Y., Xian, M., Cheng, H., Tang, X.: SMOTE-WENN: solving class imbalance and small sample problems by oversampling and distance scaling. Appl. Intell. 51, 1–16 (2020)
Guo, S., Chen, R., Li, H., Zhang, T., Liu, Y.: Identify severity bug report with distribution imbalance by CR-SMOTE and ELM. Int. J. Softw. Eng. Knowl. Eng. 29(02), 139–175 (2019)
Hämäläinen, W., Nykänen, M.: Efficient discovery of statistically significant association rules. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 203–212. IEEE (2008)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Kovács, G.: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl. Soft Comput. 83, 105662 (2019)
Kovács, G.: Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019)
Kurgan, L.A., Cios, K.J., Tadeusiewicz, R., Ogiela, M., Goodenday, L.S.: Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif. Intell. Med. 23(2), 149–169 (2001)
Liang, X., Jiang, A., Li, T., Xue, Y., Wang, G.: LR-SMOTE-an improved unbalanced data set oversampling based on k-means and SVM. Knowl.-Based Syst. 196, 105845 (2020)
Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of smote for mining imbalanced data. In: 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 104–111. IEEE (2011)
Maldonado, S., López, J., Vairetti, C.: An alternative smote oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 76, 380–389 (2019)
Rodriguez-Torres, F., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: Deterministic oversampling methods based on smote. J. Intell. Fuzzy Syst. 36(5), 4945–4955 (2019)
Rögnvaldsson, T., You, L., Garwicz, D.: State of the art prediction of HIV-1 protease cleavage sites. Bioinformatics 31(8), 1204–1210 (2015)
Sáez, J.A., Luengo, J., Stefanowski, J., Herrera, F.: SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf. Sci. 291, 184–203 (2015)
Sidana, S., Laclau, C., Amini, M.R.: Learning to recommend diverse items over implicit feedback on PANDOR. In: Proceedings of the 12th ACM Conference on Recommender Systems, pp. 427–431 (2018)
Sun, J., Li, H., Fujita, H., Fu, B., Ai, W.: Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with smote and time weighting. Inf. Fus. 54, 128–144 (2020)
Torgo, L., Ribeiro, R.P., Pfahringer, B., Branco, P.: SMOTE for regression. In: Correia, L., Reis, L.P., Cascalho, J. (eds.) EPIA 2013. LNCS (LNAI), vol. 8154, pp. 378–389. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40669-0_33
Torres, F.R., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: SMOTE-D a deterministic version of SMOTE. In: Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Ayala-Ramírez, V., Olvera-López, J.A., Jiang, X. (eds.) MCPR 2016. LNCS, vol. 9703, pp. 177–188. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39393-3_18
Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
Acknowledgments
The corresponding author thanks the National Council of Science and Technology of Mexico for partly support this work through a scholarship grant.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Rodríguez-Torres, F., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. (2021). Experimental Comparison of Oversampling Methods for Mixed Datasets. In: Roman-Rangel, E., Kuri-Morales, Á.F., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2021. Lecture Notes in Computer Science(), vol 12725. Springer, Cham. https://doi.org/10.1007/978-3-030-77004-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-77004-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77003-7
Online ISBN: 978-3-030-77004-4
eBook Packages: Computer ScienceComputer Science (R0)