Abstract
Class imbalance is a common problem in (binary) classification problems. It appears in many application domains, such as text classification, fraud detection, churn prediction and medical diagnosis. A widely used approach to cope with this problem at the data level is the Synthetic Minority Oversampling Technique (SMOTE) which uses the K-Nearest Neighbors (KNN) algorithm to generate new, artificial instances in the minority class. It is however known that SMOTE is not ideal for high-dimensional data. Therefore, we propose an alternative oversampling strategy for imbalanced classification problems in high dimensions. Our approach is based on the sparse inverse covariance matrix estimated trough the Ledoit-Wolf method for high-dimensional data. The results show that our proposal has a competitive performance with respect to popular competitors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 14, 106 (2013). https://doi.org/10.1186/1471-2105-14-106
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chen, Y., Wiesel, A., Hero, A.O.: Shrinkage estimation of high dimensional covariance matrices. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2937–2940. IEEE (2009)
Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)
Fernández, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 61, 863–905 (2018)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp. 1322–1328. IEEE (2008)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Hsieh, C.-J., Sustik, M.A., Dhillon, I.S., Ravikumar, P.K., Poldrack, R.: BIG & QUIC: sparse inverse covariance estimation for a million variables. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Ledoit, O., Wolf, M.: Honey, i shrunk the sample covariance matrix. UPF Economics and Business Working Paper (691) (2003)
Ledoit, O., Wolf, M.: A well-conditioned estimator for large-dimensional covariance matrices. J. Multivar. Anal. 88(2), 365–411 (2004)
Ledoit, O., Wolf, M.: The power of (non-) linear shrinking: a review and guide to covariance matrix estimation. J. Financ. Economet. 20(1), 187–218 (2022)
Leguen-deVarona, I., Madera, J., Martínez-López, Y., Hernández-Nieto, J.C.: SMOTE-Cov: a new oversampling method based on the covariance matrix. In: Vasant, P., Litvinchev, I., Marmolejo-Saucedo, J.A., Rodriguez-Aguilar, R., Martinez-Rios, F. (eds.) Data Analysis and Optimization for Engineering and Computing Problems. EICC, pp. 207–215. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-48149-0_15
Lotfi, R., Shahsavani, D., Arashi, M.: Classification in high dimension using the Ledoit-Wolf shrinkage method. Mathematics 10(21), 4069 (2022)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Nekooeimehr, I., Lai-Yuen, S.K.: Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst. Appl. 46, 405–416 (2016)
Li, M., Wan, Q., Deng, X., Yang, H.: Synthetic minority oversampling technique based on sample density distribution for enhanced classification on imbalanced microarray data. In: ICCDA (2022)
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RS\(B\)*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33, 245–265 (2012). https://doi.org/10.1007/s10115-011-0465-6
Fernandez, A., Maldonado, S., Vairetti, C., Herrera, F.: FW-SMOTE: a feature-weighted oversampling approach for imbalanced classification. Pattern Recogn. 124, 108511 (2022)
López, J., Maldonado, S., Vairetti, C.: An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. J. 76, 380–389 (2019)
Sharma, S., Gosain, A., Jain, S.: A review of the oversampling techniques in class imbalance problem. In: Khanna, A., Gupta, D., Bhattacharyya, S., Hassanien, A.E., Anand, S., Jaiswal, A. (eds.) International Conference on Innovative Computing and Communications. AISC, vol. 1387, pp. 459–472. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-2594-7_38
Saadatfar, H., Mayabadi, S.: Two density-based sampling approaches for imbalanced and overlapping data. Knowl.-Based Syst. 241, 108217 (2022)
Wei, G., Weimeng, M., Song, Y., Dou, J.: An improved and random synthetic minority oversampling technique for imbalanced data. Knowl.-Based Syst. 248, 108839 (2022)
Acknowledgments
We would like to thanks VLIR (Vlaamse Inter Universitaire Raad, Flemish Interuniversity Council, Belgium) for supporting this work under the project Cuban ICT NETWORK programe: “Strengthening the ICT role in Cuban Universities for the development of the society”; specifically to Project 1: “Strengthening the research on ICT and its knowledge transference to the Cuban society (RESICT)” and also to the Cuban national project “Plataforma para el análisis de grandes volúmenes de datos y su aplicación a sectores estratégicos”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Leguen-de-Varona, I., Madera, J., Gonzalez, H., Tubex, L., Verdonck, T. (2024). Oversampling Method Based Covariance Matrix Estimation in High-Dimensional Imbalanced Classification. In: Hernández Heredia, Y., Milián Núñez, V., Ruiz Shulcloper, J. (eds) Progress in Artificial Intelligence and Pattern Recognition. IWAIPR 2023. Lecture Notes in Computer Science, vol 14335. Springer, Cham. https://doi.org/10.1007/978-3-031-49552-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-49552-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49551-9
Online ISBN: 978-3-031-49552-6
eBook Packages: Computer ScienceComputer Science (R0)