MUEnsemble: Multi-ratio Undersampling-Based Ensemble Framework for Imbalanced Data

Komamizu, Takahiro; Uehara, Risa; Ogawa, Yasuhiro; Toyama, Katsuhiko

doi:10.1007/978-3-030-59051-2_14

MUEnsemble: Multi-ratio Undersampling-Based Ensemble Framework for Imbalanced Data

Takahiro Komamizu¹³,
Risa Uehara¹³,
Yasuhiro Ogawa¹³ &
…
Katsuhiko Toyama¹³

Conference paper
First Online: 08 February 2021

804 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12392))

Abstract

Class imbalance is commonly observed in real-world data, and it is still problematic in that it hurts classification performance due to biased supervision. Undersampling is one of the effective approaches to the class imbalance. The conventional undersampling-based approaches involve a single fixed sampling ratio. However, different sampling ratios have different preferences toward classes. In this paper, an undersampling-based ensemble framework, MUEnsemble, is proposed. This framework involves weak classifiers of different sampling ratios, and it allows for a flexible design for weighting weak classifiers in different sampling ratios. To demonstrate the principle of the design, in this paper, three quadratic weighting functions and a Gaussian weighting function are presented. To reduce the effort required by users in setting parameters, a grid search-based parameter estimation automates the parameter tuning. An experimental evaluation shows that MUEnsemble outperforms undersampling-based methods and oversampling-based state-of-the-art methods. Also, the evaluation showcases that the Gaussian weighting function is superior to the fundamental weighting functions. In addition, the parameter estimation predicted near-optimal parameters, and MUEnsemble with the estimated parameters outperforms the state-of-the-art methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://www.kaggle.com/datasets/.
2.
Coping with categorical attributes is out of the scope of this paper.
3.
https://github.com/cbellinger27/SWIM.
4.
https://scikit-learn.org/.

References

Bao, H., Sugiyama, M.: Calibrated surrogate maximization of linear-fractional utility in binary classification. CoRR abs/1905.12511 (2019). http://arxiv.org/abs/1905.12511
Batista, G.E.A.P.A., Bazzan, A.L.C., Monard, M.C.: Balancing training data for automated annotation of keywords: a case study. In: II Brazilian Workshop on Bioinformatics, pp. 10–18 (2003)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Google Scholar
Bellinger, C., Sharma, S., Japkowicz, N., Zaïane, O.R.: Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl. Inf. Syst. 62(3), 841–866 (2019). https://doi.org/10.1007/s10115-019-01380-z
Article Google Scholar
Bhattacharya, S., Rajan, V., Shrivastava, H.: ICU mortality prediction: a classification algorithm for imbalanced datasets. In: AAAI 2017, pp. 1288–1294 (2017)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth (1984)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Dua, D., Graff, C.: UCI Machine Learning Repository (2019). http://archive.ics.uci.edu/ml
Elkan, C.: The foundations of cost-sensitive learning. IJCAI 2001, 973–978 (2001)
Google Scholar
Galar, M., Fernández, A., Tartas, E.B., Sola, H.B., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C 42(4), 463–484 (2012)
Google Scholar
Gómez, S.E., Hernández-Callejo, L., Martínez, B.C., Sánchez-Esguevillas, A.J.: Exploratory study on class imbalance and solutions for network traffic classification. Neurocomputing 343, 100–119 (2019)
Article Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. IJCNN 2008, 1322–1328 (2008)
Google Scholar
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. 6(1), 40–49 (2004)
Article Google Scholar
Kang, P., Cho, S.: EUS SVMs: ensemble of under-sampled SVMs for data imbalance problems. In: King, I., Wang, J., Chan, L.-W., Wang, D.L. (eds.) ICONIP 2006. LNCS, vol. 4232, pp. 837–846. Springer, Heidelberg (2006). https://doi.org/10.1007/11893028_93
Chapter Google Scholar
Krawczyk, B., Wozniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)
Article Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML 1997, pp. 179–186 (1997)
Google Scholar
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
Google Scholar
Liu, X., Wu, J., Zhou, Z.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B 39(2), 539–550 (2009)
Google Scholar
Lu, W., Li, Z., Chu, J.: Adaptive ensemble undersampling-boost: a novel learning framework for imbalanced data. J. Syst. Softw. 132, 272–282 (2017)
Article Google Scholar
Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Article Google Scholar
Mani, I., Zhang, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: ICML 2003 Workshop on Learning from Imbalanced Datasets, vol. 126 (2003)
Google Scholar
Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. IJKESDP 3(1), 4–21 (2011)
Article Google Scholar
Peng, M., et al.: Trainable undersampling for class-imbalance learning. In: AAAI 2019, pp. 4707–4714 (2019)
Google Scholar
Pozzolo, A.D., Caelen, O., Johnson, R.A., Bontempi, G.: Calibrating probability with undersampling for unbalanced classification. In: SSCI 2015, pp. 159–166 (2015)
Google Scholar
Rodríguez, D., Herraiz, I., Harrison, R., Dolado, J.J., Riquelme, J.C.: Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In: EASE 2014, pp. 43:1–43:10 (2014)
Google Scholar
Seiffert, C., Khoshgoftaar, T.M., Hulse, J.V., Napolitano, A.: RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A 40(1), 185–197 (2010)
Google Scholar
Sharififar, A., Sarmadian, F., Minasny, B.: Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique. Comput. Electron. Agric. 159, 110–118 (2019)
Article Google Scholar
Sharma, S., Bellinger, C., Krawczyk, B., Zaïane, O.R., Japkowicz, N.: Synthetic oversampling with the majority class: a new perspective on handling extreme imbalance. In: ICDM 2018, pp. 447–456 (2018)
Google Scholar
Smith, M.R., Martinez, T., Giraud-Carrier, C.: An instance level analysis of data complexity. Mach. Learn. 95(2), 225–256 (2013). https://doi.org/10.1007/s10994-013-5422-z
Article MathSciNet Google Scholar
Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. SMC-6(11), 769–772 (1976)
Google Scholar
Wang, H., Gao, Y., Shi, Y., Wang, H.: A fast distributed classification algorithm for large-scale imbalanced data. In: ICDM 2016, pp. 1251–1256 (2016)
Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was partly supported by JSPS KAKENHI Grant Number JP18K18056 and the Kayamori Foundation of Informational Science Advancement.

Author information

Authors and Affiliations

Nagoya University, Nagoya, Japan
Takahiro Komamizu, Risa Uehara, Yasuhiro Ogawa & Katsuhiko Toyama

Authors

Takahiro Komamizu
View author publications
You can also search for this author in PubMed Google Scholar
Risa Uehara
View author publications
You can also search for this author in PubMed Google Scholar
Yasuhiro Ogawa
View author publications
You can also search for this author in PubMed Google Scholar
Katsuhiko Toyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takahiro Komamizu .

Editor information

Editors and Affiliations

Clausthal University of Technology, Clausthal-Zellerfeld, Germany
Sven Hartmann
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
Johannes Kepler University of Linz, Linz, Austria
Gabriele Kotsis
IFS, Vienna University of Technology, Vienna, Wien, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Komamizu, T., Uehara, R., Ogawa, Y., Toyama, K. (2020). MUEnsemble: Multi-ratio Undersampling-Based Ensemble Framework for Imbalanced Data. In: Hartmann, S., Küng, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2020. Lecture Notes in Computer Science(), vol 12392. Springer, Cham. https://doi.org/10.1007/978-3-030-59051-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-59051-2_14
Published: 08 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59050-5
Online ISBN: 978-3-030-59051-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics