Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative Approach

Bhattacharya, Rajdeep; De, Rajonya; Chakraborty, Anuran; Sarkar, Ram

doi:10.1007/s42979-024-02717-4

Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative Approach

Original Research
Published: 01 April 2024

Volume 5, article number 386, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Rajdeep Bhattacharya ORCID: orcid.org/0000-0002-9400-6037¹,
Rajonya De¹,
Anuran Chakraborty¹ &
…
Ram Sarkar¹

217 Accesses
Explore all metrics

Abstract

The class imbalance problem is prevalent in many classification tasks such as disease identification using microarray data, network intrusion detection, and so on. These are tasks in which the class distribution is skewed towards one class, more commonly known as the majority class. In such cases, traditional classifiers may not perform well as they tend to become biased towards the majority class. To address this problem, an intelligent undersampling technique is proposed in this paper. The method first groups the samples of the majority class into $l$ clusters, where $l$ is some number, using the K-means clustering algorithm. From these clusters, each of the cluster centroids is selected to form the undersampled majority class set. A classifier is then trained on this undersampled dataset consisting of the selected majority class samples and all the minority class samples. The trained model is used to predict the probability of each majority class sample belonging to the minority class. A Gaussian distribution is then constructed from these probabilities using which the top p-percent samples from each cluster are selected. The centroid of the cluster is recomputed using these samples only, which forms the new sample for our dataset for the corresponding cluster. The classifier is again trained using these samples, along with the minority class samples, thereby iteratively improving the classifier. The results obtained by the proposed method show that it performs better than most state-of-the-art methods while being evaluated on some standard datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Article 14 January 2020

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

Hybrid clustering strategies for effective oversampling and undersampling in multiclass classification

Article Open access 27 January 2025

References

Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A. An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci. 2014;259:571–95. https://doi.org/10.1016/j.ins.2010.12.016.
Article Google Scholar
Gray D, Bowes D, Davey N, et al. Reflections on the NASA MDP data sets. IET Softw. 2012;6(6):549–58. https://doi.org/10.1049/iet-sen.2011.0132.
Article Google Scholar
Acuña E, Rodríguez C. An empirical study of the effect of outliers on the misclassification error rate. Trans Knowl Data Eng. 2004;17:1–21.
Google Scholar
Zhang J, Mani I. KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the ICML’2003 workshop on learning from imbalanced datasets. 2003.
Maloof M. Learning when data sets are imbalanced and when costs are unequal and unknown. In: Proceedings of the ICML’03 workshop on learning from imbalanced data sets. 2003.
Chawla NV. C4.5 and imbalanced data sets : investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proceedings of international conference machine learning and work learning from imbalanced data sets II. 2003.
Seiffert C, Khoshgoftaar TM, Van Hulse J. Improving software-quality predictions with data sampling and boosting. IEEE Trans Syst Man Cybern A Syst Humans. 2009;39(6):1283–94. https://doi.org/10.1109/TSMCA.2009.2027131.
Article Google Scholar
He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84. https://doi.org/10.1109/TKDE.2008.239.
Article Google Scholar
Wasikowski M, Chen XW. Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng. 2010;22(10):1388–400. https://doi.org/10.1109/TKDE.2009.187.
Article Google Scholar
Liu B, Ma Y, Wong CK. Improving an association rule based classifier. In: Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and lecture notes in bioinformatics). Berlin: Springer; 2000.
Farid DM, Zhang L, Hossain A, et al. An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst Appl. 2013;40(15):5895–906. https://doi.org/10.1016/j.eswa.2013.05.001.
Article Google Scholar
Sun Z, Song Q, Zhu X, et al. A novel ensemble method for classifying imbalanced data. Pattern Recognit. 2015;48(5):1623–37. https://doi.org/10.1016/j.patcog.2014.11.014.
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, et al. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern C Appl Rev. 2012;42(4):463–84.
Article Google Scholar
Elkan C. The foundations of cost-sensitive learning. In: IJCAI international joint conference on artificial intelligence. 2001.
Zadrozny B, Langford J, Abe N. Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings—IEEE international conference on data mining, ICDM. 2003.
Haixiang G, Yijing L, Shang J, et al. Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl. 2017;73:220–39.
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW. SMOTEBoost : improving prediction of the minority class in boosting. In: Proceedings of European conference on principles and practice of knowledge discovery in databases. Berlin: Springer; 2003.
Wang S, Yao X. Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE symposium on computational intelligence and data mining, CIDM 2009—Proceedings. 2009.
Barandela R, Sánchez JS, Valdovinos RM. New applications of ensembles of classifiers. Pattern Anal Appl. 2003;6:245–56. https://doi.org/10.1007/s10044-003-0192-z.
Article MathSciNet Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57. https://doi.org/10.1613/jair.953.
Article Google Scholar
MacIejewski T, Stefanowski J. Local neighbourhood extension of SMOTE for mining imbalanced data. In: IEEE SSCI 2011: symposium series on computational intelligence—CIDM 2011: 2011 IEEE symposium on computational intelligence and data mining. 2011.
Santos MS, Abreu PH, García-Laencina PJ, et al. A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform. 2015;58:49–59. https://doi.org/10.1016/j.jbi.2015.09.012.
Article Google Scholar
Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14:106. https://doi.org/10.1186/1471-2105-14-106.
Article Google Scholar
García V, Sánchez JS, Marqués AI, et al. Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst Appl. 2020;158: 113026.
Article Google Scholar
He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. Hoboken: Wiley; 2013.
Book Google Scholar
Das B, Krishnan NC, Cook DJ. Handling imbalanced and overlapping classes in smart environments prompting dataset. In: Yada K, editor. Data mining for service. Studies in big data, vol. 3. Berlin: Springer; 2014.
Google Scholar
Yen SJ, Lee YS. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl. 2009;36(3):5718–27. https://doi.org/10.1016/j.eswa.2008.06.108.
Article MathSciNet Google Scholar
Chennuru VK, Timmappareddy SR. MahalCUSFilter: a hybrid undersampling method to improve the minority classification rate of imbalanced datasets. In: International conference on mining intelligence and knowledge exploration. New York: Springer; 2017. p. 43–53.
Chapter Google Scholar
Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S. Clustering-based undersampling in class-imbalanced data. Inf Sci. 2017;409:17–26.
Article Google Scholar
Ofek N, Rokach L, Stern R, Shabtai A. Fast-CBUS: a fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing. 2017;243:88–102.
Article Google Scholar
Tsai C-F, Lin W-C, Hu Y-H, Yao G-T. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci. 2019;477:47–54.
Article Google Scholar
Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR. DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl. 2021;168: 114301.
Article Google Scholar
Kumar NS, Rao KN, Govardhan A, et al. Undersampled K-means approach for handling imbalanced distributed data. Prog Artif Intell. 2014;3:29–38.
Article Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A Syst Humans. 2010;40(1):185–97. https://doi.org/10.1109/TSMCA.2009.2029559.
Article Google Scholar
Schapire RE. The strength of weak learnability. Mach Learn. 1990;5(2):197–227. https://doi.org/10.1023/A:1022648800760.
Article Google Scholar
Rayhan F, Ahmed S, Mahbub A, et al. CUSBoost: cluster-based under-sampling with boosting for imbalanced classification. In: 2nd international conference on computational systems and information technology for sustainable solutions, CSITSS 2017. 2018.
Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit. 2013;46(12):3460–71. https://doi.org/10.1016/j.patcog.2013.05.006.
Article Google Scholar
Gautheron L, Habrard A, Morvant E, Sebban M. Metric learning from imbalanced data with generalization guarantees. Pattern Recognit Lett. 2020;133:298–304. https://doi.org/10.1016/j.patrec.2020.03.008.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Jadavpur University, 188 Raja S.C. Mallick Rd, Kolkata, 700032, India
Rajdeep Bhattacharya, Rajonya De, Anuran Chakraborty & Ram Sarkar

Authors

Rajdeep Bhattacharya
View author publications
You can also search for this author inPubMed Google Scholar
Rajonya De
View author publications
You can also search for this author inPubMed Google Scholar
Anuran Chakraborty
View author publications
You can also search for this author inPubMed Google Scholar
Ram Sarkar
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Rajdeep Bhattacharya.

Ethics declarations

Conflict of Interests

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bhattacharya, R., De, R., Chakraborty, A. et al. Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative Approach. SN COMPUT. SCI. 5, 386 (2024). https://doi.org/10.1007/s42979-024-02717-4

Download citation

Received: 04 February 2022
Accepted: 14 February 2024
Published: 01 April 2024
DOI: https://doi.org/10.1007/s42979-024-02717-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Based Undersampling for Effective Learning from Imbalanced Data: An Iterative Approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Membership Probability–Based Undersampling Algorithm for Imbalanced Data

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

Hybrid clustering strategies for effective oversampling and undersampling in multiclass classification

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now