Robust hybrid data-level sampling approach to handle imbalanced data during classification

Kaur, Prabhjot; Gosain, Anjana

doi:10.1007/s00500-020-04901-z

Robust hybrid data-level sampling approach to handle imbalanced data during classification

Methodologies and Application
Published: 11 April 2020

Volume 24, pages 15715–15732, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

Prabhjot Kaur¹ &
Anjana Gosain²

510 Accesses
13 Citations
Explore all metrics

Abstract

Classification process is significant in finding different patterns from data. The performance of classifiers is highly affected with many data impurities like imbalance data, noise, class overlapping and different distributions of data within classes. The data in the real-world applications are often corrupted with multiple data impurities. To handle this issue, this paper proposed a hybrid data-level method to handle multiple data impurities like class imbalance, noise and different data distributions within classes. The proposed approach works in phases; in the first phase, it identifies and removes noise from the data, and then, it detects minority and majority cluster by using kernel-based fuzzy clustering approach. Radial basis kernel is used for clustering. In the next phase, minority and majority clusters are processed to balance the data. It uses radial basis kernel fuzzy membership and \(\alpha \)-cut to reduce the data size of majority cluster- and firefly-based SMOTE method to intelligently produce synthetic data within minority cluster. After removing all the data impurities, a traditional classifier (Decision Tree) is used to classify the balanced data. Performance of proposed method is tested with 3 synthetic data-sets and 44 UCI real-world data-sets of different imbalance ratios (imbalance ratio varies from 1.82 to 129.44). Area under the ROC curve is used to assess and compare the performance of proposed method with 20 other data-level methods. Experimental results confirmed that proposed method outperformed every other method especially in the case of highly imbalanced data-set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel fuzzy clustering-based undersampling framework for class imbalance problem

Article 27 March 2023

Benchmarking framework for class imbalance problem using novel sampling approach for big data

Article 13 June 2019

A Classification Method for Imbalanced Data Based on Ant Lion Optimizer

References

Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning. Springer, Berlin, pp 39–50
Alcalá-Fdez J, Sanchez L, Garcia S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM et al (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318
Article Google Scholar
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult.-Valued Logic Soft Comput 17
Asuncion A, Newman D (2007) UCI machine learning repository
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Bezdek JC (1981) Objective function clustering. In: Pattern recognition with fuzzy objective function algorithm, Springer, Berlin, pp 43–93
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, Berlin, pp 475–482
Chaira T (2011) A novel intuitionistic fuzzy c means clustering algorithm and its application to medical images. Appl Soft Comput 11(2):1711–1717
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Deng W, Zhao H (2019) An effective improved co-evolution ant colony optimization algorithm with multi-strategies and its application. Int J Bio-inspired Comput Paper:in Press
Deng W, Zhao H, Yang X, Xiong J, Sun M, Li B (2017a) Study on an improved adaptive pso algorithm for solving multi-objective gate assignment. Appl Soft Comput 59:288–302
Article Google Scholar
Deng W, Zhao H, Zou L, Li G, Yang X, Wu D (2017b) A novel collaborative optimization algorithm in solving complex optimization problems. Soft Comput 21(15):4387–4398
Article Google Scholar
Deng W, Xu J, Zhao H (2019) An improved ant colony optimization algorithm based on hybrid strategies for scheduling problem. IEEE Access 7:20,281–20,292
Article Google Scholar
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57
Article MathSciNet MATH Google Scholar
D’Addabbo A, Maglietta R (2015) Parallel selective sampling method for imbalanced and large data classification. Pattern Recognit Lett 62:61–67
Article Google Scholar
Feng L, Qiu MH, Wang YX, Xiang QL, Yang YF, Liu K (2010) A fast divisive clustering algorithm using an improved discrete particle swarm optimizer. Pattern Recognit Lett 31(11):1216–1225
Article Google Scholar
FernáNdez A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl-Based Syst 42:97–110
Article Google Scholar
Fister I, Fister I Jr, Yang XS, Brest J (2013) A comprehensive review of firefly algorithms. Swarm Evolut Comput 13:34–46
Article Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, Springer, Berlin, pp 878–887
Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516
Article Google Scholar
Kanimozhi U, Ganapathy S, Manjula D, Kannan A (2019) An intelligent risk prediction system for breast cancer using fuzzy temporal rules. Natl Acad Sci Lett 42(3):227–232
Article Google Scholar
Kaur P, Gosain A (2018a) Comparing the behaviour of undersampling and oversampling of class imbalance learning by combining class imbalance problem with noise. In: ICT based innovations, advances in intelligent systems and computing, Springer, Berlin, pp 23–30
Kaur P, Gosain A (2018b) An intelligent undersampling technique based upon intuitionistic fuzzy sets to alleviate class imbalance problem of classification with noisy environment. Int J Intell Eng Inform 6(5):417–433
Google Scholar
Kaur P, Gosain A (2019) Ff-smote: a metaheuristic approach to combat class imbalance in binary classification. Appl Artif Intell 33(5):420–439
Article Google Scholar
Kaur P, Soni A, Gosain A (2011) Robust intuitionistic fuzzy c-means clustering for linearly and nonlinearly separable data. In: 2011 International conference on image information processing, IEEE, pp 1–6
Kaur P, Soni A, Gosain A (2013) Robust kernelized approach to clustering by incorporating new distance measure. Eng Appl Artif Intell 26(2):833–847
Article Google Scholar
Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml 97:179–186
Google Scholar
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe, Springer, Berlin, pp 63–66
Li DC, Wu CS, Tsai TI, Lina YS (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34(4):966–982
Article MATH Google Scholar
Maruthi Padmaja T, Raju BS, Hota RN, Krishna PR (2014) Class imbalance and its effect on pca preprocessing. Int J Knowl Eng Soft Data Paradig 4(3):272–294
Article Google Scholar
Matlab V (2010) 7.10. 0 (r2010a). The MathWorks Inc, Natick
Google Scholar
Mollineda R, Alejo R, Sotoca J (2007) The class imbalance problem in pattern classification and learning. In: II Congreso Espanol de Informática (CEDI 2007). ISBN, pp 978–84
Perumal SP, Sannasi G, Arputharaj K (2019) An intelligent fuzzy rule-based e-learning recommendation system for dynamic user interests. J Supercomput 75(8):5145–5160
Article Google Scholar
Prati RC, Batista GE, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
Ramesh LS, Ganapathy S, Bhuvaneshwari R, Kulothungan K, Pandiyaraju V, Kannan A (2015) Prediction of user interests for providing relevant information using relevance feedback and re-ranking. Int J Intell Inf Technol 11(4):55–71
Article Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern-Part A: Syst Hum 40(1):185–197
Article Google Scholar
Sharma S, Goel M, Kaur P (2013) Performance comparison of various robust data clustering algorithms. Int J Intell Syst Appl 5(7):63
Google Scholar
Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 283–292
Tang S, Chen Sp (2008) The generation mechanism of synthetic minority class examples. In: 2008 International conference on information technology and applications in biomedicine, IEEE, pp 444–447
Tomek I (1976) Two modifications of cnn. IEEE Trans Syst Man Cybern 6:769–772
MathSciNet MATH Google Scholar
Tsai DM, Lin CC (2011) Fuzzy c-means based clustering for linearly and nonlinearly separable data. Pattern Recognit 44(8):1750–1760
Article MATH Google Scholar
Veropoulos K, Campbell C, Cristianini N, et al. (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI, vol 55, p 60
Vijay Kumar T, Lavanya N, Khanna Nehemiah H, Ganapathy S, Kannan A (2019) Identification and classification of pulmonary nodule in lung modality using digital computer. Int J Appl Math Inf Sci 12(2):451–459
Google Scholar
Vijayakumar DS, Ganapathy S (2018) Machine learning approach to combat false alarms in wireless intrusion detection system. Comput Inf Sci 11(3):67–81
Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421
Article MathSciNet MATH Google Scholar
Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727
Article Google Scholar
Yong Y (2012) The research of imbalanced data set of sample sampling method based on k-means cluster and genetic algorithm. Energy Procedia 17:164–170
Article Google Scholar
Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Fifth international conference on hybrid intelligent systems (HIS’05), IEEE, p 6
Zhao H, Liu H, Xu J, Deng W (2019a) Performance prediction using high-order differential mathematical morphology gradient spectrum entropy and extreme learning machine. IEEE Trans Instrum Meas
Zhao H, Zheng J, Xu J, Deng W (2019b) Fault diagnosis method based on principal component analysis and broad learning system. IEEE Access 7:99,263–99,272
Article Google Scholar
Zhao H, Zheng J, Deng W, Song Y (2020) Semi-supervised broad learning system based on manifold regularization and broad network. IEEE Trans Circuits Syst I: Regul Pap

Download references

Acknowledgements

The authors would like to thank all the reviewers for their constructive comments which helped a lot to improve the quality of the paper.

Author information

Authors and Affiliations

Maharaja Surajmal Institute of Technology, GGSIP University, New Delhi, India
Prabhjot Kaur
USICT, GGSIP University, New Delhi, India
Anjana Gosain

Authors

Prabhjot Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Anjana Gosain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prabhjot Kaur.

Ethics declarations

Conflict of interest

All the authors declare that there is no conflict of interest in publishing this paper.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

Table 4 Properties of data-sets

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kaur, P., Gosain, A. Robust hybrid data-level sampling approach to handle imbalanced data during classification. Soft Comput 24, 15715–15732 (2020). https://doi.org/10.1007/s00500-020-04901-z

Download citation

Published: 11 April 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s00500-020-04901-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust hybrid data-level sampling approach to handle imbalanced data during classification

Abstract

Access this article

Similar content being viewed by others

Novel fuzzy clustering-based undersampling framework for class imbalance problem

Benchmarking framework for class imbalance problem using novel sampling approach for big data

A Classification Method for Imbalanced Data Based on Ant Lion Optimizer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendix 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Robust hybrid data-level sampling approach to handle imbalanced data during classification

Abstract

Access this article

Similar content being viewed by others

Novel fuzzy clustering-based undersampling framework for class imbalance problem

Benchmarking framework for class imbalance problem using novel sampling approach for big data

A Classification Method for Imbalanced Data Based on Ant Lion Optimizer

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation