Abstract
With the development of big data, the problem of imbalanced data sets is becoming more and more serious. When dealing with high-dimensional imbalanced datasets, traditional classification algorithms usually tend to favor the majority class and ignore the minority class, which results in poor classification performance. In this paper, we study the issue of high-dimensional imbalanced dataset classification and propose a feature selection algorithm based on density clustering and importance measure (DBIM). DBIM firstly constructs multiple balanced subsets by randomly under-sampling the majority classes with the same number of samples as the minority classes and uses DBSCAN as the base classifier. This process quickly discovers feature distribution features based on density and generates the initial feature subspace. To select features with a strong classification of class labels, we propose to rank and select the generated initial feature subspace according to their importance. To avoid the redundancy between features and generate high-quality feature subsets, we further propose to design a new class distribution-based weight index combined with the redundancy evaluation index in the DBIM algorithm to calculate between features. Experimental results on eight publicly available datasets show that the DBIM algorithm proposed in this paper can generate feature subsets with high relevance and low redundancy, and can effectively reduce the dimensionality of high-dimensional imbalanced datasets and improve the classification performance.
Similar content being viewed by others
Availability of data and materials
Not.
References
Devarriya D, Gulati C, Mansharamani V, Sakalle A, Bhardwaj A (2020) Unbalanced breast cancer data classification using novel fitness functions in genetic programming. Expert Syst Appl 140:112866. https://doi.org/10.1016/j.eswa.2019.112866
Bridge J, Meng Y, Zhao Y, Du Y, Zhao M, Sun R, Zheng Y (2020) Introducing the gev activation function for highly unbalanced data to develop covid-19 diagnostic models. IEEE J Biomed Health Inform 24(10):2776–2786
Gan D, Shen J, An B, Xu M, Liu N (2020) Integrating tanbn with cost sensitive classification algorithm for imbalanced data in medical diagnosis. Comput Ind Eng 140:106266
Btoush E, Zhou X, Gururaian R, Chan K, Tao X (2021) A survey on credit card fraud detection techniques in banking industry for cyber security. In: 2021 8th international conference on behavioral and social computing (BESC). IEEE, pp 1–7
Fiore U, De Santis A, Perla F, Zanetti P, Palmieri F (2019) Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf Sci 479:448–455
Li Z, Huang M, Liu G, Jiang C (2021) A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Syst Appl 175:114750
Shi Q, Zhang H (2020) Fault diagnosis of an autonomous vehicle with an improved svm algorithm subject to unbalanced datasets. IEEE Trans Ind Electron 68(7):6248–6256
Zhang T, Chen J, Li F, Zhang K, Lv H, He S, Xu E (2022) Intelligent fault diagnosis of machines with small and imbalanced data: a state-of-the-art review and possible extensions. ISA Trans 119:152–171
Luo J, Huang J, Li H (2021) A case study of conditional deep convolutional generative adversarial networks in machine fault diagnosis. J Intell Manuf 32(2):407–425
Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281
Christensen R (2018) Analysis of variance, design, and regression: linear modeling for unbalanced data
Liu X, Li N, Liu S, Wang J, Zhang N, Zheng X, Leung K-S, Cheng L (2019) Normalization methods for the analysis of unbalanced transcriptome data: a review. Front Bioeng Biotechnol 358
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Liang D, Yi B, Cao W, Zheng Q (2022) Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and smote. Expert Syst Appl 188:116051
Devi D, Purkayastha B et al (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recogn Lett 93:3–12
Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:107262
Sun L, Zhang J, Ding W, Xu J (2022) Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors. Inf Sci 593:591–613
Quinlan JR (2014) C4. 5: programs for machine learning
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2):103–130
Vapnik V (1999) The nature of statistical learning theory
Huang G-B, Zhou H, Ding X, Zhang R (2011) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):513–529
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, vol 17. Lawrence Erlbaum Associates Ltd, pp 973–978
Zhixin QI, Hongzhi ZXWANG (2019) Cost-sensitive decision tree induction on dirty data. J Softw 30(3):604
Zhou YSG (2021) Double cost sensitive random forest algorithm. J Harbin Univ Sci Technol 26(05):44–50. https://doi.org/10.15938/j.jhust.2021.05.006
Sutton CD (2005) Classification and regression trees, bagging, and boosting. Handb Stat 24:303–329
Koapaha HP, Ananto N (2021) Bagging based ensemble analysis in handling unbalanced data on classification modeling. Klabat Account Rev 2(2):165–178
Thakkar HK, Desai A, Ghosh S, Singh P, Sharma G (2022) Clairvoyant: adaboost with cost-enabled cost-sensitive classifier for customer churn prediction. Comput Intell Neurosci 2022
Chen X-w, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 124–132
Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: 2009 IEEE international conference on data mining workshops. IEEE, pp 507–514
Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355
Nagpal A, Singh V (2019) Feature selection from high dimensional data based on iterative qualitative mutual information. J Intell Fuzzy Syst 36(6):5845–5856
Jing X-Y, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang J-Y (2019) Multiset feature learning for highly imbalanced data classification. IEEE Trans Pattern Anal Mach Intell 43(1):139–156
Saha J, Mukherjee J (2021) Cnak: cluster number assisted k-means. Pattern Recogn 110:107625
Krogh A, Vedelsby J (1994) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems 7
Acknowledgements
I would like to thank Professor Zhenfei Wang for encouraging me, helping me, and giving me pertinent guidance when I encountered bottlenecks.
Funding
National Natural Science Foundation of China, 61872324
Author information
Authors and Affiliations
Contributions
Z-FW contributed to the overall framework of the study. P-YY and L-YZ conceived and completed the experiments and wrote the manuscript,and Z-YC assisted with data analysis and constructive discussions during the experiments. The manuscript was reviewed by all authors.
Corresponding author
Ethics declarations
Conflict of interest
I declare that the authors have no competing or other interests that could be used to affect the findings and/or conclusions described in this paper.
Ethics approval
Not.
Consent to participate
Yes.
Consent for publication
Yes.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, ZF., Yuan, PY., Cao, ZY. et al. Feature reduction of unbalanced data classification based on density clustering. Computing 106, 29–55 (2024). https://doi.org/10.1007/s00607-023-01206-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-023-01206-5