Skip to main content
Log in

Feature reduction of unbalanced data classification based on density clustering

  • Regular Paper
  • Published:
Computing Aims and scope Submit manuscript

Abstract

With the development of big data, the problem of imbalanced data sets is becoming more and more serious. When dealing with high-dimensional imbalanced datasets, traditional classification algorithms usually tend to favor the majority class and ignore the minority class, which results in poor classification performance. In this paper, we study the issue of high-dimensional imbalanced dataset classification and propose a feature selection algorithm based on density clustering and importance measure (DBIM). DBIM firstly constructs multiple balanced subsets by randomly under-sampling the majority classes with the same number of samples as the minority classes and uses DBSCAN as the base classifier. This process quickly discovers feature distribution features based on density and generates the initial feature subspace. To select features with a strong classification of class labels, we propose to rank and select the generated initial feature subspace according to their importance. To avoid the redundancy between features and generate high-quality feature subsets, we further propose to design a new class distribution-based weight index combined with the redundancy evaluation index in the DBIM algorithm to calculate between features. Experimental results on eight publicly available datasets show that the DBIM algorithm proposed in this paper can generate feature subsets with high relevance and low redundancy, and can effectively reduce the dimensionality of high-dimensional imbalanced datasets and improve the classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of data and materials

Not.

References

  1. Devarriya D, Gulati C, Mansharamani V, Sakalle A, Bhardwaj A (2020) Unbalanced breast cancer data classification using novel fitness functions in genetic programming. Expert Syst Appl 140:112866. https://doi.org/10.1016/j.eswa.2019.112866

    Article  Google Scholar 

  2. Bridge J, Meng Y, Zhao Y, Du Y, Zhao M, Sun R, Zheng Y (2020) Introducing the gev activation function for highly unbalanced data to develop covid-19 diagnostic models. IEEE J Biomed Health Inform 24(10):2776–2786

    Article  Google Scholar 

  3. Gan D, Shen J, An B, Xu M, Liu N (2020) Integrating tanbn with cost sensitive classification algorithm for imbalanced data in medical diagnosis. Comput Ind Eng 140:106266

    Article  Google Scholar 

  4. Btoush E, Zhou X, Gururaian R, Chan K, Tao X (2021) A survey on credit card fraud detection techniques in banking industry for cyber security. In: 2021 8th international conference on behavioral and social computing (BESC). IEEE, pp 1–7

  5. Fiore U, De Santis A, Perla F, Zanetti P, Palmieri F (2019) Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf Sci 479:448–455

    Article  Google Scholar 

  6. Li Z, Huang M, Liu G, Jiang C (2021) A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Syst Appl 175:114750

    Article  Google Scholar 

  7. Shi Q, Zhang H (2020) Fault diagnosis of an autonomous vehicle with an improved svm algorithm subject to unbalanced datasets. IEEE Trans Ind Electron 68(7):6248–6256

    Article  Google Scholar 

  8. Zhang T, Chen J, Li F, Zhang K, Lv H, He S, Xu E (2022) Intelligent fault diagnosis of machines with small and imbalanced data: a state-of-the-art review and possible extensions. ISA Trans 119:152–171

    Article  Google Scholar 

  9. Luo J, Huang J, Li H (2021) A case study of conditional deep convolutional generative adversarial networks in machine fault diagnosis. J Intell Manuf 32(2):407–425

    Article  Google Scholar 

  10. Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281

    Article  Google Scholar 

  11. Christensen R (2018) Analysis of variance, design, and regression: linear modeling for unbalanced data

  12. Liu X, Li N, Liu S, Wang J, Zhang N, Zheng X, Leung K-S, Cheng L (2019) Normalization methods for the analysis of unbalanced transcriptome data: a review. Front Bioeng Biotechnol 358

  13. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  14. Liang D, Yi B, Cao W, Zheng Q (2022) Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and smote. Expert Syst Appl 188:116051

    Article  Google Scholar 

  15. Devi D, Purkayastha B et al (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recogn Lett 93:3–12

    Article  Google Scholar 

  16. Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:107262

    Article  Google Scholar 

  17. Sun L, Zhang J, Ding W, Xu J (2022) Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors. Inf Sci 593:591–613

    Article  Google Scholar 

  18. Quinlan JR (2014) C4. 5: programs for machine learning

  19. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2):103–130

    Article  Google Scholar 

  20. Vapnik V (1999) The nature of statistical learning theory

  21. Huang G-B, Zhou H, Ding X, Zhang R (2011) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):513–529

    Article  Google Scholar 

  22. Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, vol 17. Lawrence Erlbaum Associates Ltd, pp 973–978

  23. Zhixin QI, Hongzhi ZXWANG (2019) Cost-sensitive decision tree induction on dirty data. J Softw 30(3):604

    Google Scholar 

  24. Zhou YSG (2021) Double cost sensitive random forest algorithm. J Harbin Univ Sci Technol 26(05):44–50. https://doi.org/10.15938/j.jhust.2021.05.006

    Article  Google Scholar 

  25. Sutton CD (2005) Classification and regression trees, bagging, and boosting. Handb Stat 24:303–329

    Article  Google Scholar 

  26. Koapaha HP, Ananto N (2021) Bagging based ensemble analysis in handling unbalanced data on classification modeling. Klabat Account Rev 2(2):165–178

    Article  Google Scholar 

  27. Thakkar HK, Desai A, Ghosh S, Singh P, Sharma G (2022) Clairvoyant: adaboost with cost-enabled cost-sensitive classifier for customer churn prediction. Comput Intell Neurosci 2022

  28. Chen X-w, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 124–132

  29. Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: 2009 IEEE international conference on data mining workshops. IEEE, pp 507–514

  30. Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355

    Article  MathSciNet  Google Scholar 

  31. Nagpal A, Singh V (2019) Feature selection from high dimensional data based on iterative qualitative mutual information. J Intell Fuzzy Syst 36(6):5845–5856

    Article  Google Scholar 

  32. Jing X-Y, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang J-Y (2019) Multiset feature learning for highly imbalanced data classification. IEEE Trans Pattern Anal Mach Intell 43(1):139–156

    Article  Google Scholar 

  33. Saha J, Mukherjee J (2021) Cnak: cluster number assisted k-means. Pattern Recogn 110:107625

    Article  Google Scholar 

  34. Krogh A, Vedelsby J (1994) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems 7

Download references

Acknowledgements

I would like to thank Professor Zhenfei Wang for encouraging me, helping me, and giving me pertinent guidance when I encountered bottlenecks.

Funding

National Natural Science Foundation of China, 61872324

Author information

Authors and Affiliations

Authors

Contributions

Z-FW contributed to the overall framework of the study. P-YY and L-YZ conceived and completed the experiments and wrote the manuscript,and Z-YC assisted with data analysis and constructive discussions during the experiments. The manuscript was reviewed by all authors.

Corresponding author

Correspondence to Li-Ying Zhang.

Ethics declarations

Conflict of interest

I declare that the authors have no competing or other interests that could be used to affect the findings and/or conclusions described in this paper.

Ethics approval

Not.

Consent to participate

Yes.

Consent for publication

Yes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, ZF., Yuan, PY., Cao, ZY. et al. Feature reduction of unbalanced data classification based on density clustering. Computing 106, 29–55 (2024). https://doi.org/10.1007/s00607-023-01206-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-023-01206-5

Keywords

Mathematics Subject Classification

Navigation