Feature reduction of unbalanced data classification based on density clustering

Wang, Zhen-Fei; Yuan, Pei-Yao; Cao, Zhong-Ya; Zhang, Li-Ying

doi:10.1007/s00607-023-01206-5

Feature reduction of unbalanced data classification based on density clustering

Regular Paper
Published: 21 August 2023

Volume 106, pages 29–55, (2024)
Cite this article

Computing Aims and scope Submit manuscript

Zhen-Fei Wang¹,
Pei-Yao Yuan¹,
Zhong-Ya Cao¹ &
…
Li-Ying Zhang ORCID: orcid.org/0000-0001-7742-4985¹

173 Accesses
Explore all metrics

Abstract

With the development of big data, the problem of imbalanced data sets is becoming more and more serious. When dealing with high-dimensional imbalanced datasets, traditional classification algorithms usually tend to favor the majority class and ignore the minority class, which results in poor classification performance. In this paper, we study the issue of high-dimensional imbalanced dataset classification and propose a feature selection algorithm based on density clustering and importance measure (DBIM). DBIM firstly constructs multiple balanced subsets by randomly under-sampling the majority classes with the same number of samples as the minority classes and uses DBSCAN as the base classifier. This process quickly discovers feature distribution features based on density and generates the initial feature subspace. To select features with a strong classification of class labels, we propose to rank and select the generated initial feature subspace according to their importance. To avoid the redundancy between features and generate high-quality feature subsets, we further propose to design a new class distribution-based weight index combined with the redundancy evaluation index in the DBIM algorithm to calculate between features. Experimental results on eight publicly available datasets show that the DBIM algorithm proposed in this paper can generate feature subsets with high relevance and low redundancy, and can effectively reduce the dimensionality of high-dimensional imbalanced datasets and improve the classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing Feature Selection with Density Cluster for Better Clustering

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Article 09 July 2018

Feature Selection Based on Density Peak Clustering Using Information Distance Measure

Availability of data and materials

Not.

References

Devarriya D, Gulati C, Mansharamani V, Sakalle A, Bhardwaj A (2020) Unbalanced breast cancer data classification using novel fitness functions in genetic programming. Expert Syst Appl 140:112866. https://doi.org/10.1016/j.eswa.2019.112866
Article Google Scholar
Bridge J, Meng Y, Zhao Y, Du Y, Zhao M, Sun R, Zheng Y (2020) Introducing the gev activation function for highly unbalanced data to develop covid-19 diagnostic models. IEEE J Biomed Health Inform 24(10):2776–2786
Article Google Scholar
Gan D, Shen J, An B, Xu M, Liu N (2020) Integrating tanbn with cost sensitive classification algorithm for imbalanced data in medical diagnosis. Comput Ind Eng 140:106266
Article Google Scholar
Btoush E, Zhou X, Gururaian R, Chan K, Tao X (2021) A survey on credit card fraud detection techniques in banking industry for cyber security. In: 2021 8th international conference on behavioral and social computing (BESC). IEEE, pp 1–7
Fiore U, De Santis A, Perla F, Zanetti P, Palmieri F (2019) Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf Sci 479:448–455
Article Google Scholar
Li Z, Huang M, Liu G, Jiang C (2021) A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Syst Appl 175:114750
Article Google Scholar
Shi Q, Zhang H (2020) Fault diagnosis of an autonomous vehicle with an improved svm algorithm subject to unbalanced datasets. IEEE Trans Ind Electron 68(7):6248–6256
Article Google Scholar
Zhang T, Chen J, Li F, Zhang K, Lv H, He S, Xu E (2022) Intelligent fault diagnosis of machines with small and imbalanced data: a state-of-the-art review and possible extensions. ISA Trans 119:152–171
Article Google Scholar
Luo J, Huang J, Li H (2021) A case study of conditional deep convolutional generative adversarial networks in machine fault diagnosis. J Intell Manuf 32(2):407–425
Article Google Scholar
Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281
Article Google Scholar
Christensen R (2018) Analysis of variance, design, and regression: linear modeling for unbalanced data
Liu X, Li N, Liu S, Wang J, Zhang N, Zheng X, Leung K-S, Cheng L (2019) Normalization methods for the analysis of unbalanced transcriptome data: a review. Front Bioeng Biotechnol 358
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Liang D, Yi B, Cao W, Zheng Q (2022) Exploring ensemble oversampling method for imbalanced keyword extraction learning in policy text based on three-way decisions and smote. Expert Syst Appl 188:116051
Article Google Scholar
Devi D, Purkayastha B et al (2017) Redundancy-driven modified tomek-link based undersampling: a solution to class imbalance. Pattern Recogn Lett 93:3–12
Article Google Scholar
Koziarski M (2020) Radial-based undersampling for imbalanced data classification. Pattern Recogn 102:107262
Article Google Scholar
Sun L, Zhang J, Ding W, Xu J (2022) Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted k-nearest neighbors. Inf Sci 593:591–613
Article Google Scholar
Quinlan JR (2014) C4. 5: programs for machine learning
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29(2):103–130
Article Google Scholar
Vapnik V (1999) The nature of statistical learning theory
Huang G-B, Zhou H, Ding X, Zhang R (2011) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):513–529
Article Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, vol 17. Lawrence Erlbaum Associates Ltd, pp 973–978
Zhixin QI, Hongzhi ZXWANG (2019) Cost-sensitive decision tree induction on dirty data. J Softw 30(3):604
Google Scholar
Zhou YSG (2021) Double cost sensitive random forest algorithm. J Harbin Univ Sci Technol 26(05):44–50. https://doi.org/10.15938/j.jhust.2021.05.006
Article Google Scholar
Sutton CD (2005) Classification and regression trees, bagging, and boosting. Handb Stat 24:303–329
Article Google Scholar
Koapaha HP, Ananto N (2021) Bagging based ensemble analysis in handling unbalanced data on classification modeling. Klabat Account Rev 2(2):165–178
Article Google Scholar
Thakkar HK, Desai A, Ghosh S, Singh P, Sharma G (2022) Clairvoyant: adaboost with cost-enabled cost-sensitive classifier for customer churn prediction. Comput Intell Neurosci 2022
Chen X-w, Wasikowski M (2008) Fast: a roc-based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 124–132
Van Hulse J, Khoshgoftaar TM, Napolitano A, Wald R (2009) Feature selection with high-dimensional imbalanced data. In: 2009 IEEE international conference on data mining workshops. IEEE, pp 507–514
Li Z, Tang J (2015) Unsupervised feature selection via nonnegative spectral analysis and redundancy control. IEEE Trans Image Process 24(12):5343–5355
Article MathSciNet Google Scholar
Nagpal A, Singh V (2019) Feature selection from high dimensional data based on iterative qualitative mutual information. J Intell Fuzzy Syst 36(6):5845–5856
Article Google Scholar
Jing X-Y, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang J-Y (2019) Multiset feature learning for highly imbalanced data classification. IEEE Trans Pattern Anal Mach Intell 43(1):139–156
Article Google Scholar
Saha J, Mukherjee J (2021) Cnak: cluster number assisted k-means. Pattern Recogn 110:107625
Article Google Scholar
Krogh A, Vedelsby J (1994) Neural network ensembles, cross validation, and active learning. In: Advances in neural information processing systems 7

Download references

Acknowledgements

I would like to thank Professor Zhenfei Wang for encouraging me, helping me, and giving me pertinent guidance when I encountered bottlenecks.

Funding

National Natural Science Foundation of China, 61872324

Author information

Authors and Affiliations

School of Computer and Artificial Intelligence, Zhengzhou University, No. 100, Kexue Avenue, Zhengzhou, 450001, Henan, China
Zhen-Fei Wang, Pei-Yao Yuan, Zhong-Ya Cao & Li-Ying Zhang

Authors

Zhen-Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Pei-Yao Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Zhong-Ya Cao
View author publications
You can also search for this author in PubMed Google Scholar
Li-Ying Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z-FW contributed to the overall framework of the study. P-YY and L-YZ conceived and completed the experiments and wrote the manuscript,and Z-YC assisted with data analysis and constructive discussions during the experiments. The manuscript was reviewed by all authors.

Corresponding author

Correspondence to Li-Ying Zhang.

Ethics declarations

Conflict of interest

I declare that the authors have no competing or other interests that could be used to affect the findings and/or conclusions described in this paper.

Ethics approval

Not.

Consent to participate

Yes.

Consent for publication

Yes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, ZF., Yuan, PY., Cao, ZY. et al. Feature reduction of unbalanced data classification based on density clustering. Computing 106, 29–55 (2024). https://doi.org/10.1007/s00607-023-01206-5

Download citation

Received: 24 May 2022
Accepted: 22 July 2023
Published: 21 August 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00607-023-01206-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature reduction of unbalanced data classification based on density clustering

Abstract

Access this article

Similar content being viewed by others

Enhancing Feature Selection with Density Cluster for Better Clustering

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Feature Selection Based on Density Peak Clustering Using Information Distance Measure

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Feature reduction of unbalanced data classification based on density clustering

Abstract

Access this article

Similar content being viewed by others

Enhancing Feature Selection with Density Cluster for Better Clustering

Research on classification method of high-dimensional class-imbalanced datasets based on SVM

Feature Selection Based on Density Peak Clustering Using Information Distance Measure

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation