A Positive Sample Enhancement Algorithm with Fuzzy Nearest Neighbor Hybridization for Imbalance Data

Yang, Jiapeng; Shi, Lei; Lu, Tielin; Yuan, Lu; Cheng, Nanchang; Yang, Xiaohui; Luo, Jia; Xu, Mingying

doi:10.1007/s40815-024-01721-3

A Positive Sample Enhancement Algorithm with Fuzzy Nearest Neighbor Hybridization for Imbalance Data

Published: 03 June 2024

Volume 26, pages 2707–2725, (2024)
Cite this article

International Journal of Fuzzy Systems Aims and scope Submit manuscript

Jiapeng Yang¹,
Lei Shi ORCID: orcid.org/0000-0002-5570-7818^1,2,
Tielin Lu³,
Lu Yuan^1,4,
Nanchang Cheng¹,
Xiaohui Yang¹,
Jia Luo⁵ &
…
Mingying Xu⁶

144 Accesses
2 Citations
Explore all metrics

Abstract

The class imbalance problem is one of the critical research areas of machine learning and deep learning and has received widespread attention from researchers. To solve the class imbalance problem, current typical methods only use positive samples to generate synthetic samples that are similar to the minority class while ignoring the characteristic information of negative samples. Therefore, when the number of positive samples is too small and has highly similar features, it will cause the classifier to have fitting problems. In response to the above problems, we propose a new positive sample enhancement algorithm (PENH) to solve the class imbalance by simulating the process of chromosome cross-fusion. We select the fuzzy negative sample set around the positive sample by the K-nearest neighbor algorithm and adopt the beyond empirical risk minimization (Mixup) to randomly hybridize the positive sample with the negative sample of the set. To overcome the problem of sample imbalance, we adopt the One-class SVM with overfitting of positive samples to select the newly generated unlabeled samples to obtain the balanced dataset. We construct multiple experiments in 20 open datasets. The results show that our PENH outperforms the other six baseline methods in multiple evaluation indicator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data

Article 10 March 2020

A new technique for classification method with imbalanced training data

Article 24 February 2024

A non-parameter oversampling approach for imbalanced data classification based on hybrid natural neighbors

Article 22 January 2025

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data that support the findings of this study are available on request from public dataset websites (https://sci2s.ugr.es/keel/datasets.php).

References

Yun, J., Lee, J.S.: Learning from class-imbalanced data using misclassification-focusing generative adversarial networks. Expert Syst. Appl. 240, 122288 (2024)
Article Google Scholar
Mishra, R., Chavda, P., Kumar, R., Pandit, R., Joshi, M., Kumar, M., Joshi, C.: Exploring genetic landscape of low-density polyethylene degradation for sustainable troubleshooting of plastic pollution at landfills. Sci. Total. Environ. 912, 168882 (2024)
Article MATH Google Scholar
Saulino, M.: Maintenance and troubleshooting of intrathecal therapy for spasticity. In: Neuraxial Therapeutics: A Comprehensive Guide, pp. 721–728. Springer, Cham (2023)
Rajanbabu, K., Gunasekaran, S.: H G Selvarajan Efficacy of Audio-Video Material on Cochlear Implant in Tamil (AVMCI-T) about care, maintenance and troubleshooting. Int. J. Pediatr. Otorhinolaryngol. 176, 111768 (2024)
Article MATH Google Scholar
Manocchio, L.D., Layeghy, S., Lo, W.W., Kulatilleke, G.K., Sarhan, M., Portmann, M.: Flowtransformer: a transformer framework for flow-based network intrusion detection systems. Expert Syst. Appl. 241, 122564 (2024)
Article Google Scholar
Alazab, M., Khurma, R.A., Castillo, P.A., Abu-Salih, B., Martín, A., Camacho, D.: An effective networks intrusion detection approach based on hybrid Harris Hawks and multi-layer perceptron. Egypt. Inform. J. 25, 100423 (2024)
Article Google Scholar
Wu, H.: Feature-weighted Naive Bayesian classifier for wireless network intrusion detection. Secur. Commun. Netw. 2024, 7065482 (2024)
Article Google Scholar
Padurariu, C., Breaban, M.E.: Dealing with data imbalance in text classification. Procedia Comput. Sci. 159, 736–745 (2019)
Article MATH Google Scholar
Korde, V., Mahender, C.N.: Text classification and classifiers: a survey. Int. J. Artif. Intell. Appl. 3(2), 85 (2012)
MATH Google Scholar
Khurana, A., Verma, O.P.: Optimal feature selection for imbalanced text classification. IEEE Trans. Artif. Intell. 4(1), 135–147 (2022)
Article MATH Google Scholar
Benchaji, I., Douzi, S., El Ouahidi, B.: Using genetic algorithm to improve classification of imbalanced datasets for credit card fraud detection. In: Smart Data and Computational Intelligence: Proceedings of the International Conference on Advanced Information Technology, Services and Systems, 2019, pp. 220–229 (2019)
Makki, S., Assaghir, Z., Taher, Y., Haque, R., Hacid, M.-S., Zeineddine, H.: An experimental study with imbalanced classification approaches for credit card fraud detection. IEEE Access 7, 93010–93022 (2019)
Article Google Scholar
Singh, A., Ranjan, R.K., Tiwari, A.: Credit card fraud detection under extreme imbalanced data: a comparative study of data-level algorithms. J. Exp. Theor. Artif. Intell. 34(4), 571–598 (2022)
Article MATH Google Scholar
Alarab, I., Prakoonwit, S.: Effect of data resampling on feature importance in imbalanced blockchain data: comparison studies of resampling techniques. Data Sci. Manag. 5(2), 66–76 (2022)
Article Google Scholar
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36, 664–684 (2012)
Article Google Scholar
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Article MATH Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2(3), 408–421 (1972)
Article MathSciNet MATH Google Scholar
López, V., Triguero, I., Carmona, C.J., García, S., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)
Article MATH Google Scholar
Wang, J., Neskovic, P., Cooper, L.N.: Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognit. Lett. 28(2), 207–213 (2007)
Article MATH Google Scholar
Mehwish, N., Asit-Kuma, D., Janmenjoy, N., Danilo, P.: Rough-fuzzy based synthetic data generation exploring boundary region of rough sets to handle class imbalance problem. Axioms 12(4), 345 (2023)
Article Google Scholar
Wentao, L., Tao, Z.: Multi-granularity probabilistic rough fuzzy sets for interval-valued fuzzy decision systems. Int. J. Fuzzy Syst. 25, 1–13 (2023)
MATH Google Scholar
Wentao, L., Shichao, Z., Weihua, X.: Feature selection approach based on improved fuzzy c-means with principle of refined justifiable granularity. IEEE Trans. Fuzzy Syst. 31(7), 2112–2126 (2022)
MATH Google Scholar
Wentao, L., Yuli, W., Weihua, X.: General expression of knowledge granularity based on a fuzzy relation matrix. Fuzzy Sets Syst. 440, 149–163 (2022)
Article MathSciNet MATH Google Scholar
Wentao, L., Witold, P., Xiaoping, X.: Fuzziness and incremental information of disjoint regions in double-quantitative decision-theoretic rough set model. Int. J. Mach. Learn. Cybern. 10, 2669–2690 (2019)
Article MATH Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization (2017). arXiv preprint: 09412
Dai, Q., Liu, J.-W., Yang, J.-P.: Class-imbalanced positive instances augmentation via three-line hybrid. Knowl. Based Syst. 257, 109902 (2022)
Article MATH Google Scholar
Wentao, L., Witold, P., Weihua, X.: Interval dominance-based feature selection for interval-valued ordered data. IEEE Trans. Neural Netw. Learn. Syst. 34(10), 6898–6912 (2022)
MathSciNet MATH Google Scholar
Peterson, L.E.: K-nearest neighbor. Scholarpedia 4(2), 1883 (2009)
Article MATH Google Scholar
Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. J. Mach. Learn. Res. 2(Dec), 139–154 (2001)
MATH Google Scholar
Zhang, M.-L., Li, Y.-K., Yang, H., Liu, X.-Y.: Towards class-imbalance aware multi-label learning. IEEE Trans. Cybern. 52(6), 4459–4471 (2020)
Article MATH Google Scholar
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit. 45(10), 3738–3750 (2012)
Article MATH Google Scholar
Zhang, Y., Kang, B., Hooi, B., Yan, S., Feng, J.: Deep long-tailed learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2023). https://doi.org/10.48550/arXiv.2110.04596
Article MATH Google Scholar
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2537–2546 (2019)
Santos, M.S., Abreu, P.H., Japkowicz, N., Fernández, A., Soares, C., Wilk, S., Santos, J.: On the joint-effect of class imbalance and overlap: a critical review. Artif. Intell. Rev. 55(8), 6207–6275 (2022)
Article MATH Google Scholar
Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Advances in Artificial Intelligence, 2010, pp. 220–231 (2010)
Carvalho, D.R., Freitas, A.A.: A genetic-algorithm for discovering small-disjunct rules in data mining. Appl. Soft Comput. 2(2), 75–88 (2002)
Article MATH Google Scholar
Nekooeimehr, I., Lai-Yuen, S.K.: Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst. Appl. 46, 405–416 (2016)
Article Google Scholar
Douzas, G., Bacao, F.: Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst. Appl. 82, 40–52 (2017)
Article MATH Google Scholar
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33, 245–265 (2012)
Article MATH Google Scholar
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: International Conference on Data Warehousing and Knowledge Discovery, 2008, pp. 283–292 (2008)
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37(1), 7–18 (2006)
Article Google Scholar
Ramentol, E., Gondres, I., Lajes, S., Bello, R., Caballero, Y., Cornelis, C., Herrera, F.: Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: the SMOTE-FRST-2T algorithm. Eng. Appl. Artif. Intell. 48, 134–139 (2016)
Article MATH Google Scholar
Rivera, W.A.: Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf. Sci. 408, 146–161 (2017)
Article MATH Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, 1997, p 179 (1997)
Cervantes, J., Garcia-Lamont, F., Rodriguez, L., López, A., Castilla, J.R., Trueba, A.: PSO-based method for SVM classification on skewed data sets. Neurocomputing 228, 187–197 (2017)
Article Google Scholar
Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft. Comput. 13, 307–318 (2009)
Article Google Scholar
Dang, X.T., Tran, D.H., Hirose, O., Satou, K.: SPY: a novel resampling method for improving classification performance in imbalanced data. In: 2015 Seventh International Conference on Knowledge and Systems Engineering, 2015, pp. 280–285 (2015)

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (Nos. 2022YFE0197600, 2022YFC3302103), Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE (No. 202306), Guangxi Key Laboratory of Trusted Software (No. KX202315), the Fundamental Research Funds for the Central Universities (No. CUC23GZ017), China Association of Higher Education 2023 Higher Education Science Research Planning Project “Exploration and Practical Research on the Education Path of Traditional Chinese Culture for International Students Coming to China in the Context of New Media” (No. 23LH0403), the National Natural Science Foundation of China (No. 72104016), the R&D Program of the Beijing Municipal Education Commission (No. SM202110005011).

Author information

Authors and Affiliations

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, 10024, China
Jiapeng Yang, Lei Shi, Lu Yuan, Nanchang Cheng & Xiaohui Yang
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing, 100081, China
Lei Shi
Instrumentation Technology and Economy Institute, Beijing, 100055, China
Tielin Lu
School of Data Science and Media Intelligence, Communication University of China, Beijing, 100024, China
Lu Yuan
College of Economics and Management, Beijing University of Technology, Beijing, 100124, China
Jia Luo
School of Information Science, North China University of Technology, Beijing, 100144, China
Mingying Xu

Authors

Jiapeng Yang
View author publications
You can also search for this author inPubMed Google Scholar
Lei Shi
View author publications
You can also search for this author inPubMed Google Scholar
Tielin Lu
View author publications
You can also search for this author inPubMed Google Scholar
Lu Yuan
View author publications
You can also search for this author inPubMed Google Scholar
Nanchang Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Xiaohui Yang
View author publications
You can also search for this author inPubMed Google Scholar
Jia Luo
View author publications
You can also search for this author inPubMed Google Scholar
Mingying Xu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Lei Shi.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, J., Shi, L., Lu, T. et al. A Positive Sample Enhancement Algorithm with Fuzzy Nearest Neighbor Hybridization for Imbalance Data. Int. J. Fuzzy Syst. 26, 2707–2725 (2024). https://doi.org/10.1007/s40815-024-01721-3

Download citation

Received: 01 November 2023
Revised: 20 February 2024
Accepted: 28 February 2024
Published: 03 June 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s40815-024-01721-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Positive Sample Enhancement Algorithm with Fuzzy Nearest Neighbor Hybridization for Imbalance Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data

A new technique for classification method with imbalanced training data

A non-parameter oversampling approach for imbalanced data classification based on hybrid natural neighbors

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now