Abstract
The class imbalance problem is one of the critical research areas of machine learning and deep learning and has received widespread attention from researchers. To solve the class imbalance problem, current typical methods only use positive samples to generate synthetic samples that are similar to the minority class while ignoring the characteristic information of negative samples. Therefore, when the number of positive samples is too small and has highly similar features, it will cause the classifier to have fitting problems. In response to the above problems, we propose a new positive sample enhancement algorithm (PENH) to solve the class imbalance by simulating the process of chromosome cross-fusion. We select the fuzzy negative sample set around the positive sample by the K-nearest neighbor algorithm and adopt the beyond empirical risk minimization (Mixup) to randomly hybridize the positive sample with the negative sample of the set. To overcome the problem of sample imbalance, we adopt the One-class SVM with overfitting of positive samples to select the newly generated unlabeled samples to obtain the balanced dataset. We construct multiple experiments in 20 open datasets. The results show that our PENH outperforms the other six baseline methods in multiple evaluation indicator.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are available on request from public dataset websites (https://sci2s.ugr.es/keel/datasets.php).
References
Yun, J., Lee, J.S.: Learning from class-imbalanced data using misclassification-focusing generative adversarial networks. Expert Syst. Appl. 240, 122288 (2024)
Mishra, R., Chavda, P., Kumar, R., Pandit, R., Joshi, M., Kumar, M., Joshi, C.: Exploring genetic landscape of low-density polyethylene degradation for sustainable troubleshooting of plastic pollution at landfills. Sci. Total. Environ. 912, 168882 (2024)
Saulino, M.: Maintenance and troubleshooting of intrathecal therapy for spasticity. In: Neuraxial Therapeutics: A Comprehensive Guide, pp. 721–728. Springer, Cham (2023)
Rajanbabu, K., Gunasekaran, S.: H G Selvarajan Efficacy of Audio-Video Material on Cochlear Implant in Tamil (AVMCI-T) about care, maintenance and troubleshooting. Int. J. Pediatr. Otorhinolaryngol. 176, 111768 (2024)
Manocchio, L.D., Layeghy, S., Lo, W.W., Kulatilleke, G.K., Sarhan, M., Portmann, M.: Flowtransformer: a transformer framework for flow-based network intrusion detection systems. Expert Syst. Appl. 241, 122564 (2024)
Alazab, M., Khurma, R.A., Castillo, P.A., Abu-Salih, B., Martín, A., Camacho, D.: An effective networks intrusion detection approach based on hybrid Harris Hawks and multi-layer perceptron. Egypt. Inform. J. 25, 100423 (2024)
Wu, H.: Feature-weighted Naive Bayesian classifier for wireless network intrusion detection. Secur. Commun. Netw. 2024, 7065482 (2024)
Padurariu, C., Breaban, M.E.: Dealing with data imbalance in text classification. Procedia Comput. Sci. 159, 736–745 (2019)
Korde, V., Mahender, C.N.: Text classification and classifiers: a survey. Int. J. Artif. Intell. Appl. 3(2), 85 (2012)
Khurana, A., Verma, O.P.: Optimal feature selection for imbalanced text classification. IEEE Trans. Artif. Intell. 4(1), 135–147 (2022)
Benchaji, I., Douzi, S., El Ouahidi, B.: Using genetic algorithm to improve classification of imbalanced datasets for credit card fraud detection. In: Smart Data and Computational Intelligence: Proceedings of the International Conference on Advanced Information Technology, Services and Systems, 2019, pp. 220–229 (2019)
Makki, S., Assaghir, Z., Taher, Y., Haque, R., Hacid, M.-S., Zeineddine, H.: An experimental study with imbalanced classification approaches for credit card fraud detection. IEEE Access 7, 93010–93022 (2019)
Singh, A., Ranjan, R.K., Tiwari, A.: Credit card fraud detection under extreme imbalanced data: a comparative study of data-level algorithms. J. Exp. Theor. Artif. Intell. 34(4), 571–598 (2022)
Alarab, I., Prakoonwit, S.: Effect of data resampling on feature importance in imbalanced blockchain data: comparison studies of resampling techniques. Data Sci. Manag. 5(2), 66–76 (2022)
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: DBSMOTE: density-based synthetic minority over-sampling technique. Appl. Intell. 36, 664–684 (2012)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2(3), 408–421 (1972)
López, V., Triguero, I., Carmona, C.J., García, S., Herrera, F.: Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 126, 15–28 (2014)
Wang, J., Neskovic, P., Cooper, L.N.: Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognit. Lett. 28(2), 207–213 (2007)
Mehwish, N., Asit-Kuma, D., Janmenjoy, N., Danilo, P.: Rough-fuzzy based synthetic data generation exploring boundary region of rough sets to handle class imbalance problem. Axioms 12(4), 345 (2023)
Wentao, L., Tao, Z.: Multi-granularity probabilistic rough fuzzy sets for interval-valued fuzzy decision systems. Int. J. Fuzzy Syst. 25, 1–13 (2023)
Wentao, L., Shichao, Z., Weihua, X.: Feature selection approach based on improved fuzzy c-means with principle of refined justifiable granularity. IEEE Trans. Fuzzy Syst. 31(7), 2112–2126 (2022)
Wentao, L., Yuli, W., Weihua, X.: General expression of knowledge granularity based on a fuzzy relation matrix. Fuzzy Sets Syst. 440, 149–163 (2022)
Wentao, L., Witold, P., Xiaoping, X.: Fuzziness and incremental information of disjoint regions in double-quantitative decision-theoretic rough set model. Int. J. Mach. Learn. Cybern. 10, 2669–2690 (2019)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization (2017). arXiv preprint: 09412
Dai, Q., Liu, J.-W., Yang, J.-P.: Class-imbalanced positive instances augmentation via three-line hybrid. Knowl. Based Syst. 257, 109902 (2022)
Wentao, L., Witold, P., Weihua, X.: Interval dominance-based feature selection for interval-valued ordered data. IEEE Trans. Neural Netw. Learn. Syst. 34(10), 6898–6912 (2022)
Peterson, L.E.: K-nearest neighbor. Scholarpedia 4(2), 1883 (2009)
Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. J. Mach. Learn. Res. 2(Dec), 139–154 (2001)
Zhang, M.-L., Li, Y.-K., Yang, H., Liu, X.-Y.: Towards class-imbalance aware multi-label learning. IEEE Trans. Cybern. 52(6), 4459–4471 (2020)
Tahir, M.A., Kittler, J., Yan, F.: Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit. 45(10), 3738–3750 (2012)
Zhang, Y., Kang, B., Hooi, B., Yan, S., Feng, J.: Deep long-tailed learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2023). https://doi.org/10.48550/arXiv.2110.04596
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2537–2546 (2019)
Santos, M.S., Abreu, P.H., Japkowicz, N., Fernández, A., Soares, C., Wilk, S., Santos, J.: On the joint-effect of class imbalance and overlap: a critical review. Artif. Intell. Rev. 55(8), 6207–6275 (2022)
Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Advances in Artificial Intelligence, 2010, pp. 220–231 (2010)
Carvalho, D.R., Freitas, A.A.: A genetic-algorithm for discovering small-disjunct rules in data mining. Appl. Soft Comput. 2(2), 75–88 (2002)
Nekooeimehr, I., Lai-Yuen, S.K.: Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst. Appl. 46, 405–416 (2016)
Douzas, G., Bacao, F.: Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst. Appl. 82, 40–52 (2017)
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33, 245–265 (2012)
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: International Conference on Data Warehousing and Knowledge Discovery, 2008, pp. 283–292 (2008)
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37(1), 7–18 (2006)
Ramentol, E., Gondres, I., Lajes, S., Bello, R., Caballero, Y., Cornelis, C., Herrera, F.: Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: the SMOTE-FRST-2T algorithm. Eng. Appl. Artif. Intell. 48, 134–139 (2016)
Rivera, W.A.: Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf. Sci. 408, 146–161 (2017)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, 1997, p 179 (1997)
Cervantes, J., Garcia-Lamont, F., Rodriguez, L., López, A., Castilla, J.R., Trueba, A.: PSO-based method for SVM classification on skewed data sets. Neurocomputing 228, 187–197 (2017)
Alcalá-Fdez, J., Sanchez, L., Garcia, S., del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V.M.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft. Comput. 13, 307–318 (2009)
Dang, X.T., Tran, D.H., Hirose, O., Satou, K.: SPY: a novel resampling method for improving classification performance in imbalanced data. In: 2015 Seventh International Conference on Knowledge and Systems Engineering, 2015, pp. 280–285 (2015)
Acknowledgements
This work is supported by the National Key Research and Development Program of China (Nos. 2022YFE0197600, 2022YFC3302103), Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE (No. 202306), Guangxi Key Laboratory of Trusted Software (No. KX202315), the Fundamental Research Funds for the Central Universities (No. CUC23GZ017), China Association of Higher Education 2023 Higher Education Science Research Planning Project “Exploration and Practical Research on the Education Path of Traditional Chinese Culture for International Students Coming to China in the Context of New Media” (No. 23LH0403), the National Natural Science Foundation of China (No. 72104016), the R&D Program of the Beijing Municipal Education Commission (No. SM202110005011).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, J., Shi, L., Lu, T. et al. A Positive Sample Enhancement Algorithm with Fuzzy Nearest Neighbor Hybridization for Imbalance Data. Int. J. Fuzzy Syst. 26, 2707–2725 (2024). https://doi.org/10.1007/s40815-024-01721-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40815-024-01721-3