Abstract
With the rapid development of network technology, the Internet has brought significant convenience to various sectors of society, holding a prominent position. Due to the unpredictable and severe consequences resulting from malicious attacks, the detection of anomalous network traffic has garnered considerable attention from researchers over the past few decades. Accurately labeling a sufficient amount of network traffic data as a training dataset within a short period of time is a challenging task, given the rapid and massive generation of network traffic data. Furthermore, the proportion of malicious attack traffic is relatively small compared to the overall traffic data, and the distribution of traffic data across different types of malicious attacks also varies significantly. To address the aforementioned challenges, this paper presents a novel network anomaly detection algorithm based on semi-supervised learning and adaptive multiclass balancing. Building upon the assumption of consistent distribution between labeled and unlabeled data, this paper introduces the multiclass split balancing strategy and the adaptive confidence threshold function. These innovative approaches aim to tackle the issue of the multiclass imbalanced in traffic data. By leveraging the mutually beneficial relationship between semi-supervised learning and ensemble learning, this paper presents the collaborative rotation forest algorithm. This algorithm is specifically designed to enhance performance of anomaly detection in an environment with label inadequacy. Several comparative experiments conducted on the NSL-KDD, UNSW-NB15, and ToN-IoT demonstrate that the proposed algorithm achieves significant improvements in performance. Specifically, it enhances precision by 1.5–5.7%, recall by 1.5−5.7%, and F-Measure by 1.4−4.3% compared to the state-of-the-art algorithms.
Similar content being viewed by others
Availability of data and materials
Not applicable.
References
Moustafa N, Hu J, Slay J (2019) A holistic review of network anomaly detection systems: a comprehensive survey. J Netw Comput Appl 128:33–55
Mishra P, Varadharajan V, Tupakula U, Pilli ES (2019) A detailed investigation and analysis of using machine learning techniques for intrusion detection. IEEE Commun Surv Tutor 21(1):686–728
García-Teodoro P, Díaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: techniques, systems and challenges. Comput Secur 28(1):18–28
Bhuyan MH, Bhattacharyya DK, Kalita JK (2014) Network anomaly detection: methods, systems and tools. IEEE Commun Surv Tutor 16(1):303–336
Pajouh HH, Javidan R, Khayami R, Dehghantanha A, Choo K-KR (2019) A two-layer dimension reduction and two-tier classification model for anomaly-based intrusion detection in IoT backbone networks. IEEE Trans Emerg Top Comput 7(2):314–323
Zhou Y, Cheng G, Jiang S, Dai M (2020) Building an efficient intrusion detection system based on feature selection and ensemble classifier. Comput Netw 174:107247
Gu J, Lu S (2021) An effective intrusion detection approach using SVM with naïve Bayes feature embedding. Comput Secur 103:102158
Zhou Y, Mazzuchi TA, Sarkani S (2020) M-AdaBoost-A based ensemble system for network intrusion detection. Expert Syst Appl 162:113864
Li X, Zhu M, Yang LT, Xu M, Ma Z, Zhong C, Li H, Xiang Y (2021) Sustainable ensemble learning driving intrusion detection model. IEEE Trans Dependable Secure Comput 18(4):1591–1604
Panigrahi R, Borah S, Bhoi AK, Ijaz MF, Pramanik M, Kumar Y, Jhaveri RH (2021) A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics 9(7):751
Al-Jarrah OY, Al-Hammdi Y, Yoo PD, Muhaidat S, Al-Qutayri M (2018) Semi-supervised multi-layered clustering model for intrusion detection. Digital Commun Netw 4(4):277–286
Rathore S, Park JH (2018) Semi-supervised learning based distributed attack detection framework for IoT. Appl Soft Comput 72:79–89
Camacho J, Maciá-Fernández G, Fuentes-García NM, Saccenti E (2019) Semi-supervised multivariate statistical network monitoring for learning security threats. IEEE Trans Inf Forensics Secur 14(8):2179–2189
Yao H, Fu D, Zhang P, Li M, Liu Y (2019) MSML: a novel multilevel semi-supervised machine learning framework for intrusion detection system. IEEE Internet Things J 6(2):1949–1959
Li W, Meng W, Au MH (2020) Enhancing collaborative intrusion detection via disagreement-based semi-supervised learning in IoT environments. J Netw Comput Appl 161:102631
Marteau P-F (2021) Random partitioning forest for point-wise and collective anomaly detection-application to network intrusion detection. IEEE Trans Inf Forensics Secur 16:2157–2172
Carrasco RSM, Sicilia M-A (2018) Unsupervised intrusion detection through skip-gram models of network behavior. Comput Secur 78:187–197
Li X, Chen W, Zhang Q, Wu L (2020) Building auto-encoder intrusion detection system based on random forest feature selection. Comput Secur 95:101851
Liang W, Li K-C, Long J, Kui X, Zomaya AY (2020) An industrial network intrusion detection algorithm based on multifeature data clustering optimization model. IEEE Trans Industr Inf 16(3):2063–2071
Binbusayyis A, Vaiyapuri T (2021) Unsupervised deep learning approach for network intrusion detection combining convolutional autoencoder and one-class SVM. Appl Intell 51(10):7094–7108
Ahmed M, Naser Mahmood A, Hu J (2016) A survey of network anomaly detection techniques. J Netw Comput Appl 60:19–31
Ring M, Wunderlich S, Scheuring D, Landes D, Hotho A (2019) A survey of network-based intrusion detection data sets. Comput Secur 86:147–167
Joachims T (1999) Transductive inference for text classi cation using support vector machines. In: Icml, vol 99, pp 200–209
Yuan Y, Li X, Wang Q, Nie F (2021) A semi-supervised learning algorithm via adaptive Laplacian graph. Neurocomputing 426:162–173
Calder J, Cook B, Thorpe M, Slepcev D (2020) Poisson Learning: Graph Based Semi-Supervised Learning At Very Low Label Rates. In: Proceedings of the 37th International Conference on Machine Learning, pp 1306–1316. PMLR.
Mallapragada PK, Jin R, Jain AK, Liu Y (2009) SemiBoost: boosting for semi-supervised learning. IEEE Trans Pattern Anal Mach Intell 31(11):2000–2014
Chen K, Wang S (2011) Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. IEEE Trans Pattern Anal Mach Intell 33(1):129–143
Tanha J (2018) MSSBoost: a new multiclass boosting to semi-supervised learning. Neurocomputing 314:251–266
Li M, Zhou Z-H (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern Part A Syst Humans 37(6):1088–1098
Gu S, Jin Y (2017) Multi-train: a semi-supervised heterogeneous ensemble classifier. Neurocomputing 249:202–211
de Vries S, Thierens D (2021) A reliable ensemble based approach to semi-supervised learning. Knowl-Based Syst 215:106738
Lu J, Gong Y (2021) A co-training method based on entropy and multi-criteria. Appl Intell 51(6):3212–3225
Zhou Z-H, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541
Zhou Z-H (2009) When semi-supervised learning meets ensemble learning. In: Benediktsson JA, Kittler J, Roli F (eds) Multiple classifier systems. Lecture Notes in Computer Science, pp 529–538. Springer, Heidelberg
Liang XW, Jiang AP, Li T, Xue YY, Wang GT (2020) LR-SMOTE - an improved unbalanced data set oversampling based on K-means and SVM. Knowl-Based Syst 196:105845
Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111
Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54
Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR (2021) DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl 168:114301
Halimu C, Kasem A (2021) Split Balancing (sBal)-A Data Preprocessing Sampling Technique for Ensemble Methods for Binary Classification in Imbalanced Datasets. In: Alfred R, Iida H, Haviluddin H, Anthony P (eds) Computational science and technology. Lecture notes in electrical engineering, pp 241–257. Springer, Singapore
Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing 343:50–64
Wong ML, Seng K, Wong PK (2020) Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain. Expert Syst Appl 141:112918
Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl-Based Syst 158:81–93
Yang K, Yu Z, Wen X, Cao W, Chen CLP, Wong H-S, You J (2020) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 31(4):1387–1400
Razavi-Far R, Farajzadeh-Zanajni M, Wang B, Saif M, Chakrabarti S (2021) Imputation-based ensemble techniques for class imbalance learning. IEEE Trans Knowl Data Eng 33(5):1988–2001
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441
Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):1619–1630
Sagi O, Rokach L (2018) Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery 8(4)
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp 1–6
Moustafa N, Slay J (2015) UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS), pp 1–6
Moustafa N (2021) A new distributed architecture for evaluating AI-based security systems at the edge: network TON_iot datasets. Sustain Cities Soc 72:102994
KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Accessed 20 May 2022
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp 189–196
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp 92–100
Funding
This work was supported in part by the Fund of the China Scholarship Council, the National Natural Science Foundation of China under Grants U1804263 and 61877010, the Natural Science Foundation of Fujian Province China under Grants 2021J01616, 2020J01130167 and 2021J01625, and the Joint Straits Fund of Key Program of the National Natural Science Foundation of China under Grant U1705262.
Author information
Authors and Affiliations
Contributions
HZ contributed to the conception of the study and performed the data analyses. ZX performed the experiment and wrote the main manuscript text. JG contributed significantly to analysis and manuscript preparation. YL helped perform the analysis with constructive discussions. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethical approval
Applicable for both human and/ or animal studies.
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Xiao, Z., Gu, J. et al. A network anomaly detection algorithm based on semi-supervised learning and adaptive multiclass balancing. J Supercomput 79, 20445–20480 (2023). https://doi.org/10.1007/s11227-023-05474-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05474-y