Skip to main content
Log in

A network anomaly detection algorithm based on semi-supervised learning and adaptive multiclass balancing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the rapid development of network technology, the Internet has brought significant convenience to various sectors of society, holding a prominent position. Due to the unpredictable and severe consequences resulting from malicious attacks, the detection of anomalous network traffic has garnered considerable attention from researchers over the past few decades. Accurately labeling a sufficient amount of network traffic data as a training dataset within a short period of time is a challenging task, given the rapid and massive generation of network traffic data. Furthermore, the proportion of malicious attack traffic is relatively small compared to the overall traffic data, and the distribution of traffic data across different types of malicious attacks also varies significantly. To address the aforementioned challenges, this paper presents a novel network anomaly detection algorithm based on semi-supervised learning and adaptive multiclass balancing. Building upon the assumption of consistent distribution between labeled and unlabeled data, this paper introduces the multiclass split balancing strategy and the adaptive confidence threshold function. These innovative approaches aim to tackle the issue of the multiclass imbalanced in traffic data. By leveraging the mutually beneficial relationship between semi-supervised learning and ensemble learning, this paper presents the collaborative rotation forest algorithm. This algorithm is specifically designed to enhance performance of anomaly detection in an environment with label inadequacy. Several comparative experiments conducted on the NSL-KDD, UNSW-NB15, and ToN-IoT demonstrate that the proposed algorithm achieves significant improvements in performance. Specifically, it enhances precision by 1.5–5.7%, recall by 1.5−5.7%, and F-Measure by 1.4−4.3% compared to the state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Availability of data and materials

Not applicable.

References

  1. Moustafa N, Hu J, Slay J (2019) A holistic review of network anomaly detection systems: a comprehensive survey. J Netw Comput Appl 128:33–55

    Google Scholar 

  2. Mishra P, Varadharajan V, Tupakula U, Pilli ES (2019) A detailed investigation and analysis of using machine learning techniques for intrusion detection. IEEE Commun Surv Tutor 21(1):686–728

    Google Scholar 

  3. García-Teodoro P, Díaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: techniques, systems and challenges. Comput Secur 28(1):18–28

    Google Scholar 

  4. Bhuyan MH, Bhattacharyya DK, Kalita JK (2014) Network anomaly detection: methods, systems and tools. IEEE Commun Surv Tutor 16(1):303–336

    Google Scholar 

  5. Pajouh HH, Javidan R, Khayami R, Dehghantanha A, Choo K-KR (2019) A two-layer dimension reduction and two-tier classification model for anomaly-based intrusion detection in IoT backbone networks. IEEE Trans Emerg Top Comput 7(2):314–323

    Google Scholar 

  6. Zhou Y, Cheng G, Jiang S, Dai M (2020) Building an efficient intrusion detection system based on feature selection and ensemble classifier. Comput Netw 174:107247

    Google Scholar 

  7. Gu J, Lu S (2021) An effective intrusion detection approach using SVM with naïve Bayes feature embedding. Comput Secur 103:102158

    Google Scholar 

  8. Zhou Y, Mazzuchi TA, Sarkani S (2020) M-AdaBoost-A based ensemble system for network intrusion detection. Expert Syst Appl 162:113864

    Google Scholar 

  9. Li X, Zhu M, Yang LT, Xu M, Ma Z, Zhong C, Li H, Xiang Y (2021) Sustainable ensemble learning driving intrusion detection model. IEEE Trans Dependable Secure Comput 18(4):1591–1604

    Google Scholar 

  10. Panigrahi R, Borah S, Bhoi AK, Ijaz MF, Pramanik M, Kumar Y, Jhaveri RH (2021) A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics 9(7):751

    Google Scholar 

  11. Al-Jarrah OY, Al-Hammdi Y, Yoo PD, Muhaidat S, Al-Qutayri M (2018) Semi-supervised multi-layered clustering model for intrusion detection. Digital Commun Netw 4(4):277–286

    Google Scholar 

  12. Rathore S, Park JH (2018) Semi-supervised learning based distributed attack detection framework for IoT. Appl Soft Comput 72:79–89

    Google Scholar 

  13. Camacho J, Maciá-Fernández G, Fuentes-García NM, Saccenti E (2019) Semi-supervised multivariate statistical network monitoring for learning security threats. IEEE Trans Inf Forensics Secur 14(8):2179–2189

    Google Scholar 

  14. Yao H, Fu D, Zhang P, Li M, Liu Y (2019) MSML: a novel multilevel semi-supervised machine learning framework for intrusion detection system. IEEE Internet Things J 6(2):1949–1959

    Google Scholar 

  15. Li W, Meng W, Au MH (2020) Enhancing collaborative intrusion detection via disagreement-based semi-supervised learning in IoT environments. J Netw Comput Appl 161:102631

    Google Scholar 

  16. Marteau P-F (2021) Random partitioning forest for point-wise and collective anomaly detection-application to network intrusion detection. IEEE Trans Inf Forensics Secur 16:2157–2172

    Google Scholar 

  17. Carrasco RSM, Sicilia M-A (2018) Unsupervised intrusion detection through skip-gram models of network behavior. Comput Secur 78:187–197

    Google Scholar 

  18. Li X, Chen W, Zhang Q, Wu L (2020) Building auto-encoder intrusion detection system based on random forest feature selection. Comput Secur 95:101851

    Google Scholar 

  19. Liang W, Li K-C, Long J, Kui X, Zomaya AY (2020) An industrial network intrusion detection algorithm based on multifeature data clustering optimization model. IEEE Trans Industr Inf 16(3):2063–2071

    Google Scholar 

  20. Binbusayyis A, Vaiyapuri T (2021) Unsupervised deep learning approach for network intrusion detection combining convolutional autoencoder and one-class SVM. Appl Intell 51(10):7094–7108

    Google Scholar 

  21. Ahmed M, Naser Mahmood A, Hu J (2016) A survey of network anomaly detection techniques. J Netw Comput Appl 60:19–31

    Google Scholar 

  22. Ring M, Wunderlich S, Scheuring D, Landes D, Hotho A (2019) A survey of network-based intrusion detection data sets. Comput Secur 86:147–167

    Google Scholar 

  23. Joachims T (1999) Transductive inference for text classi cation using support vector machines. In: Icml, vol 99, pp 200–209

  24. Yuan Y, Li X, Wang Q, Nie F (2021) A semi-supervised learning algorithm via adaptive Laplacian graph. Neurocomputing 426:162–173

    Google Scholar 

  25. Calder J, Cook B, Thorpe M, Slepcev D (2020) Poisson Learning: Graph Based Semi-Supervised Learning At Very Low Label Rates. In: Proceedings of the 37th International Conference on Machine Learning, pp 1306–1316. PMLR.

  26. Mallapragada PK, Jin R, Jain AK, Liu Y (2009) SemiBoost: boosting for semi-supervised learning. IEEE Trans Pattern Anal Mach Intell 31(11):2000–2014

    Google Scholar 

  27. Chen K, Wang S (2011) Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. IEEE Trans Pattern Anal Mach Intell 33(1):129–143

    Google Scholar 

  28. Tanha J (2018) MSSBoost: a new multiclass boosting to semi-supervised learning. Neurocomputing 314:251–266

    Google Scholar 

  29. Li M, Zhou Z-H (2007) Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst Man Cybern Part A Syst Humans 37(6):1088–1098

    Google Scholar 

  30. Gu S, Jin Y (2017) Multi-train: a semi-supervised heterogeneous ensemble classifier. Neurocomputing 249:202–211

    Google Scholar 

  31. de Vries S, Thierens D (2021) A reliable ensemble based approach to semi-supervised learning. Knowl-Based Syst 215:106738

    Google Scholar 

  32. Lu J, Gong Y (2021) A co-training method based on entropy and multi-criteria. Appl Intell 51(6):3212–3225

    Google Scholar 

  33. Zhou Z-H, Li M (2005) Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans Knowl Data Eng 17(11):1529–1541

    Google Scholar 

  34. Zhou Z-H (2009) When semi-supervised learning meets ensemble learning. In: Benediktsson JA, Kittler J, Roli F (eds) Multiple classifier systems. Lecture Notes in Computer Science, pp 529–538. Springer, Heidelberg

  35. Liang XW, Jiang AP, Li T, Xue YY, Wang GT (2020) LR-SMOTE - an improved unbalanced data set oversampling based on K-means and SVM. Knowl-Based Syst 196:105845

    Google Scholar 

  36. Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111

    MathSciNet  MATH  Google Scholar 

  37. Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54

    Google Scholar 

  38. Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR (2021) DBIG-US: a two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl 168:114301

    Google Scholar 

  39. Halimu C, Kasem A (2021) Split Balancing (sBal)-A Data Preprocessing Sampling Technique for Ensemble Methods for Binary Classification in Imbalanced Datasets. In: Alfred R, Iida H, Haviluddin H, Anthony P (eds) Computational science and technology. Lecture notes in electrical engineering, pp 241–257. Springer, Singapore

  40. Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing 343:50–64

    Google Scholar 

  41. Wong ML, Seng K, Wong PK (2020) Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain. Expert Syst Appl 141:112918

    Google Scholar 

  42. Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl-Based Syst 158:81–93

    Google Scholar 

  43. Yang K, Yu Z, Wen X, Cao W, Chen CLP, Wong H-S, You J (2020) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 31(4):1387–1400

    MathSciNet  Google Scholar 

  44. Razavi-Far R, Farajzadeh-Zanajni M, Wang B, Saif M, Chakrabarti S (2021) Imputation-based ensemble techniques for class imbalance learning. IEEE Trans Knowl Data Eng 33(5):1988–2001

    Google Scholar 

  45. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  46. Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441

    MathSciNet  Google Scholar 

  47. Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):1619–1630

    Google Scholar 

  48. Sagi O, Rokach L (2018) Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery 8(4)

  49. Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp 1–6

  50. Moustafa N, Slay J (2015) UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In: 2015 Military Communications and Information Systems Conference (MilCIS), pp 1–6

  51. Moustafa N (2021) A new distributed architecture for evaluating AI-based security systems at the edge: network TON_iot datasets. Sustain Cities Soc 72:102994

    Google Scholar 

  52. KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Accessed 20 May 2022

  53. Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp 189–196

  54. Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp 92–100

Download references

Funding

This work was supported in part by the Fund of the China Scholarship Council, the National Natural Science Foundation of China under Grants U1804263 and 61877010, the Natural Science Foundation of Fujian Province China under Grants 2021J01616, 2020J01130167 and 2021J01625, and the Joint Straits Fund of Key Program of the National Natural Science Foundation of China under Grant U1705262.

Author information

Authors and Affiliations

Authors

Contributions

HZ contributed to the conception of the study and performed the data analyses. ZX performed the experiment and wrote the main manuscript text. JG contributed significantly to analysis and manuscript preparation. YL helped perform the analysis with constructive discussions. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yanhua Liu.

Ethics declarations

Ethical approval

Applicable for both human and/ or animal studies.

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Xiao, Z., Gu, J. et al. A network anomaly detection algorithm based on semi-supervised learning and adaptive multiclass balancing. J Supercomput 79, 20445–20480 (2023). https://doi.org/10.1007/s11227-023-05474-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05474-y

Keywords

Navigation