Abstract
Data imbalance in network intrusion detection datasets tends to incur underfitting or deviation in classifier training. This investigation applies Batched Variational AutoEncoders (B-VAE) to generate a desirable data generation model which can balance intrusion detection datasets to enhance the detection practice. To improve insufficient VAE decoder training in the VAE approach, we apply B-VAE to train one decoder for each piece of data by a batched duplicated data and form multiple batched VAEs to provide sufficient decoder training. The unique practice of B-VAE makes the generated data all similar to but different from the original data, to secure desirable data balance for better classifier training and classification results. Experimental evaluation conducted to compare the performance of related balancing approaches shows that our B-VAE outperforms others in that it is able to maintain the same classification accuracy (in terms of F1-scores) regardless of any Imbalance Ratio (IR) change. Specifically, B-VAE manages to solve the problem of insufficient decoder training in existing approaches and so to enhance the intrusion detection performance—mainly because it can secure balanced data generation to lift the classification accuracy due to sufficient decoder training and utilization of exact features.
Similar content being viewed by others
Data availability
All of the material is owned by the authors and/or no permissions are required.
References
Chuang P-J, Wu D-Y (2019) Applying deep learning to balancing network intrusion detection datasets. In Proceedings of the 2019 IEEE 11th International Conference on Advanced Infocomm Technology, pp. 213–217
Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2nd IEEE International Conference on Computational Intelligence for Security and Defense Applications, 2009, pp. 53–58
NSL-KDD dataset, https://www.unb.ca/cic/datasets/nsl.html, 2022.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyerm WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14(1):106–121
Fernandez A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61(1):863–905
Rosadi D et al., (2021) Improving machine learning prediction of peatlands fire occurrence for unbalanced data using SMOTE approach. In: Proceedings of the 2021 International Conference on Data Science, Artificial Intelligence, and Business Analytics, 2021, pp. 160–163
Dablain D, Krawczyk B, Chawla NV (2022) DeepSMOTE: Fusing deep learning and SMOTE for imbalanced datal. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2021.3136503
Khurana A, Verma OP (2023) Optimal feature selection for imbalanced text classification. IEEE Trans Artif Intell 4(1):135–147
Dinh PV et al., (2017) Deep learning combined with de-noising data for network intrusion detection. In: Proceedings of the 2017 21st Asia Pacific Symposium on Intelligent and Evolutionary Systems, 2017, pp. 55–60
Potluri S and Diedrich C (2016) Accelerated deep neural networks for enhanced intrusion detection system. In: Proceedings of the 2016 IEEE 21st International Conference on Emerging Technologies and Factory Automation, 2016, pp. 1–8.
Doersch C (2016), Tutorial on variational autoencoders. arXiv:1606.05908 [stat.ML], pp. 1–23.
Yang H, Qiu RC, Shi X, and He X (2018) Deep learning architecture for voltage stability evaluation in smart grid based on variational autoencoders. arXiv:1808.05762 [eess.SP], pp. 1–9
Simon D (2008) Biogeography-based optimization. IEEE Trans Evol Comput 12(6):702–713
Wesche T, Goertler G, Hubert W (1987) Modified habitat suitability index model for brown trout in southeastern Wyoming. North Am J Fisheries Manag 7:232–237
Anaconda, The World’s Most Popular Data Science Platform, https://www.anaconda.com, 2022.
Spyder IDE, https://www.spyder-ide.org, 2022.
Resende PAA, Drummond AC (2018) A survey of random forest based methods for intrusion detection systems. ACM Comput Surv 51(3):1–36
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Scikit-learn: machine learning in Python, https://github.com/scikit-learn/scikit-learn, 2022.
SMOTE-variants for imbalanced learning, https://github.com/analyticalmindsltd /smote_variants, 2022.
DeepSMOTE, https://github.com/dd1github/DeepSMOTE, 2022.
BBO : https://github.com/aroshanineshat/BBO-Python, 2022.
Xiao H, Rasul K, and Vollgraf R, (2017) Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017
Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: Proceedings of the 4th International Conference on Information Systems Security and Privacy, 2018, pp. 108–116
Precision and recall, https://en.wikipedia.org/wiki/Precision_and_recall, 2022.
Chuang P-J, Wu K-L (2021) Employing on-line training in SDN intrusion detection. J Inf Sci Eng 37(2):483–496
Boukela L, Zhang G, Yacoub M, and Bouzefrane S (2021) A near-autonomous and incremental intrusion detection system through active learning of known and unknown attacks. In: Proceedings of the 2021 International Conference on Security, Pattern Analysis, and Cybernetics, 2021, pp. 374–379
Funding
No funding.
Author information
Authors and Affiliations
Contributions
P-JC and P-YH wrote the main manuscript text, prepared all the figures and reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interests
No, I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Ethical Approval
Ethical committees, Internal Review Boards and guidelines followed must be named. When applicable, additional headings with statements on consent to participate and consent to publish are also required.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chuang, PJ., Huang, PY. B-VAE: a new dataset balancing approach using batched Variational AutoEncoders to enhance network intrusion detection. J Supercomput 79, 13262–13286 (2023). https://doi.org/10.1007/s11227-023-05171-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05171-w