Abstract
Advanced persistent threats (APTs) present a significant cybersecurity challenge, necessitating innovative detection methods. This study stands out by integrating advanced data preparation with strategies for handling data imbalances, tailored for the SCVIC-APT-2021 dataset. We employ a mix of resampling, cost-sensitive learning, and ensemble methods, alongside machine learning and deep learning models like XGBoost, LightGBM, and ANNs, to enhance APT detection. Our strategy, which draws from the MITRE ATT&CK framework, concentrates on each stage of APT attacks, which significantly increases detection accuracy. Notably, we achieved a Macro F1-score of 95.20% with XGBoost and 96.67% with LightGBM, and significant enhancements in the area under the precision–recall curve for both. Our study’s exploration of the SCVIC-APT-2021 dataset marks a progressive step in APT detection research, with vital implications for future cybersecurity developments.












Similar content being viewed by others
Availability of data and materials
Data generated or analyzed during this study are included in this published article.
Code availability
The custom code developed for the experiments in this study is available upon request from the corresponding author.
References
Chen P, Desmet L, Huygens C (2014) A study on advanced persistent threats. In: Communications and Multimedia Security: 15th IFIP TC 6/TC 11 International Conference, CMS 2014, Aveiro, Springer, Berlin Heidelberg, pp 63–72
Alshamrani A, Myneni S, Chowdhary A, Huang D (2019) A survey on advanced persistent threats: techniques, solutions, challenges, and research opportunities. IEEE Commun Surv Tutor 21(2):1851–1877
Werner de Vargas V, Schneider Aranda JA, dos Santos Costa R, da Silva Pereira PR, Victória Barbosa JL (2023) Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowl Inf Syst 65(1):31–57
Seo JH (2022) Evolutionary data preprocessing to alleviate class imbalance. Secur Commun Netw 2022
Sharma A, Gupta BB, Singh AK, Saraswat VK (2023) Advanced persistent threats (APT): evolution, anatomy, attribution and countermeasures. J Ambient Intell Humaniz Comput 1–27
Neuschmied H, Winter M, Stojanović B, Hofer-Schmitz K, Božić J, Kleb U (2022) Apt-attack detection based on multi-stage autoencoders. Appl Sci 12(13):6816
Bodström T, Hämäläinen T (2019) A novel deep learning stack for APT detection. Appl Sci 9(6):1055
Shi Y, Li W, Zhang Y, Deng X, Yin D, Deng S (2021) Survey on APT attack detection in industrial cyber-physical system. In: 2021 International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA). IEEE, pp 296–301
Do Xuan C, Dao MH (2021) A novel approach for APT attack detection based on combined deep learning model. Neural Comput Appl 33:13251–13264
Myneni S, Chowdhary A, Sabur A, Sengupta S, Agrawal G, Huang D, Kang M (2020) DAPT 2020-constructing a benchmark dataset for advanced persistent threats. In: Deployable Machine Learning for Security Defense: First International Workshop, MLHat 2020, San Diego. Springer, pp 138–163
Liu J, Shen Y, Simsek M, Kantarci B, Mouftah HT, Bagheri M, Djukic P (2022) A new realistic benchmark for advanced persistent threats in network traffic. IEEE Netw Lett 4(3):162–166
Friedberg I, Skopik F, Settanni G, Fiedler R (2015) Combating advanced persistent threats: from network event correlation to incident detection. Comput Secur 48:35–57
Siddiqui S, Khan MS, Ferens K, Kinsner W (2016) Detecting advanced persistent threats using fractal dimension based machine learning classification. In: Proceedings of the 2016 ACM on International Workshop on Security and Privacy Analytics, pp 64–69
Ghafir I, Hammoudeh M, Prenosil V, Han L, Hegarty R, Rabie K, Aparicio-Navarro FJ (2018) Detection of advanced persistent threat using machine-learning correlation analysis. Future Gener Comput Syst 89:349–359
Laurenza G, Lazzeretti R, Mazzotti L (2020) Malware triage for early identification of advanced persistent threat activities. Digit Threats Res Pract 1(3):1–17
Hasan MM, Islam MU, Uddin J (2023) Advanced persistent threat identification with boosting and explainable AI. SN Comput Sci 4(3):271
Brownlee J (2020). Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery
Brownlee J (2020). Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery
Kim M, Hwang KB (2022) An empirical evaluation of sampling methods for the classification of imbalanced data. PLoS ONE 17(7):e0271260
Janiesch C, Zschech P, Heinrich K (2021) Machine learning and deep learning. Electron Mark 31(3):685–695
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
The primary contributor to the research was Dinh-Dong Dau, who created the test code and edited the research paper’s main content. Hanseok Kim assisted in reviewing language and grammar errors in the research paper. Professor Soojin Lee supervised the research and provided oversight for the entire content of the research paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable, as this study did not involve human participants or animals.
Consent to participate
Not applicable, as this study did not involve human participants.
Consent for publication
All authors have consented to the publication of this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dau, DD., Lee, S. & Kim, H. A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling. J Supercomput 80, 14143–14179 (2024). https://doi.org/10.1007/s11227-024-06010-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06010-2