Abstract
Cyberattacks can lead to data breaches, service disruptions, and economic losses, and may even threaten national security and social stability. Therefore researchers have proposed various methods based on public datasets to improve the intelligence and automation of cybersecurity defense techniques. However, these public datasets usually have limited coverage of the types of cyberattacks, resulting in the proposed methods being ineffective against attacks not included in the dataset. Meanwhile, cybersecurity defenders often need to study cyberattack scenarios involving specific assets that are usually not represented in public datasets. To address these challenges, we propose a new approach to cybersecurity controlled dataset generation. Our method can reproduce any cyberattack using our four-role architecture, generating customized private attack data that includes specific assets, this capability satisfies the needs of researchers. By integrating the private attack data with a cybersecurity knowledge base derived from open-source datasets, we construct a comprehensive cybersecurity dataset. Extensive experiments demonstrate that the cybersecurity dataset generated by our method is suitable for various common cybersecurity tasks, such as threat hunting, alert analysis, and knowledge reasoning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
2022 top routinely exploited vulnerabilities, https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-215a
Att &ck matrix for enterprise, https://attack.mitre.org/
Common attack pattern enumeration and classification, https://capec.mitre.org/
Common platform enumeration, https://cpe.mitre.org/
Common vulnerabilities and exposure, https://cve.mitre.org/
Common weakness enumeration, https://cwe.mitre.org/
Cwe-204 detail, https://cwe.mitre.org/data/definitions/204.html
Mitre, https://www.mitre.org/
National institute of standards and technology, https://www.nist.gov/
Stix 1.0 documentation, https://stixproject.github.io/documentation/
Stix 2.0 documentation, https://oasis-open.github.io/cti-documentation/stix/examples.html
Akbanov, M., Vassilakis, V.: Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms. Journal of Telecommunications and Information Technology 1, 113–124 (04 2019). https://doi.org/10.26636/jtit.2019.130218
Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. pp. 2787–2795 (2013), https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html
Carrasco, M.A.M., Wu, C., Fuertes, W.: Adversarial examples: A survey of attacks and defenses in deep learning-enabled cybersecurity systems. Expert Syst. Appl. 238(Part E), 122223 (2024), https://doi.org/10.1016/j.eswa.2023.122223
Chou, D., Jiang, M.: A survey on data-driven network intrusion detection. ACM Comput. Surv. 54(9), 182:1–182:36 (2022), https://doi.org/10.1145/3472753
Hemberg, E., Kelly, J., Shlapentokh-Rothman, M., Reinstadler, B., Xu, K., Rutar, N., O’Reilly, U.M.: Linking threat tactics, techniques, and patterns with defensive weaknesses, vulnerabilities and affected platform configurations for cyber hunting. CoRR abs/1905.02497 (2021), http://arxiv.org/abs/1905.02497
Hwang, R., Lee, C., Lin, Y., Lin, P., Wu, H., Lai, Y., Chen, C.K.: Host-based intrusion detection with multi-datasource and deep learning. J. Inf. Secur. Appl. 78, 103625 (2023). https://doi.org/10.1016/j.jisa.2023.103625
Iannacone, M.D., Bohn, S., Nakamura, G., Gerth, J., Huffer, K.M.T., Bridges, R.A., Ferragut, E.M., Goodall, J.R.: Developing an ontology for cyber security knowledge graphs. In: Proceedings of the 10th Annual Cyber and Information Security Research Conference, CISR ’15, Oak Ridge, TN, USA, April 7-9, 2015. pp. 12:1–12:4. ACM (2015), https://doi.org/10.1145/2746266.2746278
Lin, Y., Wang, Z., Lin, P., Nguyen, V., Hwang, R., Lai, Y.: Multi-datasource machine learning in intrusion detection: Packet flows, system logs and host statistics. J. Inf. Secur. Appl. 68, 103248 (2022). https://doi.org/10.1016/j.jisa.2022.103248
Martín, M.L., Carro, B., Arribas, J.I., Sánchez-Esguevillas, A.: Network intrusion detection with a novel hierarchy of distances between embeddings of hash IP addresses. Knowl. Based Syst. 219, 106887 (2021). https://doi.org/10.1016/j.knosys.2021.106887
Martín, M.L., Sánchez-Esguevillas, A., Arribas, J.I., Carro, B.: Supervised contrastive learning over prototype-label embeddings for network intrusion detection. Inf. Fusion 79, 200–228 (2022). https://doi.org/10.1016/j.inffus.2021.09.014
Özgür, A., Erdem, H.: A review of KDD99 dataset usage in intrusion detection and machine learning between 2010 and 2015. PeerJ Prepr. 4, e1954 (2016). https://doi.org/10.7287/peerj.preprints.1954v1
Pingle, A., Piplai, A., Mittal, S., Joshi, A., Holt, J., Zak, R.: Relext: relation extraction using deep learning approaches for cybersecurity knowledge graph improvement. In: ASONAM ’19: International Conference on Advances in Social Networks Analysis and Mining, Vancouver, British Columbia, Canada, 27-30 August, 2019. pp. 879–886. ACM (2019), https://doi.org/10.1145/3341161.3343519
Ren, Y., Xiao, Y., Zhou, Y., Zhang, Z., Tian, Z.: CSKG4APT: A cybersecurity knowledge graph for advanced persistent threat organization attribution. IEEE Trans. Knowl. Data Eng. 35(6), 5695–5709 (2023). https://doi.org/10.1109/TKDE.2022.3175719
Sarhan, M., Layeghy, S., Portmann, M.: Towards a standard feature set for network intrusion detection system datasets. Mob. Networks Appl. 27(1), 357–370 (2022). https://doi.org/10.1007/s11036-021-01843-0
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: Mori, P., Furnell, S., Camp, O. (eds.) Proceedings of the 4th International Conference on Information Systems Security and Privacy, ICISSP 2018, Funchal, Madeira - Portugal, January 22-24, 2018. pp. 108–116. SciTePress (2018), https://doi.org/10.5220/0006639801080116
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Khatib, M.G., He, X., Factor, M. (eds.) IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3-7, 2010. pp. 1–10. IEEE Computer Society (2010), https://doi.org/10.1109/MSST.2010.5496972
Syed, Z., Padia, A., Finin, T., Mathews, M.L., Joshi, A.: UCO: A unified cybersecurity ontology. In: Artificial Intelligence for Cyber Security, Papers from the 2016 AAAI Workshop, Phoenix, Arizona, USA, February 12, 2016. AAAI Technical Report, vol. WS-16-03. AAAI Press (2016), http://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12574
Xiao, L., Xu, D., Mandayam, N.B., Poor, H.V.: Attacker-centric view of a detection game against advanced persistent threats. IEEE Trans. Mob. Comput. 17(11), 2512–2523 (2018). https://doi.org/10.1109/TMC.2018.2814052
Xu, W., Huang, L., Fox, A., Patterson, D.A., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel. pp. 37–46. Omnipress (2010), https://icml.cc/Conferences/2010/papers/902.pdf
Acknowledgments
This work was supported in part by the Major Key Project of PCL (Grant No. PCL2023A07-4), the National Natural Science Foundation of China (Grant No. 62372137), and the Guangxi Natural Science Foundation (No. 2022GXNSFBA035650).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xie, Y., Wang, H., Tan, R., Song, X., Gu, Z. (2025). CDGM: Controllable Dataset Generation Method for Cybersecurity. In: Sheng, Q.Z., et al. Advanced Data Mining and Applications. ADMA 2024. Lecture Notes in Computer Science(), vol 15392. Springer, Singapore. https://doi.org/10.1007/978-981-96-0850-8_16
Download citation
DOI: https://doi.org/10.1007/978-981-96-0850-8_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0849-2
Online ISBN: 978-981-96-0850-8
eBook Packages: Computer ScienceComputer Science (R0)