Skip to main content

CDGM: Controllable Dataset Generation Method for Cybersecurity

  • Conference paper
  • First Online:
Advanced Data Mining and Applications (ADMA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15392))

Included in the following conference series:

  • 129 Accesses

Abstract

Cyberattacks can lead to data breaches, service disruptions, and economic losses, and may even threaten national security and social stability. Therefore researchers have proposed various methods based on public datasets to improve the intelligence and automation of cybersecurity defense techniques. However, these public datasets usually have limited coverage of the types of cyberattacks, resulting in the proposed methods being ineffective against attacks not included in the dataset. Meanwhile, cybersecurity defenders often need to study cyberattack scenarios involving specific assets that are usually not represented in public datasets. To address these challenges, we propose a new approach to cybersecurity controlled dataset generation. Our method can reproduce any cyberattack using our four-role architecture, generating customized private attack data that includes specific assets, this capability satisfies the needs of researchers. By integrating the private attack data with a cybersecurity knowledge base derived from open-source datasets, we construct a comprehensive cybersecurity dataset. Extensive experiments demonstrate that the cybersecurity dataset generated by our method is suitable for various common cybersecurity tasks, such as threat hunting, alert analysis, and knowledge reasoning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. 2022 top routinely exploited vulnerabilities, https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-215a

  2. Att &ck matrix for enterprise, https://attack.mitre.org/

  3. Common attack pattern enumeration and classification, https://capec.mitre.org/

  4. Common platform enumeration, https://cpe.mitre.org/

  5. Common vulnerabilities and exposure, https://cve.mitre.org/

  6. Common weakness enumeration, https://cwe.mitre.org/

  7. Cwe-204 detail, https://cwe.mitre.org/data/definitions/204.html

  8. Mitre, https://www.mitre.org/

  9. National institute of standards and technology, https://www.nist.gov/

  10. Stix 1.0 documentation, https://stixproject.github.io/documentation/

  11. Stix 2.0 documentation, https://oasis-open.github.io/cti-documentation/stix/examples.html

  12. Akbanov, M., Vassilakis, V.: Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms. Journal of Telecommunications and Information Technology 1, 113–124 (04 2019). https://doi.org/10.26636/jtit.2019.130218

  13. Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. pp. 2787–2795 (2013), https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html

  14. Carrasco, M.A.M., Wu, C., Fuertes, W.: Adversarial examples: A survey of attacks and defenses in deep learning-enabled cybersecurity systems. Expert Syst. Appl. 238(Part E), 122223 (2024), https://doi.org/10.1016/j.eswa.2023.122223

  15. Chou, D., Jiang, M.: A survey on data-driven network intrusion detection. ACM Comput. Surv. 54(9), 182:1–182:36 (2022), https://doi.org/10.1145/3472753

  16. Hemberg, E., Kelly, J., Shlapentokh-Rothman, M., Reinstadler, B., Xu, K., Rutar, N., O’Reilly, U.M.: Linking threat tactics, techniques, and patterns with defensive weaknesses, vulnerabilities and affected platform configurations for cyber hunting. CoRR abs/1905.02497 (2021), http://arxiv.org/abs/1905.02497

  17. Hwang, R., Lee, C., Lin, Y., Lin, P., Wu, H., Lai, Y., Chen, C.K.: Host-based intrusion detection with multi-datasource and deep learning. J. Inf. Secur. Appl. 78, 103625 (2023). https://doi.org/10.1016/j.jisa.2023.103625

    Article  Google Scholar 

  18. Iannacone, M.D., Bohn, S., Nakamura, G., Gerth, J., Huffer, K.M.T., Bridges, R.A., Ferragut, E.M., Goodall, J.R.: Developing an ontology for cyber security knowledge graphs. In: Proceedings of the 10th Annual Cyber and Information Security Research Conference, CISR ’15, Oak Ridge, TN, USA, April 7-9, 2015. pp. 12:1–12:4. ACM (2015), https://doi.org/10.1145/2746266.2746278

  19. Lin, Y., Wang, Z., Lin, P., Nguyen, V., Hwang, R., Lai, Y.: Multi-datasource machine learning in intrusion detection: Packet flows, system logs and host statistics. J. Inf. Secur. Appl. 68, 103248 (2022). https://doi.org/10.1016/j.jisa.2022.103248

    Article  Google Scholar 

  20. Martín, M.L., Carro, B., Arribas, J.I., Sánchez-Esguevillas, A.: Network intrusion detection with a novel hierarchy of distances between embeddings of hash IP addresses. Knowl. Based Syst. 219, 106887 (2021). https://doi.org/10.1016/j.knosys.2021.106887

    Article  Google Scholar 

  21. Martín, M.L., Sánchez-Esguevillas, A., Arribas, J.I., Carro, B.: Supervised contrastive learning over prototype-label embeddings for network intrusion detection. Inf. Fusion 79, 200–228 (2022). https://doi.org/10.1016/j.inffus.2021.09.014

    Article  Google Scholar 

  22. Özgür, A., Erdem, H.: A review of KDD99 dataset usage in intrusion detection and machine learning between 2010 and 2015. PeerJ Prepr. 4, e1954 (2016). https://doi.org/10.7287/peerj.preprints.1954v1

    Article  Google Scholar 

  23. Pingle, A., Piplai, A., Mittal, S., Joshi, A., Holt, J., Zak, R.: Relext: relation extraction using deep learning approaches for cybersecurity knowledge graph improvement. In: ASONAM ’19: International Conference on Advances in Social Networks Analysis and Mining, Vancouver, British Columbia, Canada, 27-30 August, 2019. pp. 879–886. ACM (2019), https://doi.org/10.1145/3341161.3343519

  24. Ren, Y., Xiao, Y., Zhou, Y., Zhang, Z., Tian, Z.: CSKG4APT: A cybersecurity knowledge graph for advanced persistent threat organization attribution. IEEE Trans. Knowl. Data Eng. 35(6), 5695–5709 (2023). https://doi.org/10.1109/TKDE.2022.3175719

    Article  Google Scholar 

  25. Sarhan, M., Layeghy, S., Portmann, M.: Towards a standard feature set for network intrusion detection system datasets. Mob. Networks Appl. 27(1), 357–370 (2022). https://doi.org/10.1007/s11036-021-01843-0

    Article  Google Scholar 

  26. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: Mori, P., Furnell, S., Camp, O. (eds.) Proceedings of the 4th International Conference on Information Systems Security and Privacy, ICISSP 2018, Funchal, Madeira - Portugal, January 22-24, 2018. pp. 108–116. SciTePress (2018), https://doi.org/10.5220/0006639801080116

  27. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Khatib, M.G., He, X., Factor, M. (eds.) IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3-7, 2010. pp. 1–10. IEEE Computer Society (2010), https://doi.org/10.1109/MSST.2010.5496972

  28. Syed, Z., Padia, A., Finin, T., Mathews, M.L., Joshi, A.: UCO: A unified cybersecurity ontology. In: Artificial Intelligence for Cyber Security, Papers from the 2016 AAAI Workshop, Phoenix, Arizona, USA, February 12, 2016. AAAI Technical Report, vol. WS-16-03. AAAI Press (2016), http://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12574

  29. Xiao, L., Xu, D., Mandayam, N.B., Poor, H.V.: Attacker-centric view of a detection game against advanced persistent threats. IEEE Trans. Mob. Comput. 17(11), 2512–2523 (2018). https://doi.org/10.1109/TMC.2018.2814052

    Article  Google Scholar 

  30. Xu, W., Huang, L., Fox, A., Patterson, D.A., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel. pp. 37–46. Omnipress (2010), https://icml.cc/Conferences/2010/papers/902.pdf

Download references

Acknowledgments

This work was supported in part by the Major Key Project of PCL (Grant No. PCL2023A07-4), the National Natural Science Foundation of China (Grant No. 62372137), and the Guangxi Natural Science Foundation (No. 2022GXNSFBA035650).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaoquan Gu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xie, Y., Wang, H., Tan, R., Song, X., Gu, Z. (2025). CDGM: Controllable Dataset Generation Method for Cybersecurity. In: Sheng, Q.Z., et al. Advanced Data Mining and Applications. ADMA 2024. Lecture Notes in Computer Science(), vol 15392. Springer, Singapore. https://doi.org/10.1007/978-981-96-0850-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0850-8_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0849-2

  • Online ISBN: 978-981-96-0850-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics