CDGM: Controllable Dataset Generation Method for Cybersecurity

Xie, Yushun; Wang, Haiyan; Tan, Runnan; Song, Xiangyu; Gu, Zhaoquan

doi:10.1007/978-981-96-0850-8_16

Yushun Xie¹⁶,
Haiyan Wang¹⁹,
Runnan Tan¹⁸,
Xiangyu Song¹⁹ &
…
Zhaoquan Gu^17,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15392))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

129 Accesses

Abstract

Cyberattacks can lead to data breaches, service disruptions, and economic losses, and may even threaten national security and social stability. Therefore researchers have proposed various methods based on public datasets to improve the intelligence and automation of cybersecurity defense techniques. However, these public datasets usually have limited coverage of the types of cyberattacks, resulting in the proposed methods being ineffective against attacks not included in the dataset. Meanwhile, cybersecurity defenders often need to study cyberattack scenarios involving specific assets that are usually not represented in public datasets. To address these challenges, we propose a new approach to cybersecurity controlled dataset generation. Our method can reproduce any cyberattack using our four-role architecture, generating customized private attack data that includes specific assets, this capability satisfies the needs of researchers. By integrating the private attack data with a cybersecurity knowledge base derived from open-source datasets, we construct a comprehensive cybersecurity dataset. Extensive experiments demonstrate that the cybersecurity dataset generated by our method is suitable for various common cybersecurity tasks, such as threat hunting, alert analysis, and knowledge reasoning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

2022 top routinely exploited vulnerabilities, https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-215a
Att &ck matrix for enterprise, https://attack.mitre.org/
Common attack pattern enumeration and classification, https://capec.mitre.org/
Common platform enumeration, https://cpe.mitre.org/
Common vulnerabilities and exposure, https://cve.mitre.org/
Common weakness enumeration, https://cwe.mitre.org/
Cwe-204 detail, https://cwe.mitre.org/data/definitions/204.html
Mitre, https://www.mitre.org/
National institute of standards and technology, https://www.nist.gov/
Stix 1.0 documentation, https://stixproject.github.io/documentation/
Stix 2.0 documentation, https://oasis-open.github.io/cti-documentation/stix/examples.html
Akbanov, M., Vassilakis, V.: Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms. Journal of Telecommunications and Information Technology 1, 113–124 (04 2019). https://doi.org/10.26636/jtit.2019.130218
Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States. pp. 2787–2795 (2013), https://proceedings.neurips.cc/paper/2013/hash/1cecc7a77928ca8133fa24680a88d2f9-Abstract.html
Carrasco, M.A.M., Wu, C., Fuertes, W.: Adversarial examples: A survey of attacks and defenses in deep learning-enabled cybersecurity systems. Expert Syst. Appl. 238(Part E), 122223 (2024), https://doi.org/10.1016/j.eswa.2023.122223
Chou, D., Jiang, M.: A survey on data-driven network intrusion detection. ACM Comput. Surv. 54(9), 182:1–182:36 (2022), https://doi.org/10.1145/3472753
Hemberg, E., Kelly, J., Shlapentokh-Rothman, M., Reinstadler, B., Xu, K., Rutar, N., O’Reilly, U.M.: Linking threat tactics, techniques, and patterns with defensive weaknesses, vulnerabilities and affected platform configurations for cyber hunting. CoRR abs/1905.02497 (2021), http://arxiv.org/abs/1905.02497
Hwang, R., Lee, C., Lin, Y., Lin, P., Wu, H., Lai, Y., Chen, C.K.: Host-based intrusion detection with multi-datasource and deep learning. J. Inf. Secur. Appl. 78, 103625 (2023). https://doi.org/10.1016/j.jisa.2023.103625
Article Google Scholar
Iannacone, M.D., Bohn, S., Nakamura, G., Gerth, J., Huffer, K.M.T., Bridges, R.A., Ferragut, E.M., Goodall, J.R.: Developing an ontology for cyber security knowledge graphs. In: Proceedings of the 10th Annual Cyber and Information Security Research Conference, CISR ’15, Oak Ridge, TN, USA, April 7-9, 2015. pp. 12:1–12:4. ACM (2015), https://doi.org/10.1145/2746266.2746278
Lin, Y., Wang, Z., Lin, P., Nguyen, V., Hwang, R., Lai, Y.: Multi-datasource machine learning in intrusion detection: Packet flows, system logs and host statistics. J. Inf. Secur. Appl. 68, 103248 (2022). https://doi.org/10.1016/j.jisa.2022.103248
Article Google Scholar
Martín, M.L., Carro, B., Arribas, J.I., Sánchez-Esguevillas, A.: Network intrusion detection with a novel hierarchy of distances between embeddings of hash IP addresses. Knowl. Based Syst. 219, 106887 (2021). https://doi.org/10.1016/j.knosys.2021.106887
Article Google Scholar
Martín, M.L., Sánchez-Esguevillas, A., Arribas, J.I., Carro, B.: Supervised contrastive learning over prototype-label embeddings for network intrusion detection. Inf. Fusion 79, 200–228 (2022). https://doi.org/10.1016/j.inffus.2021.09.014
Article Google Scholar
Özgür, A., Erdem, H.: A review of KDD99 dataset usage in intrusion detection and machine learning between 2010 and 2015. PeerJ Prepr. 4, e1954 (2016). https://doi.org/10.7287/peerj.preprints.1954v1
Article Google Scholar
Pingle, A., Piplai, A., Mittal, S., Joshi, A., Holt, J., Zak, R.: Relext: relation extraction using deep learning approaches for cybersecurity knowledge graph improvement. In: ASONAM ’19: International Conference on Advances in Social Networks Analysis and Mining, Vancouver, British Columbia, Canada, 27-30 August, 2019. pp. 879–886. ACM (2019), https://doi.org/10.1145/3341161.3343519
Ren, Y., Xiao, Y., Zhou, Y., Zhang, Z., Tian, Z.: CSKG4APT: A cybersecurity knowledge graph for advanced persistent threat organization attribution. IEEE Trans. Knowl. Data Eng. 35(6), 5695–5709 (2023). https://doi.org/10.1109/TKDE.2022.3175719
Article Google Scholar
Sarhan, M., Layeghy, S., Portmann, M.: Towards a standard feature set for network intrusion detection system datasets. Mob. Networks Appl. 27(1), 357–370 (2022). https://doi.org/10.1007/s11036-021-01843-0
Article Google Scholar
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: Mori, P., Furnell, S., Camp, O. (eds.) Proceedings of the 4th International Conference on Information Systems Security and Privacy, ICISSP 2018, Funchal, Madeira - Portugal, January 22-24, 2018. pp. 108–116. SciTePress (2018), https://doi.org/10.5220/0006639801080116
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Khatib, M.G., He, X., Factor, M. (eds.) IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3-7, 2010. pp. 1–10. IEEE Computer Society (2010), https://doi.org/10.1109/MSST.2010.5496972
Syed, Z., Padia, A., Finin, T., Mathews, M.L., Joshi, A.: UCO: A unified cybersecurity ontology. In: Artificial Intelligence for Cyber Security, Papers from the 2016 AAAI Workshop, Phoenix, Arizona, USA, February 12, 2016. AAAI Technical Report, vol. WS-16-03. AAAI Press (2016), http://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12574
Xiao, L., Xu, D., Mandayam, N.B., Poor, H.V.: Attacker-centric view of a detection game against advanced persistent threats. IEEE Trans. Mob. Comput. 17(11), 2512–2523 (2018). https://doi.org/10.1109/TMC.2018.2814052
Article Google Scholar
Xu, W., Huang, L., Fox, A., Patterson, D.A., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel. pp. 37–46. Omnipress (2010), https://icml.cc/Conferences/2010/papers/902.pdf

Download references

Acknowledgments

This work was supported in part by the Major Key Project of PCL (Grant No. PCL2023A07-4), the National Natural Science Foundation of China (Grant No. 62372137), and the Guangxi Natural Science Foundation (No. 2022GXNSFBA035650).

Author information

Authors and Affiliations

Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China
Yushun Xie
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
Zhaoquan Gu
Cyberspace Institution of Advanced Technology, Guangzhou University, Guangzhou, China
Runnan Tan
Department of New Networks, Peng Cheng Laboratory, Shenzhen, China
Haiyan Wang, Xiangyu Song & Zhaoquan Gu

Authors

Yushun Xie
View author publications
You can also search for this author in PubMed Google Scholar
Haiyan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Runnan Tan
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Song
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoquan Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaoquan Gu .

Editor information

Editors and Affiliations

Macquarie University, Sydney, NSW, Australia
Quan Z. Sheng
University of Auckland, Auckland, New Zealand
Gill Dobbie
Australian National University, Canberra, ACT, Australia
Jing Jiang
Macquarie University, Sydney, NSW, Australia
Xuyun Zhang
The University of Adelaide, Adelaide, SA, Australia
Wei Emma Zhang
Open University of Cyprus, Nicosia, Cyprus
Yannis Manolopoulos
Macquarie University, Sydney, NSW, Australia
Jia Wu
University of Dubai, Dubai, United Arab Emirates
Wathiq Mansoor
Macquarie University, Sydney, NSW, Australia
Congbo Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, Y., Wang, H., Tan, R., Song, X., Gu, Z. (2025). CDGM: Controllable Dataset Generation Method for Cybersecurity. In: Sheng, Q.Z., et al. Advanced Data Mining and Applications. ADMA 2024. Lecture Notes in Computer Science(), vol 15392. Springer, Singapore. https://doi.org/10.1007/978-981-96-0850-8_16

Download citation

DOI: https://doi.org/10.1007/978-981-96-0850-8_16
Published: 24 December 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0849-2
Online ISBN: 978-981-96-0850-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CDGM: Controllable Dataset Generation Method for Cybersecurity