Skip to main content

VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes

  • Conference paper
  • First Online:
Foundations and Practice of Security (FPS 2023)

Abstract

The first graph masked auto-encoder (GraphMAE) model for software vulnerability detection is designed and developed, with a comparative evaluation against other self-supervised learning (SSL) methods. Evaluation of the domain-specific GraphMAE model (VulMAE) for the vulnerability detection task shows exceptional promise, outperforming all other baseline models in the study. The approach is particularly well-suited for cybersecurity applications where gathering substantial real-world labeled samples is difficult, since graph SSL methods (e.g., contrastive and generative models) offer data classification in AI tasks without requiring vast amounts of labeled data for effective training.

The study fills a key gap in the literature on automated and machine-assisted discovery and patching of software security vulnerabilities, which has become increasingly critical with the dramatic increase in modern software complexity, but for which graph neural network (GNN) approaches are understudied relative to traditional processes, such as manual source code auditing and fuzzing. To conduct the study, the evaluation applies models to source and binary software components sourced from the National Vulnerability Database (NVD). A new dataset is curated by extracting vulnerable code fragments from six applications with NVD-documented security flaws and converting them to four graph types using specialized tools based on code property graphs and binary semantics lifting. The data is used to train contrastive and generative learning models for comparison. VulMAE achieves a weighted F1 score of 0.936 and a weighted Recall of 0.938, which is the highest of all tested methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/Saquibirtiza/VulMAE.git.

References

  1. Booth, H., Rike, D., Witte, G.A.: The national vulnerability database (NVD): Overview. ITL Bulletin, National Institute of Standards and Technology (2013)

    Google Scholar 

  2. Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Proceedings of International Conference on Computer Aided Verification, pp. 463–469 (2011)

    Google Scholar 

  3. Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet. IEEE Trans. Softw. Eng. 48, 3280–3296 (2022)

    Article  Google Scholar 

  4. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifi. Intell. Res. 16(1), 321–357 (2002)

    Article  Google Scholar 

  5. Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceedings of ACM/IEEE International Symposium Empirical Software Engineering and Measurement (2021)

    Google Scholar 

  6. DevNest: How to bypass sudo – exploit CVE-2023-22809 vulnerability. Medium (2023). https://medium.com/@dev.nest/how-to-bypass-sudo-exploit-cve-2023-22809-vulnerability-296ef10a1466

  7. Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: Proceedings of International Conference on Machine Learning, pp. 4116–4126 (2020)

    Google Scholar 

  8. Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: Proceedings of International Conference on Mining Software Repositories, pp. 596–607 (2022)

    Google Scholar 

  9. Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: Proceedings of International Conference on Learning Representation (2019)

    Google Scholar 

  10. Hohnka, M.J., Miller, J.A., Dacumos, K.M., Fritton, T.J., Erdley, J.D., Long, L.N.: Evaluation of compiler-induced vulnerabilities. J. Aerospace Inform. Syst. 16(10), 409–426 (2019)

    Article  Google Scholar 

  11. Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: GraphMAE: self-supervised masked graph autoencoders. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022)

    Google Scholar 

  12. Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48(1), 312–320 (2005)

    Article  Google Scholar 

  13. Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv:1611.07308 (2016)

  14. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conferen on Learning Representation (Poster) (2017)

    Google Scholar 

  15. Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: Proceedings of International Conference on Learning Representation (2019)

    Google Scholar 

  16. Li, X., Feng, B., Li, G., Li, T., He, M.: A vulnerability detection system based on fusion of assembly code and source code. Sec. Commun. Netw. 2021 (2021)

    Google Scholar 

  17. Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Sec. Comput. 19(4), 2821–2837 (2021)

    Article  Google Scholar 

  18. Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of Annual Computer Security Applications Conference, pp. 201–213 (2016)

    Google Scholar 

  19. Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Sec. Comput. 19(4), 2244–2258 (2021)

    Article  Google Scholar 

  20. Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of Annual Network & Distributed System Security Symposium (2018)

    Google Scholar 

  21. Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: POSTER: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of ACM Conference on Computer and Communications Security, pp. 2539–2541 (2017)

    Google Scholar 

  22. Lin, G.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Indus. Inform. 14(7), 3289–3297 (2018)

    Article  Google Scholar 

  23. Lipp, S., Banescu, S., Pretschner, A.: An empirical study on the effectiveness of static C code analyzers for vulnerability detection. In: Proceedings of ACM International Symposium on Software Testing and Analysis, pp. 544–555 (2022)

    Google Scholar 

  24. Ma, R., Jian, Z., Chen, G., Ma, K., Chen, Y.: ReJection: a AST-based reentrancy vulnerability detection method. In: Proceedings of Chinese Conference on Trusted Computing and Information Security, pp. 58–71 (2020)

    Google Scholar 

  25. Mizrahi, Y.: OpenSSH pre-auth double free CVE-2023-25136 – writeup and proof-of-concept. JFrog (2023). https://jfrog.com/blog/openssh-pre-auth-double-free-cve-2023-25136-writeup-and-proof-of-concept

  26. NIST: CVSS severity distribution over time. https://nvd.nist.gov/general/visualizations/vulnerability-visualizations/cvss-severity-distribution-over-time#CVSSSeverityOverTime, (Accessed 12 Sep 2023)

  27. Pinconschi, E., Abreu, R., Adão, P.: A comparative study of automatic program repair techniques for security vulnerabilities. In: Proceedings of IEEE International Symposium on Software Reliability Engineering, pp. 196–207 (2021)

    Google Scholar 

  28. Russell, R., et al.: klM.: Automated vulnerability detection in source code using deep representation learning. In: Proceedings of IEEE International Conference on Machine Learning and Applications, pp. 757–762 (2018)

    Google Scholar 

  29. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Proceedings of European Semantic Web Conference, pp. 593–607 (2018)

    Google Scholar 

  30. Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12(9) (2011)

    Google Scholar 

  31. Shimchik, N., Ignatyev, V., Belevantsev, A.: Improving accuracy and completeness of source code static taint analysis. In: Ivannikov Ispras Open Conference, pp. 61–68 (2021)

    Google Scholar 

  32. Sun, F.Y., Hoffmann, J., Verma, V., Tang, J.: Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: Proceedings of International Conference on Learning Representations (2020)

    Google Scholar 

  33. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: Proceedings of International Conference on Learning Representation (2017)

    Google Scholar 

  34. Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: Proceedings of International Conference on Learning Representation (2019)

    Google Scholar 

  35. Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: Proceedings of International Conference on Learning Representation (2019)

    Google Scholar 

  36. Xu, L., Sun, F., Su, Z.: Constructing precise control flow graphs from binaries. The University of California, Davis, Tech. rep. (2009)

    Google Scholar 

  37. Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: Proceedings IEEE Symposium on Security & Privacy, pp. 590–604 (2014)

    Google Scholar 

  38. Yamaguchi, F., Lindner, F.F., Rieck, K.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of USENIX Workshop Offensive Technologies, pp. 118–127 (2011)

    Google Scholar 

  39. Yanardag, P., Vishwanathan, S.: Deep graph kernels. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374 (2015)

    Google Scholar 

  40. You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., Shen, Y.: Graph contrastive learning with augmentations. In: Proceedings of Conference on Neural Information Processing Systems, pp. 5812–5823 (2020)

    Google Scholar 

  41. Zhang, H., Wu, Q., Yan, J., Wipf, D., Yu, P.S.: From canonical correlation analysis to self-supervised graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 76–89 (2021)

    Google Scholar 

  42. Zhou, M., et al.: A method for software vulnerability detection based on improved control flow graph. Wuhan University J. Nat. Sci. 24(2), 149–160 (2019)

    Article  Google Scholar 

  43. Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 10197–10207 (2019)

    Google Scholar 

  44. Zhu, Q., Du, B., Yan, P.: Self-supervised training of graph convolutional networks. In: Proceedings of International Conference on Machine Learning, Online (2020)

    Google Scholar 

Download references

Acknowledgments

This research was supported in part by DARPA Award N66001-21-C-4024, ONR Award N00014-21-1-2654, and ARO Award W911NF-21-1-0032. All recommendations, opinions, and conclusions are those of the authors and not necessarily of the above supporters.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saquib Irtiza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zamani, M., Irtiza, S., Khan, L., Hamlen, K.W. (2024). VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes. In: Mosbah, M., Sèdes, F., Tawbi, N., Ahmed, T., Boulahia-Cuppens, N., Garcia-Alfaro, J. (eds) Foundations and Practice of Security. FPS 2023. Lecture Notes in Computer Science, vol 14551. Springer, Cham. https://doi.org/10.1007/978-3-031-57537-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-57537-2_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-57536-5

  • Online ISBN: 978-3-031-57537-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics