Software Vulnerability Detection via Multimodal Deep Learning

Zhou, Xin; Verma, Rakesh M.

doi:10.1007/978-3-031-29504-1_5

Xin Zhou⁹ &
Rakesh M. Verma⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13867))

Included in the following conference series:

International Workshop on Security and Trust Management

293 Accesses

Abstract

Vulnerabilities in software are like ticking time bombs, but it is difficult to completely eliminate them. For example, buffer overflow is a quite common vulnerability that occurs when a program receives too much data that can corrupt nearby space in memory and manipulate other data for malicious actions. To detect potential vulnerabilities in source code, we consider the code as multisource data by extracting semantically meaningful sub-graphs: Abstract Syntax Tree Graph (ASTG) and Tokenized Data Flow Graph (TDFG). We combine these with the original sequence of tokens and 49 heuristic features to train and leverage a multimodal deep learning network to detect vulnerable statements. We propose a Multisource Deep Learner (MDL) with joint representations based on the pretrained attention-based Bidirectional Gated Recurrent Unit (BGRU) neural networks for vulnerability detection in source code. Our framework not only detects potential vulnerabilities but also locates and ranks the vulnerable statements according to their importance based on the Program Dependence Graph (PDG). Our results show that an MDL-based model using multiple modalities is significantly better than a single modality based model. We also present comparisons with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alon, U., Brody, S., Levy, O., Yahav, E.: Code2seq: generating sequences from structured representations of code. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=H1gKYo09tX
Alon, U., Zilberstein, M., Levy, O., Yahav, E.: Code2vec: learning distributed representations of code. Proc. ACM Program. Lang. 3(POPL) (2019). https://doi.org/10.1145/3290353
Chandar, S., Khapra, M.M., Larochelle, H., Ravindran, B.: Correlational neural networks. Neural Comput. 28(2), 257–285 (2016). https://doi.org/10.1162/NECO_a_00801
Article MathSciNet MATH Google Scholar
Chernis, B., Verma, R.: Machine learning methods for software vulnerability detection. In: Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, pp. 31–39 (2018)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014 (2014)
Google Scholar
Cooper, A., Zhou, X., Heidbrink, S., Dunlavy, D.M.: Using neural architecture search for improving software flaw detection in multimodal deep learning models. arXiv:2009.10644 (2020)
Eliben: Complete c99 parser in pure python: pycparser v2.21. https://github.com/eliben/pycparser/blob/master/pycparser. Accessed Nov 2021
Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. (TOPLAS) 9(3), 319–349 (1987). https://doi.org/10.1145/24039.24041
Article MATH Google Scholar
Flawfinder: Flawfinder. https://dwheeler.com/flawfinder/. Accessed Feb 2022
SQ Group: Static analysis tool exposition (SATE) VI workshop. https://www.nist.gov/itl/ssd/software-quality-group/static-analysis-tool-exposition-sate-vi-workshop. Accessed Mar 2022
Harer, J.A., et al.: Automated software vulnerability detection with machine learning. arXiv abs/1803.04497 (2018)
Google Scholar
Heidbrink, S., Rodhouse, K.N., Dunlavy, D.M.: Multimodal deep learning for flaw detection in software programs. arXiv:2009.04549 (2020)
Heidbrink, S., Rodhouse, K.N., Dunlavy, D., Cooper, A., Zhou, X.: Joint analysis of program data representations using machine learning for improved software assurance and development capabilities (2020). https://doi.org/10.2172/1670527. https://www.osti.gov/biblio/1670527
Hicken, A.: The shift-left approach to software testing. https://www.stickyminds.com/article/shift-left-approach-software-testing. Accessed Mar 2022
Jin, A., Fu, Q., Deng, Z.: Contour-based 3D modeling through joint embedding of shapes and contours. In: Symposium on Interactive 3D Graphics and Games, I3D 2020. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3384382.3384518
Katz, O., Olshaker, Y., Goldberg, Y., Yahav, E.: Towards neural decompilation. arXiv abs/1905.08325 (2019)
Google Scholar
Kotenko, I., Izrailov, K., Buinevich, M.: Static analysis of information systems for IoT cyber security: a survey of machine learning approaches. Sensors 22(4) (2022). https://doi.org/10.3390/s22041335. https://www.mdpi.com/1424-8220/22/4/1335
Kovalenko, V., Bogomolov, E., Bryksin, T., Bacchelli, A.: PathMiner: a library for mining of path-based representations of code. In: Proceedings of the 16th International Conference on Mining Software Repositories, pp. 13–17. IEEE Press (2019)
Google Scholar
Kulenovic, M., Donko, D.: A survey of static code analysis methods for security vulnerabilities detection. In: 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1381–1386 (2014). https://doi.org/10.1109/MIPRO.2014.6859783
Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: 2011 IEEE International Conference on Robotics and Automation, pp. 1817–1824 (2011). https://doi.org/10.1109/ICRA.2011.5980382
Li, Y., Wang, S., Nguyen, T.N.: Vulnerability detection with fine-grained interpretations, pp. 292–303. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/3468264.3468597
Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Secure Comput. 19(4), 2821–2837 (2022). https://doi.org/10.1109/TDSC.2021.3076142
Article Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Secure Comput. 1 (2021). https://doi.org/10.1109/tdsc.2021.3051525
Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detection. In: 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, 18–21 February 2018. The Internet Society (2018). http://wp.internetsociety.org/ndss/wp-content/uploads/sites/25/2018/02/ndss2018_03A-2_Li_paper.pdf
McConnell, S.: Code Complete. Pearson Education (2004)
Google Scholar
Mokhov, S.A.: The use of machine learning with signal- and NLP processing of source code to fingerprint, detect, and classify vulnerabilities and weaknesses with MARFCAT. arXiv, Cryptography and Security (2011)
Google Scholar
Mokhov, S.A., Paquet, J., Debbabi, M.: MARFCAT: fast code analysis for defects and vulnerabilities. In: 2015 IEEE 1st International Workshop on Software Analytics (SWAN), pp. 35–38 (2015). https://doi.org/10.1109/SWAN.2015.7070488
Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: Graph2vec: learning distributed representations of graphs. arXiv abs/1707.05005 (2017)
Google Scholar
NIST: Software assurance reference dataset. https://samate.nist.gov/SRD/index.php. Accessed Mar 2022
NIST: National vulnerability database. https://nvd.nist.gov/. Accessed Nov 2021
RAT: rough-auditing-tool-for-security. https://code.google.com/archive/p/rough-auditing-tool-for-security/. Accessed May 2022
Reimers, N., Gurevych, I.: Reporting score distributions makes a difference: performance study of LSTM-networks for sequence tagging. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 338–348. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/D17-1035. https://aclanthology.org/D17-1035
Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762 (2018). https://doi.org/10.1109/ICMLA.2018.00120
Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). https://doi.org/10.1109/78.650093
Article Google Scholar
Sestili, C.D., Snavely, W., VanHoudnos, N.M.: Towards security defect prediction with AI. arXiv abs/1808.09897 (2018)
Google Scholar
Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Sarro, F.: A survey on machine learning techniques for source code analysis. arXiv abs/2110.09610 (2021)
Google Scholar
Shervashidze, N., Schweitzer, P., van Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-lehman graph kernels. J. Mach. Learn. Res. 12(77), 2539–2561 (2011). http://jmlr.org/papers/v12/shervashidze11a.html
Wang, Z., Yu, L., Wang, S., Liu, P.: Spotting silent buffer overflows in execution trace through graph neural network assisted data flow analysis. arXiv (2021). https://arxiv.org/abs/2102.10452
Wanjia: This 66-year-old is still writing code and wants to fix bugs early in the SDLC. https://xcalibyte.com/. Accessed Mar 2022
Weiser, M.: Program slicing. IEEE Trans. Softw. Eng. SE-10(4), 352–357 (1984). https://doi.org/10.1109/TSE.1984.5010248
Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy, pp. 590–604 (2014). https://doi.org/10.1109/SP.2014.44
Zhou, X., Verma, R.M.: Vulnerability detection via multimodal learning: datasets and analysis. In: ASIA Conference on Computer and Communications Security (2022). https://doi.org/10.1145/3488932.3527288

Download references

Acknowledgments

Research partially supported by NSF grants 1433817 and 2210198, ARO grant W911NF-20-1- 0254, and ONR award N00014-19-S-F009. Verma is the founder of Everest Cyber Security and Analytics, Inc.

Author information

Authors and Affiliations

University of Houston, Houston, TX, USA
Xin Zhou & Rakesh M. Verma

Authors

Xin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Rakesh M. Verma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xin Zhou or Rakesh M. Verma .

Editor information

Editors and Affiliations

University of Luxembourg, Esch-sur-Alzette, Luxembourg
Gabriele Lenzini
Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng

A Appendix

1.1 A.1 Limitations

Apart from the usual limitations of static analysis and machine learning, other limitations are: 1) adversarial data may negatively impact model’s performance, 2) the current implementation does not address interprocedural analysis.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X., Verma, R.M. (2023). Software Vulnerability Detection via Multimodal Deep Learning. In: Lenzini, G., Meng, W. (eds) Security and Trust Management. STM 2022. Lecture Notes in Computer Science, vol 13867. Springer, Cham. https://doi.org/10.1007/978-3-031-29504-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-29504-1_5
Published: 04 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-29503-4
Online ISBN: 978-3-031-29504-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Software Vulnerability Detection via Multimodal Deep Learning

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Limitations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation