Machine Learning for SAST: A Lightweight and Adaptable Approach

Hüther, Lorenz; Sohr, Karsten; Berger, Bernhard J.; Rothe, Hendrik; Edelkamp, Stefan

doi:10.1007/978-3-031-51482-1_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14347))

Included in the following conference series:

European Symposium on Research in Computer Security

919 Accesses

Abstract

In this paper, we summarize a novel method for machine learning-based static application security testing (SAST), which was devised as part of a larger study funded by Germany’s Federal Office for Information Security (BSI). SAST describes the practice of applying static analysis techniques to program code on the premise of detecting security-critical software defects early during the development process. In the past, this was done by using rule-based approaches, where the program code is checked against a set of rules that define some pattern, representative of a defect. Recently, an increasing influx of publications can be observed that discuss the application of machine learning methods to this problem. Our method poses a lightweight approach to this concept, comprising two main contributions: Firstly, we present a novel control-flow based embedding method for program code. Embedding the code into a metric space is a necessity in order to apply machine learning techniques to the problem of SAST. Secondly, we describe how this method can be applied to generate expressive, yet simple, models of some unwanted behavior. We have implemented these methods in a prototype for the C and C++ programming languages. Using tenfold cross-validation, we show that our prototype is capable of effectively predicting the location and type of software defects in previously unseen code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On the Verification of Software Vulnerabilities During Static Code Analysis Using Data Mining Techniques

Enhancing Security Assurance in Software Development: AI-Based Vulnerable Code Detection with Static Analysis

Automated Detection of Logical Errors in Programs

Notes

1.
https://samate.nist.gov/SARD/test-suites/112.

References

Ahmadi, M., Farkhani, R.M., Williams, R., Lu, L.: Finding bugs using your own code: detecting functionally-similar yet inconsistent code. In: 30th USENIX Security Symposium (USENIX Security 2021), pp. 2025–2040 (2021)
Google Scholar
Alexopoulos, N., Brack, M., Wagner, J.P., Grube, T., Mühlhäuser, M.: How long do vulnerabilities live in the code? A large-scale empirical measurement study on FOSS vulnerability lifetimes. In: 31st USENIX Security Symposium (USENIX Security 2022) (2022)
Google Scholar
Arp, D., et al.: Dos and Don’ts of machine learning in computer security. In: 31st USENIX Security Symposium (USENIX Security 2022), pp. 3971–3988 (2022)
Google Scholar
Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)
Google Scholar
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1984)
Google Scholar
Black, P.E.: Juliet 1.3 test suite: changes from 1.2. Technical report, NIST TN 1995, National Institute of Standards and Technology, Gaithersburg, MD (2018). https://doi.org/10.6028/NIST.TN.1995
BSI: Bundesamt für Sicherheit in der Informationstechnik - GitHub Organization. https://github.com/BSI-Bund
Cui, L., Hao, Z., Jiao, Y., Fei, H., Yun, X.: VulDetector: detecting vulnerabilities using weighted feature graph comparison. IEEE Trans. Inf. Forensics Secur. 16, 2004–2017 (2021). https://doi.org/10.1109/TIFS.2020.3047756
Article Google Scholar
Fan, J., Li, Y., Wang, S., Nguyen, T.N.: A C/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th International Conference on Mining Software Repositories, MSR 2020, pp. 508–512. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3379597.3387501
Géron, A.: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd edn. O’Reilly Media Inc., Beijing (2019)
Google Scholar
Giray, G., Bennin, K.E., Köksal, Ö., Babur, Ö., Tekinerdogan, B.: On the use of deep learning in software defect prediction. J. Syst. Softw. 195, 111537 (2023). https://doi.org/10.1016/j.jss.2022.111537
Article Google Scholar
GitHub Inc.: CodeQL. GitHub Inc. (2021)
Google Scholar
Horwitz, S., Reps, T., Binkley, D.: Interprocedural slicing using dependence graphs. ACM SIGPLAN Not. 23(7), 35–46 (1988)
Google Scholar
Hüther, L., et al.: Machine learning in the context of static application security testing - ML-SAST. Technical report, Federal Office for Information Security, Federal Office for Information Security, P.O. Box 20 03 63, 53133 Bonn (2022)
Google Scholar
Johnson, J.M., Khoshgoftaar, T.M.: Thresholding strategies for deep learning with highly imbalanced big data. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds.) Deep Learning Applications, Volume 2. AISC, vol. 1232, pp. 199–227. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-6759-9_9
Chapter Google Scholar
Landi, W., Ryder, B.G.: Pointer-induced aliasing: a problem classification. In: Proceedings of the 18th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 1991, pp. 93–103. Association for Computing Machinery, New York (1991). https://doi.org/10.1145/99583.99599
Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detection. In: Proceedings 2018 Network and Distributed System Security Symposium (2018). https://doi.org/10.14722/ndss.2018.23158
Marjanov, T., Pashchenko, I., Massacci, F.: Machine learning for source code vulnerability detection: what works and what isn’t there yet. IEEE Secur. Priv. 20(5), 60–76 (2022). https://doi.org/10.1109/MSEC.2022.3176058
Article Google Scholar
McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2020)
Google Scholar
Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: Graph2vec: Learning Distributed Representations of Graphs (2017)
Google Scholar
Sui, Y., Xue, J.: SVF: interprocedural static value-flow analysis in LLVM. In: Proceedings of the 25th International Conference on Compiler Construction, Barcelona, Spain, pp. 265–266. ACM (2016). https://doi.org/10.1145/2892208.2892235
Xia, S., Xiong, Z., Luo, Y., WeiXu, Zhang, G.: Effectiveness of the Euclidean distance in high dimensional spaces. Optik 126(24), 5614–5619 (2015). https://doi.org/10.1016/j.ijleo.2015.09.093
Yamaguchi, F., Maier, A., Gascon, H., Rieck, K.: Automatic inference of search patterns for taint-style vulnerabilities. In: 2015 IEEE Symposium on Security and Privacy, pp. 797–812 (2015). https://doi.org/10.1109/SP.2015.54

Download references

Author information

Authors and Affiliations

Computer Science Department, Software Engineering Group, University of Bremen, Bibliothekstraße 5, 28359, Bremen, Germany
Lorenz Hüther & Karsten Sohr
Institute of Embedded Systems, Hamburg University of Technology, Am Schwarzenberg-Campus 3 (E), 21073, Hamburg, Germany
Bernhard J. Berger
Team Neusta GmbH, Konsul-Smidt-Straße 24, 28217, Bremen, Germany
Hendrik Rothe
Artificial Intelligence Center, Czech Technical University in Prague, Charles Square 13, 120 00, Prague, Czech Republic
Stefan Edelkamp

Authors

Lorenz Hüther
View author publications
You can also search for this author in PubMed Google Scholar
Karsten Sohr
View author publications
You can also search for this author in PubMed Google Scholar
Bernhard J. Berger
View author publications
You can also search for this author in PubMed Google Scholar
Hendrik Rothe
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Edelkamp
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lorenz Hüther .

Editor information

Editors and Affiliations

University of California, Irvine, CA, USA
Gene Tsudik
University of Padua, Padua, Italy
Mauro Conti
Delft University of Technology, Delft, The Netherlands
Kaitai Liang
Delft University of Technology, Delft, The Netherlands
Georgios Smaragdakis

Appendix

Table 1. Results of the intrinsic evaluation.

Full size table

Table 2. Results of the oracle test for the FICS tool, our method as well as three conventional tools that were aggregated into a single column. True positive findings are marked with a “+”-symbol, negative results with a “−”-symbol. Cases where a tool failed to process the code entirely are marked with a “o”-symbol.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hüther, L., Sohr, K., Berger, B.J., Rothe, H., Edelkamp, S. (2024). Machine Learning for SAST: A Lightweight and Adaptable Approach. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-51482-1_5
Published: 11 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51481-4
Online ISBN: 978-3-031-51482-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Machine Learning for SAST: A Lightweight and Adaptable Approach

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

On the Verification of Software Vulnerabilities During Static Code Analysis Using Data Mining Techniques

Enhancing Security Assurance in Software Development: AI-Based Vulnerable Code Detection with Static Analysis

Automated Detection of Logical Errors in Programs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us