Abstract
In this paper, we summarize a novel method for machine learning-based static application security testing (SAST), which was devised as part of a larger study funded by Germany’s Federal Office for Information Security (BSI). SAST describes the practice of applying static analysis techniques to program code on the premise of detecting security-critical software defects early during the development process. In the past, this was done by using rule-based approaches, where the program code is checked against a set of rules that define some pattern, representative of a defect. Recently, an increasing influx of publications can be observed that discuss the application of machine learning methods to this problem. Our method poses a lightweight approach to this concept, comprising two main contributions: Firstly, we present a novel control-flow based embedding method for program code. Embedding the code into a metric space is a necessity in order to apply machine learning techniques to the problem of SAST. Secondly, we describe how this method can be applied to generate expressive, yet simple, models of some unwanted behavior. We have implemented these methods in a prototype for the C and C++ programming languages. Using tenfold cross-validation, we show that our prototype is capable of effectively predicting the location and type of software defects in previously unseen code.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmadi, M., Farkhani, R.M., Williams, R., Lu, L.: Finding bugs using your own code: detecting functionally-similar yet inconsistent code. In: 30th USENIX Security Symposium (USENIX Security 2021), pp. 2025–2040 (2021)
Alexopoulos, N., Brack, M., Wagner, J.P., Grube, T., Mühlhäuser, M.: How long do vulnerabilities live in the code? A large-scale empirical measurement study on FOSS vulnerability lifetimes. In: 31st USENIX Security Symposium (USENIX Security 2022) (2022)
Arp, D., et al.: Dos and Don’ts of machine learning in computer security. In: 31st USENIX Security Symposium (USENIX Security 2022), pp. 3971–3988 (2022)
Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1984)
Black, P.E.: Juliet 1.3 test suite: changes from 1.2. Technical report, NIST TN 1995, National Institute of Standards and Technology, Gaithersburg, MD (2018). https://doi.org/10.6028/NIST.TN.1995
BSI: Bundesamt für Sicherheit in der Informationstechnik - GitHub Organization. https://github.com/BSI-Bund
Cui, L., Hao, Z., Jiao, Y., Fei, H., Yun, X.: VulDetector: detecting vulnerabilities using weighted feature graph comparison. IEEE Trans. Inf. Forensics Secur. 16, 2004–2017 (2021). https://doi.org/10.1109/TIFS.2020.3047756
Fan, J., Li, Y., Wang, S., Nguyen, T.N.: A C/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th International Conference on Mining Software Repositories, MSR 2020, pp. 508–512. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3379597.3387501
Géron, A.: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd edn. O’Reilly Media Inc., Beijing (2019)
Giray, G., Bennin, K.E., Köksal, Ö., Babur, Ö., Tekinerdogan, B.: On the use of deep learning in software defect prediction. J. Syst. Softw. 195, 111537 (2023). https://doi.org/10.1016/j.jss.2022.111537
GitHub Inc.: CodeQL. GitHub Inc. (2021)
Horwitz, S., Reps, T., Binkley, D.: Interprocedural slicing using dependence graphs. ACM SIGPLAN Not. 23(7), 35–46 (1988)
Hüther, L., et al.: Machine learning in the context of static application security testing - ML-SAST. Technical report, Federal Office for Information Security, Federal Office for Information Security, P.O. Box 20 03 63, 53133 Bonn (2022)
Johnson, J.M., Khoshgoftaar, T.M.: Thresholding strategies for deep learning with highly imbalanced big data. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds.) Deep Learning Applications, Volume 2. AISC, vol. 1232, pp. 199–227. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-6759-9_9
Landi, W., Ryder, B.G.: Pointer-induced aliasing: a problem classification. In: Proceedings of the 18th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 1991, pp. 93–103. Association for Computing Machinery, New York (1991). https://doi.org/10.1145/99583.99599
Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detection. In: Proceedings 2018 Network and Distributed System Security Symposium (2018). https://doi.org/10.14722/ndss.2018.23158
Marjanov, T., Pashchenko, I., Massacci, F.: Machine learning for source code vulnerability detection: what works and what isn’t there yet. IEEE Secur. Priv. 20(5), 60–76 (2022). https://doi.org/10.1109/MSEC.2022.3176058
McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2020)
Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: Graph2vec: Learning Distributed Representations of Graphs (2017)
Sui, Y., Xue, J.: SVF: interprocedural static value-flow analysis in LLVM. In: Proceedings of the 25th International Conference on Compiler Construction, Barcelona, Spain, pp. 265–266. ACM (2016). https://doi.org/10.1145/2892208.2892235
Xia, S., Xiong, Z., Luo, Y., WeiXu, Zhang, G.: Effectiveness of the Euclidean distance in high dimensional spaces. Optik 126(24), 5614–5619 (2015). https://doi.org/10.1016/j.ijleo.2015.09.093
Yamaguchi, F., Maier, A., Gascon, H., Rieck, K.: Automatic inference of search patterns for taint-style vulnerabilities. In: 2015 IEEE Symposium on Security and Privacy, pp. 797–812 (2015). https://doi.org/10.1109/SP.2015.54
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix

Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hüther, L., Sohr, K., Berger, B.J., Rothe, H., Edelkamp, S. (2024). Machine Learning for SAST: A Lightweight and Adaptable Approach. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-51482-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51481-4
Online ISBN: 978-3-031-51482-1
eBook Packages: Computer ScienceComputer Science (R0)