Skip to main content

Machine Learning for SAST: A Lightweight and Adaptable Approach

  • Conference paper
  • First Online:
Computer Security – ESORICS 2023 (ESORICS 2023)

Abstract

In this paper, we summarize a novel method for machine learning-based static application security testing (SAST), which was devised as part of a larger study funded by Germany’s Federal Office for Information Security (BSI). SAST describes the practice of applying static analysis techniques to program code on the premise of detecting security-critical software defects early during the development process. In the past, this was done by using rule-based approaches, where the program code is checked against a set of rules that define some pattern, representative of a defect. Recently, an increasing influx of publications can be observed that discuss the application of machine learning methods to this problem. Our method poses a lightweight approach to this concept, comprising two main contributions: Firstly, we present a novel control-flow based embedding method for program code. Embedding the code into a metric space is a necessity in order to apply machine learning techniques to the problem of SAST. Secondly, we describe how this method can be applied to generate expressive, yet simple, models of some unwanted behavior. We have implemented these methods in a prototype for the C and C++ programming languages. Using tenfold cross-validation, we show that our prototype is capable of effectively predicting the location and type of software defects in previously unseen code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://samate.nist.gov/SARD/test-suites/112.

References

  1. Ahmadi, M., Farkhani, R.M., Williams, R., Lu, L.: Finding bugs using your own code: detecting functionally-similar yet inconsistent code. In: 30th USENIX Security Symposium (USENIX Security 2021), pp. 2025–2040 (2021)

    Google Scholar 

  2. Alexopoulos, N., Brack, M., Wagner, J.P., Grube, T., Mühlhäuser, M.: How long do vulnerabilities live in the code? A large-scale empirical measurement study on FOSS vulnerability lifetimes. In: 31st USENIX Security Symposium (USENIX Security 2022) (2022)

    Google Scholar 

  3. Arp, D., et al.: Dos and Don’ts of machine learning in computer security. In: 31st USENIX Security Symposium (USENIX Security 2022), pp. 3971–3988 (2022)

    Google Scholar 

  4. Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)

    Google Scholar 

  5. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1984)

    Google Scholar 

  6. Black, P.E.: Juliet 1.3 test suite: changes from 1.2. Technical report, NIST TN 1995, National Institute of Standards and Technology, Gaithersburg, MD (2018). https://doi.org/10.6028/NIST.TN.1995

  7. BSI: Bundesamt für Sicherheit in der Informationstechnik - GitHub Organization. https://github.com/BSI-Bund

  8. Cui, L., Hao, Z., Jiao, Y., Fei, H., Yun, X.: VulDetector: detecting vulnerabilities using weighted feature graph comparison. IEEE Trans. Inf. Forensics Secur. 16, 2004–2017 (2021). https://doi.org/10.1109/TIFS.2020.3047756

    Article  Google Scholar 

  9. Fan, J., Li, Y., Wang, S., Nguyen, T.N.: A C/C++ code vulnerability dataset with code changes and CVE summaries. In: Proceedings of the 17th International Conference on Mining Software Repositories, MSR 2020, pp. 508–512. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3379597.3387501

  10. Géron, A.: Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd edn. O’Reilly Media Inc., Beijing (2019)

    Google Scholar 

  11. Giray, G., Bennin, K.E., Köksal, Ö., Babur, Ö., Tekinerdogan, B.: On the use of deep learning in software defect prediction. J. Syst. Softw. 195, 111537 (2023). https://doi.org/10.1016/j.jss.2022.111537

    Article  Google Scholar 

  12. GitHub Inc.: CodeQL. GitHub Inc. (2021)

    Google Scholar 

  13. Horwitz, S., Reps, T., Binkley, D.: Interprocedural slicing using dependence graphs. ACM SIGPLAN Not. 23(7), 35–46 (1988)

    Google Scholar 

  14. Hüther, L., et al.: Machine learning in the context of static application security testing - ML-SAST. Technical report, Federal Office for Information Security, Federal Office for Information Security, P.O. Box 20 03 63, 53133 Bonn (2022)

    Google Scholar 

  15. Johnson, J.M., Khoshgoftaar, T.M.: Thresholding strategies for deep learning with highly imbalanced big data. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds.) Deep Learning Applications, Volume 2. AISC, vol. 1232, pp. 199–227. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-6759-9_9

    Chapter  Google Scholar 

  16. Landi, W., Ryder, B.G.: Pointer-induced aliasing: a problem classification. In: Proceedings of the 18th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 1991, pp. 93–103. Association for Computing Machinery, New York (1991). https://doi.org/10.1145/99583.99599

  17. Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detection. In: Proceedings 2018 Network and Distributed System Security Symposium (2018). https://doi.org/10.14722/ndss.2018.23158

  18. Marjanov, T., Pashchenko, I., Massacci, F.: Machine learning for source code vulnerability detection: what works and what isn’t there yet. IEEE Secur. Priv. 20(5), 60–76 (2022). https://doi.org/10.1109/MSEC.2022.3176058

    Article  Google Scholar 

  19. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2020)

    Google Scholar 

  20. Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., Jaiswal, S.: Graph2vec: Learning Distributed Representations of Graphs (2017)

    Google Scholar 

  21. Sui, Y., Xue, J.: SVF: interprocedural static value-flow analysis in LLVM. In: Proceedings of the 25th International Conference on Compiler Construction, Barcelona, Spain, pp. 265–266. ACM (2016). https://doi.org/10.1145/2892208.2892235

  22. Xia, S., Xiong, Z., Luo, Y., WeiXu, Zhang, G.: Effectiveness of the Euclidean distance in high dimensional spaces. Optik 126(24), 5614–5619 (2015). https://doi.org/10.1016/j.ijleo.2015.09.093

  23. Yamaguchi, F., Maier, A., Gascon, H., Rieck, K.: Automatic inference of search patterns for taint-style vulnerabilities. In: 2015 IEEE Symposium on Security and Privacy, pp. 797–812 (2015). https://doi.org/10.1109/SP.2015.54

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lorenz Hüther .

Editor information

Editors and Affiliations

Appendix

Appendix

figure b
Table 1. Results of the intrinsic evaluation.
Table 2. Results of the oracle test for the FICS tool, our method as well as three conventional tools that were aggregated into a single column. True positive findings are marked with a “+”-symbol, negative results with a “−”-symbol. Cases where a tool failed to process the code entirely are marked with a “o”-symbol.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hüther, L., Sohr, K., Berger, B.J., Rothe, H., Edelkamp, S. (2024). Machine Learning for SAST: A Lightweight and Adaptable Approach. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-51482-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-51481-4

  • Online ISBN: 978-3-031-51482-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics