Skip to main content

Towards Attention Based Vulnerability Discovery Using Source Code Representation

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series (ICANN 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11730))

Included in the following conference series:

Abstract

Vulnerability discovery in software is an important task in the field of computer security. As vulnerabilities can be abused to enable cyber criminals and other malicious actors to exploit systems, it is crucial to keep software as free from vulnerabilities as is possible. Traditional approaches often comprise code scanning tasks to find specific and already-known classes of cyber vulnerabilities. However these approaches do not in general discover new classes of vulnerabilities. In this paper, we leverage a machine learning approach to model source code representation using syntax, semantics and control flow of source code and to infer vulnerable code patterns to tackle large code bases and identify potential vulnerabilities that missed by any existing static software analysis tools. In addition, our attention-based bidirectional long short-term memory framework adaptively localise regions of code illustrating where the possible vulnerable code fragment exists. The highlighted region may provide informative guidance to human developers or security experts. The experimental results demonstrate the feasibility of the proposed approach in the problem of software vulnerability discovery.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We do not distinguish a node in the AST and an edge in the CFG in the notation because we process them in the same way.

  2. 2.

    C was chosen because of its ubiquity and the abundance of datasets. We believe that our technique would be applicable to other programming languages.

  3. 3.

    The datasets are publicly available on Github, https://github.com/DanielLin1986 /TransferRepresentationLearning.

  4. 4.

    The description of the vulnerability can be found at https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-5907.

References

  1. Allamanis, M., Brockschmidt, M., Khademi, M.: Learning to represent programs with graphs. In: International Conference on Learning Representations (2018)

    Google Scholar 

  2. Avancini, A., Ceccato, M.: Comparison and integration of genetic algorithms and dynamic symbolic execution for security testing of cross-site scripting vulnerabilities. Inf. Softw. Technol. 55(12), 2209–2222 (2013)

    Article  Google Scholar 

  3. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)

  4. Brownlee, J.: How to handle very long sequences with long short-term memory recurrent neural networks. Machine Learning Mastery, June 2017, https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/

  5. Cadar, C., Dunbar, D., Engler, D.: KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 209–224. USENIX Association, San Diego (2008)

    Google Scholar 

  6. Caliskan-Islam, A., et al.: De-anonymizing programmers via code stylometry. In: Proceedings of the 24th USENIX Conference on Security Symposium. pp. 255–270. USENIX Association, Berkeley (2015)

    Google Scholar 

  7. Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623 (2015)

  8. Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)

    Google Scholar 

  9. Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: Proceedings of the 34th International Conference on Software Engineering, pp. 837–847. IEEE, Zurich, June 2012

    Google Scholar 

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  11. Höschele, M., Zeller, A.: Mining input grammars from dynamic taints. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 720–725. ACM, Singapore (2016)

    Google Scholar 

  12. Hu, X., Wei, Y., Li, G., Jin, Z.: CodeSum: translate program language to natural language. arXiv preprint arXiv:1708.01837 (2017)

  13. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  14. Li, Y., Su, Z., Wang, L., Li, X.: Steering symbolic execution to less traveled paths. In: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, pp. 19–32. ACM, New York (2013)

    Google Scholar 

  15. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: International Conference on Learning Representations (2015)

    Google Scholar 

  16. Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: Poster: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 2539–2541. ACM (2017)

    Google Scholar 

  17. Lin, G., et al.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Industr. Inf. 14(7), 3289–3297 (2018)

    Article  Google Scholar 

  18. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)

  19. Meng, Q., Wen, S., Zhang, B., Tang, C.: Automatically discover vulnerability through similar functions. In: Proceedings of the 2016 Progress in Electromagnetic Research Symposium, pp. 3657–3661. IEEE, Shanghai, August 2016

    Google Scholar 

  20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  22. Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)

    Google Scholar 

  23. Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: AAAI Conference on Artificial Intelligence, pp. 1287–1293 (2016)

    Google Scholar 

  24. Ozkan, S.: CVEdetails.com - Security vulnerability database. Security Vulnerabilities, exploits, references and more (2018). https://www.cvedetails.com/

  25. Pang, Y., Xue, X., Namin, A.S.: Predicting vulnerable software components through N-Gram analysis and statistical feature selection. In: Proceedings of the 14th IEEE International Conference on Machine Learning and Applications, pp. 543–548. IEEE, Miami, December 2015

    Google Scholar 

  26. Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49

    Chapter  Google Scholar 

  27. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)

    Google Scholar 

  28. Raychev, V., Vechev, M., Krause, A.: Predicting program properties from “Big Code”. In: Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 111–124. ACM, New York (2015)

    Google Scholar 

  29. Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 17th IEEE International Conference on Machine Learning and Application, pp. 757–762 (2018)

    Google Scholar 

  30. Scandariato, R., Walden, J., Hovsepyan, A., Joosen, W.: Predicting vulnerable software components via text mining. IEEE Trans. Software Eng. 40(10), 993–1006 (2014)

    Article  Google Scholar 

  31. Shu, L., Xu, H., Liu, B.: Doc: Deep open classification of text documents. In: EMNLP, pp. 2911–2916 (2017)

    Google Scholar 

  32. Sutton, M., Greene, A., Amini, P.: Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, Reading (2007)

    Google Scholar 

  33. Wang, S., Chollak, D., Movshovitz-Attias, D., Tan, L.: Bugram: bug detection with n-gram language models. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 708–719. ACM (2016)

    Google Scholar 

  34. Wilshusen, G.C.: Cybersecurity: recent data breaches illustrate need for strong controls across federal agencies. In: Technical Report, GAO-15-725T. U.S. Government Accountability Office (GAO) (2015)

    Google Scholar 

  35. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)

    Google Scholar 

  36. Yamaguchi, F., Wressnegger, C., Gascon, H., Rieck, K.: Chucky: exposing missing checks in source code for vulnerability discovery. In: Proceedings of the SIGSAC Conference on Computer & Communications Security, pp. 499–510. ACM (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junae Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Crown

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, J., Hubczenko, D., Montague, P. (2019). Towards Attention Based Vulnerability Discovery Using Source Code Representation. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series. ICANN 2019. Lecture Notes in Computer Science(), vol 11730. Springer, Cham. https://doi.org/10.1007/978-3-030-30490-4_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30490-4_58

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30489-8

  • Online ISBN: 978-3-030-30490-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics