Abstract
Vulnerability discovery in software is an important task in the field of computer security. As vulnerabilities can be abused to enable cyber criminals and other malicious actors to exploit systems, it is crucial to keep software as free from vulnerabilities as is possible. Traditional approaches often comprise code scanning tasks to find specific and already-known classes of cyber vulnerabilities. However these approaches do not in general discover new classes of vulnerabilities. In this paper, we leverage a machine learning approach to model source code representation using syntax, semantics and control flow of source code and to infer vulnerable code patterns to tackle large code bases and identify potential vulnerabilities that missed by any existing static software analysis tools. In addition, our attention-based bidirectional long short-term memory framework adaptively localise regions of code illustrating where the possible vulnerable code fragment exists. The highlighted region may provide informative guidance to human developers or security experts. The experimental results demonstrate the feasibility of the proposed approach in the problem of software vulnerability discovery.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We do not distinguish a node in the AST and an edge in the CFG in the notation because we process them in the same way.
- 2.
C was chosen because of its ubiquity and the abundance of datasets. We believe that our technique would be applicable to other programming languages.
- 3.
The datasets are publicly available on Github, https://github.com/DanielLin1986 /TransferRepresentationLearning.
- 4.
The description of the vulnerability can be found at https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-5907.
References
Allamanis, M., Brockschmidt, M., Khademi, M.: Learning to represent programs with graphs. In: International Conference on Learning Representations (2018)
Avancini, A., Ceccato, M.: Comparison and integration of genetic algorithms and dynamic symbolic execution for security testing of cross-site scripting vulnerabilities. Inf. Softw. Technol. 55(12), 2209–2222 (2013)
Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
Brownlee, J.: How to handle very long sequences with long short-term memory recurrent neural networks. Machine Learning Mastery, June 2017, https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/
Cadar, C., Dunbar, D., Engler, D.: KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, pp. 209–224. USENIX Association, San Diego (2008)
Caliskan-Islam, A., et al.: De-anonymizing programmers via code stylometry. In: Proceedings of the 24th USENIX Conference on Security Symposium. pp. 255–270. USENIX Association, Berkeley (2015)
Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623 (2015)
Hermann, K.M., et al.: Teaching machines to read and comprehend. In: Advances in Neural Information Processing Systems, pp. 1693–1701 (2015)
Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: Proceedings of the 34th International Conference on Software Engineering, pp. 837–847. IEEE, Zurich, June 2012
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Höschele, M., Zeller, A.: Mining input grammars from dynamic taints. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 720–725. ACM, Singapore (2016)
Hu, X., Wei, Y., Li, G., Jin, Z.: CodeSum: translate program language to natural language. arXiv preprint arXiv:1708.01837 (2017)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Li, Y., Su, Z., Wang, L., Li, X.: Steering symbolic execution to less traveled paths. In: Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, pp. 19–32. ACM, New York (2013)
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: International Conference on Learning Representations (2015)
Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: Poster: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 2539–2541. ACM (2017)
Lin, G., et al.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Industr. Inf. 14(7), 3289–3297 (2018)
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Meng, Q., Wen, S., Zhang, B., Tang, C.: Automatically discover vulnerability through similar functions. In: Proceedings of the 2016 Progress in Electromagnetic Research Symposium, pp. 3657–3661. IEEE, Shanghai, August 2016
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Mnih, V., Heess, N., Graves, A.: Recurrent models of visual attention. In: Advances in Neural Information Processing Systems, pp. 2204–2212 (2014)
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: AAAI Conference on Artificial Intelligence, pp. 1287–1293 (2016)
Ozkan, S.: CVEdetails.com - Security vulnerability database. Security Vulnerabilities, exploits, references and more (2018). https://www.cvedetails.com/
Pang, Y., Xue, X., Namin, A.S.: Predicting vulnerable software components through N-Gram analysis and statistical feature selection. In: Proceedings of the 14th IEEE International Conference on Machine Learning and Applications, pp. 543–548. IEEE, Miami, December 2015
Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)
Raychev, V., Vechev, M., Krause, A.: Predicting program properties from “Big Code”. In: Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 111–124. ACM, New York (2015)
Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 17th IEEE International Conference on Machine Learning and Application, pp. 757–762 (2018)
Scandariato, R., Walden, J., Hovsepyan, A., Joosen, W.: Predicting vulnerable software components via text mining. IEEE Trans. Software Eng. 40(10), 993–1006 (2014)
Shu, L., Xu, H., Liu, B.: Doc: Deep open classification of text documents. In: EMNLP, pp. 2911–2916 (2017)
Sutton, M., Greene, A., Amini, P.: Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, Reading (2007)
Wang, S., Chollak, D., Movshovitz-Attias, D., Tan, L.: Bugram: bug detection with n-gram language models. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp. 708–719. ACM (2016)
Wilshusen, G.C.: Cybersecurity: recent data breaches illustrate need for strong controls across federal agencies. In: Technical Report, GAO-15-725T. U.S. Government Accountability Office (GAO) (2015)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Yamaguchi, F., Wressnegger, C., Gascon, H., Rieck, K.: Chucky: exposing missing checks in source code for vulnerability discovery. In: Proceedings of the SIGSAC Conference on Computer & Communications Security, pp. 499–510. ACM (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Crown
About this paper
Cite this paper
Kim, J., Hubczenko, D., Montague, P. (2019). Towards Attention Based Vulnerability Discovery Using Source Code Representation. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series. ICANN 2019. Lecture Notes in Computer Science(), vol 11730. Springer, Cham. https://doi.org/10.1007/978-3-030-30490-4_58
Download citation
DOI: https://doi.org/10.1007/978-3-030-30490-4_58
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30489-8
Online ISBN: 978-3-030-30490-4
eBook Packages: Computer ScienceComputer Science (R0)