AutoVAS: An automated vulnerability analysis system with a deep learning approach
Graphical abstract
Introduction
Hidden flaws in software could potentially lead to security vulnerabilities that could allow attackers to compromise systems and applications. With the recent advances in hacking technology, software vulnerabilities have steadily increased in number (NIST, Accessed: Mar 2021e). More than 20,000 security vulnerabilities have been registered in the Common Vulnerabilities and Exposures (CVE) system (NVD, Accessed: Mar 2021), which is a publicly available vulnerability database, in 2019 alone. This demonstrates the rapid increase in software security vulnerabilities. A recent exploit incident (NIST, Accessed: Mar 2021c) showed that these security loopholes can have catastrophic financial and social impacts. These vulnerabilities are often caused by subtle programming errors and can propagate quickly, owing to the spread of open-source software and code reuse. Previously, security experts analyzed software to discover vulnerabilities and patched the software themselves. However, analyzing software and searching for vulnerabilities takes considerable time, and the speed at which vulnerabilities can be analyzed depends on the experts skill level, which makes it difficult to achieve a quick response. To solve the technical dependence on experts and reduce the cost of vulnerability detection, techniques (Kim, Woo, Lee, Oh, 2017, Li, Zou, Xu, Ou, Jin, Wang, Deng, Zhong, Zou, Wang, Xu, Li, Jin, 2019) and tools (CheckMarx, Accessed: Mar 2021; Fortify, Accessed: Mar 2021) for automated static vulnerability detection have emerged.
Research on deep learning methods that minimize the intervention of experts and automatically learn a pattern of vulnerabilities has become a new trend in software vulnerability detection (Lin et al., 2020). This can also be justified by the automation of cyber defense, as promoted by initiatives such as the Cyber Grand Challenge (CGC) created by the Defense Advanced Research Projects Agency (DARPA) (DARPA, Accessed: Mar 2021). However, legacy deep learning-based vulnerability detection systems utilize the syntax and semantics of software to improve detection performance, but they have several limitations (Li et al., 2020). First, they have a long-lasting dependence on the context of vulnerable codes. For example, variables defined at the beginning of the program or function are used at the end, or the vulnerability in the source code may call many functions. As a result, a deep-learning algorithm may ignore correlations between contexts when detecting vulnerabilities. Second, these methods have an out-of-vocabulary (OoV) problem when detecting vulnerabilities in new programs in which few identifiers are employed in the source code used for learning. All programmers have unique styles when designating the names of identifiers (variables, functions, etc.) in the source code. As a result, common vocabularies are not sufficient for managing all possible identifiers. This may degrade the vulnerability detection results. Finally, the effect of the deep learning-based vulnerability detection method is highly dependent on the amount and quality of the learning data.
To solve this problem, this study proposes a deep learning-based vulnerability detection framework called an automated vulnerability analysis system (AutoVAS). This framework utilizes a compiler-based program slicing method to solve the long context dependence problem and applies various embedding methods and symbolic representation techniques to solve the OoV problem. In addition, various datasets in the National Vulnerability Database (NVD) (NIST, Accessed: Mar 2021e) and Software Assurance Reference Dataset (SARD) (SARD, Accessed: Mar 2021) were used for the source code employed as the learning data to search for various types of security vulnerabilities in the vulnerable datasets, and a learning dataset with 98 vulnerabilities from the Common Weakness Enumeration (CWE) database (NIST, Accessed: Mar 2021b) and 719 vulnerabilities from the CVE was constructed (see Appendix 1). Furthermore, an oversampling method was used to overcome the imbalance problem in the datasets. The contributions of the present study are summarized as follows:
- •
An optimal method for representing source code as input vectors in a deep learning model was proposed from the viewpoints of program slicing and embedding techniques. AutoVAS, which is a deep learning-based vulnerability detection framework, was also proposed, and its effectiveness was verified through experiments.
- •
The datasets were built based on the NVD and SARD projects and released on GitHub (GitHub, Accessed: Mar 2021b). The source lexing results of the datasets and software were released to the public to be utilized in related studies.
- •
A full-featured prototype of AutoVAS was implemented. The proposed technique achieved an FAR of 1.88%, an FRR of 3.62%, and an F1-score of 96.11%. Eleven vulnerabilities were detected in nine open-source projects. Notably, we discovered three zero-day vulnerabilities, two of which were patched after being informed by AutoVAS. The other vulnerability received the CVE ID after being detected by AutoVAS.
The remainder of this paper is organized as follows. Section 2 provides the background necessary for the design and implementation of AutoVAS. Section 3 presents the detailed design of AutoVAS. The evaluation results of the proposed mechanisms are summarized in Section 4. Section 5 discusses some limitations of the current model and possible improvements. Section 6 reviews related work, and finally, Section 7 concludes this paper.
Section snippets
Background
First, we describe vulnerability detection in Section 2.1. We then address the program slicing and word embedding method used to make the embedding vector an input of the deep learning model in Sections 2.2 and 2.3. Moreover, in Section 2.4, we introduce terminologies to facilitate the understanding of this research. Finally, in Sections 2.5 and 2.6, we address the threat model and assumptions of this research.
Main method
The AutoVAS process is divided into a training phase to learn the detection model and a detection phase that uses the trained model to detect vulnerable code. As shown in Fig. 4, the training and detection phases use the same embedding process for the input data in the model after preprocessing the source code, except for the activities of model learning and detection. The learning and detecting activities will be explained in Section 4. Thus, this section focuses on the preprocessing and
Evaluation
We designed a series of experiments to evaluate the effectiveness of AutoVAS by investigating the following research questions:
- •
RQ1: What is the optimal embedding method for vulnerability discovery in source code? In this study, methods such as program slicing, word embedding, symbolization, and padding and split were used to embed the source code. We experimented to determine the combination of methods that provides the optimal performance. (Section 4.3)
- •
RQ2: How effective is AutoVASin practice?
Discussion and limitations
AutoVAS is useful and effective for detecting software vulnerabilities; however, there are still avenues for future improvement. This section presents the limitations of the current AutoVAS system and their workarounds and discusses enhancements for future work.- Dealing with several languages and an executable binary: The current AutoVAS only support the C/C++ language. Future research must be conducted to support other languages. Further, the current AutoVAS only detect vulnerabilities of
Related work
With increasing concerns about software vulnerabilities, several studies have been conducted on automated vulnerability detection to reduce developers review efforts. In 2015, DARPA, a research agency under the US Department of Defense, held CGC to encourage research on fully automated software vulnerability analysis systems. These systems are fully autonomous and can perform automated vulnerability detection, exploit generation, and software patching without human intervention (Song and
Conclusion
This paper presents AutoVAS, an automated vulnerability analysis system based on a deep-learning approach, which aims to relieve human intervention and improve other vulnerability detection systems’ low performance. To search for various security vulnerabilities, a dataset was constructed using the source codes of various projects of NVD and SARD. Thus, AutoVAS was designed to search for security vulnerabilities of 98 types. This study mitigates the impact of OoV and the lack of vulnerability
CRediT authorship contribution statement
Sanghoon Jeon: Conceptualization, Methodology, Software, Writing - original draft. Huy Kang Kim: Writing - review & editing, Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper
References (83)
- et al.
Automated vulnerability detection in source code using minimum intermediate representation learning
Appl. Sci.
(2020) - et al.
Software clone detection: a systematic review
Inf. Softw. Technol.
(2013) - et al.
TensorFlow: a system for large-scale machine learning
12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)
(2016) - et al.
A survey of machine learning for big code and naturalness
ACM Comput. Surv. (CSUR)
(2018) - et al.
Bimodal modelling of source code and natural language
International Conference on Machine Learning
(2015) - et al.
code2vec: learning distributed representations of code
Proc. ACM Program. Lang.
(2019) - et al.
Automatic exploit generation
Commun. ACM
(2014) - et al.
Detecting bugs by discovering expectations and their violations
IEEE Trans. Softw. Eng.
(2018) - et al.
Coverage-based greybox fuzzing as Markov chain
IEEE Trans. Softw. Eng.
(2017) - et al.
Enriching word vectors with subword information
Trans. Assoc. Comput.Linguist.
(2017)
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs.
OSDI
Unleashing mayhem on binary code
2012 IEEE Symposium on Security and Privacy
Smote: synthetic minority over-sampling technique
J. Artif. Intell. Res.
Automatic feature learning for predicting vulnerable software components
IEEE Trans. Softw. Eng.
Software vulnerability analysis and discovery using machine-learning and data-mining techniques: a survey
ACM Comput. Surv. (CSUR)
Toward large-scale vulnerability discovery using machine learning
Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy
Software verification with blast
International SPIN Workshop on Model Checking of Software
Interprocedural slicing using dependence graphs
ACM Trans. Program. Lang.Syst. (TOPLAS)
MSMOTE: improving classification performance when training data is imbalanced
2009 Second International Workshop on Computer Science and Engineering
ReDeBug: finding unpatched code clones in entire os distributions
2012 IEEE Symposium on Security and Privacy
VUDDY: a scalable approach for vulnerable code clone discovery
2017 IEEE Symposium on Security and Privacy (SP)
Distributed representations of sentences and documents
International Conference on Machine Learning
Improving distributional similarity with lessons learned from word embeddings
Trans. Assoc. Comput. Linguist.
A new combination sampling method for imbalanced data
Proceedings of 2013 Chinese Intelligent Automation Conference
CBCD: cloned buggy code detector
2012 34th International Conference on Software Engineering (ICSE)
Cited by (21)
Artificial intelligence for cybersecurity: Literature review and future research directions
2023, Information FusionVDoTR: Vulnerability detection based on tensor representation of comprehensive code graphs
2023, Computers and SecuritySedSVD: Statement-level software vulnerability detection based on Relational Graph Convolutional Network with subgraph embedding
2023, Information and Software TechnologyTopic and influence analysis on technological patents related to security vulnerabilities
2023, Computers and SecurityAn effective cryptanalysis of DES for secure communication using hybrid cryptanalysis and deep neural network
2024, Concurrency and Computation: Practice and Experience