AutoVAS: An automated vulnerability analysis system with a deep learning approach

https://doi.org/10.1016/j.cose.2021.102308Get rights and content

Abstract

Owing to the advances in automated hacking and analysis technologies in recent years, numerous software security vulnerabilities have been announced. Software vulnerabilities are increasing rapidly, whereas methods to analyze and cope with them depend on manual analyses, which result in a slow response. In recent years, studies concerning the prediction of vulnerabilities or the detection of patterns of previous vulnerabilities have been conducted by applying deep learning algorithms in an automated vulnerability search based on source code. However, existing methods target only certain security vulnerabilities or make limited use of source code to compile information. Few studies have been conducted on methods that represent source code as an embedding vector. Thus, this study proposes a deep learning-based automated vulnerability analysis system (AutoVAS) that effectively represents source code as embedding vectors by using datasets from various projects in the National Vulnerability Database (NVD) and Software Assurance Reference Database (SARD). To evaluate AutoVAS, we present and share a dataset for deep learning models. Experimental results show that AutoVAS achieves a false negative rate (FNR) of 3.62%, a false positive rate (FPR) of 1.88%, and an F1-score of 96.11%, which represent lower FNR and FPR values than those achieved by other approaches. We further apply AutoVAS to nine open-source projects and detect eleven vulnerabilities, most of which are missed by the other approaches we experimented with. Notably, we discovered three zero-day vulnerabilities, two of which were patched after being informed by AutoVAS. The other vulnerability received the Common Vulnerabilities and Exposures (CVE) ID after being detected by AutoVAS.

Introduction

Hidden flaws in software could potentially lead to security vulnerabilities that could allow attackers to compromise systems and applications. With the recent advances in hacking technology, software vulnerabilities have steadily increased in number (NIST, Accessed: Mar 2021e). More than 20,000 security vulnerabilities have been registered in the Common Vulnerabilities and Exposures (CVE) system (NVD, Accessed: Mar 2021), which is a publicly available vulnerability database, in 2019 alone. This demonstrates the rapid increase in software security vulnerabilities. A recent exploit incident (NIST, Accessed: Mar 2021c) showed that these security loopholes can have catastrophic financial and social impacts. These vulnerabilities are often caused by subtle programming errors and can propagate quickly, owing to the spread of open-source software and code reuse. Previously, security experts analyzed software to discover vulnerabilities and patched the software themselves. However, analyzing software and searching for vulnerabilities takes considerable time, and the speed at which vulnerabilities can be analyzed depends on the experts skill level, which makes it difficult to achieve a quick response. To solve the technical dependence on experts and reduce the cost of vulnerability detection, techniques (Kim, Woo, Lee, Oh, 2017, Li, Zou, Xu, Ou, Jin, Wang, Deng, Zhong, Zou, Wang, Xu, Li, Jin, 2019) and tools (CheckMarx, Accessed: Mar 2021; Fortify, Accessed: Mar 2021) for automated static vulnerability detection have emerged.

Research on deep learning methods that minimize the intervention of experts and automatically learn a pattern of vulnerabilities has become a new trend in software vulnerability detection (Lin et al., 2020). This can also be justified by the automation of cyber defense, as promoted by initiatives such as the Cyber Grand Challenge (CGC) created by the Defense Advanced Research Projects Agency (DARPA) (DARPA, Accessed: Mar 2021). However, legacy deep learning-based vulnerability detection systems utilize the syntax and semantics of software to improve detection performance, but they have several limitations (Li et al., 2020). First, they have a long-lasting dependence on the context of vulnerable codes. For example, variables defined at the beginning of the program or function are used at the end, or the vulnerability in the source code may call many functions. As a result, a deep-learning algorithm may ignore correlations between contexts when detecting vulnerabilities. Second, these methods have an out-of-vocabulary (OoV) problem when detecting vulnerabilities in new programs in which few identifiers are employed in the source code used for learning. All programmers have unique styles when designating the names of identifiers (variables, functions, etc.) in the source code. As a result, common vocabularies are not sufficient for managing all possible identifiers. This may degrade the vulnerability detection results. Finally, the effect of the deep learning-based vulnerability detection method is highly dependent on the amount and quality of the learning data.

To solve this problem, this study proposes a deep learning-based vulnerability detection framework called an automated vulnerability analysis system (AutoVAS). This framework utilizes a compiler-based program slicing method to solve the long context dependence problem and applies various embedding methods and symbolic representation techniques to solve the OoV problem. In addition, various datasets in the National Vulnerability Database (NVD) (NIST, Accessed: Mar 2021e) and Software Assurance Reference Dataset (SARD) (SARD, Accessed: Mar 2021) were used for the source code employed as the learning data to search for various types of security vulnerabilities in the vulnerable datasets, and a learning dataset with 98 vulnerabilities from the Common Weakness Enumeration (CWE) database (NIST, Accessed: Mar 2021b) and 719 vulnerabilities from the CVE was constructed (see Appendix 1). Furthermore, an oversampling method was used to overcome the imbalance problem in the datasets. The contributions of the present study are summarized as follows:

  • An optimal method for representing source code as input vectors in a deep learning model was proposed from the viewpoints of program slicing and embedding techniques. AutoVAS, which is a deep learning-based vulnerability detection framework, was also proposed, and its effectiveness was verified through experiments.

  • The datasets were built based on the NVD and SARD projects and released on GitHub (GitHub, Accessed: Mar 2021b). The source lexing results of the datasets and software were released to the public to be utilized in related studies.

  • A full-featured prototype of AutoVAS was implemented. The proposed technique achieved an FAR of 1.88%, an FRR of 3.62%, and an F1-score of 96.11%. Eleven vulnerabilities were detected in nine open-source projects. Notably, we discovered three zero-day vulnerabilities, two of which were patched after being informed by AutoVAS. The other vulnerability received the CVE ID after being detected by AutoVAS.

The remainder of this paper is organized as follows. Section 2 provides the background necessary for the design and implementation of AutoVAS. Section 3 presents the detailed design of AutoVAS. The evaluation results of the proposed mechanisms are summarized in Section 4. Section 5 discusses some limitations of the current model and possible improvements. Section 6 reviews related work, and finally, Section 7 concludes this paper.

Section snippets

Background

First, we describe vulnerability detection in Section 2.1. We then address the program slicing and word embedding method used to make the embedding vector an input of the deep learning model in Sections 2.2 and 2.3. Moreover, in Section 2.4, we introduce terminologies to facilitate the understanding of this research. Finally, in Sections 2.5 and 2.6, we address the threat model and assumptions of this research.

Main method

The AutoVAS process is divided into a training phase to learn the detection model and a detection phase that uses the trained model to detect vulnerable code. As shown in Fig. 4, the training and detection phases use the same embedding process for the input data in the model after preprocessing the source code, except for the activities of model learning and detection. The learning and detecting activities will be explained in Section 4. Thus, this section focuses on the preprocessing and

Evaluation

We designed a series of experiments to evaluate the effectiveness of AutoVAS by investigating the following research questions:

  • RQ1: What is the optimal embedding method for vulnerability discovery in source code? In this study, methods such as program slicing, word embedding, symbolization, and padding and split were used to embed the source code. We experimented to determine the combination of methods that provides the optimal performance. (Section 4.3)

  • RQ2: How effective is AutoVASin practice?

Discussion and limitations

AutoVAS is useful and effective for detecting software vulnerabilities; however, there are still avenues for future improvement. This section presents the limitations of the current AutoVAS system and their workarounds and discusses enhancements for future work.- Dealing with several languages and an executable binary: The current AutoVAS only support the C/C++ language. Future research must be conducted to support other languages. Further, the current AutoVAS only detect vulnerabilities of

Related work

With increasing concerns about software vulnerabilities, several studies have been conducted on automated vulnerability detection to reduce developers review efforts. In 2015, DARPA, a research agency under the US Department of Defense, held CGC to encourage research on fully automated software vulnerability analysis systems. These systems are fully autonomous and can perform automated vulnerability detection, exploit generation, and software patching without human intervention (Song and

Conclusion

This paper presents AutoVAS, an automated vulnerability analysis system based on a deep-learning approach, which aims to relieve human intervention and improve other vulnerability detection systems’ low performance. To search for various security vulnerabilities, a dataset was constructed using the source codes of various projects of NVD and SARD. Thus, AutoVAS was designed to search for security vulnerabilities of 98 types. This study mitigates the impact of OoV and the lack of vulnerability

CRediT authorship contribution statement

Sanghoon Jeon: Conceptualization, Methodology, Software, Writing - original draft. Huy Kang Kim: Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper

References (83)

  • X. Li et al.

    Automated vulnerability detection in source code using minimum intermediate representation learning

    Appl. Sci.

    (2020)
  • D. Rattan et al.

    Software clone detection: a systematic review

    Inf. Softw. Technol.

    (2013)
  • M. Abadi et al.

    TensorFlow: a system for large-scale machine learning

    12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16)

    (2016)
  • M. Allamanis et al.

    A survey of machine learning for big code and naturalness

    ACM Comput. Surv. (CSUR)

    (2018)
  • M. Allamanis et al.

    Bimodal modelling of source code and natural language

    International Conference on Machine Learning

    (2015)
  • U. Alon et al.

    code2vec: learning distributed representations of code

    Proc. ACM Program. Lang.

    (2019)
  • T. Avgerinos et al.

    Automatic exploit generation

    Commun. ACM

    (2014)
  • P. Bian et al.

    Detecting bugs by discovering expectations and their violations

    IEEE Trans. Softw. Eng.

    (2018)
  • M. Böhme et al.

    Coverage-based greybox fuzzing as Markov chain

    IEEE Trans. Softw. Eng.

    (2017)
  • P. Bojanowski et al.

    Enriching word vectors with subword information

    Trans. Assoc. Comput.Linguist.

    (2017)
  • Brockschmidt, M., Allamanis, M., Gaunt, A. L., Polozov, O., 2018. Generative code modeling with graphs....
  • Brooks, T. N., 2017. Survey of automated vulnerability detection and exploit generation techniques in cyber reasoning...
  • C. Cadar et al.

    KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs.

    OSDI

    (2008)
  • S.K. Cha et al.

    Unleashing mayhem on binary code

    2012 IEEE Symposium on Security and Privacy

    (2012)
  • N.V. Chawla et al.

    Smote: synthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • CheckMarx, Accessed: Mar 2021. Checkmarx....
  • Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y., 2014a. On the properties of neural machine translation:...
  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014b. Learning phrase...
  • Coverity, Accessed: Mar 2021. Coverity....
  • H.K. Dam et al.

    Automatic feature learning for predicting vulnerable software components

    IEEE Trans. Softw. Eng.

    (2018)
  • DARPA, Accessed: Mar 2021. Cyber Grand Challenge (CGC)....
  • FlawFinder, Accessed: Mar 2021. Flawfinder....
  • ForAllSecure, Accessed: Mar 2021. ForAllSecure: Mayhem ensures your apps are secure in the face of the...
  • Fortify, Accessed: Mar 2021. Hp fortify....
  • Gamboa, J. C. B., 2017. Deep learning for time-series analysis....
  • S.M. Ghaffarian et al.

    Software vulnerability analysis and discovery using machine-learning and data-mining techniques: a survey

    ACM Comput. Surv. (CSUR)

    (2017)
  • GitHub, Accessed: Mar 2021a. Database and Source Code of VulDeePecker....
  • GitHub, Accessed: Mar 2021b. Database of AutoVAS. https://github.com/kppw99/autoVAS....
  • GitHub, Accessed: Mar 2021c. Database of VulDeePecker....
  • GitHub, Accessed: Mar 2021d. LLVM-Slicing....
  • GrammaTech, Accessed: Mar 2021. DARPA Cyber Grand Challenge - Team TECHX | GrammaTech....
  • G. Grieco et al.

    Toward large-scale vulnerability discovery using machine learning

    Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy

    (2016)
  • T.A. Henzinger et al.

    Software verification with blast

    International SPIN Workshop on Model Checking of Software

    (2003)
  • S. Horwitz et al.

    Interprocedural slicing using dependence graphs

    ACM Trans. Program. Lang.Syst. (TOPLAS)

    (1990)
  • S. Hu et al.

    MSMOTE: improving classification performance when training data is imbalanced

    2009 Second International Workshop on Computer Science and Engineering

    (2009)
  • J. Jang et al.

    ReDeBug: finding unpatched code clones in entire os distributions

    2012 IEEE Symposium on Security and Privacy

    (2012)
  • S. Kim et al.

    VUDDY: a scalable approach for vulnerable code clone discovery

    2017 IEEE Symposium on Security and Privacy (SP)

    (2017)
  • Q. Le et al.

    Distributed representations of sentences and documents

    International Conference on Machine Learning

    (2014)
  • O. Levy et al.

    Improving distributional similarity with lessons learned from word embeddings

    Trans. Assoc. Comput. Linguist.

    (2015)
  • H. Li et al.

    A new combination sampling method for imbalanced data

    Proceedings of 2013 Chinese Intelligent Automation Conference

    (2013)
  • J. Li et al.

    CBCD: cloned buggy code detector

    2012 34th International Conference on Software Engineering (ICSE)

    (2012)
  • Cited by (21)

    View all citing articles on Scopus
    View full text