ABSTRACT
The detection of software vulnerability is an important and challenging problem. Existing studies have shown that deep learning-based approaches can significantly improve the performance of vulnerability detection due to their powerful capabilities of automatic learning semantically rich code representation. However, the deep learning-based source code vulnerability detection methods still have limited learning ability for remote contextual dependency information between code statements. In this paper, we propose a deep learning-based code slice-level vulnerability detection via Transformer, dubbed VulD-Transformer, which is designed to detect vulnerabilities more effectively. In VulD-Transformer, transformer model is used to capture the critical features of vulnerabilities of long code slices. Especially, we firstly obtain code slices containing data dependencies and control dependencies by extracting the vulnerability syntax features and programs’ Program Dependency Graphs. Moreover, in order to improve the feature learning capability of the model for remote code statements, we design a Transformer-based vulnerability detection model. The experimental results on four synthetic datasets show that, compared to the VulDeePecker, SySeVR-BGRU, SySeVR-ABGRU and Russell approaches, VulD-Transformer achieves 6.12%, 8.01%, and 7.63% improvement on average in accuracy, recall and F1-measure respectively, when the code slices are more than 256 tokens. In addition, compared with these baselines, VulD-Transformer achieves 9.01%, 38.51%, and 20.98% improvement on average in accuracy, recall and F1-measure respectively on two real source code vulnerability datasets, Devign and REVEAL respectively, which are significantly higher than those of the comparison methods.
- [n. d.]. Common Vulnerabilities and Exposures. https://cve.mitre.org/Google Scholar
- Amritanshu Agrawal and Tim Menzies. 2018. Is "Better Data" Better Than "Better Data Miners"?. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). 1050–1061. https://doi.org/10.1145/3180155.3180197Google ScholarDigital Library
- Wenyan An, Liwei Chen, Jinxin Wang, Gewangzi Du, Gang Shi, and Dan Meng. 2020. AVDHRAM: Automated Vulnerability Detection based on Hierarchical Representation and Attention Mechanism. In 2020 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom). 337–344. https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00068Google ScholarCross Ref
- Mohammad Taneem Bin Nazim, Md Jobair Hossain Faruk, Hossain Shahriar, Md Abdullah Khan, Mohammad Masum, Nazmus Sakib, and Fan Wu. 2022. Systematic Analysis of Deep Learning Model for Vulnerable Code Detection. In 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC). 1768–1773. https://doi.org/10.1109/COMPSAC54236.2022.00281Google ScholarCross Ref
- Sicong Cao, Xiaobing Sun, Lili Bo, Ying Wei, and Bin Li. 2021. BGNN4VD: Constructing Bidirectional Graph Neural-Network for Vulnerability Detection. Inf. Softw. Technol. 136, C (aug 2021). https://doi.org/10.1016/j.infsof.2021.106576Google ScholarDigital Library
- Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet?IEEE Transactions on Software Engineering 48, 9 (2022), 3280–3296. https://doi.org/10.1109/TSE.2021.3087402Google ScholarCross Ref
- Xiao Deng, Wei Ye, Xie Rui, , and Shikun Zhang. 2023. Survey of Source Code Bug Detection Based on Deep Learning. Journal of Software 34, 2 (2023), 625–654. https://doi.org/10.13328/j.cnki.jos.006696Google ScholarCross Ref
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423Google ScholarCross Ref
- Xu Duan, Jingzheng Wu, Shouling Ji, Zhiqing Rui, Tianyue Luo, Mutian Yang, and Yanjun Wu. 2019. VulSniper: Focus Your Attention to Shoot Fine-Grained Vulnerabilities. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (Macao, China) (IJCAI’19). AAAI Press, 4665–4671.Google ScholarCross Ref
- Adanma Cecilia Eberendu, Valentine Ikechukwu Udegbe, Edmond Onwubiko Ezennorom, Anita Chinonso Ibegbulam, and Titus Ifeanyi Chinebu. 2022. A Systematic Literature Review of Software Vulnerability Detection. European Journal of Computer Science and Information Technology 10, 1 (2022), 23–37.Google ScholarCross Ref
- Hantao Feng, Xiaotong Fu, Hongyu Sun, He Wang, and Yuqing Zhang. 2020. Efficient Vulnerability Detection based on abstract syntax tree and Deep Learning. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). 722–727. https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9163061Google ScholarCross Ref
- Qi Feng, Chendong Feng, and Weijiang Hong. 2020. Graph Neural Network-based Vulnerability Predication. 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) (2020), 800–801.Google Scholar
- Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139Google ScholarCross Ref
- Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. 2021. Neural software vulnerability analysis using rich intermediate graph representations of programs. Information Sciences 553 (2021), 189–207. https://doi.org/10.1016/j.ins.2020.11.053Google ScholarCross Ref
- Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations. OpenReview.net. https://openreview.net/forum?id=jLoC4ez43PZGoogle Scholar
- Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, 427–431. https://aclanthology.org/E17-2068Google ScholarCross Ref
- Jian Li, Pinjia He, Jieming Zhu, and Michael R. Lyu. 2017. Software Defect Prediction via Convolutional Neural Network. In 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS). 318–328. https://doi.org/10.1109/QRS.2017.42Google ScholarCross Ref
- Yun Li, Chenlin Huang, Zhongfeng Wang, Lu Yuan, and Xiaochuan Wang. 2020. Survey of software vulnerability mining methods based on machine learning. Journal of Software 31, 7 (2020), 2040–2061.Google Scholar
- Yi Li, Shaohua Wang, Tien N. Nguyen, and Son Van Nguyen. 2019. Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks. Proc. ACM Program. Lang. 3, OOPSLA, Article 162 (oct 2019), 30 pages. https://doi.org/10.1145/3360588Google ScholarDigital Library
- Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2022. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing 19, 4 (2022), 2244–2258. https://doi.org/10.1109/TDSC.2021.3051525Google ScholarCross Ref
- Z. Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. ArXiv abs/1801.01681 (2018).Google Scholar
- Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Yang Xiang, Olivier De Vel, and Paul Montague. 2018. Cross-Project Transfer Representation Learning for Vulnerable Function Discovery. IEEE Transactions on Industrial Informatics 14, 7 (2018), 3289–3297. https://doi.org/10.1109/TII.2018.2821768Google ScholarCross Ref
- Henning Perl, Sergej Dechand, Matthew Smith, Daniel Arp, Fabian Yamaguchi, Konrad Rieck, Sascha Fahl, and Yasemin Acar. 2015. VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits(CCS ’15). Association for Computing Machinery, New York, NY, USA, 426–437. https://doi.org/10.1145/2810103.2813604Google ScholarDigital Library
- Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-Based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (oct 2018), 25 pages. https://doi.org/10.1145/3276517Google ScholarDigital Library
- Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.Google Scholar
- Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated Vulnerability Detection in Source Code Using Deep Representation Learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 757–762. https://doi.org/10.1109/ICMLA.2018.00120Google ScholarCross Ref
- Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4263–4272. https://doi.org/10.18653/v1/D18-1458Google ScholarCross Ref
- Tassey and Gregory. 2002. The Economic Impacts of Inadequate Infrastructure for Software Testing. National Institute of Standards and Technology (2002).Google Scholar
- Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal. 2022. Transformer-Based Language Models for Software Vulnerability Detection. In Proceedings of the 38th Annual Computer Security Applications Conference (Austin, TX, USA) (ACSAC ’22). Association for Computing Machinery, New York, NY, USA, 481–496. https://doi.org/10.1145/3564625.3567985Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.Google ScholarDigital Library
- Haitao Wang, Jie He, Xiaohong Zhang, and Shufen Liu. 2020. A Short Text Classification Method Based on N-Gram and CNN. Chinese Journal of Electronics 29, 2 (2020), 248–254. https://doi.org/10.1049/cje.2020.01.001Google ScholarCross Ref
- Shizhong Wu. 2009. Review and outlook of information security vulnerability analysis. Journal of Tsinghua University (Science and Technology) 49 (2009), 2065–2072.Google Scholar
- Shizhong Wu, Tao Guo, Guowei Dong, and Jiajie Wang. 2012. Progress in software vulnerability analysis technology. Journal of Tsinghua University (Science and Technology) 52, 10 (2012), 1309–1319.Google Scholar
- Fabian Yamaguchi, Felix Lindner, and Konrad Rieck. 2011. Vulnerability Extrapolation: Assisted Discovery of Vulnerabilities Using Machine Learning. In Proceedings of the 5th USENIX Conference on Offensive Technologies (San Francisco, CA) (WOOT’11). USENIX Association, USA, 13.Google Scholar
- Xin Zhang, Hongyu Sun, Zhipeng He, MianXue Gu, Jingyu Feng, and Yuqing Zhang. 2022. VDBWGDL: Vulnerability Detection Based On Weight Graph And Deep Learning. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). 186–190. https://doi.org/10.1109/DSN-W54100.2022.00039Google ScholarCross Ref
- Xin Zhou, DongGyun Han, and David Lo. 2021. Assessing Generalizability of CodeBERT. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). 425–436. https://doi.org/10.1109/ICSME52107.2021.00044Google ScholarCross Ref
- Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In NIPS Proceedings - Advances in Neural Information Processing Systems 32 (NIPS 2019)(Advances in Neural Information Processing Systems, Vol. 32). Neural Information Processing Systems (NIPS). https://nips.cc/Conferences/2019, https://papers.nips.cc/book/advances-in-neural-information-processing-systems-32-2019Google Scholar
Index Terms
- VulD-Transformer: Source Code Vulnerability Detection via Transformer
Recommendations
DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection
RAID '23: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and DefensesWe propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable ...
Learning-based Vulnerability Detection in Binary Code
ICMLC '22: Proceedings of the 2022 14th International Conference on Machine Learning and ComputingCyberattacks typically exploit software vulnerabilities to compromise computers and smart devices. To address vulnerabilities, many approaches have been developed to detect vulnerabilities using deep learning. However, most learning-based approaches ...
Poison Attack and Poison Detection on Deep Source Code Processing Models
In the software engineering (SE) community, deep learning (DL) has recently been applied to many source code processing tasks, achieving state-of-the-art results. Due to the poor interpretability of DL models, their security vulnerabilities require ...
Comments