Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection

Pham, Van-Hau; Hien, Do Thi Thu; Hoang, Hien Do; Duy, Phan The

doi:10.1007/s10207-024-00901-4

Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection

Regular Contribution
Published: 13 August 2024

Volume 23, pages 3513–3526, (2024)
Cite this article

International Journal of Information Security Aims and scope Submit manuscript

Van-Hau Pham^1,2,
Do Thi Thu Hien^1,2,
Hien Do Hoang^1,2 &
…
Phan The Duy^1,2

335 Accesses
Explore all metrics

Abstract

The complex and rapidly evolving nature of modern software landscapes introduces challenges such as increasingly sophisticated cyber threats, the diversity in programming languages and coding styles, and the need to identify subtle patterns indicative of vulnerabilities. These hurdles underscore the necessity for advanced techniques that can effectively cope with the intricacies of software security. Hence, this paper gives a comparative empirical study in harnessing the potential of cutting-edge natural language processing (NLP) advancements, namely Word2Vec and CodeBERT to detect vulnerabilities in C and C++ programs in the proposed Defect-Scanner framework. With the capability of converting code components and source code into contextual embedding vectors, various potential NLP techniques are combined with several DL models to evaluate the precision and accuracy of identifying vulnerabilities within software systems. Moreover, the experimentations are conducted using datasets with different representation types of codes, aiming to figure out the best combination of NLP techniques and DL models to work with each form of input. As a result, besides the outperformance of CodeBERT-based models with accuracies of approximately 90%, this comparative study also provides a comprehensive evaluation of NLP-based software vulnerability detection in the face of intricate security challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Source Code Vulnerability Detection Using Deep Learning Algorithms for Industrial Applications

Optimizing software vulnerability detection using RoBERTa and machine learning

Article 08 May 2024

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

References

SySeVR dataset. https://github.com/SySeVR/SySeVR
VulDeBERT dataset. https://github.com/SKKU-SecLab/VulDeBERT
VulDeePecker dataset. https://github.com/CGCL-codes/VulDeePecker
Ait Messaad, B., Chetioui, K., Balboul, Y., Rhachi, H.: Analyzing and detecting malware using machine learning and deep learning. In: The International Conference on Artificial Intelligence and Smart Environment, pp. 518–525. Springer (2023)
Brauckmann, A., Goens, A., Ertel, S., Castrillon, J.: Compiler-based graph representations for deep learning models of code. In: Proceedings of the 29th International Conference on Compiler Construction (2020). https://doi.org/10.1145/3377555.3377894
Cheng, X., Zhang, G., Wang, H., Sui, Y.: Path-sensitive code embedding via contrastive learning for software vulnerability detection. In: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022, pp. 519–531. Association for Computing Machinery, New York (2022)
Croft, R., Xie, Y., Babar, M.A.: Data preparation for software vulnerability prediction: a systematic literature review. IEEE Trans. Softw. Eng. 49(3), 1044–1063 (2022)
Article Google Scholar
Du, X., Wen, M., Wei, Z., Wang, S., Jin, H.: An extensive study on adversarial attack against pre-trained models of code. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, pp. 489–501. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3611643.3616356
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M.: CodeBERT: A Pre-Trained Model for Programming and Natural Languages. pp. 1536–1547 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139
Ghaffarian, S.M., Shahriari, H.R.: Software vulnerability analysis and discovery using machine-learning and data-mining techniques: a survey. ACM Comput. Surv. (CSUR) 50(4), 1–36 (2017)
Article Google Scholar
Hanif, H., Nasir, M.H.N.M., Razak, M.F.A., Firdaus, A., Anuar, N.B.: The rise of software vulnerability: taxonomy of software vulnerabilities detection and machine learning approaches. J. Netw. Comput. Appl. 179, 103009 (2021)
Article Google Scholar
Hariyanti, E., Djunaidy, A., Siahaan, D.: Information security vulnerability prediction based on business process model using machine learning approach. Comput. Secur. 110, 102422 (2021)
Article Google Scholar
Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: MSR ’22: Proceedings of the 19th International Conference on Mining Software Repositories (2022)
Khan, R.A., Khan, S.U., Khan, H.U., Ilyas, M.: Systematic mapping study on security approaches in secure software engineering. IEEE Access 9, 19139–19160 (2021)
Article Google Scholar
Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82(3), 3713–3744 (2023)
Article Google Scholar
Kim, S., Choi, J., Ahmed, M.E., Nepal, S., Kim, H.: VulDeBERT: a vulnerability detection system using BERT. In: 2022 IEEE ISSREW, pp. 69–74 (2022). https://doi.org/10.1109/ISSREW55968.2022.00042
Li, J., He, P., Zhu, J., Lyu, M.R.: Software defect prediction via convolutional neural network. In: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS) (2017)
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Secure Comput. 19(4), 2244–2258 (2022). https://doi.org/10.1109/TDSC.2021.3051525
Article Google Scholar
Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S., Deng, Z., Zhong, Y.: VulDeePecker: a deep learning-based system for vulnerability detection. In: NDSS Symposium (2018)
Lin, G., Wen, S., Han, Q.L., Zhang, J., Xiang, Y.: Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)
Article Google Scholar
Lin, T., Wang, Y., Liu, X., Qiu, X.: A survey of transformers. AI Open 3, 111–132 (2022). https://doi.org/10.1016/j.aiopen.2022.10.001
Article Google Scholar
Ling, X., Wu, L., Zhang, J., Qu, Z., Deng, W., Chen, X., Qian, Y., Wu, C., Ji, S., Luo, T., et al.: Adversarial attacks against Windows PE malware detection: a survey of the state-of-the-art. Comput. Secur. 128, 103134 (2023)
Article Google Scholar
Marjanov, T., Pashchenko, I., Massacci, F.: Machine learning for source code vulnerability detection: what works and what isn’t there yet. IEEE Secur. Priv. 20(5), 60–76 (2022)
Article Google Scholar
Medsker, L., Jain, L.C.: Recurrent neural network: design and applications (2001)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: International Conference on Learning Representations (2013)
O’Shea, K., Nash, R.: An Introduction to Convolutional Neural Networks (2015)
Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997). https://doi.org/10.1109/78.650093
Article Google Scholar
Shaukat, K., Luo, S., Varadharajan, V.: A novel method for improving the robustness of deep learning-based malware detectors against adversarial attacks. Eng. Appl. Artif. Intell. 116, 105461 (2022)
Article Google Scholar
Shaukat, K., Luo, S., Varadharajan, V.: A novel deep learning-based approach for malware detection. Eng. Appl. Artif. Intell. 122, 106030 (2023)
Article Google Scholar
Shaukat, K., Luo, S., Varadharajan, V., Hameed, I.A., Chen, S., Liu, D., Li, J.: Performance comparison and current challenges of using machine learning techniques in cybersecurity. Energies 13(10), 2509 (2020)
Article Google Scholar
Shaukat, K., Luo, S., Varadharajan, V., Hameed, I.A., Xu, M.: A survey on machine learning techniques for cyber security in the last decade. IEEE Access 8, 222310–222354 (2020)
Article Google Scholar
Tang, W., Tang, M., Ban, M., Zhao, Z., Feng, M.: CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection. J. Syst. Softw. (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017). arXiv:1706.03762pdf
Viet Phan, A., Le Nguyen, M., Thu Bui, L.: Convolutional neural networks over control flow graphs for software defect prediction. In: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI) (2017). https://doi.org/10.1109/ICTAI.2017.00019
Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction. In: ICSE ’16: Proceedings of the 38th International Conference on Software Engineering (2016)
Wu, J.: Literature review on vulnerability detection using NLP technology. arXiv:2104.11230 (2021)
Yan, S., Ren, J., Wang, W., Sun, L., Zhang, W., Yu, Q.: A survey of adversarial attack and defense methods for malware classification in cyber security. IEEE Commun. Surv. Tutor. 25(1), 467–496 (2022)
Yang, Y., Fan, H., Lin, C., Li, Q., Zhao, Z., Shen, C.: Exploiting the adversarial example vulnerability of transfer learning of source code. IEEE Trans. Inf. Forensics Secur. 19, 5880–5894 (2024). https://doi.org/10.1109/TIFS.2024.3402153
Article Google Scholar
Yu, X., Li, Z., Huang, X., Zhao, S.: Advulcode: Generating adversarial vulnerable code against deep learning-based vulnerability detectors. Electronics 12(4), 936 (2023)
Article Google Scholar
Zeng, P., Lin, G., Pan, L., Tai, Y., Zhang, J.: Software vulnerability analysis and discovery using deep learning techniques: a survey. IEEE Access 8, 197158–197172 (2020)
Article Google Scholar
Zhang, H., Lu, S., Li, Z., Jin, Z., Ma, L., Liu, Y., Li, G.: Codebert-attack: adversarial attack against source code deep learning models via pre-trained model. J. Softw. Evol. Process 36(3), e2571 (2024)
Article Google Scholar
Zhang, Q., Wu, B.: Software defect prediction via transformer. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (2020). https://doi.org/10.1109/ITNEC48623.2020.9084745
Zhu, Y., Lin, G., Song, L., Zhang, J.: The application of neural network for software vulnerability detection: a review. Neural Comput. Appl. 35(2), 1279–1301 (2023)
Article Google Scholar
Ziems, N., Wu, S.: Security vulnerability detection using deep learning natural language processing. In: IEEE INFOCOM 2021 (2021). https://doi.org/10.1109/INFOCOMWKSHPS51825.2021.9484500
Zou, D., Wang, S., Xu, S., Li, Z., Jin, H.: VulDeePecker: a deep learning-based system for multiclass vulnerability detection. IEEE Trans. Dependable Secure Comput. 18(5), 2224–2236 (2019)
Google Scholar

Download references

Acknowledgements

research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund.

Author information

Authors and Affiliations

Information Security Laboratory, University of Information Technology, Ho Chi Minh City, Vietnam
Van-Hau Pham, Do Thi Thu Hien, Hien Do Hoang & Phan The Duy
Vietnam National University, Ho Chi Minh City, Vietnam
Van-Hau Pham, Do Thi Thu Hien, Hien Do Hoang & Phan The Duy

Authors

Van-Hau Pham
View author publications
You can also search for this author inPubMed Google Scholar
Do Thi Thu Hien
View author publications
You can also search for this author inPubMed Google Scholar
Hien Do Hoang
View author publications
You can also search for this author inPubMed Google Scholar
Phan The Duy
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Van-Hau Pham.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Pham, VH., Hien, D.T.T., Hoang, H.D. et al. Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection. Int. J. Inf. Secur. 23, 3513–3526 (2024). https://doi.org/10.1007/s10207-024-00901-4

Download citation

Published: 13 August 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10207-024-00901-4

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Source Code Vulnerability Detection Using Deep Learning Algorithms for Industrial Applications

Optimizing software vulnerability detection using RoBERTa and machine learning

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now