Skip to main content

Advertisement

Log in

Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection

  • Regular Contribution
  • Published:
International Journal of Information Security Aims and scope Submit manuscript

Abstract

The complex and rapidly evolving nature of modern software landscapes introduces challenges such as increasingly sophisticated cyber threats, the diversity in programming languages and coding styles, and the need to identify subtle patterns indicative of vulnerabilities. These hurdles underscore the necessity for advanced techniques that can effectively cope with the intricacies of software security. Hence, this paper gives a comparative empirical study in harnessing the potential of cutting-edge natural language processing (NLP) advancements, namely Word2Vec and CodeBERT to detect vulnerabilities in C and C++ programs in the proposed Defect-Scanner framework. With the capability of converting code components and source code into contextual embedding vectors, various potential NLP techniques are combined with several DL models to evaluate the precision and accuracy of identifying vulnerabilities within software systems. Moreover, the experimentations are conducted using datasets with different representation types of codes, aiming to figure out the best combination of NLP techniques and DL models to work with each form of input. As a result, besides the outperformance of CodeBERT-based models with accuracies of approximately 90%, this comparative study also provides a comprehensive evaluation of NLP-based software vulnerability detection in the face of intricate security challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. SySeVR dataset. https://github.com/SySeVR/SySeVR

  2. VulDeBERT dataset. https://github.com/SKKU-SecLab/VulDeBERT

  3. VulDeePecker dataset. https://github.com/CGCL-codes/VulDeePecker

  4. Ait Messaad, B., Chetioui, K., Balboul, Y., Rhachi, H.: Analyzing and detecting malware using machine learning and deep learning. In: The International Conference on Artificial Intelligence and Smart Environment, pp. 518–525. Springer (2023)

  5. Brauckmann, A., Goens, A., Ertel, S., Castrillon, J.: Compiler-based graph representations for deep learning models of code. In: Proceedings of the 29th International Conference on Compiler Construction (2020). https://doi.org/10.1145/3377555.3377894

  6. Cheng, X., Zhang, G., Wang, H., Sui, Y.: Path-sensitive code embedding via contrastive learning for software vulnerability detection. In: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022, pp. 519–531. Association for Computing Machinery, New York (2022)

  7. Croft, R., Xie, Y., Babar, M.A.: Data preparation for software vulnerability prediction: a systematic literature review. IEEE Trans. Softw. Eng. 49(3), 1044–1063 (2022)

    Article  Google Scholar 

  8. Du, X., Wen, M., Wei, Z., Wang, S., Jin, H.: An extensive study on adversarial attack against pre-trained models of code. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, pp. 489–501. Association for Computing Machinery, New York (2023). https://doi.org/10.1145/3611643.3616356

  9. Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M.: CodeBERT: A Pre-Trained Model for Programming and Natural Languages. pp. 1536–1547 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139

  10. Ghaffarian, S.M., Shahriari, H.R.: Software vulnerability analysis and discovery using machine-learning and data-mining techniques: a survey. ACM Comput. Surv. (CSUR) 50(4), 1–36 (2017)

    Article  Google Scholar 

  11. Hanif, H., Nasir, M.H.N.M., Razak, M.F.A., Firdaus, A., Anuar, N.B.: The rise of software vulnerability: taxonomy of software vulnerabilities detection and machine learning approaches. J. Netw. Comput. Appl. 179, 103009 (2021)

    Article  Google Scholar 

  12. Hariyanti, E., Djunaidy, A., Siahaan, D.: Information security vulnerability prediction based on business process model using machine learning approach. Comput. Secur. 110, 102422 (2021)

    Article  Google Scholar 

  13. Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: MSR ’22: Proceedings of the 19th International Conference on Mining Software Repositories (2022)

  14. Khan, R.A., Khan, S.U., Khan, H.U., Ilyas, M.: Systematic mapping study on security approaches in secure software engineering. IEEE Access 9, 19139–19160 (2021)

    Article  Google Scholar 

  15. Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82(3), 3713–3744 (2023)

    Article  Google Scholar 

  16. Kim, S., Choi, J., Ahmed, M.E., Nepal, S., Kim, H.: VulDeBERT: a vulnerability detection system using BERT. In: 2022 IEEE ISSREW, pp. 69–74 (2022). https://doi.org/10.1109/ISSREW55968.2022.00042

  17. Li, J., He, P., Zhu, J., Lyu, M.R.: Software defect prediction via convolutional neural network. In: 2017 IEEE International Conference on Software Quality, Reliability and Security (QRS) (2017)

  18. Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Secure Comput. 19(4), 2244–2258 (2022). https://doi.org/10.1109/TDSC.2021.3051525

    Article  Google Scholar 

  19. Li, Z., Zou, D., Xu, S., Ou, X., Jin, H., Wang, S., Deng, Z., Zhong, Y.: VulDeePecker: a deep learning-based system for vulnerability detection. In: NDSS Symposium (2018)

  20. Lin, G., Wen, S., Han, Q.L., Zhang, J., Xiang, Y.: Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)

    Article  Google Scholar 

  21. Lin, T., Wang, Y., Liu, X., Qiu, X.: A survey of transformers. AI Open 3, 111–132 (2022). https://doi.org/10.1016/j.aiopen.2022.10.001

    Article  Google Scholar 

  22. Ling, X., Wu, L., Zhang, J., Qu, Z., Deng, W., Chen, X., Qian, Y., Wu, C., Ji, S., Luo, T., et al.: Adversarial attacks against Windows PE malware detection: a survey of the state-of-the-art. Comput. Secur. 128, 103134 (2023)

    Article  Google Scholar 

  23. Marjanov, T., Pashchenko, I., Massacci, F.: Machine learning for source code vulnerability detection: what works and what isn’t there yet. IEEE Secur. Priv. 20(5), 60–76 (2022)

    Article  Google Scholar 

  24. Medsker, L., Jain, L.C.: Recurrent neural network: design and applications (2001)

  25. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: International Conference on Learning Representations (2013)

  26. O’Shea, K., Nash, R.: An Introduction to Convolutional Neural Networks (2015)

  27. Schuster, M., Paliwal, K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681 (1997). https://doi.org/10.1109/78.650093

    Article  Google Scholar 

  28. Shaukat, K., Luo, S., Varadharajan, V.: A novel method for improving the robustness of deep learning-based malware detectors against adversarial attacks. Eng. Appl. Artif. Intell. 116, 105461 (2022)

    Article  Google Scholar 

  29. Shaukat, K., Luo, S., Varadharajan, V.: A novel deep learning-based approach for malware detection. Eng. Appl. Artif. Intell. 122, 106030 (2023)

    Article  Google Scholar 

  30. Shaukat, K., Luo, S., Varadharajan, V., Hameed, I.A., Chen, S., Liu, D., Li, J.: Performance comparison and current challenges of using machine learning techniques in cybersecurity. Energies 13(10), 2509 (2020)

    Article  Google Scholar 

  31. Shaukat, K., Luo, S., Varadharajan, V., Hameed, I.A., Xu, M.: A survey on machine learning techniques for cyber security in the last decade. IEEE Access 8, 222310–222354 (2020)

    Article  Google Scholar 

  32. Tang, W., Tang, M., Ban, M., Zhao, Z., Feng, M.: CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection. J. Syst. Softw. (2023)

  33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017). arXiv:1706.03762pdf

  34. Viet Phan, A., Le Nguyen, M., Thu Bui, L.: Convolutional neural networks over control flow graphs for software defect prediction. In: 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI) (2017). https://doi.org/10.1109/ICTAI.2017.00019

  35. Wang, S., Liu, T., Tan, L.: Automatically learning semantic features for defect prediction. In: ICSE ’16: Proceedings of the 38th International Conference on Software Engineering (2016)

  36. Wu, J.: Literature review on vulnerability detection using NLP technology. arXiv:2104.11230 (2021)

  37. Yan, S., Ren, J., Wang, W., Sun, L., Zhang, W., Yu, Q.: A survey of adversarial attack and defense methods for malware classification in cyber security. IEEE Commun. Surv. Tutor. 25(1), 467–496 (2022)

  38. Yang, Y., Fan, H., Lin, C., Li, Q., Zhao, Z., Shen, C.: Exploiting the adversarial example vulnerability of transfer learning of source code. IEEE Trans. Inf. Forensics Secur. 19, 5880–5894 (2024). https://doi.org/10.1109/TIFS.2024.3402153

    Article  Google Scholar 

  39. Yu, X., Li, Z., Huang, X., Zhao, S.: Advulcode: Generating adversarial vulnerable code against deep learning-based vulnerability detectors. Electronics 12(4), 936 (2023)

    Article  Google Scholar 

  40. Zeng, P., Lin, G., Pan, L., Tai, Y., Zhang, J.: Software vulnerability analysis and discovery using deep learning techniques: a survey. IEEE Access 8, 197158–197172 (2020)

    Article  Google Scholar 

  41. Zhang, H., Lu, S., Li, Z., Jin, Z., Ma, L., Liu, Y., Li, G.: Codebert-attack: adversarial attack against source code deep learning models via pre-trained model. J. Softw. Evol. Process 36(3), e2571 (2024)

    Article  Google Scholar 

  42. Zhang, Q., Wu, B.: Software defect prediction via transformer. In: 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (2020). https://doi.org/10.1109/ITNEC48623.2020.9084745

  43. Zhu, Y., Lin, G., Song, L., Zhang, J.: The application of neural network for software vulnerability detection: a review. Neural Comput. Appl. 35(2), 1279–1301 (2023)

    Article  Google Scholar 

  44. Ziems, N., Wu, S.: Security vulnerability detection using deep learning natural language processing. In: IEEE INFOCOM 2021 (2021). https://doi.org/10.1109/INFOCOMWKSHPS51825.2021.9484500

  45. Zou, D., Wang, S., Xu, S., Li, Z., Jin, H.: VulDeePecker: a deep learning-based system for multiclass vulnerability detection. IEEE Trans. Dependable Secure Comput. 18(5), 2224–2236 (2019)

    Google Scholar 

Download references

Acknowledgements

research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Van-Hau Pham.

Ethics declarations

Conflict of interest

The authors have no Conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pham, VH., Hien, D.T.T., Hoang, H.D. et al. Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection. Int. J. Inf. Secur. 23, 3513–3526 (2024). https://doi.org/10.1007/s10207-024-00901-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10207-024-00901-4

Keywords