Optimizing software vulnerability detection using RoBERTa and machine learning

Do, Cho Xuan; Luu, Nguyen Trong; Nguyen, Phuong Thi Lan

doi:10.1007/s10515-024-00440-1

Optimizing software vulnerability detection using RoBERTa and machine learning

Published: 08 May 2024

Volume 31, article number 40, (2024)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

Cho Xuan Do¹,
Nguyen Trong Luu² &
Phuong Thi Lan Nguyen²

688 Accesses
Explore all metrics

Abstract

Detecting vulnerabilities in source code written in C and C + + is currently essential as attack techniques against systems seek to find, exploit, and attack these vulnerabilities. In this article, to improve the effectiveness of the source code vulnerability detection process, we propose a new approach based on building and representing source code features using natural language processing (NLP) techniques. Our proposal in the article consists of two main stages: (i) building a feature profile of the source code using the RoBERTa model, and (ii) classifying source code based on the feature profile using a supervised machine learning algorithm. Specifically, with our proposal utilizing the pre-trained RoBERTa model, we have successfully built and represented important features of source code as complete vectors, thereby enhancing the effectiveness of prediction and vulnerability detection models. The experimental part of our article compared and evaluated our proposal with other approaches on the FFmpeg + Qume dataset. The experimental results in the article showed that the approach in this study was superior to other research directions on all measures. Therefore, the proposal to use NLP techniques based on the RoBERTa model not only has scientific significance as a new research direction that has not been proposed for application but also has practical significance when all experimental results are highly effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An advanced computing approach for software vulnerability detection

Article 27 June 2024

Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection

Article 13 August 2024

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated and (or) analysed during the current study are available from the corresponding author on reasonable request. Replication package URL: https://github.com/Kkyn-ltcode/Optimizing-Software-vulnerability-detection-using-RoBERTa-and-Machine-Learning.

References

Ba JL, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv:1607.06450.
Chen, D., Zhang, Yd., Wei, W., et al.: Efficient vulnerability detection based on an optimized rule-checking static analysis technique. Front. Inf. Technol. Electron. Eng. 18, 332–345 (2017)
Article Google Scholar
Chen, T, Guestrin C.: XGBoost: a scalable tree boosting system. In: KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016).
Cho, D.X., Son, V.N., Duc, D.: Automatically detect software security vulnerabilities based on natural language processing techniques and ML algorithms. J. ICT Res. Appl. 16(1), 70–87 (2022)
Article Google Scholar
Corinna, C., Vladimir, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805.
Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Programm. Lang. Syst. 9(3), 319–349 (1989)
Article Google Scholar
Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: ACM workshop on Artificial intelligence and security, pp. 45–54 (2013)
Handa, A., Sharma, A., Shukla, S.K.: Machine learning in cybersecurity: a review. WIREs Data Min. Knowl. Discov. 9(4) (2019).
Harer, J.A., Kim, L., Russell, R.L., Ozdemir, O., et al.: Automated software vulnerability detection with machine learning (2018)
Haridas, P., Chennupati, G., Santhi, N., Romero, P., Eidenbenz, S.: Code characterization with graph convolutions and capsule networks. IEEE Access. 8, 136307–136315 (2020)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, S.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70–778 (2016)
Hu, J., Chen, J., Zhang, L., Liu, Y., Bao, Q., Ackah-Arthur, H.: A memory-related vulnerability detection approach based on vulnerability features. Tsinghua Sci. Technol. 25(5), 604–613 (2020)
Article Google Scholar
Khanna, C.: Byte-Pair Encoding: Subword-based tokenization algorithm (2021). Accessed 2022 Dec 20 https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0.
Lee, M., Cho, S., Jang, C., Park, H., Choi, E.: A rulebased security auditing tool for software vulnerability detection. Int. Conf. Hybrid Inf. Technol. 2, 505–512 (2006)
Google Scholar
Leo, B.: Random forests. Mach. Learn. 45, 5–32 (2001)
Article Google Scholar
Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access. 7, 103184–103197 (2019)
Article Google Scholar
Li, Z., Zou, D., Xu, S. et al.: VulDeePecker: a deep learning based system for vulnerability detection (2018a). arXiv:1801.01681
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. EEE Trans. Depend. Secure Comput. (2018b). arXiv:1807.06756
Li, M., Li, C., Li, S., Wu, Y., Zhang, B., Wen, Y.: ACGVD: vulnerability detection based on comprehensive graph via graph neural network with attention. In: ICICS 2021: information and communications security, 243–259 (2021)
Lin, G., Wen, S., Han, Q.L., Zhang, J., Xiang, Y.: Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)
Article Google Scholar
Lin, G., et al.: Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans. Depend. Secure Comput. 18(5), 2469–2485 (2021)
Article Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Stoyanov, V.: RoBERTa: a robustly optimized bert pretraining approach (2019). arXiv:1907.11692
Martínez Torres, J., Iglesias Comesaña, C., García-Nieto, P.J.: Review: machine learning techniques applied to cybersecurity. Int. J. Mach. Learn. Cyber. 10, 2823–2836 (2019)
Article Google Scholar
Russell, R. et al.: Automated vulnerability detection in source code using deep representation learning. In: 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762 (2018)
Russell, R.L., et al.: Automated vulnerability detection in source code using deep representation learning (2018)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2015). arXiv:1508.07909
Shai, S.S., Shai, B.D.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Tang, G., Yang, L., Ren, S., Meng, L., Yang, F., Wang, H.: An automatic source code vulnerability detection approach based on KELM. Mach. Learn. Cybersecur. Privacy Public Saf. Opport. Challeng. Emerg. Appl. (2021)
Tian, H., Xu, J., Lian, K., Zhang, Y.: Research on strong-association rule based web application vulnerability detection. In: International Conference on Computer Science and Information Technology (CSIT). 2 (2009).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (NIPS 2017) (2017).
Wang, H., Ye, G., Tang, Z., Tan, S.H., et al.: Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forens. Secur. 16, 1943–1958 (2020)
Article Google Scholar
Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.:. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation (2021). arXiv:2109.00859
Wu, P., Yin, L., Du, X., Jia, L., Dong, W.: Graph-based Vulnerability detection via extracting features from sliced code. In: IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C) (2020)
Yamaguchi, F., Lottmann, M., Rieck, K.: Generalized vulnerability extrapolation using abstract syntax trees. Ann. Comput. Secur. Appl. Conf. 28, 358–368 (2012)
Google Scholar
Zeng, P., Lin, G., Pan, L., Tai, Y., Zhang, J.: Software vulnerability analysis and discovery using deep learning techniques: a survey. IEEE Access. 8, 197158–197172 (2020)
Article Google Scholar
Zheng, W., Gao, J., Wu, X. et al.: The impact factors on the performance of machine learning-based vulnerability detection: a comparative study. J. Syst. Softw. (2020).
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 10197–10207 (2019)

Download references

Funding

No funding was received for this work.

Author information

Authors and Affiliations

Faculty of Information Security, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
Cho Xuan Do
Faculty of Information Technology, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
Nguyen Trong Luu & Phuong Thi Lan Nguyen

Authors

Cho Xuan Do
View author publications
You can also search for this author inPubMed Google Scholar
Nguyen Trong Luu
View author publications
You can also search for this author inPubMed Google Scholar
Phuong Thi Lan Nguyen
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

CDX raised the idea, initialized the project and designed the experiments; N and P carried out the experiments under the supervision of CDX; Both authors analyze the data and results; CDX wrote the paper.

Corresponding author

Correspondence to Cho Xuan Do.

Ethics declarations

Competing interests

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Do, C.X., Luu, N.T. & Nguyen, P.T.L. Optimizing software vulnerability detection using RoBERTa and machine learning. Autom Softw Eng 31, 40 (2024). https://doi.org/10.1007/s10515-024-00440-1

Download citation

Received: 05 April 2023
Accepted: 16 April 2024
Published: 08 May 2024
DOI: https://doi.org/10.1007/s10515-024-00440-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing software vulnerability detection using RoBERTa and machine learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An advanced computing approach for software vulnerability detection

Defect-scanner: a comparative empirical study on language model and deep learning approach for software vulnerability detection

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now