Abstract
Detecting vulnerabilities in source code written in C and C + + is currently essential as attack techniques against systems seek to find, exploit, and attack these vulnerabilities. In this article, to improve the effectiveness of the source code vulnerability detection process, we propose a new approach based on building and representing source code features using natural language processing (NLP) techniques. Our proposal in the article consists of two main stages: (i) building a feature profile of the source code using the RoBERTa model, and (ii) classifying source code based on the feature profile using a supervised machine learning algorithm. Specifically, with our proposal utilizing the pre-trained RoBERTa model, we have successfully built and represented important features of source code as complete vectors, thereby enhancing the effectiveness of prediction and vulnerability detection models. The experimental part of our article compared and evaluated our proposal with other approaches on the FFmpeg + Qume dataset. The experimental results in the article showed that the approach in this study was superior to other research directions on all measures. Therefore, the proposal to use NLP techniques based on the RoBERTa model not only has scientific significance as a new research direction that has not been proposed for application but also has practical significance when all experimental results are highly effective.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets generated and (or) analysed during the current study are available from the corresponding author on reasonable request. Replication package URL: https://github.com/Kkyn-ltcode/Optimizing-Software-vulnerability-detection-using-RoBERTa-and-Machine-Learning.
References
Ba JL, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv:1607.06450.
Chen, D., Zhang, Yd., Wei, W., et al.: Efficient vulnerability detection based on an optimized rule-checking static analysis technique. Front. Inf. Technol. Electron. Eng. 18, 332–345 (2017)
Chen, T, Guestrin C.: XGBoost: a scalable tree boosting system. In: KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016).
Cho, D.X., Son, V.N., Duc, D.: Automatically detect software security vulnerabilities based on natural language processing techniques and ML algorithms. J. ICT Res. Appl. 16(1), 70–87 (2022)
Corinna, C., Vladimir, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805.
Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Programm. Lang. Syst. 9(3), 319–349 (1989)
Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: ACM workshop on Artificial intelligence and security, pp. 45–54 (2013)
Handa, A., Sharma, A., Shukla, S.K.: Machine learning in cybersecurity: a review. WIREs Data Min. Knowl. Discov. 9(4) (2019).
Harer, J.A., Kim, L., Russell, R.L., Ozdemir, O., et al.: Automated software vulnerability detection with machine learning (2018)
Haridas, P., Chennupati, G., Santhi, N., Romero, P., Eidenbenz, S.: Code characterization with graph convolutions and capsule networks. IEEE Access. 8, 136307–136315 (2020)
He, K., Zhang, X., Ren, S., Sun, S.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70–778 (2016)
Hu, J., Chen, J., Zhang, L., Liu, Y., Bao, Q., Ackah-Arthur, H.: A memory-related vulnerability detection approach based on vulnerability features. Tsinghua Sci. Technol. 25(5), 604–613 (2020)
Khanna, C.: Byte-Pair Encoding: Subword-based tokenization algorithm (2021). Accessed 2022 Dec 20 https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0.
Lee, M., Cho, S., Jang, C., Park, H., Choi, E.: A rulebased security auditing tool for software vulnerability detection. Int. Conf. Hybrid Inf. Technol. 2, 505–512 (2006)
Leo, B.: Random forests. Mach. Learn. 45, 5–32 (2001)
Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access. 7, 103184–103197 (2019)
Li, Z., Zou, D., Xu, S. et al.: VulDeePecker: a deep learning based system for vulnerability detection (2018a). arXiv:1801.01681
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. EEE Trans. Depend. Secure Comput. (2018b). arXiv:1807.06756
Li, M., Li, C., Li, S., Wu, Y., Zhang, B., Wen, Y.: ACGVD: vulnerability detection based on comprehensive graph via graph neural network with attention. In: ICICS 2021: information and communications security, 243–259 (2021)
Lin, G., Wen, S., Han, Q.L., Zhang, J., Xiang, Y.: Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)
Lin, G., et al.: Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans. Depend. Secure Comput. 18(5), 2469–2485 (2021)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Stoyanov, V.: RoBERTa: a robustly optimized bert pretraining approach (2019). arXiv:1907.11692
Martínez Torres, J., Iglesias Comesaña, C., García-Nieto, P.J.: Review: machine learning techniques applied to cybersecurity. Int. J. Mach. Learn. Cyber. 10, 2823–2836 (2019)
Russell, R. et al.: Automated vulnerability detection in source code using deep representation learning. In: 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762 (2018)
Russell, R.L., et al.: Automated vulnerability detection in source code using deep representation learning (2018)
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2015). arXiv:1508.07909
Shai, S.S., Shai, B.D.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)
Tang, G., Yang, L., Ren, S., Meng, L., Yang, F., Wang, H.: An automatic source code vulnerability detection approach based on KELM. Mach. Learn. Cybersecur. Privacy Public Saf. Opport. Challeng. Emerg. Appl. (2021)
Tian, H., Xu, J., Lian, K., Zhang, Y.: Research on strong-association rule based web application vulnerability detection. In: International Conference on Computer Science and Information Technology (CSIT). 2 (2009).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (NIPS 2017) (2017).
Wang, H., Ye, G., Tang, Z., Tan, S.H., et al.: Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forens. Secur. 16, 1943–1958 (2020)
Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.:. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation (2021). arXiv:2109.00859
Wu, P., Yin, L., Du, X., Jia, L., Dong, W.: Graph-based Vulnerability detection via extracting features from sliced code. In: IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C) (2020)
Yamaguchi, F., Lottmann, M., Rieck, K.: Generalized vulnerability extrapolation using abstract syntax trees. Ann. Comput. Secur. Appl. Conf. 28, 358–368 (2012)
Zeng, P., Lin, G., Pan, L., Tai, Y., Zhang, J.: Software vulnerability analysis and discovery using deep learning techniques: a survey. IEEE Access. 8, 197158–197172 (2020)
Zheng, W., Gao, J., Wu, X. et al.: The impact factors on the performance of machine learning-based vulnerability detection: a comparative study. J. Syst. Softw. (2020).
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 10197–10207 (2019)
Funding
No funding was received for this work.
Author information
Authors and Affiliations
Contributions
CDX raised the idea, initialized the project and designed the experiments; N and P carried out the experiments under the supervision of CDX; Both authors analyze the data and results; CDX wrote the paper.
Corresponding author
Ethics declarations
Competing interests
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Do, C.X., Luu, N.T. & Nguyen, P.T.L. Optimizing software vulnerability detection using RoBERTa and machine learning. Autom Softw Eng 31, 40 (2024). https://doi.org/10.1007/s10515-024-00440-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10515-024-00440-1