Skip to main content

Advertisement

Log in

Optimizing software vulnerability detection using RoBERTa and machine learning

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Detecting vulnerabilities in source code written in C and C +  + is currently essential as attack techniques against systems seek to find, exploit, and attack these vulnerabilities. In this article, to improve the effectiveness of the source code vulnerability detection process, we propose a new approach based on building and representing source code features using natural language processing (NLP) techniques. Our proposal in the article consists of two main stages: (i) building a feature profile of the source code using the RoBERTa model, and (ii) classifying source code based on the feature profile using a supervised machine learning algorithm. Specifically, with our proposal utilizing the pre-trained RoBERTa model, we have successfully built and represented important features of source code as complete vectors, thereby enhancing the effectiveness of prediction and vulnerability detection models. The experimental part of our article compared and evaluated our proposal with other approaches on the FFmpeg + Qume dataset. The experimental results in the article showed that the approach in this study was superior to other research directions on all measures. Therefore, the proposal to use NLP techniques based on the RoBERTa model not only has scientific significance as a new research direction that has not been proposed for application but also has practical significance when all experimental results are highly effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets generated and (or) analysed during the current study are available from the corresponding author on reasonable request. Replication package URL: https://github.com/Kkyn-ltcode/Optimizing-Software-vulnerability-detection-using-RoBERTa-and-Machine-Learning.

References

  • Ba JL, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv:1607.06450.

  • Chen, D., Zhang, Yd., Wei, W., et al.: Efficient vulnerability detection based on an optimized rule-checking static analysis technique. Front. Inf. Technol. Electron. Eng. 18, 332–345 (2017)

    Article  Google Scholar 

  • Chen, T, Guestrin C.: XGBoost: a scalable tree boosting system. In: KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016).

  • Cho, D.X., Son, V.N., Duc, D.: Automatically detect software security vulnerabilities based on natural language processing techniques and ML algorithms. J. ICT Res. Appl. 16(1), 70–87 (2022)

    Article  Google Scholar 

  • Corinna, C., Vladimir, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    Article  Google Scholar 

  • Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805.

  • Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Programm. Lang. Syst. 9(3), 319–349 (1989)

    Article  Google Scholar 

  • Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: ACM workshop on Artificial intelligence and security, pp. 45–54 (2013)

  • Handa, A., Sharma, A., Shukla, S.K.: Machine learning in cybersecurity: a review. WIREs Data Min. Knowl. Discov. 9(4) (2019).

  • Harer, J.A., Kim, L., Russell, R.L., Ozdemir, O., et al.: Automated software vulnerability detection with machine learning (2018)

  • Haridas, P., Chennupati, G., Santhi, N., Romero, P., Eidenbenz, S.: Code characterization with graph convolutions and capsule networks. IEEE Access. 8, 136307–136315 (2020)

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., Sun, S.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70–778 (2016)

  • Hu, J., Chen, J., Zhang, L., Liu, Y., Bao, Q., Ackah-Arthur, H.: A memory-related vulnerability detection approach based on vulnerability features. Tsinghua Sci. Technol. 25(5), 604–613 (2020)

    Article  Google Scholar 

  • Khanna, C.: Byte-Pair Encoding: Subword-based tokenization algorithm (2021). Accessed 2022 Dec 20 https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm-77828a70bee0.

  • Lee, M., Cho, S., Jang, C., Park, H., Choi, E.: A rulebased security auditing tool for software vulnerability detection. Int. Conf. Hybrid Inf. Technol. 2, 505–512 (2006)

    Google Scholar 

  • Leo, B.: Random forests. Mach. Learn. 45, 5–32 (2001)

    Article  Google Scholar 

  • Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access. 7, 103184–103197 (2019)

    Article  Google Scholar 

  • Li, Z., Zou, D., Xu, S. et al.: VulDeePecker: a deep learning based system for vulnerability detection (2018a). arXiv:1801.01681

  • Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. EEE Trans. Depend. Secure Comput. (2018b). arXiv:1807.06756

  • Li, M., Li, C., Li, S., Wu, Y., Zhang, B., Wen, Y.: ACGVD: vulnerability detection based on comprehensive graph via graph neural network with attention. In: ICICS 2021: information and communications security, 243–259 (2021)

  • Lin, G., Wen, S., Han, Q.L., Zhang, J., Xiang, Y.: Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108(10), 1825–1848 (2020)

    Article  Google Scholar 

  • Lin, G., et al.: Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Trans. Depend. Secure Comput. 18(5), 2469–2485 (2021)

    Article  Google Scholar 

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Stoyanov, V.: RoBERTa: a robustly optimized bert pretraining approach (2019). arXiv:1907.11692

  • Martínez Torres, J., Iglesias Comesaña, C., García-Nieto, P.J.: Review: machine learning techniques applied to cybersecurity. Int. J. Mach. Learn. Cyber. 10, 2823–2836 (2019)

    Article  Google Scholar 

  • Russell, R. et al.: Automated vulnerability detection in source code using deep representation learning. In: 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762 (2018)

  • Russell, R.L., et al.: Automated vulnerability detection in source code using deep representation learning (2018)

  • Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units (2015). arXiv:1508.07909

  • Shai, S.S., Shai, B.D.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)

  • Tang, G., Yang, L., Ren, S., Meng, L., Yang, F., Wang, H.: An automatic source code vulnerability detection approach based on KELM. Mach. Learn. Cybersecur. Privacy Public Saf. Opport. Challeng. Emerg. Appl. (2021)

  • Tian, H., Xu, J., Lian, K., Zhang, Y.: Research on strong-association rule based web application vulnerability detection. In: International Conference on Computer Science and Information Technology (CSIT). 2 (2009).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (NIPS 2017) (2017).

  • Wang, H., Ye, G., Tang, Z., Tan, S.H., et al.: Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forens. Secur. 16, 1943–1958 (2020)

    Article  Google Scholar 

  • Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.:. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation (2021). arXiv:2109.00859

  • Wu, P., Yin, L., Du, X., Jia, L., Dong, W.: Graph-based Vulnerability detection via extracting features from sliced code. In: IEEE 20th International Conference on Software Quality, Reliability and Security Companion (QRS-C) (2020)

  • Yamaguchi, F., Lottmann, M., Rieck, K.: Generalized vulnerability extrapolation using abstract syntax trees. Ann. Comput. Secur. Appl. Conf. 28, 358–368 (2012)

    Google Scholar 

  • Zeng, P., Lin, G., Pan, L., Tai, Y., Zhang, J.: Software vulnerability analysis and discovery using deep learning techniques: a survey. IEEE Access. 8, 197158–197172 (2020)

    Article  Google Scholar 

  • Zheng, W., Gao, J., Wu, X. et al.: The impact factors on the performance of machine learning-based vulnerability detection: a comparative study. J. Syst. Softw. (2020).

  • Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv. Neural Inf. Process. Syst. 10197–10207 (2019)

Download references

Funding

No funding was received for this work.

Author information

Authors and Affiliations

Authors

Contributions

CDX raised the idea, initialized the project and designed the experiments; N and P carried out the experiments under the supervision of CDX; Both authors analyze the data and results; CDX wrote the paper.

Corresponding author

Correspondence to Cho Xuan Do.

Ethics declarations

Competing interests

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Do, C.X., Luu, N.T. & Nguyen, P.T.L. Optimizing software vulnerability detection using RoBERTa and machine learning. Autom Softw Eng 31, 40 (2024). https://doi.org/10.1007/s10515-024-00440-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10515-024-00440-1

Keywords