Skip to main content

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

  • Conference paper
  • First Online:
Computer Security – ESORICS 2023 (ESORICS 2023)

Abstract

Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.

This work is funded by the European Union’s Horizon Research and Innovation Programme under Grant Agreement n\(^\circ \)101070303.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.cvedetails.com/cve/CVE-2017-7597/?q=CVE-2017-7597.

  2. 2.

    https://github.com/testing-cs/vulnerability-detection.git.

  3. 3.

    The term “foundation model” is used in this paper because, in the literature, a “pre-trained model” also has the meaning of a model trained by someone else and targeting a similar task [11, 24].

  4. 4.

    Notice: the number of data is a bit different from the original paper in [30] because we remove empty source code files from the provided datasets. Empty files cause compiling bugs and degrade the model performance.

  5. 5.

    https://github.com/microsoft/CodeXGLUE.

  6. 6.

    https://github.com/microsoft/CodeBERT.

  7. 7.

    https://huggingface.co/microsoft/codebert-base.

  8. 8.

    https://huggingface.co/microsoft/graphcodebert-base.

References

  1. Amankwah, R., Kudjo, P., Yeboah, S.: Evaluation of software vulnerability detection methods and tools: a review. Int. J. Comput. Appl. 169, 22–27 (2017). https://doi.org/10.5120/ijca2017914750

    Article  Google Scholar 

  2. Arusoaie, A., Ciobâca, S., Craciun, V., Gavrilut, D., Lucanu, D.: A comparison of open-source static analysis tools for vulnerability detection in c/c++ code. In: 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 161–168. IEEE (2017). https://doi.org/10.1109/SYNASC.2017.00035

  3. Asterisk team: Asterisk website (2022). https://www.asterisk.org/. Accessed 25 Aug 2023

  4. Bellard, F.: Qemu wesite (2022). https://www.qemu.org/. Accessed 25 Aug 2023

  5. Bellard, F.: FFmpeg team: Repository of ffmpeg on github (2023). https://github.com/FFmpeg/FFmpeg. Accessed 25 Aug 2023

  6. Bommasani, R., Hudson, D.A., Adeli, E., et al.: On the opportunities and risks of foundation models. CoRR abs/2108.07258 (2021). https://arxiv.org/abs/2108.07258

  7. Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  8. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018). https://doi.org/10.1016/j.neunet.2018.07.011

    Article  Google Scholar 

  9. Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet? IEEE Trans. Softw. Eng. 48(09), 3280–3296 (2022). https://doi.org/10.1109/TSE.2021.3087402

    Article  Google Scholar 

  10. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  Google Scholar 

  11. Choi, S., Yang, S., Choi, S., Yun, S.: Improving test-time adaptation via shift-agnostic weight regularization and nearest source prototypes. In: Computer Vision - ECCV 2022, pp. 440–458. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_26

  12. Croft, R., Xie, Y., Babar, M.A.: Data preparation for software vulnerability prediction: a systematic literature review. IEEE Trans. Softw. Eng. 49, 1044–1063 (2022). https://doi.org/10.1109/TSE.2022.3171202

    Article  Google Scholar 

  13. Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9260–9269. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00949

  14. Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Association for Computational Linguistics (2019). https://aclanthology.org/N19-1423.pdf

  15. Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling. In: International Conference on Machine Learning Workshop on Learning from Imbalanced Data Sets II, Washington, DC, USA (2003). https://www.site.uottawa.ca/~nat/Workshop2003/drummondc.pdf

  16. Fell, J.: A review of fuzzing tools and methods. PenTest Magazine (2017)

    Google Scholar 

  17. Feng, Z., Guo, D., Tang, D., et al.: Codebert: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139

  18. Garg, A., Degiovanni, R., Jimenez, M., Cordy, M., Papadakis, M., Le Traon, Y.: Learning from what we know: how to perform vulnerability prediction using noisy historical data. Empir. Softw. Eng. 27(7) (2022). https://doi.org/10.1007/s10664-022-10197-4

  19. Guo, D., Ren, S., Lu, S., et al.: Graphcodebert: pre-training code representations with data flow. In: International Conference on Learning Representations (2021). https://openreview.net/pdf?id=jLoC4ez43PZ

  20. Han, X., Zhang, Z., Ding, N., et al.: Pre-trained models: past, present and future. AI Open 2, 225–250 (2021). https://doi.org/10.1016/j.aiopen.2021.08.002

    Article  Google Scholar 

  21. He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications, 1st edn. Wiley-IEEE Press, Hoboken (2013)

    Book  Google Scholar 

  22. Huang, C.Y., Dai, H.L.: Learning from class-imbalanced data: review of data driven methods and algorithm driven methods. Data Sci. Finan. Econ. 1(1), 21–36 (2021). https://doi.org/10.3934/DSFE.2021002

    Article  Google Scholar 

  23. Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: Codesearchnet challenge: evaluating the state of semantic code search. CoRR abs/1909.09436 (2019). https://arxiv.org/abs/1909.09436

  24. Kim, J., Feldt, R., Yoo, S.: Guiding deep learning system testing using surprise adequacy. In: 41st International Conference on Software Engineering, pp. 1039–1049. IEEE Press (2019). https://doi.org/10.1109/ICSE.2019.00108

  25. Koh, P.W., Sagawa, S., Marklund, H., et al.: Wilds: a benchmark of in-the-wild distribution shifts. In: 38th International Conference on Machine Learning, pp. 5637–5664. PMLR (2021)

    Google Scholar 

  26. Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access 7, 103184–103197 (2019). https://doi.org/10.1109/ACCESS.2019.2930578

    Article  Google Scholar 

  27. Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: Sysevr: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Depend. Secure Comput. 19(04), 2244–2258 (2022). https://doi.org/10.1109/TDSC.2021.3051525

    Article  Google Scholar 

  28. Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: 25th Annual Network and Distributed System Security Symposium. The Internet Society (2018). https://doi.org/10.14722/ndss.2018.23158

  29. Lin, G., Xiao, W., Zhang, J., Xiang, Y.: Deep learning-based vulnerable function detection: a benchmark. In: Zhou, J., Luo, X., Shen, Q., Xu, Z. (eds.) ICICS 2019. LNCS, vol. 11999, pp. 219–232. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41579-2_13

    Chapter  Google Scholar 

  30. Lin, G., et al.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Ind. Inf. 14(7), 3289–3297 (2018). https://doi.org/10.1109/TII.2018.2821768

    Article  Google Scholar 

  31. Lin, G., et al.: Repository of lin2018 on github (2019). https://github.com/DanielLin1986/TransferRepresentationLearning. Accessed 25 Aug 2023

  32. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020). https://doi.org/10.1109/TPAMI.2018.2858826

    Article  Google Scholar 

  33. Liu, Y., Ott, M., Goyal, N., et al.: Roberta: a robustly optimized bert pretraining approach. CoRR abs/1907.11692 (2019). https://arxiv.org/abs/1907.11692

  34. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: 33rd Conference on Neural Information Processing Systems (2019)

    Google Scholar 

  35. Lu, S., Guo, D., Ren, S., Huang, J., et al.: Codexglue: a machine learning benchmark dataset for code understanding and generation. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. OpenReview.net (2021). https://openreview.net/forum?id=6lE4dQXaUcb

  36. Lu, S., Guo, D., Ren, S., et al.: Implementation of codexglue. https://github.com/microsoft/CodeXGLUE (2022). Accessed 25 Aug 2023

  37. Mazuera-Rozo, A., Mojica-Hanke, A., Linares-Vásquez, M., Bavota, G.: Shallow or deep? an empirical study on detecting vulnerabilities using deep learning. In: IEEE/ACM 29th International Conference on Program Comprehension, pp. 276–287 (2021). https://doi.org/10.1109/ICPC52881.2021.00034

  38. Mendoza, J., Mycroft, J., Milbury, L., Kahani, N., Jaskolka, J.: On the effectiveness of data balancing techniques in the context of ml-based test case prioritization. In: 18th International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 72–81. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3558489.3559073

  39. Pidgin team: Pidgin website (2020). https://pidgin.im/. Accessed 25 Aug 2023

  40. Pinconschi, E.: Repository of devign on github (2020). https://github.com/epicosy/devign. Accessed 25 Aug 2023

  41. Sam Leffler, S.G.: Repository of libtiff on gitlab (2020). https://gitlab.com/libtiff/libtiff. Accessed 25 Aug 2023

  42. Sharma, T., et al.: A survey on machine learning techniques for source code analysis. CoRR abs/2110.09610 (2021). https://arxiv.org/abs/2110.09610

  43. Shen, Z., Chen, S., Coppolino, L.: A survey of automatic software vulnerability detection, program repair, and defect prediction techniques. Secur. Commun. Netw. 2020 (2020). https://doi.org/10.1155/2020/8858010

  44. Shu, R., Xia, T., Williams, L., Menzies, T.: Dazzle: using ooptimized generative adversarial networks to address security data class imbalance issue. In: 19th International Conference on Mining Software Repositories, pp. 144–155. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3524842.3528437

  45. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7463–7472. IEEE Computer Society, Los Alamitos (2019). https://doi.org/10.1109/ICCV.2019.00756

  46. Truta, C., Randers-Pehrson, G., Dilger, A.E., Schalnat, G.E.: Repository of libpng on github (2023). https://github.com/glennrp/libpng. Accessed 25 Aug 2023

  47. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  48. VLC team: Vlc media player website (2023). https://github.com/videolan/vlc. Accessed 25 Aug 2023

  49. Wang, S., Liu, W., Wu, J., Cao, L., Meng, Q., Kennedy, P.: Training deep neural networks on imbalanced data sets. In: International Joint Conference on Neural Networks, pp. 4368–4374. IEEE (2016). https://doi.org/10.1109/IJCNN.2016.7727770

  50. Yang, Z., Shi, J., He, J., Lo, D.: Natural attack for pre-trained models of code. In: International Conference on Software Engineering, pp. 1482–1493. Association for Computing Machinery (2022). https://doi.org/10.1145/3510003.3510146

  51. You, Y., Zhang, Z., Hsieh, C., Demmel, J.: 100-epoch imagenet training with alexnet in 24 minutes. CoRR abs/1709.05011 (2017). https://arxiv.org/abs/1709.05011

  52. Zhang, H., Li, Z., Li, G., Ma, L., Liu, Y., Jin, Z.: Generating adversarial examples for holding robustness of source code processing models. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1169–1176 (2020). : https://doi.org/10.1609/aaai.v34i01.5469

  53. Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks, pp. 10197–10207. Curran Associates Inc., Red Hook (2019)

    Google Scholar 

  54. Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: 33rd International Conference on Neural Information Processing Systems, pp. 10197–10207. Curran Associates Inc., Red Hook (2019). https://dl.acm.org/doi/pdf/10.5555/3454287.3455202

  55. Zou, Y., Yu, Z., Vijaya Kumar, B.V.K., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 297–313. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_18

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiang Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, Y., Hu, Q., Tang, Q., Traon, Y.L. (2024). An Empirical Study of the Imbalance Issue in Software Vulnerability Detection. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-51482-1_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-51481-4

  • Online ISBN: 978-3-031-51482-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics