An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

Guo, Yuejun; Hu, Qiang; Tang, Qiang; Traon, Yves Le

doi:10.1007/978-3-031-51482-1_19

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14347))

Included in the following conference series:

European Symposium on Research in Computer Security

290 Accesses

Abstract

Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within extensive code volumes. Despite its promise, DL-based vulnerability detection remains in its early stages, with model performance exhibiting variability across datasets. Drawing insights from other well-explored application areas like computer vision, we conjecture that the imbalance issue (the number of vulnerable code is extremely small) is at the core of the phenomenon. To validate this, we conduct a comprehensive empirical study involving nine open-source datasets and two state-of-the-art DL models. The results confirm our conjecture. We also obtain insightful findings on how existing imbalance solutions perform in vulnerability detection. It turns out that these solutions perform differently as well across datasets and evaluation metrics. Specifically: 1) Focal loss is more suitable to improve the precision, 2) mean false error and class-balanced loss encourages the recall, and 3) random over-sampling facilitates the F1-measure. However, none of them excels across all metrics. To delve deeper, we explore external influences on these solutions and offer insights for developing new solutions.

This work is funded by the European Union’s Horizon Research and Innovation Programme under Grant Agreement n\(^\circ \)101070303.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.cvedetails.com/cve/CVE-2017-7597/?q=CVE-2017-7597.
2.
https://github.com/testing-cs/vulnerability-detection.git.
3.
The term “foundation model” is used in this paper because, in the literature, a “pre-trained model” also has the meaning of a model trained by someone else and targeting a similar task [11, 24].
4.
Notice: the number of data is a bit different from the original paper in [30] because we remove empty source code files from the provided datasets. Empty files cause compiling bugs and degrade the model performance.
5.
https://github.com/microsoft/CodeXGLUE.
6.
https://github.com/microsoft/CodeBERT.
7.
https://huggingface.co/microsoft/codebert-base.
8.
https://huggingface.co/microsoft/graphcodebert-base.

References

Amankwah, R., Kudjo, P., Yeboah, S.: Evaluation of software vulnerability detection methods and tools: a review. Int. J. Comput. Appl. 169, 22–27 (2017). https://doi.org/10.5120/ijca2017914750
Article Google Scholar
Arusoaie, A., Ciobâca, S., Craciun, V., Gavrilut, D., Lucanu, D.: A comparison of open-source static analysis tools for vulnerability detection in c/c++ code. In: 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 161–168. IEEE (2017). https://doi.org/10.1109/SYNASC.2017.00035
Asterisk team: Asterisk website (2022). https://www.asterisk.org/. Accessed 25 Aug 2023
Bellard, F.: Qemu wesite (2022). https://www.qemu.org/. Accessed 25 Aug 2023
Bellard, F.: FFmpeg team: Repository of ffmpeg on github (2023). https://github.com/FFmpeg/FFmpeg. Accessed 25 Aug 2023
Bommasani, R., Hudson, D.A., Adeli, E., et al.: On the opportunities and risks of foundation models. CoRR abs/2108.07258 (2021). https://arxiv.org/abs/2108.07258
Brown, T., Mann, B., Ryder, N., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018). https://doi.org/10.1016/j.neunet.2018.07.011
Article Google Scholar
Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet? IEEE Trans. Softw. Eng. 48(09), 3280–3296 (2022). https://doi.org/10.1109/TSE.2021.3087402
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002). https://doi.org/10.1613/jair.953
Article Google Scholar
Choi, S., Yang, S., Choi, S., Yun, S.: Improving test-time adaptation via shift-agnostic weight regularization and nearest source prototypes. In: Computer Vision - ECCV 2022, pp. 440–458. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_26
Croft, R., Xie, Y., Babar, M.A.: Data preparation for software vulnerability prediction: a systematic literature review. IEEE Trans. Softw. Eng. 49, 1044–1063 (2022). https://doi.org/10.1109/TSE.2022.3171202
Article Google Scholar
Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9260–9269. IEEE (2019). https://doi.org/10.1109/CVPR.2019.00949
Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Association for Computational Linguistics (2019). https://aclanthology.org/N19-1423.pdf
Drummond, C., Holte, R.: C4.5, class imbalance, and cost sensitivity: why under-sampling beats oversampling. In: International Conference on Machine Learning Workshop on Learning from Imbalanced Data Sets II, Washington, DC, USA (2003). https://www.site.uottawa.ca/~nat/Workshop2003/drummondc.pdf
Fell, J.: A review of fuzzing tools and methods. PenTest Magazine (2017)
Google Scholar
Feng, Z., Guo, D., Tang, D., et al.: Codebert: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139
Garg, A., Degiovanni, R., Jimenez, M., Cordy, M., Papadakis, M., Le Traon, Y.: Learning from what we know: how to perform vulnerability prediction using noisy historical data. Empir. Softw. Eng. 27(7) (2022). https://doi.org/10.1007/s10664-022-10197-4
Guo, D., Ren, S., Lu, S., et al.: Graphcodebert: pre-training code representations with data flow. In: International Conference on Learning Representations (2021). https://openreview.net/pdf?id=jLoC4ez43PZ
Han, X., Zhang, Z., Ding, N., et al.: Pre-trained models: past, present and future. AI Open 2, 225–250 (2021). https://doi.org/10.1016/j.aiopen.2021.08.002
Article Google Scholar
He, H., Ma, Y.: Imbalanced Learning: Foundations, Algorithms, and Applications, 1st edn. Wiley-IEEE Press, Hoboken (2013)
Book Google Scholar
Huang, C.Y., Dai, H.L.: Learning from class-imbalanced data: review of data driven methods and algorithm driven methods. Data Sci. Finan. Econ. 1(1), 21–36 (2021). https://doi.org/10.3934/DSFE.2021002
Article Google Scholar
Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: Codesearchnet challenge: evaluating the state of semantic code search. CoRR abs/1909.09436 (2019). https://arxiv.org/abs/1909.09436
Kim, J., Feldt, R., Yoo, S.: Guiding deep learning system testing using surprise adequacy. In: 41st International Conference on Software Engineering, pp. 1039–1049. IEEE Press (2019). https://doi.org/10.1109/ICSE.2019.00108
Koh, P.W., Sagawa, S., Marklund, H., et al.: Wilds: a benchmark of in-the-wild distribution shifts. In: 38th International Conference on Machine Learning, pp. 5637–5664. PMLR (2021)
Google Scholar
Li, Z., Zou, D., Tang, J., Zhang, Z., Sun, M., Jin, H.: A comparative study of deep learning-based vulnerability detection system. IEEE Access 7, 103184–103197 (2019). https://doi.org/10.1109/ACCESS.2019.2930578
Article Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: Sysevr: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Depend. Secure Comput. 19(04), 2244–2258 (2022). https://doi.org/10.1109/TDSC.2021.3051525
Article Google Scholar
Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: 25th Annual Network and Distributed System Security Symposium. The Internet Society (2018). https://doi.org/10.14722/ndss.2018.23158
Lin, G., Xiao, W., Zhang, J., Xiang, Y.: Deep learning-based vulnerable function detection: a benchmark. In: Zhou, J., Luo, X., Shen, Q., Xu, Z. (eds.) ICICS 2019. LNCS, vol. 11999, pp. 219–232. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41579-2_13
Chapter Google Scholar
Lin, G., et al.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Ind. Inf. 14(7), 3289–3297 (2018). https://doi.org/10.1109/TII.2018.2821768
Article Google Scholar
Lin, G., et al.: Repository of lin2018 on github (2019). https://github.com/DanielLin1986/TransferRepresentationLearning. Accessed 25 Aug 2023
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020). https://doi.org/10.1109/TPAMI.2018.2858826
Article Google Scholar
Liu, Y., Ott, M., Goyal, N., et al.: Roberta: a robustly optimized bert pretraining approach. CoRR abs/1907.11692 (2019). https://arxiv.org/abs/1907.11692
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: 33rd Conference on Neural Information Processing Systems (2019)
Google Scholar
Lu, S., Guo, D., Ren, S., Huang, J., et al.: Codexglue: a machine learning benchmark dataset for code understanding and generation. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. OpenReview.net (2021). https://openreview.net/forum?id=6lE4dQXaUcb
Lu, S., Guo, D., Ren, S., et al.: Implementation of codexglue. https://github.com/microsoft/CodeXGLUE (2022). Accessed 25 Aug 2023
Mazuera-Rozo, A., Mojica-Hanke, A., Linares-Vásquez, M., Bavota, G.: Shallow or deep? an empirical study on detecting vulnerabilities using deep learning. In: IEEE/ACM 29th International Conference on Program Comprehension, pp. 276–287 (2021). https://doi.org/10.1109/ICPC52881.2021.00034
Mendoza, J., Mycroft, J., Milbury, L., Kahani, N., Jaskolka, J.: On the effectiveness of data balancing techniques in the context of ml-based test case prioritization. In: 18th International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 72–81. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3558489.3559073
Pidgin team: Pidgin website (2020). https://pidgin.im/. Accessed 25 Aug 2023
Pinconschi, E.: Repository of devign on github (2020). https://github.com/epicosy/devign. Accessed 25 Aug 2023
Sam Leffler, S.G.: Repository of libtiff on gitlab (2020). https://gitlab.com/libtiff/libtiff. Accessed 25 Aug 2023
Sharma, T., et al.: A survey on machine learning techniques for source code analysis. CoRR abs/2110.09610 (2021). https://arxiv.org/abs/2110.09610
Shen, Z., Chen, S., Coppolino, L.: A survey of automatic software vulnerability detection, program repair, and defect prediction techniques. Secur. Commun. Netw. 2020 (2020). https://doi.org/10.1155/2020/8858010
Shu, R., Xia, T., Williams, L., Menzies, T.: Dazzle: using ooptimized generative adversarial networks to address security data class imbalance issue. In: 19th International Conference on Mining Software Repositories, pp. 144–155. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3524842.3528437
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7463–7472. IEEE Computer Society, Los Alamitos (2019). https://doi.org/10.1109/ICCV.2019.00756
Truta, C., Randers-Pehrson, G., Dilger, A.E., Schalnat, G.E.: Repository of libpng on github (2023). https://github.com/glennrp/libpng. Accessed 25 Aug 2023
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
VLC team: Vlc media player website (2023). https://github.com/videolan/vlc. Accessed 25 Aug 2023
Wang, S., Liu, W., Wu, J., Cao, L., Meng, Q., Kennedy, P.: Training deep neural networks on imbalanced data sets. In: International Joint Conference on Neural Networks, pp. 4368–4374. IEEE (2016). https://doi.org/10.1109/IJCNN.2016.7727770
Yang, Z., Shi, J., He, J., Lo, D.: Natural attack for pre-trained models of code. In: International Conference on Software Engineering, pp. 1482–1493. Association for Computing Machinery (2022). https://doi.org/10.1145/3510003.3510146
You, Y., Zhang, Z., Hsieh, C., Demmel, J.: 100-epoch imagenet training with alexnet in 24 minutes. CoRR abs/1709.05011 (2017). https://arxiv.org/abs/1709.05011
Zhang, H., Li, Z., Li, G., Ma, L., Liu, Y., Jin, Z.: Generating adversarial examples for holding robustness of source code processing models. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1169–1176 (2020). : https://doi.org/10.1609/aaai.v34i01.5469
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks, pp. 10197–10207. Curran Associates Inc., Red Hook (2019)
Google Scholar
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: 33rd International Conference on Neural Information Processing Systems, pp. 10197–10207. Curran Associates Inc., Red Hook (2019). https://dl.acm.org/doi/pdf/10.5555/3454287.3455202
Zou, Y., Yu, Z., Vijaya Kumar, B.V.K., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 297–313. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_18
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

ITIS, Luxembourg Institute of Science and Technology, Esch-sur-Alzette, Luxembourg
Yuejun Guo & Qiang Tang
SnT, University of Luxembourg, Luxembourg City, Luxembourg
Qiang Hu & Yves Le Traon

Authors

Yuejun Guo
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yves Le Traon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiang Hu .

Editor information

Editors and Affiliations

University of California, Irvine, CA, USA
Gene Tsudik
University of Padua, Padua, Italy
Mauro Conti
Delft University of Technology, Delft, The Netherlands
Kaitai Liang
Delft University of Technology, Delft, The Netherlands
Georgios Smaragdakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, Y., Hu, Q., Tang, Q., Traon, Y.L. (2024). An Empirical Study of the Imbalance Issue in Software Vulnerability Detection. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_19

Download citation

DOI: https://doi.org/10.1007/978-3-031-51482-1_19
Published: 11 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51481-4
Online ISBN: 978-3-031-51482-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection