Abstract
Detecting vulnerabilities in software is crucial to guarantee the security of software systems. Most previous methods focus on training a classification or regression model on the text feature of the source code to predict vulnerabilities. However, it is not always easy to obtain the labeled vulnerabilities in practical applications, and using only the text feature is insufficient to find the vulnerabilities in complex software systems. To address these problems, in this paper, we propose an unsupervised method to detect vulnerable software functions, which uses both text and dependency features of the source code to improve the detection accuracy. Specifically, we first extract the text and dependency features from the source code and concatenate them to the combined feature. We then learn a deep autoencoder to transform the combined feature into low-dimensional embedding. We finally apply an outlier detection method on the embedding to predict the vulnerable functions. We extensively evaluated the proposed method on seven C/C++ program datasets, and the results illustrate that our method improves F1 score on average of 88 and 66% over comparison methods Rats and Joern, which verifies the effectiveness of our method.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-022-07775-5/MediaObjects/500_2022_7775_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-022-07775-5/MediaObjects/500_2022_7775_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-022-07775-5/MediaObjects/500_2022_7775_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-022-07775-5/MediaObjects/500_2022_7775_Fig4_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Enquiries about data availability should be directed to the authors.
References
Aggarwal CC (2015) Time series and multidimensional streaming outlier detection. Outlier Analysis. Springer, New York, pp 225–264
Anowar F, Sadaoui S, Selim B (2021) Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Comput Sci Rev 40(100):378
Aremu OO, Hyland-Wood D, McAree PR (2020) A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data. Reliab Eng Syst Safety 195(106):706
Breunig MM, Kriegel HP, Ng RT (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on Management of data. ACM, Dallas, Texas, USA, pp 93–104
Chakraborty S, Krishna R, Ding Y (2022) Deep learning based vulnerability detection: are we there yet. IEEE Trans Softw Eng 48(9):3280–3296
Chibotaru V, Bichsel B, Raychev V (2019) Scalable taint specification inference with big code. In: Proceedings of the 40th ACM SIGPLAN conference on programming language design and implementation (PLDI ’19). ACM, Phoenix, AZ, pp 760–774
Dey T, Karnauch A, Mockus A (2021) Representation of developer expertise in open source software. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE 2021). IEEE, Electr network, pp 995–1007
Duan X, Wu J, Luo T (2020) Vulnerability mining method based on code property graph and attention BILSTM. J Softw 31(11):3404–3420
Filus K, Boryszko P, Domanska J et al (2021) Efficient feature selection for static analysis vulnerability prediction. Sensors 21(4):1133
Han J, Pei J, Kamber M (eds) (2011) Data mining: concepts and techniques. Elsevier, USA
Hata H, Mizuno O, Kikuno T (2010) Fault-prone module detection using large-scale text features based on spam filtering. Empir Softw Eng 15:147–165
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Landman D, Serebrenik A, Vinju JJ (2017) Challenges for static analysis of java refection-literature review and empirical study. In: 39th IEEE/ACM international conference on software engineering (ICSE). IEEE, Buenos Aires, ARGENTINA, pp 507–518
Li B, Zhou Y, Wang Y (2005) Matrixbased component dependence representation and its applications in software quality assurance. ACM SIGPLAN Notices 40:29–36
Li Y, Xue Y, Chen H (2019) Cerebro: Context-aware adaptive fuzzing for effective vulnerability detection. In: ESEC/FSE’2019 proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. ACM, Tallinn, ESTONIA, pp 533–544
Li Z, Zou D, Xu S (2021) Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans Depend Secur Comput
Lin G, Wen S, Han QL (2020) Software vulnerability detection using deep neural networks: a survey. Proc IEEE 108(10):1825–1848
Liu Z, Qian P, Wang X (2021) Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2021.3095196
Neuhaus S, Zimmermann T, Holler C (2007) Predicting vulnerable software components. In: 14th ACM conference on computer and communication security. ACM, Alexandria, VA, pp 529–540
Nguyen VH, Tran LMS (2010) Predicting vulnerable software components with dependency graphs. In: Proceedings of the 6th international workshop on security measurements and metrics, pp 1–8
Pang Y, Xue X, Namin A (2015) Predicting vulnerable software components through n-gram analysis and statistical feature selection. In: 2015 IEEE 14th international conference on machine learning and applications (ICMLA). IEEE, Miami, pp 543–548
Pang Y, Xue X, Wang H (2017) Predicting vulnerable software components through deep neural network. In: Proceedings of the 2017 international conference on deep learning technologies. ACM, Chengdu, China, pp 6–10
Perl H, Dechand S, Smith M (2015) Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: 22nd ACM SIGSAC conference on computer and communications security (CCS). ACM, Denver, CO, pp 426–437
Qasem A, Shirani P, Debbabi M (2021) Automatic vulnerability detection in embedded devices and firmware: survey and layered taxonomies. ACM Comput Surv 54(2):1–42
Russell RL, Kim L, Hamilton LH (2018) Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, Orlando, FL, pp 757–762
Şahin CB, Dinler ÖB, Abualigah L (2021) Prediction of software vulnerability based deep symbiotic genetic algorithms: phenotyping of dominant-features. Appl Intell 51(11):8271–8287
Shirey R (2007) Internet security glossary, version 2. RFC 4949:1–365
Sun H, Cui L, Li L (2021) Vdsimilar: Vulnerability detection based on code similarity of vulnerabilities and patches. Comput Secur 110(102):417
Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: Weinberger K (ed) Balcan M. Unsupervised deep embedding for clustering analysis, New York, pp 478–487
Yamaguchi F, Maier A, Gascon H (2015) Automatic inference of search patterns for taint-style vulnerabilities. In: 2015 IEEE symposium on security and privacy SP 2015. IEEE, San Jose, CA, pp 797–812
Yan H, Sui Y, Chen S (2017) Machine-learning-guided typestate analysis for static use-after-free detection. In: 33rd annual computer security applications conference (ACSAC 2017). ACM, Orlando, FL, pp 42–54
Zhou C, Liu Y, Liu X (2017) Scalable graph embedding for asymmetric proximity. In: 31st AAAI conference on artificial intelligence. AAAI, San Francisco, CA, pp 2942–2948
Zhou Y, Liu S, Siow J (2019) Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv Neural Inf Proces Syst 32(10):197–207
Zou D, Wang S, Xu S (2019) \(\mu \)vuldeepecker: A deep learning-based system for multiclass vulnerability detection. IEEE Trans Depend Secur Comput 18(5):2224–2236
Acknowledgements
This work was supported by Natural Science Foundation of YunNan Provincial Department of Education (2019J0942).
Funding
The authors have not disclosed any funding.
Author information
Authors and Affiliations
Contributions
WX involved in conceptualization, methodology, experiment and writing—original draft. TL involved in writing—review and editing. JW involved in writing—review and editing. YT involved in data curation, resources and writing—review.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This material is the authors’ own original work, which has not been previously published elsewhere. The paper is not currently being considered for publication elsewhere. The paper reflects the authors’ own research and analysis in a truthful and complete manner.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, W., Li, T., Wang, J. et al. Detecting vulnerable software functions via text and dependency features. Soft Comput 27, 5425–5435 (2023). https://doi.org/10.1007/s00500-022-07775-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-07775-5