Skip to main content
Log in

Machine learning techniques for software vulnerability prediction: a comparative study

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Software vulnerabilities represent a major cause of security problems. Various vulnerability discovery models (VDMs) attempt to model the rate at which the vulnerabilities are discovered in a software. Although several VDMs have been proposed, not all of them are universally applicable. Also most of them seldom give accurate predictive results for every type of vulnerability dataset. The use of machine learning (ML) techniques has generally found success in a wide range of predictive tasks. Thus, in this paper, we conducted an empirical study on applying some well-known machine learning (ML) techniques as well as statistical techniques to predict the software vulnerabilities on a variety of datasets. The following ML techniques have been evaluated: cascade-forward back propagation neural network, feed-forward back propagation neural network, adaptive-neuro fuzzy inference system, multi-layer perceptron, support vector machine, bagging, M5Rrule, M5P and reduced error pruning tree. The following statistical techniques have been evaluated: Alhazmi-Malaiya model, linear regression and logistic regression model. The applicability of the techniques is examined using two separate approaches: goodness-of-fit to see how well the model tracks the data, and prediction capability using different criteria. It is observed that ML techniques show remarkable improvement in predicting the software vulnerabilities than the statistical vulnerability prediction models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Kansal Y, Kumar P, Uday K (2018) Coverage â? based vulnerability discovery modeling to optimize disclosure time using multiattribute approach. June 2017, pp 1–12. https://doi.org/10.1002/qre.2380

  2. Goseva-Popstojanova K, Tyo J (2018) Identification of security related bug reports via text mining using supervised and unsupervised classification. In: 2018 IEEE International conference on software quality, reliability and security (QRS), IEEE, pp 344–355

  3. Şahin C B, Dinler OB, Abualigah L (2021) Prediction of software vulnerability based deep symbiotic genetic algorithms: Phenotyping of dominant-features. Appl Intell, pp 1–17

  4. Zeng J, Nie X, Chen L, Li J, Du G, Shi G (2020) An efficient vulnerability extrapolation using similarity of graph kernel of pdgs. In: 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), IEEE, pp 1664–1671

  5. Dam HK, Tran T, Pham TTM, Ng SW, Grundy J, Ghose A (2018) Automatic feature learning for predicting vulnerable software components. IEEE Trans Softw Eng

  6. Piran A Vulnerability Analysis of Similar Code

  7. Morrison PJ, Pandita R, Xiao X, Chillarege R, Williams L (2018) Are vulnerabilities discovered and resolved like other defects ?. https://doi.org/10.1007/s10664-017-9541-1

  8. Chakraborty S, Krishna R, Ding Y, Ray B (2021) Deep learning based vulnerability detection: Are we there yet. IEEE Trans Softw Eng

  9. Kalouptsoglou I, Siavvas M, Tsoukalas D, Kehagias D (2020) Cross-project vulnerability prediction based on software metrics and deep learning. In: International Conference on Computational Science and Its Applications, Springer, pp 877–893

  10. Li Z, Zou D, Xu S, Jin H, Zhu Y, Chen Z (2021) Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing

  11. Bhatt N, Anand A, Yadavalli Venkata SS (2021) Exploitability prediction of software vulnerabilities. Qual Reliab Eng Int 37(2):648–663

    Article  Google Scholar 

  12. Ban X, Liu S, Chen C, Chua C (2019) A performance evaluation of deep-learnt features for software vulnerability detection. Concurrency and Computation: Practice and Experience 31(19): e5103

    Article  Google Scholar 

  13. Lin G, Wen S, Han Q-L, Zhang J, Xiang Y (2020) Software vulnerability detection using deep neural networks: A survey. Proc IEEE 108(10):1825–1848

    Article  Google Scholar 

  14. Lin G, Zhang J, Member S, Luo W, Pan L, Vel OD, Montague P, Xiang Y, Member S (2019) Software Vulnerability Discovery via Learning Multi-domain Knowledge Bases. IEEE Trans. Dependable Secur. Comput. PP(c):1. https://doi.org/10.1109/TDSC.2019.2954088

    Google Scholar 

  15. Alhazmi OH, Malaiya YK (2005) Quantitative Vulnerability Assessment of Systems Software. Reliability and Maintainability Symposium, 2005. Proceedings. Annual, pp 615–620. https://doi.org/10.1109/RAMS.2005.1408432, https://www.dropbox.com/s/pjc8a97q5vjomgp/Quantitativevulnerabilityassessmentofsystemssoftware.pdf?dl=0

  16. Rahimi S, Zargham M (2013) Vulnerability Scrying Method for Software Vulnerability Discovery Prediction Without a Vulnerability Database . IEEE Trans Reliab 62(2):395–407. https://doi.org/10.1109/TR.2013.2257052

    Article  Google Scholar 

  17. Joh HC, Malaiya YK (2017) Periodicity in software vulnerability discovery, patching and exploitation. Int J Inf Secur 16(6):673–690. https://doi.org/10.1007/s10207-016-0345-x

    Article  Google Scholar 

  18. Wang X, Ma RUI, Li B, Tian D, Wang X (2019) E-WBM : An Effort-Based Vulnerability Discovery Model. IEEE Access 7:44276–44292. https://doi.org/10.1109/ACCESS.2019.2907977

    Article  Google Scholar 

  19. Anand A, Bhatt N, Alhazmi OH (2021) Modeling Software Vulnerability Discovery Process Inculcating the Impact of Reporters. pp 709–722

  20. Liu B, Shi L, Cai Z, Li M (2012) Software vulnerability discovery techniques: A survey. Proc. - 2012 4th Int. Conf. Multimed. Secur. MINES 2012, pp 152–156. https://doi.org/10.1109/MINES.2012.202

  21. Joh H, Malaiya YK (2014) Modeling Skewness in Vulnerability Discovery. Qual Reliab Eng Int, September 2013. https://doi.org/10.1002/qre.1567

  22. Movahedi Y, Cukier M, Gashi I (2019) Vulnerability prediction capability: A comparison between vulnerability discovery models and neural network models. Computers & Security 87:101596

    Article  Google Scholar 

  23. Anand A, Bhatt N, Aggrawal D (2020) Modeling Software Patch Management Based on Vulnerabilities Discovered. 27(2), pp 1–15. https://doi.org/10.1142/S0218539320400033

  24. Ban X (2018) A performance evaluation of deep-learnt features for software vulnerability detection. November, 1–10. https://doi.org/10.1002/cpe.5103

  25. Li Z, Zou D, Xu S, Ou X, Jin H, Wang S, Deng Z, Zhong Y (2018) Vuldeepecker: A deep learning-based system for vulnerability detection. Network and Distributed System Security Symposium

  26. Gupta R, Pal S, Kanade A, Shevade S (2017) Deepfix: Fixing common c language errors by deep learning. In: Thirty-First AAAI Conference on Artificial Intelligence

  27. Shar LK, Briand LC, Tan HK, Member S (2015) Web Application Vulnerability Prediction Using Hybrid Program Analysis and Machine Learning. IEEE Transactions on Dependable and Secure Computing 12(6):688–707. https://doi.org/10.1109/TDSC.2014.2373377

    Article  Google Scholar 

  28. Shar LK, Briand LC, Tan HBK (2015) Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Transactions on Dependable and Secure Computing 12(6):688–707

    Article  Google Scholar 

  29. George TK, Jacob KP, James RK (2018) Token based detection and neural network based reconstruction framework against code injection vulnerabilities. Journal of Information Security and Applications 41:75–91

    Article  Google Scholar 

  30. Akram J, Liang Q, Luo P (2019) Vcipr : Vulnerable code is identifiable when a patch is released (hacker’s perspective). In: 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), IEEE, pp 402–413

  31. Akram J, Mumtaz M, Gul J, Luo P (2019) Droidmd: An efficient and scalable android malware detection approach at source code level. Int J Inf Comput Secur, 11(1). https://doi.org/10.1504/IJICS.2019.10020453

  32. Akram J, Luo P (2021) Sqvdt: A scalable quantitative vulnerability detection technique for source code security assessment. Software: Practice and Experience 51(2):294–318

    Google Scholar 

  33. Li X, Wang L, Xin Y, Yang Y, Chen Y (2020) Automated vulnerability detection in source code using minimum intermediate representation learning. Appl Sci 10(5):1692

    Article  Google Scholar 

  34. Saccente N, Dehlinger J, Deng L, Chakraborty S, Xiong Y (2019) Project achilles: A prototype tool for static method-level vulnerability detection of java source code using a recurrent neural network. In: 2019 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW), IEEE, pp 114–121

  35. Partenza G, Amburgey T, Deng L, Dehlinger J, Chakraborty S (2021) Automatic identification of vulnerable code: Investigations with an ast-based neural network. In: 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), IEEE, pp 1475–1482

  36. Hanif H, Nasir MHNM, Ab Razak MF, Firdaus A, Anuar NB (2021) The rise of software vulnerability: Taxonomy of software vulnerabilities detection and machine learning approaches. J Netw Comput Appl, p 103009

  37. Semasaba AOA, Zheng W, Wu X, Agyemang SA (2020) Literature survey of deep learning-based vulnerability analysis on source code. IET Softw 14(6):654–664

    Article  Google Scholar 

  38. Zheng W, Gao J, Wu X, Liu F, Xun Y, Liu G, Chen X (2020) The impact factors on the performance of machine learning-based vulnerability detection: A comparative study. J Syst Softw 168:110659

    Article  Google Scholar 

  39. Geng J, Luo P (2016) A novel vulnerability prediction model to predict vulnerability loss based on probit regression. Wuhan University Journal of Natural Sciences 21(3):214–220

    Article  MathSciNet  Google Scholar 

  40. Roumani Y, Nwankpa JK, Roumani YF (2015) Time series modeling of vulnerabilities. Computers & Security 51:32–40

    Article  Google Scholar 

  41. Rescorla E (2005) Is finding security holes a good idea?. 3(1)

  42. Jabeen G, Rahim S, Sahar G, Shah AA, Bibi T (2020) An optimization of vulnerability discovery models using multiple errors iterative analysis method: An optimization of vulnerability discovery models. Proceedings of the Pakistan Academy of Sciences: A. Physical and Computational Sciences 57(3):47–60

    MathSciNet  Google Scholar 

  43. Zhu X, Cao C, Zhang J (2017) Vulnerability severity prediction and risk metric modeling for software. Appl Intell 47(3):828–836

    Article  Google Scholar 

  44. Anand A, Das S, Aggrawal D, Klochkov Y (2017) Vulnerability discovery modelling for software with multi-versions. In: Advances in reliability and system engineering. Springer, pp 255–265

  45. Johnston R, Sarkani S, Mazzuchi T, Holzer T, Eveleigh T (2019) Bayesian-model averaging using mcmcbayes for web-browser vulnerability discovery. Reliability Engineering & System Safety 183:341–359

    Article  Google Scholar 

  46. Johnston RA (2018) A multivariate bayesian approach to modeling vulnerability discovery in the software security lifecycle. Ph.D. Thesis, The George Washington University

  47. Johnston R, Sarkani S, Mazzuchi T, Holzer T, Eveleigh T (2018) Multivariate models using mcmcbayes for web-browser vulnerability discovery. Reliability Engineering & System Safety 176:52–61

    Article  Google Scholar 

  48. Shrivastava AK, Kapur PK, Anjum M (2019) Vulnerability discovery and patch modeling: State of the art. Reliab Eng, pp 401–419

  49. Movahedi Y (2019) Some guidelines for risk assessment of vulnerability discovery processes. Ph.D. Thesis, University of Maryland, College Park

  50. Movahedi Y, Cukier M, Andongabo A, Gashi I (2019) Cluster-based vulnerability assessment of operating systems and web browsers. Computing 101(2):139–160

    Article  MathSciNet  Google Scholar 

  51. Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Softw Eng 40(10):993–1006. https://doi.org/10.1109/TSE.2014.2340398

    Article  Google Scholar 

  52. Jabeen G, Ping L, Akram J, Shah AA (2019) An integrated software vulnerability discovery model based on artificial neural network.. In: SEKE, pp 349–458

  53. Catal C, Akbulut A, Ekenoglu E, Alemdaroglu M (2017) Development of a software vulnerability prediction web service based on artificial neural networks. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer , pp 59–67

  54. Sultana KZ, Anu V, Chong T-Y (2021) Using software metrics for predicting vulnerable classes and methods in java projects: A machine learning approach. Journal of Software: Evolution and Process 33(3):e2303

    Google Scholar 

  55. Houmb SH, Franqueira VNL, Engum EA (2010) Quantifying security risk level from CVSS estimates of frequency and impact. J Syst Softw 83(9):1622–1634. https://doi.org/10.1016/j.jss.2009.08.023

    Article  Google Scholar 

  56. Alhazmi OH, Malaiya YK, Ray I (2007) Measuring, analyzing and predicting security vulnerabilities in software systems. Computers and Security 26(3):219–228. https://doi.org/10.1016/j.cose.2006.10.002

    Article  Google Scholar 

  57. Machine learning group. http://www.cs.waikato.ac.nz

  58. El Emam K, Melo WL, Machado JC (2001) The prediction of faulty classes using object-oriented design metrics. J. Syst. Softw. 56(1):63–75. https://doi.org/10.1016/S0164-1212(00)00086-8

    Article  Google Scholar 

  59. Chowdhury I, Zulkernine M (2011) Using complexity, coupling, and cohesion metrics as early indicators of vulnerabilities. J Syst Archit 57(3):294–313. https://doi.org/10.1016/j.sysarc.2010.06.003

    Article  Google Scholar 

  60. Liu MY (2006) Empirical Relation between Coupling and Attackability in Software Systems : A Case Study on DOS. In: ACM SIGPLAN Workshop on Programming Languages and Analysis for Security, Ottawa, Canada, pp 57– 64

  61. Demuth H (2009) Neural Network Toolbox. The MathWorks Inc., Natr

  62. Jang J-R (1993) ANFIS : Adaptive-Ne twork-Based Fuzzy Inference System. IEEE Trans. Syst. Man. Cybern., 23(3)

  63. Tyagi K (2014) An adaptive neuro fuzzy model for estimating the reliability of component-based software systems. Applied Computing and Informatics 10(1-2):38–51. https://doi.org/10.1016/j.aci.2014.04.002

    Article  Google Scholar 

  64. Lo J- (2010) Early Software Reliability Prediction Based on Support Vector Machines with Genetic Algorithms. In: 5th IEEE Conference on Industrial Electronics and Applications, pp 2221–2226

  65. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655

    Article  MATH  Google Scholar 

  66. Quinlan JR (1992) LEARNING WITH CONTINUOUS CLASSES 2 . Constructing Model Trees. In: Preoceedings AL’92, vol 92, pp 343–348

  67. Duggal H, Singh P (2012) Comparative study of the performance of m5-rules algorithm with different algorithms

  68. Alsultanny Y (2020) Machine learning by data mining reptree and m5p for predicating novel information for pm10

  69. Galathiya A, Ganatra A, Bhensdadia C (2012) Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, Model Complexity and Reduced Error Pruning, vol 3. http://ijcsit.com/docs/Volume3/Vol3Issue2/ijcsit2012030227.pdf

  70. Emran SM, Ye N (2002) Robustness of chi-square and canberra distance metrics for computer intrusion detection. Qual Reliab Eng Int 18(1):19–28

    Article  Google Scholar 

  71. Rathore SS, Kumar S (2021) An empirical study of ensemble techniques for software fault prediction. Appl Intell 51(6):3615–3644

    Article  Google Scholar 

  72. Fonticella R (1998) The Usefulness of the R2 Statistic. Society 23:56–60

    Google Scholar 

  73. Yasasin E, Prester J, Wagner G, Schryen G (2020) Forecasting it security vulnerabilities–an empirical analysis. Computers & Security 88:101610

    Article  Google Scholar 

  74. Amin A, Grunske L, Colman A (2013) An approach to software reliability prediction based on time series modeling. J Syst Softw 86(7):1923–1932. https://doi.org/10.1016/j.jss.2013.03.045

    Article  Google Scholar 

  75. Afzal W, Torkar R, Feldt R (2012) Resampling methods in software quality classification. Int J Softw Eng Knowl Eng 22(02):203–223

    Article  Google Scholar 

Download references

Acknowledgments

This work has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 957212 and the ECSEL Joint Undertaking (JU) under grant agreement No 101007350. D. Khan was supported in part by NSFC (No.62150410433), Shenzhen Basic Research Program (JCYJ20180507182222355) and CAS-PIFI (No. 2020PT0013 ). We are thankful to the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dawar Khan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jabeen, G., Rahim, S., Afzal, W. et al. Machine learning techniques for software vulnerability prediction: a comparative study. Appl Intell 52, 17614–17635 (2022). https://doi.org/10.1007/s10489-022-03350-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03350-5

Keywords

Navigation