research-article

Toward Large-Scale Vulnerability Discovery using Machine Learning

Authors:

Gustavo Grieco,

Guillermo Luis Grinblat,

Josselin Feist,

Laurent MounierAuthors Info & Claims

CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy

Pages 85 - 96

https://doi.org/10.1145/2857705.2857720

Published: 09 March 2016 Publication History

Abstract

With sustained growth of software complexity, finding security vulnerabilities in operating systems has become an important necessity. Nowadays, OS are shipped with thousands of binary executables. Unfortunately, methodologies and tools for an OS scale program testing within a limited time budget are still missing.

In this paper we present an approach that uses lightweight static and dynamic features to predict if a test case is likely to contain a software vulnerability using machine learning techniques. To show the effectiveness of our approach, we set up a large experiment to detect easily exploitable memory corruptions using 1039 Debian programs obtained from its bug tracker, collected 138,308 unique execution traces and statically explored 76,083 different subsequences of function calls. We managed to predict with reasonable accuracy which programs contained dangerous memory corruptions.

We also developed and implemented VDiscover, a tool that uses state-of-the-art Machine Learning techniques to predict vulnerabilities in test cases. Such tool will be released as open-source to encourage the research of vulnerability discovery at a large scale, together with VDiscovery, a public dataset that collects raw analyzed data.

References

[1]

A. Zeller, Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann Publishers Inc., 2005.

Digital Library

[2]

Microsoft Corporation, Microsoft Security Development Lifecycle," MicrosoftSecurityDevelopmentLifecycle, 2012.

[3]

C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1.

Digital Library

[4]

H. Drucker, S. Wu, and V. N. Vapnik, Support vector machines for spam categorization," Neural Networks, IEEE Transactions on, vol. 10, no. 5, 1999.

Digital Library

[5]

G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks," Science, vol. 313, no. 5786, 2006.

[6]

A. Genkin, D. D. Lewis, and D. Madigan, Large-scale bayesian logistic regression for text categorization," Technometrics, vol. 49, no. 3, 2007.

[7]

M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, 2007.

Digital Library

[8]

P. Godefroid, A. Kiezun, and M. Y. Levin, Grammar-based whitebox fuzzing," SIGPLAN Not., 2008.

Digital Library

[9]

P. Godefroid, M. Y. Levin, and D. A. Molnar, Sage: whitebox fuzzing for security testing." Commun. ACM, 2012.

Digital Library

[10]

V. Ganesh, T. Leek, and M. Rinard, Taint-based directed whitebox fuzzing," in Proceedings of the 31st International Conference on Software Engineering, ser. ICSE '09. IEEE Computer Society, 2009.

Digital Library

[11]

C. Cadar, D. Dunbar, and D. R. Engler, Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs." in OSDI. USENIX Association, 2008.

Digital Library

[12]

T. Wang, T. Wei, G. Gu, and W. Zou, Checksum-aware fuzzing combined with dynamic taint analysis and symbolic execution." ACM Trans. Inf. Syst. Secur., 2011.

Digital Library

[13]

S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, Unleashing mayhem on binary code," in Proceedings of the 2012 IEEE Symposium on Security and Privacy, ser. SP '12. IEEE Computer Society, 2012.

Digital Library

[14]

S.-K. Huang, M.-H. Huang, P.-Y. Huang, H.-L. Lu, and C.-W. Lai, Software crash analysis for automatic exploit generation on binary programs," Reliability, IEEE Transactions on, March 2014.

[15]

T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, and D. Brumley, \Automatic exploit generation," Commun. ACM, 2014.

Digital Library

[16]

P. Cousot, R. Cousot, J. Feret, L. Mauborgne et al., The astre E analyzer." ser. Lecture Notes in Computer Science. Springer, 2005.

Digital Library

[17]

P. Cuoq, F. Kirchner, N. Kosmatov, V. Prevosto et al., Frama-c - a software analysis perspective." ser. Lecture Notes in Computer Science. Springer, 2012.

Digital Library

[18]

W. Landi, Undecidability of static analysis." LOPLAS, 1992.

Digital Library

[19]

D. Evans and D. Larochelle, Improving security using extensible lightweight static analysis." IEEE Software, 2002.

Digital Library

[20]

F. Yamaguchi, N. Golde, D. Arp, and K. Rieck,\Modeling and discovering vulnerabilities with code property graphs," in Proceedings of the 2014 IEEE Symposium on Security and Privacy, ser. SP '14. IEEE Computer Society, 2014.

Digital Library

[21]

S. Rawat and L. Mounier, Finding buffer overflow inducing loops in binary executables," in Proceedings of Sixth International Conference on Software Security and Reliability (SERE). IEEE, 2012.

Digital Library

[22]

[email protected], File Stream Pointer Overflows Paper," http://www.ouah.org/fsp-overflows.txt, 2003.

[23]

M. Team, Reporting 1.2K crashes," https://lists.debian.org/debian-devel/2013/06/msg00720.html, 2013.

[24]

H. He and E. A. Garcia, Learning from imbalanced data," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, no. 9, 2009.

Digital Library

[25]

J. Lee, T. Avgerinos, and D. Brumley, Tie: Principled reverse engineering of types in binary programs."

[26]

M. Zhang, A. Prakash, X. Li, Z. Liang, and H. Yin, Identifying and analyzing pointer misuses for sophisticated memory-corruption exploit diagnosis," 2012.

[27]

J. C--espedes, ltrace," http://www.ltrace.org, 2014.

[28]

L. Breiman, Random forests," Machine learning, 2001.

Digital Library

[29]

G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors," 2012.

[30]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.

[31]

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closing the gap to human-level performance in face verification," in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014.

Digital Library

[32]

H. Pirzadeh, A. Hamou-Lhadj, and M. Shah, Exploiting text mining techniques in the analysis of execution traces," in Software Maintenance (ICSM), 2011 27th IEEE International Conference on, Sept 2011.

Digital Library

[33]

W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, Detecting large-scale system problems by mining console logs," in Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, 2009.

Digital Library

[34]

I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., 2005.

Digital Library

[35]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space," 2013.

[36]

L. Wolf, Y. Hanani, K. Bar, and N. Dershowitz, Joint word2vec networks for bilingual semantic representations," International Journal of Computational Linguistics and Applications, vol. 5, no. 1, 2014.

[37]

S. P. F. G. H. Moen and T. S. S. Ananiadou, Distributional semantics resources for biomedical text processing."

[38]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al., \Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, 2011.

Digital Library

[39]

I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin et al., Pylearn2: a machine learning research library," 2013.

[40]

V. Stinner, python-ptrace," http://python-ptrace.readthedocs.org, 2014.

[41]

Microsoft Security Engineering Center (MSEC) Security Science Team, Exploitable," http://msecdbg.codeplex.com, 2013.

[42]

Jonathan Foote, CERT Triage Tools," http://www. cert.org/vulnerability-analysis/tools/triage.cfm, 2013.

[43]

I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. Bringas, Opem: A static-dynamic approach for machine-learning-based malware detection," in International Joint Conference CISIS'12-ICEUTEt'12-SOCOt'12 Special Sessions, ser. Advances in Intelligent Systems and Computing. Springer Berlin Heidelberg, 2013, vol. 189.

[44]

F. Yamaguchi, F. Lindner, and K. Rieck, Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning," in Proceedings of the 5th USENIX Conference on Offensive Technologies, ser. WOOT'11. USENIX Association, 2011.

Digital Library

[45]

S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longsta, A sense of self for unix processes," in Proceedings of the 1996 IEEE Symposium on Security and Privacy, ser. SP '96. IEEE Computer Society, 1996.

Digital Library

[46]

S. Rawat, V. P. Gulati, and A. K. Pujari, Transactions on rough sets iv." Springer-Verlag, 2005, ch. A Fast Host-based Intrusion Detection System Using Rough Set Theory.

Digital Library

[47]

T. G. and C. P., Learning rules from system calls arguments and sequences for anomaly detection," in Proc. ICDM Workshop on Data Mining for Computer Security (DMSEC). Springer, 2003.

Cited By

Zhu XZhou WHan QMa WWen SXiang Y(2025)When Software Security Meets Large Language Models: A SurveyIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2024.12497112:2(317-334)Online publication date: Feb-2025
https://doi.org/10.1109/JAS.2024.124971
Du XZhou YDu H(2025)DMVL4AVD: a deep multi-view learning model for automated vulnerability detectionNeural Computing and Applications10.1007/s00521-024-10892-x37:8(5873-5889)Online publication date: 6-Jan-2025
https://doi.org/10.1007/s00521-024-10892-x
Wei XJinghao HZhengzhang HTao WChao P(2024)Vulnerability Detection Method Based on Word Vector ModelScientific Insights and Discoveries Review10.59782/sidr.v2i1.1192:1(227-237)Online publication date: 7-Oct-2024
https://doi.org/10.59782/sidr.v2i1.119
Show More Cited By

Index Terms

Toward Large-Scale Vulnerability Discovery using Machine Learning
1. Security and privacy
  1. Security services
    1. Access control
2. Software and its engineering
  1. Software organization and properties
    1. Software functional properties
      1. Formal methods
        Automated static analysis
        Software verification

Recommendations

Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning

Cross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
XSS Vulnerability Detection Using Optimized Attack Vector Repertory
CYBERC '15: Proceedings of the 2015 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

In order to detect the Cross-Site Script (XSS)vulnerabilities in the web applications, this paper proposes a method of XSS vulnerability detection using optimal attack vector repertory. This method generates an attack vector repertory automatically, ...
A Survey on SQL Injection Attacks, Detection and Prevention
ICMLC '20: Proceedings of the 2020 12th International Conference on Machine Learning and Computing

Since the uses of Web in daily life is increasing in past 20 years and becoming trend now, almost every Web application has its own database to store important data. An attacker can get or even modify the data from database through SQL injection ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy

March 2016

340 pages

ISBN:9781450339353

DOI:10.1145/2857705

General Chairs:
Elisa Bertino
Purdue University, USA
,
Ravi Sandhu
University of Texas at San Antonio, USA
,
Program Chair:
Alexander Pretschner
Technische Universität München, Germany

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CODASPY'16

Sponsor:

SIGSAC

CODASPY'16: Sixth ACM Conference on Data and Application Security and Privacy

March 9 - 11, 2016

Louisiana, New Orleans, USA

Acceptance Rates

CODASPY '16 Paper Acceptance Rate 22 of 115 submissions, 19%;

Overall Acceptance Rate 149 of 789 submissions, 19%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

169
Total Citations
View Citations
2,000
Total Downloads

Downloads (Last 12 months)144
Downloads (Last 6 weeks)13

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu XZhou WHan QMa WWen SXiang Y(2025)When Software Security Meets Large Language Models: A SurveyIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2024.12497112:2(317-334)Online publication date: Feb-2025
https://doi.org/10.1109/JAS.2024.124971
Du XZhou YDu H(2025)DMVL4AVD: a deep multi-view learning model for automated vulnerability detectionNeural Computing and Applications10.1007/s00521-024-10892-x37:8(5873-5889)Online publication date: 6-Jan-2025
https://doi.org/10.1007/s00521-024-10892-x
Wei XJinghao HZhengzhang HTao WChao P(2024)Vulnerability Detection Method Based on Word Vector ModelScientific Insights and Discoveries Review10.59782/sidr.v2i1.1192:1(227-237)Online publication date: 7-Oct-2024
https://doi.org/10.59782/sidr.v2i1.119
Bagheri AHegedűs P(2024)Towards a Block-Level ML-Based Python Vulnerability Detection ToolActa Cybernetica10.14232/actacyb.29966726:3(323-371)Online publication date: 22-Jul-2024
https://doi.org/10.14232/actacyb.299667
Yuan SLiu CShi JLiu XPu WYu JYang L(2024)A Static Detection Method for Code Defects Based on TransformerProceedings of the 2024 3rd International Conference on Networks, Communications and Information Technology10.1145/3672121.3672141(104-111)Online publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1145/3672121.3672141
Nguyen VLe TTantithamthavorn CGrundy JPhung D(2024)Deep Domain Adaptation With Max-Margin Principle for Cross-Project Imbalanced Software Vulnerability DetectionACM Transactions on Software Engineering and Methodology10.1145/366460233:6(1-34)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3664602
Shao CLi GWu JZheng X(2024)Exploring Semantic Redundancy using Backdoor Triggers: A Complementary Insight into the Challenges Facing DNN-based Software Vulnerability DetectionACM Transactions on Software Engineering and Methodology10.1145/364033333:4(1-28)Online publication date: 24-Jan-2024
https://dl.acm.org/doi/10.1145/3640333
Wang XHu RGao CWen XChen YLiao QRoychoudhury APaiva AAbreu RStorey M(2024)ReposVul: A Repository-Level High-Quality Vulnerability DatasetProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3647634(472-483)Online publication date: 14-Apr-2024
https://dl.acm.org/doi/10.1145/3639478.3647634
Chen ZHu XXia XGao YXu TLo DYang XRoychoudhury APaiva AAbreu RStorey M(2024)Exploiting Library Vulnerability via Migration Based Automating Test GenerationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639583(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639583
Wen XGao CLuo FWang HLi GLiao Q(2024)LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability TypesIEEE Transactions on Software Engineering10.1109/TSE.2024.338236150:6(1325-1339)Online publication date: Jun-2024
https://doi.org/10.1109/TSE.2024.3382361
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten