skip to main content
10.1145/2857705.2857720acmconferencesArticle/Chapter ViewAbstractPublication PagescodaspyConference Proceedingsconference-collections
research-article

Toward Large-Scale Vulnerability Discovery using Machine Learning

Published: 09 March 2016 Publication History

Abstract

With sustained growth of software complexity, finding security vulnerabilities in operating systems has become an important necessity. Nowadays, OS are shipped with thousands of binary executables. Unfortunately, methodologies and tools for an OS scale program testing within a limited time budget are still missing.
In this paper we present an approach that uses lightweight static and dynamic features to predict if a test case is likely to contain a software vulnerability using machine learning techniques. To show the effectiveness of our approach, we set up a large experiment to detect easily exploitable memory corruptions using 1039 Debian programs obtained from its bug tracker, collected 138,308 unique execution traces and statically explored 76,083 different subsequences of function calls. We managed to predict with reasonable accuracy which programs contained dangerous memory corruptions.
We also developed and implemented VDiscover, a tool that uses state-of-the-art Machine Learning techniques to predict vulnerabilities in test cases. Such tool will be released as open-source to encourage the research of vulnerability discovery at a large scale, together with VDiscovery, a public dataset that collects raw analyzed data.

References

[1]
A. Zeller, Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann Publishers Inc., 2005.
[2]
Microsoft Corporation, Microsoft Security Development Lifecycle," MicrosoftSecurityDevelopmentLifecycle, 2012.
[3]
C. M. Bishop et al., Pattern recognition and machine learning. springer New York, 2006, vol. 1.
[4]
H. Drucker, S. Wu, and V. N. Vapnik, Support vector machines for spam categorization," Neural Networks, IEEE Transactions on, vol. 10, no. 5, 1999.
[5]
G. E. Hinton and R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks," Science, vol. 313, no. 5786, 2006.
[6]
A. Genkin, D. D. Lewis, and D. Madigan, Large-scale bayesian logistic regression for text categorization," Technometrics, vol. 49, no. 3, 2007.
[7]
M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley Professional, 2007.
[8]
P. Godefroid, A. Kiezun, and M. Y. Levin, Grammar-based whitebox fuzzing," SIGPLAN Not., 2008.
[9]
P. Godefroid, M. Y. Levin, and D. A. Molnar, Sage: whitebox fuzzing for security testing." Commun. ACM, 2012.
[10]
V. Ganesh, T. Leek, and M. Rinard, Taint-based directed whitebox fuzzing," in Proceedings of the 31st International Conference on Software Engineering, ser. ICSE '09. IEEE Computer Society, 2009.
[11]
C. Cadar, D. Dunbar, and D. R. Engler, Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs." in OSDI. USENIX Association, 2008.
[12]
T. Wang, T. Wei, G. Gu, and W. Zou, Checksum-aware fuzzing combined with dynamic taint analysis and symbolic execution." ACM Trans. Inf. Syst. Secur., 2011.
[13]
S. K. Cha, T. Avgerinos, A. Rebert, and D. Brumley, Unleashing mayhem on binary code," in Proceedings of the 2012 IEEE Symposium on Security and Privacy, ser. SP '12. IEEE Computer Society, 2012.
[14]
S.-K. Huang, M.-H. Huang, P.-Y. Huang, H.-L. Lu, and C.-W. Lai, Software crash analysis for automatic exploit generation on binary programs," Reliability, IEEE Transactions on, March 2014.
[15]
T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, and D. Brumley, \Automatic exploit generation," Commun. ACM, 2014.
[16]
P. Cousot, R. Cousot, J. Feret, L. Mauborgne et al., The astre E analyzer." ser. Lecture Notes in Computer Science. Springer, 2005.
[17]
P. Cuoq, F. Kirchner, N. Kosmatov, V. Prevosto et al., Frama-c - a software analysis perspective." ser. Lecture Notes in Computer Science. Springer, 2012.
[18]
W. Landi, Undecidability of static analysis." LOPLAS, 1992.
[19]
D. Evans and D. Larochelle, Improving security using extensible lightweight static analysis." IEEE Software, 2002.
[20]
F. Yamaguchi, N. Golde, D. Arp, and K. Rieck,\Modeling and discovering vulnerabilities with code property graphs," in Proceedings of the 2014 IEEE Symposium on Security and Privacy, ser. SP '14. IEEE Computer Society, 2014.
[21]
S. Rawat and L. Mounier, Finding buffer overflow inducing loops in binary executables," in Proceedings of Sixth International Conference on Software Security and Reliability (SERE). IEEE, 2012.
[22]
[email protected], File Stream Pointer Overflows Paper," http://www.ouah.org/fsp-overflows.txt, 2003.
[23]
M. Team, Reporting 1.2K crashes," https://lists.debian.org/debian-devel/2013/06/msg00720.html, 2013.
[24]
H. He and E. A. Garcia, Learning from imbalanced data," Knowledge and Data Engineering, IEEE Transactions on, vol. 21, no. 9, 2009.
[25]
J. Lee, T. Avgerinos, and D. Brumley, Tie: Principled reverse engineering of types in binary programs."
[26]
M. Zhang, A. Prakash, X. Li, Z. Liang, and H. Yin, Identifying and analyzing pointer misuses for sophisticated memory-corruption exploit diagnosis," 2012.
[27]
J. C--espedes, ltrace," http://www.ltrace.org, 2014.
[28]
L. Breiman, Random forests," Machine learning, 2001.
[29]
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors," 2012.
[30]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.
[31]
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, Deepface: Closing the gap to human-level performance in face verification," in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014.
[32]
H. Pirzadeh, A. Hamou-Lhadj, and M. Shah, Exploiting text mining techniques in the analysis of execution traces," in Software Maintenance (ICSM), 2011 27th IEEE International Conference on, Sept 2011.
[33]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, Detecting large-scale system problems by mining console logs," in Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, 2009.
[34]
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., 2005.
[35]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space," 2013.
[36]
L. Wolf, Y. Hanani, K. Bar, and N. Dershowitz, Joint word2vec networks for bilingual semantic representations," International Journal of Computational Linguistics and Applications, vol. 5, no. 1, 2014.
[37]
S. P. F. G. H. Moen and T. S. S. Ananiadou, Distributional semantics resources for biomedical text processing."
[38]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel et al., \Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, 2011.
[39]
I. J. Goodfellow, D. Warde-Farley, P. Lamblin, V. Dumoulin et al., Pylearn2: a machine learning research library," 2013.
[40]
V. Stinner, python-ptrace," http://python-ptrace.readthedocs.org, 2014.
[41]
Microsoft Security Engineering Center (MSEC) Security Science Team, Exploitable," http://msecdbg.codeplex.com, 2013.
[42]
Jonathan Foote, CERT Triage Tools," http://www. cert.org/vulnerability-analysis/tools/triage.cfm, 2013.
[43]
I. Santos, J. Devesa, F. Brezo, J. Nieves, and P. Bringas, Opem: A static-dynamic approach for machine-learning-based malware detection," in International Joint Conference CISIS'12-ICEUTEt'12-SOCOt'12 Special Sessions, ser. Advances in Intelligent Systems and Computing. Springer Berlin Heidelberg, 2013, vol. 189.
[44]
F. Yamaguchi, F. Lindner, and K. Rieck, Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning," in Proceedings of the 5th USENIX Conference on Offensive Technologies, ser. WOOT'11. USENIX Association, 2011.
[45]
S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longsta, A sense of self for unix processes," in Proceedings of the 1996 IEEE Symposium on Security and Privacy, ser. SP '96. IEEE Computer Society, 1996.
[46]
S. Rawat, V. P. Gulati, and A. K. Pujari, Transactions on rough sets iv." Springer-Verlag, 2005, ch. A Fast Host-based Intrusion Detection System Using Rough Set Theory.
[47]
T. G. and C. P., Learning rules from system calls arguments and sequences for anomaly detection," in Proc. ICDM Workshop on Data Mining for Computer Security (DMSEC). Springer, 2003.

Cited By

View all
  • (2025)When Software Security Meets Large Language Models: A SurveyIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2024.12497112:2(317-334)Online publication date: Feb-2025
  • (2025)DMVL4AVD: a deep multi-view learning model for automated vulnerability detectionNeural Computing and Applications10.1007/s00521-024-10892-x37:8(5873-5889)Online publication date: 6-Jan-2025
  • (2024)Vulnerability Detection Method Based on Word Vector ModelScientific Insights and Discoveries Review10.59782/sidr.v2i1.1192:1(227-237)Online publication date: 7-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CODASPY '16: Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy
March 2016
340 pages
ISBN:9781450339353
DOI:10.1145/2857705
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 March 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic analysis
  2. machine learning
  3. static analysis
  4. vulnerability detection

Qualifiers

  • Research-article

Conference

CODASPY'16
Sponsor:

Acceptance Rates

CODASPY '16 Paper Acceptance Rate 22 of 115 submissions, 19%;
Overall Acceptance Rate 149 of 789 submissions, 19%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)144
  • Downloads (Last 6 weeks)13
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)When Software Security Meets Large Language Models: A SurveyIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2024.12497112:2(317-334)Online publication date: Feb-2025
  • (2025)DMVL4AVD: a deep multi-view learning model for automated vulnerability detectionNeural Computing and Applications10.1007/s00521-024-10892-x37:8(5873-5889)Online publication date: 6-Jan-2025
  • (2024)Vulnerability Detection Method Based on Word Vector ModelScientific Insights and Discoveries Review10.59782/sidr.v2i1.1192:1(227-237)Online publication date: 7-Oct-2024
  • (2024)Towards a Block-Level ML-Based Python Vulnerability Detection ToolActa Cybernetica10.14232/actacyb.29966726:3(323-371)Online publication date: 22-Jul-2024
  • (2024)A Static Detection Method for Code Defects Based on TransformerProceedings of the 2024 3rd International Conference on Networks, Communications and Information Technology10.1145/3672121.3672141(104-111)Online publication date: 7-Jun-2024
  • (2024)Deep Domain Adaptation With Max-Margin Principle for Cross-Project Imbalanced Software Vulnerability DetectionACM Transactions on Software Engineering and Methodology10.1145/366460233:6(1-34)Online publication date: 27-Jun-2024
  • (2024)Exploring Semantic Redundancy using Backdoor Triggers: A Complementary Insight into the Challenges Facing DNN-based Software Vulnerability DetectionACM Transactions on Software Engineering and Methodology10.1145/364033333:4(1-28)Online publication date: 24-Jan-2024
  • (2024)ReposVul: A Repository-Level High-Quality Vulnerability DatasetProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3647634(472-483)Online publication date: 14-Apr-2024
  • (2024)Exploiting Library Vulnerability via Migration Based Automating Test GenerationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639583(1-12)Online publication date: 20-May-2024
  • (2024)LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability TypesIEEE Transactions on Software Engineering10.1109/TSE.2024.338236150:6(1325-1339)Online publication date: Jun-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media