skip to main content
10.1145/3510003.3510229acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

VulCNN: an image-inspired scalable vulnerability detection system

Published: 05 July 2022 Publication History

Abstract

Since deep learning (DL) can automatically learn features from source code, it has been widely used to detect source code vulnerability. To achieve scalable vulnerability scanning, some prior studies intend to process the source code directly by treating them as text. To achieve accurate vulnerability detection, other approaches consider distilling the program semantics into graph representations and using them to detect vulnerability. In practice, text-based techniques are scalable but not accurate due to the lack of program semantics. Graph-based methods are accurate but not scalable since graph analysis is typically time-consuming.
In this paper, we aim to achieve both scalability and accuracy on scanning large-scale source code vulnerabilities. Inspired by existing DL-based image classification which has the ability to analyze millions of images accurately, we prefer to use these techniques to accomplish our purpose. Specifically, we propose a novel idea that can efficiently convert the source code of a function into an image while preserving the program details. We implement VulCNN and evaluate it on a dataset of 13,687 vulnerable functions and 26,970 non-vulnerable functions. Experimental results report that VulCNN can achieve better accuracy than eight state-of-the-art vulnerability detectors (i.e., Checkmarx, FlawFinder, RATS, TokenCNN, VulDeePecker, SySeVR, VulDeeLocator, and Devign). As for scalability, VulCNN is about four times faster than VulDeePecker and SySeVR, about 15 times faster than VulDeeLocator, and about six times faster than Devign. Furthermore, we conduct a case study on more than 25 million lines of code and the result indicates that VulCNN can detect large-scale vulnerability. Through the scanning reports, we finally discover 73 vulnerabilities that are not reported in NVD.

References

[1]
2020. 5 key takeaways from the 2020 Open Source Security and Risk Analysis report. https://securityboulevard.com/2020/05/5-key-takeaways-from-the-2020-open-source-security-and-risk-analysis-report.
[2]
2020. The Exactis Breach: 5 Things You Need to Know. https://blog.infoarmor.com/individuals-and-families/the-exactis-breach-5-things-you-need-to-know.
[3]
2020. WannaCry ransomware attack. https://en.wikipedia.org/wiki/WannaCry_ransomware_attack.
[4]
2021. Adjacency Matrix. https://en.wikipedia.org/wiki/Adjacency_matrix/.
[5]
2021. Checkmarx. https://www.checkmarx.com/.
[6]
2021. FlawFinder. http://www.dwheeler.com/flawfinde/r.
[7]
2021. Frama-C. http://frama-c.com/.
[8]
2021. Libav. https://libav.org/.
[9]
2021. National Institute of Standards and Technology. https://www.nist.gov/.
[10]
2021. National Vulnerability Database. https://nvd.nist.gov.
[11]
2021. Open-source code analysis platform for C/C++ based on code property graphs. https://joern.io/.
[12]
2021. Rough Audit Tool for Security. https://code.google.com/archive/p/rough-auditing-tool-for-security/.
[13]
2021. Seamonkey. https://www.seamonkey-project.org/.
[14]
2021. Software Assurance Reference Dataset. https://samate.nist.gov/SRD/index.php.
[15]
2021. Software for complex networks (Networkx). http://networkx.github.io.
[16]
2021. Tensors and Dynamic neural networks in Python with strong GPU acceleration (PyTorch). https://pytorch.org.
[17]
2021. Xen. https://xenproject.org/xen-project-archives/.
[18]
Michael Backes, Boris Köpf, and Andrey Rybalchenko. 2009. Automatic discovery and quantification of information leaks. In Proceedings of the 2009 IEEE Symposium on Security and Privacy (S&P'09). 141--153.
[19]
Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury. 2017. Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS'17). 2329--2344.
[20]
Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV'18). 839--847.
[21]
Hongxu Chen, Yinxing Xue, Yuekang Li, Bihuan Chen, Xiaofei Xie, Xiuheng Wu, and Yang Liu. 2018. Hawkeye: Towards a desired directed grey-box fuzzer. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS'18). 2095--2108.
[22]
Xiao Cheng, Haoyu Wang, Jiayi Hua, Guoai Xu, and Yulei Sui. 2021. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1--33.
[23]
George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'13). 8609--8613.
[24]
Xu Duan, Jingzheng Wu, Shouling Ji, Zhiqing Rui, Tianyue Luo, Mutian Yang, and Yanjun Wu. 2019. VulSniper: Focus your attention to shoot fine-grained vulnerabilities. In Proceedings of the 2019 International Joint Conference on Artificial Intelligence (IJCAI'19). 4665--4671.
[25]
Linton C. Freeman. 1978. Centrality in social networks conceptual clarification. Social Networks 1, 3 (1978), 215--239.
[26]
Roger Guimera, Stefano Mossa, Adrian Turtschi, and Luis A. Nunes Amaral. 2005. The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. Proceedings of the National Academy of Sciences 102, 22 (2005), 7794--7799.
[27]
Jiyong Jang, Abeer Agrawal, and David Brumley. 2012. ReDeBug: Finding unpatched code clones in entire OS distributions. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (S&P'12). 48--62.
[28]
Hawoong Jeong, Sean P. Mason, Albert L. Barabási, and Zoltan N. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411, 6833 (2001), 41--42.
[29]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE'07). 96--105.
[30]
Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.
[31]
Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39--43.
[32]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A scalable approach for vulnerable code clone discovery. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (S&P'17). 595--614.
[33]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[34]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the 2012 Advances in Neural Information Processing Systems (NIPS'12). 1097--1105.
[35]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.
[36]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. IEEE 86, 11 (1998), 2278--2324.
[37]
Jingyue Li and Michael D. Ernst. 2012. CBCD: Cloned buggy code detector. In Proceedings of the 34th International Conference on Software Engineering (ICSE'12). 310--320.
[38]
Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and Hai Jin. 2021. Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Transactions on Dependable and Secure Computing (2021), 1--17.
[39]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. VulPecker: An automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC'16). 201--213.
[40]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing (2021), 1--15.
[41]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, and Yuyi Zhong. 2018. VulDeePecker: A deep learning-based system for vulnerability detection. In Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS'18). 1--15.
[42]
Guanjun Lin, Wei Xiao, Jun Zhang, and Yang Xiang. 2019. Deep learning-based vulnerable function detection: A benchmark. In Proceedings of the 2019 International Conference on Information and Communications Security (ICICS'19). 219--232.
[43]
Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, and Yang Xiang. 2017. POSTER: Vulnerability discovery with function representation learning from unlabeled projects. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS'17). 2539--2541.
[44]
Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. 2007. Predicting vulnerable software components. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS'07). 529--540.
[45]
Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507 (2017).
[46]
Nam H. Pham, Tung Thanh Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2010. Detection of recurring software vulnerabilities. In Proceedings of the 2010 International Conference on Automated Software Engineering (ASE'10). 447--456.
[47]
Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In Proceedings of the 2018 IEEE International Conference on Machine Learning and Applications (ICMLA'18). 757--762.
[48]
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering (ICSE'16). 1157--1168.
[49]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV'17). 618--626.
[50]
Umesh Shankar, Kunal Talwar, Jeffrey S. Foster, and David A. Wagner. 2001. Detecting format string vulnerabilities with type qualifiers. In Proceedings of the 2001 USENIX Security Symposium (USENIX Security'01). 201--220.
[51]
Lwin Khin Shar, Lionel C. Briand, and Hee Beng Kuan Tan. 2014. Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Transactions on Dependable and Secure Computing 12, 6 (2014), 688--707.
[52]
Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. 2019. MalScan: Fast market-wide mobile malware scanning by social-network centrality analysis. In Proceedings of the 34th International Conference on Automated Software Engineering (ASE'19). 139--150.
[53]
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In Proceddings of the 2014 IEEE Symposium on Security and Privacy (S&P'14). 590--604.
[54]
Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. 2012. Generalized vulnerability extrapolation using abstract syntax trees. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC'12). 359--368.
[55]
Fabian Yamaguchi, Alwin Maier, Hugo Gascon, and Konrad Rieck. 2015. Automatic inference of search patterns for taint-style vulnerabilities. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (S&P'15). 797--812.
[56]
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the 2019 Advances in Neural Information Processing Systems (NIPS'19). 10197--10207.
[57]
Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin. 2019. μVulDeePecker: A deep learning-based system for multiclass vulnerability detection. IEEE Transactions on Dependable and Secure Computing 18, 5 (2019), 1--13.

Cited By

View all
  • (2025)Vulnerability detection with graph enhancement and global dependency representation learningAutomated Software Engineering10.1007/s10515-024-00484-332:1Online publication date: 5-Jan-2025
  • (2024)FVD-DPMProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699312(7375-7392)Online publication date: 14-Aug-2024
  • (2024)Syntactic–Semantic Detection of Clone-Caused Vulnerabilities in the IoT DevicesSensors10.3390/s2422725124:22(7251)Online publication date: 13-Nov-2024
  • Show More Cited By

Index Terms

  1. VulCNN: an image-inspired scalable vulnerability detection system

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICSE '22: Proceedings of the 44th International Conference on Software Engineering
    May 2022
    2508 pages
    ISBN:9781450392211
    DOI:10.1145/3510003
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CNN
    2. image
    3. large scale
    4. vulnerability detection

    Qualifiers

    • Research-article

    Funding Sources

    • the Key Program of National Science Foundation of China

    Conference

    ICSE '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 276 of 1,856 submissions, 15%

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)376
    • Downloads (Last 6 weeks)66
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Vulnerability detection with graph enhancement and global dependency representation learningAutomated Software Engineering10.1007/s10515-024-00484-332:1Online publication date: 5-Jan-2025
    • (2024)FVD-DPMProceedings of the 33rd USENIX Conference on Security Symposium10.5555/3698900.3699312(7375-7392)Online publication date: 14-Aug-2024
    • (2024)Syntactic–Semantic Detection of Clone-Caused Vulnerabilities in the IoT DevicesSensors10.3390/s2422725124:22(7251)Online publication date: 13-Nov-2024
    • (2024)A Comprehensive Review and Assessment of Cybersecurity Vulnerability Detection MethodologiesJournal of Cybersecurity and Privacy10.3390/jcp40400404:4(853-908)Online publication date: 7-Oct-2024
    • (2024)Bridging the Gap: A Survey and Classification of Research-Informed Ethical Hacking ToolsJournal of Cybersecurity and Privacy10.3390/jcp40300214:3(410-448)Online publication date: 16-Jul-2024
    • (2024)Vul-Mixer: Efficient and Effective Machine Learning–Assisted Software Vulnerability DetectionElectronics10.3390/electronics1313253813:13(2538)Online publication date: 28-Jun-2024
    • (2024)HotCFuzz: Enhancing Vulnerability Detection through Fuzzing and Hotspot Code Coverage AnalysisElectronics10.3390/electronics1310190913:10(1909)Online publication date: 13-May-2024
    • (2024)A Method for Processing Static Analysis Alarms Based on Deep LearningApplied Sciences10.3390/app1413554214:13(5542)Online publication date: 26-Jun-2024
    • (2024)Unveil the Mystery of Critical Software VulnerabilitiesCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663835(138-149)Online publication date: 10-Jul-2024
    • (2024)SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability DetectionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652124(235-247)Online publication date: 11-Sep-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media