ABSTRACT
Since deep learning (DL) can automatically learn features from source code, it has been widely used to detect source code vulnerability. To achieve scalable vulnerability scanning, some prior studies intend to process the source code directly by treating them as text. To achieve accurate vulnerability detection, other approaches consider distilling the program semantics into graph representations and using them to detect vulnerability. In practice, text-based techniques are scalable but not accurate due to the lack of program semantics. Graph-based methods are accurate but not scalable since graph analysis is typically time-consuming.
In this paper, we aim to achieve both scalability and accuracy on scanning large-scale source code vulnerabilities. Inspired by existing DL-based image classification which has the ability to analyze millions of images accurately, we prefer to use these techniques to accomplish our purpose. Specifically, we propose a novel idea that can efficiently convert the source code of a function into an image while preserving the program details. We implement VulCNN and evaluate it on a dataset of 13,687 vulnerable functions and 26,970 non-vulnerable functions. Experimental results report that VulCNN can achieve better accuracy than eight state-of-the-art vulnerability detectors (i.e., Checkmarx, FlawFinder, RATS, TokenCNN, VulDeePecker, SySeVR, VulDeeLocator, and Devign). As for scalability, VulCNN is about four times faster than VulDeePecker and SySeVR, about 15 times faster than VulDeeLocator, and about six times faster than Devign. Furthermore, we conduct a case study on more than 25 million lines of code and the result indicates that VulCNN can detect large-scale vulnerability. Through the scanning reports, we finally discover 73 vulnerabilities that are not reported in NVD.
- 2020. 5 key takeaways from the 2020 Open Source Security and Risk Analysis report. https://securityboulevard.com/2020/05/5-key-takeaways-from-the-2020-open-source-security-and-risk-analysis-report.Google Scholar
- 2020. The Exactis Breach: 5 Things You Need to Know. https://blog.infoarmor.com/individuals-and-families/the-exactis-breach-5-things-you-need-to-know.Google Scholar
- 2020. WannaCry ransomware attack. https://en.wikipedia.org/wiki/WannaCry_ransomware_attack.Google Scholar
- 2021. Adjacency Matrix. https://en.wikipedia.org/wiki/Adjacency_matrix/.Google Scholar
- 2021. Checkmarx. https://www.checkmarx.com/.Google Scholar
- 2021. FlawFinder. http://www.dwheeler.com/flawfinde/r.Google Scholar
- 2021. Frama-C. http://frama-c.com/.Google Scholar
- 2021. Libav. https://libav.org/.Google Scholar
- 2021. National Institute of Standards and Technology. https://www.nist.gov/.Google Scholar
- 2021. National Vulnerability Database. https://nvd.nist.gov.Google Scholar
- 2021. Open-source code analysis platform for C/C++ based on code property graphs. https://joern.io/.Google Scholar
- 2021. Rough Audit Tool for Security. https://code.google.com/archive/p/rough-auditing-tool-for-security/.Google Scholar
- 2021. Seamonkey. https://www.seamonkey-project.org/.Google Scholar
- 2021. Software Assurance Reference Dataset. https://samate.nist.gov/SRD/index.php.Google Scholar
- 2021. Software for complex networks (Networkx). http://networkx.github.io.Google Scholar
- 2021. Tensors and Dynamic neural networks in Python with strong GPU acceleration (PyTorch). https://pytorch.org.Google Scholar
- 2021. Xen. https://xenproject.org/xen-project-archives/.Google Scholar
- Michael Backes, Boris Köpf, and Andrey Rybalchenko. 2009. Automatic discovery and quantification of information leaks. In Proceedings of the 2009 IEEE Symposium on Security and Privacy (S&P'09). 141--153.Google ScholarDigital Library
- Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury. 2017. Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS'17). 2329--2344.Google ScholarDigital Library
- Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV'18). 839--847.Google Scholar
- Hongxu Chen, Yinxing Xue, Yuekang Li, Bihuan Chen, Xiaofei Xie, Xiuheng Wu, and Yang Liu. 2018. Hawkeye: Towards a desired directed grey-box fuzzer. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS'18). 2095--2108.Google ScholarDigital Library
- Xiao Cheng, Haoyu Wang, Jiayi Hua, Guoai Xu, and Yulei Sui. 2021. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 1--33.Google ScholarDigital Library
- George E. Dahl, Tara N. Sainath, and Geoffrey E. Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'13). 8609--8613.Google Scholar
- Xu Duan, Jingzheng Wu, Shouling Ji, Zhiqing Rui, Tianyue Luo, Mutian Yang, and Yanjun Wu. 2019. VulSniper: Focus your attention to shoot fine-grained vulnerabilities. In Proceedings of the 2019 International Joint Conference on Artificial Intelligence (IJCAI'19). 4665--4671.Google ScholarCross Ref
- Linton C. Freeman. 1978. Centrality in social networks conceptual clarification. Social Networks 1, 3 (1978), 215--239.Google ScholarCross Ref
- Roger Guimera, Stefano Mossa, Adrian Turtschi, and Luis A. Nunes Amaral. 2005. The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. Proceedings of the National Academy of Sciences 102, 22 (2005), 7794--7799.Google ScholarCross Ref
- Jiyong Jang, Abeer Agrawal, and David Brumley. 2012. ReDeBug: Finding unpatched code clones in entire OS distributions. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (S&P'12). 48--62.Google ScholarDigital Library
- Hawoong Jeong, Sean P. Mason, Albert L. Barabási, and Zoltan N. Oltvai. 2001. Lethality and centrality in protein networks. Nature 411, 6833 (2001), 41--42.Google Scholar
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE'07). 96--105.Google ScholarDigital Library
- Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654--670.Google ScholarDigital Library
- Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 (1953), 39--43.Google ScholarCross Ref
- Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. VUDDY: A scalable approach for vulnerable code clone discovery. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (S&P'17). 595--614.Google ScholarCross Ref
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the 2012 Advances in Neural Information Processing Systems (NIPS'12). 1097--1105.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google Scholar
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Jingyue Li and Michael D. Ernst. 2012. CBCD: Cloned buggy code detector. In Proceedings of the 34th International Conference on Software Engineering (ICSE'12). 310--320.Google Scholar
- Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and Hai Jin. 2021. Vuldeelocator: a deep learning-based fine-grained vulnerability detector. IEEE Transactions on Dependable and Secure Computing (2021), 1--17.Google Scholar
- Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. VulPecker: An automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC'16). 201--213.Google ScholarDigital Library
- Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. SySeVR: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing (2021), 1--15.Google Scholar
- Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, and Yuyi Zhong. 2018. VulDeePecker: A deep learning-based system for vulnerability detection. In Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS'18). 1--15.Google ScholarCross Ref
- Guanjun Lin, Wei Xiao, Jun Zhang, and Yang Xiang. 2019. Deep learning-based vulnerable function detection: A benchmark. In Proceedings of the 2019 International Conference on Information and Communications Security (ICICS'19). 219--232.Google Scholar
- Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, and Yang Xiang. 2017. POSTER: Vulnerability discovery with function representation learning from unlabeled projects. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS'17). 2539--2541.Google ScholarDigital Library
- Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. 2007. Predicting vulnerable software components. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS'07). 529--540.Google ScholarDigital Library
- Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507 (2017).Google Scholar
- Nam H. Pham, Tung Thanh Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen. 2010. Detection of recurring software vulnerabilities. In Proceedings of the 2010 International Conference on Automated Software Engineering (ASE'10). 447--456.Google Scholar
- Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In Proceedings of the 2018 IEEE International Conference on Machine Learning and Applications (ICMLA'18). 757--762.Google ScholarCross Ref
- Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering (ICSE'16). 1157--1168.Google Scholar
- Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV'17). 618--626.Google ScholarCross Ref
- Umesh Shankar, Kunal Talwar, Jeffrey S. Foster, and David A. Wagner. 2001. Detecting format string vulnerabilities with type qualifiers. In Proceedings of the 2001 USENIX Security Symposium (USENIX Security'01). 201--220.Google Scholar
- Lwin Khin Shar, Lionel C. Briand, and Hee Beng Kuan Tan. 2014. Web application vulnerability prediction using hybrid program analysis and machine learning. IEEE Transactions on Dependable and Secure Computing 12, 6 (2014), 688--707.Google ScholarDigital Library
- Yueming Wu, Xiaodi Li, Deqing Zou, Wei Yang, Xin Zhang, and Hai Jin. 2019. MalScan: Fast market-wide mobile malware scanning by social-network centrality analysis. In Proceedings of the 34th International Conference on Automated Software Engineering (ASE'19). 139--150.Google ScholarDigital Library
- Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In Proceddings of the 2014 IEEE Symposium on Security and Privacy (S&P'14). 590--604.Google ScholarDigital Library
- Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. 2012. Generalized vulnerability extrapolation using abstract syntax trees. In Proceedings of the 28th Annual Computer Security Applications Conference (ACSAC'12). 359--368.Google ScholarDigital Library
- Fabian Yamaguchi, Alwin Maier, Hugo Gascon, and Konrad Rieck. 2015. Automatic inference of search patterns for taint-style vulnerabilities. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (S&P'15). 797--812.Google ScholarDigital Library
- Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the 2019 Advances in Neural Information Processing Systems (NIPS'19). 10197--10207.Google Scholar
- Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin. 2019. μVulDeePecker: A deep learning-based system for multiclass vulnerability detection. IEEE Transactions on Dependable and Secure Computing 18, 5 (2019), 1--13.Google ScholarDigital Library
Index Terms
- VulCNN: an image-inspired scalable vulnerability detection system
Recommendations
Detecting Blind Cross-Site Scripting Attacks Using Machine Learning
SPML '18: Proceedings of the 2018 International Conference on Signal Processing and Machine LearningCross-site scripting (XSS) is a scripting attack targeting web applications by injecting malicious scripts into web pages. Blind XSS is a subset of stored XSS, where an attacker blindly deploys malicious payloads in web pages that are stored in a ...
COIN Attacks: On Insecurity of Enclave Untrusted Interfaces in SGX
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating SystemsIntel SGX is a hardware-based trusted execution environment (TEE), which enables an application to compute on confidential data in a secure enclave. SGX assumes a powerful threat model, in which only the CPU itself is trusted; anything else is untrusted,...
Research on Vulnerability Detection Technology for WEB Mail System
Recently, the Email system is seriously threatened by the vulnerability attack, and XSS vulnerability is one of the most serious vulnerability of WEB mail system. In this paper, we proposed a crossing site script injection vulnerability detection method ...
Comments