Vestige: Identifying Binary Code Provenance for Vulnerability Detection

Ji, Yuede; Cui, Lei; Huang, H. Howie

doi:10.1007/978-3-030-78375-4_12

Yuede Ji¹⁰,
Lei Cui¹⁰ &
H. Howie Huang¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12727))

Included in the following conference series:

International Conference on Applied Cryptography and Network Security

1704 Accesses
6 Citations

Abstract

Identifying the compilation provenance of a binary code helps to pinpoint the specific compilation tools and configurations that were used to produce the executable. Unfortunately, existing techniques are not able to accurately differentiate among closely related executables, especially those generated with minor different compiling configurations. To address this problem, we have designed a new provenance identification system, Vestige. We build a new representation of the binary code, i.e., attributed function call graph (AFCG), that covers three types of features: idiom features at the instruction level, graphlet features at the function level, and function call graph at the binary level. Vestige applies a graph neural network model on the AFCG and generates representative embeddings for provenance identification. The experiment shows that Vestige achieves 96% accuracy on the publicly available datasets of more than 6,000 binaries, which is significantly better than previous works. When applied for binary code vulnerability detection, Vestige can help to improve the top-1 hit rate of three recent code vulnerability detection methods by up to 27%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cisco confirms 5 serious security threats to ‘tens of millions’ of network devices, February 2020. https://www.forbes.com/sites/daveywinder/2020/02/05/cisco-confirms-5-serious-security-threats-to-tens-of-millions-of-network-devices
Download LLVM releases, December 2019. https://releases.llvm.org/
GCC releases - GNU project, March 2020. https://gcc.gnu.org/releases.html
Ida pro - interactive disassembler. https://www.hex-rays.com/products/ida/
Researchers uncover 125 vulnerabilities across 13 routers and NAS devices (2019). https://www.helpnetsecurity.com/2019/09/17/vulnerabilities-iot-devices/
Using the GNU compiler collection (GCC): Optimize options. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Batchelor, J., Andersen, H.R.: Bridging the product configuration gap between PLM and ERP–an automotive case study. In: 19th International Product Development Management Conference (2012)
Google Scholar
Bowman, B., Laprade, C., Ji, Y., Huang, H.H.: Detecting lateral movement in enterprise computer networks with unsupervised graph AI. In: Proceedings of the 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID) (2020)
Google Scholar
Dabrowski, A., Echizen, I., Weippl, E.R.: Error-correcting codes as source for decoding ambiguity. In: 2015 IEEE Security and Privacy Workshops (2015)
Google Scholar
Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models for structured data. In: International Conference on Machine Learning (2016)
Google Scholar
Ding, S.H., Fung, B.C., Charland, P.: Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: Proceedings of the IEEE Symposium on Security and Privacy (2019)
Google Scholar
Egele, M., Woo, M., Chapman, P., Brumley, D.: Blanket execution: dynamic similarity testing for program binaries and components. In: USENIX Security (2014)
Google Scholar
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code. In: Proceedings of NDSS (2016)
Google Scholar
Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of ACM CCS (2016)
Google Scholar
Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS, vol. 4453, pp. 92–106. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71681-5_7
Chapter Google Scholar
Ji, Y., Bowman, B., Huang, H.H.: Securing malware cognitive systems against adversarial attacks. In: International Conference on Cognitive Computing (ICCC). IEEE (2019)
Google Scholar
Ji, Y., Cui, L., Huang, H.H.: BugGraph: differentiating source-binary code similarity with graph triplet-loss network. In: 16th ACM ASIA Conference on Computer and Communications Security (ASIACCS) (2021)
Google Scholar
Ji, Y., Huang, H.H.: Aquila: adaptive parallel computation of graph connectivity queries. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2020)
Google Scholar
Ji, Y., Liu, H., Huang, H.H.: iSpan: parallel identification of strongly connected components with spanning trees. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 731–742. IEEE (2018)
Google Scholar
Ji, Y., Liu, H., Huang, H.H.: SWARMGRAPH: analyzing large-scale in-memory graphs on GPUs. In: International Conference on High Performance Computing and Communications (HPCC). IEEE (2020)
Google Scholar
Kharaz, A., Arshad, S., Mulliner, C., Robertson, W., Kirda, E.: UNVEIL: a large-scale, automated approach to detecting ransomware. In: USENIX Security (2016)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Kotzias, P., Bilge, L., Vervier, P.A., Caballero, J.: Mind your own business: a longitudinal study of threats and vulnerabilities in enterprises. In: NDSS (2019)
Google Scholar
Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 207–226. Springer, Heidelberg (2006). https://doi.org/10.1007/11663812_11
Chapter Google Scholar
Liu, B., Huo, W., Zhang, C., Li, W., Li, F., Piao, A., Zou, W.: \(\alpha \) Diff: cross-version binary code similarity detection with DNN. In: Proceedings of ASE (2018)
Google Scholar
Liu, H., Motoda, H.: Feature selection for knowledge discovery and data mining (2012)
Google Scholar
Marcantoni, F., Diamantaris, M., Ioannidis, S., Polakis, J.: A large-scale study on the risks of the HTML5 WebAPI for mobile sensor-based attacks. In: WWW (2019)
Google Scholar
Massarelli, L., Di Luna, G.A., Petroni, F., Querzoni, L., Baldoni, R.: Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In: Proceedings of the 2nd Workshop on Binary Analysis Research (2019)
Google Scholar
Meng, X., Miller, B.P.: Binary code multi-author identification in multi-toolchain scenarios (2018)
Google Scholar
Meng, X., Miller, B.P., Jun, K.-S.: Identifying multiple authors in a binary program. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10493, pp. 286–304. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66399-9_16
Chapter Google Scholar
Okazaki, N.: CRFsuite: a fast implementation of conditional random fields (CRFs) (2007). http://www.chokkan.org/software/crfsuite/
Otsubo, Y., Otsuka, A., Mimura, M., Sakaki, T., Ukegawa, H.: o-glassesX: compiler provenance recovery with attention mechanism from a short code fragment. In: Proceedings of the 3nd Workshop on Binary Analysis Research (2020)
Google Scholar
Possemato, A., Lanzi, A., Chung, S.P.H., Lee, W., Fratantonio, Y.: ClickShield: are you hiding something? Towards eradicating clickjacking on android. In: Proceedings of ACM CCS (2018)
Google Scholar
Rahimian, A., Shirani, P., Alrbaee, S., Wang, L., Debbabi, M.: Bincomp: a stratified approach to compiler provenance attribution (2015)
Google Scholar
Rosenblum, N., Miller, B.P., Zhu, X.: Recovering the toolchain provenance of binary code. In: Proceedings of ISSTA (2011)
Google Scholar
Rosenblum, N.E., Miller, B.P., Zhu, X.: Extracting compiler provenance from program binaries. In: Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (2010)
Google Scholar
Rosenblum, N.E., Zhu, X., Miller, B.P., Hunt, K.: Learning to analyze binary computer code. In: AAAI, pp. 798–804 (2008)
Google Scholar
Open Source: Dyninst: an application program interface (API) for runtime code generation (2016). http://www.dyninst.org
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of ACM CCS (2017)
Google Scholar
Xu, Z., Zhang, J., Gu, G., Lin, Z.: GoldenEye: efficiently and effectively unveiling malware’s targeted environment. In: Stavrou, A., Bos, H., Portokalidis, G. (eds.) RAID 2014. LNCS, vol. 8688, pp. 22–45. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11379-1_2
Chapter Google Scholar
Ying, Z., Bourgeois, D., You, J., Zitnik, M., Leskovec, J.: GNNExplainer: generating explanations for graph neural networks. In: Proceedings of NeurIPS (2019)
Google Scholar
Zuo, F., Li, X., Zhang, Z., Young, P., Luo, L., Zeng, Q.: Neural machine translation inspired binary code similarity comparison beyond function pairs. In: NDSS (2019)
Google Scholar

Download references

Acknowledgment

The authors would like to thank the anonymous reviewers from ACNS’21 for their help in improving this paper. We would also like to express our grateful thanks to the authors of Genius, Gemini, and Origin (including Xiaozhu Meng) for sharing the source code and dataset with us. Lei Cui participated in this work while working as a postdoctoral researcher at the George Washington University from June 2017 to July 2018. This work was supported in part by DARPA under agreement number N66001-18-C-4033 and National Science Foundation CAREER award 1350766 and grants 1618706 and 1717774. The views, opinions, and/or findings expressed in this material are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense, National Science Foundation, or the U.S. Government.

Author information

Authors and Affiliations

Graph Computing Lab, George Washington University, Washington, D.C., USA
Yuede Ji, Lei Cui & H. Howie Huang

Authors

Yuede Ji
View author publications
You can also search for this author in PubMed Google Scholar
Lei Cui
View author publications
You can also search for this author in PubMed Google Scholar
H. Howie Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuede Ji .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Kazue Sako
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Nils Ole Tippenhauer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, Y., Cui, L., Huang, H.H. (2021). Vestige: Identifying Binary Code Provenance for Vulnerability Detection. In: Sako, K., Tippenhauer, N.O. (eds) Applied Cryptography and Network Security. ACNS 2021. Lecture Notes in Computer Science(), vol 12727. Springer, Cham. https://doi.org/10.1007/978-3-030-78375-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-78375-4_12
Published: 10 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78374-7
Online ISBN: 978-3-030-78375-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics