Skip to main content

Vestige: Identifying Binary Code Provenance for Vulnerability Detection

  • Conference paper
  • First Online:
Applied Cryptography and Network Security (ACNS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12727))

Included in the following conference series:

Abstract

Identifying the compilation provenance of a binary code helps to pinpoint the specific compilation tools and configurations that were used to produce the executable. Unfortunately, existing techniques are not able to accurately differentiate among closely related executables, especially those generated with minor different compiling configurations. To address this problem, we have designed a new provenance identification system, Vestige. We build a new representation of the binary code, i.e., attributed function call graph (AFCG), that covers three types of features: idiom features at the instruction level, graphlet features at the function level, and function call graph at the binary level. Vestige applies a graph neural network model on the AFCG and generates representative embeddings for provenance identification. The experiment shows that Vestige achieves 96% accuracy on the publicly available datasets of more than 6,000 binaries, which is significantly better than previous works. When applied for binary code vulnerability detection, Vestige can help to improve the top-1 hit rate of three recent code vulnerability detection methods by up to 27%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cisco confirms 5 serious security threats to ‘tens of millions’ of network devices, February 2020. https://www.forbes.com/sites/daveywinder/2020/02/05/cisco-confirms-5-serious-security-threats-to-tens-of-millions-of-network-devices

  2. Download LLVM releases, December 2019. https://releases.llvm.org/

  3. GCC releases - GNU project, March 2020. https://gcc.gnu.org/releases.html

  4. Ida pro - interactive disassembler. https://www.hex-rays.com/products/ida/

  5. Researchers uncover 125 vulnerabilities across 13 routers and NAS devices (2019). https://www.helpnetsecurity.com/2019/09/17/vulnerabilities-iot-devices/

  6. Using the GNU compiler collection (GCC): Optimize options. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

  7. Batchelor, J., Andersen, H.R.: Bridging the product configuration gap between PLM and ERP–an automotive case study. In: 19th International Product Development Management Conference (2012)

    Google Scholar 

  8. Bowman, B., Laprade, C., Ji, Y., Huang, H.H.: Detecting lateral movement in enterprise computer networks with unsupervised graph AI. In: Proceedings of the 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID) (2020)

    Google Scholar 

  9. Dabrowski, A., Echizen, I., Weippl, E.R.: Error-correcting codes as source for decoding ambiguity. In: 2015 IEEE Security and Privacy Workshops (2015)

    Google Scholar 

  10. Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models for structured data. In: International Conference on Machine Learning (2016)

    Google Scholar 

  11. Ding, S.H., Fung, B.C., Charland, P.: Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: Proceedings of the IEEE Symposium on Security and Privacy (2019)

    Google Scholar 

  12. Egele, M., Woo, M., Chapman, P., Brumley, D.: Blanket execution: dynamic similarity testing for program binaries and components. In: USENIX Security (2014)

    Google Scholar 

  13. Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code. In: Proceedings of NDSS (2016)

    Google Scholar 

  14. Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of ACM CCS (2016)

    Google Scholar 

  15. Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS, vol. 4453, pp. 92–106. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71681-5_7

    Chapter  Google Scholar 

  16. Ji, Y., Bowman, B., Huang, H.H.: Securing malware cognitive systems against adversarial attacks. In: International Conference on Cognitive Computing (ICCC). IEEE (2019)

    Google Scholar 

  17. Ji, Y., Cui, L., Huang, H.H.: BugGraph: differentiating source-binary code similarity with graph triplet-loss network. In: 16th ACM ASIA Conference on Computer and Communications Security (ASIACCS) (2021)

    Google Scholar 

  18. Ji, Y., Huang, H.H.: Aquila: adaptive parallel computation of graph connectivity queries. In: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing (HPDC) (2020)

    Google Scholar 

  19. Ji, Y., Liu, H., Huang, H.H.: iSpan: parallel identification of strongly connected components with spanning trees. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 731–742. IEEE (2018)

    Google Scholar 

  20. Ji, Y., Liu, H., Huang, H.H.: SWARMGRAPH: analyzing large-scale in-memory graphs on GPUs. In: International Conference on High Performance Computing and Communications (HPCC). IEEE (2020)

    Google Scholar 

  21. Kharaz, A., Arshad, S., Mulliner, C., Robertson, W., Kirda, E.: UNVEIL: a large-scale, automated approach to detecting ransomware. In: USENIX Security (2016)

    Google Scholar 

  22. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  23. Kotzias, P., Bilge, L., Vervier, P.A., Caballero, J.: Mind your own business: a longitudinal study of threats and vulnerabilities in enterprises. In: NDSS (2019)

    Google Scholar 

  24. Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 207–226. Springer, Heidelberg (2006). https://doi.org/10.1007/11663812_11

    Chapter  Google Scholar 

  25. Liu, B., Huo, W., Zhang, C., Li, W., Li, F., Piao, A., Zou, W.: \(\alpha \) Diff: cross-version binary code similarity detection with DNN. In: Proceedings of ASE (2018)

    Google Scholar 

  26. Liu, H., Motoda, H.: Feature selection for knowledge discovery and data mining (2012)

    Google Scholar 

  27. Marcantoni, F., Diamantaris, M., Ioannidis, S., Polakis, J.: A large-scale study on the risks of the HTML5 WebAPI for mobile sensor-based attacks. In: WWW (2019)

    Google Scholar 

  28. Massarelli, L., Di Luna, G.A., Petroni, F., Querzoni, L., Baldoni, R.: Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In: Proceedings of the 2nd Workshop on Binary Analysis Research (2019)

    Google Scholar 

  29. Meng, X., Miller, B.P.: Binary code multi-author identification in multi-toolchain scenarios (2018)

    Google Scholar 

  30. Meng, X., Miller, B.P., Jun, K.-S.: Identifying multiple authors in a binary program. In: Foley, S.N., Gollmann, D., Snekkenes, E. (eds.) ESORICS 2017. LNCS, vol. 10493, pp. 286–304. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66399-9_16

    Chapter  Google Scholar 

  31. Okazaki, N.: CRFsuite: a fast implementation of conditional random fields (CRFs) (2007). http://www.chokkan.org/software/crfsuite/

  32. Otsubo, Y., Otsuka, A., Mimura, M., Sakaki, T., Ukegawa, H.: o-glassesX: compiler provenance recovery with attention mechanism from a short code fragment. In: Proceedings of the 3nd Workshop on Binary Analysis Research (2020)

    Google Scholar 

  33. Possemato, A., Lanzi, A., Chung, S.P.H., Lee, W., Fratantonio, Y.: ClickShield: are you hiding something? Towards eradicating clickjacking on android. In: Proceedings of ACM CCS (2018)

    Google Scholar 

  34. Rahimian, A., Shirani, P., Alrbaee, S., Wang, L., Debbabi, M.: Bincomp: a stratified approach to compiler provenance attribution (2015)

    Google Scholar 

  35. Rosenblum, N., Miller, B.P., Zhu, X.: Recovering the toolchain provenance of binary code. In: Proceedings of ISSTA (2011)

    Google Scholar 

  36. Rosenblum, N.E., Miller, B.P., Zhu, X.: Extracting compiler provenance from program binaries. In: Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (2010)

    Google Scholar 

  37. Rosenblum, N.E., Zhu, X., Miller, B.P., Hunt, K.: Learning to analyze binary computer code. In: AAAI, pp. 798–804 (2008)

    Google Scholar 

  38. Open Source: Dyninst: an application program interface (API) for runtime code generation (2016). http://www.dyninst.org

  39. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)

  40. Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of ACM CCS (2017)

    Google Scholar 

  41. Xu, Z., Zhang, J., Gu, G., Lin, Z.: GoldenEye: efficiently and effectively unveiling malware’s targeted environment. In: Stavrou, A., Bos, H., Portokalidis, G. (eds.) RAID 2014. LNCS, vol. 8688, pp. 22–45. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11379-1_2

    Chapter  Google Scholar 

  42. Ying, Z., Bourgeois, D., You, J., Zitnik, M., Leskovec, J.: GNNExplainer: generating explanations for graph neural networks. In: Proceedings of NeurIPS (2019)

    Google Scholar 

  43. Zuo, F., Li, X., Zhang, Z., Young, P., Luo, L., Zeng, Q.: Neural machine translation inspired binary code similarity comparison beyond function pairs. In: NDSS (2019)

    Google Scholar 

Download references

Acknowledgment

The authors would like to thank the anonymous reviewers from ACNS’21 for their help in improving this paper. We would also like to express our grateful thanks to the authors of Genius, Gemini, and Origin (including Xiaozhu Meng) for sharing the source code and dataset with us. Lei Cui participated in this work while working as a postdoctoral researcher at the George Washington University from June 2017 to July 2018. This work was supported in part by DARPA under agreement number N66001-18-C-4033 and National Science Foundation CAREER award 1350766 and grants 1618706 and 1717774. The views, opinions, and/or findings expressed in this material are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense, National Science Foundation, or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuede Ji .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ji, Y., Cui, L., Huang, H.H. (2021). Vestige: Identifying Binary Code Provenance for Vulnerability Detection. In: Sako, K., Tippenhauer, N.O. (eds) Applied Cryptography and Network Security. ACNS 2021. Lecture Notes in Computer Science(), vol 12727. Springer, Cham. https://doi.org/10.1007/978-3-030-78375-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-78375-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-78374-7

  • Online ISBN: 978-3-030-78375-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics