Skip to main content

COBRA-GCN: Contrastive Learning to Optimize Binary Representation Analysis with Graph Convolutional Networks

  • Conference paper
  • First Online:
Book cover Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13358))

  • 1041 Accesses

Abstract

The ability to quickly identify whether two binaries are similar is critical for many security applications, with use cases ranging from triaging millions of novel malware samples, to identifying whether a binary contains a known exploitable bug. There have been many program analysis approaches to solving this problem, however, most machine learning approaches in the last 5 years have focused on function similarity, and there have been no techniques released that are able to perform robust many to many comparisons of full programs. In this paper, we present the first machine learning approach capable of learning a robust representation of programs based on their similarity, using a combination of supervised natural language processing and graph learning. We name our prototype COBRA: Contrastive Learning to Optimize Binary Representation Analysis. We evaluate our model on several different metrics for program similarity, such as compiler optimizations, code obfuscations, and different pieces of semantically similar source code. Our approach outperforms current techniques for full binary diffing, achieving an F1 score and AUC .6 and .12, respectively, higher than BinDiff while also having the ability to perform many-to-many comparisons.

DISTRIBUTION STATEMENT. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Department of Defense under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of Defense. Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. https://www.zynamics.com/bindiff.html

  2. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

  3. Android ndk. https://github.com/android/ndk-samples. Accessed 30 Sept 2010

  4. Google Code Jam. https://codingcompetitions.withgoogle.com/codejam. Accessed 30 Sept 2010

  5. Vcpkg. https://github.com/microsoft/vcpkg. Accessed 30 Sept 2010

  6. Virustotal. https://virustotal.com. Accessed 30 Sept 2010

  7. Alrabaee, S., Wang, L., Debbabi, M.: BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs)

    Google Scholar 

  8. Bayer, U., Comparetti, P.M., Hlauschek, C., Krügel, C., Kirda, E.: Scalable, behavior-based malware clustering

    Google Scholar 

  9. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information

    Google Scholar 

  10. Bruna, J., Zaremba, W., Szlam, A., Lecun, Y.: Spectral networks and locally connected networks on graphs

    Google Scholar 

  11. Chakraborty, T., Pierazzi, F., Subrahmanian, V.S.: EC2: Ensemble clustering and classification for predicting android malware families. https://doi.org/10.1109/TDSC.2017.2739145

  12. Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C.Y., Tan, H.B.K.: BinGo: Cross-architecture cross-OS binary search. https://doi.org/10.1145/2950290.2950350

  13. Corob-Msft: /o options (optimize code). https://docs.microsoft.com/en-us/cpp/build/reference/o-options-optimize-code?view=msvc-160

  14. Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models for structured data. http://arxiv.org/abs/1603.05629

  15. Damiani, E., di Vimercati, S.D.C., Paraboschi, S., Samarati, P.: An open digest-based technique for spam detection

    Google Scholar 

  16. David, Y., Partush, N., Yahav, E.: Firmup: Precise static detection of common vulnerabilities in firmware. https://doi.org/10.1145/3173162.3177157

  17. David, Y., Partush, N., Yahav, E.: Similarity of binaries through re-optimization. https://doi.org/10.1145/3062341.3062387

  18. David, Y., Partush, N., Yahav, E.: Statistical similarity of binaries. https://doi.org/10.1145/2980983.2908126

  19. David, Y., Yahav, E.: Tracelet-based code search in executables. https://doi.org/10.1145/2666356.2594343

  20. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805

  21. Ding, S.H.H., Fung, B.C.M., Charland, P.: Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. https://doi.org/10.1109/SP.2019.00003

  22. Duan, Y., Li, X., Wang, J., Yin, H.: DeepBinDiff: Learning program-wide code representations for binary diffing. https://doi.org/10.14722/ndss.2020.24311

  23. Dullien, T.: Graph-based comparison of executable objects

    Google Scholar 

  24. Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: Efficient cross-architecture identification of bugs in binary code

    Google Scholar 

  25. Fout, A., Byrd, J., Shariat, B., Ben-Hur, A.: Protein interface prediction using graph convolutional networks

    Google Scholar 

  26. Gao, D., Reiter, M., Song, D.: BinHunt: Automatically finding semantic differences in binary programs

    Google Scholar 

  27. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. https://doi.org/10.1109/CVPR.2006.100

  28. Haq, I.U., Caballero, J.: A survey of binary code similarity. https://doi.org/10.1145/3446371

  29. Hu, Y., Zhang, Y., Li, J., Gu, D.: Binary code clone detection across architectures and compiling configurations. https://doi.org/10.1109/ICPC.2017.22

  30. Jang, J., Brumley, D., Venkataraman, S.: BitShred: feature hashing malware for scalable triage and semantic analysis

    Google Scholar 

  31. Junod, P., Rinaldini, J., Wehrli, J., Michielin, J.: Obfuscator-LLVM - software protection for the masses. https://doi.org/10.1109/SPRO.2015.10

  32. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks

    Google Scholar 

  33. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing

    Google Scholar 

  34. Lakhotia, A., Walenstein, A., Miles, C., Singh, A.: VILO: a rapid learning nearest-neighbor classifier for malware triage

    Google Scholar 

  35. Liu, B., et al.: \(\alpha \) diff: Cross-version binary code similarity detection with DNN. https://doi.org/10.1145/3238147.3238199

  36. Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. https://doi.org/10.1109/TSE.2017.2655046

  37. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE

    Google Scholar 

  38. Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-attentive function embeddings for binary similarity. In: Perdisci, R., Maurice, C., Giacinto, G., Almgren, M. (eds.) DIMVA 2019. LNCS, vol. 11543, pp. 309–329. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22038-9_15

    Chapter  Google Scholar 

  39. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space

    Google Scholar 

  40. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality

    Google Scholar 

  41. Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M.: Fake news detection on social media using geometric deep learning. http://arxiv.org/abs/1902.06673

  42. Oliver, J., Cheng, C., Chen, Y.: TLSH-a locality sensitive hash

    Google Scholar 

  43. Pagani, F., Dell’Amico, M., Balzarotti, D.: Beyond precision and recall: understanding uses (and misuses) of similarity hashes in binary analysis

    Google Scholar 

  44. Pang, C., et al.: SoK: All you ever wanted to know about \(\times \)86/\(\times \)64 binary disassembly but were afraid to ask

    Google Scholar 

  45. Pei, K., Xuan, Z., Yang, J., Jana, S., Ray, B.: Trex: Learning execution semantics from micro-traces for binary similarity. https://arxiv.org/abs/2012.08680

  46. Raff, E., Nicholas, C.: Lempel-Ziv Jaccard distance, an effective alternative to SSDeep and SDHash

    Google Scholar 

  47. Rafique, M.Z., Caballero, J.: FIRMA: malware clustering and network signature generation with mixed network behaviors. In: Stolfo, S.J., Stavrou, A., Wright, C.V. (eds.) RAID 2013. LNCS, vol. 8145, pp. 144–163. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41284-4_8

    Chapter  Google Scholar 

  48. Redmond, K., Luo, L., Zeng, Q.: A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis

    Google Scholar 

  49. Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning

    Google Scholar 

  50. Roussev, V.: Data fingerprinting with similarity digests. In: Chow, K.-P., Shenoi, S. (eds.) DigitalForensics 2010. IAICT, vol. 337, pp. 207–226. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15506-2_15

    Chapter  Google Scholar 

  51. Shirani, P., Wang, L., Debbabi, M.: BinShape: scalable and robust binary library function identification using function shape. In: Polychronakis, M., Meier, M. (eds.) DIMVA 2017. LNCS, vol. 10327, pp. 301–324. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60876-1_14

    Chapter  Google Scholar 

  52. Kim, T., Lee, Y.R., Kang, B.J., Im, E.G.: Binary executable file similarity calculation using function matching

    Google Scholar 

  53. Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. http://arxiv.org/abs/1708.06525

  54. Xu, Y., Xu, Z., Chen, B., Song, F., Liu, Y., Liu, T.: Patch based vulnerability matching for binary programs

    Google Scholar 

  55. Xue, Y., Xu, Z., Chandramohan, M., Liu, Y.: Accurate and scalable cross-architecture cross-OS binary code search with emulation. https://doi.org/10.1109/TSE.2018.2827379

  56. Yang, C., Liu, Z., Zhao, D., Sun, M., Chang, E.Y.: Network representation learning with rich text information

    Google Scholar 

  57. Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware categorization using cluster ensemble. https://doi.org/10.1145/1835804.1835820

  58. Zhuang, W., Ye, Y., Chen, Y., Li, T.: Ensemble clustering for internet security applications. https://doi.org/10.1109/TSMCC.2012.2222025

  59. Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., Zhang, Z.: Neural machine translation inspired binary code similarity comparison beyond function pairs. https://doi.org/10.14722/ndss.2019.23492

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, M., Interrante-Grant, A., Whelan, R., Leek, T. (2022). COBRA-GCN: Contrastive Learning to Optimize Binary Representation Analysis with Graph Convolutional Networks. In: Cavallaro, L., Gruss, D., Pellegrino, G., Giacinto, G. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2022. Lecture Notes in Computer Science, vol 13358. Springer, Cham. https://doi.org/10.1007/978-3-031-09484-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-09484-2_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-09483-5

  • Online ISBN: 978-3-031-09484-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics