COBRA-GCN: Contrastive Learning to Optimize Binary Representation Analysis with Graph Convolutional Networks

Wang, Michael; Interrante-Grant, Alexander; Whelan, Ryan; Leek, Tim

doi:10.1007/978-3-031-09484-2_4

Michael Wang¹¹,
Alexander Interrante-Grant¹¹,
Ryan Whelan¹¹ &
…
Tim Leek¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13358))

Included in the following conference series:

International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment

1041 Accesses

Abstract

The ability to quickly identify whether two binaries are similar is critical for many security applications, with use cases ranging from triaging millions of novel malware samples, to identifying whether a binary contains a known exploitable bug. There have been many program analysis approaches to solving this problem, however, most machine learning approaches in the last 5 years have focused on function similarity, and there have been no techniques released that are able to perform robust many to many comparisons of full programs. In this paper, we present the first machine learning approach capable of learning a robust representation of programs based on their similarity, using a combination of supervised natural language processing and graph learning. We name our prototype COBRA: Contrastive Learning to Optimize Binary Representation Analysis. We evaluate our model on several different metrics for program similarity, such as compiler optimizations, code obfuscations, and different pieces of semantically similar source code. Our approach outperforms current techniques for full binary diffing, achieving an F1 score and AUC .6 and .12, respectively, higher than BinDiff while also having the ability to perform many-to-many comparisons.

DISTRIBUTION STATEMENT. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Department of Defense under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of Defense. Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

https://www.zynamics.com/bindiff.html
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
Android ndk. https://github.com/android/ndk-samples. Accessed 30 Sept 2010
Google Code Jam. https://codingcompetitions.withgoogle.com/codejam. Accessed 30 Sept 2010
Vcpkg. https://github.com/microsoft/vcpkg. Accessed 30 Sept 2010
Virustotal. https://virustotal.com. Accessed 30 Sept 2010
Alrabaee, S., Wang, L., Debbabi, M.: BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs)
Google Scholar
Bayer, U., Comparetti, P.M., Hlauschek, C., Krügel, C., Kirda, E.: Scalable, behavior-based malware clustering
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information
Google Scholar
Bruna, J., Zaremba, W., Szlam, A., Lecun, Y.: Spectral networks and locally connected networks on graphs
Google Scholar
Chakraborty, T., Pierazzi, F., Subrahmanian, V.S.: EC2: Ensemble clustering and classification for predicting android malware families. https://doi.org/10.1109/TDSC.2017.2739145
Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C.Y., Tan, H.B.K.: BinGo: Cross-architecture cross-OS binary search. https://doi.org/10.1145/2950290.2950350
Corob-Msft: /o options (optimize code). https://docs.microsoft.com/en-us/cpp/build/reference/o-options-optimize-code?view=msvc-160
Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models for structured data. http://arxiv.org/abs/1603.05629
Damiani, E., di Vimercati, S.D.C., Paraboschi, S., Samarati, P.: An open digest-based technique for spam detection
Google Scholar
David, Y., Partush, N., Yahav, E.: Firmup: Precise static detection of common vulnerabilities in firmware. https://doi.org/10.1145/3173162.3177157
David, Y., Partush, N., Yahav, E.: Similarity of binaries through re-optimization. https://doi.org/10.1145/3062341.3062387
David, Y., Partush, N., Yahav, E.: Statistical similarity of binaries. https://doi.org/10.1145/2980983.2908126
David, Y., Yahav, E.: Tracelet-based code search in executables. https://doi.org/10.1145/2666356.2594343
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805
Ding, S.H.H., Fung, B.C.M., Charland, P.: Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. https://doi.org/10.1109/SP.2019.00003
Duan, Y., Li, X., Wang, J., Yin, H.: DeepBinDiff: Learning program-wide code representations for binary diffing. https://doi.org/10.14722/ndss.2020.24311
Dullien, T.: Graph-based comparison of executable objects
Google Scholar
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: Efficient cross-architecture identification of bugs in binary code
Google Scholar
Fout, A., Byrd, J., Shariat, B., Ben-Hur, A.: Protein interface prediction using graph convolutional networks
Google Scholar
Gao, D., Reiter, M., Song, D.: BinHunt: Automatically finding semantic differences in binary programs
Google Scholar
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. https://doi.org/10.1109/CVPR.2006.100
Haq, I.U., Caballero, J.: A survey of binary code similarity. https://doi.org/10.1145/3446371
Hu, Y., Zhang, Y., Li, J., Gu, D.: Binary code clone detection across architectures and compiling configurations. https://doi.org/10.1109/ICPC.2017.22
Jang, J., Brumley, D., Venkataraman, S.: BitShred: feature hashing malware for scalable triage and semantic analysis
Google Scholar
Junod, P., Rinaldini, J., Wehrli, J., Michielin, J.: Obfuscator-LLVM - software protection for the masses. https://doi.org/10.1109/SPRO.2015.10
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks
Google Scholar
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing
Google Scholar
Lakhotia, A., Walenstein, A., Miles, C., Singh, A.: VILO: a rapid learning nearest-neighbor classifier for malware triage
Google Scholar
Liu, B., et al.: \(\alpha \) diff: Cross-version binary code similarity detection with DNN. https://doi.org/10.1145/3238147.3238199
Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. https://doi.org/10.1109/TSE.2017.2655046
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE
Google Scholar
Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-attentive function embeddings for binary similarity. In: Perdisci, R., Maurice, C., Giacinto, G., Almgren, M. (eds.) DIMVA 2019. LNCS, vol. 11543, pp. 309–329. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22038-9_15
Chapter Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality
Google Scholar
Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M.: Fake news detection on social media using geometric deep learning. http://arxiv.org/abs/1902.06673
Oliver, J., Cheng, C., Chen, Y.: TLSH-a locality sensitive hash
Google Scholar
Pagani, F., Dell’Amico, M., Balzarotti, D.: Beyond precision and recall: understanding uses (and misuses) of similarity hashes in binary analysis
Google Scholar
Pang, C., et al.: SoK: All you ever wanted to know about \(\times \)86/\(\times \)64 binary disassembly but were afraid to ask
Google Scholar
Pei, K., Xuan, Z., Yang, J., Jana, S., Ray, B.: Trex: Learning execution semantics from micro-traces for binary similarity. https://arxiv.org/abs/2012.08680
Raff, E., Nicholas, C.: Lempel-Ziv Jaccard distance, an effective alternative to SSDeep and SDHash
Google Scholar
Rafique, M.Z., Caballero, J.: FIRMA: malware clustering and network signature generation with mixed network behaviors. In: Stolfo, S.J., Stavrou, A., Wright, C.V. (eds.) RAID 2013. LNCS, vol. 8145, pp. 144–163. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41284-4_8
Chapter Google Scholar
Redmond, K., Luo, L., Zeng, Q.: A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis
Google Scholar
Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning
Google Scholar
Roussev, V.: Data fingerprinting with similarity digests. In: Chow, K.-P., Shenoi, S. (eds.) DigitalForensics 2010. IAICT, vol. 337, pp. 207–226. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15506-2_15
Chapter Google Scholar
Shirani, P., Wang, L., Debbabi, M.: BinShape: scalable and robust binary library function identification using function shape. In: Polychronakis, M., Meier, M. (eds.) DIMVA 2017. LNCS, vol. 10327, pp. 301–324. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60876-1_14
Chapter Google Scholar
Kim, T., Lee, Y.R., Kang, B.J., Im, E.G.: Binary executable file similarity calculation using function matching
Google Scholar
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. http://arxiv.org/abs/1708.06525
Xu, Y., Xu, Z., Chen, B., Song, F., Liu, Y., Liu, T.: Patch based vulnerability matching for binary programs
Google Scholar
Xue, Y., Xu, Z., Chandramohan, M., Liu, Y.: Accurate and scalable cross-architecture cross-OS binary code search with emulation. https://doi.org/10.1109/TSE.2018.2827379
Yang, C., Liu, Z., Zhao, D., Sun, M., Chang, E.Y.: Network representation learning with rich text information
Google Scholar
Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware categorization using cluster ensemble. https://doi.org/10.1145/1835804.1835820
Zhuang, W., Ye, Y., Chen, Y., Li, T.: Ensemble clustering for internet security applications. https://doi.org/10.1109/TSMCC.2012.2222025
Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., Zhang, Z.: Neural machine translation inspired binary code similarity comparison beyond function pairs. https://doi.org/10.14722/ndss.2019.23492

Download references

Author information

Authors and Affiliations

MIT Lincoln Laboratory, Lexington, USA
Michael Wang, Alexander Interrante-Grant, Ryan Whelan & Tim Leek

Authors

Michael Wang
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Interrante-Grant
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Whelan
View author publications
You can also search for this author in PubMed Google Scholar
Tim Leek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Wang .

Editor information

Editors and Affiliations

University College London, London, UK
Lorenzo Cavallaro
Graz University of Technology, Graz, Austria
Daniel Gruss
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Giancarlo Pellegrino
University of Cagliari, Cagliari, Italy
Giorgio Giacinto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, M., Interrante-Grant, A., Whelan, R., Leek, T. (2022). COBRA-GCN: Contrastive Learning to Optimize Binary Representation Analysis with Graph Convolutional Networks. In: Cavallaro, L., Gruss, D., Pellegrino, G., Giacinto, G. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2022. Lecture Notes in Computer Science, vol 13358. Springer, Cham. https://doi.org/10.1007/978-3-031-09484-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-09484-2_4
Published: 24 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09483-5
Online ISBN: 978-3-031-09484-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

COBRA-GCN: Contrastive Learning to Optimize Binary Representation Analysis with Graph Convolutional Networks