Abstract
The ability to quickly identify whether two binaries are similar is critical for many security applications, with use cases ranging from triaging millions of novel malware samples, to identifying whether a binary contains a known exploitable bug. There have been many program analysis approaches to solving this problem, however, most machine learning approaches in the last 5 years have focused on function similarity, and there have been no techniques released that are able to perform robust many to many comparisons of full programs. In this paper, we present the first machine learning approach capable of learning a robust representation of programs based on their similarity, using a combination of supervised natural language processing and graph learning. We name our prototype COBRA: Contrastive Learning to Optimize Binary Representation Analysis. We evaluate our model on several different metrics for program similarity, such as compiler optimizations, code obfuscations, and different pieces of semantically similar source code. Our approach outperforms current techniques for full binary diffing, achieving an F1 score and AUC .6 and .12, respectively, higher than BinDiff while also having the ability to perform many-to-many comparisons.
DISTRIBUTION STATEMENT. Approved for public release. Distribution is unlimited. This material is based upon work supported by the Department of Defense under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of Defense. Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Android ndk. https://github.com/android/ndk-samples. Accessed 30 Sept 2010
Google Code Jam. https://codingcompetitions.withgoogle.com/codejam. Accessed 30 Sept 2010
Vcpkg. https://github.com/microsoft/vcpkg. Accessed 30 Sept 2010
Virustotal. https://virustotal.com. Accessed 30 Sept 2010
Alrabaee, S., Wang, L., Debbabi, M.: BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs)
Bayer, U., Comparetti, P.M., Hlauschek, C., Krügel, C., Kirda, E.: Scalable, behavior-based malware clustering
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information
Bruna, J., Zaremba, W., Szlam, A., Lecun, Y.: Spectral networks and locally connected networks on graphs
Chakraborty, T., Pierazzi, F., Subrahmanian, V.S.: EC2: Ensemble clustering and classification for predicting android malware families. https://doi.org/10.1109/TDSC.2017.2739145
Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C.Y., Tan, H.B.K.: BinGo: Cross-architecture cross-OS binary search. https://doi.org/10.1145/2950290.2950350
Corob-Msft: /o options (optimize code). https://docs.microsoft.com/en-us/cpp/build/reference/o-options-optimize-code?view=msvc-160
Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models for structured data. http://arxiv.org/abs/1603.05629
Damiani, E., di Vimercati, S.D.C., Paraboschi, S., Samarati, P.: An open digest-based technique for spam detection
David, Y., Partush, N., Yahav, E.: Firmup: Precise static detection of common vulnerabilities in firmware. https://doi.org/10.1145/3173162.3177157
David, Y., Partush, N., Yahav, E.: Similarity of binaries through re-optimization. https://doi.org/10.1145/3062341.3062387
David, Y., Partush, N., Yahav, E.: Statistical similarity of binaries. https://doi.org/10.1145/2980983.2908126
David, Y., Yahav, E.: Tracelet-based code search in executables. https://doi.org/10.1145/2666356.2594343
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805
Ding, S.H.H., Fung, B.C.M., Charland, P.: Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. https://doi.org/10.1109/SP.2019.00003
Duan, Y., Li, X., Wang, J., Yin, H.: DeepBinDiff: Learning program-wide code representations for binary diffing. https://doi.org/10.14722/ndss.2020.24311
Dullien, T.: Graph-based comparison of executable objects
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: Efficient cross-architecture identification of bugs in binary code
Fout, A., Byrd, J., Shariat, B., Ben-Hur, A.: Protein interface prediction using graph convolutional networks
Gao, D., Reiter, M., Song, D.: BinHunt: Automatically finding semantic differences in binary programs
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. https://doi.org/10.1109/CVPR.2006.100
Haq, I.U., Caballero, J.: A survey of binary code similarity. https://doi.org/10.1145/3446371
Hu, Y., Zhang, Y., Li, J., Gu, D.: Binary code clone detection across architectures and compiling configurations. https://doi.org/10.1109/ICPC.2017.22
Jang, J., Brumley, D., Venkataraman, S.: BitShred: feature hashing malware for scalable triage and semantic analysis
Junod, P., Rinaldini, J., Wehrli, J., Michielin, J.: Obfuscator-LLVM - software protection for the masses. https://doi.org/10.1109/SPRO.2015.10
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing
Lakhotia, A., Walenstein, A., Miles, C., Singh, A.: VILO: a rapid learning nearest-neighbor classifier for malware triage
Liu, B., et al.: \(\alpha \) diff: Cross-version binary code similarity detection with DNN. https://doi.org/10.1145/3238147.3238199
Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. https://doi.org/10.1109/TSE.2017.2655046
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE
Massarelli, L., Di Luna, G.A., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-attentive function embeddings for binary similarity. In: Perdisci, R., Maurice, C., Giacinto, G., Almgren, M. (eds.) DIMVA 2019. LNCS, vol. 11543, pp. 309–329. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22038-9_15
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality
Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M.: Fake news detection on social media using geometric deep learning. http://arxiv.org/abs/1902.06673
Oliver, J., Cheng, C., Chen, Y.: TLSH-a locality sensitive hash
Pagani, F., Dell’Amico, M., Balzarotti, D.: Beyond precision and recall: understanding uses (and misuses) of similarity hashes in binary analysis
Pang, C., et al.: SoK: All you ever wanted to know about \(\times \)86/\(\times \)64 binary disassembly but were afraid to ask
Pei, K., Xuan, Z., Yang, J., Jana, S., Ray, B.: Trex: Learning execution semantics from micro-traces for binary similarity. https://arxiv.org/abs/2012.08680
Raff, E., Nicholas, C.: Lempel-Ziv Jaccard distance, an effective alternative to SSDeep and SDHash
Rafique, M.Z., Caballero, J.: FIRMA: malware clustering and network signature generation with mixed network behaviors. In: Stolfo, S.J., Stavrou, A., Wright, C.V. (eds.) RAID 2013. LNCS, vol. 8145, pp. 144–163. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41284-4_8
Redmond, K., Luo, L., Zeng, Q.: A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis
Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning
Roussev, V.: Data fingerprinting with similarity digests. In: Chow, K.-P., Shenoi, S. (eds.) DigitalForensics 2010. IAICT, vol. 337, pp. 207–226. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15506-2_15
Shirani, P., Wang, L., Debbabi, M.: BinShape: scalable and robust binary library function identification using function shape. In: Polychronakis, M., Meier, M. (eds.) DIMVA 2017. LNCS, vol. 10327, pp. 301–324. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60876-1_14
Kim, T., Lee, Y.R., Kang, B.J., Im, E.G.: Binary executable file similarity calculation using function matching
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. http://arxiv.org/abs/1708.06525
Xu, Y., Xu, Z., Chen, B., Song, F., Liu, Y., Liu, T.: Patch based vulnerability matching for binary programs
Xue, Y., Xu, Z., Chandramohan, M., Liu, Y.: Accurate and scalable cross-architecture cross-OS binary code search with emulation. https://doi.org/10.1109/TSE.2018.2827379
Yang, C., Liu, Z., Zhao, D., Sun, M., Chang, E.Y.: Network representation learning with rich text information
Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware categorization using cluster ensemble. https://doi.org/10.1145/1835804.1835820
Zhuang, W., Ye, Y., Chen, Y., Li, T.: Ensemble clustering for internet security applications. https://doi.org/10.1109/TSMCC.2012.2222025
Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., Zhang, Z.: Neural machine translation inspired binary code similarity comparison beyond function pairs. https://doi.org/10.14722/ndss.2019.23492
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, M., Interrante-Grant, A., Whelan, R., Leek, T. (2022). COBRA-GCN: Contrastive Learning to Optimize Binary Representation Analysis with Graph Convolutional Networks. In: Cavallaro, L., Gruss, D., Pellegrino, G., Giacinto, G. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2022. Lecture Notes in Computer Science, vol 13358. Springer, Cham. https://doi.org/10.1007/978-3-031-09484-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-09484-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09483-5
Online ISBN: 978-3-031-09484-2
eBook Packages: Computer ScienceComputer Science (R0)