research-article

Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures

Authors:

Abdullah Qasem,

Mourad Debbabi,

Marthe KassoufAuthors Info & Claims

ASIA CCS '23: Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security

Pages 443 - 456

https://doi.org/10.1145/3579856.3582818

Published: 10 July 2023 Publication History

Abstract

Binary function clone search is an essential capability that enables multiple applications and use cases, including reverse engineering, patch security inspection, threat analysis, vulnerable function detection, etc. As such, a surge of interest has been expressed in designing and implementing techniques to address function similarity on binary executables and firmware images. Although existing approaches have merit in fingerprinting function clones, they present limitations when the target binary code has been subjected to significant code transformation resulting from obfuscation, compiler optimization, and/or cross-compilation to multiple-CPU architectures. In this regard, we design and implement a system named BinFinder, which employs a neural network to learn binary function embeddings based on a set of extracted features that are resilient to both code obfuscation and compiler optimization techniques. Our experimental evaluation indicates that BinFinder outperforms state-of-the-art approaches for multi-CPU architectures by a large margin, with 46% higher Recall against Gemini, 55% higher Recall against SAFE, and 28% higher Recall against GMN. With respect to obfuscation and compiler optimization clone search approaches, BinFinder outperforms the asm2vec (single CPU architecture approach) with higher Recall and BinMatch (multi-CPU architecture approach) with higher Recall. Finally, our work is the first to provide noteworthy results with respect to binary clone search over the tigress obfuscator, which is a well-established open-source obfuscator.

References

[1]

Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2022. A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features. ACM Computing Surveys (CSUR) 55, 1 (2022), 1–41.

Digital Library

[2]

Christopher M Bishop 1995. Neural Networks for Pattern Recognition. Oxford University Press.

[3]

Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019. IEEE, 472–489. https://doi.org/10.1109/SP.2019.00003

[4]

Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In NDSS.

[5]

Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 480–491.

Digital Library

[6]

FLIRT. 2020. FLIRT @ONLINE. https://hex-rays.com/products/ida/tech/flirt/.

[7]

Yikun Hu, Hui Wang, Yuanyuan Zhang, Bodong Li, and Dawu Gu. 2019. A Semantics-Based Hybrid Approach on Binary Code Similarity Comparison. IEEE Transactions on Software Engineering 47 (2019), 1241–1258.

[8]

idapro. 2020. idapro @ONLINE. https://www.hex-rays.com/products/ida/index.shtml.

[9]

Jianguo Jiang, Gengwang Li, Min Yu, Gang Li, Chao Liu, Zhiqiang Lv, Bin Lv, and Weiqing Huang. 2020. Similarity of binaries across optimization levels and obfuscation. In European Symposium on Research in Computer Security. Springer, 295–315.

Digital Library

[10]

Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. 2015. Obfuscator-LLVM–software protection for the masses. In 2015 IEEE/ACM 1st International Workshop on Software Protection. IEEE, 3–9.

Digital Library

[11]

Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim. 2022. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Transactions on Software Engineering (2022).

Digital Library

[12]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980

[13]

Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning. PMLR, 3835–3845.

[14]

Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. 2018. α diff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 667–678.

Digital Library

[15]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Evaluation in Information Retrieval. Cambridge University Press, 139–161. https://doi.org/10.1017/CBO9780511809071.009

[16]

Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. [n. d.]. How Machine Learning Is Solving the Binary Function Similarity Problem. ([n. d.]).

[17]

Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. 2022. How Machine Learning Is Solving the Binary Function Similarity Problem. In 31st USENIX Security Symposium (USENIX Security 22). 2099–2116.

[18]

Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Detection of Intrusions and Malware, and Vulnerability Assessment - 16th International Conference, DIMVA 2019, Gothenburg, Sweden, June 19-20, 2019, Proceedings(Lecture Notes in Computer Science, Vol. 11543), Roberto Perdisci, Clémentine Maurice, Giorgio Giacinto, and Magnus Almgren (Eds.). Springer, 309–329. https://doi.org/10.1007/978-3-030-22038-9_15

[19]

Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices, Vol. 42. ACM, 89–100.

[20]

Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849–856.

[21]

Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020).

[22]

Federico Scrinzi. 2015. Behavioral Analysis of Obfuscated Code. http://essay.utwente.nl/67522/

[23]

Noam Shalev and Nimrod Partush. 2018. Binary similarity detection using machine learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security. 42–47.

Digital Library

[24]

Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Krügel, and Giovanni Vigna. 2016. SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016. IEEE Computer Society, 138–157. https://doi.org/10.1109/SP.2016.17

[25]

tigress. 2020. tigress @ONLINE. https://tigress.wtf/.

[26]

Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. 2022. jTrans: jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 1–13.

Digital Library

[27]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363–376.

Digital Library

[28]

Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. 2020. Order matters: semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1145–1152.

[29]

Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2020. Codecmr: Cross-modal retrieval for function-level binary source code matching. Advances in Neural Information Processing Systems 33 (2020), 3872–3883.

[30]

Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society. https://www.ndss-symposium.org/ndss-paper/neural-machine-translation-inspired-binary-code-similarity-comparison-beyond-function-pairs/

Cited By

Du JWei QWang YBai X(2025)DEGNN: A Deep Learning-Based Method for Unmanned Aerial Vehicle Software Security AnalysisDrones10.3390/drones90201109:2(110)Online publication date: 2-Feb-2025
https://doi.org/10.3390/drones9020110
Gao YLiang LLi YLi RWang Y(2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
https://doi.org/10.3390/electronics13091692
Wang HGao ZZhang CSun MZhou YQiu HXiao XChristakis MPradel M(2024)CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity DetectionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652117(149-161)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652117
Show More Cited By

Index Terms

Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures
1. Security and privacy
  1. Software and application security
    1. Software reverse engineering

Recommendations

Obfuscation: The Hidden Malware

A cyberwar exists between malware writers and antimalware researchers. At this war's heart rages a weapons race that originated in the 80s with the first computer virus. Obfuscation is one of the latest strategies to camouflage the telltale signs of ...
Analysis on Technique for Code Obfuscation
CNCIT '23: Proceedings of the 2023 2nd International Conference on Networks, Communications and Information Technology

Code obfuscation is used to reduce legibility of the code, and protect the critical code information from being stolen by reverse engineering. For the characteristic that obfuscation can be used for assembly and source code, the main method and principle ...
On the robustness of clone detection to code obfuscation
IWSC '13: Proceedings of the 7th International Workshop on Software Clones

Code clones are a common reuse mechanism in software development. While there is an ongoing discussion about harmfulness and advantages of code cloning, this discussion is mainly centered around aspects of software quality. However, recent research has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASIA CCS '23: Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security

July 2023

1066 pages

ISBN:9798400700989

DOI:10.1145/3579856

Editors:
Joseph Liu
Monash University, Australia
,
Yang Xiang
Swinburne University of Technology, Australia
,
Surya Nepal
Data61, Australia
,
Gene Tsudik
University of California Irvine, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ASIA CCS '23

Sponsor:

SIGSAC

ASIA CCS '23: ACM Asia Conference on Computer and Communications Security

July 10 - 14, 2023

VIC, Melbourne, Australia

Acceptance Rates

Overall Acceptance Rate 160 of 921 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
454
Total Downloads

Downloads (Last 12 months)192
Downloads (Last 6 weeks)12

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Du JWei QWang YBai X(2025)DEGNN: A Deep Learning-Based Method for Unmanned Aerial Vehicle Software Security AnalysisDrones10.3390/drones90201109:2(110)Online publication date: 2-Feb-2025
https://doi.org/10.3390/drones9020110
Gao YLiang LLi YLi RWang Y(2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
https://doi.org/10.3390/electronics13091692
Wang HGao ZZhang CSun MZhou YQiu HXiao XChristakis MPradel M(2024)CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity DetectionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652117(149-161)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652117
Luo ZWang PXie WZhou XWang B(2023)BlockMatch: A Fine-Grained Binary Code Similarity Detection Approach Using Contrastive Learning for Basic Block MatchingApplied Sciences10.3390/app13231275113:23(12751)Online publication date: 28-Nov-2023
https://doi.org/10.3390/app132312751

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten