ABSTRACT
Binary function clone search is an essential capability that enables multiple applications and use cases, including reverse engineering, patch security inspection, threat analysis, vulnerable function detection, etc. As such, a surge of interest has been expressed in designing and implementing techniques to address function similarity on binary executables and firmware images. Although existing approaches have merit in fingerprinting function clones, they present limitations when the target binary code has been subjected to significant code transformation resulting from obfuscation, compiler optimization, and/or cross-compilation to multiple-CPU architectures. In this regard, we design and implement a system named BinFinder, which employs a neural network to learn binary function embeddings based on a set of extracted features that are resilient to both code obfuscation and compiler optimization techniques. Our experimental evaluation indicates that BinFinder outperforms state-of-the-art approaches for multi-CPU architectures by a large margin, with 46% higher Recall against Gemini, 55% higher Recall against SAFE, and 28% higher Recall against GMN. With respect to obfuscation and compiler optimization clone search approaches, BinFinder outperforms the asm2vec (single CPU architecture approach) with higher Recall and BinMatch (multi-CPU architecture approach) with higher Recall. Finally, our work is the first to provide noteworthy results with respect to binary clone search over the tigress obfuscator, which is a well-established open-source obfuscator.
- Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2022. A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and Features. ACM Computing Surveys (CSUR) 55, 1 (2022), 1–41.Google ScholarDigital Library
- Christopher M Bishop 1995. Neural Networks for Pattern Recognition. Oxford University Press.Google Scholar
- Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019. IEEE, 472–489. https://doi.org/10.1109/SP.2019.00003Google ScholarCross Ref
- Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code.. In NDSS.Google Scholar
- Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 480–491.Google ScholarDigital Library
- FLIRT. 2020. FLIRT @ONLINE. https://hex-rays.com/products/ida/tech/flirt/.Google Scholar
- Yikun Hu, Hui Wang, Yuanyuan Zhang, Bodong Li, and Dawu Gu. 2019. A Semantics-Based Hybrid Approach on Binary Code Similarity Comparison. IEEE Transactions on Software Engineering 47 (2019), 1241–1258.Google ScholarCross Ref
- idapro. 2020. idapro @ONLINE. https://www.hex-rays.com/products/ida/index.shtml.Google Scholar
- Jianguo Jiang, Gengwang Li, Min Yu, Gang Li, Chao Liu, Zhiqiang Lv, Bin Lv, and Weiqing Huang. 2020. Similarity of binaries across optimization levels and obfuscation. In European Symposium on Research in Computer Security. Springer, 295–315.Google ScholarDigital Library
- Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. 2015. Obfuscator-LLVM–software protection for the masses. In 2015 IEEE/ACM 1st International Workshop on Software Protection. IEEE, 3–9.Google ScholarDigital Library
- Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim. 2022. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Transactions on Software Engineering (2022).Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980Google Scholar
- Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning. PMLR, 3835–3845.Google Scholar
- Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. 2018. α diff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 667–678.Google ScholarDigital Library
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Evaluation in Information Retrieval. Cambridge University Press, 139–161. https://doi.org/10.1017/CBO9780511809071.009Google Scholar
- Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. [n. d.]. How Machine Learning Is Solving the Binary Function Similarity Problem. ([n. d.]).Google Scholar
- Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. 2022. How Machine Learning Is Solving the Binary Function Similarity Problem. In 31st USENIX Security Symposium (USENIX Security 22). 2099–2116.Google Scholar
- Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Detection of Intrusions and Malware, and Vulnerability Assessment - 16th International Conference, DIMVA 2019, Gothenburg, Sweden, June 19-20, 2019, Proceedings(Lecture Notes in Computer Science, Vol. 11543), Roberto Perdisci, Clémentine Maurice, Giorgio Giacinto, and Magnus Almgren (Eds.). Springer, 309–329. https://doi.org/10.1007/978-3-030-22038-9_15Google Scholar
- Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices, Vol. 42. ACM, 89–100.Google Scholar
- Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849–856.Google Scholar
- Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020).Google Scholar
- Federico Scrinzi. 2015. Behavioral Analysis of Obfuscated Code. http://essay.utwente.nl/67522/Google Scholar
- Noam Shalev and Nimrod Partush. 2018. Binary similarity detection using machine learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security. 42–47.Google ScholarDigital Library
- Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Krügel, and Giovanni Vigna. 2016. SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016. IEEE Computer Society, 138–157. https://doi.org/10.1109/SP.2016.17Google ScholarCross Ref
- tigress. 2020. tigress @ONLINE. https://tigress.wtf/.Google Scholar
- Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. 2022. jTrans: jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 1–13.Google ScholarDigital Library
- Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363–376.Google ScholarDigital Library
- Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. 2020. Order matters: semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1145–1152.Google ScholarCross Ref
- Zeping Yu, Wenxin Zheng, Jiaqi Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2020. Codecmr: Cross-modal retrieval for function-level binary source code matching. Advances in Neural Information Processing Systems 33 (2020), 3872–3883.Google Scholar
- Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. The Internet Society. https://www.ndss-symposium.org/ndss-paper/neural-machine-translation-inspired-binary-code-similarity-comparison-beyond-function-pairs/Google ScholarCross Ref
Index Terms
- Binary Function Clone Search in the Presence of Code Obfuscation and Optimization over Multi-CPU Architectures
Recommendations
On the robustness of clone detection to code obfuscation
IWSC '13: Proceedings of the 7th International Workshop on Software ClonesCode clones are a common reuse mechanism in software development. While there is an ongoing discussion about harmfulness and advantages of code cloning, this discussion is mainly centered around aspects of software quality. However, recent research has ...
Obfuscation: The Hidden Malware
A cyberwar exists between malware writers and antimalware researchers. At this war's heart rages a weapons race that originated in the 80s with the first computer virus. Obfuscation is one of the latest strategies to camouflage the telltale signs of ...
The power of obfuscation techniques in malicious JavaScript code: A measurement study
MALWARE '12: Proceedings of the 2012 7th International Conference on Malicious and Unwanted Software (MALWARE)JavaScript based attacks have been reported as the top Internet security threats in recent years. Since most of the Internet users rely on anti-virus software to protect themselves from malicious JavaScript code, attackers exploit JavaScript obfuscation ...
Comments