Abstract
Binary code similarity detection (BCSD) is pivotal in system security including reverse engineering, vulnerability detection and software component analysis. Recent studies on BCSD have proliferated, yet they exhibit poor performance when confronting semantic alterations (e.g., function inlining) caused by compiler optimization. To tackle this challenge, we present OpTrans, an innovative framework that fuses binary code Optimization techniques with the Transformer model for BCSD. OpTrans employs an algorithm based on binary program analysis to determine which functions should be inlined, followed by binary rewriting techniques to effectuate re-optimization on binaries. This innovative method significantly reduces false positives and enhances model performance in real-world BCSD tasks. We evaluated OpTrans on the BinaryCorp datasets, and it outperformed the state-of-the-art BCSD solutions by 21.5% on average. The inline re-optimization improved all BCSD solutions by up to 32.1%. Our ablation study and vulnerability experiment demonstrate the practicality of inline re-optimization in real-world detection scenarios, showing the usefulness of our approach.
Similar content being viewed by others
Data Availability
Our code and data are available at https://github.com/Sandspeare/optrans
References
Liu B, Huo W, Zhang C, Li W, Li F, Piao A, Zou W (2018) \(\alpha \)diff: Cross-version binary code similarity detection with dnn. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pp 667–678. ACM, New York, NY, USA
Zuo F, Li X, Zhang Z, Young P, Luo L, Zeng Q (2019) Neural machine translation inspired binary code similarity comparison beyond function pairs. In: 26th Annual network and distributed system security symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019
Ding SHH, Fung BCM, Charland P (2019) Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: 2019 IEEE symposium on security and privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pp 472–489
Massarelli L, Luna GAD, Petroni F, Querzoni L, Baldoni R (2019) Safe: Self-attentive function embeddings for binary similarity. In: Detection of intrusions and malware, and vulnerability assessment - 16th international conference, DIMVA 2019, Gothenburg, Sweden, June 19-20, 2019, Proceedings. Lecture Notes in Computer Science, vol 11543, pp 309–329
Li X, Qu Y, Yin H (2021) Palmtree: Learning an assembly language model for instruction embedding. In: Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pp 3236–3251
Li Y, Gu C, Dullien T, Vinyals O, Kohli P (2019) Graph matching networks for learning the similarity of graph structured objects. In: International conference on machine learning, PMLR, pp 3835–3845
Wang H, Qu W, Katz G, Zhu W, Gao Z, Qiu H, Zhuge J, Zhang C (2022) jtrans: jump-aware transformer for binary code similarity detection. ISSTA ’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. Virtual Event, South Korea, July 18–22, 2022. ACM, New York, NY, USA, pp 1–13
TensorFlow (2022) Word2vec skip-gram implementation in tensorflow. https://tensorflow.google.cn/tutorials/text/word2vec
Marhon SA, Cameron CJF, Kremer SC (2013) In: Bianchini M, Maggini M, Jain LC (eds) Recurrent Neural Networks, Springer, Berlin, Heidelberg, pp 29–65
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Networks 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605
Ji Y, Cui L, Huang HH (2021) Buggraph: Differentiating source-binary code similarity with graph triplet-loss network. ASIA CCS ’21: ACM Asia Conference on Computer and Communications Security. Virtual Event, Hong Kong, June 7–11, 2021. ACM, New York, NY, USA, pp 702–715
Xu X, Liu C, Feng Q, Yin H, Song L, Song DX (2017) Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pp 363–376
Li X, Yu Q, Yin H (2021) Palmtree: Learning an assembly language model for instruction embedding. CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security. Virtual Event, Republic of Korea, November 15–19, 2021. ACM, New York, NY, USA, pp 3236–3251
Project L (2024) Clang Documentation. Accessed on October 11, 2024. https://clang.llvm.org/docs/
Cesare S, Xiang Y (2011) Malware variant detection using similarity search over sets of control flow graphs. In: IEEE 10th International conference on trust, security and privacy in computing and communications, TrustCom 2011, Changsha, China, 16-18 November, 2011, pp 181–189
Cesare S, Xiang Y, Zhou W (2014) Control flow-based malware variantdetection. IEEE Trans Dependable Secure Comput 11:307–317
Tamás C, Papp D, Buttyán L (2021) Simbiota: Similarity-based malware detection on iot devices. In: Proceedings of the 6th International Conference on Internet of Things, Big Data and Security, IoTBDS 2021, Online Streaming, April 23-25, 2021, pp 58–69
Hu Y, Zhang Y, Li J, Gu D (2017) Binary code clone detection across architectures and compiling configurations. In: Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, May 22-23, 2017, pp 88–98
Ding SHH, Fung BCM, Charland P (2016) Kam1n0: Mapreduce-based assembly clone search for reverse engineering. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp 461–470
Xu Z, Chen B, Chandramohan M, Liu Y, Song F (2017) Spain: Security patch analysis for binaries towards understanding the pain and pills. In: Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pp 462–472
Gao D, Reiter MK, Song DX (2008) Binhunt: Automatically finding semantic differences in binary programs. In: Information and Communications Security, 10th International Conference, ICICS 2008, Birmingham, UK, October 20-22, 2008, Proceedings. Lecture Notes in Computer Science, vol 5308, pp 238–255
Chandramohan M, Xue Y, Xu Z, Liu Y, Cho CY, Tan HBK (2016) Bingo: cross-architecture cross-os binary search. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 678–689
Pewny J, Garmany B, Gawlik R, Rossow C, Holz T (2015) Cross-architecture bug search in binary executables. Inf Technol 59:83–91
Hex-rays (2022) Ida pro disassembler and debugger. https://www.hex-rays.com/products/ida/index.shtml
Dullien T, Rolles R (2005) Graph-based comparison of executable objects (english version). In: SSTIC, vol 5, p 3
Eschweiler S, Yakdan K, Gerhards-Padilla E (2016) discovre: Efficient cross-architecture identification of bugs in binary code. In: 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016
Pewny J, Schuster F, Bernhard L, Holz T, Rossow C (2014) Leveraging semantic signatures for bug search in binary programs. In: Proceedings of the 30th Annual Computer Security Applications Conference, ACSAC 2014, New Orleans, LA, USA, December 8-12, 2014, pp 406–415
Feng Q, Zhou R, Xu C, Cheng Y, Testa B, Yin H (2016) Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pp 480–491. ACM, New York, NY, USA
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp 3111–3119
He H, Lin X, Weng Z, Zhao R, Gan S, Chen L, Ji Y, Wang J, Xue Z (2024) Code is not natural language: Unlock the power of Semantics-Oriented graph representation for binary code similarity detection. In: 33rd USENIX Security Symposium (USENIX Security 24), pp 1759–1776. USENIX Association, Philadelphia, PA. https://www.usenix.org/conference/usenixsecurity24/presentation/he-haojie
Luo Z, Wang P, Wang B, Tang Y, Xie W, Zhou X, Liu D, Lu K (2023) Vulhawk: Cross-architecture vulnerability detection with entropy-based binary code search. Proceedings 2023 Network and Distributed System Security Symposium
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR)
Yang S, Dong C, Xiao Y, Cheng Y, Shi Z, Li Z, Sun L (2023) Asteria-pro: enhancing deep-learning based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodology
Jia A, Fan M, Jin W, Xu X, Zhou Z, Tang Q, Nie S, Wu S, Liu T (2023) 1-to-1 or 1-to-n? investigating the effect of function inlining on binary similarity analysis. ACM Trans Softw Eng Methodol 32(4). https://doi.org/10.1145/3561385
Jia A, Fan M, Xu X, Jin W, Wang H, Liu T (2024) Cross-inlining binary function similarity detection. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3597503.3639080
Jin X, Pei K, Won JY, Lin Z (2022) Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In: Proceedings of the 2022 ACM SIGSAC conference on computer and communications security, pp 1631–1645
Patrick-Evans J, Dannehl M, Kinder J (2023) Xfl: naming functions in binaries with extreme multi-label learning. In: 2023 IEEE Symposium on Security and Privacy (SP), IEEE, pp 2375–2390
Sha Z, Shu H, Xiong X, Kang F (2022) Model of execution trace obfuscation between threads. IEEE Trans Dependable Secure Comput 19(6):4156–4171. https://doi.org/10.1109/TDSC.2021.3123159
Hex-Rays. (2021) IDA Pro Disassembler and Debugger. Retrieved September 10, 2023 from http://www.hex-rays.com/products/ida/index.shtml
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advan Neural Inf Process Syst 32
Wang H, Gao Z, Zhang C, Sha Z, Sun M, Zhou Y, Zhu W, Sun W, Qiu H, Xiao X (2024) CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. CoRR arXiv:1412.6980
Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)
Wang H, Qu W, Katz G, Zhu W, Gao Z, Qiu H, Zhuge J, Zhang C (2022) Jtrans: Jump-aware transformer for binary code similarity detection. In: Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis, pp 1–13
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest Statement
We declare that all authors have no conflict of interest.
Additional information
Communicated by: Foutse Khomh and Bowen Xu
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on SEA4DQ.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sha, Z., Lan, Y., Zhang, C. et al. OpTrans: enhancing binary code similarity detection with function inlining re-optimization. Empir Software Eng 30, 49 (2025). https://doi.org/10.1007/s10664-024-10605-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-024-10605-x