OpTrans: enhancing binary code similarity detection with function inlining re-optimization

Sha, Zihan; Lan, Yang; Zhang, Chao; Wang, Hao; Gao, Zeyu; Zhang, Bolun; Shu, Hui

doi:10.1007/s10664-024-10605-x

OpTrans: enhancing binary code similarity detection with function inlining re-optimization

Published: 26 December 2024

Volume 30, article number 49, (2025)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Zihan Sha ORCID: orcid.org/0000-0002-1020-9006¹,
Yang Lan¹,
Chao Zhang²,
Hao Wang²,
Zeyu Gao²,
Bolun Zhang³ &
…
Hui Shu²

219 Accesses
Explore all metrics

Abstract

Binary code similarity detection (BCSD) is pivotal in system security including reverse engineering, vulnerability detection and software component analysis. Recent studies on BCSD have proliferated, yet they exhibit poor performance when confronting semantic alterations (e.g., function inlining) caused by compiler optimization. To tackle this challenge, we present OpTrans, an innovative framework that fuses binary code Optimization techniques with the Transformer model for BCSD. OpTrans employs an algorithm based on binary program analysis to determine which functions should be inlined, followed by binary rewriting techniques to effectuate re-optimization on binaries. This innovative method significantly reduces false positives and enhances model performance in real-world BCSD tasks. We evaluated OpTrans on the BinaryCorp datasets, and it outperformed the state-of-the-art BCSD solutions by 21.5% on average. The inline re-optimization improved all BCSD solutions by up to 32.1%. Our ablation study and vulnerability experiment demonstrate the practicality of inline re-optimization in real-world detection scenarios, showing the usefulness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 2

Fig. 4

Similarity of Binaries Across Optimization Levels and Obfuscation

A Survey of Control Flow Graph Recovery for Binary Code

MalwareHunt: semantics-based malware diffing speedup by normalized basic block memoization

Article 17 May 2016

Data Availability

Our code and data are available at https://github.com/Sandspeare/optrans

References

Liu B, Huo W, Zhang C, Li W, Li F, Piao A, Zou W (2018) $\alpha $diff: Cross-version binary code similarity detection with dnn. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pp 667–678. ACM, New York, NY, USA
Zuo F, Li X, Zhang Z, Young P, Luo L, Zeng Q (2019) Neural machine translation inspired binary code similarity comparison beyond function pairs. In: 26th Annual network and distributed system security symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019
Ding SHH, Fung BCM, Charland P (2019) Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: 2019 IEEE symposium on security and privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pp 472–489
Massarelli L, Luna GAD, Petroni F, Querzoni L, Baldoni R (2019) Safe: Self-attentive function embeddings for binary similarity. In: Detection of intrusions and malware, and vulnerability assessment - 16th international conference, DIMVA 2019, Gothenburg, Sweden, June 19-20, 2019, Proceedings. Lecture Notes in Computer Science, vol 11543, pp 309–329
Li X, Qu Y, Yin H (2021) Palmtree: Learning an assembly language model for instruction embedding. In: Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pp 3236–3251
Li Y, Gu C, Dullien T, Vinyals O, Kohli P (2019) Graph matching networks for learning the similarity of graph structured objects. In: International conference on machine learning, PMLR, pp 3835–3845
Wang H, Qu W, Katz G, Zhu W, Gao Z, Qiu H, Zhuge J, Zhang C (2022) jtrans: jump-aware transformer for binary code similarity detection. ISSTA ’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. Virtual Event, South Korea, July 18–22, 2022. ACM, New York, NY, USA, pp 1–13
TensorFlow (2022) Word2vec skip-gram implementation in tensorflow. https://tensorflow.google.cn/tutorials/text/word2vec
Marhon SA, Cameron CJF, Kremer SC (2013) In: Bianchini M, Maggini M, Jain LC (eds) Recurrent Neural Networks, Springer, Berlin, Heidelberg, pp 29–65
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Networks 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605
Article Google Scholar
Ji Y, Cui L, Huang HH (2021) Buggraph: Differentiating source-binary code similarity with graph triplet-loss network. ASIA CCS ’21: ACM Asia Conference on Computer and Communications Security. Virtual Event, Hong Kong, June 7–11, 2021. ACM, New York, NY, USA, pp 702–715
Xu X, Liu C, Feng Q, Yin H, Song L, Song DX (2017) Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pp 363–376
Li X, Yu Q, Yin H (2021) Palmtree: Learning an assembly language model for instruction embedding. CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security. Virtual Event, Republic of Korea, November 15–19, 2021. ACM, New York, NY, USA, pp 3236–3251
Project L (2024) Clang Documentation. Accessed on October 11, 2024. https://clang.llvm.org/docs/
Cesare S, Xiang Y (2011) Malware variant detection using similarity search over sets of control flow graphs. In: IEEE 10th International conference on trust, security and privacy in computing and communications, TrustCom 2011, Changsha, China, 16-18 November, 2011, pp 181–189
Cesare S, Xiang Y, Zhou W (2014) Control flow-based malware variantdetection. IEEE Trans Dependable Secure Comput 11:307–317
Article Google Scholar
Tamás C, Papp D, Buttyán L (2021) Simbiota: Similarity-based malware detection on iot devices. In: Proceedings of the 6th International Conference on Internet of Things, Big Data and Security, IoTBDS 2021, Online Streaming, April 23-25, 2021, pp 58–69
Hu Y, Zhang Y, Li J, Gu D (2017) Binary code clone detection across architectures and compiling configurations. In: Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, May 22-23, 2017, pp 88–98
Ding SHH, Fung BCM, Charland P (2016) Kam1n0: Mapreduce-based assembly clone search for reverse engineering. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp 461–470
Xu Z, Chen B, Chandramohan M, Liu Y, Song F (2017) Spain: Security patch analysis for binaries towards understanding the pain and pills. In: Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pp 462–472
Gao D, Reiter MK, Song DX (2008) Binhunt: Automatically finding semantic differences in binary programs. In: Information and Communications Security, 10th International Conference, ICICS 2008, Birmingham, UK, October 20-22, 2008, Proceedings. Lecture Notes in Computer Science, vol 5308, pp 238–255
Chandramohan M, Xue Y, Xu Z, Liu Y, Cho CY, Tan HBK (2016) Bingo: cross-architecture cross-os binary search. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 678–689
Pewny J, Garmany B, Gawlik R, Rossow C, Holz T (2015) Cross-architecture bug search in binary executables. Inf Technol 59:83–91
Google Scholar
Hex-rays (2022) Ida pro disassembler and debugger. https://www.hex-rays.com/products/ida/index.shtml
Dullien T, Rolles R (2005) Graph-based comparison of executable objects (english version). In: SSTIC, vol 5, p 3
Eschweiler S, Yakdan K, Gerhards-Padilla E (2016) discovre: Efficient cross-architecture identification of bugs in binary code. In: 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016
Pewny J, Schuster F, Bernhard L, Holz T, Rossow C (2014) Leveraging semantic signatures for bug search in binary programs. In: Proceedings of the 30th Annual Computer Security Applications Conference, ACSAC 2014, New Orleans, LA, USA, December 8-12, 2014, pp 406–415
Feng Q, Zhou R, Xu C, Cheng Y, Testa B, Yin H (2016) Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pp 480–491. ACM, New York, NY, USA
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp 3111–3119
He H, Lin X, Weng Z, Zhao R, Gan S, Chen L, Ji Y, Wang J, Xue Z (2024) Code is not natural language: Unlock the power of Semantics-Oriented graph representation for binary code similarity detection. In: 33rd USENIX Security Symposium (USENIX Security 24), pp 1759–1776. USENIX Association, Philadelphia, PA. https://www.usenix.org/conference/usenixsecurity24/presentation/he-haojie
Luo Z, Wang P, Wang B, Tang Y, Xie W, Zhou X, Liu D, Lu K (2023) Vulhawk: Cross-architecture vulnerability detection with entropy-based binary code search. Proceedings 2023 Network and Distributed System Security Symposium
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR)
Yang S, Dong C, Xiao Y, Cheng Y, Shi Z, Li Z, Sun L (2023) Asteria-pro: enhancing deep-learning based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodology
Jia A, Fan M, Jin W, Xu X, Zhou Z, Tang Q, Nie S, Wu S, Liu T (2023) 1-to-1 or 1-to-n? investigating the effect of function inlining on binary similarity analysis. ACM Trans Softw Eng Methodol 32(4). https://doi.org/10.1145/3561385
Jia A, Fan M, Xu X, Jin W, Wang H, Liu T (2024) Cross-inlining binary function similarity detection. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3597503.3639080
Jin X, Pei K, Won JY, Lin Z (2022) Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In: Proceedings of the 2022 ACM SIGSAC conference on computer and communications security, pp 1631–1645
Patrick-Evans J, Dannehl M, Kinder J (2023) Xfl: naming functions in binaries with extreme multi-label learning. In: 2023 IEEE Symposium on Security and Privacy (SP), IEEE, pp 2375–2390
Sha Z, Shu H, Xiong X, Kang F (2022) Model of execution trace obfuscation between threads. IEEE Trans Dependable Secure Comput 19(6):4156–4171. https://doi.org/10.1109/TDSC.2021.3123159
Article Google Scholar
Hex-Rays. (2021) IDA Pro Disassembler and Debugger. Retrieved September 10, 2023 from http://www.hex-rays.com/products/ida/index.shtml
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advan Neural Inf Process Syst 32
Wang H, Gao Z, Zhang C, Sha Z, Sun M, Zhou Y, Zhu W, Sun W, Qiu H, Xiao X (2024) CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. CoRR arXiv:1412.6980
Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)
Wang H, Qu W, Katz G, Zhu W, Gao Z, Qiu H, Zhuge J, Zhang C (2022) Jtrans: Jump-aware transformer for binary code similarity detection. In: Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis, pp 1–13

Download references

Author information

Authors and Affiliations

Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou, China
Zihan Sha & Yang Lan
Tsinghua University, Beijing, China
Chao Zhang, Hao Wang, Zeyu Gao & Hui Shu
Institute of Information Engineering CAS, Beijing, China
Bolun Zhang

Authors

Zihan Sha
View author publications
You can also search for this author in PubMed Google Scholar
Yang Lan
View author publications
You can also search for this author in PubMed Google Scholar
Chao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Gao
View author publications
You can also search for this author in PubMed Google Scholar
Bolun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Shu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zihan Sha.

Ethics declarations

Conflict of Interest Statement

We declare that all authors have no conflict of interest.

Additional information

Communicated by: Foutse Khomh and Bowen Xu

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on SEA4DQ.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sha, Z., Lan, Y., Zhang, C. et al. OpTrans: enhancing binary code similarity detection with function inlining re-optimization. Empir Software Eng 30, 49 (2025). https://doi.org/10.1007/s10664-024-10605-x

Download citation

Accepted: 10 December 2024
Published: 26 December 2024
DOI: https://doi.org/10.1007/s10664-024-10605-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OpTrans: enhancing binary code similarity detection with function inlining re-optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Similarity of Binaries Across Optimization Levels and Obfuscation

A Survey of Control Flow Graph Recovery for Binary Code

MalwareHunt: semantics-based malware diffing speedup by normalized basic block memoization

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest Statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

OpTrans: enhancing binary code similarity detection with function inlining re-optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Similarity of Binaries Across Optimization Levels and Obfuscation

A Survey of Control Flow Graph Recovery for Binary Code

MalwareHunt: semantics-based malware diffing speedup by normalized basic block memoization

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest Statement

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation