Skip to main content

OpTrans: enhancing binary code similarity detection with function inlining re-optimization

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Binary code similarity detection (BCSD) is pivotal in system security including reverse engineering, vulnerability detection and software component analysis. Recent studies on BCSD have proliferated, yet they exhibit poor performance when confronting semantic alterations (e.g., function inlining) caused by compiler optimization. To tackle this challenge, we present OpTrans, an innovative framework that fuses binary code Optimization techniques with the Transformer model for BCSD. OpTrans employs an algorithm based on binary program analysis to determine which functions should be inlined, followed by binary rewriting techniques to effectuate re-optimization on binaries. This innovative method significantly reduces false positives and enhances model performance in real-world BCSD tasks. We evaluated OpTrans on the BinaryCorp datasets, and it outperformed the state-of-the-art BCSD solutions by 21.5% on average. The inline re-optimization improved all BCSD solutions by up to 32.1%. Our ablation study and vulnerability experiment demonstrate the practicality of inline re-optimization in real-world detection scenarios, showing the usefulness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Algorithm 2
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Our code and data are available at https://github.com/Sandspeare/optrans

References

  • Liu B, Huo W, Zhang C, Li W, Li F, Piao A, Zou W (2018) \(\alpha \)diff: Cross-version binary code similarity detection with dnn. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pp 667–678. ACM, New York, NY, USA

  • Zuo F, Li X, Zhang Z, Young P, Luo L, Zeng Q (2019) Neural machine translation inspired binary code similarity comparison beyond function pairs. In: 26th Annual network and distributed system security symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019

  • Ding SHH, Fung BCM, Charland P (2019) Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: 2019 IEEE symposium on security and privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pp 472–489

  • Massarelli L, Luna GAD, Petroni F, Querzoni L, Baldoni R (2019) Safe: Self-attentive function embeddings for binary similarity. In: Detection of intrusions and malware, and vulnerability assessment - 16th international conference, DIMVA 2019, Gothenburg, Sweden, June 19-20, 2019, Proceedings. Lecture Notes in Computer Science, vol 11543, pp 309–329

  • Li X, Qu Y, Yin H (2021) Palmtree: Learning an assembly language model for instruction embedding. In: Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pp 3236–3251

  • Li Y, Gu C, Dullien T, Vinyals O, Kohli P (2019) Graph matching networks for learning the similarity of graph structured objects. In: International conference on machine learning, PMLR, pp 3835–3845

  • Wang H, Qu W, Katz G, Zhu W, Gao Z, Qiu H, Zhuge J, Zhang C (2022) jtrans: jump-aware transformer for binary code similarity detection. ISSTA ’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. Virtual Event, South Korea, July 18–22, 2022. ACM, New York, NY, USA, pp 1–13

  • TensorFlow (2022) Word2vec skip-gram implementation in tensorflow. https://tensorflow.google.cn/tutorials/text/word2vec

  • Marhon SA, Cameron CJF, Kremer SC (2013) In: Bianchini M, Maggini M, Jain LC (eds) Recurrent Neural Networks, Springer, Berlin, Heidelberg, pp 29–65

  • Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  • Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Networks 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605

    Article  Google Scholar 

  • Ji Y, Cui L, Huang HH (2021) Buggraph: Differentiating source-binary code similarity with graph triplet-loss network. ASIA CCS ’21: ACM Asia Conference on Computer and Communications Security. Virtual Event, Hong Kong, June 7–11, 2021. ACM, New York, NY, USA, pp 702–715

  • Xu X, Liu C, Feng Q, Yin H, Song L, Song DX (2017) Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pp 363–376

  • Li X, Yu Q, Yin H (2021) Palmtree: Learning an assembly language model for instruction embedding. CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security. Virtual Event, Republic of Korea, November 15–19, 2021. ACM, New York, NY, USA, pp 3236–3251

  • Project L (2024) Clang Documentation. Accessed on October 11, 2024. https://clang.llvm.org/docs/

  • Cesare S, Xiang Y (2011) Malware variant detection using similarity search over sets of control flow graphs. In: IEEE 10th International conference on trust, security and privacy in computing and communications, TrustCom 2011, Changsha, China, 16-18 November, 2011, pp 181–189

  • Cesare S, Xiang Y, Zhou W (2014) Control flow-based malware variantdetection. IEEE Trans Dependable Secure Comput 11:307–317

    Article  Google Scholar 

  • Tamás C, Papp D, Buttyán L (2021) Simbiota: Similarity-based malware detection on iot devices. In: Proceedings of the 6th International Conference on Internet of Things, Big Data and Security, IoTBDS 2021, Online Streaming, April 23-25, 2021, pp 58–69

  • Hu Y, Zhang Y, Li J, Gu D (2017) Binary code clone detection across architectures and compiling configurations. In: Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, May 22-23, 2017, pp 88–98

  • Ding SHH, Fung BCM, Charland P (2016) Kam1n0: Mapreduce-based assembly clone search for reverse engineering. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp 461–470

  • Xu Z, Chen B, Chandramohan M, Liu Y, Song F (2017) Spain: Security patch analysis for binaries towards understanding the pain and pills. In: Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, pp 462–472

  • Gao D, Reiter MK, Song DX (2008) Binhunt: Automatically finding semantic differences in binary programs. In: Information and Communications Security, 10th International Conference, ICICS 2008, Birmingham, UK, October 20-22, 2008, Proceedings. Lecture Notes in Computer Science, vol 5308, pp 238–255

  • Chandramohan M, Xue Y, Xu Z, Liu Y, Cho CY, Tan HBK (2016) Bingo: cross-architecture cross-os binary search. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016, pp 678–689

  • Pewny J, Garmany B, Gawlik R, Rossow C, Holz T (2015) Cross-architecture bug search in binary executables. Inf Technol 59:83–91

    Google Scholar 

  • Hex-rays (2022) Ida pro disassembler and debugger. https://www.hex-rays.com/products/ida/index.shtml

  • Dullien T, Rolles R (2005) Graph-based comparison of executable objects (english version). In: SSTIC, vol 5, p 3

  • Eschweiler S, Yakdan K, Gerhards-Padilla E (2016) discovre: Efficient cross-architecture identification of bugs in binary code. In: 23rd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016

  • Pewny J, Schuster F, Bernhard L, Holz T, Rossow C (2014) Leveraging semantic signatures for bug search in binary programs. In: Proceedings of the 30th Annual Computer Security Applications Conference, ACSAC 2014, New Orleans, LA, USA, December 8-12, 2014, pp 406–415

  • Feng Q, Zhou R, Xu C, Cheng Y, Testa B, Yin H (2016) Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pp 480–491. ACM, New York, NY, USA

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held December 5-8, 2013, Lake Tahoe, Nevada, United States, pp 3111–3119

  • He H, Lin X, Weng Z, Zhao R, Gan S, Chen L, Ji Y, Wang J, Xue Z (2024) Code is not natural language: Unlock the power of Semantics-Oriented graph representation for binary code similarity detection. In: 33rd USENIX Security Symposium (USENIX Security 24), pp 1759–1776. USENIX Association, Philadelphia, PA. https://www.usenix.org/conference/usenixsecurity24/presentation/he-haojie

  • Luo Z, Wang P, Wang B, Tang Y, Xie W, Zhou X, Liu D, Lu K (2023) Vulhawk: Cross-architecture vulnerability detection with entropy-based binary code search. Proceedings 2023 Network and Distributed System Security Symposium

  • Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv:1907.11692

  • Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR)

  • Yang S, Dong C, Xiao Y, Cheng Y, Shi Z, Li Z, Sun L (2023) Asteria-pro: enhancing deep-learning based binary code similarity detection by incorporating domain knowledge. ACM Trans Softw Eng Methodology

  • Jia A, Fan M, Jin W, Xu X, Zhou Z, Tang Q, Nie S, Wu S, Liu T (2023) 1-to-1 or 1-to-n? investigating the effect of function inlining on binary similarity analysis. ACM Trans Softw Eng Methodol 32(4). https://doi.org/10.1145/3561385

  • Jia A, Fan M, Xu X, Jin W, Wang H, Liu T (2024) Cross-inlining binary function similarity detection. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. ICSE ’24. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3597503.3639080

  • Jin X, Pei K, Won JY, Lin Z (2022) Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. In: Proceedings of the 2022 ACM SIGSAC conference on computer and communications security, pp 1631–1645

  • Patrick-Evans J, Dannehl M, Kinder J (2023) Xfl: naming functions in binaries with extreme multi-label learning. In: 2023 IEEE Symposium on Security and Privacy (SP), IEEE, pp 2375–2390

  • Sha Z, Shu H, Xiong X, Kang F (2022) Model of execution trace obfuscation between threads. IEEE Trans Dependable Secure Comput 19(6):4156–4171. https://doi.org/10.1109/TDSC.2021.3123159

    Article  Google Scholar 

  • Hex-Rays. (2021) IDA Pro Disassembler and Debugger. Retrieved September 10, 2023 from http://www.hex-rays.com/products/ida/index.shtml

  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: An imperative style, high-performance deep learning library. Advan Neural Inf Process Syst 32

  • Wang H, Gao Z, Zhang C, Sha Z, Sun M, Zhou Y, Zhu W, Sun W, Qiu H, Xiao X (2024) CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

  • Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. CoRR arXiv:1412.6980

  • Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)

  • Wang H, Qu W, Katz G, Zhu W, Gao Z, Qiu H, Zhuge J, Zhang C (2022) Jtrans: Jump-aware transformer for binary code similarity detection. In: Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis, pp 1–13

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zihan Sha.

Ethics declarations

Conflict of Interest Statement

We declare that all authors have no conflict of interest.

Additional information

Communicated by: Foutse Khomh and Bowen Xu

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on SEA4DQ.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sha, Z., Lan, Y., Zhang, C. et al. OpTrans: enhancing binary code similarity detection with function inlining re-optimization. Empir Software Eng 30, 49 (2025). https://doi.org/10.1007/s10664-024-10605-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-024-10605-x

Keywords