Skip to main content

LEARNT: A Neural Machine Translation Framework for Accurate Binary Lifting to High-Level Representation

  • Conference paper
  • First Online:
Advances in Intelligent Networking and Collaborative Systems (INCoS 2024)

Abstract

In binary analysis, performing static analyses on architecture-agnostic intermediate representation is efficient and strongly demanded. Sound and accurate Low-Level Virtual Machine Intermediate Representation (LLVM IR) lifted from binary could make the reuse of dozens of existing analysis programs of the LLVM ecosystem possible. However, current binary lifters lack the resources to improve manually developed lifting rules and develop more of them. This work aims to solve the problem of lifting low-level language to sound high-level Intermediate Representation (IR) as a formal language translation problem, enabling automatic learning of binary lifting. Therefore, we propose a neural machine translation-based binary lifting framework named LEARNT with a parallel corpus generation method leveraging a compiler. The evaluation results show that LEARNT’s average translation accuracy is 93%, which proves that translation rules automatically learned by LEARNT are sound.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. radare2. https://www.rada.re (2016)

  2. Ghidra. https://www.ghidra-sre.org (2019)

  3. Binary Ninja. https://www.binary.ninja (2016)

  4. Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: A Binary Analysis Platform. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 463–469. Springer, Berlin, Heidelberg (2011)

    Google Scholar 

  5. Anand, K., Smithson, M., Elwazeer, K., Kotha, A., Gruen, J., Giles, N., Barua, R.: A compiler-level intermediate representation based binary analysis and rewriting system. In: EuroSys ‘13: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 295–308 (2013)

    Google Scholar 

  6. Fabrice, B.: QEMU, a fast and portable dynamic translation. In: USENIX Annual Technical Conference, FREENIX Track, vol. 41, pp. 41–46 (2005)

    Google Scholar 

  7. Qiling. https://www.qiling.io (2020)

  8. Quynh, N.A., Vu, D.H.: Unicorn: Next generation CPU emulator framework. In: BlackHat USA (2015)

    Google Scholar 

  9. retdec. https://www.github.com/avast/retdec (2017)

  10. Dinaburg, A., Ruef, A.: Mcsema: Static translation of x86 instructions to llvm. In: ReCon 2014 Conference (2014)

    Google Scholar 

  11. Yadavalli, S.B., Smith, A.: Raising binaries to LLVM IR with MCTOLL (WIP paper). In: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 81–92 (2019)

    Google Scholar 

  12. Liu, Z., Yuan, Y., Wang, S., Bao, Y.: Demystifying binary lifters through the lens of downstream applications. In: IEEE Symposium on Security and Privacy (SP), pp. 1019–1036 (2022)

    Google Scholar 

  13. Hasabnis, N., Sekar, R.: Lifting assembly to intermediate representation: A novel approach leveraging compilers. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 299–311 (2016)

    Google Scholar 

  14. Elwazeer, K., Anand, K., Kotha, A., Smithson, M., Barua, R.: Scalable variable and data type detection in a binary rewriter. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 51–62 (2013)

    Google Scholar 

  15. Mahajan, G., Raja, M.: Metamorphic malware detection using LLVM IR and hidden Markov model. In: Proceedings of the International Congress on Information and Communication Technology, pp. 221–230 (2015)

    Google Scholar 

  16. Schaad, A., Binder, D.: Deep-learning-based Vulnerability Detection in Binary Executables. In: International Symposium on Foundations and Practice of Security, pp. 33–47 (2022)

    Google Scholar 

  17. Dharma, K.C., Ferra, T., Morrison, C.T.: Neural Machine Translation for Recovering ASTs. In: IEEE 3rd International Conference on Software Engineering and Artificial Intelligence (SEAI), pp. 123–134 (2023)

    Google Scholar 

  18. Katz, D.S., Ruchti, J., Schulte, E.: Using Recurrent Neural Networks for Decompilation. In: IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 653–663 (2018)

    Google Scholar 

  19. Liang, R., Cao, Y., Hu, P., Chen, K.: Neutron: an attention-based neural decompiler. In: Cybersecurity, Springer, pp. 213–225 (2021)

    Google Scholar 

  20. Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in C compilers. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283–294 (2011)

    Google Scholar 

  21. Wang, S., Wang, P., Wu, D.: Reassembleable disassembling. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 627–642 (2015)

    Google Scholar 

  22. Altinay, A. et al.: BinRec: dynamic binary lifting and recompilation. In: Proceedings of the Fifteenth European Conference on Computer Systems, 29: 1–16 (2020)

    Google Scholar 

  23. S2E. https://www.s2e.systems/docs/ (2020)

  24. Qiao, R., Zhang, M., Sekar, R.: A principled approach for ROP defense. In: Proceedings of the 31st Annual Computer Security Applications Conference, pp. 326–335 (2015)

    Google Scholar 

  25. Hwang, Y.S., Lin, T.Y., Chang, R.G.: DisIRer: converting a retargetable compiler into a multiplatform binary translator. ACM Trans. Arch. Code Opt. (TACO) 8(3), 14:1-14:21 (2010)

    Google Scholar 

  26. Dagger. https://www.github.com/ahmedbougacha/dagger (2017)

  27. Mahdi, S.: LPMSAEF: lightweight process mining-based software architecture evaluation framework for security and performance analysis. Heliyon 6(1), e03456 (2024)

    Google Scholar 

  28. Cao, Y., Liang, R., Chen, K., Hu, P.: Boosting neural networks to decompile optimized binaries. In: Proceedings of the 38th Annual Computer Security Applications Conference, pp. 447–458 (2022)

    Google Scholar 

  29. Vaswani, A. et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008 (2017)

    Google Scholar 

  30. Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017)

    Article  Google Scholar 

  31. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

Download references

Acknowledgements

This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-01197, Convergence security core talent training business (Soon Chun Hyang University)).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Duulga Baasantogtokh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Baasantogtokh, D., Yoon, Y., Batzorig, M., Sahlabadi, M., Yim, K. (2024). LEARNT: A Neural Machine Translation Framework for Accurate Binary Lifting to High-Level Representation. In: Barolli, L. (eds) Advances in Intelligent Networking and Collaborative Systems. INCoS 2024. Lecture Notes on Data Engineering and Communications Technologies, vol 225. Springer, Cham. https://doi.org/10.1007/978-3-031-72322-3_4

Download citation

Publish with us

Policies and ethics