Abstract
In binary analysis, performing static analyses on architecture-agnostic intermediate representation is efficient and strongly demanded. Sound and accurate Low-Level Virtual Machine Intermediate Representation (LLVM IR) lifted from binary could make the reuse of dozens of existing analysis programs of the LLVM ecosystem possible. However, current binary lifters lack the resources to improve manually developed lifting rules and develop more of them. This work aims to solve the problem of lifting low-level language to sound high-level Intermediate Representation (IR) as a formal language translation problem, enabling automatic learning of binary lifting. Therefore, we propose a neural machine translation-based binary lifting framework named LEARNT with a parallel corpus generation method leveraging a compiler. The evaluation results show that LEARNT’s average translation accuracy is 93%, which proves that translation rules automatically learned by LEARNT are sound.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
radare2. https://www.rada.re (2016)
Ghidra. https://www.ghidra-sre.org (2019)
Binary Ninja. https://www.binary.ninja (2016)
Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: A Binary Analysis Platform. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 463–469. Springer, Berlin, Heidelberg (2011)
Anand, K., Smithson, M., Elwazeer, K., Kotha, A., Gruen, J., Giles, N., Barua, R.: A compiler-level intermediate representation based binary analysis and rewriting system. In: EuroSys ‘13: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 295–308 (2013)
Fabrice, B.: QEMU, a fast and portable dynamic translation. In: USENIX Annual Technical Conference, FREENIX Track, vol. 41, pp. 41–46 (2005)
Qiling. https://www.qiling.io (2020)
Quynh, N.A., Vu, D.H.: Unicorn: Next generation CPU emulator framework. In: BlackHat USA (2015)
retdec. https://www.github.com/avast/retdec (2017)
Dinaburg, A., Ruef, A.: Mcsema: Static translation of x86 instructions to llvm. In: ReCon 2014 Conference (2014)
Yadavalli, S.B., Smith, A.: Raising binaries to LLVM IR with MCTOLL (WIP paper). In: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 81–92 (2019)
Liu, Z., Yuan, Y., Wang, S., Bao, Y.: Demystifying binary lifters through the lens of downstream applications. In: IEEE Symposium on Security and Privacy (SP), pp. 1019–1036 (2022)
Hasabnis, N., Sekar, R.: Lifting assembly to intermediate representation: A novel approach leveraging compilers. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 299–311 (2016)
Elwazeer, K., Anand, K., Kotha, A., Smithson, M., Barua, R.: Scalable variable and data type detection in a binary rewriter. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 51–62 (2013)
Mahajan, G., Raja, M.: Metamorphic malware detection using LLVM IR and hidden Markov model. In: Proceedings of the International Congress on Information and Communication Technology, pp. 221–230 (2015)
Schaad, A., Binder, D.: Deep-learning-based Vulnerability Detection in Binary Executables. In: International Symposium on Foundations and Practice of Security, pp. 33–47 (2022)
Dharma, K.C., Ferra, T., Morrison, C.T.: Neural Machine Translation for Recovering ASTs. In: IEEE 3rd International Conference on Software Engineering and Artificial Intelligence (SEAI), pp. 123–134 (2023)
Katz, D.S., Ruchti, J., Schulte, E.: Using Recurrent Neural Networks for Decompilation. In: IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 653–663 (2018)
Liang, R., Cao, Y., Hu, P., Chen, K.: Neutron: an attention-based neural decompiler. In: Cybersecurity, Springer, pp. 213–225 (2021)
Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in C compilers. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283–294 (2011)
Wang, S., Wang, P., Wu, D.: Reassembleable disassembling. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 627–642 (2015)
Altinay, A. et al.: BinRec: dynamic binary lifting and recompilation. In: Proceedings of the Fifteenth European Conference on Computer Systems, 29: 1–16 (2020)
S2E. https://www.s2e.systems/docs/ (2020)
Qiao, R., Zhang, M., Sekar, R.: A principled approach for ROP defense. In: Proceedings of the 31st Annual Computer Security Applications Conference, pp. 326–335 (2015)
Hwang, Y.S., Lin, T.Y., Chang, R.G.: DisIRer: converting a retargetable compiler into a multiplatform binary translator. ACM Trans. Arch. Code Opt. (TACO) 8(3), 14:1-14:21 (2010)
Dagger. https://www.github.com/ahmedbougacha/dagger (2017)
Mahdi, S.: LPMSAEF: lightweight process mining-based software architecture evaluation framework for security and performance analysis. Heliyon 6(1), e03456 (2024)
Cao, Y., Liang, R., Chen, K., Hu, P.: Boosting neural networks to decompile optimized binaries. In: Proceedings of the 38th Annual Computer Security Applications Conference, pp. 447–458 (2022)
Vaswani, A. et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008 (2017)
Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Acknowledgements
This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-01197, Convergence security core talent training business (Soon Chun Hyang University)).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Baasantogtokh, D., Yoon, Y., Batzorig, M., Sahlabadi, M., Yim, K. (2024). LEARNT: A Neural Machine Translation Framework for Accurate Binary Lifting to High-Level Representation. In: Barolli, L. (eds) Advances in Intelligent Networking and Collaborative Systems. INCoS 2024. Lecture Notes on Data Engineering and Communications Technologies, vol 225. Springer, Cham. https://doi.org/10.1007/978-3-031-72322-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-72322-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72321-6
Online ISBN: 978-3-031-72322-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)