LEARNT: A Neural Machine Translation Framework for Accurate Binary Lifting to High-Level Representation

Baasantogtokh, Duulga; Yoon, Yoseob; Batzorig, Munkhdelgerekh; Sahlabadi, Mahdi; Yim, Kangbin

doi:10.1007/978-3-031-72322-3_4

Duulga Baasantogtokh³,
Yoseob Yoon³,
Munkhdelgerekh Batzorig⁴,
Mahdi Sahlabadi⁴ &
…
Kangbin Yim⁴

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 225))

Included in the following conference series:

International Conference on Intelligent Networking and Collaborative Systems

220 Accesses

Abstract

In binary analysis, performing static analyses on architecture-agnostic intermediate representation is efficient and strongly demanded. Sound and accurate Low-Level Virtual Machine Intermediate Representation (LLVM IR) lifted from binary could make the reuse of dozens of existing analysis programs of the LLVM ecosystem possible. However, current binary lifters lack the resources to improve manually developed lifting rules and develop more of them. This work aims to solve the problem of lifting low-level language to sound high-level Intermediate Representation (IR) as a formal language translation problem, enabling automatic learning of binary lifting. Therefore, we propose a neural machine translation-based binary lifting framework named LEARNT with a parallel corpus generation method leveraging a compiler. The evaluation results show that LEARNT’s average translation accuracy is 93%, which proves that translation rules automatically learned by LEARNT are sound.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 179.99; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Next-Generation Intermediate Representations for Binary Code Analysis

Article 16 December 2019

Sound Transpilation from Binary to Machine-Independent Code

MFHBT: Hybrid Binary Translation System with Multi-stage Feedback Powered by LLVM

References

radare2. https://www.rada.re (2016)
Ghidra. https://www.ghidra-sre.org (2019)
Binary Ninja. https://www.binary.ninja (2016)
Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: A Binary Analysis Platform. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 463–469. Springer, Berlin, Heidelberg (2011)
Google Scholar
Anand, K., Smithson, M., Elwazeer, K., Kotha, A., Gruen, J., Giles, N., Barua, R.: A compiler-level intermediate representation based binary analysis and rewriting system. In: EuroSys ‘13: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 295–308 (2013)
Google Scholar
Fabrice, B.: QEMU, a fast and portable dynamic translation. In: USENIX Annual Technical Conference, FREENIX Track, vol. 41, pp. 41–46 (2005)
Google Scholar
Qiling. https://www.qiling.io (2020)
Quynh, N.A., Vu, D.H.: Unicorn: Next generation CPU emulator framework. In: BlackHat USA (2015)
Google Scholar
retdec. https://www.github.com/avast/retdec (2017)
Dinaburg, A., Ruef, A.: Mcsema: Static translation of x86 instructions to llvm. In: ReCon 2014 Conference (2014)
Google Scholar
Yadavalli, S.B., Smith, A.: Raising binaries to LLVM IR with MCTOLL (WIP paper). In: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 81–92 (2019)
Google Scholar
Liu, Z., Yuan, Y., Wang, S., Bao, Y.: Demystifying binary lifters through the lens of downstream applications. In: IEEE Symposium on Security and Privacy (SP), pp. 1019–1036 (2022)
Google Scholar
Hasabnis, N., Sekar, R.: Lifting assembly to intermediate representation: A novel approach leveraging compilers. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 299–311 (2016)
Google Scholar
Elwazeer, K., Anand, K., Kotha, A., Smithson, M., Barua, R.: Scalable variable and data type detection in a binary rewriter. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 51–62 (2013)
Google Scholar
Mahajan, G., Raja, M.: Metamorphic malware detection using LLVM IR and hidden Markov model. In: Proceedings of the International Congress on Information and Communication Technology, pp. 221–230 (2015)
Google Scholar
Schaad, A., Binder, D.: Deep-learning-based Vulnerability Detection in Binary Executables. In: International Symposium on Foundations and Practice of Security, pp. 33–47 (2022)
Google Scholar
Dharma, K.C., Ferra, T., Morrison, C.T.: Neural Machine Translation for Recovering ASTs. In: IEEE 3rd International Conference on Software Engineering and Artificial Intelligence (SEAI), pp. 123–134 (2023)
Google Scholar
Katz, D.S., Ruchti, J., Schulte, E.: Using Recurrent Neural Networks for Decompilation. In: IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 653–663 (2018)
Google Scholar
Liang, R., Cao, Y., Hu, P., Chen, K.: Neutron: an attention-based neural decompiler. In: Cybersecurity, Springer, pp. 213–225 (2021)
Google Scholar
Yang, X., Chen, Y., Eide, E., Regehr, J.: Finding and understanding bugs in C compilers. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 283–294 (2011)
Google Scholar
Wang, S., Wang, P., Wu, D.: Reassembleable disassembling. In: 24th USENIX Security Symposium (USENIX Security 15), pp. 627–642 (2015)
Google Scholar
Altinay, A. et al.: BinRec: dynamic binary lifting and recompilation. In: Proceedings of the Fifteenth European Conference on Computer Systems, 29: 1–16 (2020)
Google Scholar
S2E. https://www.s2e.systems/docs/ (2020)
Qiao, R., Zhang, M., Sekar, R.: A principled approach for ROP defense. In: Proceedings of the 31st Annual Computer Security Applications Conference, pp. 326–335 (2015)
Google Scholar
Hwang, Y.S., Lin, T.Y., Chang, R.G.: DisIRer: converting a retargetable compiler into a multiplatform binary translator. ACM Trans. Arch. Code Opt. (TACO) 8(3), 14:1-14:21 (2010)
Google Scholar
Dagger. https://www.github.com/ahmedbougacha/dagger (2017)
Mahdi, S.: LPMSAEF: lightweight process mining-based software architecture evaluation framework for security and performance analysis. Heliyon 6(1), e03456 (2024)
Google Scholar
Cao, Y., Liang, R., Chen, K., Hu, P.: Boosting neural networks to decompile optimized binaries. In: Proceedings of the 38th Annual Computer Security Applications Conference, pp. 447–458 (2022)
Google Scholar
Vaswani, A. et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008 (2017)
Google Scholar
Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar

Download references

Acknowledgements

This work was supported by Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-01197, Convergence security core talent training business (Soon Chun Hyang University)).

Author information

Authors and Affiliations

Department of Mobility Convergence Security, Soonchunhyang University, 22, Soonchunhyang-ro, Sinchang-Myeon, Asan-si, Chungcheongnam-do, South Korea
Duulga Baasantogtokh & Yoseob Yoon
Department of Information Security Engineering, Soonchunhyang University, 22, Soonchunhyang-ro, Sinchang-Myeon, Asan-si, Chungcheongnam-do, South Korea
Munkhdelgerekh Batzorig, Mahdi Sahlabadi & Kangbin Yim

Authors

Duulga Baasantogtokh
View author publications
You can also search for this author in PubMed Google Scholar
Yoseob Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Munkhdelgerekh Batzorig
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Sahlabadi
View author publications
You can also search for this author in PubMed Google Scholar
Kangbin Yim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Duulga Baasantogtokh .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Fukuoka Institute of Technology (FIT), Fukuoka, Japan
Leonard Barolli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baasantogtokh, D., Yoon, Y., Batzorig, M., Sahlabadi, M., Yim, K. (2024). LEARNT: A Neural Machine Translation Framework for Accurate Binary Lifting to High-Level Representation. In: Barolli, L. (eds) Advances in Intelligent Networking and Collaborative Systems. INCoS 2024. Lecture Notes on Data Engineering and Communications Technologies, vol 225. Springer, Cham. https://doi.org/10.1007/978-3-031-72322-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-72322-3_4
Published: 15 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72321-6
Online ISBN: 978-3-031-72322-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics