Abstract
Compiler optimization levels are important for binary analysis, but they are not available in COTS binaries. In this paper, we present the first end-to-end system called HIMALIA which recovers compiler optimization levels from disassembled binary code without any knowledge of the target instruction set semantics. We achieve this by formulating the problem as a deep learning task and training a two layer recurrent neural network. Besides the recurrent neural network, HIMALIA is also powered by two other techniques: instruction embedding and a new function representation method. We implement HIMALIA and carry out comprehensive experiments on our dataset consisting of 378,695 different functions from 5828 binaries compiled by GCC. The results show that HIMALIA exhibits accuracy of around 89%. Moreover, we find that HIMALIA’s learnt model is explicable: it can auto-learn common compiler conventions and idioms that match our prior knowledge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Different compilation units that make up a single executable can be optimized as a single module only if Link Time Optimization (LTO) is enabled by the compiler.
- 2.
All the instructions are processed in the way described in the input preprocess section.
- 3.
Stub functions are typically functions which have been defined but have no real code in them. For examples, most run-time library functions are stub functions.
References
Hoste, K., Eeckhout, L.: Cole: compiler optimization level exploration. In: IEEE/ACM International Symposium on Code Generation and Optimization, pp. 165–174 (2008)
Wang, X., Zeldovich, N., Kaashoek, M.F., Solar-Lezama, A.: Towards optimization-safe systems: analyzing the impact of undefined behavior. In: Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 260–275 (2013)
David, Y., Partush, N., Yahav, E., David, Y., Partush, N., Yahav, E., David, Y., Partush, N., Yahav, E.: Similarity of binaries through re-optimization. ACM Sigplan Not. 52(6), 79–94 (2017)
Caliskanislam, A., Yamaguchi, F., Dauber, E., Harang, R., Rieck, K., Greenstadt, R., Narayanan, A.: When coding style survives compilation: de-anonymizing programmers from executable binaries (2016)
Egele, M., Woo, M., Chapman, P., Brumley, D.: Blanket execution: dynamic similarity testing for program binaries and components. In: Proceedings of the 23rd USENIX Conference on Security Symposium (2014)
Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram-based detection of new malicious code. In: International Computer Software and Applications Conference - Workshops and FAST Abstracts, pp. 41–42 (2004)
Rieck, K., Holz, T., Willems, C., Dssel, P., Laskov, P.: Learning and classification of malware behavior. In: International Conference on Detection of Intrusions and Malware, pp. 108–125 (2008)
Rosenblum, N., Zhu, X., Hunt, K., Hunt, K.: Learning to analyze binary computer code. In: National Conference on Artificial Intelligence, pp. 798–804 (2008)
Bao, T., Burket, J., Woo, M., Turner, R., Brumley, D.: Byteweight: learning to recognize functions in binary code. In: Usenix Security Symposium (2014)
Shin, E., Song, D., Moazzezi, R.: Recognizing functions in binaries with neural networks. In: USENIX Security Symposium, pp. 611–626 (2015)
Chua, Z.L., Shen, S., Saxena, P., Liang, Z.: Neural nets can learn function type signatures from binaries. In: 26th USENIX Security Symposium (USENIX Security 17), pp. 99–116. USENIX Association, Vancouver, BC (2017). https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/chua
Intel, I.: Intel 64 and ia-32 architectures software developers manual. Volume 3A: System Programming Guide, Part, 1(64) (2016)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. Comput. Sci. (2013)
Acknowledgment
This work was supported by National Key Research and Development Program of China (2016YFB0800202); National Natural Science Foundation of China under Grants No. U1636120; Fundamental Theory and Cutting Edge Technology Research Program of Institute of Information Engineering, CAS; SKLOIS (No. Y7Z0361104 and No. Y7Z0311104) and Science Foundation Ireland under Grant Number 13/SIRG/2178.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Y., Shi, Z., Li, H., Zhao, W., Liu, Y., Qiao, Y. (2019). HIMALIA: Recovering Compiler Optimization Levels from Binaries by Deep Learning. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868. Springer, Cham. https://doi.org/10.1007/978-3-030-01054-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-01054-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01053-9
Online ISBN: 978-3-030-01054-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)