Abstract
This paper proposes a method to systematically extract the formal semantics of ARM instructions from their natural language specifications. Although ARM is based on RISC architecture and the number of instructions is relatively small, an abundance of variations diversely exist under various series including Cortex-A, Cortex-M, and Cortex-R. Thus, the semi-automatic semantics formalisation of rather simple instructions results in reducing tedious human efforts for tool developments e.g., the symbolic execution. We concentrate on six variations: M0, M0+, M3, M4, M7, and M33 of ARM Cortex-M series, aiming at covering IoT malware. Our systematic approach consists of the semantics interpretation by applying translation rules, augmented by the sentences similarity analysis to recognise the modification of flags. Among 1039 collected specifications, the formal semantics of 662 instructions have been successfully extracted by using only 228 manually prepared rules. They are utilised afterwards to preliminarily build a dynamic symbolic execution tool for Cortex-M called Corana. We experimentally observe that Corana is capable of effectively tracing IoT malware under the presence of obfuscation techniques like indirect jumps, as well as correctly detecting dead conditional branches, which are regarded as opaque predicates.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
King, J.C.: Symbolic execution and program testing. Commun. ACM 19(7), 385–394 (1976)
Thakur, A., et al.: Directed proof generation for machine code. In: Tayssir, T., Byron, C., Paul, J. (eds.) CAV 2010. LNCS, vol. 6174, pp. 288–305. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14295-6_27
Desclaux, F.: miasm: Framework de reverse engineering. In: Actes du SSTIC (2012)
Cha, S.K., Avgerinos, T., Rebert, A., Brumley, D.: Unleashing Mayhem on binary code. In: IEEE S and P 2012, pp. 380–394 (2012)
Anthony, R.: Methods for binary symbolic execution. In: Ph.D. Dissertation, Stanford University (December 2014)
Bonfante, G., Fernandez, J., Marion, J.Y., Rouxel, B., Sabatier, F., Thierry, A.: Codisasm: medium scale concatic disassembly of self-modifying binaries with overlapping instructions. In: CCS 2015, pp. 745–756 (2015)
Hai, N.M., Ogawa, M., Tho, Q.T.: Obfuscation code localization based on CFG generation of malware. In: Garcia-Alfaro, J., Kranakis, E., Bonfante, G. (eds.) FPS 2015. LNCS, vol. 9482, pp. 229–247. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30303-1_14
Shoshitaishvili, Y., et al.: (State of) the art of war: offensive techniques in binary analysis. In: IEEE S and P 2016, pp. 138–157 (2016)
Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. In: ACM PLDI 2007, pp. 89–100 (2007)
Capstone Engine. http://capstone-engine.org. Accessed 9 July 2019
Ida. https://hex-rays.com/products/ida. Accessed 9 July 2019
Krishnamoorthy, N., Debray, S., Fligg, K.: Static detection of disassembly errors. In: IEEE WCRE 2009, pp. 259–268 (2009)
Dasgupta, S., Park, D., Kasampalis, T., Adve, V.S., Rosu, G.: A complete formal semantics of x86-64 user-level instruction set architecture. In: ACM PLDI 2019, pp. 1133–1148 (2019)
ARM Developer. https://developer.arm.com. Accessed 9 July 2019
The Corana Tool. https://anhvvcs.github.io/corana. Accessed 9 July 2019
Robeer, M., Lucassen, G., van der Werf, J.M.E., Dalpiaz, F., Brinkkemper, S.: Automated extraction of conceptual models from user stories via NLP. In: IEEE RE 2016, pp. 196–205 (2016)
Yue, T., Briand, L.C., Labiche, Y.: aToucan: an automated framework to derive UML analysis models from use case models. ACM TOSEM 24(3), 13:1–13:52 (2015)
Heule, S., Schkufza, E., Sharma, R., Aiken, A.: Stratified synthesis: automatically learning the x86-64 instruction set. In: ACM PLDI 2016, pp. 237–250 (2016)
Schkufza, E., Sharma, R., Aiken, A.: Stochastic superoptimization. In: ASPLOS 2013, pp. 305–316 (2013)
\(\mu \)Vision. http://keil.com/mdk5/uvision. Accessed 9 July 2019
Yen, N.L.H.: Automatic extraction of x86 formal semantics from its natural language description. In: Master’s Thesis, School of Information Science, JAIST (March 2018)
Anh, V.V.: Formal semantics extraction from natural language specifications for ARM. In: Master’s Thesis, School of Information Science, JAIST (December 2018)
Bonfante, G., Marion, J.Y., Reynaud-Plantey, D.: A computability perspective on self-modifying programs. In: SEFM 2009, pp. 231–239 (2009)
Degenbaev, U.: Formal specification of the x86 instruction set architecture. In: Ph.D. Dissertation, Universitat des Saarlandes (February 2012)
Aceto, L., Fokkink, W., Verhoef, C.: Structural operational semantics. Handbook of Process Algebra, pp. 197–292 (2001)
Loper, E., Bird, S.: NLTK: the natural language toolkit. In: ACL (2004)
Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Documentation 60(5), 503–520 (2004)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Luckow, K., et al.: JDart: a dynamic symbolic analysis framework. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 442–459. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_26
Visser, W., Havelund, K., Brat, G., Park, S., Lerda, F.: Model checking programs. Autom. Softw. Eng. 10(2), 203–232 (2003)
de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78800-3_24
Kirat, D., Vigna, G., Kruegel, C.: barebox: efficient malware analysis on bare-metal. In: ACSAC 2011, pp. 403–412 (2011)
Brumley, D., Hartwig, C., Liang, Z., Newsome, J., Song, D., Yin, H.: Automatically identifying trigger-based behavior in malware. In: Wenke L., Cliff W., David D. (eds.) Botnet Detection 2008, ADIS, vol. 36, pp. 65–88. Springer, Heidelberg (2008). https://doi.org/10.1007/978-0-387-68768-14
Fleck, D., Tokhtabayev, A., Alarif, A., Stavrou, A., Nykodym, T.: PyTrigger: a system to trigger & extract user-activated malware behavior. In: AERES 2013, pp. 92–101 (2013)
Virus Total. https://www.virustotal.com. Accessed 9 July 2019
Virus Share. https://virusshare.com. Accessed 9 July 2019
Acknowledgments
We are grateful to Nao Hirokawa, Le Minh Nguyen, and the anonymous reviewers of FM’19 for their insightful feedback and invaluable comments. We sincerely thank Xuan Tung Vu, Thi Hai Yen Vuong, and Lam Hoang Yen Nguyen for their constructive discussions, as well as Thu Trang Hoang for her sharp comments on some grammatical issues. This study is partially supported by JSPS KAKENHI Grant-in-Aid for Scientific Research (B) 19H04083.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Vu, A.V., Ogawa, M. (2019). Formal Semantics Extraction from Natural Language Specifications for ARM. In: ter Beek, M., McIver, A., Oliveira, J. (eds) Formal Methods – The Next 30 Years. FM 2019. Lecture Notes in Computer Science(), vol 11800. Springer, Cham. https://doi.org/10.1007/978-3-030-30942-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-30942-8_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30941-1
Online ISBN: 978-3-030-30942-8
eBook Packages: Computer ScienceComputer Science (R0)