Skip to main content

Formal Semantics Extraction from Natural Language Specifications for ARM

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11800))

Abstract

This paper proposes a method to systematically extract the formal semantics of ARM instructions from their natural language specifications. Although ARM is based on RISC architecture and the number of instructions is relatively small, an abundance of variations diversely exist under various series including Cortex-A, Cortex-M, and Cortex-R. Thus, the semi-automatic semantics formalisation of rather simple instructions results in reducing tedious human efforts for tool developments e.g., the symbolic execution. We concentrate on six variations: M0, M0+, M3, M4, M7, and M33 of ARM Cortex-M series, aiming at covering IoT malware. Our systematic approach consists of the semantics interpretation by applying translation rules, augmented by the sentences similarity analysis to recognise the modification of flags. Among 1039 collected specifications, the formal semantics of 662 instructions have been successfully extracted by using only 228 manually prepared rules. They are utilised afterwards to preliminarily build a dynamic symbolic execution tool for Cortex-M called Corana. We experimentally observe that Corana is capable of effectively tracing IoT malware under the presence of obfuscation techniques like indirect jumps, as well as correctly detecting dead conditional branches, which are regarded as opaque predicates.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. King, J.C.: Symbolic execution and program testing. Commun. ACM 19(7), 385–394 (1976)

    Article  MathSciNet  Google Scholar 

  2. Thakur, A., et al.: Directed proof generation for machine code. In: Tayssir, T., Byron, C., Paul, J. (eds.) CAV 2010. LNCS, vol. 6174, pp. 288–305. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14295-6_27

    Chapter  Google Scholar 

  3. Desclaux, F.: miasm: Framework de reverse engineering. In: Actes du SSTIC (2012)

    Google Scholar 

  4. Cha, S.K., Avgerinos, T., Rebert, A., Brumley, D.: Unleashing Mayhem on binary code. In: IEEE S and P 2012, pp. 380–394 (2012)

    Google Scholar 

  5. Anthony, R.: Methods for binary symbolic execution. In: Ph.D. Dissertation, Stanford University (December 2014)

    Google Scholar 

  6. Bonfante, G., Fernandez, J., Marion, J.Y., Rouxel, B., Sabatier, F., Thierry, A.: Codisasm: medium scale concatic disassembly of self-modifying binaries with overlapping instructions. In: CCS 2015, pp. 745–756 (2015)

    Google Scholar 

  7. Hai, N.M., Ogawa, M., Tho, Q.T.: Obfuscation code localization based on CFG generation of malware. In: Garcia-Alfaro, J., Kranakis, E., Bonfante, G. (eds.) FPS 2015. LNCS, vol. 9482, pp. 229–247. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30303-1_14

  8. Shoshitaishvili, Y., et al.: (State of) the art of war: offensive techniques in binary analysis. In: IEEE S and P 2016, pp. 138–157 (2016)

    Google Scholar 

  9. Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary instrumentation. In: ACM PLDI 2007, pp. 89–100 (2007)

    Google Scholar 

  10. Capstone Engine. http://capstone-engine.org. Accessed 9 July 2019

  11. Ida. https://hex-rays.com/products/ida. Accessed 9 July 2019

  12. Krishnamoorthy, N., Debray, S., Fligg, K.: Static detection of disassembly errors. In: IEEE WCRE 2009, pp. 259–268 (2009)

    Google Scholar 

  13. Dasgupta, S., Park, D., Kasampalis, T., Adve, V.S., Rosu, G.: A complete formal semantics of x86-64 user-level instruction set architecture. In: ACM PLDI 2019, pp. 1133–1148 (2019)

    Google Scholar 

  14. ARM Developer. https://developer.arm.com. Accessed 9 July 2019

  15. The Corana Tool. https://anhvvcs.github.io/corana. Accessed 9 July 2019

  16. Robeer, M., Lucassen, G., van der Werf, J.M.E., Dalpiaz, F., Brinkkemper, S.: Automated extraction of conceptual models from user stories via NLP. In: IEEE RE 2016, pp. 196–205 (2016)

    Google Scholar 

  17. Yue, T., Briand, L.C., Labiche, Y.: aToucan: an automated framework to derive UML analysis models from use case models. ACM TOSEM 24(3), 13:1–13:52 (2015)

    Google Scholar 

  18. Heule, S., Schkufza, E., Sharma, R., Aiken, A.: Stratified synthesis: automatically learning the x86-64 instruction set. In: ACM PLDI 2016, pp. 237–250 (2016)

    Google Scholar 

  19. Schkufza, E., Sharma, R., Aiken, A.: Stochastic superoptimization. In: ASPLOS 2013, pp. 305–316 (2013)

    Google Scholar 

  20. \(\mu \)Vision. http://keil.com/mdk5/uvision. Accessed 9 July 2019

  21. Yen, N.L.H.: Automatic extraction of x86 formal semantics from its natural language description. In: Master’s Thesis, School of Information Science, JAIST (March 2018)

    Google Scholar 

  22. Anh, V.V.: Formal semantics extraction from natural language specifications for ARM. In: Master’s Thesis, School of Information Science, JAIST (December 2018)

    Google Scholar 

  23. Bonfante, G., Marion, J.Y., Reynaud-Plantey, D.: A computability perspective on self-modifying programs. In: SEFM 2009, pp. 231–239 (2009)

    Google Scholar 

  24. Degenbaev, U.: Formal specification of the x86 instruction set architecture. In: Ph.D. Dissertation, Universitat des Saarlandes (February 2012)

    Google Scholar 

  25. Aceto, L., Fokkink, W., Verhoef, C.: Structural operational semantics. Handbook of Process Algebra, pp. 197–292 (2001)

    Google Scholar 

  26. Loper, E., Bird, S.: NLTK: the natural language toolkit. In: ACL (2004)

    Google Scholar 

  27. Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Documentation 60(5), 503–520 (2004)

    Google Scholar 

  28. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  29. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  30. Luckow, K., et al.: JDart: a dynamic symbolic analysis framework. In: Chechik, M., Raskin, J.-F. (eds.) TACAS 2016. LNCS, vol. 9636, pp. 442–459. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49674-9_26

    Chapter  Google Scholar 

  31. Visser, W., Havelund, K., Brat, G., Park, S., Lerda, F.: Model checking programs. Autom. Softw. Eng. 10(2), 203–232 (2003)

    Google Scholar 

  32. de Moura, L., Bjørner, N.: Z3: an efficient SMT solver. In: Ramakrishnan, C.R., Rehof, J. (eds.) TACAS 2008. LNCS, vol. 4963, pp. 337–340. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78800-3_24

    Chapter  Google Scholar 

  33. Kirat, D., Vigna, G., Kruegel, C.: barebox: efficient malware analysis on bare-metal. In: ACSAC 2011, pp. 403–412 (2011)

    Google Scholar 

  34. Brumley, D., Hartwig, C., Liang, Z., Newsome, J., Song, D., Yin, H.: Automatically identifying trigger-based behavior in malware. In: Wenke L., Cliff W., David D. (eds.) Botnet Detection 2008, ADIS, vol. 36, pp. 65–88. Springer, Heidelberg (2008). https://doi.org/10.1007/978-0-387-68768-14

  35. Fleck, D., Tokhtabayev, A., Alarif, A., Stavrou, A., Nykodym, T.: PyTrigger: a system to trigger & extract user-activated malware behavior. In: AERES 2013, pp. 92–101 (2013)

    Google Scholar 

  36. Virus Total. https://www.virustotal.com. Accessed 9 July 2019

  37. Virus Share. https://virusshare.com. Accessed 9 July 2019

Download references

Acknowledgments

We are grateful to Nao Hirokawa, Le Minh Nguyen, and the anonymous reviewers of FM’19 for their insightful feedback and invaluable comments. We sincerely thank Xuan Tung Vu, Thi Hai Yen Vuong, and Lam Hoang Yen Nguyen for their constructive discussions, as well as Thu Trang Hoang for her sharp comments on some grammatical issues. This study is partially supported by JSPS KAKENHI Grant-in-Aid for Scientific Research (B) 19H04083.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anh V. Vu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vu, A.V., Ogawa, M. (2019). Formal Semantics Extraction from Natural Language Specifications for ARM. In: ter Beek, M., McIver, A., Oliveira, J. (eds) Formal Methods – The Next 30 Years. FM 2019. Lecture Notes in Computer Science(), vol 11800. Springer, Cham. https://doi.org/10.1007/978-3-030-30942-8_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30942-8_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30941-1

  • Online ISBN: 978-3-030-30942-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics