Skip to main content
Log in

A Model and Declarative Language for Specifying Binary Data Formats

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

Tasks related to binary data formats include parsing, generating, and conjoint code and data analysis. A key element for all of these tasks is a universal data format model. An approach to modeling binary data formats is proposed. The described model has sufficient expressive power for specifying the majority of widespread data formats. A distinctive feature of this model is its flexibility in specifying field locations and the ability to describe external fields the structure of which cannot be determined by parsing. The implemented infrastructure makes it possible to create and modify the representation using application programming interfaces. An algorithm is proposed for parsing binary data using the specified model based on the concept of computability of fields. A domain-specific language for data format specification is also described. The specified formats and potential practical applications of the model for programmatic analysis of formatted data are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.
Fig. 11.
Fig. 12.
Fig. 13.
Fig. 14.

Similar content being viewed by others

Notes

  1. Since different languages have different primitives, partial format descriptions with a similar structure were used.

REFERENCES

  1. Back, J., DataScript—A specification and scripting language for binary data, Lect. Notes Comput. Sci., 2002, vol. 2487, pp. 66–77.

    Article  MATH  Google Scholar 

  2. Khmelnov, A.Y., Bychkov, I.V., and Mikhailov, A.A., A declarative language FlexT for analyzing and documenting binary data formats, Trudy ISP RAN, 2016, vol. 28, no. 5, pp. 239–268. https://doi.org/10.15514/ISPRAS-2016-28(5)-15

    Article  Google Scholar 

  3. Kaitai Struct: Declarative binary format parsing language. https://kaitai.io/.

  4. McCann, P.J. and Chandra, S., Packet Types: abstract specification of network protocol messages, ACM SIGCOMM Comput. Commun. Rev., 2000, vol. 30, no. 4, pp. 321–333.

    Article  Google Scholar 

  5. Pang, R., Paxson, V., et al. Binpac: a yacc for writing application protocol parsers, Proc. of the 6th ACM SIGCOMM Conference on Internet Measurement (IMC '06), 2006, pp. 289–300.

  6. Borisov, N., Brumley, D., et al. Generic application-level protocol analyzer and its language, Proc. of the Network and Distributed System Security Symposium, 2007.

  7. Hopcroft, J.E., Motwani, R., and Ullman, J.D., Introduction to Automata Theory, Languages, and Computation, 3rd ed., Pearson, 2006.

    MATH  Google Scholar 

  8. Knuth, D.E., Semantics of context-free languages, Math. Syst. Theory, 1968, vol. 2, no. 2, pp. 127–145.

    Article  MathSciNet  MATH  Google Scholar 

  9. Ford, B., Parsing expression grammars: a recognition-based syntactic foundation, ACM SIGPLAN Notices, 2001, vol. 39, no. 1, pp. 111–122.

    Article  MATH  Google Scholar 

  10. Jim, T., Mandelbaum, Y., and Walker, D., Semantics and algorithms for data-dependent grammars, Proc. of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2010, pp. 417–430.

  11. Afroozeh, A. and Izmaylova, A., Iguana: A practical data-dependent parsing framework, Proc. of the 25th International Conference on Compiler Construction, 2016, pp. 267–268.

  12. Earley, J., An efficient context-free parsing algorithm, Commun. ACM, 1970, vol. 13, no. 2, 1970, pp. 94–102.

  13. Jim, T. and Mandelbaum, Y., A new method for dependent parsing, Proc. of the 20th European Conference on Programming Languages and Systems, 2011, pp. 378–397.

  14. Ganty, P., Köpf, B., and Valero, P., A language-theoretic view on network protocols, Lect. Notes Comput. Sci., 2017, vol. 10482, pp. 363–379.

    Article  MATH  Google Scholar 

  15. Peach: a fuzzing framework which uses a DSL for building fuzzers and an observer based architecture to execute and monitor them. https://github.com/MozillaSecurity/peach.

  16. Netzob: Protocol Reverse Engineering, Modeling and Fuzzing. https://github.com/netzob/netzob

  17. Sommer, R., Amann, J., and Hall, S., Spicy: A unified deep packet inspection framework for safely dissecting all your data. Proc. of the 32nd Annual Conference on Computer Security Applications, 2016, pp. 558–569.

  18. Fisher, K., Mandelbaum, Y., and Walker, D., The next 700 data description languages, ACM SIGPLAN Notices, 2006, vol. 4, no. 1, pp. 2–15.

    Article  MATH  Google Scholar 

  19. Fisher, K. and Gruber, R., PADS: A domain-specific language for processing ad hoc data. ACM SIGPLAN Notices, 2005, vol. 40, no. 6, pp. 295–304.

    Article  Google Scholar 

  20. boofuzz: Network Protocol Fuzzing for Humans. https://github.com/jtpereyda/boofuzz/.

  21. GitLab Protocol Fuzzer Community Edition. https://gitlab.com/gitlab-org/security-products/protocol-fuzzer-ce.

  22. 010 Editor - Pro Text/Hex Editor. https://www.sweetscape.com/010editor/.

  23. GNU poke, an extensible editor for structured binary data. https://doi.org/10.5446/46118

  24. Solov’ev, M.A., Bakulin, M.G., et al. Practical abstract interpretation of binary code, Trudy ISP RAN, 2020, vol. 32, no. 6, pp. 101–110. https://doi.org/10.15514/ISPRAS-2020-32(6)-8

    Article  Google Scholar 

  25. Solov’ev, M.A., Bakulin, M.G., et al. Next generation intermediate representations for binary code analysis, Trudy ISP RAN, 2018, vol. 30, no. 6, pp. 39–68. https://doi.org/10.15514/ISPRAS-2018-30(6)-3

    Article  Google Scholar 

  26. Cousot, P. and Cousot, R., Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints, Proc. of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 1977, pp. 238–252.

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to A. A. Evgin, M. A. Solovev or V. A. Padaryan.

Additional information

Translated by A. Klimontovich

Data parsing algorithm in accordance with the proposed model (pseudocode)

Data parsing algorithm in accordance with the proposed model (pseudocode)

INPUT: pointer to data (format instance and starting address)

 OUTPUT: structure of result Data

1. For the set of unparsed relations

2. Take a new relation from the of unparsed relations

3. Relation type:

 - internal or external:

  3.1. Determine the computability of location

   - Not computable: goto Step 2

   - Computable: calculate -> POS

  3.2. If POS = None, then mark the relation as parsed and goto Step 2

  3.3. Determine the computability of the format instance:

   - Not computable: goto Step 2

   - Computable: calculate -> FORMAT

  3.4. Relation type:

   - internal:

     3.4.1. For (POS, FORMAT) call Algorithm

     3.4.2. Add the parsing result to the structure Data

     3.4.3. Mark the relation as parsed

   - external:

     3.4.4. Add (POS, FORMAT) to the structure Data

     3.4.5. Mark the relation as parsed

   - value relation:

  3.5. Determine the computability of the value:

   - Not computable: goto Step 2

   - Computable:

     3.5.1. Calculate -> VALUE

     3.5.2. Add VALUE to the structure Data

     3.5.3. Mark the relation as parsed

4. If there are relations in the set of unparsed ones, then goto Step 2

5. If no relation was parsed, then return ERROR

6. Goto Step 1

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Evgin, A.A., Solovev, M.A. & Padaryan, V.A. A Model and Declarative Language for Specifying Binary Data Formats. Program Comput Soft 48, 469–483 (2022). https://doi.org/10.1134/S0361768822070040

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768822070040

Navigation