Arpeggio: A flexible PEG parser for Python
Introduction
A parser is a software component that takes input (usually textual) and produces a data structure. This transformation is often based on a formal description of the input language syntax - a grammar. A traditional way to define the syntax of a programming language is Chomsky’s generative system of grammars [1], in particular, Context-Free Grammars and Regular Expressions. The main problem with this approach is that it was meant to be used to describe natural languages where the possibility to define ambiguity is a desirable feature. But, the very same feature is a source of serious problems when describing machine-oriented syntaxes.
Parsing Expression Grammars (PEGs) provide an alternative, recognition-based formal foundation for describing machine-oriented syntaxes, which solves the ambiguity problem by not introducing ambiguity in the first place [2].
Arpeggio is an implementation of a PEG-based recursive descent parser with backtracking and memoization implemented in the Python programming language. This class of parsers is known as packrat parsers [3]. Full backtracking enables an unlimited lookahead while linear parse time is still preserved using memoization technique where intermediate results are cached.
The main motivation to design and implement Arpeggio was to provide a parsing infrastructure for a Domain-Specific Languages (DSL) [4] development tool textX [5]. Nevertheless, as parsers are important parts of many software tools and libraries (e.g. [6]), Arpeggio is built to be suitable for all sorts of general purpose parsing. It is used in data extraction from various textual formats, parsing of different languages, analysis of legacy source code, etc.
Section snippets
Problems and background
The development of DSLs usually requires a lot of experimentation through trial and error. Furthermore, DSLs are much more prone to change than General-Purpose Languages (GPL). Thus, tools for DSL development should be built in such a way that the grammar is readable, simple to change and extend, and to enable fast round-trip.
From the start, Arpeggio is designed to work as a grammar interpreter as opposed to grammar compiler (i.e. parser generator). Furthermore, various grammar syntaxes are
Software framework
From the given grammar Arpeggio builds, in runtime, an instance of the parser, which is a graph of Python objects whose classes inherit ParsingExpression class (Fig. 1).
We call this graph of objects the parser model. The parser model for the simple grammar given in Fig. 3 is given in Fig. 2.
A grammar may be specified using different syntaxes. A canonical form of the grammar specification is the internal DSL form [4], i.e. the grammar is defined using Python language elements (Fig. 3).
In this
Implementation and empirical results
Arpeggio is written in the pure Python programming language without any dependencies4. It can be installed from PyPI5 using the standard Python installer - pip6. The details of the installation and usage can be found in the project documentation.
Arpeggio has been validated in various academic and industrial projects. It is covered with extensive unit tests. Our previous work
Illustrative examples
The Arpeggio code repository hosts 11 different examples in the examples directory. Each example comes with a README file which contains its description and instructions on how to run it7. Additionally, we provide three full-length tutorials (CSV, BibTeX and Calc) in the documentation8. Here we will briefly describe each example. s The BibTeX example demonstrates parsing of the BibTeX format9
Conclusions
Arpeggio is an implementation of a packrat parser that brings unambiguous parsing with unlimited lookahead while still preserving linear parse times. Its performance has been tested and our results show that its speed is comparable to other popular solutions and in some cases outperforms them. It has been used for several years in both academic and industrial environments and is covered with extensive unit tests. Because of its usage on the DSL course at the Faculty of Technical Sciences we
References (9)
Three models for the description of language
IRE Trans. Inf. Theory
(1956)Parsing expression grammars: a recognition-based syntactic foundation
ACM SIGPLAN Notices
(2004)Packrat parsing: simple, powerful, lazy, linear time
Proceedings of the Seventh ACM SIGPLAN International Conference on Functional Programming
(2002)Domain-Specific Languages
(2010)
Cited by (13)
Parglare: A LR/GLR parser for Python
2022, Science of Computer ProgrammingCitation Excerpt :The grammar is defined using Python language constructs by overloading of Python operators. Arpeggio [39] is another PEG library where grammar is defined by either using Python language or by a textual PEG syntax. TextX [40] is a library that builds on top of Arpeggio and provides additional facilities geared towards Domain-Specific Language development.
PyGOP: A Python library for Generalized Operational Perceptron algorithms
2019, Knowledge-Based SystemsTextX: A Python tool for Domain-Specific Languages implementation
2017, Knowledge-Based SystemsCitation Excerpt :It enables fast specification of both concrete and abstract syntaxes (i.e. meta-model) but it’s light-weight and easy to use in different contexts. Its only dependency is the Arpeggio packrat parser which brings unlimited lookahead with linear parse times [10]. We provide a detailed comparison to other popular tools in the documentation.1
Object Parsing Expressions for Unplanned, Unmodified, and Incremental Grammar Reuse
2022, Communications in Computer and Information SciencePyflies: A domain-specific language for designing experiments in psychology
2021, Applied Sciences (Switzerland)Object parsing grammars with composition
2021, Proceedings of the 16th International Conference on Software Technologies, ICSOFT 2021