Elsevier

Knowledge-Based Systems

Volume 95, 1 March 2016, Pages 71-74
Knowledge-Based Systems

Arpeggio: A flexible PEG parser for Python

https://doi.org/10.1016/j.knosys.2015.12.004Get rights and content

Abstract

Arpeggio is a recursive descent parser with full backtracking and memoization based on PEG (Parsing Expression Grammar) grammars. This category of parsers is known as packrat parsers. It is implemented in the Python programming language and works as a grammar interpreter.

Arpeggio has a very good support for error reporting, debugging, and grammar and parse tree visualization. It is used in industrial environments and teaching Domain-Specific Languages course at the Faculty of Technical Sciences in Novi Sad. Arpeggio is a foundation of a high-level DSL meta-language and tool - textX.

It is a free and open-source software available at GitHub under MIT license.

Introduction

A parser is a software component that takes input (usually textual) and produces a data structure. This transformation is often based on a formal description of the input language syntax - a grammar. A traditional way to define the syntax of a programming language is Chomsky’s generative system of grammars [1], in particular, Context-Free Grammars and Regular Expressions. The main problem with this approach is that it was meant to be used to describe natural languages where the possibility to define ambiguity is a desirable feature. But, the very same feature is a source of serious problems when describing machine-oriented syntaxes.

Parsing Expression Grammars (PEGs) provide an alternative, recognition-based formal foundation for describing machine-oriented syntaxes, which solves the ambiguity problem by not introducing ambiguity in the first place [2].

Arpeggio is an implementation of a PEG-based recursive descent parser with backtracking and memoization implemented in the Python programming language. This class of parsers is known as packrat parsers [3]. Full backtracking enables an unlimited lookahead while linear parse time is still preserved using memoization technique where intermediate results are cached.

The main motivation to design and implement Arpeggio was to provide a parsing infrastructure for a Domain-Specific Languages (DSL) [4] development tool textX [5]. Nevertheless, as parsers are important parts of many software tools and libraries (e.g. [6]), Arpeggio is built to be suitable for all sorts of general purpose parsing. It is used in data extraction from various textual formats, parsing of different languages, analysis of legacy source code, etc.

Section snippets

Problems and background

The development of DSLs usually requires a lot of experimentation through trial and error. Furthermore, DSLs are much more prone to change than General-Purpose Languages (GPL). Thus, tools for DSL development should be built in such a way that the grammar is readable, simple to change and extend, and to enable fast round-trip.

From the start, Arpeggio is designed to work as a grammar interpreter as opposed to grammar compiler (i.e. parser generator). Furthermore, various grammar syntaxes are

Software framework

From the given grammar Arpeggio builds, in runtime, an instance of the parser, which is a graph of Python objects whose classes inherit ParsingExpression class (Fig. 1).

We call this graph of objects the parser model. The parser model for the simple grammar given in Fig. 3 is given in Fig. 2.

A grammar may be specified using different syntaxes. A canonical form of the grammar specification is the internal DSL form [4], i.e. the grammar is defined using Python language elements (Fig. 3).

In this

Implementation and empirical results

Arpeggio is written in the pure Python programming language without any dependencies4. It can be installed from PyPI5 using the standard Python installer - pip6. The details of the installation and usage can be found in the project documentation.

Arpeggio has been validated in various academic and industrial projects. It is covered with extensive unit tests. Our previous work

Illustrative examples

The Arpeggio code repository hosts 11 different examples in the examples directory. Each example comes with a README file which contains its description and instructions on how to run it7. Additionally, we provide three full-length tutorials (CSV, BibTeX and Calc) in the documentation8. Here we will briefly describe each example. s The BibTeX example demonstrates parsing of the BibTeX format9

Conclusions

Arpeggio is an implementation of a packrat parser that brings unambiguous parsing with unlimited lookahead while still preserving linear parse times. Its performance has been tested and our results show that its speed is comparable to other popular solutions and in some cases outperforms them. It has been used for several years in both academic and industrial environments and is covered with extensive unit tests. Because of its usage on the DSL course at the Faculty of Technical Sciences we

References (9)

  • N. Chomsky

    Three models for the description of language

    IRE Trans. Inf. Theory

    (1956)
  • B. Ford

    Parsing expression grammars: a recognition-based syntactic foundation

    ACM SIGPLAN Notices

    (2004)
  • B. Ford

    Packrat parsing: simple, powerful, lazy, linear time

    Proceedings of the Seventh ACM SIGPLAN International Conference on Functional Programming

    (2002)
  • M. Fowler

    Domain-Specific Languages

    (2010)
There are more references available in the full text version of this article.

Cited by (13)

  • Parglare: A LR/GLR parser for Python

    2022, Science of Computer Programming
    Citation Excerpt :

    The grammar is defined using Python language constructs by overloading of Python operators. Arpeggio [39] is another PEG library where grammar is defined by either using Python language or by a textual PEG syntax. TextX [40] is a library that builds on top of Arpeggio and provides additional facilities geared towards Domain-Specific Language development.

  • TextX: A Python tool for Domain-Specific Languages implementation

    2017, Knowledge-Based Systems
    Citation Excerpt :

    It enables fast specification of both concrete and abstract syntaxes (i.e. meta-model) but it’s light-weight and easy to use in different contexts. Its only dependency is the Arpeggio packrat parser which brings unlimited lookahead with linear parse times [10]. We provide a detailed comparison to other popular tools in the documentation.1

  • Object parsing grammars with composition

    2021, Proceedings of the 16th International Conference on Software Technologies, ICSOFT 2021
View all citing articles on Scopus
View full text