Re-examining regular expressions with backreferences

https://doi.org/10.1016/j.tcs.2022.10.041Get rights and content

Abstract

Most modern regular expression matching libraries (one of the rare exceptions being Google's RE2) allow backreferences, operations which bind a substring to a variable, allowing it to be matched again verbatim. However, both real-world implementations and definitions in the literature use different syntactic restrictions and have differences in the semantics of the matching of backreferences. Our aim is to compare these various flavors by considering the classes of formal languages that each can describe, establishing, as a result, a hierarchy of language classes. Beyond the hierarchy itself, some complexity results are given, and as part of the effort on comparing language classes new pumping lemmas are established, old classes are extended to new ones, and several incidental results on the nature of these language classes are given.

Introduction

Regular expressions as used and implemented in practice are vastly different from their traditional theoretic counterpart, both in semantics (driven by the features offered), and expectations of performance. Even when not using the more complex features the performance profile of practical regular expression matching is a fairly deep subject matter, which has seen theoretical study only fairly recently, such as in [2] and [3]. In this paper we focus on regular expressions with backreferences (rewbr for short), an advanced feature which is available in most regular expression matching libraries. This subject matter has seen some study in the literature and we will refer frequently to [4], [5], [6], and [7], but each paper has its own definition of a rewbr and its semantics ([5] in effect has two), and many implementations disagree with all of them (the definition given by Aho in [4] is common however), and with each other. The differences may initially seem minor, but turn out to have very real impact on the languages that can be matched.

A backreference is placed in a regular expression to indicate that the substring matched by some specified capturing group (where capturing group is synonymous with parenthesized subexpression), should be matched again at the position (or positions) where the backreference is placed. In the Java programming language we denote by \i that the substring most recently matched by the ith capturing group should be matched by the backreference again, where capturing groups are numbered from 1 onwards, based on the relative position of their left parenthesis when reading the regular expression from left to right. For example, [0-9]+\.\d*(\d+)\1+ can be used to match recurring decimal numbers, such as 0.33, 0.818181 and 0.04555, since the subexpression (\d+) captures some sequence of digits in the input string and the backreference \1+ instructs the matcher to match this sequence again, one or more times. Similarly, the regular expression (.+)\1 matches strings of the form ww, i.e. producing the non-context-free reduplication property.

Long and complicated regular expressions may be hard to read and maintain, as adding or removing capturing groups changes the numbers of all groups following the modification. The re module in Python was the first to offer a solution in terms of named capturing groups and backreferences; (?P<name>group) captures the match of the subexpression group into name, whereas a backreference to the contents of this capturing group is done with (?P=name). In some implementations it is then possible to reuse the same label for different capturing groups (e.g. Python and .NET both allow naming of groups, but .NET allows reusing names where Python does not), which opens possibilities not available when simply numbering capturing groups from left to right. Also, regular expression matchers use different conventions in terms of how matching is defined when encountering a backreference without having captured a substring with the label corresponding to the backreference. These subtle differences in syntax and semantics allowed in rewbr influence the classes of languages described, as well as the relative succinctness of the rewbr variants. It is thus clear that a thorough comparison of rewbr variants is needed if further study is to be possible, which forms a big part of our contribution.

This paper uses as starting point the definitions and results, on rewbr, from [4], [5], [6], [7] and [8]. In particular, the structure of the definition of matching semantics of rewbr is taken from [8], and the pumping lemma (for rewbr) from [5] (this pumping lemma is also treated in [6]), provides the intuition for our own pumping-style lemmas. As an illustrative example, consider [9], which demonstrates that bounding the nesting depth of capturing groups induces a strict language hierarchy (i.e., rewbr with a capturing group nesting depth of k+1 recognize a strictly larger class of languages than those bounded to k). It demonstrates this for the definitions from [4], but as we will see this forms the strongest class considered here, and as a result the difficult part of the proof is easily adapted to all the other classes.

The outline of the paper is as follows. After providing the necessary notation and definitions in the next section, we first illustrate the nature of the language classes induced by giving some bounds on the succinctness of strings generated, and some complexity results. Next, we develop various lemmas revealing properties of the language classes and then describe the relationships between the language classes obtained when considering the variants of rewbr as found in theory and practice.

Section snippets

Notation and definitions

We use Σ and Φ as finite input and backreference alphabets respectively, with these (possibly empty) alphabets being disjoint. Let ∅ and ε denote the empty set and word respectively, N denotes the set of natural numbers including 0, and for m,nN, with mn, [m,n] denotes the set {m,m+1,,n}. To improve readability, we sometimes denote v1=w1,,vn=wn as (v1,,vn)=(w1,,wn). For a string w over Σ (or any other alphabet), we denote by |w| the length of w, i.e. the number of occurrences of symbols

Succinctness and membership testing complexity

Before we delve into pumping lemmas and the relationship between the languages induced by these expressions, let us first bound how long a string a rewbr may succinctly encode, already foreshadowed in Example 3. Then we consider the computational complexity of membership and emptiness testing for the different classes.

Pumping lemmas

The pumping lemma given in [5] is a useful tool for finding languages not in L[ε,!,↶̸]. It is used to show that L[ε,!,↶̸]L[ε,!]. First we recall the pumping lemma for L[ε,!,↶̸], which is then considered in the context of the additional semantics treated here, to introduce more general pumping-style lemmas.

Lemma 3

from [5]

For LL[ε,!,↶̸] (i.e. a language matched by some rewbr in rewbr[ε,!,↶̸]) there exists a constant k such that if wL with |w|>k, then there is a decomposition w=x0vx1vx2vxn, for some n1

Language hierarchies

Using the pumping results from the previous section, some straightforward containment relationships are obtained. In this section we consider these relationships in detail. We begin with Lemma 10, where we combine and summarize straightforward containment relationships based on what has already been established. This is followed by Theorem 3, which establishes the equivalence of rewbr[ε] and rewbr[]. Then the core of this section first establishes the relationships between L[,!], L[,!,↶̸],

Conclusions and future work

An important contribution in this article is the definitions providing a framework summarizing several disparate takes on backreferences in regular expressions. Beyond that, we have given a variety of results describing properties of these languages, as well as the complexity of some decision problems on them. Finally, Theorem 6 exhaustively puts the language classes in relation to each other and other relevant language classes. Nonetheless, there are many avenues which call for further

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We would like to thank an anonymous reviewer for helpful and thorough comments on a previous version of this document.

References (18)

There are more references available in the full text version of this article.

Cited by (5)

This article is a revised and extended version of [1].

View full text