Re-examining regular expressions with backreferences☆
Introduction
Regular expressions as used and implemented in practice are vastly different from their traditional theoretic counterpart, both in semantics (driven by the features offered), and expectations of performance. Even when not using the more complex features the performance profile of practical regular expression matching is a fairly deep subject matter, which has seen theoretical study only fairly recently, such as in [2] and [3]. In this paper we focus on regular expressions with backreferences (rewbr for short), an advanced feature which is available in most regular expression matching libraries. This subject matter has seen some study in the literature and we will refer frequently to [4], [5], [6], and [7], but each paper has its own definition of a rewbr and its semantics ([5] in effect has two), and many implementations disagree with all of them (the definition given by Aho in [4] is common however), and with each other. The differences may initially seem minor, but turn out to have very real impact on the languages that can be matched.
A backreference is placed in a regular expression to indicate that the substring matched by some specified capturing group (where capturing group is synonymous with parenthesized subexpression), should be matched again at the position (or positions) where the backreference is placed. In the Java programming language we denote by \i that the substring most recently matched by the ith capturing group should be matched by the backreference again, where capturing groups are numbered from 1 onwards, based on the relative position of their left parenthesis when reading the regular expression from left to right. For example, [0-9]+\.\d*(\d+)\1+ can be used to match recurring decimal numbers, such as 0.33, 0.818181 and 0.04555, since the subexpression (\d+) captures some sequence of digits in the input string and the backreference \1+ instructs the matcher to match this sequence again, one or more times. Similarly, the regular expression (.+)\1 matches strings of the form ww, i.e. producing the non-context-free reduplication property.
Long and complicated regular expressions may be hard to read and maintain, as adding or removing capturing groups changes the numbers of all groups following the modification. The re module in Python was the first to offer a solution in terms of named capturing groups and backreferences; (?P<name>group) captures the match of the subexpression group into name, whereas a backreference to the contents of this capturing group is done with (?P=name). In some implementations it is then possible to reuse the same label for different capturing groups (e.g. Python and .NET both allow naming of groups, but .NET allows reusing names where Python does not), which opens possibilities not available when simply numbering capturing groups from left to right. Also, regular expression matchers use different conventions in terms of how matching is defined when encountering a backreference without having captured a substring with the label corresponding to the backreference. These subtle differences in syntax and semantics allowed in rewbr influence the classes of languages described, as well as the relative succinctness of the rewbr variants. It is thus clear that a thorough comparison of rewbr variants is needed if further study is to be possible, which forms a big part of our contribution.
This paper uses as starting point the definitions and results, on rewbr, from [4], [5], [6], [7] and [8]. In particular, the structure of the definition of matching semantics of rewbr is taken from [8], and the pumping lemma (for rewbr) from [5] (this pumping lemma is also treated in [6]), provides the intuition for our own pumping-style lemmas. As an illustrative example, consider [9], which demonstrates that bounding the nesting depth of capturing groups induces a strict language hierarchy (i.e., rewbr with a capturing group nesting depth of recognize a strictly larger class of languages than those bounded to k). It demonstrates this for the definitions from [4], but as we will see this forms the strongest class considered here, and as a result the difficult part of the proof is easily adapted to all the other classes.
The outline of the paper is as follows. After providing the necessary notation and definitions in the next section, we first illustrate the nature of the language classes induced by giving some bounds on the succinctness of strings generated, and some complexity results. Next, we develop various lemmas revealing properties of the language classes and then describe the relationships between the language classes obtained when considering the variants of rewbr as found in theory and practice.
Section snippets
Notation and definitions
We use Σ and Φ as finite input and backreference alphabets respectively, with these (possibly empty) alphabets being disjoint. Let ∅ and ε denote the empty set and word respectively, denotes the set of natural numbers including 0, and for , with , denotes the set . To improve readability, we sometimes denote as . For a string w over Σ (or any other alphabet), we denote by the length of w, i.e. the number of occurrences of symbols
Succinctness and membership testing complexity
Before we delve into pumping lemmas and the relationship between the languages induced by these expressions, let us first bound how long a string a rewbr may succinctly encode, already foreshadowed in Example 3. Then we consider the computational complexity of membership and emptiness testing for the different classes.
Pumping lemmas
The pumping lemma given in [5] is a useful tool for finding languages not in . It is used to show that . First we recall the pumping lemma for , which is then considered in the context of the additional semantics treated here, to introduce more general pumping-style lemmas.
Lemma 3 For (i.e. a language matched by some rewbr in ) there exists a constant k such that if with , then there is a decomposition , for some from [5]
Language hierarchies
Using the pumping results from the previous section, some straightforward containment relationships are obtained. In this section we consider these relationships in detail. We begin with Lemma 10, where we combine and summarize straightforward containment relationships based on what has already been established. This is followed by Theorem 3, which establishes the equivalence of and . Then the core of this section first establishes the relationships between , ,
Conclusions and future work
An important contribution in this article is the definitions providing a framework summarizing several disparate takes on backreferences in regular expressions. Beyond that, we have given a variety of results describing properties of these languages, as well as the complexity of some decision problems on them. Finally, Theorem 6 exhaustively puts the language classes in relation to each other and other relevant language classes. Nonetheless, there are many avenues which call for further
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
We would like to thank an anonymous reviewer for helpful and thorough comments on a previous version of this document.
References (18)
Algorithms for finding patterns in strings
Regular expressions with nested levels of back referencing form a hierarchy
Inf. Process. Lett.
(1998)- et al.
Pattern matching with variables: a multivariate complexity analysis
Inf. Comput.
(2015) - et al.
Regular expressions with backreferences re-examined
- et al.
Analyzing catastrophic backtracking behavior in practical regular expression matching
- et al.
Analyzing matching time behavior of backtracking regular expression matchers by using ambiguity of NFA
- et al.
A formal study of practical regular expressions
Int. J. Found. Comput. Sci.
(2003) - et al.
On extended regular expressions
- et al.
Deterministic regular expressions with back-references
Cited by (5)
Deducing Matching Strings for Real-World Regular Expressions
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Modeling Regex Operators for Solving Regex Crossword Puzzles
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)On the undecidability and descriptional complexity of synchronized regular expressions
2023, Acta InformaticaOn the Expressive Power of Regular Expressions with Backreferences
2023, Leibniz International Proceedings in Informatics, LIPIcs