Efficient Submatch Extraction for Practical Regular Expressions

Haber, Stuart; Horne, William; Manadhata, Pratyusa; Mowbray, Miranda; Rao, Prasad

doi:10.1007/978-3-642-37064-9_29

Stuart Haber¹⁸,
William Horne¹⁸,
Pratyusa Manadhata¹⁸,
Miranda Mowbray¹⁹ &
…
Prasad Rao¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7810))

Included in the following conference series:

International Conference on Language and Automata Theory and Applications

1148 Accesses
2 Citations

Abstract

A capturing group is a syntax used in modern regular expression implementations to specify a subexpression of a regular expression. Given a string that matches the regular expression, submatch extraction is the process of extracting the substrings corresponding to those subexpressions. Greedy and reluctant closures are variants on the standard closure operator that impact how submatches are extracted. The state of the art and practice in submatch extraction are automata based approaches and backtracking algorithms. In theory, the number of states in an automata-based approach can be exponential in n, the size of the regular expression, and the running time of backtracking algorithms can be exponential in ℓ, the length of the string. In this paper, we present an O(ℓc) runtime automata based algorithm for extracting submatches from a string that matches a regular expression, where c > 0 is the number of capturing groups. The previous fastest automata based algorithm was O(nℓc). Both our approach and the previous fastest one require worst-case exponential compile time. But in practice, the worst case behavior rarely occurs, so achieving a practical speed-up against state-of-the-art methods is of significant interest. Our experimental results show that, for a large set of regular expressions used in practice, our algorithm is approximately twice as fast as Java’s backtracking based regular expression library and approximately twenty times faster than the RE2 regular expression engine.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

On the Semantics of Atomic Subgroups in Practical Regular Expressions

On the Semantics of Regular Expression Parsing in the Wild

Memoized Regular Expressions

References

Benchmark of Regex Libraries (July 2010), http://lh3lh3.users.sourceforge.net/reb.shtml
RE2 (January 2012), http://code.google.com/p/re2/
PCRE (2011), http://www.pcre.org/
Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to automata theory, languages, and computation. Addison-Wesley (2003)
Google Scholar
Laurikari, V.: NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In: Proc. of the 7th Int. Symp. on String Processing and Information Retrieval, pp. 181–187 (2000)
Google Scholar
Laurikari, V.: Efficient submatch addressing for regular expressions. Master’s thesis, Helsinki University of Technology (2001)
Google Scholar
Nourie, D., McCloskey, M.: Regular Expressions and the Java Programming Language (2010), http://java.sun.com/developer/technicalArticles/releases/1.4regex
Pike, R.: The Text Editor sam. Softw. Pract. Exper. 17, 813–845 (1987)
Article Google Scholar
Rabin, M.O., Scott, D.: Finite automata and their decision problems. IBM J. Research and Development 3(2) (April 1959), doi:10.1147/rd.32.0114
Google Scholar
Thompson, K.: Programming techniques: Regular expression search algorithm. Comm. ACM 11, 419–422 (1968)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

HP Labs Princeton, 5 Vaughn Drive, Suite 301, Princeton, NJ, 08540, USA
Stuart Haber, William Horne, Pratyusa Manadhata & Prasad Rao
HP Labs Bristol, Long Down Ave, Stoke Gifford, Bristol, BS34 8QT, UK
Miranda Mowbray

Authors

Stuart Haber
View author publications
You can also search for this author in PubMed Google Scholar
William Horne
View author publications
You can also search for this author in PubMed Google Scholar
Pratyusa Manadhata
View author publications
You can also search for this author in PubMed Google Scholar
Miranda Mowbray
View author publications
You can also search for this author in PubMed Google Scholar
Prasad Rao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistics, Universitat Rovira i Virgili, Avinguda Catalunya, 35, 43002, Tarragona, Spain
Adrian-Horia Dediu & Carlos Martín-Vide &
Fakultät für Informatik, Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Bianca Truthe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Haber, S., Horne, W., Manadhata, P., Mowbray, M., Rao, P. (2013). Efficient Submatch Extraction for Practical Regular Expressions. In: Dediu, AH., Martín-Vide, C., Truthe, B. (eds) Language and Automata Theory and Applications. LATA 2013. Lecture Notes in Computer Science, vol 7810. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37064-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-37064-9_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37063-2
Online ISBN: 978-3-642-37064-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Submatch Extraction for Practical Regular Expressions

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

On the Semantics of Atomic Subgroups in Practical Regular Expressions

On the Semantics of Regular Expression Parsing in the Wild

Memoized Regular Expressions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Efficient Submatch Extraction for Practical Regular Expressions

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

On the Semantics of Atomic Subgroups in Practical Regular Expressions

On the Semantics of Regular Expression Parsing in the Wild

Memoized Regular Expressions

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation