Abstract
We present new algorithms for producing greedy parses for regular expressions (REs) in a semi-streaming fashion. Our lean-log algorithm executes in time O(mn) for REs of size m and input strings of size n and outputs a compact bit-coded parse tree representation. It improves on previous algorithms by: operating in only 2 passes; using only O(m) words of random-access memory (independent of n); requiring only k n bits of sequentially written and read log storage, where \(k < \frac{1}{3} m\) is the number of alternatives and Kleene stars in the RE; processing the input string as a symbol stream and not requiring it to be stored at all. Previous RE parsing algorithms do not scale linearly with input size, or require substantially more log storage and employ 3 passes where the first consists of reversing the input, or do not or are not known to produce a greedy parse. The performance of our unoptimized C-based prototype indicates that our lean-log algorithm has also in practice superior performance and is surprisingly competitive with RE tools not performing full parsing, such as Grep.
This work has been partially supported by The Danish Council for Independent Research under Project 11-106278, “Kleene Meets Church: Regular Expressions and Types”. The order of authors is insignificant.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Kearns, S.M.: Extending Regular Expressions. PhD thesis, Columbia University (1990)
Frisch, A., Cardelli, L.: Greedy Regular Expression Matching. In: Díaz, J., Karhumäki, J., Lepistö, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 618–629. Springer, Heidelberg (2004)
Dubé, D., Feeley, M.: Efficiently Building a Parse Tree From a Regular Expression. Acta Informatica 37(2), 121–144 (2000)
Nielsen, L., Henglein, F.: Bit-coded Regular Expression Parsing. In: Dediu, A.-H., Inenaga, S., Martín-Vide, C. (eds.) LATA 2011. LNCS, vol. 6638, pp. 402–413. Springer, Heidelberg (2011)
Henglein, F., Nielsen, L.: Regular expression containment: Coinductive axiomatization and computational interpretation. In: Proc. 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL). SIGPLAN Notices, vol. 46, pp. 385–398. ACM Press (January 2011)
Cox, R.: RE2, https://code.google.com/p/re2/
Ousterhout, J.: Tcl: An Embeddable Command Language. In: Proc. USENIX Winter Conference, pp. 133–146 (January 1990)
Wall, L., Christiansen, T., Orwant, J.: Programming Perl. O’Reilly Media, Incorporated (2000)
Veanes, M.V.M., de Halleux, P., Tillmann, N.: Rex: Symbolic Regular Expression Explorer. In: Proc. 3d Int’l Conf. on Software Testing, Verification and Validation, Paris, France. IEEE Computer Society Press (April 6-10 2010)
Cox, R.: Regular Expression Matching can be Simple and Fast
Earley, J.: An Efficient Context-Free Parsing Algorithm. Communications of the ACM 13(2), 94–102 (1970)
Might, M., Darais, D., Spiewak, D.: Parsing with derivatives: a functional pearl. In: ACM SIGPLAN Notices, vol. 46, pp. 189–195. ACM (2011)
Fischer, S., Huch, F., Wilke, T.: A Play on Regular Expressions: Functional Pearl. In: Proc. of the 15th ACM SIGPLAN International Conference on Functional Programming, ICFP 2010, pp. 357–368. ACM, New York (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grathwohl, N.B.B., Henglein, F., Nielsen, L., Rasmussen, U.T. (2013). Two-Pass Greedy Regular Expression Parsing. In: Konstantinidis, S. (eds) Implementation and Application of Automata. CIAA 2013. Lecture Notes in Computer Science, vol 7982. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-39274-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-39274-0_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-39273-3
Online ISBN: 978-3-642-39274-0
eBook Packages: Computer ScienceComputer Science (R0)