ABSTRACT
The manipulation of raw string data is ubiquitous in security-critical software, and verification of such software relies on efficiently solving string and regular expression constraints via SMT. However, the typical case of Boolean combinations of regular expression constraints exposes blowup in existing techniques. To address solvability of such constraints, we propose a new theory of derivatives of symbolic extended regular expressions (extended meaning that complement and intersection are incorporated), and show how to apply this theory to obtain more efficient decision procedures. Our implementation of these ideas, built on top of Z3, matches or outperforms state-of-the-art solvers on standard and handwritten benchmarks, showing particular benefits on examples with Boolean combinations.
Our work is the first formalization of derivatives of regular expressions which both handles intersection and complement and works symbolically over an arbitrary character theory. It unifies existing approaches involving derivatives of extended regular expressions, alternating automata and Boolean automata by lifting them to a common symbolic platform. It relies on a parsimonious augmentation of regular expressions: a construct for symbolic conditionals is shown to be sufficient to obtain relevant closure properties for derivatives over extended regular expressions.
Supplemental Material
Available for Download
New regular expression SMT benchmarks for the paper, provided as a separate archive from the artifact. Also available at https://github.com/cdstanford/regex-smt-benchmarks
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Bui Phi Diep, Julian Dolby, Petr Janku, Hsin-Hung Lin, Lukás Holík, and Wei-Cheng Wu. 2020. Efficient handling of string-number conversion. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 943–957. https://doi.org/10.1145/3385412.3386034 Google ScholarDigital Library
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Bui Phi Diep, Lukáš Holík, Ahmed Rezine, and Philipp Rümmer. 2018. Trau: SMT solver for string constraints. In 2018 Formal Methods in Computer Aided Design (FMCAD). 1–5.Google Scholar
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Lukáš Holík, Ahmed Rezine, Philipp Rümmer, and Jari Stenman. 2014. String constraints for verification. In International Conference on Computer Aided Verification. 150–166.Google ScholarDigital Library
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Lukáš Holík, Ahmed Rezine, Philipp Rümmer, and Jari Stenman. 2015. Norn: An SMT solver for string constraints. In International Conference on Computer Aided Verification. 462–469.Google ScholarCross Ref
- Cyril Allauzen and Mehryar Mohri. 2006. A unified construction of the Glushkov, Follow, and Antimirov automata. In International Symposium on Mathematical Foundations of Computer Science. 110–121.Google ScholarCross Ref
- Valentin Antimirov. 1995. Partial Derivatives of Regular Expressions and Finite Automata Constructions. Theoretical Computer Science, 155 (1995), 291–319.Google ScholarDigital Library
- Valentin M Antimirov and Peter D Mosses. 1995. Rewriting extended regular expressions. Theoretical Computer Science, 143, 1 (1995), 51–72.Google ScholarDigital Library
- John Backes, Pauline Bolignano, Byron Cook, Catherine Dodge, Andrew Gacek, Kasper Sœ Luckow, Neha Rungta, Oksana Tkachuk, and Carsten Varming. 2018. Semantic-based Automated Reasoning for AWS Access Policies using SMT. In 2018 Formal Methods in Computer Aided Design, FMCAD 2018, Austin, TX, USA, October 30 - November 2, 2018, Nikolaj Bjørner and Arie Gurfinkel (Eds.). IEEE, 1–9. https://doi.org/10.23919/FMCAD.2018.8602994 Google ScholarCross Ref
- Clark Barrett, Christopher L Conway, Morgan Deters, Liana Hadarean, Dejan Jovanović, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011. Cvc4. In International Conference on Computer Aided Verification. 171–177.Google ScholarCross Ref
- Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Robert Endre Tarjan. 2011. A New Approach to Incremental Cycle Detection and Related Problems. CoRR, abs/1112.0784 (2011), arxiv:1112.0784Google Scholar
- Murphy Berzish, Vijay Ganesh, and Yunhui Zheng. 2017. Z3str3: A string solver with theory-aware heuristics. In 2017 Formal Methods in Computer Aided Design (FMCAD). 55–59.Google Scholar
- Nikolaj Bjørner, Vijay Ganesh, Raphael Michel, and Margus Veanes. 2012. An SMT-LIB Format for Sequences and Regular Expressions. In SMT’12, P. Fontaine and A. Goel (Eds.). 76–86.Google Scholar
- Martin Brain, James H Davenport, and Alberto Griggio. 2017. Benchmarking Solvers, SAT-style.. In SC2 ISSAC.Google Scholar
- Janusz A. Brzozowski. 1964. Derivatives of regular expressions. JACM, 11 (1964), 481–494.Google ScholarDigital Library
- J. A. Brzozowski and E. Leiss. 1980. On equations for regular languages, finite automata, and sequential networks. Theoretical Computer Science, 10 (1980), 19–35.Google ScholarCross Ref
- Tevfik Bultan, Fang Yu, Muath Alkhalaf, and Abdulbaki Aydin. 2017. String Analysis for Software Verification and Security. Springer.Google Scholar
- Pascal Caron, Jean-Marc Champarnaud, and Ludovic Mignot. 2011. Partial Derivatives of an Extended Regular Expression. In Language and Automata Theory and Applications, LATA 2011 (LNCS, Vol. 6638). Springer, 179–191.Google Scholar
- Ashok K. Chandra, Dexter C. Kozen, and Larry J. Stockmeyer. 1981. Alternation. JACM, 28, 1 (1981), 114–133.Google ScholarDigital Library
- Taolue Chen, Matthew Hague, Jinlong He, Denghang Hu, Anthony Widjaja Lin, Philipp Rümmer, and Zhilin Wu. 2020. A Decision Procedure for Path Feasibility of String Manipulating Programs with Integer Data Type. In International Symposium on Automated Technology for Verification and Analysis. 325–342.Google ScholarDigital Library
- Taolue Chen, Matthew Hague, Anthony W Lin, Philipp Rümmer, and Zhilin Wu. 2019. Decision procedures for path feasibility of string-manipulating programs with complex operations. Proceedings of the ACM on Programming Languages, 3, POPL (2019), 1–30.Google ScholarDigital Library
- CVC4. 2020. https://github.com/CVC4/CVC4.Google Scholar
- Loris D’Antoni, Zachary Kincaid, and Fang Wang. 2018. A Symbolic Decision Procedure for Symbolic Alternating Finite Automata. Electronic Notes in Theoretical Computer Science, 336 (2018), 79–99.Google ScholarCross Ref
- Loris D’Antoni and Margus Veanes. 2014. Minimization of Symbolic Automata. ACM SIGPLAN Notices – POPL’14, 49, 1 (2014), 541–553. https://doi.org/10.1145/2535838.2535849 Google ScholarDigital Library
- Loris D’Antoni and Margus Veanes. 2020. Automata Modulo Theories. Commun. ACM.Google Scholar
- James C Davis. 2019. Rethinking Regex engines to address ReDoS. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1256–1258.Google ScholarDigital Library
- Leonardo de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In TACAS’08 (LNCS). Springer, 337–340.Google Scholar
- Keith Ellul, Bryan Krawetz, Jeffrey Shallit, and Ming-Wei Wang. 2005. Regular expressions: New results and open problems. J. Autom. Lang. Comb., 10, 4 (2005), 407–437.Google ScholarDigital Library
- Wouter Gelade and Frank Neven. 2008. Succinctness of the complement and intersection of regular expressions. arXiv preprint arXiv:0802.2869.Google Scholar
- Dan Gusfield. 1997. Algorithms on stings, trees, and sequences: Computer science and computational biology. Acm Sigact News, 28, 4 (1997), 41–60.Google ScholarDigital Library
- J.G. Henriksen, J. Jensen, M. Jørgensen, N. Klarlund, B. Paige, T. Rauhe, and A. Sandholm. 1995. Mona: Monadic Second-order logic in practice. In TACAS ’95 (LNCS, Vol. 1019). Springer.Google Scholar
- Hossein Hojjat, Philipp Rümmer, and Ali Shamakhi. 2019. On Strings in Software Model Checking. In APLAS, A. Lin (Ed.) (LNCS, Vol. 11893). Springer.Google Scholar
- Lukáš Holík, Petr Janků, Anthony W Lin, Philipp Rümmer, and Tomáš Vojnar. 2017. String constraints with concatenation and transducers solved efficiently. Proceedings of the ACM on Programming Languages, 2, POPL (2017), 1–32.Google Scholar
- H. B. Hunt III. 1973. The equivalence problem for regular expressions with intersections is not polynomial in tape. Department of Computer Science, Cornell University, Ithaca, New York.Google Scholar
- R. Iosif, A. Rogalewicz, and T. Vojnar. 2016. Abstraction refinement and antichains for trace inclusion of infinite state systems. In TACAS’16 (LNCS, Vol. 9636). Springer, 71–89.Google Scholar
- Radu Iosif and Xiao Xu. 2018. Abstraction Refinement for Emptiness Checking of Alternating Data Automata. In TACAS’18, Dirk Beyer and Marieke Huisman (Eds.). Springer, 93–111.Google Scholar
- Matthias Keil and Peter Thiemann. 2014. Symbolic Solving of Extended Regular Expression Inequalities. In FSTTCS’14 (LIPIcs). 175–186.Google Scholar
- Nils Klarlund, Anders Møller, and Michael I. Schwartzbach. 2002. MONA Implementation Secrets. International Journal of Foundations of Computer Science, 13, 4 (2002), 571–586.Google ScholarCross Ref
- Dexter Kozen. 1976. On parallelism in Turing machines. In 17th Annual Symposium on Foundations of Computer Science, FOCS’76. IEEE Xplore, 89–97.Google ScholarDigital Library
- Dexter Kozen. 1977. Lower bounds for natural proof systems. In 18th Annual Symposium on Foundations of Computer Science (SFCS 1977). 254–266. https://doi.org/10.1109/SFCS.1977.16 Google ScholarDigital Library
- Dexter Kozen. 1997. Kleene algebra with tests. Transactions on Programming Languages and Systems, 19 (1997), 427–443.Google ScholarDigital Library
- Orna Kupferman and Sharon Zuhovitzky. 2002. An improved algorithm for the membership problem for extended regular expressions. In International Symposium on Mathematical Foundations of Computer Science. 446–458.Google ScholarCross Ref
- Tianyi Liang, Andrew Reynolds, Cesare Tinelli, Clark Barrett, and Morgan Deters. 2014. A DPLL (T) theory solver for a theory of strings and regular expressions. In International Conference on Computer Aided Verification. 646–662.Google ScholarDigital Library
- Tianyi Liang, Nestan Tsiskaridze, Andrew Reynolds, Cesare Tinelli, and Clark Barrett. 2015. A Decision Procedure for Regular Membership and Length Constraints over Unbounded Strings? In FroCoS 2015: Frontiers of Combining Systems (LNCS, Vol. 9322). Springer, 135–150.Google Scholar
- Blake Loring, Duncan Mitchell, and Johannes Kinder. 2019. Sound regular expression semantics for dynamic symbolic execution of JavaScript. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 425–438.Google ScholarDigital Library
- Microsoft. 2020. Azure Resource Manager documentation. https://docs.microsoft.com/en-us/azure/azure-resource-manager/.Google Scholar
- Microsoft. 2020. .NET regular expressions. https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expressions.Google Scholar
- MiniZinc. 2020. https://www.minizinc.org.Google Scholar
- Mehryar Mohri. 1996. On some applications of finite-state automata theory to natural language processing. Natural Language Engineering, 2, 1 (1996), 61–80.Google ScholarDigital Library
- Robert Nieuwenhuis, Albert Oliveras, and Cesare Tinelli. 2006. Solving SAT and SAT Modulo Theories: From an abstract Davis–Putnam–Logemann–Loveland procedure to DPLL(T). J. ACM, 53, 6 (2006), 937–977. https://doi.org/10.1145/1217856.1217859 Google ScholarDigital Library
- Ostrich. 2020. https://github.com/uuverifiers/ostrich/.Google Scholar
- Scott Owens, John Reppy, and Aaron Turon. 2009. Regular-expression derivatives re-examined. Journal of Functional Programming, 19, 2 (2009), 173–190.Google ScholarDigital Library
- passwords generator.org. 2020. https://passwords-generator.org/.Google Scholar
- Damien Pous. 2015. Symbolic Algorithms for Language Equivalence and Kleene Algebra with Tests. ACM SIGPLAN Notices – POPL’15, 50, 1 (2015), 357–368. https://doi.org/10.1145/2775051.2677007 Google ScholarDigital Library
- Grigore Roşu and Mahesh Viswanathan. 2003. Testing extended regular language membership incrementally by rewriting. In International Conference on Rewriting Techniques and Applications. 499–514.Google ScholarCross Ref
- Olli Saarikivi, Margus Veanes, Todd Mytkowicz, and Madan Musuvathi. 2017. Fusing Effectful Comprehensions. In ACM SIGPLAN Notices – PLDI’17. ACM.Google Scholar
- Koushik Sen and Grigore Roşu. 2003. Generating optimal monitors for extended regular expressions. Electronic Notes in Theoretical Computer Science, 89, 2 (2003), 226–245.Google ScholarCross Ref
- Reetinder Sidhu and Viktor K Prasanna. 2001. Fast regular expression matching using FPGAs. In The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’01). 227–238.Google ScholarDigital Library
- SMT. 2012. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/nbjorner-microsoft.automata.smtbenchmarks.zip.Google Scholar
- SMTLib. 2020. https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/QF_S.Google Scholar
- SMTLib. 2020. https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/QF_SLIA.Google Scholar
- stackoverflow.com. 2020. Regex for password must contain at least eight characters, at least one number and both lower and uppercase letters and special characters. https://stackoverflow.com/questions/19605150/regex-for-password-must-contain-at-least-eight-characters-at-least-one-number-a.Google Scholar
- L. J. Stockmeyer and A. R. Meyer. 1973. Word Problems Requiring Exponential Time(Preliminary Report). In Proceedings of the Fifth Annual ACM Symposium on Theory of Computing, STOC’73. ACM, 1–9. https://doi.org/10.1145/800125.804029 Google ScholarDigital Library
- Robert E. Tarjan. 1975. Efficiency of a good but not linear set union algorithm. JACM, 22 (1975), 215–225.Google ScholarDigital Library
- Cesare Tinelli, Clark Barrett, and Pascal Fontaine. 2020. http://smtlib.cs.uiowa.edu/theories-UnicodeStrings.shtml.Google Scholar
- Minh-Thai Trinh, Duc-Hiep Chu, and Joxan Jaffar. 2014. S3: A Symbolic String Solver for Vulnerability Detection in Web Applications. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS ’14). Association for Computing Machinery, New York, NY, USA. 1232–1243. isbn:9781450329576 https://doi.org/10.1145/2660267.2660372 Google ScholarDigital Library
- Margus Veanes, Nikolaj Bjørner, and Leonardo de Moura. 2010. Symbolic Automata Constraint Solving. In Logic for Programming, Artificial Intelligence, and Reasoning. LPAR 2010, C.G. Fermüller and A. Voronkov (Eds.) (LNCS, Vol. 6397). Springer, 640–654.Google Scholar
- Margus Veanes, Olli Saarikivi, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In TACAS’19 (LNCS). Springer.Google Scholar
- Z3. 2020. https://github.com/z3prover/z3.Google Scholar
- Z3-Trau. 2020. https://github.com/diepbp/z3-trau.Google Scholar
- Z3str3. 2020. https://sites.google.com/site/z3strsolver/.Google Scholar
- Yunhui Zheng, Vijay Ganesh, Sanu Subramanian, Omer Tripp, Murphy Berzish, Julian Dolby, and Xiangyu Zhang. 2017. Z3str2: an efficient solver for strings, regular expressions, and length constraints. Formal Methods in System Design, 50, 2-3 (2017), 249–288.Google ScholarDigital Library
Index Terms
- Symbolic Boolean derivatives for efficiently solving extended regular expression constraints
Recommendations
A regular expression matching circuit: Decomposed non-deterministic realization with prefix sharing and multi-character transition
This paper shows a compact realization of regular expression matching circuits on FPGAs. First, the given regular expression is converted into a non-deterministic finite automaton (NFA) by the modified McNaughton-Yamada method. Second, to reduce the ...
Computation of regular expression derivatives
The conversion of regular expressions into finite state automata and finite state automata into regular expression is an important area of research in automata theory. The notion of derivatives of regular expressions has been introduced to make the ...
Partial derivatives of an extended regular expression
LATA'11: Proceedings of the 5th international conference on Language and automata theory and applicationsThe notion of expression derivative due to Brzozowski leads to the construction of a deterministic automaton from an extended regular expression, whereas the notion of partial derivative due to Antimirov leads to the construction of a non-deterministic ...
Comments