Abstract
The problem of matching a regular expression (regex) on a text exists in many applications such as entity matching, protein sequences matching, and shell commands. Classical methods to support regex matching usually adopt the finite automaton which has a high matching cost. Recent methods solve the regex matching problem by utilizing the positional q-gram inverted index – one of the most widely used index schemes, and all matching results can be matched directly based on this index. The efficiency of these methods depends critically on the query plan tree, which is built from the query with some heuristic rules. However, these methods could become inefficient when an improper rule is used for building the query plan tree. To remedy this issue, this paper aims to build a good query plan tree with an efficiency guarantee. We propose a novel method to build an optimal query plan tree with the minimal expected matching cost for the index-based regex matching method. While computing an optimal query plan tree is an NP-hard problem even with strong assumptions, we propose a pseudo-polynomial time algorithm to build an optimal query plan tree. Finally, extensive experiments have been conducted on real-world data sets and the results show that our method outperforms state-of-the-art methods.
This work is partly supported by the National Natural Science Foundation of China (Nos. 62002245, U22A2025, 62072088, 62232007, 61802268), Ten Thousand Talent Program (No. ZX20200035), Liaoning Distinguished Professor (No. XLYC1902057), and the Natural Science Foundation of Liaoning Province (Nos. 2022-BS-218, 2022-MS-303, 2022-MS-302).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Berry, G., Sethi, R.: From regular expressions to deterministic automata. Theoret. Comput. Sci. 48, 117–126 (1986)
Cho, J., Rajagopalan, S.: A fast regular expression indexing engine. In: ICDE, vol. 2, p. 0419 (2002)
DeRose, P., Shen, W., Chen, F., Lee, Y., et al.: DBLife: a community information management platform for the database research community. In: CIDR, pp. 169–172 (2007)
GNUgrep. http://reality.sgiweb.org/freeware/relnotes/ fw-5.3/fw_gnugrep/gnugrep.html
Greiner, R., Hayward, R., et al.: Finding optimal satisficing strategies for and-or trees. Artif. Intell. 170(1), 19–58 (2006)
Hofmann, K., Bucher, P., Falquet, L., Bairoch, A.: The PROSITE database. Nucleic Acids Res. 27(1), 215–219 (1999)
Kandhan, R., Teletia, N., Patel, J.M.: SigMatch: fast and scalable multi-pattern matching. VLDB 3(1–2), 1173–1184 (2010)
Majumder, A., Rastogi, R., Vanama, S.: Scalable regular expression matching on data streams. In: SIGMOD, pp. 161–172. ACM (2008)
McNaughton, R., Yamada, H.: Regular expressions and state graphs for automata. IEEE Trans. Electron. Comput. 1(EC-9), 39–47 (1960)
Mohri, M.: String-matching with automata. Nord. J. Comput. 4(2), 217–231 (1997)
Navarro, C.: NR-grep: a fast and flexible pattern matching tool. Softw. Pract. Experience (SPE) 31, 1265–1312 (2001)
Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical Online Search Algorithms for Texts and Biological Sequences. Cambridge University Press, Cambridge (2002)
Navarro, G., Raffinot, M.: New techniques for regular expression searching. Algorithmica 41(2), 89–116 (2005). https://doi.org/10.1007/s00453-004-1120-3
Qiu, T., Yang, X., Wang, B., Wang, W.: Efficient regular expression matching based on positional inverted index. IEEE Trans. Knowl. Data Eng. 34, 1133–1148 (2020)
Watson, B.W.: A new regula grammar pattern matching algorithm. In: Diaz, J., Serna, M. (eds.) ESA 1996. LNCS, vol. 1136, pp. 364–377. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61680-2_68
Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 25 (2016)
Yu, F., Chen, Z., Diao, Y., Lakshman, T., Katz, R.H.: Fast and memory-efficient regular expression matching for deep packet inspection. In: ANCS, 2006, pp. 93–102. IEEE (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qiu, T., Yang, X., Wang, B., Zong, C., Zhu, R., Xia, X. (2023). Efficient Index-Based Regular Expression Matching with Optimal Query Plan Tree. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13943. Springer, Cham. https://doi.org/10.1007/978-3-031-30637-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-30637-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30636-5
Online ISBN: 978-3-031-30637-2
eBook Packages: Computer ScienceComputer Science (R0)