Skip to main content

Efficient Regular Expression Matching on Compressed Strings

  • Conference paper
  • First Online:
Book cover Database Systems for Advanced Applications (DASFAA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10178))

Included in the following conference series:

Abstract

Existing methods for regular expression matching on LZ78 compressed strings do not perform efficiently. Moreover, LZ78 compression has some shortcomings, such as high compression ratio and slower decompression speed than LZ77 (a variant of LZ78). In this paper, we study regular expression matching on LZ77 compressed strings. To address this problem, we propose an efficient algorithm, namely, RELZ, utilizing the positive factors, i.e., a prefix and a suffix, and negative factors (Negative factors are substrings that cannot appear in an answer.) of the regular expression to prune the candidates. For the sake of quickly locating these two kinds of factors on the compressed string without decompression, we design a variant suffix trie index, called SSLZ. In addition, we construct bitmaps for factors of regular expression to detect potential region and propose block filtering to reduce candidates. At last, we conduct a comprehensive performance evaluation using five real datasets to validate our ideas and the proposed algorithms. The experimental result shows that our RELZ algorithm outperforms the existing algorithms significantly.

This work is partially supported by the NSF of China for Outstanding Young Scholars under grant No. 61322208, the NSF of China under grant Nos. 61272178 and 61572122, and the NSF of China for Key Program under grant No. 61532021.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(select_{b}(B,\; i) \text { is the position of } i\text {-th bit} b \text {matchings in } B \).

  2. 2.

    http://www.dcc.uchile.cl/~gnavarro/pubcode/.

  3. 3.

    https://github.com/google/re2/.

References

  1. Bille, P., Fagerberg, R., Gortz, I.L.: Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts. In: Proceedings of the 18th Annual Conference on Combinatorial Pattern Matching, pp. 52–62 (2007)

    Google Scholar 

  2. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). doi:10.1007/978-3-642-54423-1_63

    Chapter  Google Scholar 

  3. Gonzlez, R., Grabowski, S., Mkinen, V., Navarro, G.: Practical implementation of rank and select queries, pp. 27–38 (2005)

    Google Scholar 

  4. Kreft, S., Navarro, G.: Self-indexing based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011). doi:10.1007/978-3-642-21458-5_6

    Chapter  Google Scholar 

  5. Li, Z., Wang, H., Shao, W., Li, J., Gao, H.: Repairing data through regular expressions. Proc. VLDB Endow. 9(5), 432–443 (2016)

    Article  Google Scholar 

  6. Navarro, G.: NR-grep: a fast and flexible pattern-matching tool. Softw. Pract. Exp. 31(13), 1265–1312 (2001)

    Article  MATH  Google Scholar 

  7. Navarro, G.: Regular expression searching on compressed text. J. Discrete Algorithms 1(5–6), 423–443 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  8. Navarro, G., Raffinot, M.: Fast regular expression search. In: Vitter, J.S., Zaroliagis, C.D. (eds.) WAE 1999. LNCS, vol. 1668, pp. 198–212. Springer, Heidelberg (1999). doi:10.1007/3-540-48318-7_17

    Chapter  Google Scholar 

  9. Navarro, G., Raffinot, M.: Compact DFA representation for fast regular expression search. In: Brodal, G.S., Frigioni, D., Marchetti-Spaccamela, A. (eds.) WAE 2001. LNCS, vol. 2141, pp. 1–13. Springer, Heidelberg (2001). doi:10.1007/3-540-44688-5_1

    Chapter  Google Scholar 

  10. Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)

    Article  Google Scholar 

  11. Thormpson, K.: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)

    Article  Google Scholar 

  12. Wu, S.: Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992)

    Article  Google Scholar 

  13. Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 1–46 (2016)

    Article  MathSciNet  Google Scholar 

  14. Yang, X., Wang, B., Li, C., Wang, J.: Efficient direct search on compressed genomic data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 961–972 (2013)

    Google Scholar 

  15. Yang, X., Wang, B., Qiu, T., Wang, Y., Li, C.: Improving regular-expression matching on strings using negative factors. In: ACM SIGMOD International Conference on Management of Data, pp. 361–372 (2013)

    Google Scholar 

  16. Zhang, M., Zhang, Y., Hou, C.: Compact representations of automata for regular expression matching. Inf. Process. Lett. 116(12), 750–756 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  17. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. 24(5), 530–536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Han, Y., Wang, B., Yang, X., Zhu, H. (2017). Efficient Regular Expression Matching on Compressed Strings. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-55699-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55698-7

  • Online ISBN: 978-3-319-55699-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics