Skip to main content

Practical Variable Length Gap Pattern Matching

  • Conference paper
  • First Online:
Book cover Experimental Algorithms (SEA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9685))

Included in the following conference series:

Abstract

Solving the problem of reporting all occurrences of patterns containing variable length gaps in an input text T efficiently is important for various applications in a broad range of domains such as Bioinformatics or Natural Language Processing. In this paper we present an efficient solution for static inputs which utilizes the wavelet tree of the suffix array. The algorithm partially traverses the wavelet tree to find matches and can be easily adapted to several variants of the problem. We explore the practical properties of our solution in an experimental study where we compare to online and semi-indexed solutions using standard datasets. The experiments show that our approach is the best choice for searching patterns with many gaps in large texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    I.e. any two match tuples \({\langle }i_0\ldots i_{k-1}{\rangle }\) and \({\langle }i'_0\ldots i'_{k-1}{\rangle }\) spanning the intervals \([i_0,i_{k-1}+m_{k-1}-1]\) and \([i'_0,i'_{k-1}+m_{k-1}-1]\) do not overlap.

References

  1. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  2. Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bille, P., Thorup, M.: Regular expression matching with multi-strings and intervals. In: Proceedings of SODA, pp. 1297–1308 (2010)

    Google Scholar 

  5. Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443, 25–34 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  6. Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retrieval 11(4), 335–357 (2008)

    Article  Google Scholar 

  7. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Heidelberg (2014)

    Google Scholar 

  8. Grossi, R., Gupta, A., Vitter, J.S.: High-order entropy-compressed text indexes. In: Proceedings of SODA, pp. 841–850 (2003)

    Google Scholar 

  9. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.A.: The PROSITE database. Nucleic Acids Res. 34(suppl 1), D227–D230 (2006)

    Article  Google Scholar 

  10. Knuth, D.E., Morris Jr., J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  11. Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Soft. Prac. Exp. 45(1), 1–29 (2015)

    Article  Google Scholar 

  12. Lewenstein, M.: Indexing with gaps. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 135–143. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  13. Lopez, A.: Hierarchical phrase-based translation with suffix arrays. In: Proceedings of EMNLP-CoNLL, pp. 976–985 (2007)

    Google Scholar 

  14. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  15. Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: Proceedings of SIGIR, pp. 472–479 (2005)

    Google Scholar 

  16. Mihalcea, R., Tarau, P., Figa, E.: Pagerank on semantic networks, with application to word sense disambiguation. In: Proceedings of COLING (2004)

    Google Scholar 

  17. Morgante, M., Policriti, A., Vitacolonna, N., Zuccolo, A.: Structured motifs search. J. Comput. Biol. 12(8), 1065–1082 (2005)

    Article  Google Scholar 

  18. Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003)

    Article  Google Scholar 

  19. Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Chen, D.Z., Lee, D.T. (eds.) COCOON 2006. LNCS, vol. 4112, pp. 146–155. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  20. Thompson, K.: Regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)

    Article  MATH  Google Scholar 

Download references

Acknowledgement

We are grateful to Timo Bingmann for profiling our initial implementation. This work was supported under the Australian Research Council’s Discovery Projects scheme (project DP140103256) and Deutsche Forschungsgemeinschaft.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simon Gog .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bader, J., Gog, S., Petri, M. (2016). Practical Variable Length Gap Pattern Matching. In: Goldberg, A., Kulikov, A. (eds) Experimental Algorithms. SEA 2016. Lecture Notes in Computer Science(), vol 9685. Springer, Cham. https://doi.org/10.1007/978-3-319-38851-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-38851-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-38850-2

  • Online ISBN: 978-3-319-38851-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics