Skip to main content

LZ77-Based Self-indexing with Faster Pattern Matching

  • Conference paper
LATIN 2014: Theoretical Informatics (LATIN 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8392))

Included in the following conference series:

Abstract

To store and search genomic databases efficiently, researchers have recently started building self-indexes based on LZ77. As the name suggests, a self-index for a string supports both random access and pattern matching queries. In this paper we show how, given a string S [1..n] whose LZ77 parse consists of z phrases, we can store a self-index for S in \(\mathcal{O}({z \log (n / z)})\) space such that later, first, given a position i and a length ℓ, we can extract S [i..i + ℓ − 1] in \(\mathcal{O}({\ell + \log n})\) time; second, given a pattern P [1..m], we can list the occ occurrences of P in S in \(\mathcal{O}({m \log m + occ \log \log n})\) time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alstrup, S., Brodal, G., Rauhe, T.: Optimal static range reporting in one dimension. In: Proc. STOC, pp. 476–482 (2001)

    Google Scholar 

  2. Alstrup, S., Brodal, G.S., Rauhe, T.: Pattern matching in dynamic texts. In: Proc. SODA, pp. 819–828 (2000)

    Google Scholar 

  3. Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62(1-2), 54–101 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  4. Belazzougui, D., Boldi, P., Pagh, R., Vigna, S.: Monotone minimal perfect hashing: searching a sorted table with \(\mathcal{O}({1})\) accesses. In: Proc. SODA, pp. 785–794 (2009)

    Google Scholar 

  5. Bille, P., Gørtz, I.L.: Substring range reporting. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 299–308. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  6. Chan, T.M., Larsen, K.G., Pǎtraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proc. SoCG, pp. 1–10 (2011)

    Google Scholar 

  7. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fund. Inf. 111(3), 313–337 (2011)

    MATH  MathSciNet  Google Scholar 

  8. Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative Lempel-Ziv self-index for similar sequences. Theor. Comp. Sci. (to appear)

    Google Scholar 

  9. Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  10. Bille, P., et al.: Random access to grammar-compressed strings. In: Proc. SODA, pp. 373–389 (2011)

    Google Scholar 

  11. Bille, P., Cording, P.H., Gørtz, I.L., Sach, B., Vildhøj, H.W., Vind, S.: Fingerprints in compressed strings. In: Dehne, F., Solis-Oba, R., Sack, J.-R. (eds.) WADS 2013. LNCS, vol. 8037, pp. 146–157. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  12. Farach, M., Thorup, M.: String matching in Lempel-Ziv compressed strings. In: Proc. STOC, pp. 703–712 (1995)

    Google Scholar 

  13. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. Technical Report 1109.3954v6, arxiv.org (2012)

    Google Scholar 

  15. Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing similar DNA sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  16. Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proc. WSP, pp. 141–155 (1996)

    Google Scholar 

  17. Karp, R.M., Miller, R.E., Rosenberg, A.L.: Rapid identification of repeated patters in strings, trees and arrays. In: Proc. STOC, pp. 125–136 (1972)

    Google Scholar 

  18. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  19. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comp. Sci. 483, 115–133 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  20. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Bio. 17(3), 281–308 (2010)

    Article  Google Scholar 

  21. Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-index: A compressed index based on edit-sensitive parsing. J. Dis. Alg. 18, 100–112 (2013)

    MATH  MathSciNet  Google Scholar 

  22. Morrison, D.R.: PATRICIA - Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968)

    Article  Google Scholar 

  23. Mortensen, C.W., Pagh, R., Pǎtraşcu, M.: On dynamic range reporting in one dimension. In: Proc. STOC, pp. 104–111 (2005)

    Google Scholar 

  24. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1) (2007)

    Google Scholar 

  25. Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv-Lempel dictionary. Inf. Retr. 11(4), 359–388 (2008)

    Article  Google Scholar 

  26. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comp. Sci. 302(1-3), 211–222 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  27. Verbin, E., Yu, W.: Data structure lower bounds on random access to grammar-compressed strings. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 247–258. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  28. Wandelt, S., Leser, U.: QGramProjector: Q-gram projection for indexing highly-similar strings. In: Catania, B., Guerrini, G., Pokorný, J. (eds.) ADBIS 2013. LNCS, vol. 8133, pp. 260–273. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  29. Yang, X., Wang, B., Li, C., Wang, J., Xie, X.: Efficient direct search on genomic data. In: Proc. ICDE, pp. 961–972 (2013)

    Google Scholar 

  30. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J. (2014). LZ77-Based Self-indexing with Faster Pattern Matching. In: Pardo, A., Viola, A. (eds) LATIN 2014: Theoretical Informatics. LATIN 2014. Lecture Notes in Computer Science, vol 8392. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54423-1_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54423-1_63

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54422-4

  • Online ISBN: 978-3-642-54423-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics