Skip to main content

Faster Approximate Pattern Matching in Compressed Repetitive Texts

  • Conference paper
Algorithms and Computation (ISAAC 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7074))

Included in the following conference series:

  • 1935 Accesses

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s).

Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an \(\ensuremath{\mathcal{O}\!\left( {r} \right)}\)-word data structure that allows us to extract any substring s [i..j] in \(\ensuremath{\mathcal{O}\!\left( {\log n + j - i} \right)}\) time. They also showed how, given a pattern p of length m and an edit distance k ≤ m, their data structure supports finding all occ approximate matches to p in s in \(\ensuremath{\mathcal{O}\!\left( {r (\min (m k, k^4 + m) + \log n) + \ensuremath{\mathsf{occ}}} \right)}\) time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with \(\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}\) rules. In this paper we give a simple \(\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}\)-word data structure that takes the same time for substring extraction but only \(\ensuremath{\mathcal{O}\!\left( {z (\min (m k, k^4 + m)) + \ensuremath{\mathsf{occ}}} \right)}\) time for approximate pattern matching.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica (to appear)

    Google Scholar 

  2. Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proceedings of the 22nd Symposium on Discrete Algorithms, SODA (2011)

    Google Scholar 

  3. Cole, R., Hariharan, R.: Approximate string matching: A simpler faster algorithm. SIAM Journal on Computing 31(6), 1761–1782 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  4. Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372(1), 115–121 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  5. Ferragina, P., Manzini, G.: On compressing the textual web. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 391–400. ACM, New York (2010)

    Google Scholar 

  6. Gagie, T., Gawrychowski, P.: Grammar-Based Compression in a Streaming Model. In: Dediu, A.-H., Fernau, H., Martín-Vide, C. (eds.) LATA 2010. LNCS, vol. 6031, pp. 273–284. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  7. Genome 10K Community of Scientists: A proposal to obtain whole-genome sequence for 10,000 vertebrate species. Journal of Heredity 100, 659–674 (2009)

    Article  Google Scholar 

  8. González, R., Navarro, G.: Compressed Text Indexes with Fast Locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  9. Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the Data Compression Conference, DCC (2010)

    Google Scholar 

  10. Kreft, S., Navarro, G.: Self-Indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. Journal of Algorithms 10(2), 157–169 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  12. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)

    Article  MathSciNet  Google Scholar 

  13. Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  14. Durbin, R., et al.: 1000 genomes (2010), http://www.1000genomes.org/

  15. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), 211–222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  16. Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-Length Compressed Indexes are Superior for Highly Repetitive Sequence Collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  17. Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  18. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gagie, T., Gawrychowski, P., Puglisi, S.J. (2011). Faster Approximate Pattern Matching in Compressed Repetitive Texts. In: Asano, T., Nakano, Si., Okamoto, Y., Watanabe, O. (eds) Algorithms and Computation. ISAAC 2011. Lecture Notes in Computer Science, vol 7074. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25591-5_67

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25591-5_67

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25590-8

  • Online ISBN: 978-3-642-25591-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics