Faster Approximate Pattern Matching in Compressed Repetitive Texts

Gagie, Travis; Gawrychowski, Paweł; Puglisi, Simon J.

doi:10.1007/978-3-642-25591-5_67

Travis Gagie²⁰,
Paweł Gawrychowski²¹ &
Simon J. Puglisi²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7074))

Included in the following conference series:

International Symposium on Algorithms and Computation

1935 Accesses

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s).

Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an $\ensuremath{\mathcal{O}\!\left( {r} \right)}$-word data structure that allows us to extract any substring s [i..j] in $\ensuremath{\mathcal{O}\!\left( {\log n + j - i} \right)}$ time. They also showed how, given a pattern p of length m and an edit distance k ≤ m, their data structure supports finding all occ approximate matches to p in s in $\ensuremath{\mathcal{O}\!\left( {r (\min (m k, k^4 + m) + \log n) + \ensuremath{\mathsf{occ}}} \right)}$ time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with $\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}$ rules. In this paper we give a simple $\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}$-word data structure that takes the same time for substring extraction but only $\ensuremath{\mathcal{O}\!\left( {z (\min (m k, k^4 + m)) + \ensuremath{\mathsf{occ}}} \right)}$ time for approximate pattern matching.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

CHICO: A Compressed Hybrid Index for Repetitive Collections

Fast, Small, and Simple Document Listing on Repetitive Text Collections

Substring Complexities on Run-Length Compressed Strings

References

Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica (to appear)
Google Scholar
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proceedings of the 22nd Symposium on Discrete Algorithms, SODA (2011)
Google Scholar
Cole, R., Hariharan, R.: Approximate string matching: A simpler faster algorithm. SIAM Journal on Computing 31(6), 1761–1782 (2002)
Article MathSciNet MATH Google Scholar
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372(1), 115–121 (2007)
Article MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G.: On compressing the textual web. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 391–400. ACM, New York (2010)
Google Scholar
Gagie, T., Gawrychowski, P.: Grammar-Based Compression in a Streaming Model. In: Dediu, A.-H., Fernau, H., Martín-Vide, C. (eds.) LATA 2010. LNCS, vol. 6031, pp. 273–284. Springer, Heidelberg (2010)
Chapter Google Scholar
Genome 10K Community of Scientists: A proposal to obtain whole-genome sequence for 10,000 vertebrate species. Journal of Heredity 100, 659–674 (2009)
Article Google Scholar
González, R., Navarro, G.: Compressed Text Indexes with Fast Locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007)
Chapter Google Scholar
Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the Data Compression Conference, DCC (2010)
Google Scholar
Kreft, S., Navarro, G.: Self-Indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)
Chapter Google Scholar
Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. Journal of Algorithms 10(2), 157–169 (1989)
Article MathSciNet MATH Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
Article MathSciNet MATH Google Scholar
Durbin, R., et al.: 1000 genomes (2010), http://www.1000genomes.org/
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-Length Compressed Indexes are Superior for Highly Repetitive Sequence Collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)
Chapter Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Aalto University, Espoo, Finland
Travis Gagie
Department of Computer Science, University of Wrocław, Wrocław, Poland
Paweł Gawrychowski
Department of Informatics, King’s College London, London, United Kingdom
Simon J. Puglisi

Authors

Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Gawrychowski
View author publications
You can also search for this author in PubMed Google Scholar
Simon J. Puglisi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chuo University, Kasuga, Bunkyo-ku, 112-8551, Tokyo, Japan
Takao Asano
Gunma University, 1-5-1 Tenjin-Cho, 376-8515, Kiryu-Shi, Japan
Shin-ichi Nakano
Japan Advanced Institute of Science and Technology, 1-1 Asahidai, 923-1292, Nomi, Ishikawa, Japan
Yoshio Okamoto
Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, 152-8552, Tokyo, Japan
Osamu Watanabe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gagie, T., Gawrychowski, P., Puglisi, S.J. (2011). Faster Approximate Pattern Matching in Compressed Repetitive Texts. In: Asano, T., Nakano, Si., Okamoto, Y., Watanabe, O. (eds) Algorithms and Computation. ISAAC 2011. Lecture Notes in Computer Science, vol 7074. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25591-5_67

Download citation

DOI: https://doi.org/10.1007/978-3-642-25591-5_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25590-8
Online ISBN: 978-3-642-25591-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics