Abstract
Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s).
Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an \(\ensuremath{\mathcal{O}\!\left( {r} \right)}\)-word data structure that allows us to extract any substring s [i..j] in \(\ensuremath{\mathcal{O}\!\left( {\log n + j - i} \right)}\) time. They also showed how, given a pattern p of length m and an edit distance k ≤ m, their data structure supports finding all occ approximate matches to p in s in \(\ensuremath{\mathcal{O}\!\left( {r (\min (m k, k^4 + m) + \log n) + \ensuremath{\mathsf{occ}}} \right)}\) time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with \(\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}\) rules. In this paper we give a simple \(\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}\)-word data structure that takes the same time for substring extraction but only \(\ensuremath{\mathcal{O}\!\left( {z (\min (m k, k^4 + m)) + \ensuremath{\mathsf{occ}}} \right)}\) time for approximate pattern matching.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica (to appear)
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proceedings of the 22nd Symposium on Discrete Algorithms, SODA (2011)
Cole, R., Hariharan, R.: Approximate string matching: A simpler faster algorithm. SIAM Journal on Computing 31(6), 1761–1782 (2002)
Ferragina, P., Venturini, R.: A simple storage scheme for strings achieving entropy bounds. Theoretical Computer Science 372(1), 115–121 (2007)
Ferragina, P., Manzini, G.: On compressing the textual web. In: WSDM 2010: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 391–400. ACM, New York (2010)
Gagie, T., Gawrychowski, P.: Grammar-Based Compression in a Streaming Model. In: Dediu, A.-H., Fernau, H., Martín-Vide, C. (eds.) LATA 2010. LNCS, vol. 6031, pp. 273–284. Springer, Heidelberg (2010)
Genome 10K Community of Scientists: A proposal to obtain whole-genome sequence for 10,000 vertebrate species. Journal of Heredity 100, 659–674 (2009)
González, R., Navarro, G.: Compressed Text Indexes with Fast Locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007)
Kreft, S., Navarro, G.: LZ77-like compression with fast random access. In: Proceedings of the Data Compression Conference, DCC (2010)
Kreft, S., Navarro, G.: Self-Indexing Based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)
Landau, G.M., Vishkin, U.: Fast parallel and serial approximate string matching. Journal of Algorithms 10(2), 157–169 (1989)
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
Durbin, R., et al.: 1000 genomes (2010), http://www.1000genomes.org/
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), 211–222 (2003)
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-Length Compressed Indexes are Superior for Highly Repetitive Sequence Collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gagie, T., Gawrychowski, P., Puglisi, S.J. (2011). Faster Approximate Pattern Matching in Compressed Repetitive Texts. In: Asano, T., Nakano, Si., Okamoto, Y., Watanabe, O. (eds) Algorithms and Computation. ISAAC 2011. Lecture Notes in Computer Science, vol 7074. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25591-5_67
Download citation
DOI: https://doi.org/10.1007/978-3-642-25591-5_67
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25590-8
Online ISBN: 978-3-642-25591-5
eBook Packages: Computer ScienceComputer Science (R0)