Skip to main content

Approximate String Matching with Lempel-Ziv Compressed Indexes

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4726))

Abstract

A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  2. Chang, W.I., Marr, T.G.: Approximate string matching and local similarity. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807, pp. 259–273. Springer, Heidelberg (1994)

    Google Scholar 

  3. Fredriksson, K., Navarro, G.: Average-optimal single and multiple approximate string matching. ACM Journal of Experimental Algorithmics 9(1.4) (2004)

    Google Scholar 

  4. Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)

    Google Scholar 

  5. Cole, R., Gottlieb, L.A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, pp. 91–100 (2004)

    Google Scholar 

  6. Maaß, M., Nowak, J.: Text indexing with errors. In: CPM, pp. 21–32 (2005)

    Google Scholar 

  7. Chan, H.L., Lam, T.W., Sung, W.K., Tam, S.L., Wong, S.S.: A linear size index for approximate pattern matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 49–59. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  8. Coelho, L., Oliveira, A.: Dotted suffix trees: a structure for approximate text indexing. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 329–336. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Weiner, P.: Linear pattern matching algorithms. In: IEEE 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Computer Society Press, Los Alamitos (1973)

    Google Scholar 

  10. Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 935–948 (1993)

    Google Scholar 

  11. Gonnet, G.: A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland (1992)

    Google Scholar 

  12. Ukkonen, E.: Approximate string matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)

    Chapter  Google Scholar 

  13. Cobbs, A.: Fast approximate matching using suffix trees. In: Galil, Z., Ukkonen, E. (eds.) Combinatorial Pattern Matching. LNCS, vol. 937, pp. 41–54. Springer, Heidelberg (1995)

    Google Scholar 

  14. Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 50–63. Springer, Heidelberg (1996)

    Google Scholar 

  15. Navarro, G., Baeza-Yates, R.: A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal 1(2) (1998)

    Google Scholar 

  16. Myers, E.W.: A sublinear algorithm for approximate keyword searching. Algorithmica 12(4/5), 345–374 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  17. Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. Journal of Discrete Algorithms 1(1), 205–239 (2000)

    MathSciNet  Google Scholar 

  18. Navarro, G., Sutinen, E., Tarhio, J.: Indexing text with approximate q-grams. J. Discrete Algorithms 3(2-4), 157–175 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  19. Kurtz, S.: Reducing the space requirement of suffix trees. Pract. Exper. 29(13), 1149–1171 (1999)

    Article  Google Scholar 

  20. Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2), 294–313 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  21. Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  22. Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  23. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  24. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) article 2 (2007)

    Google Scholar 

  25. Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  26. Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: South American Workshop on String Processing, pp. 141–155. Carleton University Press (1996)

    Google Scholar 

  27. Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-Index. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 318–329. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  28. Russo, L.M.S., Oliveira, A.L.: A compressed self-index using a Ziv-Lempel dictionary. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 163–180. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  29. Huynh, T., Hon, W., Lam, T., Sung, W.: Approximate string matching using compressed suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 434–444. Springer, Heidelberg (2004)

    Google Scholar 

  30. Lam, T., Sung, W., Wong, S.: Improved approximate string matching using compressed suffix data structures. In: Deng, X., Du, D.-Z. (eds.) ISAAC 2005. LNCS, vol. 3827, pp. 339–348. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  31. Morales, P.: Solución de consultas complejas sobre un indice de texto comprimido (solving complex queries over a compressed text index). Undergraduate thesis, Dept. of Computer Science, University of Chile, G. Navarro, advisor (2005)

    Google Scholar 

  32. Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  33. Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46(3), 395–415 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  34. Navarro, G., Baeza-Yates, R.: Very fast and simple approximate string matching. Information Processing Letters 72, 65–70 (1999)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Nivio Ziviani Ricardo Baeza-Yates

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Russo, L.M.S., Navarro, G., Oliveira, A.L. (2007). Approximate String Matching with Lempel-Ziv Compressed Indexes. In: Ziviani, N., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol 4726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75530-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75530-2_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75529-6

  • Online ISBN: 978-3-540-75530-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics