Skip to main content

A New Indexing Method for Approximate String Matching

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1999)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1645))

Included in the following conference series:

Abstract

We present a new indexing method for the approximate string matching problem. The method is based on a sufix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n λ), for 0 < λ < 1, whenever \( \alpha < - e/\sqrt \sigma \) , where α is the error level tolerated and σ is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space.

This work has been supported in part by Fondecyt grant 1-990627 and Fondef grant 96-1064.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Apostolico and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, New York, 1985.

    Book  MATH  Google Scholar 

  2. M. Araújo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP’97, pages 2–20. Carleton University Press, 1997.

    Google Scholar 

  3. R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992.

    Google Scholar 

  4. R. Baeza-Yates and G. Gonnet. All-against-all sequence matching. Dept. of Computer Science, University of Chile, 1990.

    Google Scholar 

  5. R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton searching on a trie. J. of the ACM, 43, 1996.

    Google Scholar 

  6. R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. In Proc. ACM CIKM’97, pages 1–8, 1997.

    Google Scholar 

  7. R. Baeza-Yates and G. Navarro. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.

  8. R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127–158, 1999. Preliminary version in Proc. CPM’96, LNCS 1075.

    Google Scholar 

  9. R. Baeza-Yates and C. Perleberg. Fast and practical approximate pattern matching. Information Processing Letters, 59:21–27, 1996.

    Article  MathSciNet  MATH  Google Scholar 

  10. A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. The samllest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55, 1985.

    Article  MathSciNet  MATH  Google Scholar 

  11. W. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In Proc. CPM’92, LNCS 644, pages 172–181, 1992.

    Google Scholar 

  12. W. Chang and T. Marr. Approximate string matching and local similarity. In Proc. CPM’94, LNCS 807, pages 259–273, 1994.

    Google Scholar 

  13. A. Cobbs. Fast approximate matching using sufix trees. In Proc. CPM’95, pages 41–54, 1995. LNCS 937.

    Google Scholar 

  14. M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45:63–86, 1986.

    Article  MathSciNet  MATH  Google Scholar 

  15. M. Farach, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottle-neck in sufix tree construction. In Proc. SODA’98, pages 174–183, 1998.

    Google Scholar 

  16. Z. Galil and K. Park. An improved algorithm for approximate string matching. SIAM J. on Computing, 19(6):989–999, 1990.

    Article  MathSciNet  MATH  Google Scholar 

  17. G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.

    Google Scholar 

  18. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. MFCS’91, volume 16, pages 240–248. Springer-Verlag, 1991.

    Google Scholar 

  19. D. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, 1973.

    Google Scholar 

  20. G. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. of Algorithms, 10:157–169, 1989.

    Article  MathSciNet  MATH  Google Scholar 

  21. U. Manber and G. Myers. Sufix arrays: a new method for on-line string searches. In Proc. ACM-SIAM SODA’90, pages 319–327, 1990.

    Google Scholar 

  22. U. Manber and S. Wu. glimpse: A tool to search through entire file systems. In Proc. USENIX Technical Conference, pages 23–32, Winter 1994.

    Google Scholar 

  23. E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, Oct/Nov 1994.

    Article  MathSciNet  MATH  Google Scholar 

  24. G. Myers. A fast bit-vector algorithm for approximate pattern matching based on dynamic programming. In Proc. CPM’98, LNCS 1448, pages 1–13, 1998.

    Google Scholar 

  25. G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://-ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.

    Google Scholar 

  26. G. Navarro and R. Baeza-Yates. Improving an algorithm for approximate pattern matching. Technical Report TR/DCC-98-5, Dept. of Computer Science, Univ. of Chile, 1998. Submitted.

    Google Scholar 

  27. S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. of Molecular Biology, 48:444–453, 1970.

    Article  Google Scholar 

  28. P. Sellers. The theory and computation of evolutionary distances: pattern recognition. J. of Algorithms, 1:359–373, 1980.

    Article  MathSciNet  MATH  Google Scholar 

  29. F. Shi. Fast approximate string matching with q-blocks sequences. In Proc. WSP’96, pages 257–271. Carleton University Press, 1996.

    Google Scholar 

  30. E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proc. ESA’95, LNCS 979, pages 327–340, 1995.

    Google Scholar 

  31. E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. CPM’96, LNCS 1075, pages 50–61, 1996.

    Google Scholar 

  32. J. Tarhio and E. Ukkonen. Approximate Boyer-Moore string matching. SIAM J. on Computing, 22(2):243–260, 1993.

    Article  MathSciNet  MATH  Google Scholar 

  33. E. Ukkonen. Approximate string matching over sufix trees. In Proc. CPM’93, pages 228–242, 1993.

    Google Scholar 

  34. Esko Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.

    Article  MathSciNet  MATH  Google Scholar 

  35. S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83–91, October 1992.

    Article  Google Scholar 

  36. S. Wu, U. Manber, and E. Myers. A sub-quadratic algorithm for approximate limited expression matching. Algorithmica, 15(1):50–67, 1996.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Navarro, G., Baeza-Yates, R. (1999). A New Indexing Method for Approximate String Matching. In: Crochemore, M., Paterson, M. (eds) Combinatorial Pattern Matching. CPM 1999. Lecture Notes in Computer Science, vol 1645. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48452-3_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-48452-3_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66278-5

  • Online ISBN: 978-3-540-48452-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics