Skip to main content

A faster algorithm for approximate string matching

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1075))

Abstract

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a non-deterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length O(log n), being n the maximum size of the text. The running time achieved is O(n) for small patterns (i.e. of length m=O(√log n)), independently of the maximum number of errors allowed, k. This algorithm is then used to design two general algorithms. One of them partitions the problem into subproblems, while the other partitions the automaton into sub-automata. These algorithms are combined to obtain a hybrid algorithm which on average is O(n) for moderate k/m ratios, O(√mk/log n n) for medium ratios, and O((m−k)kn/log n) for large ratios. We show experimentally that this hybrid algorithm is faster than previous ones for moderate size of patterns and error ratios, which is the case in text searching.

This work has been supported in part by FONDECYT grant 1950622.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search. CACM, 18(6):333–340, June 1975.

    Google Scholar 

  2. R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I: Algorithms, Software, Architecture, pages 465–476. Elsevier Science, September 1992.

    Google Scholar 

  3. R. Baeza-Yates. A unified view to pattern-matching problems. Dept. of Computer Science, Univ. of Chile. ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/unified.ps.gz, 1995.

    Google Scholar 

  4. R. Baeza-Yates and G. Gonnet. A new approach to text searching. CACM, 35(10):74–82, October 1992.

    Google Scholar 

  5. R. Baeza-Yates and C. Perleberg. Fast and practical approximate pattern matching. In Proc. CPM'92, pages 185–192. Springer-Verlag, 1992. LNCS 644.

    Google Scholar 

  6. W. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In Proc. CPM'92, pages 172–181. Springer-Verlag, 1992. LNCS 644.

    Google Scholar 

  7. W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327–344, Oct/Nov 1994.

    Article  Google Scholar 

  8. W. Chang and T. Marr. Approximate string matching and local similarity. In Proc. of CPM'94, pages 259–273. Springer-Verlag, 1994. LNCS 807.

    Google Scholar 

  9. Z. Galil and K. Park. An improved algorithm for approximate string matching. SIAM J. of Computing, 19(6):989–999, 1990.

    Google Scholar 

  10. G. Landau and U. Vishkin. Fast string matching with k differences. J. of Computer Systems Science, 37:63–78, 1988.

    Google Scholar 

  11. G. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. of Algorithms, 10:157–169, 1989.

    Google Scholar 

  12. E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, Oct/Nov 1994.

    Article  Google Scholar 

  13. S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. of Molecular Biology, 48:444–453, 1970.

    Google Scholar 

  14. P. Sellers. The theory and computation of evolutionary distances: pattern recognition. J. of Algorithms, 1:359–373, 1980.

    Google Scholar 

  15. D. Sunday. A very fast substring search algorithm. CACM, 33(8):132–142, August 1990.

    Google Scholar 

  16. E. Suntinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proc. of ESA'95. Springer-Verlag, 1995. LNCS 979.

    Google Scholar 

  17. T. Takaoka. Approximate pattern matching with samples. In Proc. of ISAAC'94, pages 234–242. Springer-Verlag, 1994. LNCS 834.

    Google Scholar 

  18. J. Tarhio and E. Ukkonen. Boyer-Moore approach to approximate string matching. In Proc. of SWAT'90, pages 348–359. Springer-Verlag, 1990. LNCS 447.

    Google Scholar 

  19. E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theoretical Computer Science, 1:191–211, 1992.

    Google Scholar 

  20. Esko Ukkonen. Algorithms for approximate string matching. Information and Control, 64:100–118, 1985.

    Google Scholar 

  21. Esko Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.

    Google Scholar 

  22. A. Wright. Approximate string matching using within-word parallelism. Software Practice and Experience, 24(4):337–362, April 1994.

    Google Scholar 

  23. S. Wu and U. Manber. Agrep — a fast approximate pattern-matching tool. In Proc. of USENIX Technical Conference, pages 153–162, 1992.

    Google Scholar 

  24. S. Wu and U. Manber. Fast text searching allowing errors. CACM, 35(10):83–91, October 1992.

    Google Scholar 

  25. S. Wu, U. Manber, and E. Myers. A subquadratic algorithm for approximate regular expression matching. J. of Algorithms, 19:346–360, 1995.

    Google Scholar 

  26. S. Wu, U. Manber, and E. Myers. A sub-quadratic algorithm for approximate limited expression matching. Algorithmica, 15(1):50–67, 1996.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Dan Hirschberg Gene Myers

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baeza-Yates, R., Navarro, G. (1996). A faster algorithm for approximate string matching. In: Hirschberg, D., Myers, G. (eds) Combinatorial Pattern Matching. CPM 1996. Lecture Notes in Computer Science, vol 1075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-61258-0_1

Download citation

  • DOI: https://doi.org/10.1007/3-540-61258-0_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-61258-2

  • Online ISBN: 978-3-540-68390-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics