Skip to main content

A dictionary matching algorithm fast on the average for terms of varying length

  • Session I
  • Conference paper
  • First Online:
Book cover Combinatorial Pattern Matching (CPM 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1448))

Included in the following conference series:

  • 123 Accesses

Abstract

We examine the exact dictionary matching problem with dynamic text and static terms and propose a simple but efficient algorithm with sublinear (in size of text) average performance for a wide range of practical problems. The algorithm is based on the Commentz-Walter-Horspool algorithm (CWH), presented by Baeza-Yates and Re‘gnier [101. Typically, our refinement will prune out more than 30% of characters scanned by CWH, when searching for all occurrences of tags, which are of varying lengths and members of a set of moderate size, in natural language text. This problem arises frequently in practice in scanning text downloaded from the internet, and accounts for a major portion of the preprocessing time associated with indexing such text for later retrieval. Our approach, which we refer to as layering, keeps track of an upper bound on the maximal length of potential term prefixes ending at each given position in the text. This information is then used to mask out some of the terms and filter out unnecessary character comparisons during the search. A practical implementation is described, which increases the size of the existing data structures as well as the preprocessing cost only by a factor of the size of the longest term in the set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A.V. Aho. Algorithms for Finding Patterns in Strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science. Pages 257–300. Elsevier Science Publishers B.V., Amsterdam, The Netherlands. 1990.

    Google Scholar 

  2. A.V.Aho and M.Corasick. Efficient string matching: An aid to bibliographic search. Comm. of the ACM, 18(6):333–340, June 1975.

    Google Scholar 

  3. A. Àmir and M. Farach. Adaptive dictionary matching. Proc. 32nd IEEE FOCS, pages 760–766,1991.

    Google Scholar 

  4. A. Amir, M. Farach, R. Giancarlo, Z. Galil, and K. Park. Dynamic dictionary matching. Journal of Computer and System Sciences, 49(2):208–222, 1994

    Google Scholar 

  5. A. Amir, M. Farach, R.M. Idury, J.A. La Poutre', and A.A Schaffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258–282, 1995

    Google Scholar 

  6. A.Amir, M.Farach, and Y.Matias. Efficient randomized dictionary matching algorithms, Proc. of 3rd Combinatorial Pattern Matching Conference, pages 259–272, 1992. Tucson, Arizona.

    Google Scholar 

  7. A. Apostolico and R. Giancarlo. The Boyer-Moore-Galil string searching strategies revisited, SIAM J. Comput, 15,(1), 98–105 (1986)

    Google Scholar 

  8. R.S. Boyer and J.S. Moore. A fast string matching algorithm. Comm. of the ACM, 20:762–772,1977.

    Google Scholar 

  9. R. Baeza-Yates and G.H. Gonnet. On Boyer-Moore automata, Research report, university of Waterloo,1989.

    Google Scholar 

  10. R. Baeza-Yates and M. Regnier. Fast Algorithms for Two Dimensional and Multiple Pattern Matching(Preliminary version). SWAT 90, In Proc. 2nd Scandinavian Workshop on Algorithm Theory. Number 447 in Lecture Notes in Computer Science, pages 332–347, Springer-Verlag, Bergen, Sweden, July 1990.

    Google Scholar 

  11. D. Breslauer. Dictionary-Matching on Unbounded Alphabets: Uniform-Length Dictionaries. Proc. of 5th Combinatorial Pattern Matching Conference, pages 184–197,1994, Asilomar, CA, USA

    Google Scholar 

  12. V. Bruye re, R. Baeza-Yates, O. Delgrange and R. Scheihing. On the size of Boyer Moore Automata. Proceedings of Third South American Workshop on String Processing. Recife, Brazil, August 1996, 31–46

    Google Scholar 

  13. M.Crochemore,A.Czumaj,L.Gasieniec,S.Jarominek,T.Lecroq,W.Plandowski, and W.Rytter. Fast Practical Multi-Pattern Matching. Technical Report 93–3, Institut Gaspard Monge, Université de Marne la Vallée, Marne la Vall'ee, France, 1993.

    Google Scholar 

  14. L.Colussi, Z.Galil, and R.Giancarlo.The exact complexity of string matching. ‘31st Symposium on foundations of Computer Science I, 135–143, IEEE(October 22–24 1990)

    Google Scholar 

  15. R.Cole. Tight bounds on the complexity of the Boyer-Moore pattern Matching algorithm, Technical Report 512, Computer Science Dept, New York University (June 1990).

    Google Scholar 

  16. B.Commentz-Walter. A string matching algorithm fast on the average. Technical Report 79.09.007, IBM Wissenchaftliches Zentrum. Heidelberg, Germany, 1979.

    Google Scholar 

  17. B. Commentz-Walter. A string matching algorithm fast on the average. Proc,6th International Colloquium on Automata, Languages, and Programming, Lecture notes in Computer Science. Pages 118–132. Springer-Verlag. Berlin, Germany 1979.

    Google Scholar 

  18. J.J. Fan and K.Y. Su. An efficient algorithm for matching multiple patterns. IEEE Transactions on Knowledge and Data Engineering. 5(2):339–351, April, 1993

    Google Scholar 

  19. Z. Galil. On improving the worst case running time of the Boyer-Moore string matching algorithm, Comm. of the ACM, 22(9) 505–508, (1979)

    Google Scholar 

  20. L.J. Guibas and A.M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm, Siam J. Comput. 9 (1980) 672–682

    Google Scholar 

  21. D.Gusfield.Algorithms on strings, trees and sequences, published by the press syndicate of the University Of Cambridge, (1997) 157–164

    Google Scholar 

  22. A. Hume and D. Sunday. Fast String Searching. Software-Practice and experience, Vol.21(11). 1221–1248 (November 1991).

    Google Scholar 

  23. T. Hagerup. On saving space in parallel computation, Information Processing Letters, Vol.29, 1988, pages 327–329

    Google Scholar 

  24. R.N. Horspool. Practical fast searching in strings. Software-Practice and Experience, 10:501–506,1980.

    Google Scholar 

  25. R.M. Idury and A.A Schaffer. Dynamic dictionary matching with failure functions. Proc. 3rd Annual Symposium on Combinatorial Pattern Matching, pages 273–284,1992.

    Google Scholar 

  26. J.Y. Kim and J. Shawe-Taylor. Fast Multiple Keyword Searching. Proc. of 3rd Combinatorial Pattern Matching Conference, pages 41–51,1992. Tucson, Arizona.

    Google Scholar 

  27. G.Kowalski and A. Meltzer. New Multi-Term high speed text search algorithms. 1st conference on computers and applications, IEEE(1984)

    Google Scholar 

  28. D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6:322–350,1977.

    Google Scholar 

  29. T.Lecroq. A variation on the Boyer-Moore algorithm. Theoretical Computer Science 92 (119–144), Elsevier.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Martin Farach-Colton

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ziv-Ukelson, M., Kershenbaum, A. (1998). A dictionary matching algorithm fast on the average for terms of varying length. In: Farach-Colton, M. (eds) Combinatorial Pattern Matching. CPM 1998. Lecture Notes in Computer Science, vol 1448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0030779

Download citation

  • DOI: https://doi.org/10.1007/BFb0030779

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64739-3

  • Online ISBN: 978-3-540-69054-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics