Abstract
We examine the exact dictionary matching problem with dynamic text and static terms and propose a simple but efficient algorithm with sublinear (in size of text) average performance for a wide range of practical problems. The algorithm is based on the Commentz-Walter-Horspool algorithm (CWH), presented by Baeza-Yates and Re‘gnier [101. Typically, our refinement will prune out more than 30% of characters scanned by CWH, when searching for all occurrences of tags, which are of varying lengths and members of a set of moderate size, in natural language text. This problem arises frequently in practice in scanning text downloaded from the internet, and accounts for a major portion of the preprocessing time associated with indexing such text for later retrieval. Our approach, which we refer to as layering, keeps track of an upper bound on the maximal length of potential term prefixes ending at each given position in the text. This information is then used to mask out some of the terms and filter out unnecessary character comparisons during the search. A practical implementation is described, which increases the size of the existing data structures as well as the preprocessing cost only by a factor of the size of the longest term in the set.
Preview
Unable to display preview. Download preview PDF.
References
A.V. Aho. Algorithms for Finding Patterns in Strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science. Pages 257–300. Elsevier Science Publishers B.V., Amsterdam, The Netherlands. 1990.
A.V.Aho and M.Corasick. Efficient string matching: An aid to bibliographic search. Comm. of the ACM, 18(6):333–340, June 1975.
A. Àmir and M. Farach. Adaptive dictionary matching. Proc. 32nd IEEE FOCS, pages 760–766,1991.
A. Amir, M. Farach, R. Giancarlo, Z. Galil, and K. Park. Dynamic dictionary matching. Journal of Computer and System Sciences, 49(2):208–222, 1994
A. Amir, M. Farach, R.M. Idury, J.A. La Poutre', and A.A Schaffer. Improved dynamic dictionary matching. Information and Computation, 119(2):258–282, 1995
A.Amir, M.Farach, and Y.Matias. Efficient randomized dictionary matching algorithms, Proc. of 3rd Combinatorial Pattern Matching Conference, pages 259–272, 1992. Tucson, Arizona.
A. Apostolico and R. Giancarlo. The Boyer-Moore-Galil string searching strategies revisited, SIAM J. Comput, 15,(1), 98–105 (1986)
R.S. Boyer and J.S. Moore. A fast string matching algorithm. Comm. of the ACM, 20:762–772,1977.
R. Baeza-Yates and G.H. Gonnet. On Boyer-Moore automata, Research report, university of Waterloo,1989.
R. Baeza-Yates and M. Regnier. Fast Algorithms for Two Dimensional and Multiple Pattern Matching(Preliminary version). SWAT 90, In Proc. 2nd Scandinavian Workshop on Algorithm Theory. Number 447 in Lecture Notes in Computer Science, pages 332–347, Springer-Verlag, Bergen, Sweden, July 1990.
D. Breslauer. Dictionary-Matching on Unbounded Alphabets: Uniform-Length Dictionaries. Proc. of 5th Combinatorial Pattern Matching Conference, pages 184–197,1994, Asilomar, CA, USA
V. Bruye re, R. Baeza-Yates, O. Delgrange and R. Scheihing. On the size of Boyer Moore Automata. Proceedings of Third South American Workshop on String Processing. Recife, Brazil, August 1996, 31–46
M.Crochemore,A.Czumaj,L.Gasieniec,S.Jarominek,T.Lecroq,W.Plandowski, and W.Rytter. Fast Practical Multi-Pattern Matching. Technical Report 93–3, Institut Gaspard Monge, Université de Marne la Vallée, Marne la Vall'ee, France, 1993.
L.Colussi, Z.Galil, and R.Giancarlo.The exact complexity of string matching. ‘31st Symposium on foundations of Computer Science I, 135–143, IEEE(October 22–24 1990)
R.Cole. Tight bounds on the complexity of the Boyer-Moore pattern Matching algorithm, Technical Report 512, Computer Science Dept, New York University (June 1990).
B.Commentz-Walter. A string matching algorithm fast on the average. Technical Report 79.09.007, IBM Wissenchaftliches Zentrum. Heidelberg, Germany, 1979.
B. Commentz-Walter. A string matching algorithm fast on the average. Proc,6th International Colloquium on Automata, Languages, and Programming, Lecture notes in Computer Science. Pages 118–132. Springer-Verlag. Berlin, Germany 1979.
J.J. Fan and K.Y. Su. An efficient algorithm for matching multiple patterns. IEEE Transactions on Knowledge and Data Engineering. 5(2):339–351, April, 1993
Z. Galil. On improving the worst case running time of the Boyer-Moore string matching algorithm, Comm. of the ACM, 22(9) 505–508, (1979)
L.J. Guibas and A.M. Odlyzko. A new proof of the linearity of the Boyer-Moore string searching algorithm, Siam J. Comput. 9 (1980) 672–682
D.Gusfield.Algorithms on strings, trees and sequences, published by the press syndicate of the University Of Cambridge, (1997) 157–164
A. Hume and D. Sunday. Fast String Searching. Software-Practice and experience, Vol.21(11). 1221–1248 (November 1991).
T. Hagerup. On saving space in parallel computation, Information Processing Letters, Vol.29, 1988, pages 327–329
R.N. Horspool. Practical fast searching in strings. Software-Practice and Experience, 10:501–506,1980.
R.M. Idury and A.A Schaffer. Dynamic dictionary matching with failure functions. Proc. 3rd Annual Symposium on Combinatorial Pattern Matching, pages 273–284,1992.
J.Y. Kim and J. Shawe-Taylor. Fast Multiple Keyword Searching. Proc. of 3rd Combinatorial Pattern Matching Conference, pages 41–51,1992. Tucson, Arizona.
G.Kowalski and A. Meltzer. New Multi-Term high speed text search algorithms. 1st conference on computers and applications, IEEE(1984)
D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6:322–350,1977.
T.Lecroq. A variation on the Boyer-Moore algorithm. Theoretical Computer Science 92 (119–144), Elsevier.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ziv-Ukelson, M., Kershenbaum, A. (1998). A dictionary matching algorithm fast on the average for terms of varying length. In: Farach-Colton, M. (eds) Combinatorial Pattern Matching. CPM 1998. Lecture Notes in Computer Science, vol 1448. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0030779
Download citation
DOI: https://doi.org/10.1007/BFb0030779
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64739-3
Online ISBN: 978-3-540-69054-2
eBook Packages: Springer Book Archive