Abstract
A popular and much studied class of filters for approximate string matching is based on finding common q-grams, substrings of length q, between the pattern and the text. A variation of the basic idea uses gapped q-grams and has been recently shown to provide significant improvements in practice. A major difficulty with gapped q-gram filters is the computation of the so-called threshold which defines the filter criterium. We describe the first general method for computing the threshold for q-gram filters. The method is based on a carefully chosen precise statement of the problem which is then transformed into a constrained shortest path problem. In its generic form the method leaves certain parts open but is applicable to a large variety of q-gram filters and may be extensible even to other classes of filters. We also give a full algorithm for a specific subclass. For this subclass, the algorithm has been implemented and used succesfully in an experimental comparison.
Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.
Y. P. Aneja, V. Aggarwal, and K. P. K. Nair. Shortest chain subject to side conditions. Networks, 13:295–302, 1983.
A. I. Buchsbaum, R. Giancarlo, and J. R. Westbrook. On the determinization of weighted finite automata. SIAM J. Comput., 30(5):1502–1531, 2000.
S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based database searching using a suffix array (QUASAR). In Proc. 3rd Annual International Conference on Computational Molecular Biology (RECOMB), pages 77–83. ACM Press, 1999.
S. Burkhardt and J. Kärkkäinen. Better filtering with gapped q-grams. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching, volume 2089 of LNCS, pages 73–85. Springer, 2001.
S. Burkhardt and J. Kärkkäinen. One-gapped q-gram filters for Levenshtein distance. In Proc. 13th Annual Symposium on Combinatorial Pattern Matching, LNCS. Springer, 2002. To appear.
A. Califano and I. Rigoutsos. FLASH: A fast look-up algorithm for string homology. In Proc. 1st International Conference on Intelligent Systems for Molecular Biology, pages 56–64. AAAI Press, 1993.
W. I. Chang and T. G. Marr. Approximate string matching and local similarity. In Proc. 5th Annual Symposium on Combinatorial Pattern Matching, volume 807 of LNCS, pages 259–273. Springer, 1994.
L. Desrosiers, Y. Dumas, M. M. Solomon, and F. Soumis. Time constrained routing and scheduling. In M. O. Ball et al., editors, Network Routing, volume 8 of Handbooks in Operations Research and Management Science, chapter 2, pages 35–139. North-Holland, 1995.
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc. 7th Finnish Symposium on Computer Science, pages 23–32, 1994.
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. 16th Symposium on Mathematical Foundations of Computer Science, volume 520 of LNCS, pages 240–248. Springer, 1991.
M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23:269–311, 1997.
E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, 1994.
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.
G. Navarro and R. Baeza-Yates. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.
G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19–27, 2001. Special issue on Managing Text Natively and in DBMSs.
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching, volume 848 of LNCS, pages 350–363. Springer, 2000.
P. A. Pevzner and M. S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13(1/2):135–154, 1995.
F. Shi. Fast approximate string matching with q-blocks sequences. In Proc. 3rd South American Workshop on String Processing, pages 257–271. Carleton University Press, 1996.
E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proc. 3rd Annual European Symposium on Algorithms, volume 979 of LNCS, pages 327–340. Springer, 1995.
E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching, volume 1075 of LNCS, pages 50–63. Springer, 1996.
T. Takaoka. Approximate pattern matching with samples. In Proc. 5th International Symposium on Algorithms and Computation (ISAAC), volume 834 of LNCS, pages 236–242. Springer, 1994.
E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191–212, 1992.
M. Ziegelmann. Constrained Shortest Paths and Related Problems. PhD thesis, Universität des Saarlandes, Germany, 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kärkkäinen, J. (2002). Computing the Threshold for q-Gram Filters. In: Penttonen, M., Schmidt, E.M. (eds) Algorithm Theory — SWAT 2002. SWAT 2002. Lecture Notes in Computer Science, vol 2368. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45471-3_36
Download citation
DOI: https://doi.org/10.1007/3-540-45471-3_36
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43866-3
Online ISBN: 978-3-540-45471-7
eBook Packages: Springer Book Archive