Skip to main content

Computing the Threshold for q-Gram Filters

  • Conference paper
  • First Online:
Algorithm Theory — SWAT 2002 (SWAT 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2368))

Included in the following conference series:

Abstract

A popular and much studied class of filters for approximate string matching is based on finding common q-grams, substrings of length q, between the pattern and the text. A variation of the basic idea uses gapped q-grams and has been recently shown to provide significant improvements in practice. A major difficulty with gapped q-gram filters is the computation of the so-called threshold which defines the filter criterium. We describe the first general method for computing the threshold for q-gram filters. The method is based on a carefully chosen precise statement of the problem which is then transformed into a constrained shortest path problem. In its generic form the method leaves certain parts open but is applicable to a large variety of q-gram filters and may be extensible even to other classes of filters. We also give a full algorithm for a specific subclass. For this subclass, the algorithm has been implemented and used succesfully in an experimental comparison.

Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.

    Google Scholar 

  2. Y. P. Aneja, V. Aggarwal, and K. P. K. Nair. Shortest chain subject to side conditions. Networks, 13:295–302, 1983.

    Article  MathSciNet  Google Scholar 

  3. A. I. Buchsbaum, R. Giancarlo, and J. R. Westbrook. On the determinization of weighted finite automata. SIAM J. Comput., 30(5):1502–1531, 2000.

    Article  MATH  MathSciNet  Google Scholar 

  4. S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based database searching using a suffix array (QUASAR). In Proc. 3rd Annual International Conference on Computational Molecular Biology (RECOMB), pages 77–83. ACM Press, 1999.

    Google Scholar 

  5. S. Burkhardt and J. Kärkkäinen. Better filtering with gapped q-grams. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching, volume 2089 of LNCS, pages 73–85. Springer, 2001.

    Google Scholar 

  6. S. Burkhardt and J. Kärkkäinen. One-gapped q-gram filters for Levenshtein distance. In Proc. 13th Annual Symposium on Combinatorial Pattern Matching, LNCS. Springer, 2002. To appear.

    Google Scholar 

  7. A. Califano and I. Rigoutsos. FLASH: A fast look-up algorithm for string homology. In Proc. 1st International Conference on Intelligent Systems for Molecular Biology, pages 56–64. AAAI Press, 1993.

    Google Scholar 

  8. W. I. Chang and T. G. Marr. Approximate string matching and local similarity. In Proc. 5th Annual Symposium on Combinatorial Pattern Matching, volume 807 of LNCS, pages 259–273. Springer, 1994.

    Google Scholar 

  9. L. Desrosiers, Y. Dumas, M. M. Solomon, and F. Soumis. Time constrained routing and scheduling. In M. O. Ball et al., editors, Network Routing, volume 8 of Handbooks in Operations Research and Management Science, chapter 2, pages 35–139. North-Holland, 1995.

    Google Scholar 

  10. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.

    Google Scholar 

  11. N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc. 7th Finnish Symposium on Computer Science, pages 23–32, 1994.

    Google Scholar 

  12. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. 16th Symposium on Mathematical Foundations of Computer Science, volume 520 of LNCS, pages 240–248. Springer, 1991.

    Google Scholar 

  13. M. Mohri. Finite-state transducers in language and speech processing. Computational Linguistics, 23:269–311, 1997.

    MathSciNet  Google Scholar 

  14. E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, 1994.

    Article  MATH  MathSciNet  Google Scholar 

  15. G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.

    Google Scholar 

  16. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.

    Article  Google Scholar 

  17. G. Navarro and R. Baeza-Yates. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.

  18. G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19–27, 2001. Special issue on Managing Text Natively and in DBMSs.

    Google Scholar 

  19. G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching, volume 848 of LNCS, pages 350–363. Springer, 2000.

    Chapter  Google Scholar 

  20. P. A. Pevzner and M. S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13(1/2):135–154, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  21. F. Shi. Fast approximate string matching with q-blocks sequences. In Proc. 3rd South American Workshop on String Processing, pages 257–271. Carleton University Press, 1996.

    Google Scholar 

  22. E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proc. 3rd Annual European Symposium on Algorithms, volume 979 of LNCS, pages 327–340. Springer, 1995.

    Google Scholar 

  23. E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching, volume 1075 of LNCS, pages 50–63. Springer, 1996.

    Google Scholar 

  24. T. Takaoka. Approximate pattern matching with samples. In Proc. 5th International Symposium on Algorithms and Computation (ISAAC), volume 834 of LNCS, pages 236–242. Springer, 1994.

    Google Scholar 

  25. E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191–212, 1992.

    Article  MATH  MathSciNet  Google Scholar 

  26. M. Ziegelmann. Constrained Shortest Paths and Related Problems. PhD thesis, Universität des Saarlandes, Germany, 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kärkkäinen, J. (2002). Computing the Threshold for q-Gram Filters. In: Penttonen, M., Schmidt, E.M. (eds) Algorithm Theory — SWAT 2002. SWAT 2002. Lecture Notes in Computer Science, vol 2368. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45471-3_36

Download citation

  • DOI: https://doi.org/10.1007/3-540-45471-3_36

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43866-3

  • Online ISBN: 978-3-540-45471-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics