Skip to main content

Better Filtering with Gapped q-Grams

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2089))

Included in the following conference series:

Abstract

The q-gram filter is a popular filtering method for approxi- mate string matching. It compares substrings of length q (the q-grams) in the pattern and the text to identify the text areas that might contain a match. A generalization of the method is to use gapped q-grams, subsets of q characters in some fixed non-contiguous shape, instead of contiguous substrings. Although mentioned a few times in the literature, this gen- eralization has never been studied in any depth. In this paper, we report the first results from a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. The performance, however, depends on the shape of the q-grams. The best shapes are rare and often pos- sess no apparent regularity. We show how to recognize good shapes and demonstrate with experiments their advantage over both contiguous and average shapes. We concentrate here on the k mismatches problem, but also outline an approach for extending the results to the more common k differences problem.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Baeza-Yates and G. Gonnet. All-against-all sequence matching. Technical report, Dept. of Computer Science, University of Chile, 1990.

    Google Scholar 

  2. S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based database searching using a suffix array. In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB-99), pages 77–83, Lyon, France, 1999. ACM Press.

    Google Scholar 

  3. A. Califano and I. Rigoutsos. FLASH: A fast look-up algorithm for string homology. In L. Hunter, D. Searls, and J. Shavlik, editors, Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, pages 56–64, Bethesda, MD, 1993. AAAI Press.

    Google Scholar 

  4. A.L. Cobbs. Fast approximate matching using suffix trees. In Z. Galil and E. Ukkonen, editors, Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, number 937 in Lecture Notes in Computer Science, pages 41–54, Espoo, Finland, 1995. Springer-Verlag, Berlin.

    Chapter  Google Scholar 

  5. N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proceedings of the 7th Finnish Symposium on Computer Science, pages 23–32, 1994.

    Google Scholar 

  6. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Proceedings of the 16th Symposium on Mathematical Foundations of Computer Science, number 520 in Lecture Notes in Computer Science, pages 240–248, Kazimierz Dolny, Poland, 1991. Springer-Verlag, Berlin.

    Google Scholar 

  7. A. Krause and M. Vingron. A set-theoretic approach to database searching and clustering. Bioinformatics, 14:430–438, 1998.

    Article  Google Scholar 

  8. O. Lehtinen, E. Sutinen, and J. Tarhio. Experiments on block indexing. In R. Baeza-Yates N. Ziviani and K. Guimarães, editors, Proceedings of the 3rd South AmericanWorkshop on String Processing (WSP’96), pages 183–193, Recife, Brazil, 1996. Carleton University Press.

    Google Scholar 

  9. G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. Assoc. Comput. Mach., 46(3):395–415, 1999.

    Article  MATH  Google Scholar 

  10. G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.

    Google Scholar 

  11. P.A. Pevzner and M.S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13(1/2):135–154, 1995.

    Article  MathSciNet  MATH  Google Scholar 

  12. F.P. Preparata, A.M. Fieze, and E. Upfal. On the power of universal bases in sequencing by hybridization. In Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB-99), pages 295–301, Lyon, France, 1999. ACM Press.

    Google Scholar 

  13. F.P. Preparata and E. Upfal. Sequencing-by-hybridization at the informationtheory bound: An optimal algorithm. In R. Shamir, S. Miyano, S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of the 4th Annual International Conference on Computational Molecular Biology (RECOMB-00), pages 245–253, Tokio, 2000. ACM Press.

    Google Scholar 

  14. E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In P.G. Spirakis, editor, Proceedings of the 3rd Annual European Symposium on Algorithms, number 979 in Lecture Notes in Computer Science, pages 327–340, Corfu, Greece, 1995. Springer-Verlag, Berlin.

    Google Scholar 

  15. T. Takaoka. Approximate pattern matching with samples. In Ding-Zhu Du and Xiang sun Zhang, editors, Proceedings of the 5th International Symposium on Algorithms and Computation, number 834 in Lecture Notes in Computer Science, pages 236–242, Beijing, P.R. China, 1994. Springer-Verlag, Berlin.

    Google Scholar 

  16. E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191–212, 1992.

    Article  MathSciNet  MATH  Google Scholar 

  17. E. Ukkonen. Approximate string matching over suffix trees. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors, Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, number 684 in Lecture Notes in Computer Science, pages 228–242, Padova, Italy, 1993. Springer-Verlag, Berlin.

    Chapter  Google Scholar 

  18. J. Weber and H. Myers. Human whole genome shotgun sequencing. Genome Research, 7:401–409, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Burkhardt, S., Kärkkäinen, J. (2001). Better Filtering with Gapped q-Grams. In: Amir, A. (eds) Combinatorial Pattern Matching. CPM 2001. Lecture Notes in Computer Science, vol 2089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48194-X_6

Download citation

  • DOI: https://doi.org/10.1007/3-540-48194-X_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42271-6

  • Online ISBN: 978-3-540-48194-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics