Abstract
The q-gram filter is a popular filtering method for approxi- mate string matching. It compares substrings of length q (the q-grams) in the pattern and the text to identify the text areas that might contain a match. A generalization of the method is to use gapped q-grams, subsets of q characters in some fixed non-contiguous shape, instead of contiguous substrings. Although mentioned a few times in the literature, this gen- eralization has never been studied in any depth. In this paper, we report the first results from a study on gapped q-grams. We show that gapped q-grams can provide orders of magnitude faster and/or more efficient filtering than contiguous q-grams. The performance, however, depends on the shape of the q-grams. The best shapes are rare and often pos- sess no apparent regularity. We show how to recognize good shapes and demonstrate with experiments their advantage over both contiguous and average shapes. We concentrate here on the k mismatches problem, but also outline an approach for extending the results to the more common k differences problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Baeza-Yates and G. Gonnet. All-against-all sequence matching. Technical report, Dept. of Computer Science, University of Chile, 1990.
S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based database searching using a suffix array. In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB-99), pages 77–83, Lyon, France, 1999. ACM Press.
A. Califano and I. Rigoutsos. FLASH: A fast look-up algorithm for string homology. In L. Hunter, D. Searls, and J. Shavlik, editors, Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, pages 56–64, Bethesda, MD, 1993. AAAI Press.
A.L. Cobbs. Fast approximate matching using suffix trees. In Z. Galil and E. Ukkonen, editors, Proceedings of the 6th Annual Symposium on Combinatorial Pattern Matching, number 937 in Lecture Notes in Computer Science, pages 41–54, Espoo, Finland, 1995. Springer-Verlag, Berlin.
N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proceedings of the 7th Finnish Symposium on Computer Science, pages 23–32, 1994.
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Proceedings of the 16th Symposium on Mathematical Foundations of Computer Science, number 520 in Lecture Notes in Computer Science, pages 240–248, Kazimierz Dolny, Poland, 1991. Springer-Verlag, Berlin.
A. Krause and M. Vingron. A set-theoretic approach to database searching and clustering. Bioinformatics, 14:430–438, 1998.
O. Lehtinen, E. Sutinen, and J. Tarhio. Experiments on block indexing. In R. Baeza-Yates N. Ziviani and K. Guimarães, editors, Proceedings of the 3rd South AmericanWorkshop on String Processing (WSP’96), pages 183–193, Recife, Brazil, 1996. Carleton University Press.
G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming. J. Assoc. Comput. Mach., 46(3):395–415, 1999.
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.
P.A. Pevzner and M.S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13(1/2):135–154, 1995.
F.P. Preparata, A.M. Fieze, and E. Upfal. On the power of universal bases in sequencing by hybridization. In Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB-99), pages 295–301, Lyon, France, 1999. ACM Press.
F.P. Preparata and E. Upfal. Sequencing-by-hybridization at the informationtheory bound: An optimal algorithm. In R. Shamir, S. Miyano, S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of the 4th Annual International Conference on Computational Molecular Biology (RECOMB-00), pages 245–253, Tokio, 2000. ACM Press.
E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In P.G. Spirakis, editor, Proceedings of the 3rd Annual European Symposium on Algorithms, number 979 in Lecture Notes in Computer Science, pages 327–340, Corfu, Greece, 1995. Springer-Verlag, Berlin.
T. Takaoka. Approximate pattern matching with samples. In Ding-Zhu Du and Xiang sun Zhang, editors, Proceedings of the 5th International Symposium on Algorithms and Computation, number 834 in Lecture Notes in Computer Science, pages 236–242, Beijing, P.R. China, 1994. Springer-Verlag, Berlin.
E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci., 92(1):191–212, 1992.
E. Ukkonen. Approximate string matching over suffix trees. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors, Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, number 684 in Lecture Notes in Computer Science, pages 228–242, Padova, Italy, 1993. Springer-Verlag, Berlin.
J. Weber and H. Myers. Human whole genome shotgun sequencing. Genome Research, 7:401–409, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Burkhardt, S., Kärkkäinen, J. (2001). Better Filtering with Gapped q-Grams. In: Amir, A. (eds) Combinatorial Pattern Matching. CPM 2001. Lecture Notes in Computer Science, vol 2089. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48194-X_6
Download citation
DOI: https://doi.org/10.1007/3-540-48194-X_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42271-6
Online ISBN: 978-3-540-48194-2
eBook Packages: Springer Book Archive