Abstract
We present a new indexing method for the approximate string matching problem. The method is based on a sufix tree combined with a partitioning of the pattern. We analyze the resulting algorithm and show that the retrieval time is O(n λ), for 0 < λ < 1, whenever \( \alpha < - e/\sqrt \sigma \) , where α is the error level tolerated and σ is the alphabet size. We experimentally show that this index outperforms by far all other algorithms for indexed approximate searching, also being the first experiments that compare the different existing schemes. We finally show how this index can be implemented using much less space.
This work has been supported in part by Fondecyt grant 1-990627 and Fondef grant 96-1064.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
A. Apostolico and Z. Galil. Combinatorial Algorithms on Words. Springer-Verlag, New York, 1985.
M. Araújo, G. Navarro, and N. Ziviani. Large text searching allowing errors. In Proc. WSP’97, pages 2–20. Carleton University Press, 1997.
R. Baeza-Yates. Text retrieval: Theory and practice. In 12th IFIP World Computer Congress, volume I, pages 465–476. Elsevier Science, September 1992.
R. Baeza-Yates and G. Gonnet. All-against-all sequence matching. Dept. of Computer Science, University of Chile, 1990.
R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton searching on a trie. J. of the ACM, 43, 1996.
R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. In Proc. ACM CIKM’97, pages 1–8, 1997.
R. Baeza-Yates and G. Navarro. A practical q-gram index for text retrieval allowing errors. CLEI Electronic Journal, 1(2), 1998. http://www.clei.cl.
R. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127–158, 1999. Preliminary version in Proc. CPM’96, LNCS 1075.
R. Baeza-Yates and C. Perleberg. Fast and practical approximate pattern matching. Information Processing Letters, 59:21–27, 1996.
A. Blumer, J. Blumer, D. Haussler, A. Ehrenfeucht, M. Chen, and J. Seiferas. The samllest automaton recognizing the subwords of a text. Theoretical Computer Science, 40:31–55, 1985.
W. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In Proc. CPM’92, LNCS 644, pages 172–181, 1992.
W. Chang and T. Marr. Approximate string matching and local similarity. In Proc. CPM’94, LNCS 807, pages 259–273, 1994.
A. Cobbs. Fast approximate matching using sufix trees. In Proc. CPM’95, pages 41–54, 1995. LNCS 937.
M. Crochemore. Transducers and repetitions. Theoretical Computer Science, 45:63–86, 1986.
M. Farach, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottle-neck in sufix tree construction. In Proc. SODA’98, pages 174–183, 1998.
Z. Galil and K. Park. An improved algorithm for approximate string matching. SIAM J. on Computing, 19(6):989–999, 1990.
G. Gonnet. A tutorial introduction to Computational Biochemistry using Darwin. Technical report, Informatik E.T.H., Zuerich, Switzerland, 1992.
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. MFCS’91, volume 16, pages 240–248. Springer-Verlag, 1991.
D. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, 1973.
G. Landau and U. Vishkin. Fast parallel and serial approximate string matching. J. of Algorithms, 10:157–169, 1989.
U. Manber and G. Myers. Sufix arrays: a new method for on-line string searches. In Proc. ACM-SIAM SODA’90, pages 319–327, 1990.
U. Manber and S. Wu. glimpse: A tool to search through entire file systems. In Proc. USENIX Technical Conference, pages 23–32, Winter 1994.
E. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12(4/5):345–374, Oct/Nov 1994.
G. Myers. A fast bit-vector algorithm for approximate pattern matching based on dynamic programming. In Proc. CPM’98, LNCS 1448, pages 1–13, 1998.
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, Univ. of Chile, December 1998. Technical Report TR/DCC-98-14. ftp://-ftp.dcc.uchile.cl/pub/users/gnavarro/thesis98.ps.gz.
G. Navarro and R. Baeza-Yates. Improving an algorithm for approximate pattern matching. Technical Report TR/DCC-98-5, Dept. of Computer Science, Univ. of Chile, 1998. Submitted.
S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. of Molecular Biology, 48:444–453, 1970.
P. Sellers. The theory and computation of evolutionary distances: pattern recognition. J. of Algorithms, 1:359–373, 1980.
F. Shi. Fast approximate string matching with q-blocks sequences. In Proc. WSP’96, pages 257–271. Carleton University Press, 1996.
E. Sutinen and J. Tarhio. On using q-gram locations in approximate string matching. In Proc. ESA’95, LNCS 979, pages 327–340, 1995.
E. Sutinen and J. Tarhio. Filtration with q-samples in approximate string matching. In Proc. CPM’96, LNCS 1075, pages 50–61, 1996.
J. Tarhio and E. Ukkonen. Approximate Boyer-Moore string matching. SIAM J. on Computing, 22(2):243–260, 1993.
E. Ukkonen. Approximate string matching over sufix trees. In Proc. CPM’93, pages 228–242, 1993.
Esko Ukkonen. Finding approximate patterns in strings. J. of Algorithms, 6:132–137, 1985.
S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83–91, October 1992.
S. Wu, U. Manber, and E. Myers. A sub-quadratic algorithm for approximate limited expression matching. Algorithmica, 15(1):50–67, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Navarro, G., Baeza-Yates, R. (1999). A New Indexing Method for Approximate String Matching. In: Crochemore, M., Paterson, M. (eds) Combinatorial Pattern Matching. CPM 1999. Lecture Notes in Computer Science, vol 1645. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48452-3_13
Download citation
DOI: https://doi.org/10.1007/3-540-48452-3_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66278-5
Online ISBN: 978-3-540-48452-3
eBook Packages: Springer Book Archive