A novel look-ahead optimization strategy for trie-based approximate string matching

Badr, Ghada; Oommen, B. John

doi:10.1007/s10044-006-0036-8

A novel look-ahead optimization strategy for trie-based approximate string matching

Theoretical Advances
Published: 26 August 2006

Volume 9, pages 177–187, (2006)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Ghada Badr¹ &
B. John Oommen¹

123 Accesses
3 Citations
Explore all metrics

Abstract

This paper deals with the problem of estimating a transmitted string X ^* by processing the corresponding string Y, which is a noisy version of X ^*. We assume that Y contains substitution, insertion, and deletion errors, and that X ^* is an element of a finite (but possibly, large) dictionary, H. The best estimate X ⁺ of X ^*, is defined as that element of H which minimizes the generalized Levenshtein distance D(X, Y) between X and Y such that the total number of errors is not more than K, for all X ∈H. The trie is a data structure that offers search costs that are independent of the document size. Tries also combine prefixes together, and so by using tries in approximate string matching we can utilize the information obtained in the process of evaluating any one D(X _i, Y), to compute any other D(X _j, Y), where X _i and X _j share a common prefix. In the artificial intelligence (AI) domain, branch and bound (BB) schemes are used when we want to prune paths that have costs above a certain threshold. These techniques have been applied to prune, for example, game trees. In this paper, we present a new BB pruning strategy that can be applied to dictionary-based approximate string matching when the dictionary is stored as a trie. The new strategy attempts to look ahead at each node, c, before moving further, by merely evaluating a certain local criterion at c. The search algorithm according to this pruning strategy will not traverse inside the subtrie(c) unless there is a “hope” of determining a suitable string in it. In other words, as opposed to the reported trie-based methods (Kashyap and Oommen in Inf Sci 23(2):123–142, 1981; Shang and Merrettal in IEEE Trans Knowledge Data Eng 8(4):540–547, 1996), the pruning is done a priori before even embarking on the edit distance computations. The new strategy depends highly on the variance of the lengths of the strings in H. It combines the advantages of partitioning the dictionary according to the string lengths, and the advantages gleaned by representing H using the trie data structure. The results demonstrate a marked improvement (up to 30% when costs are of a 0/1 form, and up to 47% when costs are general) with respect to the number of operations needed on three benchmark dictionaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Longest Common Substring with Approximately k Mismatches

Article Open access 16 February 2019

Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review

Article Open access 13 August 2018

Bayesian Neural Networks: An Introduction and Survey

Notes

Observe that our method is quite distinct from the dictionary partitioning strategy which is also based on string lengths [11].
The basic addition operation involves adding the inter-symbol distance to the currently computed inter-string distance, and the minimization operation involves evaluating the minimum of the corresponding terms in the DP equation and in the LHBB condition.
This file is available at http://www.scs.carleton.ca/∼oommen/papers/WordWldn.txt.
The actual dictionary can be downloaded from http://www.cs.princeton.edu/∼rs/strings/dictwords.
It can be downloaded from http://www.scs.carleton.ca/∼oommen/papers/QWERTY.doc.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) A basic local alignment search tool. J Mol Biol 215:403–410
Article Google Scholar
Baeza-Yates RA, Gonnet GH (1982) A new approach to text searching. In: Annual ACM-SIGIR conference on information retrieval, Cambridge, MA, June 1982, pp 168–175
Bentley J, Sedgewick R (1997) Fast algorithms for sorting and searching strings. In: Eighth annual ACM-SIAM symposium on discrete algorithms, New Orleans, January 1997, pp 360–369
Bucher P, Hoffmann K (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the fourth international conference on intelligent systems for molecular biology, ISMB, vol 96, pp 44–51
Bunke H (1993) Structural and syntactic pattern recognition. In: Chen CH, Pau LF, Wang PSP (eds) Handbook of pattern recognition and computer vision. World Scientific, Singapore
Google Scholar
Bunke H, Csirik J (1993) Parametric string edit distance and its application to pattern recognition. IEEE Trans Syst Man Cybern SMC-25(1):202–206
Google Scholar
Chang W, Lawler E (1992) Approximate string matching in sublinear expected time. In: 13th annual symposium on foundations of computer science, St.~Louis, Missouri, October 1992. IEEE Computer Society Press, pp 116–124
Clement J, Flajolet P, Vallee B (1998) The analysis of hybrid trie structures. In: Proceedings of the annual ACM–SIAM symposium on discrete algorithms, San Francisco, CA, pp 531–539
Crochemore M, Landau GM, Ziv-Ukleson M (1973) A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J 32(6):1654–1673
Article Google Scholar
Dewey G (1923) Relative frequency of English speech sounds. Harvard University Press, Cambridge, MA
Google Scholar
Du M, Chang S (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowledge Data Eng 6(4):620–633
Article Google Scholar
Firebaugh M (1988) Artificial intelligence: a knowledge-based approach. Boyd and Fraser, Boston
Google Scholar
Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequence. Commun ACM 18(6):341–343
Article MATH MathSciNet Google Scholar
Hunt JW, Szymanski TG (1977) A fast algorithm for computing longest common subsequences. Commun Assoc Comput Mach 20:350–353
MATH MathSciNet Google Scholar
Kashyap RL, Oommen BJ (1981) An effective algorithm for string correction using generalized edit distances -I: description of the algorithm and its optimality. Inf Sci 23(2):123–142
Article Google Scholar
Levenshtein A (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 10:707–710
MathSciNet Google Scholar
Masek WJ, Paterson MS (1980) A faster algorithm computing string edit distances. J Comput Syst Sci 20:18–31
Article MATH MathSciNet Google Scholar
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar
Oflazer K (1996) Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–89
Google Scholar
Oommen BJ (1987) Recognition of noisy subsequences using constrained edit distances. IEEE Trans Pattern Anal Mach Intel PAMI 9:676–685
Article MATH Google Scholar
Oommen BJ, Badr G (2004) Dictionary-based syntactic pattern recognition using tries. In: Proceedings of the joint IARR international workshops SSPR 2004 and SPR 2004, Libon, August 2004
Oommen BJ, Kashyap RL (1998) A formal theory for optimal and information theoretic syntactic pattern recognition. Pattern Recognit 31:1159–1177
Article Google Scholar
Oommen BJ, Loke RKS (1999) Designing syntactic pattern classifiers using vector quantization and parametric string editing. IEEE Trans Syst Man Cybern SMC-29:881-888
Google Scholar
Oommen BJ, Loke RKS (2006) Syntactic pattern recognition involving traditional and generalized transposition errors: attaining the information theoretic bound (submitted)
Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun Assoc Comput Mach 23:676–687
Google Scholar
Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison–Wesley, Reading, MA
Google Scholar
Shang H, Merrettal T (1996) Tries for approximate string matching. IEEE Trans Knowledge Data Eng 8(4):540–547
Article Google Scholar
Stephen GA (2000) String searching algorithms, Lecture notes series on computing, vol 6, World Scientific, Sihgapore, NJ
Ukkonen E (1985) Algorithm for approximate string matching. Inf control 64:100–118
Article MATH MathSciNet Google Scholar
Wagner RA (1974) Order-n correction for regular languages. Commun ACM 17:265–268
Article MATH Google Scholar
Wagner R, Fischer A (1974) The string-to-string correction problem. J Assoc Comput Machinery (ACM) 21:168–173
MATH MathSciNet Google Scholar
Wu S, Manber U (1992) Fast text searching allowing errors. Commmun ACM 35(10):83–91
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carleton University, 1125 Colonel By Dr., Ottawa, ON, Canada, K1S 5B6
Ghada Badr & B. John Oommen

Authors

Ghada Badr
View author publications
You can also search for this author in PubMed Google Scholar
B. John Oommen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ghada Badr.

Additional information

A preliminary version of some of the results of this paper was presented at CORES’05, the 4th international conference on computer recognition systems, Rydzyna Castle, Poland, May 2005.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Badr, G., Oommen, B.J. A novel look-ahead optimization strategy for trie-based approximate string matching. Pattern Anal Applic 9, 177–187 (2006). https://doi.org/10.1007/s10044-006-0036-8

Download citation

Received: 17 November 2005
Accepted: 22 March 2006
Published: 26 August 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s10044-006-0036-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel look-ahead optimization strategy for trie-based approximate string matching

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review

Bayesian Neural Networks: An Introduction and Survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel look-ahead optimization strategy for trie-based approximate string matching

Abstract

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

Algorithms for generating all possible spanning trees of a simple undirected connected graph: an extensive review

Bayesian Neural Networks: An Introduction and Survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation