Breadth-first search strategies for trie-based syntactic pattern recognition

Oommen, B. John; Badr, Ghada

doi:10.1007/s10044-006-0032-z

Breadth-first search strategies for trie-based syntactic pattern recognition

Theoretical Advances
Published: 06 October 2006

Volume 10, pages 1–13, (2007)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

B. John Oommen¹ &
Ghada Badr¹

182 Accesses
4 Citations
Explore all metrics

Abstract

Dictionary-based syntactic pattern recognition of strings attempts to recognize a transmitted string X ^*, by processing its noisy version, Y, without sequentially comparing Y with every element X in the finite, (but possibly, large) dictionary, H. The best estimate X ⁺ of X ^*, is defined as that element of H which minimizes the generalized Levenshtein distance (GLD) D(X, Y) between X and Y, for all X ∈H. The non-sequential PR computation of X ⁺ involves a compact trie-based representation of H. In this paper, we show how we can optimize this computation by incorporating breadth first search schemes on the underlying graph structure. This heuristic emerges from the trie-based dynamic programming recursive equations, which can be effectively implemented using a new data structure called the linked list of prefixes that can be built separately or “on top of” the trie representation of H. The new scheme does not restrict the number of errors in Y to be merely a small constant, as is done in most of the available methods. The main contribution is that our new approach can be used for generalized GLDs and not merely for 0/1 costs. It is also applicable when all possible correct candidates need to be known, and not just the best match. These constitute the cases when the “cutoffs” cannot be used in the DFS trie-based technique (Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996). The new technique is compared with the DFS trie-based technique (Risvik in United Patent 6377945 B1, 23 April 2002; Shang and Merrettal in IEEE Trans Knowl Data Eng 8(4):540–547, 1996) using three large and small benchmark dictionaries with different errors. In each case, we demonstrate marked improvements with regard to the operations needed up to 21%, while at the same time maintaining the same accuracy. Additionally, some further improvements can be obtained by introducing the knowledge of the maximum number or percentage of errors in Y.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolutionary algorithms and their applications to engineering problems

Article Open access 16 March 2020

Adam Slowik & Halina Kwasnicka

What an Algorithm Is

Article 11 January 2015

Robin K. Hill

A survey on Bayesian network structure learning from data

Article 29 May 2019

Mauro Scanagatta, Antonio Salmerón & Fabio Stella

Notes

In terms of notation, A is a finite alphabet, H is a finite (but possibly large) dictionary, and μ is the null string, distinct from λ, the null symbol. The left derivative of order one of any string Z = z ₁ z ₂ ... z _k is the string Z _p = z ₁ z ₂ ... z _k-1. The left derivative of order two of Z is the left derivative of order one of Z _p, and so on.
This file is available at http://www.scs.carleton.ca/∼oommen/papers/WordWldn.txt
the actual dictionary can be downloaded from http://www.cs.princeton.edu/∼rs/strings/dictwords
It can be downloaded from http://www.scs.carleton.ca/∼oommen/papers/QWERTY.doc
A BFS technique was earlier shown in [29] to be much more superior to a method which computes X ⁺ using sequential comparison between every X ∈H and Y. The fact that it is also uniformly superior to a DFS-method is what we demonstrate here.

References

Acharya A, Zhu H, Shen K (1999) Adaptive algorithms for cache-efficient trie search. In: ACM and SIAM workshop on algorithm engineering and experimentation, January 1999, pp 296–311
Amengual JC, Vidal E (1998) Efficient error-correcting viterbi parsing. IEEE Trans Commun 20(10):1109–1116
Google Scholar
Amengual JC, Vidal E (1998) The viterbi algorithm. IEEE Trans Pattern Anal Mach Intell 20(10):268–278
Article Google Scholar
Baeza-Yates R, Navarro G (1998) Fast approximate string matching in a dictionary. In: Proceedings of the 5th South American symposium on string processing and information retrieval (SPIRE’98), IEEE CS Press, pp 14–22
Bentley J, Sedgewick R (1997) Fast algorithms for sorting and searching strings. In: 8th annual ACM-SIAM symposium on discrete algorithms, New Orleans, January 1997, pp 360–369
Bouloutas A, Hart GW, Schwartz M (1991) Two extensions of the viterbi algorithm. IEEE Trans Inf Theory 37(2):430–436
Article MathSciNet Google Scholar
Bucher P, Hoffmann K (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, ISMB, vol 96. AAAI Press, Menlo Park, pp 44–51
Bunke H (1993) Structural and syntactic pattern recognition. In: Chen CH, Pau LF, Wang PSP (eds) Handbook of pattern recognition and computer vision. World Scientific, Singapore
Google Scholar
Bunke H (1995) Fast approximate matching of words against a dictionary. Computing 55(1):75–89
Article MathSciNet MATH Google Scholar
Bunke H, Csirik J (1993) Parametric string edit distance and its application to pattern recognition. IEEE Trans Syst Man Cybern 25(1):202–206
Article Google Scholar
Clement J, Flajolet P, Vallee B (1998) The analysis of hybrid trie structures. In: Proceedings of the annual a CM-SIAM symposium on discrete algorithms, San Francisco, California, 1998, pp 531–539
Cole R, Gottieb L, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. In; Proceedings of the 36th annual ACM aymposium on theory of computing, Chicago, IL, USA, June 2004, pp 91–100
Cormen TH, Leiserson CE, Rivest RL (1990) Introduction to algorithms. The MIT Press, Cambridge
MATH Google Scholar
Dewey G (1923) Relative frequency of English speech sounds. Harvard University Press, MA
Du M, Chang S (1994) An approach to designing very fast approximate string matching algorithms. IEEE Trans Knowl Data Eng 6(4):620–633
Article Google Scholar
Forney GD (1973) The viterbi algorithm. Proc IEEE 61(3):268–278
Article MathSciNet Google Scholar
Fuketa M, Sumitomo T, Shishibori M, Aoe J (1999) A suffix compression algorithm of tries. In: ICCPOL’99: 18th international conference on computer processing of original languages vol 18, pp 345–348
Kashyap RL, Oommen BJ (1981) An effective algorithm for string correction using generalized edit distances −i. Description of the algorithm and its optimality. Inf Sci 23(2):123–142
Article Google Scholar
Kashyap RL, Oommen BJ (1984) String correction using probabilistic methods. Pattern Recognit Lett pp 147–154
Levenshtein A (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys Dokl 10:707–710
MathSciNet Google Scholar
Masek WJ, Paterson MS (1980) A faster algorithm computing string edit distances. J Comput Syst Sci 20:18–31
Article MathSciNet MATH Google Scholar
Mibov S, Schulz K (2002) Fast approximate string matching in large dictionaries. Available: www.cis.uni-muenchen.de//people//schulz//pub//fastapproxsearch.pdf
Miclet L (1990) Grammatical inference. Syntactic Struct Pattern Recognit Appl 237–290
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surveys 33(1):31–88
Article Google Scholar
Oflazer K (1996) Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–89
Google Scholar
Okuda T, Tanaka E, Kasai T (1976) A method of correction of garbled words based on the levenshtein metric. IEEE Trans Comput 25:172–177
Article MathSciNet MATH Google Scholar
Oommen BJ (1987) Constrained string editing. Inf Sci 40(3):267–284
Article MathSciNet Google Scholar
Oommen BJ (1987) Recognition of noisy subsequences using constrained edit distances. IEEE Trans Pattern Anal Mach Intell 9:676–685
Article MATH Google Scholar
Oommen BJ, Badr G (2004) Dictionary-based syntactic pattern recognition using tries. In: Proceedings of the joint IARR international workshops SSPR 2004 and SPR 2004, Libon, Portugal, August 2004, pp 251–259
Oommen BJ, Kashyap RL (1998) A formal theory for optimal and information theoretic syntactic pattern recognition. Pattern Recognit 31:1159–1177
Article Google Scholar
Oommen BJ, Loke RKS. Syntactic pattern recognition involving traditional and generalized transposition errors: Attaining the information theoretic bound. (Submitted)
Oommen BJ, Loke RKS (1997) Pattern recognition of strings with substitutions, insertions, deletions and generalized transposition. Pattern Recognit 30:789–800
Article Google Scholar
Oommen BJ, Loke RKS (1999) Designing syntactic pattern classifiers using vector quantization and parametric string editing. IEEE Trans Syst Man Cybern 29:881–888
Google Scholar
Perez-Cortes JC, Amengual JC, Arlandis J, Llobet R (2000) Stochastic error correcting parsing for ocr post-processing. In: International conference on pattern recognition ICPR-2000, Barcelona, 2000, pp 4405–4408
Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun Assoc Comput Mach 23:676–687
Google Scholar
Risvik KM (2002) Search system and method for retrieval of data, and the use thereof in a search engine. United States Patent 6377945 B1, April 23 2002
Sankoff D, Kruskal JB (1983) Time warps, string edits and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading
Google Scholar
Schulz K, Mihov S (2002) Fast string correction with levenshtein-automata. Int J Doc Anal Recognit 5(1):67–85
Article MATH Google Scholar
Shang H, Merrettal T (1996) Tries for approximate string matching. IEEE Trans Knowl Data Eng 8(4):540–547
Article Google Scholar
Stephen GA (1989) String searching. Prentice-Hall, Englewood Cliffs
Google Scholar
Stephen GA (2000) String searching algorithms, vol 6. Lecture Notes Series on Computing, World Scientific, Singapore
Ukkonen E (1985) Algorithm for approximate string matching. Inf Control 64:100–118
Article MathSciNet MATH Google Scholar
Viterbi AJ (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269
Article MATH Google Scholar
Wagner R, Fischer A (1974) The string-to-string correction problem. J Assoc Comput Mach 21:168–173
MathSciNet MATH Google Scholar
Wagner RA (1974) Order-n correction for regular languages. Commun ACM 17:265–268
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carleton University, 1125 Colonel By Dr., Ottawa, ON, Canada, K1S 5B6
B. John Oommen & Ghada Badr

Authors

B. John Oommen
View author publications
You can also search for this author in PubMed Google Scholar
Ghada Badr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ghada Badr.

Additional information

B. John Oommen dedicates this paper to his good friends Rama Chellappa, Horst Bunke and Alberto Sanfeliu with whom he spent many hours together in the “Pattern Recognition Lab” at Purdue between 1978-1981. “Thanks, my friends”. B. John Oommen Partially supported by NSERC, the Natural Science and Engineering Research Council of Canada.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Oommen, B.J., Badr, G. Breadth-first search strategies for trie-based syntactic pattern recognition. Pattern Anal Applic 10, 1–13 (2007). https://doi.org/10.1007/s10044-006-0032-z

Download citation

Received: 26 April 2005
Accepted: 24 March 2006
Published: 06 October 2006
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10044-006-0032-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Breadth-first search strategies for trie-based syntactic pattern recognition

Abstract

Access this article

Similar content being viewed by others

Evolutionary algorithms and their applications to engineering problems

What an Algorithm Is

A survey on Bayesian network structure learning from data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Breadth-first search strategies for trie-based syntactic pattern recognition

Abstract

Access this article

Similar content being viewed by others

Evolutionary algorithms and their applications to engineering problems

What an Algorithm Is

A survey on Bayesian network structure learning from data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation