Elsevier

Pattern Recognition

Volume 34, Issue 1, January 2001, Pages 37-46
Pattern Recognition

Syntactic methodology of pruning large lexicons in cursive script recognition

https://doi.org/10.1016/S0031-3203(99)00201-0Get rights and content

Abstract

In this paper, we present a holistic technique for pruning of large lexicons for recognition of off-line cursive script words. The technique involves extraction and representation of downward pen-strokes from the off-line cursive word to obtain a descriptor which provides a coarse characterization of word shape. Elastic matching is used to match the image descriptor with “ideal” descriptors corresponding to lexicon entries organized as a trie of stroke classes. On a set of 23,335 real cursive word images the reduction is about 70% with accuracy above 75%.

Introduction

Research in offline handwritten word recognition has traditionally concentrated on relatively small lexicons of ten to a thousand words. Several real-world applications, such as the recognition of English prose, involve large lexicons of 10,000–50,000 words. Recognition with such lexicons may be made efficient by initially eliminating lexicon entries that are unlikely to match the given image. This process is called lexicon reduction or lexicon pruning, and has the desirable side effect of improving recognition accuracy by reducing classifier confusion [1].

The approach to lexicon reduction described in this paper is inspired by the approach for online cursive words taken by Seni et al. [2]. A cursive word may be characterized as a sequence of alternating upstrokes and downstrokes. It has been suggested that downstrokes are more important than upstrokes for the reason that they are usually part of a letter while upstrokes are often ligatures used to connect letters [3]. A descriptor of word shape may be built from a coarse characterization of the shapes of downstrokes.

While the stroke sequence may be readily extracted from online cursive script, extracting the same from offline script words is complex and computationally expensive. In this effort, we are concerned with coarse features of downstrokes, rather than the exact trace of downstrokes. We therefore adopt a heuristic strategy that detects downstrokes by identifying spatial configurations of local contour extrema.

Seni et al. classify the extracted strokes into a small set of “hard” stroke classes such as medium, ascender, descender, retrograde, and unknown strokes in order to compose a string descriptor, and match the descriptor with the lexicon entries using production rules. In this paper, an alternative “soft” representation of downstrokes is proposed. Each stroke is represented by its normalized extensions into the upper and lower zones of the word. The stroke sequence extracted from the image is matched with the ideal strokes of lexicon words by an elastic matching scheme.

Given a set of n word images and corresponding lexicons, let us denote the lexicon corresponding to image xi by Li. A lexicon reduction algorithm takes as input xi and Li and determines a reduced lexicon QiLi. We denote the event that the truth ti is contained in the reduced lexicon by a random variable A, defined as A=1, if tiεQi;A=0, otherwise. The extent of reduction is captured by random variable R, defined as R=(|Li|−|Qi|)/|Li|.

Three measures of lexicon reduction performance are defined:

  • Accuracy of reduction: α=E(A),

  • Degree of reduction: ρ=E(R), and

  • Reduction efficacy: η=αk.ρ.

Note that α,ρ,η∈[0,1]. The accuracy and degree of reduction are usually related inversely to each other. The accuracy α can often be made arbitrarily close to unity at the expense of ρ. The two measures are combined into one overall measure η. The emphasis placed on accuracy relative to the degree of reduction is expressed as a constant k, which in turn may be determined by the particular application.

Section snippets

Extraction of downstrokes from offline cursive script

The extraction of temporal information (stroke sequence or trace of stylus) from offline script is an interesting area for research and methods have been proposed for both binary and gray-scale images [4], [5]. However, the analysis of the offline word necessary to reconstruct the stylus trace is complex and computationally expensive.

Given that we are only concerned with coarse features of downstrokes, rather than their exact trace, a computationally efficient heuristic approach may be adopted

Hard stroke classes and syntactic matching

In the lexicon reduction strategy for online cursive words developed by Seni et al., downward pen strokes extracted from the online trace of the word are classified into a small number of “hard” stroke classes as ascender, descender, medium, retrograde and unknown strokes in order to compose shape descriptors. The lexicon is organized as a trie [7], and the descriptor extracted from the image is used to syntactically match lexicon entries using a set of production rules which encode valid

Generalized strokes

Given a cursive word (offline or online), let us assume that the reference lines have been detected and are of the form y=mx+c, where m is the baseline skew of the word. Each downstroke in the cursive word may be represented coarsely by the Cartesian coordinates of its endpoints. The endpoints of downstrokes are easily extracted from online data. In the offline case, where the stroke has nonzero thickness, the limiting contour extrema may be used to approximate the endpoints.

Let us assume that

Distance computation

Elastic matching is used to compute the distance between the descriptor extracted from the image and the “ideal” descriptor corresponding to a given ASCII string.

Trie implementation

Elastic matching implemented naı̈vely is O(mn) in complexity, where m and n are the lengths of the image and lexicon descriptors. The computational cost of matching the image descriptor with each of the lexicon descriptors sequentially is O(mns), where s is the size of the lexicon, and n the mean length of a lexicon descriptor.

  • 1.

    Trie of stroke classes: Let I be the descriptor extracted from the image. Let P=X.Y and Q=X.Z be two lexicon descriptors having the common prefix string X of length |X|=l

Pruning strategy

The trie_match( ) procedure computes for each lexicon entry a distance score. The pruning strategy we have used for evaluation of reduction performance is essentially the same as that described in the context of reduction using perceptual features, and is as follows:

Given a scored and ranked lexicon, and a rank threshold Tr, the bounds on trie levels searched are employed implicitly and scores are computed for the lexicon words in the permissible trie-levels. These words are ordered by score,

Empirical evaluation

Tests were conducted using a lexicon of 23,665 words on a set of 760 real cursive word images (Fig. 9). Images were cursive but could have upto one break. The α obtained was 75.504% at 10,000. Table 1 shows the results at various values of α.

The image descriptor is matched with the ideal descriptors of a lexicon using the trie_match( ) procedure described earlier. The search is restricted to two trie levels above and below the extracted length of the image. Thus scores are computed for only a

Summary

Most of the information in cursive script words is widely regarded to be in the downstrokes, with the upstrokes being used mainly to connect characters. Extraction of downstrokes from offline cursive script is difficult without the use of complex analysis. However, since our interest is in coarse features of the strokes, the system described in this paper uses an efficient heuristic procedure to detect downstrokes that is based upon detection of spatial configurations of local contour extrema.

About the Author—SRIGANESH MADHVANATH received his B.Tech. in Computer Science from the Indian Institute of Technology, Bombay, India in 1990. He obtained his M.S. in Computer Science in 1993 and Ph.D. in 1997 from the State University of New York at Buffalo. He worked as a Research Assistant at the Center of Excellence for Document Analysis and Recognition (CEDAR) from 1991 to 1996, and is presently with the Document Analysis and Recognition group at IBM Almaden Research Center, San Jose, USA.

References (7)

  • S. Madhvanath, V. Govindaraju, Serial classifier combination for handwritten word recognition, Proceedings of the Third...
  • G. Seni, N. Nasrabadi, R. Srihari, An online cursive word recognition system, Proceedings of the IEEE CVPR-94, Seattle,...
  • E.R. Brocklehurst, P.D. Kenward, Preprocessing for cursive script recognition, NPL Report, November...
There are more references available in the full text version of this article.

Cited by (25)

  • Arabic word descriptor for handwritten word indexing and lexicon reduction

    2014, Pattern Recognition
    Citation Excerpt :

    The most common feature extracted from a word image is a sequence of ascenders and descenders [11]. The sequence is matched against features extracted from synthetic images of words in the lexicon, using regular expressions [12] or the string edit distance [13]. Then, lexicon reduction is performed by discarding the unmatched lexicon entries.

  • Effect of delayed strokes on the recognition of online Farsi handwriting

    2013, Pattern Recognition Letters
    Citation Excerpt :

    In (Kaufmann et al., 1997), the lexicon reduction was directly based on the feature vectors used as the input for HMMs. Normally, presence of some topological features such as t-crossing, i-dots (Carbonnel and Anquetil, 2003), ascenders and descenders (Madhvanath et al., 2001), have been exploited to restrict the number of candidates in a Latin lexicon. Many Farsi characters are characterized by the number of dots and small signs.

  • W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents

    2012, Pattern Recognition
    Citation Excerpt :

    Several other holistic approaches for lexicon reduction extract a string-based descriptor for each shape, which is further matched using dynamic programming, the lexicon entries with the smallest edit distances being considered part of the reduced lexicon. Madhvanath et al. [20] holistic approach is based on using downward pen-strokes descriptors. These pen strokes are extracted from the word shape using a set of heuristic rules, and categorized according to their positions relative to the baseline.

  • Two-stage lexicon reduction for offline arabic handwritten word recognition

    2008, International Journal of Pattern Recognition and Artificial Intelligence
  • Recent advances of ML and DL approaches for Arabic handwriting recognition: A review

    2023, International Journal of Hybrid Intelligent Systems
View all citing articles on Scopus

About the Author—SRIGANESH MADHVANATH received his B.Tech. in Computer Science from the Indian Institute of Technology, Bombay, India in 1990. He obtained his M.S. in Computer Science in 1993 and Ph.D. in 1997 from the State University of New York at Buffalo. He worked as a Research Assistant at the Center of Excellence for Document Analysis and Recognition (CEDAR) from 1991 to 1996, and is presently with the Document Analysis and Recognition group at IBM Almaden Research Center, San Jose, USA. His research interests include Document Analysis and Understanding, Handwritten Word Recognition, and the specification and effective use of context for document understanding systems.

About the Author—V. KRPASUNDAR received his B.Tech. in Computer Science from the University of Kerala (Thiruvananthapuram, India) in 1988. He was awarded a Woodburn fellowship in 1990 to attend the State University of New York at Buffalo. He obtained his M.S. in Computer Science in 1993 and Ph.D. in 1998. His dissertation research is on the use of linguistic information to resolve ambiguities in document understanding applications. He is currently employed with Viewlogic Systems, Inc.

About the Author—VENU GOVINDARAJU received his Ph.D. in Computer Science from the State University of New York at Buffalo in 1992. He has coauthored more than 90 technical papers in various International journals and conferences and has one US patent. He is currently the associate director of CEDAR and concurrently holds the research associate professorship in the department of Computer Science and Engineering, State University of New York at Buffalo. Dr. Govindaraju has been a co-principal investigator on several federally sponsored and industry sponsored projects. He is a senior member of the IEEE.

View full text