W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents

doi:10.1016/j.patcog.2012.02.030

Pattern Recognition

Volume 45, Issue 9, September 2012, Pages 3277-3287

https://doi.org/10.1016/j.patcog.2012.02.030 Get rights and content

Abstract

This paper proposes a holistic lexicon-reduction method for ancient and modern handwritten Arabic documents. The word shape is represented by the weighted topological signature vector (W-TSV), which encodes graph data into a low-dimensional vector space. Three directed acyclic graph (DAG) representations are proposed for Arabic word shapes, based on topological and geometrical features. Lexicon reduction is achieved by a nearest neighbors search in the W-TSV space. The proposed framework has been tested on the IFN/ENIT and the Ibn Sina databases, achieving respectively a degree of reduction of 83.5% and 92.9% for an accuracy of reduction of 90%.

Highlights

► A shape-based approach for lexicon reduction of Arabic documents is introduced. ► The topological signature vector formulation is extended to weighted graphs. ► Three graphical representations for Arabic word shape are proposed. ► Experiments were performed on the IFN/ENIT and Ibn Sina databases. ► Topological and geometrical information of word shapes improves the performance.

Introduction

Handwritten word recognition systems have improved in a number of ways in recent decades, across many applications, from the recognition of the legal amount on bank checks and of postal addresses [1], [2], [3], [4], [5] to the automated transcription of ancient documents [6], [7], [8], [9], [10]. While the vocabulary for a bank check application is small (fewer than 30 words), it is large for postal applications (1000 words) and unconstrained for historical documents (several thousand words). A vocabulary of valid words that are expected to be recognized by the system is called a lexicon [11]. A large lexicon generates a high computational complexity, as all the word hypotheses must be tested, and recognition performance decreases as the number of allowed hypotheses grows. To address this problem, lexicon-reduction methods are used. When a query word shape is submitted for recognition, the lexicon is pruned by keeping only the shapes that are most likely to correspond to the query word class [12], or by using application-dependent knowledge [13]. Then, the recognition system considers the word hypotheses remaining in the pruned lexicon. The performance of a lexicon-reduction method is classically evaluated based on its accuracy of reduction $α$ (the probability that the query word class was included in the pruned lexicon), the degree of reduction $ρ$ (the decrease in the size of the lexicon after pruning), and the reduction efficacy $η$ , which is a combination of the two previous criteria. Computational complexity is also a major factor in lexicon reduction, as one of its goals is to speed up the recognition process. In this paper, we propose a lexicon-reduction method for handwritten Arabic documents, both ancient and modern.

The Arabic language has an alphabet of 28 letters. The script is cursive and written from right to left. One important feature of Arabic letters is that their shapes are context-dependent, which means that a letter shape is usually determined by its position in a word, i.e. initial, medial or final. The letters have no cases and many share the same base shape. They are distinguishable by the addition of diacritical marks. The diacritics used in Arabic for this purpose are dots, one, two, or three of them appearing below or above the base shape. If we ignore the dots, we obtain the archigraphemes (Fig. 1), where a single grapheme (letter shape) can represent many letters. Four archigrapheme letter shapes (‘A’, ‘D’, ‘R’, ‘W’) can be connected only if they are in the final position. If they appear in the middle of a word, the word is divided into subwords, also known as pieces of Arabic word (PAW).

The goal of this paper is to provide a lexicon-reduction strategy for Arabic documents, based on the structure of Arabic subword shapes, which is described by their topology and geometry. First, the topological and geometrical properties of the subword shapes are extracted from the shape skeleton. Then these properties are encoded in a directed acyclic graph (DAG) in order to preserve information about their relationship in the skeleton. Finally, the subword DAG is transformed into a vector using the weighted topological signature vector (W-TSV), which is an extension of the TSV [14] for weighted DAGs. Like the classical TSV, the W-TSV is a powerful tool for encoding structured data, such as a DAG, mapping the DAG to a low-dimensional vector space for fast matching. Also, it has good discriminatory power for DAGs with different topologies, because it preserves their topological properties to some extent. Unlike the TSV, the W-TSV can also discriminate between DAGs sharing the same topology, but with different weights, and it is more robust to topological perturbation than the TSV under small weight perturbation. In this work, lexicon reduction is performed by pruning the reference database of subword/word shapes. This is achieved by selecting the i nearest shapes in the database to a query shape in the W-TSV space. First, the database is indexed by ordering its shapes in ascending order, based on their distance from the query shape; next, the lexicon is reduced by selecting the first i elements of the indexed lexicon as candidates. The value of i is evaluated during a training phase in order to reach the accuracy of reduction level selected for the application. The same i value is then applied for all the query shapes during the lexicon reduction process. From the reduced database of shapes, it is then possible to build a reduced lexicon of subwords/words from the labels of the selected shapes (Fig. 2).

This paper is organized as follows. The features of lexicon reduction for ancient and modern Arabic documents are described in Section 2. Related work on lexicon reduction is reviewed in Section 3. The details of the W-TSV scheme and of the formation of the Arabic subword DAG are respectively provided in 4 Weighted topological signature vector (W-TSV), 5 Proposed arabic subword graph representation. Finally, the details of our experiments and our results are given in Section 6, followed by the conclusion in Section 7.

This paper is an extension of the work published in [15]. The underlying methodology, as well as the experimental evaluation, has been significantly improved.

Section snippets

Features of ancient and modern Arabic documents for lexicon reduction

The nature of ancient Arabic documents is different from that of the Arabic documents used in modern applications. The study of ancient documents is motivated by their cultural significance, and a vast number of them have been scanned as digital images in order to protect them from aging. Pre-modern Arabic documents were written during the medieval period. They can be written in a variety of calligraphic styles, depending on when and where they were copied. The appearance of a written text

Related works

Lexicon reduction can be performed by comparing the optical shapes of the lexicon words to improve recognition speed. When the word's optical shape is used, the simplest criterion for lexicon reduction, but still efficient, is word length, as this makes it easy to discriminate between long words and short words. More refined knowledge about the word's shape can also be used. Zimmermann et al. [16] propose the concept of key characters, which are characters that can be accurately identified

Background

The classical topological signature vector (TSV) is an efficient encoding of the topology of structured data, such as a directed acyclic graph (DAG). The topology of a given DAG G can be represented by its adjacency matrix A, where $A (i, j) = 1$ if an edge goes from vertex $v_{i}$ to vertex $v_{j}$ , $A (i, j) = - 1$ if an edge goes from vertex $v_{j}$ to vertex $v_{i}$ , and $A (i, j) = 0$ in all other cases. The adjacency matrix is therefore antisymmetric. From the adjacency matrix, a signature S_G for the graph G can be extracted

Proposed arabic subword graph representation

In this section, our holistic method for encoding the structure of Arabic subword shapes into a DAG is presented. We chose the DAG representation because it is more expressive than the vector representation, thanks to the relational information it contains. The saliency of an Arabic subword derives from its topology and its geometry, which are highlighted by the shape skeleton. Therefore, relevant pieces of information are extracted from the shape skeleton, giving rise to three DAG

Databases

We evaluated this approach on the Ibn Sina database [33] for ancient Arabic documents and the IFN/ENIT database [34] for modern Arabic documents. The Ibn Sina database is based on a commentary on an important philosophical work by the famous Persian scholar Ibn Sina. This database consists of 60 pages and approximately 25,000 Arabic subword shapes written in the Naskh style (Fig. 3b). The document images were binarized with a dedicated algorithm [35] to preserve the shape's topology. Each page

Conclusion

In this paper, we proposed the W-TSV representation, a generalization of the TSV for weighted DAG indexing. The stability and robustness to small weights perturbation of the W-TSV have been studied. The W-TSV has been applied for holistic lexicon reduction of handwritten Arabic words/subwords. The topology and the geometry of the word/subword shape is first converted into a DAG and then transformed into a low dimensional vector using the W-TSV representation. Three different DAG representations

Acknowledgments

The authors thank the NSERC and SSHRC of Canada for their financial support.

Youssouf Chherawala received his M.Sc. degree in Electrical Engineering from the École de Technologie Supérieure (University of Québec) in 2007. Since 2009, he joined the Synchromedia Laboratory for Multimedia Communication in Telepresence where he pursues his Ph.D. research, under the supervision of professor Mohamed Cheriet. His research interests include Pattern Recognition, Shape Analysis and Handwriting Recognition.

References (39)

S.N. Srihari
Recognition of handwritten and machine-printed text for postal address interpretation
Pattern Recognition Letters
(1993)
M. Zimmermann et al.
Lexicon reduction using key characters in cursive handwritten words
Pattern Recognition Letters
(1999)
S. Madhvanath et al.
Syntactic methodology of pruning large lexicons in cursive script recognition
Pattern Recognition
(2001)
S. Mozaffari et al.
Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition
Pattern Recognition Letters
(2008)
A. Neumaier
The second largest eigenvalue of a tree
Linear Algebra and its Applications
(1982)
R. Farrahi Moghaddam et al.
A multi-scale framework for adaptive binarization of degraded document images
Pattern Recognition
(2010)
G. Kaufmann et al.
Automated reading of cheque amounts
Pattern Analysis & Applications
(2000)
K.K. Kim, J.H. Kim, Y.K. Chung, C. Suen, Legal amount recognition based on the segmentation hypotheses for bank check...
C.-L. Liu et al.
Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2002)
R. Al-Hajj Mohamad et al.
Combining slanted-frame classifiers for improved HMM-based arabic handwriting recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2009)

V. Lavrenko, T. Rath, R. Manmatha, Holistic word recognition for handwritten historical documents, in: Proceedings of...

S. Feng, R. Manmatha, A. McCallum, Exploring the use of conditional random field models and HMMs for historical...

G. Vamvakas, B. Gatos, N. Stamatopoulos, S. Perantonis, A complete optical character recognition methodology for...

M. Wuthrich, M. Liwicki, A. Fischer, E. Indermuhle, H. Bunke, G. Viehhauser, M. Stolz, Language model integration for...

A. Fischer, K. Riesen, H. Bunke, Graph similarity features for HMM-based handwriting recognition in historical...

A.L. Koerich et al.

Large vocabulary off-line handwriting recognition: a survey

Pattern Analysis & Applications

(2003)

A.L. Koerich et al.

Recognition and verification of unconstrained handwritten words

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2005)

C. Tomai, B. Zhang, V. Govindaraju, Transcript mapping for historic handwritten document images, in: Proceedings of the...

A. Shokoufandeh et al.

Indexing hierarchical structures using graph spectra

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2005)

Cited by (15)

Holistic word descriptor for lexicon reduction in handwritten arabic documents
2021, Pattern Recognition
Citation Excerpt :
Wshah et al. [17] introduced a method to improve diacritic categorization by optimize the estimation of diacritics’ positions and types, using convolutional neural network. Advanced shape features such as loops, ascenders, descenders and curves have been also investigated in several methods [3,18,19]. Chherawala and Cheriet [20] proposed an Arabic word descriptor (AWD) for lexicon reduction based on word shape indexing.
Most of word recognition systems rely on a pre-defined lexicon in aims to achieve high performance. Recently, the availability of training /testing data allows to include a huge number of words in the lexicon to recognize. However, this leads to high computation cost as the lexicon is grown. In addition, including more and more word-classes may lead to increase the burden on classification methods and degrade the recognition rate. In this work, we propose a holistic word descriptor for word lexicon reduction in Arabic handwritten documents. The proposed descriptor represents geometrical features of word shape through three main feature sets, defined from multi-scale convexity concavity analysis. The first two sets are dedicated to defined the number of peaks and their intensity levels of convexity/concavity peaks, respectively. In contrast, the last set is dedicated to define a region codes of the peaks by analyzing their regions according to their spatial information. Given a query word and lexicon(reference dataset), the lexicon reduction system is applied by first defining the holistic word descriptor for both query word and each word in the lexicon. The lexicon is then indexed according to its distances to the query word descriptor. Finally, the reduced lexicon is formulated from the first $k$ th entries of the indexed lexicon. The proposed system has been evaluated under two well-known Arabic datasets, namely Ibn Sina and IFN/ENIT. Reported results show superior performance compared to prior art, with $93.7 %$ and $91.2 %$ reduction efficacy for Ibn Sina and IFN/ENIT, respectively.
Arabic word descriptor for handwritten word indexing and lexicon reduction
2014, Pattern Recognition
Citation Excerpt :
The other group of methods considers the subword shape, and is based on the skeleton image. Chherawala and Cheriet [19] propose a spectral method for indexing skeleton shapes, where the skeleton is modeled as a weighted graph using topological and geometrical features. Lexicon reduction is then performed by indexing a reference database of subword shapes and selecting the labels of the top ranked database entries.
Word recognition systems use a lexicon to guide the recognition process in order to improve the recognition rate. However, as the lexicon grows, the computation time increases. In this paper, we present the Arabic word descriptor (AWD) for Arabic word shape indexing and lexicon reduction in handwritten documents. It is formed in two stages. First, the structural descriptor (SD) is computed for each connected component (CC) of the word image. It describes the CC shape using the bag-of-words model, where each visual word represents a different local shape structure, extracted from the image with filters of different patterns and scales. Then, the AWD is formed by sorting and normalizing the SDs. This emphasizes the symbolic features of Arabic words, such as subwords and diacritics, without performing layout segmentation. In the context of lexicon reduction, the AWD is used to index a reference database. Given a query image, the reduced lexicon is obtained from the labels of the first entries in the indexed database. This framework has been tested on Arabic word databases. It has a low computational overhead, while providing a compact descriptor, with state-of-the-art results for lexicon reduction on the Ibn Sina and IFN/ENIT databases.
An efficient post processing algorithm for online handwriting Gurmukhi character recognition using set theory
2013, International Journal of Pattern Recognition and Artificial Intelligence
Subword Recognition in Historical Arabic Documents using C-GRUs
2021, TEM Journal
Distribution, directional, structural and concavity features for historical Arabic handwritten recognition: A comparative study
2017, ACM International Conference Proceeding Series
Lexicon reduction of handwritten Arabic subwords based on the prominent shape regions
2016, International Journal on Document Analysis and Recognition

View all citing articles on Scopus

Mohamed Cheriet was born in Algiers (Algeria) in 1960. He received his B.Eng. from USTHB University (Algiers) in 1984 and his M.Sc. and Ph.D. degrees in Computer Science from the University of Pierre et Marie Curie (Paris VI) in 1985 and 1988 respectively. Since 1992, he has been a professor in the Automation Engineering department at the École de Technologie Supérieure (University of Quebec), Montreal, and was appointed full professor there in 1998. He co-founded the Laboratory for Imagery, Vision and Artificial Intelligence (LIVIA) at the University of Quebec, and was its director from 2000 to 2006. He also founded the SYNCHROMEDIA Consortium (Multimedia Communication in Telepresence) there, and has been its director since 1998. His interests include document image analysis, OCR, mathematical models for image processing, pattern classification models and learning algorithms, as well as perception in computer vision. Dr. Cheriet has published more than 250 technical papers in the field, and has served as chair or co-chair of the following international conferences: VI'1998, VI'2000, IWFHR'2002, and ICFHR'2008. He currently serves on the editorial board and is associate editor of several international journals: IJPRAI, IJDAR, and Pattern Recognition. He co-authored a book entitled, “Character Recognition Systems: A guide for Students and Practitioners,” John Wiley and Sons, Spring 2007. Dr. Cheriet is a senior member of the IEEE and the chapter chair of IEEE Montreal Computational Intelligent Systems (CIS).

¹: Tel.: +1 514 3968972; fax: +1 514 3968595.

View full text

W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents

Abstract

Highlights

Introduction

Section snippets

Features of ancient and modern Arabic documents for lexicon reduction

Related works

Background

Proposed arabic subword graph representation

Databases

Conclusion

Acknowledgments

Pattern Recognition Letters

Pattern Recognition Letters

Pattern Recognition

Pattern Recognition Letters

Linear Algebra and its Applications

Pattern Recognition

Automated reading of cheque amounts

Pattern Analysis & Applications

Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading

IEEE Transactions on Pattern Analysis and Machine Intelligence

Combining slanted-frame classifiers for improved HMM-based arabic handwriting recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence

Large vocabulary off-line handwriting recognition: a survey

Pattern Analysis & Applications

Recognition and verification of unconstrained handwritten words

IEEE Transactions on Pattern Analysis and Machine Intelligence

Indexing hierarchical structures using graph spectra

IEEE Transactions on Pattern Analysis and Machine Intelligence