Elsevier

Pattern Recognition

Volume 45, Issue 9, September 2012, Pages 3277-3287
Pattern Recognition

W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents

https://doi.org/10.1016/j.patcog.2012.02.030Get rights and content

Abstract

This paper proposes a holistic lexicon-reduction method for ancient and modern handwritten Arabic documents. The word shape is represented by the weighted topological signature vector (W-TSV), which encodes graph data into a low-dimensional vector space. Three directed acyclic graph (DAG) representations are proposed for Arabic word shapes, based on topological and geometrical features. Lexicon reduction is achieved by a nearest neighbors search in the W-TSV space. The proposed framework has been tested on the IFN/ENIT and the Ibn Sina databases, achieving respectively a degree of reduction of 83.5% and 92.9% for an accuracy of reduction of 90%.

Highlights

► A shape-based approach for lexicon reduction of Arabic documents is introduced. ► The topological signature vector formulation is extended to weighted graphs. ► Three graphical representations for Arabic word shape are proposed. ► Experiments were performed on the IFN/ENIT and Ibn Sina databases. ► Topological and geometrical information of word shapes improves the performance.

Introduction

Handwritten word recognition systems have improved in a number of ways in recent decades, across many applications, from the recognition of the legal amount on bank checks and of postal addresses [1], [2], [3], [4], [5] to the automated transcription of ancient documents [6], [7], [8], [9], [10]. While the vocabulary for a bank check application is small (fewer than 30 words), it is large for postal applications (1000 words) and unconstrained for historical documents (several thousand words). A vocabulary of valid words that are expected to be recognized by the system is called a lexicon [11]. A large lexicon generates a high computational complexity, as all the word hypotheses must be tested, and recognition performance decreases as the number of allowed hypotheses grows. To address this problem, lexicon-reduction methods are used. When a query word shape is submitted for recognition, the lexicon is pruned by keeping only the shapes that are most likely to correspond to the query word class [12], or by using application-dependent knowledge [13]. Then, the recognition system considers the word hypotheses remaining in the pruned lexicon. The performance of a lexicon-reduction method is classically evaluated based on its accuracy of reduction α (the probability that the query word class was included in the pruned lexicon), the degree of reduction ρ (the decrease in the size of the lexicon after pruning), and the reduction efficacy η, which is a combination of the two previous criteria. Computational complexity is also a major factor in lexicon reduction, as one of its goals is to speed up the recognition process. In this paper, we propose a lexicon-reduction method for handwritten Arabic documents, both ancient and modern.

The Arabic language has an alphabet of 28 letters. The script is cursive and written from right to left. One important feature of Arabic letters is that their shapes are context-dependent, which means that a letter shape is usually determined by its position in a word, i.e. initial, medial or final. The letters have no cases and many share the same base shape. They are distinguishable by the addition of diacritical marks. The diacritics used in Arabic for this purpose are dots, one, two, or three of them appearing below or above the base shape. If we ignore the dots, we obtain the archigraphemes (Fig. 1), where a single grapheme (letter shape) can represent many letters. Four archigrapheme letter shapes (‘A’, ‘D’, ‘R’, ‘W’) can be connected only if they are in the final position. If they appear in the middle of a word, the word is divided into subwords, also known as pieces of Arabic word (PAW).

The goal of this paper is to provide a lexicon-reduction strategy for Arabic documents, based on the structure of Arabic subword shapes, which is described by their topology and geometry. First, the topological and geometrical properties of the subword shapes are extracted from the shape skeleton. Then these properties are encoded in a directed acyclic graph (DAG) in order to preserve information about their relationship in the skeleton. Finally, the subword DAG is transformed into a vector using the weighted topological signature vector (W-TSV), which is an extension of the TSV [14] for weighted DAGs. Like the classical TSV, the W-TSV is a powerful tool for encoding structured data, such as a DAG, mapping the DAG to a low-dimensional vector space for fast matching. Also, it has good discriminatory power for DAGs with different topologies, because it preserves their topological properties to some extent. Unlike the TSV, the W-TSV can also discriminate between DAGs sharing the same topology, but with different weights, and it is more robust to topological perturbation than the TSV under small weight perturbation. In this work, lexicon reduction is performed by pruning the reference database of subword/word shapes. This is achieved by selecting the i nearest shapes in the database to a query shape in the W-TSV space. First, the database is indexed by ordering its shapes in ascending order, based on their distance from the query shape; next, the lexicon is reduced by selecting the first i elements of the indexed lexicon as candidates. The value of i is evaluated during a training phase in order to reach the accuracy of reduction level selected for the application. The same i value is then applied for all the query shapes during the lexicon reduction process. From the reduced database of shapes, it is then possible to build a reduced lexicon of subwords/words from the labels of the selected shapes (Fig. 2).

This paper is organized as follows. The features of lexicon reduction for ancient and modern Arabic documents are described in Section 2. Related work on lexicon reduction is reviewed in Section 3. The details of the W-TSV scheme and of the formation of the Arabic subword DAG are respectively provided in 4 Weighted topological signature vector (W-TSV), 5 Proposed arabic subword graph representation. Finally, the details of our experiments and our results are given in Section 6, followed by the conclusion in Section 7.

This paper is an extension of the work published in [15]. The underlying methodology, as well as the experimental evaluation, has been significantly improved.

Section snippets

Features of ancient and modern Arabic documents for lexicon reduction

The nature of ancient Arabic documents is different from that of the Arabic documents used in modern applications. The study of ancient documents is motivated by their cultural significance, and a vast number of them have been scanned as digital images in order to protect them from aging. Pre-modern Arabic documents were written during the medieval period. They can be written in a variety of calligraphic styles, depending on when and where they were copied. The appearance of a written text

Related works

Lexicon reduction can be performed by comparing the optical shapes of the lexicon words to improve recognition speed. When the word's optical shape is used, the simplest criterion for lexicon reduction, but still efficient, is word length, as this makes it easy to discriminate between long words and short words. More refined knowledge about the word's shape can also be used. Zimmermann et al. [16] propose the concept of key characters, which are characters that can be accurately identified

Background

The classical topological signature vector (TSV) is an efficient encoding of the topology of structured data, such as a directed acyclic graph (DAG). The topology of a given DAG G can be represented by its adjacency matrix A, where A(i,j)=1 if an edge goes from vertex vi to vertex vj, A(i,j)=1 if an edge goes from vertex vj to vertex vi, and A(i,j)=0 in all other cases. The adjacency matrix is therefore antisymmetric. From the adjacency matrix, a signature SG for the graph G can be extracted

Proposed arabic subword graph representation

In this section, our holistic method for encoding the structure of Arabic subword shapes into a DAG is presented. We chose the DAG representation because it is more expressive than the vector representation, thanks to the relational information it contains. The saliency of an Arabic subword derives from its topology and its geometry, which are highlighted by the shape skeleton. Therefore, relevant pieces of information are extracted from the shape skeleton, giving rise to three DAG

Databases

We evaluated this approach on the Ibn Sina database [33] for ancient Arabic documents and the IFN/ENIT database [34] for modern Arabic documents. The Ibn Sina database is based on a commentary on an important philosophical work by the famous Persian scholar Ibn Sina. This database consists of 60 pages and approximately 25,000 Arabic subword shapes written in the Naskh style (Fig. 3b). The document images were binarized with a dedicated algorithm [35] to preserve the shape's topology. Each page

Conclusion

In this paper, we proposed the W-TSV representation, a generalization of the TSV for weighted DAG indexing. The stability and robustness to small weights perturbation of the W-TSV have been studied. The W-TSV has been applied for holistic lexicon reduction of handwritten Arabic words/subwords. The topology and the geometry of the word/subword shape is first converted into a DAG and then transformed into a low dimensional vector using the W-TSV representation. Three different DAG representations

Acknowledgments

The authors thank the NSERC and SSHRC of Canada for their financial support.

Youssouf Chherawala received his M.Sc. degree in Electrical Engineering from the École de Technologie Supérieure (University of Québec) in 2007. Since 2009, he joined the Synchromedia Laboratory for Multimedia Communication in Telepresence where he pursues his Ph.D. research, under the supervision of professor Mohamed Cheriet. His research interests include Pattern Recognition, Shape Analysis and Handwriting Recognition.

References (39)

  • V. Lavrenko, T. Rath, R. Manmatha, Holistic word recognition for handwritten historical documents, in: Proceedings of...
  • S. Feng, R. Manmatha, A. McCallum, Exploring the use of conditional random field models and HMMs for historical...
  • G. Vamvakas, B. Gatos, N. Stamatopoulos, S. Perantonis, A complete optical character recognition methodology for...
  • M. Wuthrich, M. Liwicki, A. Fischer, E. Indermuhle, H. Bunke, G. Viehhauser, M. Stolz, Language model integration for...
  • A. Fischer, K. Riesen, H. Bunke, Graph similarity features for HMM-based handwriting recognition in historical...
  • A.L. Koerich et al.

    Large vocabulary off-line handwriting recognition: a survey

    Pattern Analysis & Applications

    (2003)
  • A.L. Koerich et al.

    Recognition and verification of unconstrained handwritten words

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • C. Tomai, B. Zhang, V. Govindaraju, Transcript mapping for historic handwritten document images, in: Proceedings of the...
  • A. Shokoufandeh et al.

    Indexing hierarchical structures using graph spectra

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • Cited by (15)

    • Holistic word descriptor for lexicon reduction in handwritten arabic documents

      2021, Pattern Recognition
      Citation Excerpt :

      Wshah et al. [17] introduced a method to improve diacritic categorization by optimize the estimation of diacritics’ positions and types, using convolutional neural network. Advanced shape features such as loops, ascenders, descenders and curves have been also investigated in several methods [3,18,19]. Chherawala and Cheriet [20] proposed an Arabic word descriptor (AWD) for lexicon reduction based on word shape indexing.

    • Arabic word descriptor for handwritten word indexing and lexicon reduction

      2014, Pattern Recognition
      Citation Excerpt :

      The other group of methods considers the subword shape, and is based on the skeleton image. Chherawala and Cheriet [19] propose a spectral method for indexing skeleton shapes, where the skeleton is modeled as a weighted graph using topological and geometrical features. Lexicon reduction is then performed by indexing a reference database of subword shapes and selecting the labels of the top ranked database entries.

    • An efficient post processing algorithm for online handwriting Gurmukhi character recognition using set theory

      2013, International Journal of Pattern Recognition and Artificial Intelligence
    • Lexicon reduction of handwritten Arabic subwords based on the prominent shape regions

      2016, International Journal on Document Analysis and Recognition
    View all citing articles on Scopus

    Youssouf Chherawala received his M.Sc. degree in Electrical Engineering from the École de Technologie Supérieure (University of Québec) in 2007. Since 2009, he joined the Synchromedia Laboratory for Multimedia Communication in Telepresence where he pursues his Ph.D. research, under the supervision of professor Mohamed Cheriet. His research interests include Pattern Recognition, Shape Analysis and Handwriting Recognition.

    Mohamed Cheriet was born in Algiers (Algeria) in 1960. He received his B.Eng. from USTHB University (Algiers) in 1984 and his M.Sc. and Ph.D. degrees in Computer Science from the University of Pierre et Marie Curie (Paris VI) in 1985 and 1988 respectively. Since 1992, he has been a professor in the Automation Engineering department at the École de Technologie Supérieure (University of Quebec), Montreal, and was appointed full professor there in 1998. He co-founded the Laboratory for Imagery, Vision and Artificial Intelligence (LIVIA) at the University of Quebec, and was its director from 2000 to 2006. He also founded the SYNCHROMEDIA Consortium (Multimedia Communication in Telepresence) there, and has been its director since 1998. His interests include document image analysis, OCR, mathematical models for image processing, pattern classification models and learning algorithms, as well as perception in computer vision. Dr. Cheriet has published more than 250 technical papers in the field, and has served as chair or co-chair of the following international conferences: VI'1998, VI'2000, IWFHR'2002, and ICFHR'2008. He currently serves on the editorial board and is associate editor of several international journals: IJPRAI, IJDAR, and Pattern Recognition. He co-authored a book entitled, “Character Recognition Systems: A guide for Students and Practitioners,” John Wiley and Sons, Spring 2007. Dr. Cheriet is a senior member of the IEEE and the chapter chair of IEEE Montreal Computational Intelligent Systems (CIS).

    1

    Tel.: +1 514 3968972; fax: +1 514 3968595.

    View full text