W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents
Highlights
► A shape-based approach for lexicon reduction of Arabic documents is introduced. ► The topological signature vector formulation is extended to weighted graphs. ► Three graphical representations for Arabic word shape are proposed. ► Experiments were performed on the IFN/ENIT and Ibn Sina databases. ► Topological and geometrical information of word shapes improves the performance.
Introduction
Handwritten word recognition systems have improved in a number of ways in recent decades, across many applications, from the recognition of the legal amount on bank checks and of postal addresses [1], [2], [3], [4], [5] to the automated transcription of ancient documents [6], [7], [8], [9], [10]. While the vocabulary for a bank check application is small (fewer than 30 words), it is large for postal applications (1000 words) and unconstrained for historical documents (several thousand words). A vocabulary of valid words that are expected to be recognized by the system is called a lexicon [11]. A large lexicon generates a high computational complexity, as all the word hypotheses must be tested, and recognition performance decreases as the number of allowed hypotheses grows. To address this problem, lexicon-reduction methods are used. When a query word shape is submitted for recognition, the lexicon is pruned by keeping only the shapes that are most likely to correspond to the query word class [12], or by using application-dependent knowledge [13]. Then, the recognition system considers the word hypotheses remaining in the pruned lexicon. The performance of a lexicon-reduction method is classically evaluated based on its accuracy of reduction (the probability that the query word class was included in the pruned lexicon), the degree of reduction (the decrease in the size of the lexicon after pruning), and the reduction efficacy , which is a combination of the two previous criteria. Computational complexity is also a major factor in lexicon reduction, as one of its goals is to speed up the recognition process. In this paper, we propose a lexicon-reduction method for handwritten Arabic documents, both ancient and modern.
The Arabic language has an alphabet of 28 letters. The script is cursive and written from right to left. One important feature of Arabic letters is that their shapes are context-dependent, which means that a letter shape is usually determined by its position in a word, i.e. initial, medial or final. The letters have no cases and many share the same base shape. They are distinguishable by the addition of diacritical marks. The diacritics used in Arabic for this purpose are dots, one, two, or three of them appearing below or above the base shape. If we ignore the dots, we obtain the archigraphemes (Fig. 1), where a single grapheme (letter shape) can represent many letters. Four archigrapheme letter shapes (‘A’, ‘D’, ‘R’, ‘W’) can be connected only if they are in the final position. If they appear in the middle of a word, the word is divided into subwords, also known as pieces of Arabic word (PAW).
The goal of this paper is to provide a lexicon-reduction strategy for Arabic documents, based on the structure of Arabic subword shapes, which is described by their topology and geometry. First, the topological and geometrical properties of the subword shapes are extracted from the shape skeleton. Then these properties are encoded in a directed acyclic graph (DAG) in order to preserve information about their relationship in the skeleton. Finally, the subword DAG is transformed into a vector using the weighted topological signature vector (W-TSV), which is an extension of the TSV [14] for weighted DAGs. Like the classical TSV, the W-TSV is a powerful tool for encoding structured data, such as a DAG, mapping the DAG to a low-dimensional vector space for fast matching. Also, it has good discriminatory power for DAGs with different topologies, because it preserves their topological properties to some extent. Unlike the TSV, the W-TSV can also discriminate between DAGs sharing the same topology, but with different weights, and it is more robust to topological perturbation than the TSV under small weight perturbation. In this work, lexicon reduction is performed by pruning the reference database of subword/word shapes. This is achieved by selecting the i nearest shapes in the database to a query shape in the W-TSV space. First, the database is indexed by ordering its shapes in ascending order, based on their distance from the query shape; next, the lexicon is reduced by selecting the first i elements of the indexed lexicon as candidates. The value of i is evaluated during a training phase in order to reach the accuracy of reduction level selected for the application. The same i value is then applied for all the query shapes during the lexicon reduction process. From the reduced database of shapes, it is then possible to build a reduced lexicon of subwords/words from the labels of the selected shapes (Fig. 2).
This paper is organized as follows. The features of lexicon reduction for ancient and modern Arabic documents are described in Section 2. Related work on lexicon reduction is reviewed in Section 3. The details of the W-TSV scheme and of the formation of the Arabic subword DAG are respectively provided in 4 Weighted topological signature vector (W-TSV), 5 Proposed arabic subword graph representation. Finally, the details of our experiments and our results are given in Section 6, followed by the conclusion in Section 7.
This paper is an extension of the work published in [15]. The underlying methodology, as well as the experimental evaluation, has been significantly improved.
Section snippets
Features of ancient and modern Arabic documents for lexicon reduction
The nature of ancient Arabic documents is different from that of the Arabic documents used in modern applications. The study of ancient documents is motivated by their cultural significance, and a vast number of them have been scanned as digital images in order to protect them from aging. Pre-modern Arabic documents were written during the medieval period. They can be written in a variety of calligraphic styles, depending on when and where they were copied. The appearance of a written text
Related works
Lexicon reduction can be performed by comparing the optical shapes of the lexicon words to improve recognition speed. When the word's optical shape is used, the simplest criterion for lexicon reduction, but still efficient, is word length, as this makes it easy to discriminate between long words and short words. More refined knowledge about the word's shape can also be used. Zimmermann et al. [16] propose the concept of key characters, which are characters that can be accurately identified
Background
The classical topological signature vector (TSV) is an efficient encoding of the topology of structured data, such as a directed acyclic graph (DAG). The topology of a given DAG G can be represented by its adjacency matrix A, where if an edge goes from vertex to vertex , if an edge goes from vertex to vertex , and in all other cases. The adjacency matrix is therefore antisymmetric. From the adjacency matrix, a signature SG for the graph G can be extracted
Proposed arabic subword graph representation
In this section, our holistic method for encoding the structure of Arabic subword shapes into a DAG is presented. We chose the DAG representation because it is more expressive than the vector representation, thanks to the relational information it contains. The saliency of an Arabic subword derives from its topology and its geometry, which are highlighted by the shape skeleton. Therefore, relevant pieces of information are extracted from the shape skeleton, giving rise to three DAG
Databases
We evaluated this approach on the Ibn Sina database [33] for ancient Arabic documents and the IFN/ENIT database [34] for modern Arabic documents. The Ibn Sina database is based on a commentary on an important philosophical work by the famous Persian scholar Ibn Sina. This database consists of 60 pages and approximately 25,000 Arabic subword shapes written in the Naskh style (Fig. 3b). The document images were binarized with a dedicated algorithm [35] to preserve the shape's topology. Each page
Conclusion
In this paper, we proposed the W-TSV representation, a generalization of the TSV for weighted DAG indexing. The stability and robustness to small weights perturbation of the W-TSV have been studied. The W-TSV has been applied for holistic lexicon reduction of handwritten Arabic words/subwords. The topology and the geometry of the word/subword shape is first converted into a DAG and then transformed into a low dimensional vector using the W-TSV representation. Three different DAG representations
Acknowledgments
The authors thank the NSERC and SSHRC of Canada for their financial support.
Youssouf Chherawala received his M.Sc. degree in Electrical Engineering from the École de Technologie Supérieure (University of Québec) in 2007. Since 2009, he joined the Synchromedia Laboratory for Multimedia Communication in Telepresence where he pursues his Ph.D. research, under the supervision of professor Mohamed Cheriet. His research interests include Pattern Recognition, Shape Analysis and Handwriting Recognition.
References (39)
Recognition of handwritten and machine-printed text for postal address interpretation
Pattern Recognition Letters
(1993)- et al.
Lexicon reduction using key characters in cursive handwritten words
Pattern Recognition Letters
(1999) - et al.
Syntactic methodology of pruning large lexicons in cursive script recognition
Pattern Recognition
(2001) - et al.
Lexicon reduction using dots for off-line Farsi/Arabic handwritten word recognition
Pattern Recognition Letters
(2008) The second largest eigenvalue of a tree
Linear Algebra and its Applications
(1982)- et al.
A multi-scale framework for adaptive binarization of degraded document images
Pattern Recognition
(2010) - et al.
Automated reading of cheque amounts
Pattern Analysis & Applications
(2000) - K.K. Kim, J.H. Kim, Y.K. Chung, C. Suen, Legal amount recognition based on the segmentation hypotheses for bank check...
- et al.
Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2002) - et al.
Combining slanted-frame classifiers for improved HMM-based arabic handwriting recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2009)
Large vocabulary off-line handwriting recognition: a survey
Pattern Analysis & Applications
Recognition and verification of unconstrained handwritten words
IEEE Transactions on Pattern Analysis and Machine Intelligence
Indexing hierarchical structures using graph spectra
IEEE Transactions on Pattern Analysis and Machine Intelligence
Cited by (15)
Holistic word descriptor for lexicon reduction in handwritten arabic documents
2021, Pattern RecognitionCitation Excerpt :Wshah et al. [17] introduced a method to improve diacritic categorization by optimize the estimation of diacritics’ positions and types, using convolutional neural network. Advanced shape features such as loops, ascenders, descenders and curves have been also investigated in several methods [3,18,19]. Chherawala and Cheriet [20] proposed an Arabic word descriptor (AWD) for lexicon reduction based on word shape indexing.
Arabic word descriptor for handwritten word indexing and lexicon reduction
2014, Pattern RecognitionCitation Excerpt :The other group of methods considers the subword shape, and is based on the skeleton image. Chherawala and Cheriet [19] propose a spectral method for indexing skeleton shapes, where the skeleton is modeled as a weighted graph using topological and geometrical features. Lexicon reduction is then performed by indexing a reference database of subword shapes and selecting the labels of the top ranked database entries.
An efficient post processing algorithm for online handwriting Gurmukhi character recognition using set theory
2013, International Journal of Pattern Recognition and Artificial IntelligenceSubword Recognition in Historical Arabic Documents using C-GRUs
2021, TEM JournalDistribution, directional, structural and concavity features for historical Arabic handwritten recognition: A comparative study
2017, ACM International Conference Proceeding SeriesLexicon reduction of handwritten Arabic subwords based on the prominent shape regions
2016, International Journal on Document Analysis and Recognition
Youssouf Chherawala received his M.Sc. degree in Electrical Engineering from the École de Technologie Supérieure (University of Québec) in 2007. Since 2009, he joined the Synchromedia Laboratory for Multimedia Communication in Telepresence where he pursues his Ph.D. research, under the supervision of professor Mohamed Cheriet. His research interests include Pattern Recognition, Shape Analysis and Handwriting Recognition.
Mohamed Cheriet was born in Algiers (Algeria) in 1960. He received his B.Eng. from USTHB University (Algiers) in 1984 and his M.Sc. and Ph.D. degrees in Computer Science from the University of Pierre et Marie Curie (Paris VI) in 1985 and 1988 respectively. Since 1992, he has been a professor in the Automation Engineering department at the École de Technologie Supérieure (University of Quebec), Montreal, and was appointed full professor there in 1998. He co-founded the Laboratory for Imagery, Vision and Artificial Intelligence (LIVIA) at the University of Quebec, and was its director from 2000 to 2006. He also founded the SYNCHROMEDIA Consortium (Multimedia Communication in Telepresence) there, and has been its director since 1998. His interests include document image analysis, OCR, mathematical models for image processing, pattern classification models and learning algorithms, as well as perception in computer vision. Dr. Cheriet has published more than 250 technical papers in the field, and has served as chair or co-chair of the following international conferences: VI'1998, VI'2000, IWFHR'2002, and ICFHR'2008. He currently serves on the editorial board and is associate editor of several international journals: IJPRAI, IJDAR, and Pattern Recognition. He co-authored a book entitled, “Character Recognition Systems: A guide for Students and Practitioners,” John Wiley and Sons, Spring 2007. Dr. Cheriet is a senior member of the IEEE and the chapter chair of IEEE Montreal Computational Intelligent Systems (CIS).
- 1
Tel.: +1 514 3968972; fax: +1 514 3968595.