Elsevier

Pattern Recognition

Volume 47, Issue 3, March 2014, Pages 1021-1030
Pattern Recognition

Learning-based word spotting system for Arabic handwritten documents

https://doi.org/10.1016/j.patcog.2013.08.014Get rights and content

Highlights

  • We propose a learning-based word spotting system for Arabic handwritten documents.

  • The system is designed to address the lack of boundary problem in Arabic script.

  • Pieces of Arabic Words (PAWs) are extracted from text lines.

  • Language models are incorporated into the system to reconstruct words from PAWs.

  • The system is tested on a variety of documents with promising results.

Abstract

The retrieval of information from scanned handwritten documents is becoming vital with the rapid increase of digitized documents, and word spotting systems have been developed to search for words within documents. These systems can be either template matching algorithms or learning based. This paper presents a coherent learning based Arabic handwritten word spotting system which can adapt to the nature of Arabic handwriting, which can have no clear boundaries between words. Consequently, the system recognizes Pieces of Arabic Words (PAWs), then re-constructs and spots words using language models. The proposed system produced promising result for Arabic handwritten word spotting when tested on the CENPARMI Arabic documents database.

Introduction

As the sheer number of handwritten documents being digitized continues to increase, the need for indexing them becomes vital. Handwritten word spotting is a method that allows the user to search for keywords in handwritten texts. Early indexing work started by applying conventional Optical Character Recognition (OCR) techniques, and the results are passed to special search engines to search for words. In fact, Manmatha et al. [1] designed the first handwritten word spotting system in 1996 because they found that applying traditional OCR techniques to search for words is inadequate.

Word spotting methods are based on two main approaches: template matching and learning based. Manmatha et al. [1] proposed the first indexing or word spotting system for single writer historical documents. The proposed method was based on matching word pixels. Zhang et al. [2] proposed a template matching approach based on extracting features from word images. Dynamic Time Warping (DTW) [3], [4], [5] was successfully applied as an efficient template matching algorithm.

Learning based word spotting systems were introduced to adapt to multi-writers with promising results. However, sufficiently large databases are needed to train the system. HMM is the most common classifier that is applied to word spotting systems [6], [7]. Other approaches have also been developed; for example, Frinken et al. [8] proposed a word spotting system that uses a bidirectional Long Short-Term Memory (LSTM) Neural Network together with the Connectionist Temporal Classification (CTC) Token Passing algorithm to spot words, and this system has shown high performance.

Word spotting has been widely implemented for Latin-based and Chinese documents, while few word spotting systems have been implemented for Arabic handwritten documents. Arabic script is cursive by nature, and in Arabic writing words have no clear boundaries; these facts make the implementation of word spotting for Arabic handwritten documents a significant challenge.

In this paper we propose a learning based system for multi-writer Arabic word spotting. The system aims to overcome the problem of not having clear boundaries between words in Arabic handwriting. This paper is organized as follows: Section 2 explains the preprocessing and the extraction of features. Section 3 presents the partial segmentation algorithm to segment words into PAWs. Section 4 explains the technical details of our word spotting system. Section 5 introduces the databases used in this experiment and shows the experimental results. Finally, we conclude our work in Section 6.

Arabic script is always cursive even when printed, and it is written horizontally from right to left. In Arabic writing, letter shapes change depending on their locations in the word, a characteristic which distinguishes Arabic writing from many other languages. Arabic words consist of a number of connected components or sub-words, and some researchers call these sub-words Pieces of Arabic Words (PAWs) [9]. In Arabic script there is no difference in the within word space (i.e. the white space between the PAWs) and the between words space as illustrated in Fig. 1. Therefore, the naturally cursive structure of Arabic writing is more unconstrained than in other languages. This, combined with the fact that the boundaries between words are arbitrary and often non-existing, make word spotting in the Arabic language challenging and in need of further research.

Attempts have been made to construct a language independent word spotting system, but these have encountered problems when handling Arabic script. Srihari and Ball [10] proposed a language independent word spotting system. They extracted gradient features from words since these features are language-independent. However, for Arabic handwritten word spotting, they found it necessary to apply manual word segmentation (clustering). In this way, they circumvent a main problem of the Arabic language — that there are no clear boundaries between words. Leydier et al. [11] proposed a segmentation free language independent word spotting system which may overcome this problem. However, they faced difficulties with words from the same root. Even though the system was validated for Arabic using only one simple query consisting of a single PAW, the precision rate of 80.00% for Arabic was lower than that of the two Latin databases that were tested. Wshah et al. [12] proposed a script independent segmentation free word spotting system based on HMMs, and this system was compared to a concurrent word spotting system [7] also utilizing HMMs. Both systems have found that the lowest results were obtained on the Arabic language.

DTW has been extensively used for word matching in Arabic handwritten word spotting. Moghaddam and Cheriet [13] applied Euclidean distance enhanced by rotation, together with DTW, to measure the similarity between two connected components or PAWs of historical documents. Moreover, Self-Organizing Maps were used to initially cluster PAWs depending on the shape complexity of each PAW. Rodriguez-Serrano and Perronnin [14] proposed a model-based similarity measure between vector sequences. Each sequence is mapped to a semicontinuous Hidden Markov Model, and then a measure of similarity is computed between the HMMs. This computation of similarity was simplified using DTW. They applied the measure to handwritten word retrieval in three different datasets including the IFN/ENIT database of Arabic handwritten words, and concluded that their proposed similarity outperforms DTW and ordinary continuous HMMs. Saabni and Bronstein [15] implemented an Arabic word matching approach by extracting contour features from PAWs, then embedding each PAW into an Euclidean space to reduce the complexity; finally they used an Active-DTW [16] to determine the final matching result of a PAW.

Attempting to segment Arabic documents into candidate words may not be an appropriate approach for Arabic word spotting systems, since Arabic words are composed of PAWs. A line of Arabic text can be viewed as a sequence of PAWs instead of a set of words, because there are no differences between the spaces separating the PAWs and the words. Srihari et al. [17] tried to cluster words by segmenting the line into connected components and merging each main component with its diacritics. Nine features were extracted from each pair of clusters and the features were passed to a neural network to decide whether the gap between the pairs is a word gap. However, with ten writers each writing ten documents, the overall performance was only 60% when the word segmentations were correct, and this significantly affected the spotting results.

Many studies favored segmenting documents into PAWs rather than words due to the problem of not having clear boundaries for words. Sari and Kefali [18] preferred to segment the document into major connected components, to circumvent the problem of word segmentation in Arabic documents. Thus, they decided to favor Arabic PAWs processing instead of words. They converted the PAW into Word Shape Tokens (WSTs) in which they represented each PAW by global structural features such as loops, ascenders and descenders. Similarly, input queries were coded and then a string matching technique was applied. They validated their word spotting system using both printed and handwritten Arabic manuscripts and historical documents. This approach is promising because it uses open lexicons and avoids pre-clustering. Similarly, Saabni and El-Sana [19] segmented the documents into PAWs; they used DTW and HMM for matching in two different systems, and then additional strokes were used by means of a rule-based system to determine the final match.

Content-based retrieval using a codebook has been used for Arabic word spotting [20], [21], [18]. In these systems, meaningful features are extracted to represent codes of symbols, characters, or PAWs. Then similarity matching or distance measure algorithms between the codes and the codebook are applied to perform the final match.

In this paper we propose a learning-based word spotting system. This system is based on a hierarchical classifier that integrates a partial segmentation of the lexicon words into PAWs, with language models to spot or reject a word. Also a pruning model which includes a change in the internal structure of the classifier is proposed and compared with the default internal structure of the classifier.

We also propose a two-pass partial segmentation algorithm. This algorithm first segments text line images and word images into PAWs, based on segmenting the document into connected components. These components are then divided into major and minor connected components according to the characteristics of Arabic handwriting.

Support Vector Machines (SVMs) and Regularized Discriminant Analysis (RDA) are used in experiments as internal classifiers of the hierarchical classifier. These classifiers are designed to recognize PAWs rather than words. A comparative study of the implementation results for these two methods is also included in this paper.

Finally, we improve the confidence transformation equation which is used to transform discriminative scores into confidence values, so that the confidence value can be used as a measure in the decision to accept or reject PAWs at the first level of the hierarchical classifier.

Section snippets

Preprocessing and feature extraction

The words database and the documents database are first binarized and smoothed, and then the documents are segmented into text lines, after which both text lines and word images are partially segmented into PAWs. Gradient features [22] are extracted from each PAW image and passed to the classifiers for training or testing. Since the RDA classifier performs better with lower dimensionality, the dimensionality of the feature vectors was reduced accordingly. The following sections present the

Partial segmentation

Many word spotting systems start by segmenting text lines into characters or words, to which word spotting algorithms are applied. In Latin based languages, words are delimited by space which makes it easier to extract a word as a basic unit [10], [26], [5]. In the Chinese language words consist of one or more characters, so text lines are usually segmented into Chinese characters, after which words are reconstructed [27], [28], [29]. In the Arabic language many studies favored segmenting a

Proposed word spotting system

This paper introduces a word spotting system for Arabic handwritten documents. The system is trained on a lexicon of Arabic handwritten words, and tested on Arabic handwritten documents. The PAWs of the words lexicon are re-grouped according to their locations within the word, and each group is used to train a classifier. This will result in a sequence of classifiers that will form a hierarchical classifier. The PAWs of the document text lines are passed to the hierarchical classifier. Graphs

Experiments and evaluation

The proposed method was evaluated using the CENPARMI Arabic words database (described in Section 5.1), which is processed by the hierarchical classifier and the words classifier as well. In addition, the CENPARMI documents database is used to test and validate the system.

The performance of the segmentation algorithm was evaluated on the documents database, by considering the total number of PAWs that resulted in segmentation errors. These errors were divided into three categories: touching PAWs

Conclusion

This paper presents a coherent, learning based and multi-writer Arabic word spotting system. The system is based on a PAW or sub-word model, in which words are spotted based on PAWs. Word spotting is implemented using a hierarchical classifier consisting of a sequence of classifiers, each of which recognizes PAWs rather than words. Language models are proposed to integrate contextual information with the confidence values given to the PAWs by the classifier sequence.

We also proposed a

Conflict of interest

None declared.

Muna Khayyat is a Ph.D. student at the computer science department and software engineering of Concordia university (Montreal, Canada), and she is a researcher at CENPARMI. Her research interests are handwriting recognition and word spotting.

References (40)

  • T. Adamek et al.

    Word matching using single closed contours for indexing handwritten historical documents

    International Journal on Document Analysis and Recognition

    (2007)
  • J.A. Rodríguez-Serrano, F. Perronnin, Local gradient histogram features for word-spotting in unconstrained handwritten...
  • V. Frinken et al.

    A novel word spotting method based on recurrent neural networks

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • N. Aouadi, A. Kacem, Word spotting for Arabic handwritten historical document retrieval using generalized Hough...
  • S.N. Srihari, G.R. Ball, Language independent word spotting in scanned documents, in: Lecture Notes in Computer Science...
  • S. Wshah, G. Kumar, V. Govindaraju, Script independent word spotting in offline handwritten documents based on Hidden...
  • R. Moghaddam, M. Cheriet, Application of multi-level classifiers and clustering for automatic word spotting in...
  • J.A. Rodríguez-Serrano et al.

    A model-based sequence similarity with application to handwritten word-spotting

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • R. Saabni, A. Bronstein, Fast key-word searching via embedding and Active-DTW, in: Proceedings of the 11th...
  • M. Sridha, D. Mandalapu, M. Patel, Active-DTW: a generative classifier that combines elastic matching with active shape...
  • Cited by (0)

    Muna Khayyat is a Ph.D. student at the computer science department and software engineering of Concordia university (Montreal, Canada), and she is a researcher at CENPARMI. Her research interests are handwriting recognition and word spotting.

    Louisa Lam received the B.A. degree from Wellesley College in Massachusetts, USA, and M.Sc. and Ph.D. degrees from the University of Toronto, Canada. Her research interests include character recognition, combination of classifiers, skeletonization, multilingual document processing, and applications.

    Ching Y. Suen is the Director of CENPARMI and the Concordia Chair on AI and Pattern Recognition. He received his Ph.D. degree from UBC (Vancouver) and his Master's degree from the University of Hong Kong. He has served as the Chairman of the Department of Computer Science and as the Associate Dean (Research) of the Faculty of Engineering and Computer Science of Concordia University. He has served at numerous national and international professional societies as President, Vice-President, Governor, and Director. He has given 40 invited/keynote papers at conferences and 180 invited talks at various industries and academic institutions around the world. He has been the Principal Investigator or Consultant of 30 industrial projects. His research projects have been funded by the ENCS Faculty and the Distinguished Chair Programs at Concordia University, FCAR (Quebec), NSERC (Canada), the National Networks of Centres of Excellence (Canada), the Canadian Foundation for Innovation, and the industrial sectors in various countries, including Canada, France, Japan, Italy, and the United States.

    View full text