Elsevier

Pattern Recognition

Volume 60, December 2016, Pages 596-612
Pattern Recognition

Flexible Sequence Matching technique: An effective learning-free approach for word spotting

https://doi.org/10.1016/j.patcog.2016.05.011Get rights and content

Highlights

  • Flexible Sequence Matching (FSM) is introduced here and applied to word spotting.

  • FSM can do partial matching, skip outliers, and one-to-one/many correspondances.

  • FSM is able to spot words inside segmented lines or from improperly segmented words.

  • FSM can handle word derivatives and spelling variations.

  • FSM can behave as other sequence matching techniques (DTW, MVM, CDP).

Abstract

In this paper, a robust method is presented to perform word spotting in degraded handwritten and printed document images. A new sequence matching technique, called the Flexible Sequence Matching (FSM) algorithm, is introduced for this word spotting task. The FSM algorithm was specially designed to incorporate crucial characteristics of other sequence matching algorithms (especially Dynamic Time Warping (DTW), Subsequence DTW (SSDTW), Minimal Variance Matching (MVM) and Continuous Dynamic Programming (CDP)). Along with the characteristics of multiple matching (many-to-one and one-to-many), FSM is strongly capable of skipping existing outliers or noisy elements, regardless of their positions in the target signal. More precisely, in the domain of word spotting, FSM has the ability to retrieve complete words or words that contain only a part of the query. Furthermore, due to its adaptable skipping capability, FSM is less sensitive to local variation in the spelling of words and to local degradation effects within the word image. The multiple matching capability (many-to-one, one-to-many) of FSM helps it addressing the stretching effects of query and/or target images. Moreover, FSM is designed in such a way that with little modification, its architecture can be changed into the architecture of DTW, MVM, and SSDTW and to CDP-like techniques. To illustrate these possibilities for FSM applied to specific cases of word spotting, such as incorrect word segmentation and word-level local variations, we performed experiments on historical handwritten documents and also on historical printed document images. To demonstrate the capabilities of sub-sequence matching, of noise skipping, as well as the ability to work in a multilingual paradigm with local spelling variations, we have considered properly segmented lines of historical handwritten documents in different languages and improperly as well as properly segmented words in printed and handwritten historical documents. From the comparative experimental results shown in this paper, it can be clearly seen that FSM can be equivalent or better than most DTW-based word spotting techniques in the literature while providing at the same time more meaningful correspondences between elements.

Introduction

Today's world of high quality document digitization has provided a stirring alternative to preserving precious ancient manuscripts. It has provided easy, hassle-free access of these ancient manuscripts for historians and researchers. Retrieving information from these knowledge resources is useful for interpreting and understanding history in various domains and for knowing our cultural as well as societal heritage. However, digitization alone cannot be very helpful until these collections of manuscripts can be indexed and made searchable. The performance of the available OCR engines highly dependent on the burdensome process of learning. Moreover, the writing and font style variability, linguistics and script dependencies and poor document quality caused by high degradation effects are the bottlenecks of such systems. The process of manual or semi-automatic transcription of the entire text of handwritten or printed documents for searching any specific word is a tedious and costly job. For this reason research has been emphasized on word spotting. This technique can be defined as the: “localization of words of interest in the dataset without actually interpreting the content” [1], and the result of such a search could look like the result shown in Fig. 1 (without transcription). These figures (Fig. 1) demonstrates a layman's view of the word spotting outcome of the system.1

A popular way to categorize word spotting techniques is to consider those that are based on query-by-example and those that are based on query-by-string. In the former category, a region of a document is defined by the user, and the system should return all of the regions that contain the same text region, that is the same as the region defined by the user. These are often achieved by learning-free, image-matching-based approaches. For approaches that belong to the query-by-string category, queries of arbitrary character combinations can be searched. These approaches require a model for every character. Consequently, they are often achieved by learning-based approaches, such as HMM [2], [3] or a Bidirectional Long Short-Term Memory (BLSTM) neural network [4]. These approaches allow us to obtain very good performance when the learning set is representative of the writing/font styles that are found in the document to be searched. The well-known drawback of learning-based approaches is the requirement of a set (most often enormous) of transcribed text line images for training, which could be costly to obtain for some of the historical datasets. Only very few approaches appear to be able to work with a low level of training data [5], [6]. Moreover, the training (transcription of the learning set and learning of models) could have to be re-performed for new documents, depending on the variability of the writing/font styles. Thus, if neither the language nor the alphabet of a historical document are known or if creating a new learning set and retraining the system is necessary but not possible, a learning-free approach to word spotting might be the only available option. Consequently, a fair comparison between the two approaches is difficult to perform without including these criteria and we decided in this study to focus on learning-free approaches. These approaches can be further categorized depending on the level of segmentation.

The concept of word spotting as the task of detecting words in document images without actually understanding or transcribing the content, was initially the subject of experimentation by Manmatha et al. [1]. This approach relies on the segmentation of full document images into word images. A general and highly applicable approach for comparing word images is to represent them by a sequence of features, which are extracted by using a sliding window. These word images can be thought of as a 2D signal, which can be matched using dynamic programming [1], [7] based approaches. Some methods that were oriented toward bitwise comparison of images were also investigated [8], as well as holistic approaches that describe a full image of words [9], [10]. An approach based on low-dimensional, fixed-length representations of word images, which is fast to compute and fast to compare, is proposed in [11]. Based on the topological and morphological information of handwriting, a skeleton based graph matching technique is used in [12], for performing word spotting in handwritten historical documents. There have also been some attempts to spot words on segmented lines to avoid the problems of word segmentation. Indeed, depending on the document quality, line segmentation could be comparatively easier than word segmentation. The partial sequence matching property of CDP [13] is one possibility. Using over segmentation is also an alternative as in [14], where the comparison of sequences of primitives obtained by segmentation and clustering (corresponding to similar characters or pieces of characters) is investigated.

The necessity of proper word segmentation (or line segmentation in some cases) and the high computational complexity of matching are critical bottlenecks of most of the techniques in this category. Moreover, these techniques are prone to the usual degradation noise that is found in historical document images. Most of them cannot spot out-of-vocabulary words.

In [15], the authors proposed another type of matching technique based on differential features to match only the informative parts of the words, which are detected by patches. A common approach for segmentation-free word spotting is to consider the task as an image retrieval task for an input shape, which represents the query image [16]. For example, a HOG descriptors based sliding window is used in [17] to locate the document regions that are the most similar to the query. In [18], by treating the query as a compact shape, a pixel-based dissimilarity is calculated between the query image and the full document page (using the Squared Sum Distance) for locating words. A heat kernel signature (HKS) based technique is proposed in [19]. By detecting SIFT based key points on the document pages and the query image, HKS descriptors are extracted from a local patch that is centered at the key points. Then, a method is proposed to locate the local zones that contain a sufficient number of matching key points that corresponds to the query image. Bag of visual words (BoVW) based approaches were also used to identify the zones of the image that share common characteristics with the query word. In [20], the Longest Weighted Profile based zone filtering technique is used from BoVW to identify the location of the query words in the document image. In [21], local patches powered by SIFT descriptors are described by a BoVW. By projecting the patch descriptors to a topic space with a latent semantic analysis technique and compressing the descriptors with a product quantization method, the approach can efficiently index the document information both in terms of memory and time. Overall, segmentation-free approaches can overcome the curse of the segmentation problems, but they have comparatively low accuracy (in comparison with segmentation-based and learning-based approaches) and a high computational burden, considering the full image regions as an apparent candidate for matching.

The Table 1 summarizes several existing word spotting techniques, depending on the criteria mentioned above: learning or not; and level of segmentation. From this literature review, we noticed that there are still some important unresolved problems in this domain, e.g., word spotting independent of script and language variability, has not been properly addressed by the research community. Additionally, excluding learning based approaches, there are few research studies that could handle noise and degradation effects in historical documents [22], [3]. In most languages, there are several variations or derivatives of words that could be interesting for the user. For example, the French word cheval (horse) can have derivatives such as “chevalerie”, “chevaux”, “chevalier”. In old French, lexical variations also exist e.g., “cheual”, “chevaus”, and these variations do not change the meaning of the word; being able to retrieve these derivatives of the word that is being searched could be very useful. However, very little work is available in this direction as well as in the direction of skipping prefix and suffix of segmented words [13], [7].

For this reason, in this paper, we propose a robust learning-free word spotting method by introducing a novel sequence matching technique, called Flexible Sequence Matching (FSM). The proposed FSM technique is designed to overcome the bottlenecks of the other sequence matching techniques that have been applied in the domain of word spotting, e.g., DTW [1], CDP [13], and some modified version of classical DTW [7]. The proposed algorithm is capable of handling (to some extent) the local degradations and noise that is present in the image by skipping it, if necessary. To an extent, it can also handle lexical variations or derivatives by skipping unnecessary portions such as prefixes or suffixes. FSM is flexible with regard to the word segmentation problem and does not truly depend on good segmentation: it can work on pieces of lines and/or on improperly segmented words (see Table 1). Finally, the approach can handle spelling (local) variations and word derivatives. These properties allow the algorithm to find meaningful correspondences between elements of query and targets, which is not often the case for classical sequence matching techniques. Finally, the architecture of FSM is designed in such a generalized manner that it can be easily modified into the architecture of other sequence matching techniques, e.g., Minimal Variance Matching (MVM) [23], DTW [1], Subsequence DTW (SSDTW) [24], and Continuous Dynamic Programming (CDP) [13]. To show the usefulness of FSM in the word spotting domain, a comparison of FSM with other similar sequence matching techniques such as DTW, SSDTW, MVM, Optimal subsequence bijection (OSB) [25], and CDP is performed.

The remainder of this paper is organized as follows. In Section 2, a comparative discussion of state-of-the-art sequence matching techniques is given. The proposed word spotting framework, along with descriptions of the used features are briefly mentioned in Section 4.1. The core idea of the paper, the Flexible Sequence Matching technique, is explained in Section 3, including it's theoretical description, pseudo-code and generalization properties (see Appendix E). The experimental evaluation is described in Section 4, and conclusions and future work are described in Section 5.

Section snippets

Bird's eye view of sequence matching techniques

Many studies have been published in the literature [1], [7], following an architecture for word image matching that is similar to the architecture that is based on classical DTW [24] (see also Section 4.1). The main idea of DTW is to calculate the distance between two time series by summing up the distances of their corresponding elements. DTW yields an optimal (order preserving) relationship R of all of the elements of sequence x={x1,x2,x3.xp} to all of the elements of sequence y={y1,y2,y3.yq

Flexible sequence matching (FSM)

In this section, the complete mathematical structure of FSM is described. The main properties of FSM are summarized in Table A1, Table A2. FSM creates a relation R from two finite sequences x (query) to y (target), of different lengths p and q: x=(x1,x2,..,xp) and y=(y1,y2,..,yq);pq.3

Experimental evaluation

The experiments of the word spotting application was performed on: i) correctly segmented words, ii) incorrectly segmented words, i.e. pseudo-words, iii) correctly segmented lines. Indeed, depending on the characteristics of the documents, it can be comparatively easier to perform word segmentation or line segmentation. For example, if inter-word gaps are not large enough or are variable (which is often the case in old historical documents), word segmentation can be very difficult. However, for

Conclusion and future works

In this paper, we presented a new robust sequence-matching algorithm called the FSM algorithm, which can be easily modified into the architecture of other sequence matching techniques, e.g., DTW, SSDTW, MVM, and CDP. Specifically, FSM also includes the ability to skip outliers from any position of the target sequence. This, coupled with the facilities for many-to-one and one-to-many matching, makes the proposed FSM algorithm robust, general (it can be easily modified to behave differently) and

Conflict of interest

none declare

Acknowledgements

This work has been supported by the Indo-French Center for Promotion of Advanced Research (IFCPAR/CEFIPRA) under project no 4700-IT1. We would like to thank Dr. Kengo Terasawa (Department of Media Architecture, Future University-Hakodate, Japan) for providing us the segmented lines, ground truth and feature vectors [13] of the George Washington and Japanese datasets.

Tanmoy Mondal: Tanmoy Mondal received B.Tech degree in information technology from West Bengal University of Technology, Kolkata, India, in 2007 and the M.Tech. degree in mechatronics from Bengal Engineering and Science University, Kolkata, in 2009. Before joining as a PhD student in Laboratoiré d'Informatique, Université François Rabelais, Tours, France in 2012, he has worked at several industries and scientific laboratories as a researcher. He has received his PhD degree in computer science

References (39)

  • V. Frinken et al.

    A novel word spotting method based on recurrent neural networks

    IEEE TPAMI

    (2012)
  • N.R. Howe

    Part-structured inkball models for one-shot handwritten word spotting

    ICDAR

    (2013)
  • L. Rothacker et al.

    Bag-of-features hmms for segmentation-free word spotting in handwritten documents

    ICDAR

    (2013)
  • M. Meshesha et al.

    Matching word images for content-based retrieval from printed document images

    IJDAR

    (2008)
  • S. Srihari, H. Srinivasan, P. Babu, C. Bhole, Spotting words in handwritten Arabic documents, in: Proc. SPIE 6067,...
  • T. Adamek et al.

    Word matching using single closed contours for indexing handwritten historical documents

    IJDAR

    (2006)
  • H. Cao et al.

    Template-free word spotting in low-quality manuscripts

    ICPR

    (2007)
  • J. Almazan et al.

    Handwritten word spotting with corrected attributes

    ICCV

    (2013)
  • P. Wang, V. Eglin, C. Garcia, C. Largeron, J. Llados, A. Fornes, A novel learning-free word spotting approach based on...
  • Cited by (10)

    • Fractional means based method for multi-oriented keyword spotting in video/scene/license plate images

      2019, Expert Systems with Applications
      Citation Excerpt :

      Therefore, the semantics and language models are good for small and specific datasets but will not be applicable to the dataset of the present problem. Mondal, Ragot, Ramel, and Pal (2016) proposed a flexible sequence matching technique for spotting words in degraded handwritten document images. This flexible sequence matching combines the advantages of other sequential matching algorithms to make it robust to noises and outliers caused by degradation and distortion.

    • Comparative study of conventional time series matching techniques for word spotting

      2018, Pattern Recognition
      Citation Excerpt :

      Finally, a summary of results with discussion and future work is presented in Section 8. For all experiments and datasets used, the comparison between a query (word image) and a target (word image or text line(piece of) image) is done by transforming text images into a vector sequence using classical features; such as column based features (please see [28]) or Slit Style HOG features [36]. Slit style HOG (SSHOG) based features : For calculating SSHOG features (refer to [36]), a fixed sized window is slided over the image in a horizontal direction to extract the HOG features from each slit.

    • Efficient Keypoint Reduction for Document Image Matching

      2019, International Conference on Pattern Recognition Applications and Methods
    View all citing articles on Scopus

    Tanmoy Mondal: Tanmoy Mondal received B.Tech degree in information technology from West Bengal University of Technology, Kolkata, India, in 2007 and the M.Tech. degree in mechatronics from Bengal Engineering and Science University, Kolkata, in 2009. Before joining as a PhD student in Laboratoiré d'Informatique, Université François Rabelais, Tours, France in 2012, he has worked at several industries and scientific laboratories as a researcher. He has received his PhD degree in computer science in 2015. His research interests include pattern recognition, image processing and analysis, and computer vision. His current research is mainly related to time series matching techniques and document image processing.

    Nicolas Ragot: Nicolas Ragot received his Ph.D. degree in computer science in 2003 from IRISA lab, Rennes University (France). Since 2005, he joined the Computer Science Lab (LI EA 6300) in the RFAI group of Université François Rabelais, Tours (France) where he is an associate professor at Polytech Tours (French engineering school). His main research area is Pattern Recognition applied to Document Analysis. During the past 10 years, he worked mainly on online signature recognition, robust and adaptive OCR systems based on HMM, OCR control and defects detection (with French National Library-BnF). More recently he and Indian Statistical Institute-Kolkata received a 3 years grant from IFCPAR for project collaboration on robust and multilingual word spotting. He and is group were also involved in several National projects funded by government (ANR NAVIDOMAS, DIGIDOC) as well as companies (ATOS Worldline, Nexter). His group also received during 2 years a Google Digital Humanities award to work on interactive layout analysis and the use of pattern redundancy for transcription and retrieval of old printed books.

    Jean-Yves Ramel: Jean-Yves RAMEL received his PhD degree in Computer Sciences in 1996 from the National Institute of Applied Sciences of Lyon (INSA Lyon France). After being an Assistant Professor at the Industrial Engineering Department of the National Institute of Applied Sciences of Lyon from 1998 to 2002; Jean-Yves Ramel is currently a Full Professor at the Computer Science Department of Polytech Tours (French engineering school). He is also a researcher of the Computer Science Laboratory of Tours (EA 6300) where he is managing the Image Analysis and Pattern Recognition Group. His current research fields are interactive methods for image analysis and classification and structural pattern recognition. Jean-Yves RAMEL is an active member of the Pattern Recognition and Image Analysis French and International communities (AFRIF, GRCE, IAPR).

    Umapada Pal: Umapada Pal received his Ph.D. in 1997 from Indian Statistical Institute. He did his Post Doctoral research at INRIA (Institut National de Recherche en Informatique et en Automatique), France. From January 1997, he is a Faculty member of Computer Vision and Pattern Recognition Unit of the Indian Statistical Institute, Kolkata and at present he is a Professor. His fields of research interest include Digital Document Processing, Optical Character Recognition, Biometrics, Word spotting etc. He has published 300 research papers in various international journals, conference proceedings and edited volumes. Because of his significant impact in the Document Analysis research, in 2003 he received ICDAR Outstanding Young Researcher Award from International Association for Pattern Recognition (IAPR). In 2005–2006 Dr. Pal has received JSPS fellowship from Japan government. In 2008, 2011 and 2012, Dr. Pal received Visiting fellowship from Spain, France and Australia government, respectively. Dr. Pal has been serving as General/Program/Organizing Chair of many conferences including International Conference on Document Analysis and Recognition (ICDAR), International Conference on Frontiers of Handwritten Recognition (ICFHR), International Workshop on Document Analysis and Systems (DAS), Asian Conference on Pattern recognition (ACPR) etc. Also he has served as a program committee member of more than 50 international events. He has many international research collaborations and supervising Ph. D. students of many foreign universities. He is an associate Editor of the journal of ACM Transactions of Asian Language Information Processing (ACM-TALIP), Pattern recognition Letters (PRL), Electronic Letters on Computer Vision and Image Analysis (ELCVIA) etc. Also he has served as a guest editor of several special issues. He is a Fellow of IAPR (International Association of Pattern Recognition).

    View full text