Elsevier

Image and Vision Computing

Volume 44, December 2015, Pages 15-28
Image and Vision Computing

Word spotting in historical documents using primitive codebook and dynamic programming

https://doi.org/10.1016/j.imavis.2015.09.006Get rights and content

Abstract

Word searching and indexing in historical document collections are a challenging problem because text characters are often touching or broken due to degradation or aging effects. In this paper, we present a novel approach towards word spotting using text line decomposition into character primitives and string matching. The text lines are initially separated by a segmentation process. Then each text line is described as sequences of primitive labels which correspond to single characters or parts of characters. These representative primitives are considered from a codebook of shapes generated from training pages taken from the collection. During indexation, the text lines are transcribed into strings of primitives in off-line stage and stored in files. For this purpose, an efficient indexation strategy using multi-label approach is used by a combination of two-level analysis of the primitives: coarse and fine levels. During retrieval, the query word image is encoded into strings of coarse and fine primitives chosen according to the codebook. Finally, a dynamic programming method based on approximate string matching is used to find similar primitive sequences in the text lines from the collection in runtime. We present the experimental evaluation on datasets of real life document images, gathered from historical books of different scripts. Experimental results show that the method is robust in searching text in noisy documents.

Introduction

Text searching in historical document is getting popular in Document Image Analysis (DIA) research community due to its complexity and the growing necessity for accessing the content of digitized books. In recent years, mass digitization of historical documents in libraries, museums are being performed and this digital information is made available to users through web-portals. By these portals, users are restricted to access only to view the pages that were digitized. Searching with content information (e.g. word) is available only if the corresponding pages are transcribed. In historical documents, due to degradation occurred by aging, strains, repetitive use, etc., the character recognition is not an easy task. Proper extraction of characters in such documents for recognition purpose is difficult. Incorrect segmentation of severely touching or broken characters is still one of the main causes for segmentation based recognition approaches [1]. Most of the word segmentation methods use space analysis between characters [2]. Sometimes due to non-uniform spacing between characters and words, it is difficult to segment words perfectly. Also, it is noticed that some pages of a historical book may contain text of different fonts. Thus, the recognition method needs to be robust to word segmentation problem and to tackle different fonts. We show two examples of document images from our collection in Fig. 1 that illustrate some of the issues described above. Automatic text transcription, performed by the available commercial OCR systems in these books is not satisfactory until now. Also, manual transcription of the archive is not feasible due to the large volume of data.

When processing such degraded documents, word spotting [3], [4], [5], [6], [7] techniques, an alternative to OCR, are useful to search the possible instances of specific/query words. These approaches do not require the recognition of every letter of the query word or the target words and thus are capable of similar word retrieval in the presence of small distortions. The features are generally computed from the whole word and thus the methods look for similar features in the target images. One of the bottle-neck of these word spotting methods is that most of them require a word segmentation step prior to the matching. If the words are not segmented properly, the features in target image do not match, thus these words cannot be retrieved. To overcome this problem, recently some segmentation free methods [8], [9] have been proposed but their computation cost are too much high to be used in a real application for searching.

The goal of this work is to propose an efficient indexing scheme that will be able to search the text information in historical archives better and faster. To overcome OCR limitation, we propose to use Query By Example (QBE) principle in such a way that the user query image can be searched efficiently in a large volume of historical documents. The retrieval of text information will be fast and it will help the user to browse relevant information by overcoming problems that restrict OCR processing to historical books. Our proposed approach tries to overcome the difficulties of segmentation based word spotting methods by not requiring complete word segmentation before. Only, segmentation of text line that is relatively easier for layout segmentation of printed documents is considered in our approach. Also, the heavy searching-cost in segmentation-free word spotting methods is avoided by using a strategy of string encoding and matching of the text line image information.

Shape coding has been used efficiently to encode the words in printed documents [10]. Inspired with this idea, the proposed approach uses text primitive segmentation for word retrieval. With the same notion, we describe the text content (each text line) of historical books by basic feature shapes called primitive. A primitive consists of a single character or a part of a character. Primitive segmentation is performed using background information of the text image. To handle the background information, water reservoir concept [11] has been used. After the primitive extraction, similar primitives are grouped using a shape matching algorithm and a codebook of primitives is built. During indexation, the text contents in the book are encoded using the previously generated codebook of primitives. During the retrieval, a query word is also encoded by a string of primitives coming from the codebook. Next, a sub-string matching algorithm is applied to each encoded strings in the documents for retrieving query words. To make the retrieval process efficient, the encoding is done from two different codebooks of primitives : a coarse one corresponding to connected components and a fine one corresponding to glyphs (explained in Fig. 5). During the querying step, similarly, the coarse and the fine signatures are generated from the query image. The sub-string matching is performed by dynamic programming based approximate string matching algorithm. A bi-level matching is done to find similar words; using coarse approach first; and fine approach from the predetermined hypothetical locations only if necessary. This work is motivated by our preliminary work presented in [12], [13]. The current work is an improved version with more details and an exhaustive experimentation has been performed. Numerous experiments have been performed to understand the different aspects of the methodology.

Our approach considers each text line as input for word spotting because, the segmentation of text lines in printed historical documents is relatively easier. Thus, we avoid the most difficult task of exact word segmentation in a document. The main contributions of this paper are the use of coarse and fine level text portions (primitives) instead of the whole word and encoding the text using these primitives for indexing. One advantage is that it searches for possible words in an efficient way using coarse level of primitive shapes (i.e. connected component) first. Then, if necessary, it uses fine primitives to detect strings of touching and broken characters. This two-level searching is robust to degradation such as touching or broken characters as we use fine level matching when coarse level matching fails. As the method searches for query word at the string level, using string matching in terms of primitives, response time is faster. The proposed approach can be applied in different scripts as the method uses dynamic codebook vocabulary for text encoding. Fig. 2 shows some examples of retrieval of our word spotting system in a set of text lines where word segmentation would have been a major issue.

The rest of the paper is organized as follows. We will present the related work in Section 2. In Section 3, we explain in detail the proposed coarse-to-fine indexing approach for text encoding. In Section 4, we discuss the word retrieval process when a query is provided. Section 5 presents the experimental results on datasets of different scripts. Finally conclusion is given in Section 6.

Section snippets

Related works

Text searching without OCR provides an alternative approach for indexing and retrieval of text information in degraded images. Spotting query word is a content-based retrieval approach focused on ranking a list of target word images that are similar to a query word image. It treats each word as a whole entity and thus avoids the difficulty of character segmentation and recognition. In a broader way, there exist two different approaches for word spotting techniques: the segmentation-based

Preprocessing and layout analysis

The historical documents in our collections contain complex layout and degradation effect. In some of the documents, graphical decorations outside the text part may appear (see Fig. 1(a)). Often, the dropcaps, figures, etc. appear in between text portion which makes it difficult for layout analysis. To realize the coarse text line segmentation, we have incorporated the AGORA [25] tool in our system because of its good performance in layout analysis. The superiority of AGORA is mainly due to the

Text retrieval from query image

For searching a query text image Q in a collection of document images, Q is segmented first into primitives using bilevel approach (CC and glyph) as explained in Section 3.2. Next, we search for each primitive to find out most similar codebook models of μ using similarity matching criteria. Q is then encoded into a sequence of labels Lq1Lq2  Lqt, where Lqi  Lm and t is the number of primitives in Q. To handle noise, Nc nearest codebook models is chosen for each query primitives. Finding text

Dataset description

In 2002, the “Centre d'Études Supérieures de la Renaissance” (CESR) of Tours has created the Humanistic Virtual Library [34]. At present, this virtual library contains bitmap versions of several books. There is a collection of precious historical books, currently numbering around 3000 copies dating from the middle of the XIV century to the beginning of the XVII century. Some of them are already scanned or photographed and made available. Latin and French are the most frequent languages used in

Conclusion

We have presented a robust and fast word spotting system for historical documents. A two-level approach in terms of coarse-to-fine is proposed to increase the robustness and speed of the retrieval process. We use connected components as well as glyph primitives for the indexing purpose. During the querying step, the primitives search the possible locations of the primitives using indexed location. Finally, an error tolerant string matching algorithm is used to retrieve the similar words from

Acknowledgments

This work has been supported by the AAP program of Université François Rabelais, Tours, France (2010–2011) (AAP-UFRT-2010-06) and by the Google Digital Humanities Research Awards (2010) given to the Computer Science Laboratory of Tours (RFAI team). Thanks to CESR for providing datasets and valuable discussions which helped us to improve our system.

References (37)

  • Y. Leydier et al.

    Towards an omnilingual word retrieval system for ancient manuscripts

    Pattern Recogn.

    (2009)
  • U. Pal et al.

    Touching numeral segmentation using water reservoir concept

    Pattern Recogn. Lett.

    (2003)
  • R.F. Moghaddam et al.

    Application of multi-level classifiers and clustering for automatic word spotting in historical document images

  • V. Frinken et al.

    A novel word spotting method based on recurrent neural networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • L. Huang et al.

    Keyword spotting in unconstrained handwritten Chinese documents using contextual word model

    Image Vis. Comput.

    (2013)
  • T. Rath et al.

    Word image matching using dynamic time warping

  • K. Terasawa et al.

    Slit style hog feature for document image word spotting

  • M.C. Fairhurst et al.

    A synthesised word approach to word retrieval in handwritten documents

    Pattern Recogn.

    (2012)
  • B. Gatos et al.

    Segmentation-free word spotting in historical printed documents

  • M. Rusiñol et al.

    Browsing heterogeneous document collections by a segmentation-free word spotting method

  • S. Lu et al.

    Document image retrieval through word shape coding

    PAMI

    (2008)
  • P.P. Roy et al.

    Word retrieval in historical document using character-primitives

  • Partha Pratim Roy et al.

    An efficient coarse-to-fine indexing technique for fast text retrieval in historical documents

  • T. Nakayama

    Modeling content identification from document images

  • W.J. Williams et al.

    Word spotting in bitmapped fax documents

    Inf. Retr.

    (2000)
  • T. Rath et al.

    Word spotting for historical documents

    Int. J. Doc. Anal. Recognit.

    (2005)
  • B. Zhang et al.

    Word image retrieval using binary features

  • T. Adamek et al.

    Word matching using single closed contours for indexing handwritten historical documents

    Int. J. Doc. Anal. Recognit.

    (2007)
  • Cited by (19)

    • Date-field retrieval in scene image and video frames using text enhancement and shape coding

      2018, Neurocomputing
      Citation Excerpt :

      In addition, the method is sensitive to seed points. Word spotting in scene and video images: Word spotting, as mentioned earlier, is an extensively practiced area of research considering texts in handwritten [4-6,24,26] or printed documents [7,25] even in different scripts [8]. Although, couple of works [9,10] show some efficient approaches of text detection in scene images based on background invariant features [11], etc. but the problem is still not solved in general.

    • HMM word graph based keyword spotting in handwritten document images

      2016, Information Sciences
      Citation Excerpt :

      However, this word pre-segmentation is impossible for the millions of historical handwritten documents and, even in favorable cases, it is quite prone to errors [31,33,40] which tend to significantly hinder overall KWS performance [1]. To overcome this considerable drawback, recent works [14,16,17,26,52,58,72] assume the line image as the lowest search level, without any further segmentation into words. This is a convenient setting because, in most cases of interest, text images can be fully-automatically segmented into lines with fair accuracy [5,40], and lines are sufficiently precise for most practical document image search and retrieval applications.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Seong-Whan Lee.

    1

    Tel.: + 33 247 361 432.

    View full text