Word spotting in historical documents using primitive codebook and dynamic programming☆
Introduction
Text searching in historical document is getting popular in Document Image Analysis (DIA) research community due to its complexity and the growing necessity for accessing the content of digitized books. In recent years, mass digitization of historical documents in libraries, museums are being performed and this digital information is made available to users through web-portals. By these portals, users are restricted to access only to view the pages that were digitized. Searching with content information (e.g. word) is available only if the corresponding pages are transcribed. In historical documents, due to degradation occurred by aging, strains, repetitive use, etc., the character recognition is not an easy task. Proper extraction of characters in such documents for recognition purpose is difficult. Incorrect segmentation of severely touching or broken characters is still one of the main causes for segmentation based recognition approaches [1]. Most of the word segmentation methods use space analysis between characters [2]. Sometimes due to non-uniform spacing between characters and words, it is difficult to segment words perfectly. Also, it is noticed that some pages of a historical book may contain text of different fonts. Thus, the recognition method needs to be robust to word segmentation problem and to tackle different fonts. We show two examples of document images from our collection in Fig. 1 that illustrate some of the issues described above. Automatic text transcription, performed by the available commercial OCR systems in these books is not satisfactory until now. Also, manual transcription of the archive is not feasible due to the large volume of data.
When processing such degraded documents, word spotting [3], [4], [5], [6], [7] techniques, an alternative to OCR, are useful to search the possible instances of specific/query words. These approaches do not require the recognition of every letter of the query word or the target words and thus are capable of similar word retrieval in the presence of small distortions. The features are generally computed from the whole word and thus the methods look for similar features in the target images. One of the bottle-neck of these word spotting methods is that most of them require a word segmentation step prior to the matching. If the words are not segmented properly, the features in target image do not match, thus these words cannot be retrieved. To overcome this problem, recently some segmentation free methods [8], [9] have been proposed but their computation cost are too much high to be used in a real application for searching.
The goal of this work is to propose an efficient indexing scheme that will be able to search the text information in historical archives better and faster. To overcome OCR limitation, we propose to use Query By Example (QBE) principle in such a way that the user query image can be searched efficiently in a large volume of historical documents. The retrieval of text information will be fast and it will help the user to browse relevant information by overcoming problems that restrict OCR processing to historical books. Our proposed approach tries to overcome the difficulties of segmentation based word spotting methods by not requiring complete word segmentation before. Only, segmentation of text line that is relatively easier for layout segmentation of printed documents is considered in our approach. Also, the heavy searching-cost in segmentation-free word spotting methods is avoided by using a strategy of string encoding and matching of the text line image information.
Shape coding has been used efficiently to encode the words in printed documents [10]. Inspired with this idea, the proposed approach uses text primitive segmentation for word retrieval. With the same notion, we describe the text content (each text line) of historical books by basic feature shapes called primitive. A primitive consists of a single character or a part of a character. Primitive segmentation is performed using background information of the text image. To handle the background information, water reservoir concept [11] has been used. After the primitive extraction, similar primitives are grouped using a shape matching algorithm and a codebook of primitives is built. During indexation, the text contents in the book are encoded using the previously generated codebook of primitives. During the retrieval, a query word is also encoded by a string of primitives coming from the codebook. Next, a sub-string matching algorithm is applied to each encoded strings in the documents for retrieving query words. To make the retrieval process efficient, the encoding is done from two different codebooks of primitives : a coarse one corresponding to connected components and a fine one corresponding to glyphs (explained in Fig. 5). During the querying step, similarly, the coarse and the fine signatures are generated from the query image. The sub-string matching is performed by dynamic programming based approximate string matching algorithm. A bi-level matching is done to find similar words; using coarse approach first; and fine approach from the predetermined hypothetical locations only if necessary. This work is motivated by our preliminary work presented in [12], [13]. The current work is an improved version with more details and an exhaustive experimentation has been performed. Numerous experiments have been performed to understand the different aspects of the methodology.
Our approach considers each text line as input for word spotting because, the segmentation of text lines in printed historical documents is relatively easier. Thus, we avoid the most difficult task of exact word segmentation in a document. The main contributions of this paper are the use of coarse and fine level text portions (primitives) instead of the whole word and encoding the text using these primitives for indexing. One advantage is that it searches for possible words in an efficient way using coarse level of primitive shapes (i.e. connected component) first. Then, if necessary, it uses fine primitives to detect strings of touching and broken characters. This two-level searching is robust to degradation such as touching or broken characters as we use fine level matching when coarse level matching fails. As the method searches for query word at the string level, using string matching in terms of primitives, response time is faster. The proposed approach can be applied in different scripts as the method uses dynamic codebook vocabulary for text encoding. Fig. 2 shows some examples of retrieval of our word spotting system in a set of text lines where word segmentation would have been a major issue.
The rest of the paper is organized as follows. We will present the related work in Section 2. In Section 3, we explain in detail the proposed coarse-to-fine indexing approach for text encoding. In Section 4, we discuss the word retrieval process when a query is provided. Section 5 presents the experimental results on datasets of different scripts. Finally conclusion is given in Section 6.
Section snippets
Related works
Text searching without OCR provides an alternative approach for indexing and retrieval of text information in degraded images. Spotting query word is a content-based retrieval approach focused on ranking a list of target word images that are similar to a query word image. It treats each word as a whole entity and thus avoids the difficulty of character segmentation and recognition. In a broader way, there exist two different approaches for word spotting techniques: the segmentation-based
Preprocessing and layout analysis
The historical documents in our collections contain complex layout and degradation effect. In some of the documents, graphical decorations outside the text part may appear (see Fig. 1(a)). Often, the dropcaps, figures, etc. appear in between text portion which makes it difficult for layout analysis. To realize the coarse text line segmentation, we have incorporated the AGORA [25] tool in our system because of its good performance in layout analysis. The superiority of AGORA is mainly due to the
Text retrieval from query image
For searching a query text image Q in a collection of document images, Q is segmented first into primitives using bilevel approach (CC and glyph) as explained in Section 3.2. Next, we search for each primitive to find out most similar codebook models of μ using similarity matching criteria. Q is then encoded into a sequence of labels Lq1Lq2 … Lqt, where Lqi ∈ Lm and t is the number of primitives in Q. To handle noise, Nc nearest codebook models is chosen for each query primitives. Finding text
Dataset description
In 2002, the “Centre d'Études Supérieures de la Renaissance” (CESR) of Tours has created the Humanistic Virtual Library [34]. At present, this virtual library contains bitmap versions of several books. There is a collection of precious historical books, currently numbering around 3000 copies dating from the middle of the XIV century to the beginning of the XVII century. Some of them are already scanned or photographed and made available. Latin and French are the most frequent languages used in
Conclusion
We have presented a robust and fast word spotting system for historical documents. A two-level approach in terms of coarse-to-fine is proposed to increase the robustness and speed of the retrieval process. We use connected components as well as glyph primitives for the indexing purpose. During the querying step, the primitives search the possible locations of the primitives using indexed location. Finally, an error tolerant string matching algorithm is used to retrieve the similar words from
Acknowledgments
This work has been supported by the AAP program of Université François Rabelais, Tours, France (2010–2011) (AAP-UFRT-2010-06) and by the Google Digital Humanities Research Awards (2010) given to the Computer Science Laboratory of Tours (RFAI team). Thanks to CESR for providing datasets and valuable discussions which helped us to improve our system.
References (37)
- et al.
Towards an omnilingual word retrieval system for ancient manuscripts
Pattern Recogn.
(2009) - et al.
Touching numeral segmentation using water reservoir concept
Pattern Recogn. Lett.
(2003) - et al.
Application of multi-level classifiers and clustering for automatic word spotting in historical document images
- et al.
A novel word spotting method based on recurrent neural networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2012) - et al.
Keyword spotting in unconstrained handwritten Chinese documents using contextual word model
Image Vis. Comput.
(2013) - et al.
Word image matching using dynamic time warping
- et al.
Slit style hog feature for document image word spotting
- et al.
A synthesised word approach to word retrieval in handwritten documents
Pattern Recogn.
(2012) - et al.
Segmentation-free word spotting in historical printed documents
- et al.
Browsing heterogeneous document collections by a segmentation-free word spotting method
Document image retrieval through word shape coding
PAMI
Word retrieval in historical document using character-primitives
An efficient coarse-to-fine indexing technique for fast text retrieval in historical documents
Modeling content identification from document images
Word spotting in bitmapped fax documents
Inf. Retr.
Word spotting for historical documents
Int. J. Doc. Anal. Recognit.
Word image retrieval using binary features
Word matching using single closed contours for indexing handwritten historical documents
Int. J. Doc. Anal. Recognit.
Cited by (19)
Date-field retrieval in scene image and video frames using text enhancement and shape coding
2018, NeurocomputingCitation Excerpt :In addition, the method is sensitive to seed points. Word spotting in scene and video images: Word spotting, as mentioned earlier, is an extensively practiced area of research considering texts in handwritten [4-6,24,26] or printed documents [7,25] even in different scripts [8]. Although, couple of works [9,10] show some efficient approaches of text detection in scene images based on background invariant features [11], etc. but the problem is still not solved in general.
HMM word graph based keyword spotting in handwritten document images
2016, Information SciencesCitation Excerpt :However, this word pre-segmentation is impossible for the millions of historical handwritten documents and, even in favorable cases, it is quite prone to errors [31,33,40] which tend to significantly hinder overall KWS performance [1]. To overcome this considerable drawback, recent works [14,16,17,26,52,58,72] assume the line image as the lowest search level, without any further segmentation into words. This is a convenient setting because, in most cases of interest, text images can be fully-automatically segmented into lines with fair accuracy [5,40], and lines are sufficiently precise for most practical document image search and retrieval applications.
Word spotting and character recognition of handwritten Hindi scripts by Integral Histogram of Oriented Displacement (IHOD) descriptor
2024, Multimedia Tools and ApplicationsA novel optimized deep learning framework to spot keywords and query matching process in Devanagari scripts
2023, Multimedia Tools and ApplicationsSpatial Distribution of Ink at Keypoints (SDIK): A Novel Feature for Word Spotting in Arabic Documents
2022, International Journal of Image and GraphicsSegmentation of text lines using multi-scale CNN from warped printed and handwritten document images
2021, International Journal on Document Analysis and Recognition