Elsevier

Pattern Recognition

Volume 45, Issue 7, July 2012, Pages 2598-2609
Pattern Recognition

Word spotting in historical printed documents using shape and sequence comparisons

https://doi.org/10.1016/j.patcog.2011.10.013Get rights and content

Abstract

Information spotting in scanned historical document images is a very challenging task. The joint use of the mechanical press and of human controlled inking introduced great variability in ink level within a book or even within a page. Consequently characters are often broken or merged together and thus become difficult to segment and recognize. The limitations of commercial OCR engines for information retrieval in historical document images have inspired alternative means of identification of given words in such documents. We present a word spotting method for scanned documents in order to find the word images that are similar to a query word, without assuming a correct segmentation of the words into characters. The connected components are first processed to transform a word pattern into a sequence of sub-patterns. Each sub-pattern is represented by a sequence of feature vectors. A modified Edit distance is proposed to perform a segmentation-driven string matching and to compute the Segmentation Driven Edit (SDE) distance between the words to be compared. The set of SDE operations is defined to obtain the word segmentations that are the most appropriate to evaluate their similarity. These operations are efficient to cope with broken and touching characters in words. The distortion of character shapes is handled by coupling the string matching process with local shape comparisons that are achieved by Dynamic Time Warping (DTW). The costs of the SDE operations are provided by the DTW distances. A sub-optimal version of the SDE string matching is also proposed to reduce the computation time, nevertheless it did not lead to a great decrease in performance. It is possible to enter a query by example or a textual query entered with the keyboard. Textual queries can be used to directly spot the word without the need to synthesize its image, as far as character prototype images are available. Results are presented for different documents and compared with other methods, showing the efficiency of our method.

Highlights

► Word-spotting enables information retrieval in historical digital libraries. ► The matching of word images tolerates inaccurate segmentation of words into ascii characters. ► Word segmentation is performed in the course of the matching process. ► The method is based on coupling local shape comparisons with the comparison of shape sequences. ► A sub-optimal version of the method speeds up word spotting with only slight performance decrease.

Introduction

The importance of digital libraries for information retrieval cannot be denied. Historical collections are of interest to a number of people, like historians, students and scholars, who need to study the historical originals. These documents contain invaluable knowledge that is made widely accessible thanks to digital libraries. Unfortunately, digitization alone is not enough to satisfy the users who try to retrieve information from these documents. Finding particular regions of interest in a digital document is easy owing to the possibility to search for key words in huge sets of page images. Character recognition is necessary to facilitate this task. Professional OCR engines, designed for different languages, especially Latin alphabets, give excellent recognition results on scanned images of contemporary good quality documents. However, when used with ancient documents that have undergone degradations, discussed in detail in [1], [2], recognition results drop significantly. The use of the mechanical press and the imperfect control of the ink level have introduced specific difficulties in historical documents, such as broken and touching characters that may prevent finding the correct segmentation of the words into characters automatically.

Word spotting is a relatively new alternative for information retrieval in ancient document images. It makes it possible to retrieve all the document images or passages that contain words similar to a query word by matching the image of a given query word with the word images of the documents. Research has been going on in this field for some time now and already different methods, which are discussed in detail in the next section, have been proposed for efficient word spotting but there is always room for improvement. Most of the methods were developed for handwritten words for which recognition, in the case of unconstrained vocabulary, is much more difficult than the matching of word images. For printed texts, the main advantage of matching over recognition is the possibility to search for words written with the unconventional fonts that are often encountered in historical documents and are not recognized by current OCR systems.

Our work in this domain aims to facilitate the information search by spotting the different instances of a given query word in documents. Word matching has to handle local shape distortions as well as inexact segmentation of the words to compare. We present a novel method for word spotting that can work efficiently for printed document images, with word image or textual queries, to search the required information. A segmentation-driven matching is proposed to transform the words into sequences of sub-patterns that are the most appropriate to evaluate the similarity of the words without the need to find the correct segmentation of the words into characters. This is achieved by coupling local shape comparisons at sub-pattern level and string comparison at word level. This two-level representation is efficient to cope with inexact word segmentation and shape distortions during the matching phase. It also enables to construct word queries from ASCII entries without the need to create the corresponding word images.

This paper is organized as follows: Section 2 presents word-spotting methods that include the principle of our own proposal. Then document image processing performed prior to word spotting in our system is described in Section 3. Section 4 describes the segmentation-driven algorithms in use for word comparison. The way word indexing is achieved and used in querying is the object of Section 5. The results of word-spotting experiments are given in Section 6. Several tests for evaluating intermediate processing stages are described all along this paper.

Section snippets

Related works

Word spotting is an alternative to text recognition of the whole document. It is valuable to search for words of interest in a document when the text format is not available or when the recognized text contains too many errors. Most of the work in the field of word spotting has been done on handwritten documents [3], [4], [5], [6], [7], [8], [9], [10], [11]. The reason for that mainly lies in the difficulty to automate the recognition of irregular writing styles. Printed document images are

Document image processing

The aim of the process is to build an index file in order to help information retrieval in the document image. In this section we focus on the content of this index file. The process is broken down into sub-tasks, including binarization, word/graphic discrimination and word segmentation into a sequence of S-characters. The data stored to index the words are associated with the local level of word representation where the information of shape is captured. Before presenting the shape

Word comparison

For word spotting, we propose a multi-step comparison process to retrieve the words similar to the query. The aim of the first step is to filter the number of words to be compared with the query. A coarse criterion rapidly eliminates a large amount of words from the candidates to be compared without eliminating the relevant words. For two words to be considered as eligible for matching, we have set bounds on the ratio of their lengths. If this ratio does not lie within a specific interval

Document indexing and querying

In this section, we give an overview of the indexing process to create an index data file for each document image and the different modes that are offered to formulate a word query. The first step of the indexing process is the computation of the word representation as described in Section 3. As it is a time consuming process, document image indexing is done beforehand to allow a rapid information search. A file is associated with each document image and contains the coordinates of each word in

Experimental results

To analyze how the proposed methods perform in comparison with the state of the art, we implemented and tested the method of [5] in which four feature sequences are found out for word images and two words are compared by matching these features using the DTW algorithm at word level. We also compared the results with the classic Edit distance based method presented in [39]. In addition to that, we compared our method with the commercial OCR software ABBYY fine reader [40] on the same dataset.

Conclusion

This work provides a thorough examination of segmentation-based retrieval techniques for historical document images. Our system allows queries either in the form of a word image or as an ASCII text. The main contribution is a segmentation-based method that is not dependent on perfect character segmentation. The proposed approach for word spotting is based on a two-level processing. This is achieved by coupling string comparison at word level with local comparison using a DTW distance. When we

Khurram Kurshid received his Masters degree in Image Processing in 2006 from university Paris Descartes, France. He is currently working as a Ph.D. candidate word-spotting. His research interests include document analysis and pattern recognition and their applications.

References (41)

  • A. Kolcz et al.

    A line-oriented approach to word spotting in handwritten documents

    Pattern Analysis and Applications

    (2000)
  • K. Terasawa, Y. Tanaka, Slit style hog feature for document image word spotting, in: Proceedings of the Tenth...
  • J.A. Rodriguez-Serrano, F. Perronnin, Handwritten word-image retrieval with synthesized typed queries, in: Proceedings...
  • Bin Zhang et al.

    Word image retrieval using binary features

  • B. Gatos, I. Pratikakis, Segmentation-free word spotting in historical printed documents, in: Proceedings of the Tenth...
  • T. Konidaris et al.

    Keyword-guided word spotting in historical printed documents using synthetic data and user feedback

    International Journal on Document Analysis and Recognition

    (2007)
  • A. Andreev, N. Kirov, Word image matching based on hausdorff distances, in: Proceedings of the Tenth International...
  • S. Marinai, S. Faini, E. Marino, G. Soda, Efficient word retrieval by means of som clustering and pca, in: Workshop on...
  • T. Konidaris, B. Gatos, S. Perantonis, A. Kesidis, Keyword matching in historical machine-printed documents using...
  • G. Vamvakas, B. Gatos, N. Stamatopoulos, S.J. Perantonis, A complete optical character recognition methodology for...
  • Cited by (38)

    • Using keyword spotting systems as tools for the transcription of historical handwritten documents: Models and procedures for performance evaluation

      2020, Pattern Recognition Letters
      Citation Excerpt :

      The region of the document image to label can be either produced by a preliminary segmentation step (segmentation-based) [32,33,34,35] or provided as a result of the keyword spotting (segmentation-free) [36,37,38]. Another important distinction is between lexicon-based KWS approaches, that rely on the presence of a predefined keyword list usually fixed during the training phase [6,9,26,27,28,29,30,31] and lexicon-free KWS, that do not rely on a predefined keyword list [19,20,21,22,23,24,25], or that can find new keywords to add to the keyword list, as it has been recently proposed [39,40]. The main purpose of the paper is not to refer at different approaches and techniques for keyword spotting but to focus on a performance model of a generic KWS that can be useful to understand and evaluate the convenience of its use for assisted transcription with respect to the manual one.

    • Comparative study of conventional time series matching techniques for word spotting

      2018, Pattern Recognition
      Citation Excerpt :

      This technique can be defined as the “localization of words of interest in the dataset without actually interpreting the content” and it allows to index or search inside a document using queries. For spotting words in handwritten manuscripts and historical printed document images, word images can be thought as 2D signals, that can be matched by sequence matching algorithms like DTW [14,17,32]. In other application domains, DTW’s variants have been intensively evaluated to demonstrate their interest [7,34], but they have not been clearly studied and compared in the case of word spotting.

    • A survey of document image word spotting techniques

      2017, Pattern Recognition
      Citation Excerpt :

      For instance, Sauvola’s technique [112] calculates a local threshold which is adapted to the neighborhood of each pixel according to the local mean value and the local standard deviation inside the neighborhood which is defined by a sliding window. Methods based on local thresholding can be found in [40,44,113–115]. Some methods also include an image enhancement step.

    View all citing articles on Scopus

    Khurram Kurshid received his Masters degree in Image Processing in 2006 from university Paris Descartes, France. He is currently working as a Ph.D. candidate word-spotting. His research interests include document analysis and pattern recognition and their applications.

    Claudie Faure received the degree in Physics from the University of Nice, France. She then studied Computer Science and Signal Processing at the University of Paris XI. She received the Doctor of Sciences degree in 1982. From 1976 to 1985 she worked in Pattern Recognition at the University of Compiègne, France. Since 1985, she has been with the Information Processing and Communication Laboratory (LTCI) of Telecom-ParisTech. She is a CNRS researcher since 1975. Her research interests are pattern recognition systems, gesture-based human-computer interaction, document image analysis and visual perception.

    Nicole Vincent is full Professor since 1996. She presently heads the research group Systèmes Intelligents de Perception (SIP) at the Laboratoire d'Informatique Paris Descartes (LIPADE) in the university Paris Descartes—Paris 5. After studying in Ecole Normale Supérieure and graduation in Mathematics, Nicole Vincent received a Ph.D. in Computer Science in 1988 from Lyon Insa. She has been involved with several projects in pattern recognition, signal and image processing and video analysis. Her research interest concerns document image analysis, image retrieval and video sequence analysis.

    View full text