Word spotting in historical printed documents using shape and sequence comparisons
Highlights
► Word-spotting enables information retrieval in historical digital libraries. ► The matching of word images tolerates inaccurate segmentation of words into ascii characters. ► Word segmentation is performed in the course of the matching process. ► The method is based on coupling local shape comparisons with the comparison of shape sequences. ► A sub-optimal version of the method speeds up word spotting with only slight performance decrease.
Introduction
The importance of digital libraries for information retrieval cannot be denied. Historical collections are of interest to a number of people, like historians, students and scholars, who need to study the historical originals. These documents contain invaluable knowledge that is made widely accessible thanks to digital libraries. Unfortunately, digitization alone is not enough to satisfy the users who try to retrieve information from these documents. Finding particular regions of interest in a digital document is easy owing to the possibility to search for key words in huge sets of page images. Character recognition is necessary to facilitate this task. Professional OCR engines, designed for different languages, especially Latin alphabets, give excellent recognition results on scanned images of contemporary good quality documents. However, when used with ancient documents that have undergone degradations, discussed in detail in [1], [2], recognition results drop significantly. The use of the mechanical press and the imperfect control of the ink level have introduced specific difficulties in historical documents, such as broken and touching characters that may prevent finding the correct segmentation of the words into characters automatically.
Word spotting is a relatively new alternative for information retrieval in ancient document images. It makes it possible to retrieve all the document images or passages that contain words similar to a query word by matching the image of a given query word with the word images of the documents. Research has been going on in this field for some time now and already different methods, which are discussed in detail in the next section, have been proposed for efficient word spotting but there is always room for improvement. Most of the methods were developed for handwritten words for which recognition, in the case of unconstrained vocabulary, is much more difficult than the matching of word images. For printed texts, the main advantage of matching over recognition is the possibility to search for words written with the unconventional fonts that are often encountered in historical documents and are not recognized by current OCR systems.
Our work in this domain aims to facilitate the information search by spotting the different instances of a given query word in documents. Word matching has to handle local shape distortions as well as inexact segmentation of the words to compare. We present a novel method for word spotting that can work efficiently for printed document images, with word image or textual queries, to search the required information. A segmentation-driven matching is proposed to transform the words into sequences of sub-patterns that are the most appropriate to evaluate the similarity of the words without the need to find the correct segmentation of the words into characters. This is achieved by coupling local shape comparisons at sub-pattern level and string comparison at word level. This two-level representation is efficient to cope with inexact word segmentation and shape distortions during the matching phase. It also enables to construct word queries from ASCII entries without the need to create the corresponding word images.
This paper is organized as follows: Section 2 presents word-spotting methods that include the principle of our own proposal. Then document image processing performed prior to word spotting in our system is described in Section 3. Section 4 describes the segmentation-driven algorithms in use for word comparison. The way word indexing is achieved and used in querying is the object of Section 5. The results of word-spotting experiments are given in Section 6. Several tests for evaluating intermediate processing stages are described all along this paper.
Section snippets
Related works
Word spotting is an alternative to text recognition of the whole document. It is valuable to search for words of interest in a document when the text format is not available or when the recognized text contains too many errors. Most of the work in the field of word spotting has been done on handwritten documents [3], [4], [5], [6], [7], [8], [9], [10], [11]. The reason for that mainly lies in the difficulty to automate the recognition of irregular writing styles. Printed document images are
Document image processing
The aim of the process is to build an index file in order to help information retrieval in the document image. In this section we focus on the content of this index file. The process is broken down into sub-tasks, including binarization, word/graphic discrimination and word segmentation into a sequence of S-characters. The data stored to index the words are associated with the local level of word representation where the information of shape is captured. Before presenting the shape
Word comparison
For word spotting, we propose a multi-step comparison process to retrieve the words similar to the query. The aim of the first step is to filter the number of words to be compared with the query. A coarse criterion rapidly eliminates a large amount of words from the candidates to be compared without eliminating the relevant words. For two words to be considered as eligible for matching, we have set bounds on the ratio of their lengths. If this ratio does not lie within a specific interval
Document indexing and querying
In this section, we give an overview of the indexing process to create an index data file for each document image and the different modes that are offered to formulate a word query. The first step of the indexing process is the computation of the word representation as described in Section 3. As it is a time consuming process, document image indexing is done beforehand to allow a rapid information search. A file is associated with each document image and contains the coordinates of each word in
Experimental results
To analyze how the proposed methods perform in comparison with the state of the art, we implemented and tested the method of [5] in which four feature sequences are found out for word images and two words are compared by matching these features using the DTW algorithm at word level. We also compared the results with the classic Edit distance based method presented in [39]. In addition to that, we compared our method with the commercial OCR software ABBYY fine reader [40] on the same dataset.
Conclusion
This work provides a thorough examination of segmentation-based retrieval techniques for historical document images. Our system allows queries either in the form of a word image or as an ASCII text. The main contribution is a segmentation-based method that is not dependent on perfect character segmentation. The proposed approach for word spotting is based on a two-level processing. This is achieved by coupling string comparison at word level with local comparison using a DTW distance. When we
Khurram Kurshid received his Masters degree in Image Processing in 2006 from university Paris Descartes, France. He is currently working as a Ph.D. candidate word-spotting. His research interests include document analysis and pattern recognition and their applications.
References (41)
- et al.
Text search for medieval manuscript images
Pattern Recognition
(2007) - et al.
Character segmentation in handwritten words an overview
Pattern Recognition
(1996) - et al.
Shape recognition using attributed string matching with polygon vertices as the primitives
Pattern Recognition Letters
(2002) An optimised minimal edit distance for hand-written word recognition
Pattern Recognition Letters
(1995)- H.S. Baird, Difficult and urgent open problems in document image analysis for libraries, in: Proceedings of First...
- A. Antonacopoulos, D. Karatzas, H. Krawczyk, B. Wiszniewski, The life- cycle of a digital historical document:...
- et al.
Word matching using single closed contours for indexing handwritten historical documents
International Journal on Document Analysis and Recognition
(2007) - et al.
The role of holistic paradigms in hand-written word recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2001) - et al.
Word spotting for historical documents
International Journal on Document Analysis and Recognition
(2007) - J.L. Rothfeder, S. Feng, T.M. Rath, Using corner features correspondences to rank word images by similarity, in:...
A line-oriented approach to word spotting in handwritten documents
Pattern Analysis and Applications
Word image retrieval using binary features
Keyword-guided word spotting in historical printed documents using synthetic data and user feedback
International Journal on Document Analysis and Recognition
Cited by (38)
Using keyword spotting systems as tools for the transcription of historical handwritten documents: Models and procedures for performance evaluation
2020, Pattern Recognition LettersCitation Excerpt :The region of the document image to label can be either produced by a preliminary segmentation step (segmentation-based) [32,33,34,35] or provided as a result of the keyword spotting (segmentation-free) [36,37,38]. Another important distinction is between lexicon-based KWS approaches, that rely on the presence of a predefined keyword list usually fixed during the training phase [6,9,26,27,28,29,30,31] and lexicon-free KWS, that do not rely on a predefined keyword list [19,20,21,22,23,24,25], or that can find new keywords to add to the keyword list, as it has been recently proposed [39,40]. The main purpose of the paper is not to refer at different approaches and techniques for keyword spotting but to focus on a performance model of a generic KWS that can be useful to understand and evaluate the convenience of its use for assisted transcription with respect to the manual one.
Shall deep learning be the mandatory future of document analysis problems?
2019, Pattern RecognitionComparative study of conventional time series matching techniques for word spotting
2018, Pattern RecognitionCitation Excerpt :This technique can be defined as the “localization of words of interest in the dataset without actually interpreting the content” and it allows to index or search inside a document using queries. For spotting words in handwritten manuscripts and historical printed document images, word images can be thought as 2D signals, that can be matched by sequence matching algorithms like DTW [14,17,32]. In other application domains, DTW’s variants have been intensively evaluated to demonstrate their interest [7,34], but they have not been clearly studied and compared in the case of word spotting.
A survey of document image word spotting techniques
2017, Pattern RecognitionCitation Excerpt :For instance, Sauvola’s technique [112] calculates a local threshold which is adapted to the neighborhood of each pixel according to the local mean value and the local standard deviation inside the neighborhood which is defined by a sliding window. Methods based on local thresholding can be found in [40,44,113–115]. Some methods also include an image enhancement step.
Word spotting and character recognition of handwritten Hindi scripts by Integral Histogram of Oriented Displacement (IHOD) descriptor
2024, Multimedia Tools and Applications
Khurram Kurshid received his Masters degree in Image Processing in 2006 from university Paris Descartes, France. He is currently working as a Ph.D. candidate word-spotting. His research interests include document analysis and pattern recognition and their applications.
Claudie Faure received the degree in Physics from the University of Nice, France. She then studied Computer Science and Signal Processing at the University of Paris XI. She received the Doctor of Sciences degree in 1982. From 1976 to 1985 she worked in Pattern Recognition at the University of Compiègne, France. Since 1985, she has been with the Information Processing and Communication Laboratory (LTCI) of Telecom-ParisTech. She is a CNRS researcher since 1975. Her research interests are pattern recognition systems, gesture-based human-computer interaction, document image analysis and visual perception.
Nicole Vincent is full Professor since 1996. She presently heads the research group Systèmes Intelligents de Perception (SIP) at the Laboratoire d'Informatique Paris Descartes (LIPADE) in the university Paris Descartes—Paris 5. After studying in Ecole Normale Supérieure and graduation in Mathematics, Nicole Vincent received a Ph.D. in Computer Science in 1988 from Lyon Insa. She has been involved with several projects in pattern recognition, signal and image processing and video analysis. Her research interest concerns document image analysis, image retrieval and video sequence analysis.