A survey of document image word spotting techniques
Introduction
A great amount of information in libraries and cultural institutions exist all over the world and need to be digitized so as to preserve it and protect it from frequent handling. Among others, Google has put an effort to digitize books on a large scale [1], [2], thereby providing support to the document understanding research community. In order to create digital libraries which allow efficient searching and browsing for future users, thousands of digitized documents have to be transcribed or at least indexed at a certain degree. However, the automatic recognition of poor quality printed text and especially, handwritten text, is not feasible by traditional OCR approaches which mainly suffice for modern printed documents with simple layouts and known fonts. Most of the constraints encountered by recognition systems stem from difficulties in segmenting characters or words, the variability of the handwriting and the open vocabulary. For this reason, more flexible information retrieval and image analysis techniques are required.
The actual problem behind building digital libraries lies on the retrieval of digitized documents in terms of reliable extraction and access to specific information. While a document image processing system analyzes different text regions so as to convert them to machine-readable text using OCR, a document image retrieval system searches whether a document image contains particular words of interest, without the need for correct character recognition, but by directly characterizing image features at character, word, line or even document level.
On one hand, recognition-based retrieval relies on the complete recognition of documents either at character level using OCR, or at word level using word recognition methods. In the latter case, the goal is to correctly classify a query word into a labeled class, or else, obtain its transcription. Most methods of this type require prior transcription of text-lines, words or characters to train character or word models. During the search phase, a text dictionary or lexicon is used and only words from that lexicon can be used as candidate transcriptions in the recognition task. These methods usually rely on hidden Markov models (HMMs) [3], [4], conditional random fields (CRFs) [5], neural networks (NNs) [6], [7] or they might follow a hybrid approach by combining different classifiers, such as support vector machines (SVMs) with HMMs [8], [9] or HMMs with NNs [10]. An obvious drawback of these approaches is that they have to deal with the inherent handwriting variability and handle a large number of word and character models. Nevertheless, the scope of this work does not focus on recognition-based retrieval methods and thus, we only briefly refer to them.
On the other hand, the recognition-free retrieval which is also known in the literature as word spotting or keyword spotting is the main subject of this study. The goal here is to retrieve all instances of user queries in a set of document images which may be segmented at text lines or words. Actually, the user formulates a query and the system evaluates its similarity with the stored documents and returns as output a ranked list of results which are most similar to the query. The process is totally based on matching between common representations of features, such as color, texture, geometric shape or textual features, while conversion of whole documents into machine readable format and recognition do not take place at all. Therefore, the selection and use of proper features and robust matching techniques are the most important aspects of a word spotting system.
Word spotting methods may be divided into multiple categories according to various factors. Depending on how the input is specified by the user we can distinguish query-by-example (QBE) from query-by-string (QBS) methods. In the QBE scenario, the user selects an image of the word to be searched in the document collection, whereas in the QBS paradigm, the user provides an arbitrary text string as input to the system. Another way to categorize word spotting methods depends on whether training data are used offline, either to learn character and word models or tune the parameters of the system. This way we can distinguish learning-based from learning-free approaches. Finally, word spotting methods which can be directly applied to whole document pages are considered as segmentation-free, in contrast with segmentation-based methods, where a segmentation step has to be applied at line or word level during preprocessing.
Word spotting was initially proposed in the speech recognition community [11]. Its application was adopted later on for printed [12], [13] and handwritten [14] document indexing. While early approaches were based on raw features extracted directly from image pixels [14], [15], the state of the art is to characterize document images with more complex features based on gradient information, shape structure, texture, etc. (see Section 4.1).
There are a variety of applications of word spotting for document indexing and retrieval including the following:
- •
retrieval of documents with a given word in company files,
- •
searching online in cultural heritage collections stored in libraries all over the world,
- •
automatic sorting of handwritten mail containing significant words (e.g. “urgent”, “cancelation”, “complain”) [16],
- •
identification of figures and their corresponding captions [17],
- •
keyword retrieval in pre-hospital care reports (PCR forms) [18],
- •
word spotting in graphical documents such as maps [19],
- •
retrieval of cuneiform structures from ancient clay tablets [20],
- •
assisting human transcribers in identifying words in degraded documents, especially those appearing for the first time.
Although word spotting and word recognition belong to two separate retrieval paradigms, they sometimes interact by assisting one another. For instance, the authors in [21] propose a keyword spotting approach relying on a NN-based recognition system. On the contrary, in [22], word spotting contributes as a means of bootstrapping a handwriting recognition system, in terms of selecting new elements from the retrieved results. These elements can be used to augment the training set through a semi-supervised procedure, thus increasing the final recognition accuracy while at the same time avoiding the costly manual annotation process.
In order to track the recent literature, we present some statistics related to the evolution of word spotting methods over the last decade. The research community concentrates on indexing historical documents on a grand scale using word spotting and thus, we consider that the whole process remains an open problem. To the best of our knowledge, Fig. 1 provides a concise view of the various word spotting approaches for offline, handwritten or printed documents, which were published in conferences and journals since 2007. As it can be seen in Fig. 1, there is an increased number of papers over the past few years which confirms the growing interest of the community in word spotting.
Apart from the proposed methods, there also exist a number of surveys for word spotting, either for a specific script, or a particular domain (machine-printed, handwritten), or even for a variety of applications. Murugappan et al. [23] present a study for word spotting in printed documents. The authors divide the word spotting methods according to a character-based and a word-based representation depending on the features used in each case. Their work implies that character-based approaches provide satisfactory results if character segmentation is easy to obtain, whereas word-based approaches can deal with touching characters efficiently and analyze the shapes of the words without explicit character recognition. In addition, a comparative study for segmentation and word spotting methods is presented in [24] for handwritten and printed text in Arabic documents. The segmentation techniques rely on horizontal and vertical profile features and scale space segmentation. The features under comparison are geometrical moments and word profiles, whereas the similarity computation is carried out using the cosine metric and dynamic time warping (DTW). An explicit view of the various aspects of a word spotting system is presented by Marinai et al. [25]. Therein, the different features used for each technique are categorized according to the layer at which the similarity computation is applied (pixel/column features, connected components, word level features etc.). Image representations (i.e. feature vectors) with respect to the specific feature types are also analyzed along with the respective similarity measures. Finally, the work of Tan et al. [26] underlines the necessity for content-based image retrieval as an economical alternative to OCR, relying on proper selection of features, representation and similarity measures. Word spotting is defined under a framework of categories with respect to the word image representation.
Nevertheless, a considerable number of word spotting approaches proposed over the last years as well as several techniques involved for the improvement of the performance yet remain unexplored. This survey aims to review the recently proposed methods and complete the missing parts of other studies in the word spotting literature. To this end, we analyze the nature of text sources along with the inherent difficulties addressed by word spotting methods. Among the main steps of the word spotting system, namely, feature extraction, representation and similarity computation, we also investigate the preprocessing stage with respect to binarization, segmentation and normalization techniques. Furthermore, we present the benefits accrued from relevance feedback methods employed in the retrieval phase of a word spotting task, either by involving the user to select true query instances or in a completely unsupervised way. Subsequently, we examine whether direct comparison among different methods is straightforward or not, since the evaluation measures and protocols applied for assessing the performance may differ substantially. Finally, we present the most commonly used datasets along with the experimental results published by the state-of-the-art methods and discuss about the performance obtained in each case.
The rest of our work is structured as follows. In Section 2, we describe the challenges involved in document image word spotting. In Sections 3 and 4, we present the core steps of the word spotting pipeline with respect to the preprocessing and feature extraction, the input of the spotting system and the different similarity metrics applied among common representations upon the extracted features. Section 5 describes a number of techniques which enhance the retrieved results from the image matching step based on relevance feedback, data fusion and re-ranking. In Section 6, we present the most common datasets along with some distinct measures used to evaluate word spotting systems and examine the results achieved by the state-of-the-art methods. Finally, conclusions are drawn in Section 7.
Section snippets
Challenges in document image word spotting
Keyword spotting in document images presents several challenges which are related to the nature of the original documents. In this section, we first investigate the various text sources used by word spotting methods and subsequently overview the corresponding challenges.
Basic document image analysis technologies involved
Although the intermediate stages of a word spotting system may vary across different methods, we can distinguish some common steps. Document images are initially preprocessed in order to enhance the subsequent feature extraction step. After appropriate features have been extracted, a common representation is selected to describe both the documents at a specific level (word, line or page) and the query, which in most cases is a single word provided either as an image or a text string. The next
Keyword spotting system architecture
In this section, we examine the main steps of the word spotting pipeline. Fig. 2 illustrates a general purpose word spotting system where the whole procedure is divided in an offline and an online phase. In the offline stage, features are extracted from word images, text lines or whole pages which are then represented by feature vectors. In the case where training data are used, feature vectors are usually modelled with statistical models (e.g. HMMs). In the online phase, a user formulates a
Retrieval enhancement
In this section, we present a number of methods which are used to improve the retrieved results of a word spotting system in terms of incorporating the information of the ranked lists obtained from user queries. This is done either by involving the user to select positive query instances in a supervised process, or in an purely unsupervised manner.
Evaluation
The ranked list of results obtained from a word spotting system for a number of different queries is finally used to evaluate its accuracy. In this section, we introduce the databases which are publicly available and most widely used for word spotting. After describing the importance of having a common evaluation scheme for direct comparison between methods, we present the distinct measures used for assessing the performance. Finally, we present and discuss the results achieved by the state of
Conclusion
In this survey, we presented a comprehensive study on word spotting for indexing documents available all over the world, written in various scripts or fonts. After examining the nature of the documents used by the research community, we described the intermediate steps of a word spotting system, namely, preprocessing, feature extraction, representation and similarity measures which are used to retrieve instances of user inserted queries. Subsequently, we overviewed a number of boosting
Acknowledgment
This work has been supported by the OldDocPro project (ID 4717) funded by the GSRT as well as the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943), https://read.transkribus.eu/.
Angelos P. Giotis Received his B.Sc. and M.Sc. degrees in Computer Science from the Department of Computer Science and Engineering, University of Ioannina, Greece in 2010 and 2012, respectively. He is a Ph.D. student at the same department. He is currently working as a Research Associate at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece. His research interests lie on Text Understanding, Information Retrieval and
References (189)
- et al.
Handwritten word-spotting using hidden Markov models and universal vocabularies
Pattern Recognit.
(2009) - et al.
A probabilistic method for keyword retrieval in handwritten document images
J. Pattern Recognit.
(2009) - et al.
Keyword spotting for self-training of BLSTM NN-based handwriting recognition systems
Pattern Recognit.
(2014) - et al.
A synthesised word approach to word retrieval in handwritten documents
Pattern Recognit.
(2012) - et al.
Segmentation-free word spotting with exemplar SVMs
Pattern Recognit.
(2014) - et al.
Towards an omnilingual word retrieval system for ancient manuscripts
Pattern Recognit.
(2009) - et al.
Learning-based word spotting system for Arabic handwritten documents
Pattern Recognit.
(2014) - et al.
Statistical script independent word spotting in offline handwritten documents
Pattern Recognit.
(2014) - et al.
Keyword spotting in unconstrained handwritten Chinese documents using contextual word model
Image Vis. Comput.
(2013) - et al.
A line-based representation for matching words in historical manuscripts
Pattern Recognit. Lett.
(2011)
A document image retrieval system
Eng. Appl. Artif. Intell.
Google book search: document understanding on a massive scale
Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR)
Combining on-line and off-line systems for handwriting recognition
Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR)
Dynamic and contextual information in HMM modeling for handwritten word recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Handwritten word recognition using conditional random fields
Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR)
A novel connectionist system for unconstrained handwriting recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Unconstrained handwritten word recognition based on trigrams using BLSTM
Proceedings of the 22th International Conference on Pattern Recognition (ICPR)
Lexicon-based word recognition using support vector machine and hidden Markov model
Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR)
Cursive on-line handwriting word recognition using a bi-character model for large lexicon applications
Proceedings of the 12th International Conference on Frontiers for Handwriting Recognition (ICFHR)
Improving offline handwritten text recognition with hybrid HMM/ANN models
IEEE Trans. Pattern Anal. Mach. Intell.
Continuous hidden Markov modeling for speaker-independent word spotting
Proceedings of the 14th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Keyword location in noisy document images
Proceedings of the 2nd Annual Symposium on Document Analysis and Information Retrieval
Word spotting in scanned images using hidden Markov models
Proceedings of the 18th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Word spotting: a new approach to indexing handwriting
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)
Keyword spotting for cursive document retrieval
Proceedings of the 1st Workshop on Document Image Analysis (DIA)
Fusion of word spotting and spatial information for figure caption retrieval in historical document images
Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR)
Word spotting in Bangla and English graphical documents
Proceedings of the 22nd International Conference on Pattern Recognition (ICPR)
Retrieving cuneiform structures in a segmentation-free word spotting framework
Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing (HIP)
A novel word spotting method based on recurrent neural networks
IEEE Trans. Pattern Anal. Mach. Intell.
A survey of keyword spotting techniques for printed document images
Artif. Intell. Rev.
Segmentation and word spotting methods for printed and handwritten Arabic texts: a comparative study
Proceedings of the 13th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Digital libraries and document image retrieval techniques: a survey
Image based retrieval and keyword spotting in documents
Integrating visual and textual cues for query-by-string word spotting
Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)
A study of bag-of-visual-words representations for handwritten keyword spotting
Int. J. Doc. Anal. Recognit.
Segmentation-based historical handwritten word spotting using document-specific local features
Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Segmentation-free keyword spotting for handwritten documents based on heat kernel signature
Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)
A keyword spotting approach using blurred shape model-based descriptors
Proceedings of the 10th Workshop on Historical Document Imaging and Processing
Character n-gram spotting on handwritten documents using weakly-supervised segmentation
Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)
Bag-of-features HMMs for segmentation-free word spotting in handwritten documents
Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)
Segmentation-free query-by-string word spotting with bag-of-features HMMs
Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR)
A fast word retrieval technique based on kernelized locality sensitive hashing
Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)
Spot it! Finding words and patterns in historical documents
Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)
Word spotting for historical documents
Int. J. Doc. Anal. Recognit.
SpottingNet: learning the similarity of word images with convolutional neural network for word spotting in handwritten historical documents
Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Word spotting in Alice’s adventures underground using multi scale integral orientation features
Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (DAS)
Keyword spotting framework using dynamic background model
Proceedings of the 12th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Keyword spotting in handwritten documents using projections of oriented gradients
Proceedings of the 12th IAPR Workshop on Document Analysis Systems (DAS)
Deep feature embedding for accurate recognition and retrieval of handwritten text
Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Cited by (131)
A bibliometric analysis of off-line handwritten document analysis literature (1990–2020)
2022, Pattern RecognitionCitation Excerpt :In SV, some influential reviews are Plamondon and Lorette [23] and Impedovo and Pirlo [24]. WS and IR have been reviewed by Giotis et al. [25] and Doermann et al. [8], respectively. Several studies are script-specific, such as Arabic [26], Indian [27], and Chinese [28].
Unsupervised neural domain adaptation for document image binarization
2021, Pattern RecognitionAttribute-based document image retrieval
2024, International Journal on Document Analysis and RecognitionKeyword Spotting from Historical Handwritten Manuscripts using CLBP and CRLBP
2024, International Journal of Performability EngineeringHWNet v3: a joint embedding framework for recognition and retrieval of handwritten text
2023, International Journal on Document Analysis and RecognitionmetaGraphos: a Web-based system for transcribing, proofreading and publishing scanned documents
2023, Collection and Curation
Angelos P. Giotis Received his B.Sc. and M.Sc. degrees in Computer Science from the Department of Computer Science and Engineering, University of Ioannina, Greece in 2010 and 2012, respectively. He is a Ph.D. student at the same department. He is currently working as a Research Associate at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece. His research interests lie on Text Understanding, Information Retrieval and Object Detection.
Giorgos Sfikas Received his B.Sc. and M.Sc. degrees in Computer Science from the Department of Computer Science, University of Ioannina, Greece in 2004 and 2007, respectively, and his Ph.D. degree in Image Processing and Computer Vision from the University of Strasbourg, France in 2012. His research interests include statistical image processing, medical imaging, document image processing, machine learning and computer vision. He is currently working as a Research Associate at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece.
Basilis Gatos Received his Electrical Engineering Diploma in 1992 and his Ph.D. degree in 1998, both from the Electrical and Computer Engineering Department of Democritus University of Thrace, Xanthi, Greece. He worked as Director of the Research Division in the field of digital preservation of old newspapers at Lambrakis Press Archives and as Managing Director of R&D Division in the field of document management and recognition at BSI S.A. in Greece. He is currently working as a Researcher at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece. His main research interests are in Image Processing and Document Image Analysis, OCR and Pattern Recognition. He has more than 150 publications in journals and international conference proceedings and has participated in several research programs funded by the European community. He is a member of the Editorial Board of the International Journal on Document Analysis and Recognition (IJDAR) and program committee member of several international Conferences and Workshops. He is co-organizer of the International Conference of Frontiers in Handwriting Recognition (ICFHR) in 2014 and of the International Workshop on Document Analysis Systems (DAS 2016).
Christophoros Nikou Received the Diploma in electrical engineering from the Aristotle University of Thessaloniki, Greece, in 1994 and the DEA and Ph.D. degrees in image processing and computer vision from Louis Pasteur University, Strasbourg, France, in 1995 and 1999, respectively. He was a Senior Researcher with the Department of Informatics, Aristotle University of Thessaloniki in 2001. From 2002 to 2004, he was a Research Engineer and Project Manager with Compucon S.A., Thessaloniki, Greece. He was a Lecturer (2004–2009) and an Assistant Professor (2009–2013) with the Department of Computer Science and Engineering, University of Ioannina, Ioannina, Greece, where he has been an Associate Professor, since 2013. During the academic year 2015-2016 he has been a visiting Associate Professor at the Department of Computer Science, Univiresity of Houston, USA. His research interests mainly include image processing and analysis, computer vision and pattern recognition and their application to medical imaging. He is a member of EURASIP and an IEEE Senior Member.