Elsevier

Pattern Recognition

Volume 68, August 2017, Pages 310-332
Pattern Recognition

A survey of document image word spotting techniques

https://doi.org/10.1016/j.patcog.2017.02.023Get rights and content

Highlights

  • This work reviews the word spotting methods for document indexing.

  • The nature of texts addressed by word spotting techniques is analyzed.

  • The core steps that compose a word spotting system are thoroughly explored.

  • Several boosting mechanisms which enhance the retrieved results are examined.

  • Results achieved by the state of the art imply that there are still goals to be reached.

Abstract

Vast collections of documents available in image format need to be indexed for information retrieval purposes. In this framework, word spotting is an alternative solution to optical character recognition (OCR), which is rather inefficient for recognizing text of degraded quality and unknown fonts usually appearing in printed text, or writing style variations in handwritten documents. Over the past decade there has been a growing interest in addressing document indexing using word spotting which is reflected by the continuously increasing number of approaches. However, there exist very few comprehensive studies which analyze the various aspects of a word spotting system. This work aims to review the recent approaches as well as fill the gaps in several topics with respect to the related works. The nature of texts and inherent challenges addressed by word spotting methods are thoroughly examined. After presenting the core steps which compose a word spotting system, we investigate the use of retrieval enhancement techniques based on relevance feedback which improve the retrieved results. Finally, we present the datasets which are widely used for word spotting, we describe the evaluation standards and measures applied for performance assessment and discuss the results achieved by the state of the art.

Introduction

A great amount of information in libraries and cultural institutions exist all over the world and need to be digitized so as to preserve it and protect it from frequent handling. Among others, Google has put an effort to digitize books on a large scale [1], [2], thereby providing support to the document understanding research community. In order to create digital libraries which allow efficient searching and browsing for future users, thousands of digitized documents have to be transcribed or at least indexed at a certain degree. However, the automatic recognition of poor quality printed text and especially, handwritten text, is not feasible by traditional OCR approaches which mainly suffice for modern printed documents with simple layouts and known fonts. Most of the constraints encountered by recognition systems stem from difficulties in segmenting characters or words, the variability of the handwriting and the open vocabulary. For this reason, more flexible information retrieval and image analysis techniques are required.

The actual problem behind building digital libraries lies on the retrieval of digitized documents in terms of reliable extraction and access to specific information. While a document image processing system analyzes different text regions so as to convert them to machine-readable text using OCR, a document image retrieval system searches whether a document image contains particular words of interest, without the need for correct character recognition, but by directly characterizing image features at character, word, line or even document level.

On one hand, recognition-based retrieval relies on the complete recognition of documents either at character level using OCR, or at word level using word recognition methods. In the latter case, the goal is to correctly classify a query word into a labeled class, or else, obtain its transcription. Most methods of this type require prior transcription of text-lines, words or characters to train character or word models. During the search phase, a text dictionary or lexicon is used and only words from that lexicon can be used as candidate transcriptions in the recognition task. These methods usually rely on hidden Markov models (HMMs) [3], [4], conditional random fields (CRFs) [5], neural networks (NNs) [6], [7] or they might follow a hybrid approach by combining different classifiers, such as support vector machines (SVMs) with HMMs [8], [9] or HMMs with NNs [10]. An obvious drawback of these approaches is that they have to deal with the inherent handwriting variability and handle a large number of word and character models. Nevertheless, the scope of this work does not focus on recognition-based retrieval methods and thus, we only briefly refer to them.

On the other hand, the recognition-free retrieval which is also known in the literature as word spotting or keyword spotting is the main subject of this study. The goal here is to retrieve all instances of user queries in a set of document images which may be segmented at text lines or words. Actually, the user formulates a query and the system evaluates its similarity with the stored documents and returns as output a ranked list of results which are most similar to the query. The process is totally based on matching between common representations of features, such as color, texture, geometric shape or textual features, while conversion of whole documents into machine readable format and recognition do not take place at all. Therefore, the selection and use of proper features and robust matching techniques are the most important aspects of a word spotting system.

Word spotting methods may be divided into multiple categories according to various factors. Depending on how the input is specified by the user we can distinguish query-by-example (QBE) from query-by-string (QBS) methods. In the QBE scenario, the user selects an image of the word to be searched in the document collection, whereas in the QBS paradigm, the user provides an arbitrary text string as input to the system. Another way to categorize word spotting methods depends on whether training data are used offline, either to learn character and word models or tune the parameters of the system. This way we can distinguish learning-based from learning-free approaches. Finally, word spotting methods which can be directly applied to whole document pages are considered as segmentation-free, in contrast with segmentation-based methods, where a segmentation step has to be applied at line or word level during preprocessing.

Word spotting was initially proposed in the speech recognition community [11]. Its application was adopted later on for printed [12], [13] and handwritten [14] document indexing. While early approaches were based on raw features extracted directly from image pixels [14], [15], the state of the art is to characterize document images with more complex features based on gradient information, shape structure, texture, etc. (see Section 4.1).

There are a variety of applications of word spotting for document indexing and retrieval including the following:

  • retrieval of documents with a given word in company files,

  • searching online in cultural heritage collections stored in libraries all over the world,

  • automatic sorting of handwritten mail containing significant words (e.g. “urgent”, “cancelation”, “complain”) [16],

  • identification of figures and their corresponding captions [17],

  • keyword retrieval in pre-hospital care reports (PCR forms) [18],

  • word spotting in graphical documents such as maps [19],

  • retrieval of cuneiform structures from ancient clay tablets [20],

  • assisting human transcribers in identifying words in degraded documents, especially those appearing for the first time.

Although word spotting and word recognition belong to two separate retrieval paradigms, they sometimes interact by assisting one another. For instance, the authors in [21] propose a keyword spotting approach relying on a NN-based recognition system. On the contrary, in [22], word spotting contributes as a means of bootstrapping a handwriting recognition system, in terms of selecting new elements from the retrieved results. These elements can be used to augment the training set through a semi-supervised procedure, thus increasing the final recognition accuracy while at the same time avoiding the costly manual annotation process.

In order to track the recent literature, we present some statistics related to the evolution of word spotting methods over the last decade. The research community concentrates on indexing historical documents on a grand scale using word spotting and thus, we consider that the whole process remains an open problem. To the best of our knowledge, Fig. 1 provides a concise view of the various word spotting approaches for offline, handwritten or printed documents, which were published in conferences and journals since 2007. As it can be seen in Fig. 1, there is an increased number of papers over the past few years which confirms the growing interest of the community in word spotting.

Apart from the proposed methods, there also exist a number of surveys for word spotting, either for a specific script, or a particular domain (machine-printed, handwritten), or even for a variety of applications. Murugappan et al. [23] present a study for word spotting in printed documents. The authors divide the word spotting methods according to a character-based and a word-based representation depending on the features used in each case. Their work implies that character-based approaches provide satisfactory results if character segmentation is easy to obtain, whereas word-based approaches can deal with touching characters efficiently and analyze the shapes of the words without explicit character recognition. In addition, a comparative study for segmentation and word spotting methods is presented in [24] for handwritten and printed text in Arabic documents. The segmentation techniques rely on horizontal and vertical profile features and scale space segmentation. The features under comparison are geometrical moments and word profiles, whereas the similarity computation is carried out using the cosine metric and dynamic time warping (DTW). An explicit view of the various aspects of a word spotting system is presented by Marinai et al. [25]. Therein, the different features used for each technique are categorized according to the layer at which the similarity computation is applied (pixel/column features, connected components, word level features etc.). Image representations (i.e. feature vectors) with respect to the specific feature types are also analyzed along with the respective similarity measures. Finally, the work of Tan et al. [26] underlines the necessity for content-based image retrieval as an economical alternative to OCR, relying on proper selection of features, representation and similarity measures. Word spotting is defined under a framework of categories with respect to the word image representation.

Nevertheless, a considerable number of word spotting approaches proposed over the last years as well as several techniques involved for the improvement of the performance yet remain unexplored. This survey aims to review the recently proposed methods and complete the missing parts of other studies in the word spotting literature. To this end, we analyze the nature of text sources along with the inherent difficulties addressed by word spotting methods. Among the main steps of the word spotting system, namely, feature extraction, representation and similarity computation, we also investigate the preprocessing stage with respect to binarization, segmentation and normalization techniques. Furthermore, we present the benefits accrued from relevance feedback methods employed in the retrieval phase of a word spotting task, either by involving the user to select true query instances or in a completely unsupervised way. Subsequently, we examine whether direct comparison among different methods is straightforward or not, since the evaluation measures and protocols applied for assessing the performance may differ substantially. Finally, we present the most commonly used datasets along with the experimental results published by the state-of-the-art methods and discuss about the performance obtained in each case.

The rest of our work is structured as follows. In Section 2, we describe the challenges involved in document image word spotting. In Sections 3 and 4, we present the core steps of the word spotting pipeline with respect to the preprocessing and feature extraction, the input of the spotting system and the different similarity metrics applied among common representations upon the extracted features. Section 5 describes a number of techniques which enhance the retrieved results from the image matching step based on relevance feedback, data fusion and re-ranking. In Section 6, we present the most common datasets along with some distinct measures used to evaluate word spotting systems and examine the results achieved by the state-of-the-art methods. Finally, conclusions are drawn in Section 7.

Section snippets

Challenges in document image word spotting

Keyword spotting in document images presents several challenges which are related to the nature of the original documents. In this section, we first investigate the various text sources used by word spotting methods and subsequently overview the corresponding challenges.

Basic document image analysis technologies involved

Although the intermediate stages of a word spotting system may vary across different methods, we can distinguish some common steps. Document images are initially preprocessed in order to enhance the subsequent feature extraction step. After appropriate features have been extracted, a common representation is selected to describe both the documents at a specific level (word, line or page) and the query, which in most cases is a single word provided either as an image or a text string. The next

Keyword spotting system architecture

In this section, we examine the main steps of the word spotting pipeline. Fig. 2 illustrates a general purpose word spotting system where the whole procedure is divided in an offline and an online phase. In the offline stage, features are extracted from word images, text lines or whole pages which are then represented by feature vectors. In the case where training data are used, feature vectors are usually modelled with statistical models (e.g. HMMs). In the online phase, a user formulates a

Retrieval enhancement

In this section, we present a number of methods which are used to improve the retrieved results of a word spotting system in terms of incorporating the information of the ranked lists obtained from user queries. This is done either by involving the user to select positive query instances in a supervised process, or in an purely unsupervised manner.

Evaluation

The ranked list of results obtained from a word spotting system for a number of different queries is finally used to evaluate its accuracy. In this section, we introduce the databases which are publicly available and most widely used for word spotting. After describing the importance of having a common evaluation scheme for direct comparison between methods, we present the distinct measures used for assessing the performance. Finally, we present and discuss the results achieved by the state of

Conclusion

In this survey, we presented a comprehensive study on word spotting for indexing documents available all over the world, written in various scripts or fonts. After examining the nature of the documents used by the research community, we described the intermediate steps of a word spotting system, namely, preprocessing, feature extraction, representation and similarity measures which are used to retrieve instances of user inserted queries. Subsequently, we overviewed a number of boosting

Acknowledgment

This work has been supported by the OldDocPro project (ID 4717) funded by the GSRT as well as the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943), https://read.transkribus.eu/.

Angelos P. Giotis Received his B.Sc. and M.Sc. degrees in Computer Science from the Department of Computer Science and Engineering, University of Ioannina, Greece in 2010 and 2012, respectively. He is a Ph.D. student at the same department. He is currently working as a Research Associate at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece. His research interests lie on Text Understanding, Information Retrieval and

References (189)

  • K. Zagoris et al.

    A document image retrieval system

    Eng. Appl. Artif. Intell.

    (2010)
  • S. Levy, Google’s two revolutions, 2004, (Newsweek)...
  • L. Vincent

    Google book search: document understanding on a massive scale

    Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR)

    (2007)
  • M. Liwicki et al.

    Combining on-line and off-line systems for handwriting recognition

    Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR)

    (2007)
  • A.L. Bianne-Bernard et al.

    Dynamic and contextual information in HMM modeling for handwritten word recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • S. Shetty et al.

    Handwritten word recognition using conditional random fields

    Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR)

    (2007)
  • A. Graves et al.

    A novel connectionist system for unconstrained handwriting recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • X. Zhang et al.

    Unconstrained handwritten word recognition based on trigrams using BLSTM

    Proceedings of the 22th International Conference on Pattern Recognition (ICPR)

    (2014)
  • A. Ahmad et al.

    Lexicon-based word recognition using support vector machine and hidden Markov model

    Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR)

    (2009)
  • S. Prum et al.

    Cursive on-line handwriting word recognition using a bi-character model for large lexicon applications

    Proceedings of the 12th International Conference on Frontiers for Handwriting Recognition (ICFHR)

    (2010)
  • S. España-Boquera et al.

    Improving offline handwritten text recognition with hybrid HMM/ANN models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • J. Rohlicek et al.

    Continuous hidden Markov modeling for speaker-independent word spotting

    Proceedings of the 14th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

    (1989)
  • S. Khoubyari et al.

    Keyword location in noisy document images

    Proceedings of the 2nd Annual Symposium on Document Analysis and Information Retrieval

    (1993)
  • F. Chen et al.

    Word spotting in scanned images using hidden Markov models

    Proceedings of the 18th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

    (1993)
  • R. Manmatha et al.

    Word spotting: a new approach to indexing handwriting

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)

    (1996)
  • P. Keaton et al.

    Keyword spotting for cursive document retrieval

    Proceedings of the 1st Workshop on Document Image Analysis (DIA)

    (1997)
  • K. Khurshid et al.

    Fusion of word spotting and spatial information for figure caption retrieval in historical document images

    Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR)

    (2009)
  • A. Tarafdar et al.

    Word spotting in Bangla and English graphical documents

    Proceedings of the 22nd International Conference on Pattern Recognition (ICPR)

    (2014)
  • L. Rothacker et al.

    Retrieving cuneiform structures in a segmentation-free word spotting framework

    Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing (HIP)

    (2015)
  • V. Frinken et al.

    A novel word spotting method based on recurrent neural networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • A. Murugappan et al.

    A survey of keyword spotting techniques for printed document images

    Artif. Intell. Rev.

    (2011)
  • M. Kchaou et al.

    Segmentation and word spotting methods for printed and handwritten Arabic texts: a comparative study

    Proceedings of the 13th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2012)
  • S. Marinai et al.

    Digital libraries and document image retrieval techniques: a survey

  • C. Tan et al.

    Image based retrieval and keyword spotting in documents

  • D. Aldavert et al.

    Integrating visual and textual cues for query-by-string word spotting

    Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)

    (2013)
  • D. Aldavert et al.

    A study of bag-of-visual-words representations for handwritten keyword spotting

    Int. J. Doc. Anal. Recognit.

    (2015)
  • K. Zagoris et al.

    Segmentation-based historical handwritten word spotting using document-specific local features

    Proceedings of the 14th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2014)
  • X. Zhang et al.

    Segmentation-free keyword spotting for handwritten documents based on heat kernel signature

    Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)

    (2013)
  • A. Fornés et al.

    A keyword spotting approach using blurred shape model-based descriptors

    Proceedings of the 10th Workshop on Historical Document Imaging and Processing

    (2011)
  • U. Roy et al.

    Character n-gram spotting on handwritten documents using weakly-supervised segmentation

    Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)

    (2013)
  • L. Rothacker et al.

    Bag-of-features HMMs for segmentation-free word spotting in handwritten documents

    Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)

    (2013)
  • L. Rothacker et al.

    Segmentation-free query-by-string word spotting with bag-of-features HMMs

    Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR)

    (2015)
  • T. Mondal et al.

    A fast word retrieval technique based on kernelized locality sensitive hashing

    Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)

    (2013)
  • V. Dovgalecs et al.

    Spot it! Finding words and patterns in historical documents

    Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR)

    (2013)
  • T.M. Rath et al.

    Word spotting for historical documents

    Int. J. Doc. Anal. Recognit.

    (2007)
  • Z. Zhong et al.

    SpottingNet: learning the similarity of word images with convolutional neural network for word spotting in handwritten historical documents

    Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2016)
  • A.I. Wagan et al.

    Word spotting in Alice’s adventures underground using multi scale integral orientation features

    Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (DAS)

    (2010)
  • G. Kumar et al.

    Keyword spotting framework using dynamic background model

    Proceedings of the 12th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2012)
  • G. Retsinas et al.

    Keyword spotting in handwritten documents using projections of oriented gradients

    Proceedings of the 12th IAPR Workshop on Document Analysis Systems (DAS)

    (2016)
  • P. Krishnan et al.

    Deep feature embedding for accurate recognition and retrieval of handwritten text

    Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)

    (2016)
  • Cited by (131)

    • A bibliometric analysis of off-line handwritten document analysis literature (1990–2020)

      2022, Pattern Recognition
      Citation Excerpt :

      In SV, some influential reviews are Plamondon and Lorette [23] and Impedovo and Pirlo [24]. WS and IR have been reviewed by Giotis et al. [25] and Doermann et al. [8], respectively. Several studies are script-specific, such as Arabic [26], Indian [27], and Chinese [28].

    • Attribute-based document image retrieval

      2024, International Journal on Document Analysis and Recognition
    • Keyword Spotting from Historical Handwritten Manuscripts using CLBP and CRLBP

      2024, International Journal of Performability Engineering
    • HWNet v3: a joint embedding framework for recognition and retrieval of handwritten text

      2023, International Journal on Document Analysis and Recognition
    View all citing articles on Scopus

    Angelos P. Giotis Received his B.Sc. and M.Sc. degrees in Computer Science from the Department of Computer Science and Engineering, University of Ioannina, Greece in 2010 and 2012, respectively. He is a Ph.D. student at the same department. He is currently working as a Research Associate at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece. His research interests lie on Text Understanding, Information Retrieval and Object Detection.

    Giorgos Sfikas Received his B.Sc. and M.Sc. degrees in Computer Science from the Department of Computer Science, University of Ioannina, Greece in 2004 and 2007, respectively, and his Ph.D. degree in Image Processing and Computer Vision from the University of Strasbourg, France in 2012. His research interests include statistical image processing, medical imaging, document image processing, machine learning and computer vision. He is currently working as a Research Associate at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece.

    Basilis Gatos Received his Electrical Engineering Diploma in 1992 and his Ph.D. degree in 1998, both from the Electrical and Computer Engineering Department of Democritus University of Thrace, Xanthi, Greece. He worked as Director of the Research Division in the field of digital preservation of old newspapers at Lambrakis Press Archives and as Managing Director of R&D Division in the field of document management and recognition at BSI S.A. in Greece. He is currently working as a Researcher at the Institute of Informatics and Telecommunications of the National Center for Scientific Research “Demokritos” in Athens, Greece. His main research interests are in Image Processing and Document Image Analysis, OCR and Pattern Recognition. He has more than 150 publications in journals and international conference proceedings and has participated in several research programs funded by the European community. He is a member of the Editorial Board of the International Journal on Document Analysis and Recognition (IJDAR) and program committee member of several international Conferences and Workshops. He is co-organizer of the International Conference of Frontiers in Handwriting Recognition (ICFHR) in 2014 and of the International Workshop on Document Analysis Systems (DAS 2016).

    Christophoros Nikou Received the Diploma in electrical engineering from the Aristotle University of Thessaloniki, Greece, in 1994 and the DEA and Ph.D. degrees in image processing and computer vision from Louis Pasteur University, Strasbourg, France, in 1995 and 1999, respectively. He was a Senior Researcher with the Department of Informatics, Aristotle University of Thessaloniki in 2001. From 2002 to 2004, he was a Research Engineer and Project Manager with Compucon S.A., Thessaloniki, Greece. He was a Lecturer (2004–2009) and an Assistant Professor (2009–2013) with the Department of Computer Science and Engineering, University of Ioannina, Ioannina, Greece, where he has been an Associate Professor, since 2013. During the academic year 2015-2016 he has been a visiting Associate Professor at the Department of Computer Science, Univiresity of Houston, USA. His research interests mainly include image processing and analysis, computer vision and pattern recognition and their application to medical imaging. He is a member of EURASIP and an IEEE Senior Member.

    View full text