Keyword spotting in handwritten chinese documents using semi-markov conditional random fields
Introduction
With the increasing use of pen-based input devices and user-interfaces, more and more research attentions have been paid on document analysis techniques including text segmentation, recognition and retrieval. In spite of the great progress on handwritten text recognition (Plamondon and Srihari, 2000, Graves et al., 2009), remaining recognition errors can still prevent locating keywords. Keyword spotting (Manmatha et al., 1996, Frinken et al., 2012, Fischer et al., 2012) is to locate words or phrases in the document without the need of accurate handwriting recognition. By computing a similarity measure between the query word and a segmented candidate in the document, the user can adjust the threshold to balance recall and precision rates for fulfilling different needs. The application background of keyword spotting is the retrieval of handwritten pages such as notes, bank checks, government files and historical documents.
For fast retrieval of documents from large database, it is necessary to build and store an index file beforehand, on which the spotting algorithm is run and gives spotting results for a query word (Zhang et al., 2014). The state-of-art for English handwriting recognition, i.e., long-short term memory recurrent neural network (LSTM-RNN) has been used for Chinese off-line handwritten text recognition (Messina and Louradour, 2015) but its performance is still inferior to methods based on character over-segmentation and classification (Wang et al., 2012, Wang et al., 2014). Therefore, for keyword spotting in handwritten Chinese documents, we build the index file based on the segmentation-recognition framework (Zhang et al., 2014, Zhang et al., 2013a, Huang et al., 2013). In the document, each text line is first over-segmented into a sequence of components according to the overlapping between strokes, with the hope that each component is a character or part of a character. Subject to constraints of character width, consecutive components are combined to generate candidate characters, which constitute the segmentation candidate lattice. On assigning each candidate character a number of candidate classes using a character classifier, the segmentation-recognition candidate lattice (referred to as lattice for brevity) is constructed. Each path in the lattice corresponds to a segmentation-recognition hypothesis of the text line. Each character-label pair (a candidate character coupled with one of its candidate labels) in the lattice is referred to as an edge. The character similarity scores between each candidate character and its candidate classes, also referred to as edge scores, are calculated and stored. In text search, the query word is matched with sequences of candidate characters (partial paths in the candidate lattice) starting from each component in the lattice (Zhang et al., 2013a). The word similarity is obtained by combining the character similarity scores (edge scores). When the word similarity is greater than a threshold, a word instance is located in the document. So, the similarity measure is critical to keyword spotting. For keyword spotting from large databases of multi-writer or writer independent documents, to alleviate the effects of character shape variation, edge scores are usually given by a character classifier (Oda et al., 2004, Zhang et al., 2013a, Cheng et al., 2013). Ideally, the score of a true character should be higher than any imposters. To improve the discriminability, additional information (Huang et al., 2013), such as geometric and linguistic contexts, are incorporated when calculating edge scores.
In this paper, we propose an indexing method for keyword spotting using semi-CRFs (Zhou et al., 2013), which are probabilistic graphical models defined in the candidate lattice and provide a theoretical framework for fusing the information of different contexts. With this model, we first augment the original lattice to supplement candidate character classes and then reduce the lattice complexity by a forward-backward pruning procedure used in Zhou et al. (2013) which avoids breaking high-probability paths, and edge scores are derived from the marginal probabilities of edges. The semi-CRF model is derived from the Chinese handwriting recognition system in Zhou et al. (2013), but the proposed method is different from text line recognition. Handwritten text recognition only retains the 1-best recognition result which is limited for retrieval. Keyword spotting is performed in the lattice which contains alternative candidate characters/words, such that more instances can be located than 1-best list and the user can adjust the threshold to balance recall and precision rates. The relations between keyword spotting and handwriting recognition have been discussed in Frinken et al. (2012). Different from handwriting recognition, keyword spotting is binary classification of edges in the lattice desiring high similarities to target characters and low scores to all the others. Hence, we propose a binary classification objective, i.e., the cross-entropy (CE) to optimize semi-CRF parameters. Besides, to enhance the recognition performance of the semi-CRF model, we propose a proxy-character driven search algorithm to locate mis-recognized character instances in the lattice. Confusing similar characters can be used as proxies to search in the index file so that the candidate character mis-recognized as its proxies can be matched. Compared with the traditional character-synchronous dynamic search, the use of proxy-characters can improve the keyword spotting performance significantly. Keyword spotting from Japanese handwritten documents (Cheng et al., 2013, Oda et al., 2004) is also performed by scoring candidate characters in the lattice and locate instances by the character-synchronous search. Here, we have paid more attentions to character scoring and the search method to improve the word matching accuracy and search efficiency.
The remainder of this paper is organized as follows: Section 2 reviews related works. Section 3 gives the overview of our keyword spotting system. Section 4 details the index generation based on the semi-CRF model. Section 6 introduces the proxy-character driven search algorithm. Section 6 presents our experimental results on an online handwriting database CASIA-OLHWDB and Section 7 draws concluding remarks.
Section snippets
Related work
Keyword spotting was originally formulated as detecting words or phrases in speech (Myers et al., 1981), and extended to locate words in printed text documents (Kuo and Agazzi, 1994) a decade later. This technique was first performed on online handwriting in Lopresti and Tomkins (1994) for annotation retrieval and on handwritten document images in Manmatha et al. (1996). With regard to word similarity scoring techniques, keyword spotting methods can be categorized into two groups: word shape
System overview
For fast keyword spotting from a large collection of documents, the proposed system of online handwritten Chinese document retrieval consists of two stages: indexing and keyword search, as shown in Fig. 1. The indexing is done offline to generate the pruned candidate lattice and compute character confidence measures (edge probabilities/scores), while the keyword search is performed online to locate instances matched with the query.
To build the index file, the document is first segmented into
Building the index file using Semi-CRFs
In this section, we first describe the candidate character augmentation technique and then briefly introduce the semi-CRF model for lattice pruning and edge (character-label pair) score computation. At last, we propose the model trained method, i.e., CE criterion which views keyword spotting as binary classification of candidate characters. The compact lattices together with edge scores (for text lines in a document) are concatenated and stored into the index file for retrieval.
Proxy-character driven search algorithm
Previously, we use a character-synchronous dynamic search algorithm for locating keywords in the lattice (Zhang et al., 2013a). The search algorithm contains two key steps: one is the candidate character scoring and the other is the word search. If the query is matched with the candidate character/word, the similarity score is the logarithm of the edge probability; otherwise, is . So, the query mis-matched with the candidate character/word cannot be located and the recall rate is limited. In
Experiments
We evaluated the performance of the proposed keyword spotting method on a database of online Chinese handwriting: CASIA-OLHWDB (Liu et al., 2011). This database is divided into six data sets, three for isolated characters and three for handwritten texts. There are 3,912,017 isolated character samples and 5,092 handwritten pages (52,220 text lines) in total. Both the isolated data and handwritten text data have been divided into standard training set (816 writers) and testing set (204 writers).
Conclusion
In this paper, we present an indexing method for keyword spotting in online handwritten Chinese documents based on semi-CRFs, which provide a theoretical framework for fusing the information of character recognition, geometric and linguistic contexts. By candidate character augmentation and lattice pruning, we obtain a compact index file for keyword spotting. In the pruned candidate segmentation-recognition lattice, the candidate character sores are estimated based on the semi-CRF model. The
Acknowledgements
This work is supported by the National Natural Science Foundation of China (NSFC) under grants no. 61403385 and no. 61273269.
References (56)
- et al.
A probabilistic method for keyword retrieval in handwritten document images
Pattern Recognit.
(2009) - et al.
Lexicon-free handwritten word spotting using character HMMs
Pattern Recognit. Lett.
(2012) - et al.
Finding words in alphabet soup: inference on freeform character recognition for historical scripts
Pattern Recognit.
(2009) - et al.
Retrieval of online handwriting by synthesis and matching
Pattern Recognit.
(2009) - et al.
A word graph algorithm for large vocabulary continuous speech recognition
Comput. Speech Lang.
(1997) - et al.
Handwritten word-spotting using hidden markov models and universal vocabularies
Pattern Recognit.
(2009) - et al.
Off-line recognition of realistic Chinese handwriting using segmentation-free strategy
Pattern Recognit.
(2009) - et al.
An approach for real-time recognition of online Chinese handwritten sentences
Pattern Recognit.
(2012) - et al.
Unsupervised language model adaptation for handwritten chinese text recognition
Pattern Recognit.
(2014) - et al.
Transcript mapping for handwritten chinese documents by integrating character recognition model and geometric context
Pattern Recognit.
(2013)
Character confidence based on N-best list for keyword spotting in online chinese handwritten documents
Pattern Recognit.
A robust approach to text line grouping in online handwritten Japanese documents
Pattern Recognit.
Digital ink search based on character-recognition candidates compared with feature-matching-based approach
IEICE Trans. Inf. Syst.
A novel word spotting method based on recurrent neural networks
IEEE Trans. Pattern. Anal. Mach. Intell.
Introduction to Statistical Pattern Recognition
A novel connectionist system for unconstrained handwriting recognition
IEEE Trans. Pattern. Anal. Mach. Intell.
Keyword spotting in unconstrained handwritten Chinese documents using contextual word model
Image Vis. Comput.
Modified quadratic discriminant functions and the application to chinese character recognition
IEEE Trans. Pattern. Anal. Mach. Intell.
Keyword spotting in poorly printed documents using pseudo 2-D hidden Markov models
IEEE Trans. Pattern. Anal. Mach. Intell.
Lexicon-driven segmentation and recognition of handwritten character strings for Japanese address reading
IEEE Trans. Pattern. Anal. Mach. Intell.
Cited by (6)
Deep learning based conference program organization system from determining articles in session to scheduling
2022, Information Processing and ManagementCitation Excerpt :The model is then used to extract keywords from new documents. Hidden Markov model (Zhang et al., 2017), support vector machine (SVM) (Domoto et al., 2016), Naive Bayes (NB), etc. are employed as training models. In a supervised machine learning approach (Guleria et al., 2021), keywords based on statistical and linguistic features have been extracted using SVM.
Non-invasive optical micro-identification of ink verification in pen ink handwriting
2020, Results in ChemistryCitation Excerpt :Therefore, determining which of the ink lines was made last [11,12] may help in solving the dispute as a recurrent matter in many police investigations and in courts. Generally speaking, handwriting with ink as primary method to calligraph and affix one's signature is widely used in document writing [13,14], document approval [15], contract signing and other fields [16–29], especially in the clear responsibility of signature, which plays an important role in distinguishing identity. The analysis of ink handwriting detection has important significance for technical division to determine economic responsibility and/or authority responsibility relationship.
A comprehensive review of conditional random fields: variants, hybrids and applications
2020, Artificial Intelligence ReviewAn empirical study of textrank for keyword extraction
2020, IEEE AccessA multi-oriented chinese keyword spotter guided by text line detection
2019, Proceedings of the International Conference on Document Analysis and Recognition, ICDAR