Skip to main content
Log in

Label Embedding: A Frugal Baseline for Text Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. An alternative upper-bound is the slack-rescaled hinge loss \(\max _{y \in \mathcal {Y}} \Delta (y_n,y) (1 - F(x_n,y_n;w) + F(x_n,y;w))\). Note that in the 0/1 loss case, both are equivalent. See (Nowozin and Lampert (2011), p.120) for more details.

  2. Marginalization can be done “early”, by constructing a string representation that includes all possible symbols in that position (weighted by the size of the symbols’ alphabet), or “late”, by explicitly generating a new set of queries that match the query with the wildcard and averaging the similarities of those queries with the image. This is equivalent to generating the new set of queries, averaging them, and then computing the similarity between that average query and the image. The subtle differences between “early” and “late” marginalization are only due to the way the string representation is normalized. We focus on late marginalization since it obtained slightly better results than early marginalization.

References

  • Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2013). Handwritten word spotting with corrected attributes. In ICCV.

  • Almazán, J., Gordo, A., Fornés, A., & Valveny, E. (2014). Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2552–2566.

  • Bai, B., Weston, J., Grangier, D., Collobert, R., Chapelle, O., & Weinberger, K. (2009). Supervised semantic indexing. In CIKM.

  • Bazzi, I., Schwartz, R., & Makhoul, J. (1999). An omnifont open-vocabulary ocr system for english and arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6), 495–504.

    Article  Google Scholar 

  • Bishop, C. (1995) Training with noise is equivalent to Tikhonov regularization. Neural Computation.

  • Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.

    MATH  Google Scholar 

  • Bissacco, A., Cummins, M., Netzer, Y., & Neven, H. (2013) Photoocr: Reading text in uncontrolled conditions. In ICCV.

  • Brakensiek, A., & Rigoll, G. (2004). Handwritten address recognition using hidden markov models. Reading and Learning (pp. 103–122). Berlin: Springer.

    Google Scholar 

  • Brakensiek, A., Rottland, J., Kosmala, A., & Rigoll, G. (2000). Off-line handwriting recognition using various hybrid modeling techniques and character n-grams. In ICFHR.

  • Breuel, T. M. (2001). Segmentation of handprinted letter strings using a dynamic programming algorithm. In ICDAR.

  • Bunke, H., Roth, M., & Schukat-Talamazzini, E. G. (1995). Off-line cursive handwriting recognition using hidden Markov models. Pattern Recognition, 28(9), 1399–1413.

    Article  Google Scholar 

  • Cash, G. L., & Hatamian, M. (1987). Optical character recognition by the method of moments. Computer Vision, Graphics, and Image Processing, 39(3), 291–310.

    Article  Google Scholar 

  • Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: an evaluation of recent feature encoding methods. In BMVC.

  • Chen, M. Y., Kundu, A., & Zhou, J. (1994). Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 481–496. doi:10.1109/34.291449.

    Article  Google Scholar 

  • Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV SLCV workshop.

  • Dutta, S., Sankaran, N., Sankar, K. P., & Jawahar, C. V. (2012). Robust recognition of degraded documents using character n-grams. In DAS.

  • El-Yacoubi, A., Sabourin, R., Suen, C. Y., & Gilloux, M. (1999). An HMM-based approach for off-line unconstrained handwritten word modeling and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8), 752–760.

    Article  Google Scholar 

  • Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv.

  • Jain, R. & Jawahar, C. (2010). Towards more effective distance functions for word image matching. In DAS (pp. 363–370). ACM.

  • Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1704–1716.

    Article  Google Scholar 

  • Joachims, T. (2002). Optimizing search engines using clickthrough data. In SIGKDD.

  • Kedem, D., Tyree, S., Sha, F., Lanckriet, G. R., & Weinberger, K. Q. (2012). Non-linear metric learning. In NIPS.

  • Knerr, S., Augustin, E., Baret, O., & Price, D. (1998). Hidden Markov model based word recognition and its application to legal amount reading on French checks. Computer Vision and Image Understanding, 70(3), 404–419.

    Article  Google Scholar 

  • Koerich, A. L., Sabourin, R., & Suen, C. Y. (2003). Large vocabulary off-line handwriting recognition: A survey. Pattern Analysis and Applications, 6(2), 97–121.

    Article  MathSciNet  Google Scholar 

  • Larochelle, H., Erhan, D., & Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • LeCun, Y., Bottou, L., Orr, G., & Muller, K. (1998). Efficient backprop. In G. Orr & K. Muller (Eds.), Neural networks: Tricks of the trade. New York: Springer.

    Google Scholar 

  • Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. J. Mach. Learn. Res., 2, 419–444.

    MATH  Google Scholar 

  • Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Madhvanath, S., & Govindaraju, V. (2001). The role of holistic paradigms in handwritten word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 149–164.

    Article  Google Scholar 

  • Marti, U. V., & Bunke, H. (2001). Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition system. International Journal of Pattern Recognition and Artificial Intelligence, 15, 65–90.

    Article  Google Scholar 

  • Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS.

  • Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Scene text recognition using higher order language priors. In BMVC.

  • Mishra, A., Alahari, K., & Jawahar, C. V. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.

  • Mohamed, M. A., & Gader, P. D. (1996). Handwritten word recognition using segmentation-free hidden Markov modeling and segmentation-based dynamic programming techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5), 548–554. doi:10.1109/34.494644.

    Article  Google Scholar 

  • Mori, S., Nishida, H., & Yamada, H. (1999). Optical character recognition. New York: Wiley.

    Google Scholar 

  • Nagy, G. (2000). Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 38–62.

    Article  Google Scholar 

  • Neumann, L., & Matas, J. (2012). Real-time scene text localization and recognition. In CVPR.

  • Novikova, T., Barinova, O., Kohli, P., & Lempitsky, V. (2012). Large-lexicon attribute-consistent text recognition in natural images. In ECCV.

  • Nowozin, S., & Lampert, C. (2011). Structured learning and prediction in computer vision. Foundations and trends in computer graphics and vision.

  • Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.

  • Perronnin, F., Liu, Y., Sánchez, J., & Poirier, H. (2010). Large-scale image retrieval with compressed Fisher vectors. In CVPR.

  • Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In CVPR.

  • Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for large-scale image classification. In ECCV.

  • Rath, T. M., & Manmatha, R. (2003). Word image matching using dynamic time warping. In CVPR.

  • Rodríguez-Serrano, J. A., & Perronnin, F. (2012). A model-based sequence similarity with application to handwritten word spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2108–2120.

    Article  Google Scholar 

  • Rodriguez-Serrano, J. A., & Perronnin, F. (2013). Label embedding for text recognition. In BMVC.

  • Rodríguez-Serrano, J. A., Sandhawalia, H., Bala, R., Perronnin, F., & Saunders, C. (2012). Data-driven vehicle identification by image matching. In ECCV Workshop on Computer Vision for Vehicle Technology.

  • Sankar, K., Manmatha, R., Jawahar, C. V., & Manmatha, R. (2010). Nearest neighbor based collection ocr. In DAS.

  • Schölkopf, B., Smola, A., & Müller, K. R. (1998). Non-linear component analysis as a kernel eigenvalue problem. In Neural Computation.

  • Senior, A. W., & Robinson, A. J. (1998). An off-line cursive handwriting recognition system. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 309–321. doi:10.1109/34.667887.

    Article  Google Scholar 

  • Vinciarelli, A., Bengio, S., & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 709–720.

    Article  Google Scholar 

  • Wang, K., Babenko, B., & Belongie, S. (2011). End-to-end scene text recognition. In ICCV.

  • Wang, K., & Belongie, S. (2010). Word spotting in the wild. In ECCV.

  • Weston, J., Bengio, S., & Usunier, N. (2010). Learning to rank with joint word-image embeddings. ECML: Large scale image annotation.

  • Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In NIPS.

  • Yao, C., Bai, X., Shi, B., & Liu, W. (2014). Strokelets: A learned multi-scale representation for scene text recognition. In CVPR.

  • Zimmermann, M., Chappelier, J. C., & Bunke, H. (2006). Offline grammar-based recognition of handwritten sentences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 818–821.

    Article  Google Scholar 

Download references

Acknowledgments

This work was partially funded by the French ANR project FIRE-ID.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose A. Rodriguez-Serrano.

Additional information

Communicated by Tilo Burghardt, Majid Mirmehdi, Walterio Mayol-Cuevas and Dima Damen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodriguez-Serrano, J.A., Gordo, A. & Perronnin, F. Label Embedding: A Frugal Baseline for Text Recognition. Int J Comput Vis 113, 193–207 (2015). https://doi.org/10.1007/s11263-014-0793-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0793-6

Keywords

Navigation