Skip to main content

Hybrid CNN-HMM Model for Street View House Number Recognition

  • Conference paper
  • First Online:
Computer Vision - ACCV 2014 Workshops (ACCV 2014)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9008))

Included in the following conference series:

Abstract

We present an integrated model for using deep neural networks to solve street view number recognition problem. We didn’t follow the traditional way of first doing segmentation then perform recognition on isolated digits, but formulate the problem as a sequence recognition problem under probabilistic treatment. Our model leverage a deep Convolutional Neural Network(CNN) to represent the highly variable appearance of digits in natural images. Meanwhile, hidden Markov model(HMM) is used to deal with the dynamics of the sequence. They are combined in a hybrid fashion to form the hybrid CNN-HMM architecture. By using this model we can perform the training and recognition procedure both at word level. There is no explicit segmentation operation at all which save lots of labour of sophisticated segmentation algorithm design or finegrained character labeling. To the best of our knowledge, this is the first time using hybrid CNN-HMM model directly on the whole scene text images. Experiments show that deep CNN can dramaticly boost the performance compared with shallow Gausian Mixture Model(GMM)-HMM model. We obtaied competitive results on the street view house number(SVHN) dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22, 38–62 (2000)

    Article  Google Scholar 

  2. Cheriet, M., El Yacoubi, M., Fujisawa, H., Lopresti, D., Lorette, G.: Handwriting recognition research: twenty years of achievement and beyond. Pattern Recogn. 42, 3131–3135 (2009)

    Article  Google Scholar 

  3. Ohya, J., Shio, A., Akamatsu, S.: Recognizing characters in scene images. IEEE Trans. Pattern Anal. Mach. Intell. 16, 214–220 (1994)

    Article  Google Scholar 

  4. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), IEEE, pp. 1457–1464 (2011)

    Google Scholar 

  5. Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: 2012 21st International Conference on Pattern Recognition (ICPR), IEEE, pp. 3304–3308 (2012)

    Google Scholar 

  6. Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 770–783. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  7. Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3538–3545 (2012)

    Google Scholar 

  8. Alsharif, O., Pineau, J.: End-to-end text recognition with hybrid HMM maxout models (2013). arXiv preprint arXiv:1310.1811

  9. Bissacco, A., Cummins, M., Netzer, Y., Neven, H.: PhotoOCR: reading text in uncontrolled conditions. In: ICCV (2013)

    Google Scholar 

  10. Neumann, L., Matas, J.: Scene text localization and recognition with oriented stroke detection. In: ICCV (2013)

    Google Scholar 

  11. Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Convolutional neural network committees for handwritten character classification. In: ICDAR, pp. 1250–1254 (2011)

    Google Scholar 

  12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, vol. 1, p. 4 (2012)

    Google Scholar 

  13. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks (2013). arXiv preprint arXiv:1302.4389

  14. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. CoRR abs/1311.2901 (2013)

    Google Scholar 

  15. Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V.: Multi-digit number recognition from street view imagery using deep convolutional neural networks (2014). arXiv preprint arXiv:1312.6082

  16. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)

    Article  Google Scholar 

  17. Matan, O., Burges, C.J.C., LeCun, Y., Denker, J.S.: Multi-digit recognition using a space displacement neural network. In: NIPS, pp. 488–495 (1991)

    Google Scholar 

  18. Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach. Kluwer Academic Publishers, Norwell (1993). ISBN: 0792393961

    Google Scholar 

  19. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20, 30–42 (2012)

    Article  Google Scholar 

  20. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Sig. Process. Mag. 29, 82–97 (2012)

    Article  Google Scholar 

  21. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6645–6649 (2013)

    Google Scholar 

  22. Sainath, T.N., Kingsbury, B., Ramabhadran, B., Fousek, P., Novak, P., Mohamed, A.R.: Making deep belief networks effective for large vocabulary continuous speech recognition. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE, pp. 30–35 (2011)

    Google Scholar 

  23. Forney, G.D.J.: The viterbi algorithm. Proc. IEEE 61, 268–278 (1973)

    Article  MathSciNet  Google Scholar 

  24. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: ICCV, pp. 2146–2153 (2009)

    Google Scholar 

  25. Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41, 164–171 (1970)

    Article  MATH  MathSciNet  Google Scholar 

  26. Baum, L.E.: An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities 3, 1–18 (1972)

    Google Scholar 

  27. Richard, M.D., Lippmann, R.P.: Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput. 3, 461–483 (1991)

    Article  Google Scholar 

  28. Morgan, N., Bourlard, H.: Continuous speech recognition. IEEE Sig. Process. Mag. 12, 24–42 (1995)

    Article  Google Scholar 

  29. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning, vol. 2011 (2011)

    Google Scholar 

  30. Kapadia, S., Valtchev, V., Young, S.: Mmi training for continuous phoneme recognition on the timit database. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1993, vol. 2, pp. 491–494 (1993)

    Google Scholar 

  31. Juang, B.H., Hou, W., Lee, C.H.: Minimum classification error rate methods for speech recognition. IEEE Trans. Speech Audio Process. 5, 257–265 (1997)

    Article  Google Scholar 

  32. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiang Guo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Guo, Q., Tu, D., Lei, J., Li, G. (2015). Hybrid CNN-HMM Model for Street View House Number Recognition. In: Jawahar, C., Shan, S. (eds) Computer Vision - ACCV 2014 Workshops. ACCV 2014. Lecture Notes in Computer Science(), vol 9008. Springer, Cham. https://doi.org/10.1007/978-3-319-16628-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16628-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16627-8

  • Online ISBN: 978-3-319-16628-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics