Comparing Hybrid NN-HMM and RNN for Temporal Modeling in Gesture Recognition

Granger, Nicolas; el Yacoubi, Mounîm A.

doi:10.1007/978-3-319-70096-0_16

Nicolas Granger¹⁸ &
Mounîm A. el Yacoubi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10635))

Included in the following conference series:

International Conference on Neural Information Processing

7970 Accesses
4 Citations

Abstract

This paper provides an extended comparison of two temporal models for gesture recognition, namely Hybrid Neural Network-Hidden Markov Models (NN-HMM) and Recurrent Neural Networks (RNN) which have lately claimed the state-the-art performances. Experiments were conducted on both models in the same body of work, with similar representation learning capacity and comparable computational costs. For both solutions, we have integrated recent contributions to the model architectures and training techniques. We show that, for this task, Hybrid NN-HMM models remain competitive with Recurrent Neural Networks in a standard setting. For both models, we analyze the influence of the training objective function on the final evaluation metric. We further tested the influence of temporal convolution to improve context modeling, a technique which was recently reported to improve the accuracy of gesture recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bourlard, H., Morgan, N.: A continuous speech recognition system embedding MLP into HMM. In: Advances in Neural Information Processing Systems, pp. 186–193 (1990)
Google Scholar
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Escalera, S., Baró, X., Gonzàlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce-López, V., Escalante, H.J., Shotton, J., Guyon, I.: ChaLearn looking at people challenge 2014: dataset and results. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 459–473. Springer, Cham (2015). doi:10.1007/978-3-319-16178-5_32
Chapter Google Scholar
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). doi:10.1007/11550907_126
Google Scholar
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)
Google Scholar
Graves, A., Schmidhuber, J.: Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 545–552 (2009)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 448–456, 07–09 July 2015. PMLR, Lille (2015)
Google Scholar
Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: hybrid CNN-HMM for continuous sign language recognition. In: Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, 19–22 September 2016 (2016)
Google Scholar
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML Workshop on Deep Learning for Audio, Speech and Language Processing (2013)
Google Scholar
Neverova, N., Wolf, C., Taylor, G.W., Nebout, F.: Multi-scale deep learning for gesture detection and localization. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 474–490. Springer, Cham (2015). doi:10.1007/978-3-319-16178-5_33
Chapter Google Scholar
Pigou, L., van den Oord, A., Dieleman, S., Van Herreweghe, M., Dambre, J.: Beyond temporal pooling: recurrence and temporal convolutions for gesture recognition in Video. Int. J. Comput. Vis. 1–10 (2016)
Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MATH MathSciNet Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Tóth, L., Kocsor, A.: Training HMM/ANN hybrid speech recognizers by probabilistic sampling. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3696, pp. 597–603. Springer, Heidelberg (2005). doi:10.1007/11550822_93
Chapter Google Scholar
Wu, D., Pigou, L., Kindermans, P.J., Nam, L.E., Shao, L., Dambre, J., Odobez, J.M.: Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1583–1597 (2016). doi:10.1109/TPAMI.2016.2537340
Article Google Scholar
Yang, H.D., Sclaroff, S., Lee, S.W.: Sign language spotting with a threshold model based on conditional random fields. IEEE Trans. Patt. Anal. Mach. Intell. 31(7), 1264–1277 (2009)
Article Google Scholar
Yin, Y., Davis, R.: Real-time continuous gesture recognition for natural human-computer interaction. In: 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 113–120. IEEE (2014)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

SAMOVAR, Télécom SudParis, CNRS, University of Paris-Saclay, 9 rue Charles Fourier, 91000, Évry, France
Nicolas Granger & Mounîm A. el Yacoubi

Authors

Nicolas Granger
View author publications
You can also search for this author in PubMed Google Scholar
Mounîm A. el Yacoubi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Granger .

Editor information

Editors and Affiliations

Guangdong University of Technology, Guangzhou, China
Derong Liu
Guangdong University of Technology, Guangzhou, China
Shengli Xie
South China University of Technology, Guangzhou, China
Yuanqing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Granger, N., el Yacoubi, M.A. (2017). Comparing Hybrid NN-HMM and RNN for Temporal Modeling in Gesture Recognition. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10635. Springer, Cham. https://doi.org/10.1007/978-3-319-70096-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-70096-0_16
Published: 26 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70095-3
Online ISBN: 978-3-319-70096-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics