Time-varying LSTM networks for action recognition

Ma, Zichao; Sun, Zhixin

doi:10.1007/s11042-018-6260-6

Time-varying LSTM networks for action recognition

Published: 18 June 2018

Volume 77, pages 32275–32285, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

655 Accesses
9 Citations
Explore all metrics

Abstract

We describe an architecture of Time-Varying Long Short-Term Memory recurrent neural networks (TV-LSTMs) for human action recognition. The main innovation of this architecture is the use of hybrid weights, shared weights and non-shared weights which we refer to as varying weights. The varying weights can enhance the ability of LSTMs to represent videos and other sequential data. We evaluate TV-LSTMs on UCF-11, HMDB-51, and UCF-101 human action datasets and achieve the top-1 accuracy of 99.64%, 57.52%, and 85.06% respectively. This model performs competitively against the models that use both RGB and other features, such as optical flows, improved Dense Trajectory, etc. In this paper, we also propose and analyze the methods of selecting varying weights.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparative Study of HMMs and LSTMs on Action Classification with Limited Training Data

Recognizing Human Activities in Videos Using Improved Dense Trajectories over LSTM

ResLNet: deep residual LSTM network with longer input for action recognition

Article 22 January 2022

Notes

The pre-trained ResNet-152 model can be downloaded on http://data.mxnet.io/models/imagenet-11k/

References

Amodei D, Anubhai R, Battenberg E, et al. (2015) Deep speech 2: End-to-end speech recognition in english and mandarin[J]. arXiv preprint arXiv:1512.02595
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166
Article Google Scholar
Cho K et al. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 886–893 (IEEE)
Deng, J. et al. (2009) Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255 (IEEE)
Donahue J et al. (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634
El Hihi S., Bengio Y (1995) Hierarchical Recurrent Neural Networks for Long-Term Dependencies. In NIPS, vol. 400, 409 (Citeseer)
Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 3, 189–194 (IEEE)
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12:2451–2471
Article Google Scholar
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. URL http://www.deeplearningbook.org, book in preparation for MIT Press
Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, 6645–6649 (IEEE)
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610
Article Google Scholar
Greff K, Srivastava RK, Koutnk J, Steunebrink BR, Schmidhuber J (2015) LSTM: A search space odyssey. arXiv preprint arXiv:1503.04069
Hannun A, Case C, Casper J et al. (2014) Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint arXiv:1412.5567
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780 This paper introduced LSTM recurrent networks, which have become a crucial ingredient in recent advances with recurrent networks because they are good at learning long-range dependencies
Article Google Scholar
Hochreiter S, Schmidhuber J (1995) Long Short-term Memory
Jain M, van Gemert JC, Snoek CG (2015) What do 15,000 object categories tell us about classifying and localizing actions? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 46–55
Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In ICCV
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231
Article Google Scholar
Karpathy A et al. (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 204–212
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In Proceedings CVPR08 (citeseer)
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article Google Scholar
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019
Ng JY-H. et al. (2015) Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Otte S, Liwicki M, Zell A (2014) Dynamic cortex memory: enhancing recurrent neural networks for gradient-based sequence learning. In International Conference on Artificial Neural Networks, 1–8 (Springer)
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. ICML 28(3):1310–1318
Google Scholar
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In European Conference on Computer Vision, 581–595 (Springer)
Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech, 338–342
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576
Soomro, K., Zamir, A. R. & Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CRCV-TR-12-01
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised Learning of Video Representations using LSTMs. In ICML, 843–852
Sutskever I (2013) Training recurrent neural networks. Ph.D. thesis, University of Toronto
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 4489–4497
Venugopalan S, Xu H, Donahue J, et al. (2014) Translating videos to natural language using deep recurrent neural networks[J]. arXiv preprint arXiv:1412.4729
Vinyals O, Toshev A, Bengio S, et al. (2015) Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–4164
Wang H, Schmid C (2013) Action Recognition with Improved Trajectories. In Computer Vision (ICCV), 2013 IEEE International Conference on, 3551–3558 (IEEE)
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In BMVC 2009 British Machine Vision Conference, 124–1 (BMVA Press)
Wu Y, Zhang S, Zhang Y, Bengio Y, Salakhutdinov R (2016) On Multiplicative Integration with Recurrent Neural Networks. arXiv preprint arXiv:1606.06630
Xu K et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, vol. 14, 77–81
Yao L, Torabi A, Cho K et al. (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision: 4507–4515
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
Zhang B, Wang L, Wang Z, et al. (2016) Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2718–2726

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.61672299). We would like to thank Songle Chen for his valuable advices.

Author information

Authors and Affiliations

Nanjing University of Posts and Telecommunications, Nanjing, Jiangsu, China
Zichao Ma
Key Lab of Broadband Wireless Communication and Sensor Network Technology, Ministry of Education, Nanjing, Jiangsu, China
Zhixin Sun

Authors

Zichao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Zhixin Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zichao Ma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ma, Z., Sun, Z. Time-varying LSTM networks for action recognition. Multimed Tools Appl 77, 32275–32285 (2018). https://doi.org/10.1007/s11042-018-6260-6

Download citation

Received: 25 April 2017
Revised: 05 June 2018
Accepted: 07 June 2018
Published: 18 June 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11042-018-6260-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time-varying LSTM networks for action recognition

Abstract

Access this article

Similar content being viewed by others

A Comparative Study of HMMs and LSTMs on Action Classification with Limited Training Data

Recognizing Human Activities in Videos Using Improved Dense Trajectories over LSTM

ResLNet: deep residual LSTM network with longer input for action recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Time-varying LSTM networks for action recognition

Abstract

Access this article

Similar content being viewed by others

A Comparative Study of HMMs and LSTMs on Action Classification with Limited Training Data

Recognizing Human Activities in Videos Using Improved Dense Trajectories over LSTM

ResLNet: deep residual LSTM network with longer input for action recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation