Skip to main content
Log in

Time-varying LSTM networks for action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We describe an architecture of Time-Varying Long Short-Term Memory recurrent neural networks (TV-LSTMs) for human action recognition. The main innovation of this architecture is the use of hybrid weights, shared weights and non-shared weights which we refer to as varying weights. The varying weights can enhance the ability of LSTMs to represent videos and other sequential data. We evaluate TV-LSTMs on UCF-11, HMDB-51, and UCF-101 human action datasets and achieve the top-1 accuracy of 99.64%, 57.52%, and 85.06% respectively. This model performs competitively against the models that use both RGB and other features, such as optical flows, improved Dense Trajectory, etc. In this paper, we also propose and analyze the methods of selecting varying weights.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The pre-trained ResNet-152 model can be downloaded on http://data.mxnet.io/models/imagenet-11k/

References

  1. Amodei D, Anubhai R, Battenberg E, et al. (2015) Deep speech 2: End-to-end speech recognition in english and mandarin[J]. arXiv preprint arXiv:1512.02595

  2. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  3. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166

    Article  Google Scholar 

  4. Cho K et al. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

  5. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 886–893 (IEEE)

  6. Deng, J. et al. (2009) Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255 (IEEE)

  7. Donahue J et al. (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634

  8. El Hihi S., Bengio Y (1995) Hierarchical Recurrent Neural Networks for Long-Term Dependencies. In NIPS, vol. 400, 409 (Citeseer)

  9. Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 3, 189–194 (IEEE)

  10. Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12:2451–2471

    Article  Google Scholar 

  11. Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. URL http://www.deeplearningbook.org, book in preparation for MIT Press

  12. Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, 6645–6649 (IEEE)

  13. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610

    Article  Google Scholar 

  14. Greff K, Srivastava RK, Koutnk J, Steunebrink BR, Schmidhuber J (2015) LSTM: A search space odyssey. arXiv preprint arXiv:1503.04069

  15. Hannun A, Case C, Casper J et al. (2014) Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint arXiv:1412.5567

  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778

  17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780 This paper introduced LSTM recurrent networks, which have become a crucial ingredient in recent advances with recurrent networks because they are good at learning long-range dependencies

    Article  Google Scholar 

  18. Hochreiter S, Schmidhuber J (1995) Long Short-term Memory

  19. Jain M, van Gemert JC, Snoek CG (2015) What do 15,000 object categories tell us about classifying and localizing actions? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 46–55

  20. Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In ICCV

  21. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231

    Article  Google Scholar 

  22. Karpathy A et al. (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732

  23. Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 204–212

  24. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In Proceedings CVPR08 (citeseer)

  25. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444

    Article  Google Scholar 

  26. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019

  27. Ng JY-H. et al. (2015) Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  28. Otte S, Liwicki M, Zell A (2014) Dynamic cortex memory: enhancing recurrent neural networks for gradient-based sequence learning. In International Conference on Artificial Neural Networks, 1–8 (Springer)

  29. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. ICML 28(3):1310–1318

    Google Scholar 

  30. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  31. Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In European Conference on Computer Vision, 581–595 (Springer)

  32. Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech, 338–342

  33. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119

  34. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576

  35. Soomro, K., Zamir, A. R. & Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CRCV-TR-12-01

  36. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  37. Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised Learning of Video Representations using LSTMs. In ICML, 843–852

  38. Sutskever I (2013) Training recurrent neural networks. Ph.D. thesis, University of Toronto

  39. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112

  40. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 4489–4497

  41. Venugopalan S, Xu H, Donahue J, et al. (2014) Translating videos to natural language using deep recurrent neural networks[J]. arXiv preprint arXiv:1412.4729

  42. Vinyals O, Toshev A, Bengio S, et al. (2015) Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–4164

  43. Wang H, Schmid C (2013) Action Recognition with Improved Trajectories. In Computer Vision (ICCV), 2013 IEEE International Conference on, 3551–3558 (IEEE)

  44. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In BMVC 2009 British Machine Vision Conference, 124–1 (BMVA Press)

  45. Wu Y, Zhang S, Zhang Y, Bengio Y, Salakhutdinov R (2016) On Multiplicative Integration with Recurrent Neural Networks. arXiv preprint arXiv:1606.06630

  46. Xu K et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, vol. 14, 77–81

  47. Yao L, Torabi A, Cho K et al. (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision: 4507–4515

  48. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329

  49. Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

  50. Zhang B, Wang L, Wang Z, et al. (2016) Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2718–2726

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.61672299). We would like to thank Songle Chen for his valuable advices.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zichao Ma.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Z., Sun, Z. Time-varying LSTM networks for action recognition. Multimed Tools Appl 77, 32275–32285 (2018). https://doi.org/10.1007/s11042-018-6260-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6260-6

Keywords

Navigation