Abstract
We describe an architecture of Time-Varying Long Short-Term Memory recurrent neural networks (TV-LSTMs) for human action recognition. The main innovation of this architecture is the use of hybrid weights, shared weights and non-shared weights which we refer to as varying weights. The varying weights can enhance the ability of LSTMs to represent videos and other sequential data. We evaluate TV-LSTMs on UCF-11, HMDB-51, and UCF-101 human action datasets and achieve the top-1 accuracy of 99.64%, 57.52%, and 85.06% respectively. This model performs competitively against the models that use both RGB and other features, such as optical flows, improved Dense Trajectory, etc. In this paper, we also propose and analyze the methods of selecting varying weights.
Similar content being viewed by others
Notes
The pre-trained ResNet-152 model can be downloaded on http://data.mxnet.io/models/imagenet-11k/
References
Amodei D, Anubhai R, Battenberg E, et al. (2015) Deep speech 2: End-to-end speech recognition in english and mandarin[J]. arXiv preprint arXiv:1512.02595
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166
Cho K et al. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, 886–893 (IEEE)
Deng, J. et al. (2009) Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 248–255 (IEEE)
Donahue J et al. (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634
El Hihi S., Bengio Y (1995) Hierarchical Recurrent Neural Networks for Long-Term Dependencies. In NIPS, vol. 400, 409 (Citeseer)
Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 3, 189–194 (IEEE)
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12:2451–2471
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. URL http://www.deeplearningbook.org, book in preparation for MIT Press
Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, 6645–6649 (IEEE)
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18:602–610
Greff K, Srivastava RK, Koutnk J, Steunebrink BR, Schmidhuber J (2015) LSTM: A search space odyssey. arXiv preprint arXiv:1503.04069
Hannun A, Case C, Casper J et al. (2014) Deep speech: Scaling up end-to-end speech recognition[J]. arXiv preprint arXiv:1412.5567
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780 This paper introduced LSTM recurrent networks, which have become a crucial ingredient in recent advances with recurrent networks because they are good at learning long-range dependencies
Hochreiter S, Schmidhuber J (1995) Long Short-term Memory
Jain M, van Gemert JC, Snoek CG (2015) What do 15,000 object categories tell us about classifying and localizing actions? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 46–55
Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In ICCV
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35:221–231
Karpathy A et al. (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 204–212
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In Proceedings CVPR08 (citeseer)
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019
Ng JY-H. et al. (2015) Beyond short snippets: Deep networks for video classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Otte S, Liwicki M, Zell A (2014) Dynamic cortex memory: enhancing recurrent neural networks for gradient-based sequence learning. In International Conference on Artificial Neural Networks, 1–8 (Springer)
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. ICML 28(3):1310–1318
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In European Conference on Computer Vision, 581–595 (Springer)
Sak H, Senior AW, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Interspeech, 338–342
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576
Soomro, K., Zamir, A. R. & Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CRCV-TR-12-01
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised Learning of Video Representations using LSTMs. In ICML, 843–852
Sutskever I (2013) Training recurrent neural networks. Ph.D. thesis, University of Toronto
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 4489–4497
Venugopalan S, Xu H, Donahue J, et al. (2014) Translating videos to natural language using deep recurrent neural networks[J]. arXiv preprint arXiv:1412.4729
Vinyals O, Toshev A, Bengio S, et al. (2015) Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–4164
Wang H, Schmid C (2013) Action Recognition with Improved Trajectories. In Computer Vision (ICCV), 2013 IEEE International Conference on, 3551–3558 (IEEE)
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In BMVC 2009 British Machine Vision Conference, 124–1 (BMVA Press)
Wu Y, Zhang S, Zhang Y, Bengio Y, Salakhutdinov R (2016) On Multiplicative Integration with Recurrent Neural Networks. arXiv preprint arXiv:1606.06630
Xu K et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, vol. 14, 77–81
Yao L, Torabi A, Cho K et al. (2015) Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision: 4507–4515
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329
Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
Zhang B, Wang L, Wang Z, et al. (2016) Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2718–2726
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No.61672299). We would like to thank Songle Chen for his valuable advices.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ma, Z., Sun, Z. Time-varying LSTM networks for action recognition. Multimed Tools Appl 77, 32275–32285 (2018). https://doi.org/10.1007/s11042-018-6260-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6260-6