skip to main content
research-article

Modeling Long-Term Dependencies from Videos Using Deep Multiplicative Neural Networks

Published: 14 July 2020 Publication History

Abstract

Understanding temporal dependencies of videos is fundamental for vision problems, but deep learning–based models are still insufficient in this field. In this article, we propose a novel deep multiplicative neural network (DMNN) for learning hierarchical long-term representations from video. The DMNN is built upon the multiplicative block that remembers the pairwise transformations between consecutive frames using multiplicative interactions rather than the regular weighted-sum ones. The block is slided over the timesteps to update the memory of the networks on the frame pairs. Deep architecture can be implemented by stacking multiple layers of the sliding blocks. The multiplicative interactions lead to exact, rather than approximate, modeling of temporal dependencies. The memory mechanism can remember the temporal dependencies for an arbitrary length of time. The multiple layers output multiple-level representations that reflect the multi-timescale structure of video. Moreover, to address the difficulty of training DMNNs, we derive a theoretically sound convergent method, which leads to a fast and stable convergence. We demonstrate a new state-of-the-art classification performance with proposed networks on the UCF101 dataset and the effectiveness of capturing complicate temporal dependencies on a variety of synthetic datasets.

References

[1]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179.
[2]
Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. 2015. Natural neural networks. In Advances in Neural Information Processing Systems. 2071--2079.
[3]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR’15). 2625--2634.
[4]
Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 249--256.
[5]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 315--323.
[6]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712--2719.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026--1034.
[8]
Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82--97.
[9]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. 448--456. http://proceedings.mlr.press/v37/ioffe15.html.
[10]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221--231.
[11]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14).
[12]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539.
[13]
Kishore Reddy Konda, Roland Memisevic, and Vincent Michalski. 2013. Learning to encode motion using spatio-temporal synchrony. arXiv:1306.3162.
[14]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[15]
Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 204--212.
[16]
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning (ICML’13), Vol. 30.
[17]
Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011. Lecture Notes in Computer Science, Vol. 6791. Springer, 52--59.
[18]
Roland Memisevic. 2011. Gradient-based learning of higher-order image features. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV’11). IEEE, Los Alamitos, CA, 1591--1598.
[19]
Roland Memisevic. 2013. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1829--1846.
[20]
Vincent Michalski, Roland Memisevic, and Kishore Konda. 2014. Modeling deep temporal dependencies with recurrent grammar cells. In Advances in Neural Information Processing Systems. 1925--1933.
[21]
Hossein Mobahi, Ronan Collobert, and Jason Weston. 2009. Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 737--744.
[22]
Grégoire Montavon and Klaus-Robert Müller. 2012. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade. Springer, 621--637.
[23]
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 807--814.
[24]
Tapani Raiko, Harri Valpola, and Yann LeCun. 2012. Deep learning made easier by linear transformations in perceptrons. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 924--932.
[25]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.
[26]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. 2017--2027.
[27]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. CRCV-TR12-01. Center for Research in Computer Vision, University of Central Floriday, Orlando, FL.
[28]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[29]
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 843–852.
[30]
Ilya Sutskever, Oriol Vinyals, and Quoc V. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112.
[31]
Graham W. Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. 2010. Convolutional learning of spatio-temporal features. In Computer Vision–ECCV 2010. Lecture Notes in Computer Science, Vol. 6316. Springer, 140--153.
[32]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.
[33]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
[34]
Li Wang, Ting Liu, Gang Wang, Kap Luk Chan, and Qingxiong Yang. 2015. Video tracking using learned hierarchical features. IEEE Transactions on Image Processing 24, 4 (2015), 1424--1435.
[35]
Naiyan Wang and Dit-Yan Yeung. 2013. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems. 809--817.
[36]
Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Knowledge base completion via search-based question answering. In Proceedings of the 23rd International Conference on World Wide Web. ACM, New York, NY, 515--526.
[37]
Geoffrey Hinton, Yann LeCun, and Yoshua Bengio. 2015. Deep learning. Nature 521, 7553 (05 2015), 436--444.

Cited By

View all
  • (2022)Improving the Use of Blockchain Technology in Stroke Care Information Management SystemsComputational and Mathematical Methods in Medicine10.1155/2022/26428412022(1-9)Online publication date: 26-Sep-2022
  • (2021)[Retracted] Design and Application of Electronic Rehabilitation Medical Record (ERMR) Sharing Scheme Based on Blockchain TechnologyBioMed Research International10.1155/2021/35408302021:1Online publication date: 29-Aug-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2s
Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers
April 2020
291 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3407689
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 July 2020
Online AM: 07 May 2020
Accepted: 01 August 2019
Revised: 01 July 2019
Received: 01 May 2019
Published in TOMM Volume 16, Issue 2s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deep learning
  2. temporal dependencies
  3. video recognition

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Natural Science Foundation of Ningbo
  • Natural Science Foundation of Shanghai
  • Science Foundation of Department of Education of Zhejiang

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Improving the Use of Blockchain Technology in Stroke Care Information Management SystemsComputational and Mathematical Methods in Medicine10.1155/2022/26428412022(1-9)Online publication date: 26-Sep-2022
  • (2021)[Retracted] Design and Application of Electronic Rehabilitation Medical Record (ERMR) Sharing Scheme Based on Blockchain TechnologyBioMed Research International10.1155/2021/35408302021:1Online publication date: 29-Aug-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media