research-article

Modeling Long-Term Dependencies from Videos Using Deep Multiplicative Neural Networks

Authors:

Meijing ShanAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 16, Issue 2s

Article No.: 63, Pages 1 - 19

https://doi.org/10.1145/3357797

Published: 14 July 2020 Publication History

Abstract

Understanding temporal dependencies of videos is fundamental for vision problems, but deep learning–based models are still insufficient in this field. In this article, we propose a novel deep multiplicative neural network (DMNN) for learning hierarchical long-term representations from video. The DMNN is built upon the multiplicative block that remembers the pairwise transformations between consecutive frames using multiplicative interactions rather than the regular weighted-sum ones. The block is slided over the timesteps to update the memory of the networks on the frame pairs. Deep architecture can be implemented by stacking multiple layers of the sliding blocks. The multiplicative interactions lead to exact, rather than approximate, modeling of temporal dependencies. The memory mechanism can remember the temporal dependencies for an arbitrary length of time. The multiple layers output multiple-level representations that reflect the multi-timescale structure of video. Moreover, to address the difficulty of training DMNNs, we derive a theoretically sound convergent method, which leads to a fast and stable convergence. We demonstrate a new state-of-the-art classification performance with proposed networks on the UCF101 dataset and the effectiveness of capturing complicate temporal dependencies on a variety of synthetic datasets.

References

[1]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179.

[2]

Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. 2015. Natural neural networks. In Advances in Neural Information Processing Systems. 2071--2079.

[3]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR’15). 2625--2634.

[4]

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 249--256.

[5]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 315--323.

[6]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2712--2719.

Digital Library

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026--1034.

Digital Library

[8]

Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6 (2012), 82--97.

[9]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37. 448--456. http://proceedings.mlr.press/v37/ioffe15.html.

Digital Library

[10]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2013), 221--231.

Digital Library

[11]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14).

Digital Library

[12]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539.

[13]

Kishore Reddy Konda, Roland Memisevic, and Vincent Michalski. 2013. Learning to encode motion using spatio-temporal synchrony. arXiv:1306.3162.

[14]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[15]

Zhengzhong Lan, Ming Lin, Xuanchong Li, Alex G. Hauptmann, and Bhiksha Raj. 2015. Beyond Gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 204--212.

[16]

Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning (ICML’13), Vol. 30.

[17]

Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011. Lecture Notes in Computer Science, Vol. 6791. Springer, 52--59.

[18]

Roland Memisevic. 2011. Gradient-based learning of higher-order image features. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV’11). IEEE, Los Alamitos, CA, 1591--1598.

Digital Library

[19]

Roland Memisevic. 2013. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1829--1846.

Digital Library

[20]

Vincent Michalski, Roland Memisevic, and Kishore Konda. 2014. Modeling deep temporal dependencies with recurrent grammar cells. In Advances in Neural Information Processing Systems. 1925--1933.

[21]

Hossein Mobahi, Ronan Collobert, and Jason Weston. 2009. Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 737--744.

Digital Library

[22]

Grégoire Montavon and Klaus-Robert Müller. 2012. Deep Boltzmann machines and the centering trick. In Neural Networks: Tricks of the Trade. Springer, 621--637.

[23]

Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML’10). 807--814.

Digital Library

[24]

Tapani Raiko, Harri Valpola, and Yann LeCun. 2012. Deep learning made easier by linear transformations in perceptrons. In Proceedings of the International Conference on Artificial Intelligence and Statistics. 924--932.

[25]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576.

[26]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations. 2017--2027.

[27]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. CRCV-TR12-01. Center for Research in Computer Vision, University of Central Floriday, Orlando, FL.

[28]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.

Digital Library

[29]

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 843–852.

[30]

Ilya Sutskever, Oriol Vinyals, and Quoc V. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112.

Digital Library

[31]

Graham W. Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. 2010. Convolutional learning of spatio-temporal features. In Computer Vision–ECCV 2010. Lecture Notes in Computer Science, Vol. 6316. Springer, 140--153.

[32]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.

Digital Library

[33]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

[34]

Li Wang, Ting Liu, Gang Wang, Kap Luk Chan, and Qingxiong Yang. 2015. Video tracking using learned hierarchical features. IEEE Transactions on Image Processing 24, 4 (2015), 1424--1435.

[35]

Naiyan Wang and Dit-Yan Yeung. 2013. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems. 809--817.

[36]

Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Knowledge base completion via search-based question answering. In Proceedings of the 23rd International Conference on World Wide Web. ACM, New York, NY, 515--526.

Digital Library

[37]

Geoffrey Hinton, Yann LeCun, and Yoshua Bengio. 2015. Deep learning. Nature 521, 7553 (05 2015), 436--444.

Cited By

Yang YSong AChang QZhao HKong WXue QXue Q(2022)Improving the Use of Blockchain Technology in Stroke Care Information Management SystemsComputational and Mathematical Methods in Medicine10.1155/2022/26428412022(1-9)Online publication date: 26-Sep-2022
https://doi.org/10.1155/2022/2642841
Zhang JLi ZTan RLiu C(2021)[Retracted] Design and Application of Electronic Rehabilitation Medical Record (ERMR) Sharing Scheme Based on Blockchain TechnologyBioMed Research International10.1155/2021/35408302021:1Online publication date: 29-Aug-2021
https://doi.org/10.1155/2021/3540830

Index Terms

Modeling Long-Term Dependencies from Videos Using Deep Multiplicative Neural Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Hierarchical representations
      2. Computer vision tasks
        Activity recognition and understanding

Recommendations

Learning long-term dependencies in NARX recurrent neural networks

It has previously been shown that gradient-descent learning algorithms for recurrent neural networks can perform poorly on tasks that involve long-term dependencies, i.e. those problems for which the desired output depends on inputs presented at times ...
Improvement of Bidirectional Recurrent Neural Network for Learning Long-Term Dependencies
ICPR '04: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 4 - Volume 04

Bidirectional recurrent neural network (BRNN) is a non-causal generalization of recurrent neural networks (RNNs). Due to the problem of vanishing gradients, BRNN cannot learn long-term dependencies efficiently with gradient descent. To tackle the long-...
Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Multivariate time series forecasting is an important machine learning problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. Temporal data arise in these real-world ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 16, Issue 2s

Special Issue on Smart Communications and Networking for Future Video Surveillance and Special Section on Extended MMSYS-NOSSDAV 2019 Best Papers

April 2020

291 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3407689

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 July 2020

Online AM: 07 May 2020

Accepted: 01 August 2019

Revised: 01 July 2019

Received: 01 May 2019

Published in TOMM Volume 16, Issue 2s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Natural Science Foundation of Ningbo
Natural Science Foundation of Shanghai
Science Foundation of Department of Education of Zhejiang

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
89
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang YSong AChang QZhao HKong WXue QXue Q(2022)Improving the Use of Blockchain Technology in Stroke Care Information Management SystemsComputational and Mathematical Methods in Medicine10.1155/2022/26428412022(1-9)Online publication date: 26-Sep-2022
https://doi.org/10.1155/2022/2642841
Zhang JLi ZTan RLiu C(2021)[Retracted] Design and Application of Electronic Rehabilitation Medical Record (ERMR) Sharing Scheme Based on Blockchain TechnologyBioMed Research International10.1155/2021/35408302021:1Online publication date: 29-Aug-2021
https://doi.org/10.1155/2021/3540830

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents