research-article

Rethinking the Combined and Individual Orders of Derivative of States for Differential Recurrent Neural Networks: Deep Differential Recurrent Neural Networks

Authors:
Naifan Zhuang

University of Central Florida, Orlando, FL

University of Central Florida, Orlando, FL
View Profile

,
Guo-Jun Qi

University of Central Florida, Orlando, FL

University of Central Florida, Orlando, FL
View Profile

,
The Duc Kieu

University of the West Indies, Trinidad and Tobago

University of the West Indies, Trinidad and Tobago
View Profile

,
Kien A. Hua

University of Central Florida, Orlando, FL

University of Central Florida, Orlando, FL
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15 Issue 3Article No.: 83pp 1–21https://doi.org/10.1145/3337928

Published:12 September 2019Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Due to their special gating schemes, Long Short-Term Memory (LSTM) has shown greater potential to process complex sequential information than the traditional Recurrent Neural Network (RNN). The conventional LSTM, however, fails to take into consideration the impact of salient spatio-temporal dynamics present in the sequential input data. This problem was first addressed by the differential Recurrent Neural Network (dRNN), which uses a differential gating scheme known as Derivative of States (DoS). DoS uses higher orders of internal state derivatives to analyze the change in information gain originated from the salient motions between the successive frames. The weighted combination of several orders of DoS is then used to modulate the gates in dRNN. While each individual order of DoS is good at modeling a certain level of salient spatio-temporal sequences, the sum of all the orders of DoS could distort the detected motion patterns. To address this problem, we propose to control the LSTM gates via individual orders of DoS. To fully utilize the different orders of DoS, we further propose to stack multiple levels of LSTM cells in an increasing order of state derivatives. The proposed model progressively builds up the ability of the LSTM gates to detect salient dynamical patterns in deeper stacked layers modeling higher orders of DoS; thus, the proposed LSTM model is termed deep differential Recurrent Neural Network (d²RNN). The effectiveness of the proposed model is demonstrated on three publicly available human activity datasets: NUS-HGA, Violent-Flows, and UCF101. The proposed model outperforms both LSTM and non-LSTM based state-of-the-art algorithms.

References

Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2010. Action classification in soccer videos with long short-term memory recurrent neural networks. In International Conference on Artificial Neural Networks. Springer, 154--159. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Arxiv Preprint Arxiv:1409.0473 (2014).Google Scholar
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
Zhongwei Cheng, Lei Qin, Qingming Huang, Shuicheng Yan, and Qi Tian. 2014. Recognizing human group action by layered model with multiple cues. Neurocomput. 136 (2014), 124--135.Google ScholarCross Ref
Nam-Gyu Cho, Young-Ji Kim, Unsang Park, Jeong-Seon Park, and Seong-Whan Lee. 2015. Group activity recognition with group interaction zone based on relative distance between human objects. Int. J. Pattern Recognit. Artif. Intell. 29, 5 (2015), 1555007.Google ScholarCross Ref
François Chollet et al. 2015. Keras. Retrieved from https://github.com/fchollet/keras.Google Scholar
Hang Chu, Weiyao Lin, Jianxin Wu, Xingtong Zhou, Yuanzhe Chen, and Hongxiang Li. 2012. A new heat-map-based algorithm for human group activity recognition. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 1069--1072. Google ScholarDigital Library
Manuel P. Cuéllar, Miguel Delgado, and M.C. Pegalajar. 2007. An application of non-linear programming to train recurrent neural networks in time series prediction problems. In Enterprise Information Systems VII. Springer, 95--102.Google Scholar
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarCross Ref
James F. Epperson. 2013. An Introduction to Numerical Methods and Analysis. John Wiley 8 Sons. Google ScholarDigital Library
Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139. Google ScholarDigital Library
Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2007. Multi-dimensional recurrent neural networks. CoRR abs/0705.2011 (2007). arxiv:0705.2011 http://arxiv.org/abs/0705.2011.Google Scholar
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6645--6649.Google ScholarCross Ref
Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE Trans. Neural Networks Learn. Syst. 28, 10 (2017), 2222--2232.Google ScholarCross Ref
Alexander Grushin, Derek D. Monner, James A. Reggia, and Ajay Mishra. 2013. Robust human action recognition via long short-term memory. In The 2013 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.Google ScholarCross Ref
Tal Hassner, Yossi Itcher, and Orit Kliper-Gross. 2012. Violent flows: Real-time detection of violent crowd behavior. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 1--6.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPRW). IEEE, 770--778.Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Hao Hu and Guo-Jun Qi. 2017. State-frequency memory recurrent neural networks. In International Conference on Machine Learning. 1568--1577. Google ScholarDigital Library
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
Xian-Sheng Hua and Guo-Jun Qi. 2008. Online multi-label active annotation: Towards large-scale content-based video search. In Proceedings of the 16th ACM International Conference on Multimedia. ACM, 141--150. Google ScholarDigital Library
Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 720--728.Google ScholarCross Ref
Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multimedia 20, 11 (2018), 3137--3147.Google ScholarDigital Library
Kevin Joslyn, Naifan Zhuang, and Kien A. Hua. 2018. Deep segment hash learning for music generation. Arxiv Preprint Arxiv:1805.12176 (2018).Google Scholar
Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. 2015. Grid long short-term memory. Arxiv Preprint Arxiv:1507.01526 (2015).Google Scholar
Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3D-gradients. In BMVC 2008-19th British Machine Vision Conference. British Machine Vision Association, 275--1.Google ScholarCross Ref
Zechao Li and Jinhui Tang. 2017. Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26, 1 (2017), 276--288. Google ScholarDigital Library
Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2018).Google Scholar
Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google ScholarCross Ref
Mark Marsden, Kevin McGuinness, Suzanne Little, and Noel E. O’Connor. 2016. Holistic features for real-time crowd behaviour anomaly detection. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 918--922.Google Scholar
Sadegh Mohammadi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015. Violence detection in crowded scenes using substantial derivative. In 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--6. Google ScholarDigital Library
Hossein Mousavi, Sadegh Mohammadi, Alessandro Perina, Ryad Chellali, and Vittorio Murino. 2015. Analyzing tracklets for the detection of abnormal crowd behavior. In 2015 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 148--155. Google ScholarDigital Library
Hossein Mousavi, Moin Nabi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015. Crowd motion monitoring using tracklet-based commotion measure. In 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2354--2358.Google ScholarCross Ref
Bingbing Ni, Shuicheng Yan, and Ashraf Kassim. 2009. Recognizing human group activities with localized causalities. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1470--1477.Google Scholar
Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. 2012. On clustering heterogeneous social media objects with outlier links. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 553--562. Google ScholarDigital Library
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533--5541.Google ScholarCross Ref
Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Arxiv Preprint Arxiv:1402.1128 (2014).Google Scholar
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673--2681. Google ScholarDigital Library
Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 357--360. Google ScholarDigital Library
Jing Shao, Chen Change Loy, and Xiaogang Wang. 2014. Scene-independent group profiling in crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2219--2226.Google ScholarDigital Library
Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Wei Liu, and Jian Yang. 2018. Hierarchical long short-term concurrent memory for human interaction recognition. Arxiv Preprint Arxiv:1811.00270 (2018).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576. Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Arxiv Preprint Arxiv:1409.1556 (2014).Google Scholar
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. Arxiv Preprint Arxiv:1212.0402 (2012).Google Scholar
Hang Su, Yinpeng Dong, Jun Zhu, Haibin Ling, and Bo Zhang. 2016. Crowd scene understanding with coherent recurrent neural networks. In IJCAI, Vol. 2. 5. Google ScholarDigital Library
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112. Google ScholarDigital Library
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. 2015. Differential recurrent neural networks for action recognition. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 4041--4049. Google ScholarDigital Library
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Arxiv Preprint Arxiv:1412.4729 (2014).Google Scholar
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google ScholarCross Ref
Jun Ye, Kai Li, Guo-Jun Qi, and Kien A. Hua. 2015. Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 99--106. Google ScholarDigital Library
Jun Ye, Guojun Qi, Naifan Zhuang, Hao Hu, and Kien A. Hua. 2018. Learning compact features for human activity recognition via probabilistic first-take-all. IEEE Trans. Pattern Anal. Mach. Intell. (2018).Google Scholar
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google ScholarCross Ref
Tuoerhongjiang Yusufu, Naifan Zhuang, Kai Li, and Kien A Hua. 2016. Relational learning based happiness intensity analysis in a group. In 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 353--358.Google ScholarCross Ref
Zheng-Jun Zha, Tao Mei, Xian-Sheng Hua, Guo-Jun Qi, and Zengfu Wang. 2007. Refining video annotation by exploiting pairwise concurrent relation. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 345--348. Google ScholarDigital Library
Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. 2017. Stock price prediction via discovering multi-frequency trading patterns. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2141--2149. Google ScholarDigital Library
Guangyu Zhu, Shuicheng Yan, Tony X. Han, and Changsheng Xu. 2011. Generative group activity analysis with quaternion descriptor. In International Conference on Multimedia Modeling. Springer, 1--11. Google ScholarDigital Library
Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning. 1604--1612. Google ScholarDigital Library
Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury. 2013. Context-aware activity recognition and anomaly detection in video. J. Sel. Top. Signal Process. 7, 1 (2013), 91--101.Google ScholarCross Ref
Naifan Zhuang, The Duc Kieu, Jun Ye, and Kien A. Hua. 2018. Convolutional nonlinear differential recurrent neural networks for crowd scene understanding. Int. J. Semant. Comput. 12, 4 (2018), 481--500.Google ScholarCross Ref
Naifan Zhuang, Jun Ye, and Kien A. Hua. 2016. DLSTM approach to video modeling with hashing for large-scale video retrieval. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 3222--3227.Google Scholar
Naifan Zhuang, Jun Ye, and Kien A. Hua. 2017. Convolutional DLSTM for crowd scene understanding. In 2017 IEEE International Symposium on Multimedia (ISM). IEEE, 61--68.Google Scholar
Naifan Zhuang, Tuoerhongjiang Yusufu, Jun Ye, and Kien A. Hua. 2017. Group activity recognition with differential recurrent convolutional neural networks. In 2017 12th IEEE International Conference on Automatic Face 8 Gesture Recognition (FG'17). IEEE, 526--531.Google Scholar

Index Terms

Rethinking the Combined and Individual Orders of Derivative of States for Differential Recurrent Neural Networks: Deep Differential Recurrent Neural Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Recurrent neural networks: a functional approach
Read More
Minimal gated unit for recurrent neural networks

Recurrent neural networks (RNN) have been very successful in handling sequence data. However, understanding RNN and finding the best practices for RNN learning is a difficult task, partly because there are many competing and complex hidden units, such ...
Read More
Equivalence results between feedforward and recurrent neural networks for sequences
IJCAI'15: Proceedings of the 24th International Conference on Artificial Intelligence

In the context of sequence processing, we study the relationship between single-layer feedforward neural networks, that have simultaneous access to all items composing a sequence, and single-layer recurrent neural networks which access information one ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 3
August 2019
331 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3352586
Editor:
Alberto Del Bimbo
University of Firenze, Italy
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 September 2019
- Accepted: 1 May 2019
- Revised: 1 April 2019
- Received: 1 January 2019
Published in tomm Volume 15, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep differential recurrent neural networks
activity recognition
derivative of state
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 180
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Rethinking the Combined and Individual Orders of Derivative of States for Differential Recurrent Neural Networks: Deep Differential Recurrent Neural Networks

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Recurrent neural networks: a functional approach

Minimal gated unit for recurrent neural networks

Equivalence results between feedforward and recurrent neural networks for sequences