skip to main content
research-article

Rethinking the Combined and Individual Orders of Derivative of States for Differential Recurrent Neural Networks: Deep Differential Recurrent Neural Networks

Published:12 September 2019Publication History
Skip Abstract Section

Abstract

Due to their special gating schemes, Long Short-Term Memory (LSTM) has shown greater potential to process complex sequential information than the traditional Recurrent Neural Network (RNN). The conventional LSTM, however, fails to take into consideration the impact of salient spatio-temporal dynamics present in the sequential input data. This problem was first addressed by the differential Recurrent Neural Network (dRNN), which uses a differential gating scheme known as Derivative of States (DoS). DoS uses higher orders of internal state derivatives to analyze the change in information gain originated from the salient motions between the successive frames. The weighted combination of several orders of DoS is then used to modulate the gates in dRNN. While each individual order of DoS is good at modeling a certain level of salient spatio-temporal sequences, the sum of all the orders of DoS could distort the detected motion patterns. To address this problem, we propose to control the LSTM gates via individual orders of DoS. To fully utilize the different orders of DoS, we further propose to stack multiple levels of LSTM cells in an increasing order of state derivatives. The proposed model progressively builds up the ability of the LSTM gates to detect salient dynamical patterns in deeper stacked layers modeling higher orders of DoS; thus, the proposed LSTM model is termed deep differential Recurrent Neural Network (d2RNN). The effectiveness of the proposed model is demonstrated on three publicly available human activity datasets: NUS-HGA, Violent-Flows, and UCF101. The proposed model outperforms both LSTM and non-LSTM based state-of-the-art algorithms.

References

  1. Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2010. Action classification in soccer videos with long short-term memory recurrent neural networks. In International Conference on Artificial Neural Networks. Springer, 154--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Arxiv Preprint Arxiv:1409.0473 (2014).Google ScholarGoogle Scholar
  3. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  4. Zhongwei Cheng, Lei Qin, Qingming Huang, Shuicheng Yan, and Qi Tian. 2014. Recognizing human group action by layered model with multiple cues. Neurocomput. 136 (2014), 124--135.Google ScholarGoogle ScholarCross RefCross Ref
  5. Nam-Gyu Cho, Young-Ji Kim, Unsang Park, Jeong-Seon Park, and Seong-Whan Lee. 2015. Group activity recognition with group interaction zone based on relative distance between human objects. Int. J. Pattern Recognit. Artif. Intell. 29, 5 (2015), 1555007.Google ScholarGoogle ScholarCross RefCross Ref
  6. François Chollet et al. 2015. Keras. Retrieved from https://github.com/fchollet/keras.Google ScholarGoogle Scholar
  7. Hang Chu, Weiyao Lin, Jianxin Wu, Xingtong Zhou, Yuanzhe Chen, and Hongxiang Li. 2012. A new heat-map-based algorithm for human group activity recognition. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 1069--1072. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Manuel P. Cuéllar, Miguel Delgado, and M.C. Pegalajar. 2007. An application of non-linear programming to train recurrent neural networks in time series prediction problems. In Enterprise Information Systems VII. Springer, 95--102.Google ScholarGoogle Scholar
  9. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarGoogle ScholarCross RefCross Ref
  10. James F. Epperson. 2013. An Introduction to Numerical Methods and Analysis. John Wiley 8 Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2007. Multi-dimensional recurrent neural networks. CoRR abs/0705.2011 (2007). arxiv:0705.2011 http://arxiv.org/abs/0705.2011.Google ScholarGoogle Scholar
  13. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6645--6649.Google ScholarGoogle ScholarCross RefCross Ref
  14. Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE Trans. Neural Networks Learn. Syst. 28, 10 (2017), 2222--2232.Google ScholarGoogle ScholarCross RefCross Ref
  15. Alexander Grushin, Derek D. Monner, James A. Reggia, and Ajay Mishra. 2013. Robust human action recognition via long short-term memory. In The 2013 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  16. Tal Hassner, Yossi Itcher, and Orit Kliper-Gross. 2012. Violent flows: Real-time detection of violent crowd behavior. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  17. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPRW). IEEE, 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  18. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hao Hu and Guo-Jun Qi. 2017. State-frequency memory recurrent neural networks. In International Conference on Machine Learning. 1568--1577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle ScholarCross RefCross Ref
  21. Xian-Sheng Hua and Guo-Jun Qi. 2008. Online multi-label active annotation: Towards large-scale content-based video search. In Proceedings of the 16th ACM International Conference on Multimedia. ACM, 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 720--728.Google ScholarGoogle ScholarCross RefCross Ref
  23. Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multimedia 20, 11 (2018), 3137--3147.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kevin Joslyn, Naifan Zhuang, and Kien A. Hua. 2018. Deep segment hash learning for music generation. Arxiv Preprint Arxiv:1805.12176 (2018).Google ScholarGoogle Scholar
  25. Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. 2015. Grid long short-term memory. Arxiv Preprint Arxiv:1507.01526 (2015).Google ScholarGoogle Scholar
  26. Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3D-gradients. In BMVC 2008-19th British Machine Vision Conference. British Machine Vision Association, 275--1.Google ScholarGoogle ScholarCross RefCross Ref
  27. Zechao Li and Jinhui Tang. 2017. Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26, 1 (2017), 276--288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2018).Google ScholarGoogle Scholar
  29. Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google ScholarGoogle ScholarCross RefCross Ref
  30. Mark Marsden, Kevin McGuinness, Suzanne Little, and Noel E. O’Connor. 2016. Holistic features for real-time crowd behaviour anomaly detection. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 918--922.Google ScholarGoogle Scholar
  31. Sadegh Mohammadi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015. Violence detection in crowded scenes using substantial derivative. In 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hossein Mousavi, Sadegh Mohammadi, Alessandro Perina, Ryad Chellali, and Vittorio Murino. 2015. Analyzing tracklets for the detection of abnormal crowd behavior. In 2015 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 148--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hossein Mousavi, Moin Nabi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015. Crowd motion monitoring using tracklet-based commotion measure. In 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2354--2358.Google ScholarGoogle ScholarCross RefCross Ref
  34. Bingbing Ni, Shuicheng Yan, and Ashraf Kassim. 2009. Recognizing human group activities with localized causalities. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1470--1477.Google ScholarGoogle Scholar
  35. Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. 2012. On clustering heterogeneous social media objects with outlier links. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 553--562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533--5541.Google ScholarGoogle ScholarCross RefCross Ref
  37. Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Arxiv Preprint Arxiv:1402.1128 (2014).Google ScholarGoogle Scholar
  38. Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673--2681. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 357--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jing Shao, Chen Change Loy, and Xiaogang Wang. 2014. Scene-independent group profiling in crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2219--2226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Wei Liu, and Jian Yang. 2018. Hierarchical long short-term concurrent memory for human interaction recognition. Arxiv Preprint Arxiv:1811.00270 (2018).Google ScholarGoogle Scholar
  42. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Arxiv Preprint Arxiv:1409.1556 (2014).Google ScholarGoogle Scholar
  44. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. Arxiv Preprint Arxiv:1212.0402 (2012).Google ScholarGoogle Scholar
  45. Hang Su, Yinpeng Dong, Jun Zhu, Haibin Ling, and Bo Zhang. 2016. Crowd scene understanding with coherent recurrent neural networks. In IJCAI, Vol. 2. 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. 2015. Differential recurrent neural networks for action recognition. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 4041--4049. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Arxiv Preprint Arxiv:1412.4729 (2014).Google ScholarGoogle Scholar
  50. Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google ScholarGoogle ScholarCross RefCross Ref
  51. Jun Ye, Kai Li, Guo-Jun Qi, and Kien A. Hua. 2015. Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 99--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Jun Ye, Guojun Qi, Naifan Zhuang, Hao Hu, and Kien A. Hua. 2018. Learning compact features for human activity recognition via probabilistic first-take-all. IEEE Trans. Pattern Anal. Mach. Intell. (2018).Google ScholarGoogle Scholar
  53. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google ScholarGoogle ScholarCross RefCross Ref
  54. Tuoerhongjiang Yusufu, Naifan Zhuang, Kai Li, and Kien A Hua. 2016. Relational learning based happiness intensity analysis in a group. In 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 353--358.Google ScholarGoogle ScholarCross RefCross Ref
  55. Zheng-Jun Zha, Tao Mei, Xian-Sheng Hua, Guo-Jun Qi, and Zengfu Wang. 2007. Refining video annotation by exploiting pairwise concurrent relation. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 345--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. 2017. Stock price prediction via discovering multi-frequency trading patterns. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2141--2149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Guangyu Zhu, Shuicheng Yan, Tony X. Han, and Changsheng Xu. 2011. Generative group activity analysis with quaternion descriptor. In International Conference on Multimedia Modeling. Springer, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning. 1604--1612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury. 2013. Context-aware activity recognition and anomaly detection in video. J. Sel. Top. Signal Process. 7, 1 (2013), 91--101.Google ScholarGoogle ScholarCross RefCross Ref
  60. Naifan Zhuang, The Duc Kieu, Jun Ye, and Kien A. Hua. 2018. Convolutional nonlinear differential recurrent neural networks for crowd scene understanding. Int. J. Semant. Comput. 12, 4 (2018), 481--500.Google ScholarGoogle ScholarCross RefCross Ref
  61. Naifan Zhuang, Jun Ye, and Kien A. Hua. 2016. DLSTM approach to video modeling with hashing for large-scale video retrieval. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 3222--3227.Google ScholarGoogle Scholar
  62. Naifan Zhuang, Jun Ye, and Kien A. Hua. 2017. Convolutional DLSTM for crowd scene understanding. In 2017 IEEE International Symposium on Multimedia (ISM). IEEE, 61--68.Google ScholarGoogle Scholar
  63. Naifan Zhuang, Tuoerhongjiang Yusufu, Jun Ye, and Kien A. Hua. 2017. Group activity recognition with differential recurrent convolutional neural networks. In 2017 12th IEEE International Conference on Automatic Face 8 Gesture Recognition (FG'17). IEEE, 526--531.Google ScholarGoogle Scholar

Index Terms

  1. Rethinking the Combined and Individual Orders of Derivative of States for Differential Recurrent Neural Networks: Deep Differential Recurrent Neural Networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Multimedia Computing, Communications, and Applications
        ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 3
        August 2019
        331 pages
        ISSN:1551-6857
        EISSN:1551-6865
        DOI:10.1145/3352586
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 September 2019
        • Accepted: 1 May 2019
        • Revised: 1 April 2019
        • Received: 1 January 2019
        Published in tomm Volume 15, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)14
        • Downloads (Last 6 weeks)3

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format