Abstract
Due to their special gating schemes, Long Short-Term Memory (LSTM) has shown greater potential to process complex sequential information than the traditional Recurrent Neural Network (RNN). The conventional LSTM, however, fails to take into consideration the impact of salient spatio-temporal dynamics present in the sequential input data. This problem was first addressed by the differential Recurrent Neural Network (dRNN), which uses a differential gating scheme known as Derivative of States (DoS). DoS uses higher orders of internal state derivatives to analyze the change in information gain originated from the salient motions between the successive frames. The weighted combination of several orders of DoS is then used to modulate the gates in dRNN. While each individual order of DoS is good at modeling a certain level of salient spatio-temporal sequences, the sum of all the orders of DoS could distort the detected motion patterns. To address this problem, we propose to control the LSTM gates via individual orders of DoS. To fully utilize the different orders of DoS, we further propose to stack multiple levels of LSTM cells in an increasing order of state derivatives. The proposed model progressively builds up the ability of the LSTM gates to detect salient dynamical patterns in deeper stacked layers modeling higher orders of DoS; thus, the proposed LSTM model is termed deep differential Recurrent Neural Network (d2RNN). The effectiveness of the proposed model is demonstrated on three publicly available human activity datasets: NUS-HGA, Violent-Flows, and UCF101. The proposed model outperforms both LSTM and non-LSTM based state-of-the-art algorithms.
- Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2010. Action classification in soccer videos with long short-term memory recurrent neural networks. In International Conference on Artificial Neural Networks. Springer, 154--159. Google ScholarDigital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Arxiv Preprint Arxiv:1409.0473 (2014).Google Scholar
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
- Zhongwei Cheng, Lei Qin, Qingming Huang, Shuicheng Yan, and Qi Tian. 2014. Recognizing human group action by layered model with multiple cues. Neurocomput. 136 (2014), 124--135.Google ScholarCross Ref
- Nam-Gyu Cho, Young-Ji Kim, Unsang Park, Jeong-Seon Park, and Seong-Whan Lee. 2015. Group activity recognition with group interaction zone based on relative distance between human objects. Int. J. Pattern Recognit. Artif. Intell. 29, 5 (2015), 1555007.Google ScholarCross Ref
- François Chollet et al. 2015. Keras. Retrieved from https://github.com/fchollet/keras.Google Scholar
- Hang Chu, Weiyao Lin, Jianxin Wu, Xingtong Zhou, Yuanzhe Chen, and Hongxiang Li. 2012. A new heat-map-based algorithm for human group activity recognition. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 1069--1072. Google ScholarDigital Library
- Manuel P. Cuéllar, Miguel Delgado, and M.C. Pegalajar. 2007. An application of non-linear programming to train recurrent neural networks in time series prediction problems. In Enterprise Information Systems VII. Springer, 95--102.Google Scholar
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.Google ScholarCross Ref
- James F. Epperson. 2013. An Introduction to Numerical Methods and Analysis. John Wiley 8 Sons. Google ScholarDigital Library
- Yoav Freund and Robert E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (1997), 119--139. Google ScholarDigital Library
- Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2007. Multi-dimensional recurrent neural networks. CoRR abs/0705.2011 (2007). arxiv:0705.2011 http://arxiv.org/abs/0705.2011.Google Scholar
- Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6645--6649.Google ScholarCross Ref
- Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE Trans. Neural Networks Learn. Syst. 28, 10 (2017), 2222--2232.Google ScholarCross Ref
- Alexander Grushin, Derek D. Monner, James A. Reggia, and Ajay Mishra. 2013. Robust human action recognition via long short-term memory. In The 2013 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.Google ScholarCross Ref
- Tal Hassner, Yossi Itcher, and Orit Kliper-Gross. 2012. Violent flows: Real-time detection of violent crowd behavior. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 1--6.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPRW). IEEE, 770--778.Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Hao Hu and Guo-Jun Qi. 2017. State-frequency memory recurrent neural networks. In International Conference on Machine Learning. 1568--1577. Google ScholarDigital Library
- Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- Xian-Sheng Hua and Guo-Jun Qi. 2008. Online multi-label active annotation: Towards large-scale content-based video search. In Proceedings of the 16th ACM International Conference on Multimedia. ACM, 141--150. Google ScholarDigital Library
- Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 720--728.Google ScholarCross Ref
- Yu-Gang Jiang, Zuxuan Wu, Jinhui Tang, Zechao Li, Xiangyang Xue, and Shih-Fu Chang. 2018. Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multimedia 20, 11 (2018), 3137--3147.Google ScholarDigital Library
- Kevin Joslyn, Naifan Zhuang, and Kien A. Hua. 2018. Deep segment hash learning for music generation. Arxiv Preprint Arxiv:1805.12176 (2018).Google Scholar
- Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. 2015. Grid long short-term memory. Arxiv Preprint Arxiv:1507.01526 (2015).Google Scholar
- Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3D-gradients. In BMVC 2008-19th British Machine Vision Conference. British Machine Vision Association, 275--1.Google ScholarCross Ref
- Zechao Li and Jinhui Tang. 2017. Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26, 1 (2017), 276--288. Google ScholarDigital Library
- Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2018).Google Scholar
- Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1942--1950.Google ScholarCross Ref
- Mark Marsden, Kevin McGuinness, Suzanne Little, and Noel E. O’Connor. 2016. Holistic features for real-time crowd behaviour anomaly detection. In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 918--922.Google Scholar
- Sadegh Mohammadi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015. Violence detection in crowded scenes using substantial derivative. In 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1--6. Google ScholarDigital Library
- Hossein Mousavi, Sadegh Mohammadi, Alessandro Perina, Ryad Chellali, and Vittorio Murino. 2015. Analyzing tracklets for the detection of abnormal crowd behavior. In 2015 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 148--155. Google ScholarDigital Library
- Hossein Mousavi, Moin Nabi, Hamed Kiani, Alessandro Perina, and Vittorio Murino. 2015. Crowd motion monitoring using tracklet-based commotion measure. In 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2354--2358.Google ScholarCross Ref
- Bingbing Ni, Shuicheng Yan, and Ashraf Kassim. 2009. Recognizing human group activities with localized causalities. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1470--1477.Google Scholar
- Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. 2012. On clustering heterogeneous social media objects with outlier links. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 553--562. Google ScholarDigital Library
- Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533--5541.Google ScholarCross Ref
- Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Arxiv Preprint Arxiv:1402.1128 (2014).Google Scholar
- Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673--2681. Google ScholarDigital Library
- Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 357--360. Google ScholarDigital Library
- Jing Shao, Chen Change Loy, and Xiaogang Wang. 2014. Scene-independent group profiling in crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2219--2226.Google ScholarDigital Library
- Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Wei Liu, and Jian Yang. 2018. Hierarchical long short-term concurrent memory for human interaction recognition. Arxiv Preprint Arxiv:1811.00270 (2018).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems. 568--576. Google ScholarDigital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Arxiv Preprint Arxiv:1409.1556 (2014).Google Scholar
- Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. Arxiv Preprint Arxiv:1212.0402 (2012).Google Scholar
- Hang Su, Yinpeng Dong, Jun Zhu, Haibin Ling, and Bo Zhang. 2016. Crowd scene understanding with coherent recurrent neural networks. In IJCAI, Vol. 2. 5. Google ScholarDigital Library
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. 3104--3112. Google ScholarDigital Library
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497. Google ScholarDigital Library
- Vivek Veeriah, Naifan Zhuang, and Guo-Jun Qi. 2015. Differential recurrent neural networks for action recognition. In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 4041--4049. Google ScholarDigital Library
- Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. Arxiv Preprint Arxiv:1412.4729 (2014).Google Scholar
- Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4305--4314.Google ScholarCross Ref
- Jun Ye, Kai Li, Guo-Jun Qi, and Kien A. Hua. 2015. Temporal order-preserving dynamic quantization for human action recognition from multimodal sensor streams. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 99--106. Google ScholarDigital Library
- Jun Ye, Guojun Qi, Naifan Zhuang, Hao Hu, and Kien A. Hua. 2018. Learning compact features for human activity recognition via probabilistic first-take-all. IEEE Trans. Pattern Anal. Mach. Intell. (2018).Google Scholar
- Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google ScholarCross Ref
- Tuoerhongjiang Yusufu, Naifan Zhuang, Kai Li, and Kien A Hua. 2016. Relational learning based happiness intensity analysis in a group. In 2016 IEEE International Symposium on Multimedia (ISM). IEEE, 353--358.Google ScholarCross Ref
- Zheng-Jun Zha, Tao Mei, Xian-Sheng Hua, Guo-Jun Qi, and Zengfu Wang. 2007. Refining video annotation by exploiting pairwise concurrent relation. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 345--348. Google ScholarDigital Library
- Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. 2017. Stock price prediction via discovering multi-frequency trading patterns. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2141--2149. Google ScholarDigital Library
- Guangyu Zhu, Shuicheng Yan, Tony X. Han, and Changsheng Xu. 2011. Generative group activity analysis with quaternion descriptor. In International Conference on Multimedia Modeling. Springer, 1--11. Google ScholarDigital Library
- Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning. 1604--1612. Google ScholarDigital Library
- Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury. 2013. Context-aware activity recognition and anomaly detection in video. J. Sel. Top. Signal Process. 7, 1 (2013), 91--101.Google ScholarCross Ref
- Naifan Zhuang, The Duc Kieu, Jun Ye, and Kien A. Hua. 2018. Convolutional nonlinear differential recurrent neural networks for crowd scene understanding. Int. J. Semant. Comput. 12, 4 (2018), 481--500.Google ScholarCross Ref
- Naifan Zhuang, Jun Ye, and Kien A. Hua. 2016. DLSTM approach to video modeling with hashing for large-scale video retrieval. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 3222--3227.Google Scholar
- Naifan Zhuang, Jun Ye, and Kien A. Hua. 2017. Convolutional DLSTM for crowd scene understanding. In 2017 IEEE International Symposium on Multimedia (ISM). IEEE, 61--68.Google Scholar
- Naifan Zhuang, Tuoerhongjiang Yusufu, Jun Ye, and Kien A. Hua. 2017. Group activity recognition with differential recurrent convolutional neural networks. In 2017 12th IEEE International Conference on Automatic Face 8 Gesture Recognition (FG'17). IEEE, 526--531.Google Scholar
Index Terms
- Rethinking the Combined and Individual Orders of Derivative of States for Differential Recurrent Neural Networks: Deep Differential Recurrent Neural Networks
Recommendations
Minimal gated unit for recurrent neural networks
Recurrent neural networks (RNN) have been very successful in handling sequence data. However, understanding RNN and finding the best practices for RNN learning is a difficult task, partly because there are many competing and complex hidden units, such ...
Equivalence results between feedforward and recurrent neural networks for sequences
IJCAI'15: Proceedings of the 24th International Conference on Artificial IntelligenceIn the context of sequence processing, we study the relationship between single-layer feedforward neural networks, that have simultaneous access to all items composing a sequence, and single-layer recurrent neural networks which access information one ...
Comments