Abstract
Most of the recent works leverage Two-Stream framework to model the spatiotemporal information for video action recognition and achieve remarkable performance. In this paper, we propose a novel convolution architecture, called Residual Gating Fusion Network (RGFN), to improve their performance by fully exploring spatiotemporal information in residual signals. In order to further exploit the local details of low-level layers, we introduce Multi-Scale Convolution Fusion (MSCF) to implement spatiotemporal fusion at multiple levels. Since RGFN is an end-to-end network, it can be trained on various kinds of video datasets and applicative to other video analysis tasks. We evaluate our RGFN on two standard benchmarks, i.e., UCF101 and HMDB51, and analyze the designs of convolution network. Experiments results demonstrate the advantages of RGFN, achieving the state-of-the-art performance.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Laptev, I.: On space-time interest points. In: ICCV, vol. 1, pp. 432–439 (2003)
Wang, H.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2014)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
Feichtenhofer, C.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)
Wang, L.: Temporal segment networks: towards good practices for deep action recognition. ACM Trans. Inf. Syst. 22(1), 20–36 (2016)
He, K.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Soomro, K.: UCF101: a dataset of 101 human actions classes from videos in the wild, CRCV-TR-12-01 (2012)
Kuehne, H.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Bilen, H.: Dynamic image networks for action recognition. In: CVPR, pp. 3034–3042 (2016)
Wang, L.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp. 4305–4314 (2015)
Du, T.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2016)
Varol, G.: Long-term temporal convolutions for action recognition. TPAMI, PP(99), 1 (2016)
Zhu, W.: A key volume mining deep framework for action recognition. In: CVPR, pp. 1991–1999 (2016)
Diba, A.: Deep temporal linear encoding networks. In: CVPR (2017)
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under Grant 61673402, Grant 61273270, and Grant 60802069, in part by the Natural Science Foundation of Guangdong under Grant 2017A030311029, Grant 2016B010109002, Grant 2015B090912001, Grant 2016B010123005, and Grant 2017B090909005, in part by the Science and Technology Program of Guangzhou under Grant 201704020180 and Grant 201604020024, and in part by the Fundamental Research Funds for the Central Universities of China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, J., Hu, H. (2018). Residual Gating Fusion Network for Human Action Recognition. In: Zhou, J., et al. Biometric Recognition. CCBR 2018. Lecture Notes in Computer Science(), vol 10996. Springer, Cham. https://doi.org/10.1007/978-3-319-97909-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-97909-0_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-97908-3
Online ISBN: 978-3-319-97909-0
eBook Packages: Computer ScienceComputer Science (R0)