Abstract
Video action recognition has been a challenging task over the years. The challenge herein is not only due to the complication in increasing information in videos but also the requirement of an efficient method to retain information over a longer-term where human action would take to perform. This paper proposes a novel framework, named as long-term video action recognition (LVAR) to perform generic action classification in the continuous video. The idea of LVAR is introducing a partial recurrence connection to propagate information within every layer of a spatial-temporal network, such as the well-known C3D. Empirically, we show that this addition allows the C3D network to access long-term information, and subsequently improves action recognition performance with videos of different length selected from both UCF101 and miniKinetics datasets. Further confirmation of our approach is strengthened with experiments on untrimmed video from the Thumos14 dataset.






Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Notes
Segmenting the last section from the temporal axis can ensure the contextual information transferred is relevant to the current event.
References
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV, pp 1026–1034
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Jiang X, Sun J, Li C, Ding H (2018) Video image defogging recognition based on recurrent neural network. TII 14(7):3281–3288
Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) THUMOS challenge: action recognition with a large number of classes. Retrieved from https://www.crcv.ucf.edu/THUMOS14/results.html
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105
Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. JMLR 9(Nov):2579–2605
Muhammad K, Hamza R, Ahmad J, Lloret J, Wang H, Baik SW (2018) Secure surveillance framework for iot systems using probabilistic image encryption. TII 14(8):3679–3689
Ng JYH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: CVPR, pp 4694–4702
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. IJCV 115(3):211–252
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol 4, p 12
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp 4489–4497
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp 6450–6459
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. TPAMI 40(6):1510–1517
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, pp 3551–3558
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. Springer, pp 20–36
Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. TCSVT 27(12):2613–2622
Wu CY, Feichtenhofer C, Fan H, He K, Krähenbühl P, Girshick R (2018) Long-term feature banks for detailed video understanding. arXiv:1812.05038
Zeng Z, Li Z, Cheng D, Zhang H, Zhan K, Yang Y (2018) Two-stream multirate recurrent neural network for video-based pedestrian reidentification. TII 14(7):3179–3186
Acknowledgements
This research is supported by the Fundamental Research Grant Scheme (FRGS) MoHE Grant FP021-2018A, from the Ministry of Education Malaysia, and Postgraduate Research Grant (PPP) Grant PG006-2016A, from University of Malaya, Malaysia. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chang, Y.L., Chan, C.S. & Remagnino, P. Action recognition on continuous video. Neural Comput & Applic 33, 1233–1243 (2021). https://doi.org/10.1007/s00521-020-04982-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-04982-9