Skip to main content
Log in

Action recognition on continuous video

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Video action recognition has been a challenging task over the years. The challenge herein is not only due to the complication in increasing information in videos but also the requirement of an efficient method to retain information over a longer-term where human action would take to perform. This paper proposes a novel framework, named as long-term video action recognition (LVAR) to perform generic action classification in the continuous video. The idea of LVAR is introducing a partial recurrence connection to propagate information within every layer of a spatial-temporal network, such as the well-known C3D. Empirically, we show that this addition allows the C3D network to access long-term information, and subsequently improves action recognition performance with videos of different length selected from both UCF101 and miniKinetics datasets. Further confirmation of our approach is strengthened with experiments on untrimmed video from the Thumos14 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Segmenting the last section from the temporal axis can ensure the contextual information transferred is relevant to the current event.

References

  1. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR

  2. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634

  3. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941

  4. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: ICCV, pp 1026–1034

  5. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  6. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  7. Jiang X, Sun J, Li C, Ding H (2018) Video image defogging recognition based on recurrent neural network. TII 14(7):3281–3288

    Google Scholar 

  8. Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) THUMOS challenge: action recognition with a large number of classes. Retrieved from https://www.crcv.ucf.edu/THUMOS14/results.html

  9. Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105

  10. Maaten Lvd, Hinton G (2008) Visualizing data using t-SNE. JMLR 9(Nov):2579–2605

    MATH  Google Scholar 

  11. Muhammad K, Hamza R, Ahmad J, Lloret J, Wang H, Baik SW (2018) Secure surveillance framework for iot systems using probabilistic image encryption. TII 14(8):3679–3689

    Google Scholar 

  12. Ng JYH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: CVPR, pp 4694–4702

  13. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. IJCV 115(3):211–252

    Article  MathSciNet  Google Scholar 

  14. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

  15. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01

  16. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol 4, p 12

  17. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp 4489–4497

  18. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp 6450–6459

  19. Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. TPAMI 40(6):1510–1517

    Article  Google Scholar 

  20. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, pp 3551–3558

  21. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv:1507.02159

  22. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. Springer, pp 20–36

  23. Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. TCSVT 27(12):2613–2622

    Google Scholar 

  24. Wu CY, Feichtenhofer C, Fan H, He K, Krähenbühl P, Girshick R (2018) Long-term feature banks for detailed video understanding. arXiv:1812.05038

  25. Zeng Z, Li Z, Cheng D, Zhang H, Zhan K, Yang Y (2018) Two-stream multirate recurrent neural network for video-based pedestrian reidentification. TII 14(7):3179–3186

    Google Scholar 

Download references

Acknowledgements

This research is supported by the Fundamental Research Grant Scheme (FRGS) MoHE Grant FP021-2018A, from the Ministry of Education Malaysia, and Postgraduate Research Grant (PPP) Grant PG006-2016A, from University of Malaya, Malaysia. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. S. Chan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, Y.L., Chan, C.S. & Remagnino, P. Action recognition on continuous video. Neural Comput & Applic 33, 1233–1243 (2021). https://doi.org/10.1007/s00521-020-04982-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-04982-9

Keywords

Navigation