Early-stopped learning for action prediction in videos

Saremi, Mehrin; Yaghmaee, Farzin

doi:10.1007/s13735-021-00216-3

Early-stopped learning for action prediction in videos

Regular Paper
Published: 13 August 2021

Volume 10, pages 219–226, (2021)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Mehrin Saremi¹ &
Farzin Yaghmaee¹

358 Accesses
2 Citations
Explore all metrics

Abstract

Action prediction, also called early action recognition, is about recognizing an action in a video with partial observation. Various methods have been developed to tackle either offline or early action recognition, including deep learning approaches. In a family of deep learning methods, video frames or optical flow images are processed sequentially by the network. In this paper, we present a learning framework that can be applied to such methods to make them more appropriate for early recognition. We propose encouraging the learner to learn from earlier parts of the video and stop learning from some point on. By focusing on the earlier parts, we can expect the model to take full advantage of the information lying in these early parts. To this end, it is necessary to find a stopping point up to which enough information has been observed. We measure the amount of information with the help of the loss function. We applied our framework to Temporal Segment Networks and experimented on UCF11 and HMDB51 datasets. The results show that our method improves on Temporal Segment Networks and outperforms other baseline methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Xia Zhao, Limin Wang, … Milan Parmar

References

Cao Y, Barrett D, Barbu A, Narayanaswamy S, Yu H, Michaux A, Lin Y, Dickinson S, Siskind JM, Wang S (2013) Recognize human activities from partially observed videos. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2658–2665. https://doi.org/10.1109/CVPR.2013.343
Chakraborty B, Holte MB, Moeslund TB, Gonzàlez J (2012) Selective spatio-temporal interest points. Comput Vis Image Underst 116(3):396–410. https://doi.org/10.1016/j.cviu.2011.09.010
Article Google Scholar
Cui R, Hua G, Wu J (2020) AP-GAN: predicting skeletal activity to improve early activity recognition. J Vis Commun Image Represent 73:102923. https://doi.org/10.1016/j.jvcir.2020.102923
Article Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings - 2nd Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, VS-PETS, vol 2005, pp 65–72. https://doi.org/10.1109/VSPETS.2005.1570899
Furnari A, Farinella G (2020) Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, p 1. https://doi.org/10.1109/tpami.2020.2992889
Harris CG, Stephens (1988) A combined corner and edge detector. In: Alvey vision conference, vol 15, pp 189–192
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. www.image-net.org
Hu JF, Zheng WS, Ma L, Wang G, Lai JH, Zhang J (2018) Early action prediction by soft regression. IEEE Trans Pattern Anal Mach Intell 41(11):2568–2583. https://doi.org/10.1109/TPAMI.2018.2863279
Article Google Scholar
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2593–2600. https://doi.org/10.1109/CVPR.2014.332
Kong Y, Fu Y (2016) Max-margin action prediction machine. IEEE Trans Pattern Anal Mach Intell 38(9):1844–1858. https://doi.org/10.1109/TPAMI.2015.2491928
Article Google Scholar
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: Fleet D et al (eds) ECCV 2014, Part V, LNCS 8693, Springer. pp. 596–611. https://doi.org/10.1007/978-3-319-10602-1_39
Kong Y, Tao Z, Fu Y (2017) Deep sequential context networks for action prediction. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3662–3670. https://doi.org/10.1109/CVPR.2017.390. http://ieeexplore.ieee.org/document/8099873/
Kong Y, Tao Z, Fu Y (2018) Adversarial action prediction networks. IEEE Trans Pattern Anal Mach Intell 42(3):539–553
Article Google Scholar
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543
Lai S, Zheng WS, Hu JF, Zhang J (2017) Global-local temporal saliency action prediction. IEEE Trans Image Process 27(5):2272–2285. https://doi.org/10.1109/TIP.2017.2751145
Article MathSciNet MATH Google Scholar
Laptev Li (2003) Space–time interest points. In: Proceedings ninth IEEE international conference on computer vision, pp 432–439. https://doi.org/10.1109/ICCV.2003.1238378
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the Wild. In: 2009 IEEE computer society conference on computer vision and pattern recognition workshops, CVPR workshops 2009, pp 1996–2003. https://doi.org/10.1109/CVPRW.2009.5206744
Liu J, Shahroudy A, Wang G, Duan LY, Kot AC (2018) Ssnet: scale selection network for online 3d action prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8349–8358
Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in LSTMs for activity detection and early detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1942–1950. https://doi.org/10.1109/CVPR.2016.214. http://ieeexplore.ieee.org/document/7780583/
Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), pp 744–759. https://doi.org/10.1007/978-3-319-46493-0_45
Qiao R, Liu L, Shen C, van den Hengel A (2017) Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition. Pattern Recogn 66:202–212. https://doi.org/10.1016/j.patcog.2017.01.015
Article Google Scholar
Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intell Rev 46(4):485–514. https://doi.org/10.1007/s10462-016-9473-y
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 2164–2173
Ryoo MS (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE international conference on computer vision, pp 1036–1043. https://doi.org/10.1109/ICCV.2011.6126349
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, vol 1. Neural information processing systems foundation, pp 568–576
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd international conference on learning representations, ICLR 2015 - Conference Track Proceedings
Tran D, Wang H, Torresani L, Ray J, Lecun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 6450–6459. https://doi.org/10.1109/CVPR.2018.00675. http://openaccess.thecvf.com/content_cvpr_2018/html/Tran_A_Closer_Look_CVPR_2018_paper.html
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79. https://doi.org/10.1007/s11263-012-0594-8
Article MathSciNet Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558. https://doi.org/10.1109/ICCV.2013.441
Wang H, Yuan C, Shen J, Yang W, Ling H (2018) Action unit detection and key frame selection for human activity prediction. Neurocomputing 318:109–119. https://doi.org/10.1016/j.neucom.2018.08.037
Article Google Scholar
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 9912 LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755
Article Google Scholar
Wang Y, Song J, Wang L, Gool L, Hilliges O (2016) Two-stream SR-CNNs for action recognition in videos. In: Proceedings of the British machine vision conference (BMVC), pp 108.1–108.12. https://doi.org/10.5244/c.30.108
Weng J, Jiang X, Zheng WL, Yuan J (2020) Early action recognition with category exclusion using policy-based reinforcement learning. IEEE Trans Circuits Syst Video Technol, p 1. https://doi.org/10.1109/tcsvt.2020.2976789
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759. https://doi.org/10.1109/ICCV.2013.342
Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A comprehensive survey of vision-based human action recognition methods. Sensors 19(5):1005. https://doi.org/10.3390/s19051005
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Dr. Mohsen Ramezani for reviewing the manuscript, and for his valuable comments.

Author information

Authors and Affiliations

Electrical and Computer Engineering Department, Semnan University, Semnan, Semnan Province, Islamic Republic of Iran
Mehrin Saremi & Farzin Yaghmaee

Authors

Mehrin Saremi
View author publications
You can also search for this author in PubMed Google Scholar
Farzin Yaghmaee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farzin Yaghmaee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saremi, M., Yaghmaee, F. Early-stopped learning for action prediction in videos. Int J Multimed Info Retr 10, 219–226 (2021). https://doi.org/10.1007/s13735-021-00216-3

Download citation

Received: 29 January 2021
Revised: 28 July 2021
Accepted: 03 August 2021
Published: 13 August 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s13735-021-00216-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Early-stopped learning for action prediction in videos

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

A review of convolutional neural networks in computer vision

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Early-stopped learning for action prediction in videos

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

A review of convolutional neural networks in computer vision

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation