Abstract
Action prediction, also called early action recognition, is about recognizing an action in a video with partial observation. Various methods have been developed to tackle either offline or early action recognition, including deep learning approaches. In a family of deep learning methods, video frames or optical flow images are processed sequentially by the network. In this paper, we present a learning framework that can be applied to such methods to make them more appropriate for early recognition. We propose encouraging the learner to learn from earlier parts of the video and stop learning from some point on. By focusing on the earlier parts, we can expect the model to take full advantage of the information lying in these early parts. To this end, it is necessary to find a stopping point up to which enough information has been observed. We measure the amount of information with the help of the loss function. We applied our framework to Temporal Segment Networks and experimented on UCF11 and HMDB51 datasets. The results show that our method improves on Temporal Segment Networks and outperforms other baseline methods.
Similar content being viewed by others
References
Cao Y, Barrett D, Barbu A, Narayanaswamy S, Yu H, Michaux A, Lin Y, Dickinson S, Siskind JM, Wang S (2013) Recognize human activities from partially observed videos. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2658–2665. https://doi.org/10.1109/CVPR.2013.343
Chakraborty B, Holte MB, Moeslund TB, Gonzàlez J (2012) Selective spatio-temporal interest points. Comput Vis Image Underst 116(3):396–410. https://doi.org/10.1016/j.cviu.2011.09.010
Cui R, Hua G, Wu J (2020) AP-GAN: predicting skeletal activity to improve early activity recognition. J Vis Commun Image Represent 73:102923. https://doi.org/10.1016/j.jvcir.2020.102923
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Proceedings - 2nd Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, VS-PETS, vol 2005, pp 65–72. https://doi.org/10.1109/VSPETS.2005.1570899
Furnari A, Farinella G (2020) Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, p 1. https://doi.org/10.1109/tpami.2020.2992889
Harris CG, Stephens (1988) A combined corner and edge detector. In: Alvey vision conference, vol 15, pp 189–192
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778. www.image-net.org
Hu JF, Zheng WS, Ma L, Wang G, Lai JH, Zhang J (2018) Early action prediction by soft regression. IEEE Trans Pattern Anal Mach Intell 41(11):2568–2583. https://doi.org/10.1109/TPAMI.2018.2863279
Kantorov V, Laptev I (2014) Efficient feature extraction, encoding, and classification for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2593–2600. https://doi.org/10.1109/CVPR.2014.332
Kong Y, Fu Y (2016) Max-margin action prediction machine. IEEE Trans Pattern Anal Mach Intell 38(9):1844–1858. https://doi.org/10.1109/TPAMI.2015.2491928
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: Fleet D et al (eds) ECCV 2014, Part V, LNCS 8693, Springer. pp. 596–611. https://doi.org/10.1007/978-3-319-10602-1_39
Kong Y, Tao Z, Fu Y (2017) Deep sequential context networks for action prediction. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3662–3670. https://doi.org/10.1109/CVPR.2017.390. http://ieeexplore.ieee.org/document/8099873/
Kong Y, Tao Z, Fu Y (2018) Adversarial action prediction networks. IEEE Trans Pattern Anal Mach Intell 42(3):539–553
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543
Lai S, Zheng WS, Hu JF, Zhang J (2017) Global-local temporal saliency action prediction. IEEE Trans Image Process 27(5):2272–2285. https://doi.org/10.1109/TIP.2017.2751145
Laptev Li (2003) Space–time interest points. In: Proceedings ninth IEEE international conference on computer vision, pp 432–439. https://doi.org/10.1109/ICCV.2003.1238378
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos in the Wild. In: 2009 IEEE computer society conference on computer vision and pattern recognition workshops, CVPR workshops 2009, pp 1996–2003. https://doi.org/10.1109/CVPRW.2009.5206744
Liu J, Shahroudy A, Wang G, Duan LY, Kot AC (2018) Ssnet: scale selection network for online 3d action prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8349–8358
Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in LSTMs for activity detection and early detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1942–1950. https://doi.org/10.1109/CVPR.2016.214. http://ieeexplore.ieee.org/document/7780583/
Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), pp 744–759. https://doi.org/10.1007/978-3-319-46493-0_45
Qiao R, Liu L, Shen C, van den Hengel A (2017) Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition. Pattern Recogn 66:202–212. https://doi.org/10.1016/j.patcog.2017.01.015
Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intell Rev 46(4):485–514. https://doi.org/10.1007/s10462-016-9473-y
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 2164–2173
Ryoo MS (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE international conference on computer vision, pp 1036–1043. https://doi.org/10.1109/ICCV.2011.6126349
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, vol 1. Neural information processing systems foundation, pp 568–576
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd international conference on learning representations, ICLR 2015 - Conference Track Proceedings
Tran D, Wang H, Torresani L, Ray J, Lecun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 6450–6459. https://doi.org/10.1109/CVPR.2018.00675. http://openaccess.thecvf.com/content_cvpr_2018/html/Tran_A_Closer_Look_CVPR_2018_paper.html
Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 3169–3176. https://doi.org/10.1109/CVPR.2011.5995407
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79. https://doi.org/10.1007/s11263-012-0594-8
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558. https://doi.org/10.1109/ICCV.2013.441
Wang H, Yuan C, Shen J, Yang W, Ling H (2018) Action unit detection and key frame selection for human activity prediction. Neurocomputing 318:109–119. https://doi.org/10.1016/j.neucom.2018.08.037
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 9912 LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755
Wang Y, Song J, Wang L, Gool L, Hilliges O (2016) Two-stream SR-CNNs for action recognition in videos. In: Proceedings of the British machine vision conference (BMVC), pp 108.1–108.12. https://doi.org/10.5244/c.30.108
Weng J, Jiang X, Zheng WL, Yuan J (2020) Early action recognition with category exclusion using policy-based reinforcement learning. IEEE Trans Circuits Syst Video Technol, p 1. https://doi.org/10.1109/tcsvt.2020.2976789
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759. https://doi.org/10.1109/ICCV.2013.342
Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A comprehensive survey of vision-based human action recognition methods. Sensors 19(5):1005. https://doi.org/10.3390/s19051005
Acknowledgements
The authors would like to thank Dr. Mohsen Ramezani for reviewing the manuscript, and for his valuable comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Saremi, M., Yaghmaee, F. Early-stopped learning for action prediction in videos. Int J Multimed Info Retr 10, 219–226 (2021). https://doi.org/10.1007/s13735-021-00216-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-021-00216-3