Abstract
Early action recognition seeks to recognize human actions in a video, while the video has been only partially observed. In this paper, we introduce an approach to this kind of recognition task. In some offline (non-early) recognition works, it has been proposed to sample frames of the video uniformly and use them in training of the model. However, there is no reason that uniform sampling should be optimal, so we propose a non-uniform sampling to make it more tailored to early recognition. The proposed method samples the frames in such a way that earlier frames are more likely to be chosen. These frames are then used in training a deep network architecture. We compare our sampling approach with a uniform sampling process, using HMDB51 dataset as a benchmark. We further compare our method with other state-of-the-art early recognition works. The experimental results suggest that our sampling process leads to better recognition accuracy than uniform sampling, at the early stages of the video, and that our proposed algorithm outperforms the state-of-the-art.



References
Cao Y, Barrett D, Barbu A, Narayanaswamy S, Yu H, Michaux A, Lin Y, Dickinson S, Siskind JM, Wang S (2013) Recognize human activities from partially observed videos. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2658–2665. https://doi.org/10.1109/CVPR.2013.343
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.
Hu JF, Zheng WS, Ma L, Wang G, Lai JH, Zhang J (2018) Early action prediction by soft regression. IEEE Trans Pattern Anal Mach Intell 41(11):2568–2583. https://doi.org/10.1109/TPAMI.2018.2863279
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd international conference on machine learning, ICML 2015, vol 1. International Machine Learning Society (IMLS), pp 448–456
Kong Y, Fu Y (2016) Max-margin action prediction machine. IEEE Trans Pattern Anal Mach Intell 38(9):1844–1858. https://doi.org/10.1109/TPAMI.2015.2491928
Kong Y, Kit D, Fu Y (2014) A discriminative model with multiple temporal scales for action prediction. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 8693. LNCS, pp 596–611. https://doi.org/10.1007/978-3-319-10602-1_39
Kong Y, Tao Z, Fu Y (2017) Deep sequential context networks for action prediction. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3662–3670. https://doi.org/10.1109/CVPR.2017.390. http://ieeexplore.ieee.org/document/8099873/
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2556–2563. https://doi.org/10.1109/ICCV.2011.6126543
Lai S, Zheng WS, Hu JF, Zhang J (2017) Global-local temporal saliency action prediction. IEEE Trans Image Process. https://doi.org/10.1109/TIP.2017.2751145
Li K, Fu Y (2014) Prediction of human activity by discovering temporal sequence patterns. IEEE Trans Pattern Anal Mach Intell 36(8):1644–1657. https://doi.org/10.1109/TPAMI.2013.2297321
Ryoo MS (2011) Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE international conference on computer vision, pp 1036–1043. https://doi.org/10.1109/ICCV.2011.6126349
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1:568–576
Vondrick C, Pirsiavash H, Torralba A (2016) Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 98–106
Wang H, Yang W, Yuan C, Ling H, Hu W (2017) Human activity prediction using temporally-weighted generalized time warping. Neurocomputing 225:139–147. https://doi.org/10.1016/j.neucom.2016.11.004
Wang H, Yuan C, Shen J, Yang W, Ling H (2018) Action unit detection and key frame selection for human activity prediction. Neurocomputing 318:109–119. https://doi.org/10.1016/j.neucom.2018.08.037
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 9912. LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2. arXiv:1608.00859
Xu Z, Qing L, Miao J (2015) Activity auto-completion: Predicting human activities from partial videos. In: Proceedings of the IEEE international conference on computer vision, pp 3191–3199. https://doi.org/10.1109/ICCV.2015.365
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759. https://doi.org/10.1109/ICCV.2013.342
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Saremi, M., Yaghmaee, F. Probabilistic selection of frames for early action recognition in videos. Int J Multimed Info Retr 8, 253–257 (2019). https://doi.org/10.1007/s13735-019-00182-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-019-00182-x