Abstract
Most state-of-the-art spatio-temporal (S-T) action localization methods explicitly use optical flow as auxiliary motion information. Although the combination of optical flow and RGB significantly improves the performance, optical flow estimation brings a large amount of computational cost and the whole network is not end-to-end trainable. These shortcomings hinder the interactive fusion between motion information and RGB information, and greatly limit its real-world applications. In this paper, we exploit better ways to use motion information in a unified end-to-end trainable network architecture. First, we use knowledge distillation to enable the 3D-Convolutional branch to learn motion information from RGB inputs. Second, we propose a novel motion cue called short-range-motion (SRM) module to enhance the 2D-Convolutional branch to learn RGB information and dynamic motion information. In this strategy, flow computation at test time is avoided. Finally, we apply our methods to learn powerful RGB-motion representations for action classification and localization. Experimental results show that our method significantly outperforms the state-of-the-arts on dataset benchmarks J-HMDB-21 and UCF101-24 with an impressive improvement of \(\sim \)8% and \(\sim \)3%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. arXiv preprint arXiv:1708.00042 (2017)
He, J., Deng, Z., Ibrahim, M.S., Mori, G.: Generic tubelet proposals for action localization. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 343–351. IEEE (2018)
Ye, Y., Yang, X., Tian, Y.: Discovering spatio-temporal action tubes. J. Visual Commun. Image Represent. 58, 515–524 (2019)
Köpüklü, O., Wei, X., Rigoll, G.: You only watch once: a unified CNN architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644 (2019)
Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., Kautz, J.: Step: spatio-temporal progressive learning for video action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 264–272 (2019)
Zhang, D., He, L., Tu, Z., Zhang, S., Han, F., Yang, B.: Learning motion representation for real-time spatio-temporal action localization. Pattern Recognit. 103, 107312 (2020)
Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human action classes from videos in the wild. Center Res. Comput. Vis. 2 (2012)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: International Conference on Computer Vision (ICCV), pp. 3192–3199 (2013)
Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961–2969 (2017)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3637–3646 (2017)
Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3164–3172 (2015)
Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients (2008)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093 (2019)
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018)
Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 284–293 (2019)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: STAT, vol. 9, p. 1050 (2015)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Teed, Z., Deng, J.: Raft: recurrent all-pairs field transforms for optical flow. arXiv preprint arXiv:2003.12039 (2020)
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Hui, T.W., Tang, X., Change Loy, C.: Liteflownet: a lightweight convolutional neural network for optical flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8981–8989 (2018)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Zhang, C., Zou, Y., Chen, G., Gan, L.: Pan: towards fast action recognition via learning persistence of appearance. arXiv preprint arXiv:2008.03462 (2020)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV) (2011)
Duarte, K., Rawat, Y., Shah, M.: Videocapsulenet: a simplified network for action detection. In: Advances in Neural Information Processing Systems, pp. 7610–7619 (2018)
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. Pattern Recognit. (2015)
Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5822–5831 (2017)
Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, 4405–4413 (2017)
Alwando, E.H.P., Chen, Y.T., Fang, W.H.: CNN-based multiple path search for action tube detection in videos. IEEE Trans. Circ. Syst. Video Technol. 30(1), 104–116 (2018)
Wei, J., Wang, H., Yi, Y., Li, Q., Huang, D.: P3D-CTN: pseudo-3D convolutional tube network for spatio-temporal action detection in videos. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 300–304. IEEE (2019)
Singh, G., Saha, S., Cuzzolin, F.: Predicting action tubes. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Acknowledgments
The work is supported by the National Key Research and Development Program of China (No. 2018YFB1600600)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, Y., Tu, Z., Lin, L., Xie, X., Qin, Q. (2021). Real-Time Spatio-Temporal Action Localization via Learning Motion Representation. In: Sato, I., Han, B. (eds) Computer Vision – ACCV 2020 Workshops. ACCV 2020. Lecture Notes in Computer Science(), vol 12628. Springer, Cham. https://doi.org/10.1007/978-3-030-69756-3_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-69756-3_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69755-6
Online ISBN: 978-3-030-69756-3
eBook Packages: Computer ScienceComputer Science (R0)