Abstract
Video-based egocentric activity recognition involves spatio-temporal and human-object interaction. With the great success of deep learning technology in image recognition, human activity recognition in videos has got increasing attention in multimedia understanding. Comprehensive visual understanding requires the detection and modeling of individual visual features and the interactions between them. The current popular human action recognition approaches based on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from images and skeletons. First, we propose a pose-based two-stream network for action recognition that effectively fuses information from both skeleton and image at multiple levels of the video processing pipeline. In our network, one stream models the temporal dynamics of the action-related objects from video frames, and the other stream models the temporal dynamics of the targeted 2D human pose sequences which are extracted from raw video. Moreover, we demonstrate that a ConvNet trained on RGB data is able to achieve good performance in spite of limited training data. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF101-24 and JHMDB, where it is competitive with the state of the art. Among them, we have got the best results currently on the JHMDB, the mAP reached 90.6%.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Attention is all we need: Nailing down object-centric attention for egocentric activity recognition (2018)
Baradel, F., Wolf, C., Mille, J.: Pose-conditioned spatio-temporal attention for human action recognition (2017)
Bobick, A.F., Davis, J.W.: The Recognition of Human Movement Using Temporal Templates. The Recognition of Human Movement Using Temporal Templates (2001)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. arXiv e-prints (2016)
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Du, W., Wang, Y., Yu, Q.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos (2017)
Farrajota, M., Rodrigues, J., Buf, J.: Human action recognition in videos with articulated pose information by deep networks. Pattern Anal. Appl. 22(4), 1307–1318 (2019)
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions (2017)
Hahner, S., Iza-Teran, R., Garcke, J.: Analysis and prediction of deforming 3D shapes using oriented bounding boxes and LSTM autoencoders. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 284–296. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_23
Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos, pp. 5823–5832. IEEE Computer Society (2017)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision (2014)
Klser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: British Machine Vision Conference (2010)
Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight network architecture for real-time action recognition (2020)
Laha, A., Raykar, V.: An empirical evaluation of various deep learning architectures for bi-sequence classification tasks (2016)
Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. Neurocomputing 395, 138–149 (2020)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Piergiovanni, A.J., Ryoo, M.S.: Representation flow for action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Shen, J., Xiong, X., Li, Y., He, W., Li, P., Zheng, X.: Detecting safety helmet wearing on construction sites with bounding-box regression and deep transfer learning. Comput. Aided Civil Infrastruc. Eng. 36(2), 180–196 (2021)
Shen, J., Xiong, X., Xue, Z., Bian, Y.: A convolutional neural-network-based pedestrian counting model for various crowded scenes. Comput. Aided Civil Infrastruc. Eng. 34(10), 897–914 (2019)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems 1 (2014)
Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Sun, B., Liu, M., Zheng, R., Zhang, S.: Attention-based LSTM network for wearable human activity recognition. In: 2019 Chinese Control Conference (CCC) (2019)
Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-Centric Relation Network. Springer, Cham (2018)
Tang, J., Xia, J., Mu, X., Pang, B., Lu, C.: Asynchronous interaction aggregation for action detection (2020)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. IEEE (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Véges, M., Lőrincz, A.: Multi-person absolute 3D human pose estimation with weak depth supervision. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 258–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_21
Wu, J., Kuang, Z., Wang, L., Zhang, W., Wu, G.: Context-aware RCNN: a baseline for action detection in videos (2020)
Zhao, J., Snoek, C.: Dance with flow: two-in-one stream action detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Zheng, Z., Shi, L., Wang, C., Sun, L., Pan, G.: LSTM with uniqueness attention for human activity recognition. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11729, pp. 498–509. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30508-6_40
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Shen, J., Xiong, X., He, W., Li, P., Yan, W. (2021). A Multimode Two-Stream Network for Egocentric Action Recognition. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham. https://doi.org/10.1007/978-3-030-86362-3_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-86362-3_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86361-6
Online ISBN: 978-3-030-86362-3
eBook Packages: Computer ScienceComputer Science (R0)