A Multimode Two-Stream Network for Egocentric Action Recognition

Li, Ying; Shen, Jie; Xiong, Xin; He, Wei; Li, Peng; Yan, Wenjie

doi:10.1007/978-3-030-86362-3_29

A Multimode Two-Stream Network for Egocentric Action Recognition

Ying Li¹²,
Jie Shen¹²,
Xin Xiong¹²,
Wei He¹²,
Peng Li¹² &
…
Wenjie Yan¹²

Conference paper
First Online: 07 September 2021

2997 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12891))

Abstract

Video-based egocentric activity recognition involves spatio-temporal and human-object interaction. With the great success of deep learning technology in image recognition, human activity recognition in videos has got increasing attention in multimedia understanding. Comprehensive visual understanding requires the detection and modeling of individual visual features and the interactions between them. The current popular human action recognition approaches based on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from images and skeletons. First, we propose a pose-based two-stream network for action recognition that effectively fuses information from both skeleton and image at multiple levels of the video processing pipeline. In our network, one stream models the temporal dynamics of the action-related objects from video frames, and the other stream models the temporal dynamics of the targeted 2D human pose sequences which are extracted from raw video. Moreover, we demonstrate that a ConvNet trained on RGB data is able to achieve good performance in spite of limited training data. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF101-24 and JHMDB, where it is competitive with the state of the art. Among them, we have got the best results currently on the JHMDB, the mAP reached 90.6%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Attention is all we need: Nailing down object-centric attention for egocentric activity recognition (2018)
Google Scholar
Baradel, F., Wolf, C., Mille, J.: Pose-conditioned spatio-temporal attention for human action recognition (2017)
Google Scholar
Bobick, A.F., Davis, J.W.: The Recognition of Human Movement Using Temporal Templates. The Recognition of Human Movement Using Temporal Templates (2001)
Google Scholar
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. arXiv e-prints (2016)
Google Scholar
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Du, W., Wang, Y., Yu, Q.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos (2017)
Google Scholar
Farrajota, M., Rodrigues, J., Buf, J.: Human action recognition in videos with articulated pose information by deep networks. Pattern Anal. Appl. 22(4), 1307–1318 (2019)
Article MathSciNet Google Scholar
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions (2017)
Google Scholar
Hahner, S., Iza-Teran, R., Garcke, J.: Analysis and prediction of deforming 3D shapes using oriented bounding boxes and LSTM autoencoders. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 284–296. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_23
Chapter Google Scholar
Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos, pp. 5823–5832. IEEE Computer Society (2017)
Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision (2014)
Google Scholar
Klser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: British Machine Vision Conference (2010)
Google Scholar
Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight network architecture for real-time action recognition (2020)
Google Scholar
Laha, A., Raykar, V.: An empirical evaluation of various deep learning architectures for bi-sequence classification tasks (2016)
Google Scholar
Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. Neurocomputing 395, 138–149 (2020)
Article Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45
Chapter Google Scholar
Piergiovanni, A.J., Ryoo, M.S.: Representation flow for action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Shen, J., Xiong, X., Li, Y., He, W., Li, P., Zheng, X.: Detecting safety helmet wearing on construction sites with bounding-box regression and deep transfer learning. Comput. Aided Civil Infrastruc. Eng. 36(2), 180–196 (2021)
Article Google Scholar
Shen, J., Xiong, X., Xue, Z., Bian, Y.: A convolutional neural-network-based pedestrian counting model for various crowded scenes. Comput. Aided Civil Infrastruc. Eng. 34(10), 897–914 (2019)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems 1 (2014)
Google Scholar
Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Sun, B., Liu, M., Zheng, R., Zhang, S.: Attention-based LSTM network for wearable human activity recognition. In: 2019 Chinese Control Conference (CCC) (2019)
Google Scholar
Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-Centric Relation Network. Springer, Cham (2018)
Book Google Scholar
Tang, J., Xia, J., Mu, X., Pang, B., Lu, C.: Asynchronous interaction aggregation for action detection (2020)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. IEEE (2015)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Véges, M., Lőrincz, A.: Multi-person absolute 3D human pose estimation with weak depth supervision. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 258–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_21
Chapter Google Scholar
Wu, J., Kuang, Z., Wang, L., Zhang, W., Wu, G.: Context-aware RCNN: a baseline for action detection in videos (2020)
Google Scholar
Zhao, J., Snoek, C.: Dance with flow: two-in-one stream action detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Zheng, Z., Shi, L., Wang, C., Sun, L., Pan, G.: LSTM with uniqueness attention for human activity recognition. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11729, pp. 498–509. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30508-6_40
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Ying Li, Jie Shen, Xin Xiong, Wei He, Peng Li & Wenjie Yan

Authors

Ying Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Wei He
View author publications
You can also search for this author in PubMed Google Scholar
Peng Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Yan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ying Li .

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Shen, J., Xiong, X., He, W., Li, P., Yan, W. (2021). A Multimode Two-Stream Network for Egocentric Action Recognition. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham. https://doi.org/10.1007/978-3-030-86362-3_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-86362-3_29
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86361-6
Online ISBN: 978-3-030-86362-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics