Skip to main content

A Multimode Two-Stream Network for Egocentric Action Recognition

  • Conference paper
  • First Online:
  • 2997 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12891))

Abstract

Video-based egocentric activity recognition involves spatio-temporal and human-object interaction. With the great success of deep learning technology in image recognition, human activity recognition in videos has got increasing attention in multimedia understanding. Comprehensive visual understanding requires the detection and modeling of individual visual features and the interactions between them. The current popular human action recognition approaches based on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from images and skeletons. First, we propose a pose-based two-stream network for action recognition that effectively fuses information from both skeleton and image at multiple levels of the video processing pipeline. In our network, one stream models the temporal dynamics of the action-related objects from video frames, and the other stream models the temporal dynamics of the targeted 2D human pose sequences which are extracted from raw video. Moreover, we demonstrate that a ConvNet trained on RGB data is able to achieve good performance in spite of limited training data. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF101-24 and JHMDB, where it is competitive with the state of the art. Among them, we have got the best results currently on the JHMDB, the mAP reached 90.6%.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Attention is all we need: Nailing down object-centric attention for egocentric activity recognition (2018)

    Google Scholar 

  2. Baradel, F., Wolf, C., Mille, J.: Pose-conditioned spatio-temporal attention for human action recognition (2017)

    Google Scholar 

  3. Bobick, A.F., Davis, J.W.: The Recognition of Human Movement Using Temporal Templates. The Recognition of Human Movement Using Temporal Templates (2001)

    Google Scholar 

  4. Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. arXiv e-prints (2016)

    Google Scholar 

  5. Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  6. Du, W., Wang, Y., Yu, Q.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos (2017)

    Google Scholar 

  7. Farrajota, M., Rodrigues, J., Buf, J.: Human action recognition in videos with articulated pose information by deep networks. Pattern Anal. Appl. 22(4), 1307–1318 (2019)

    Article  MathSciNet  Google Scholar 

  8. Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions (2017)

    Google Scholar 

  9. Hahner, S., Iza-Teran, R., Garcke, J.: Analysis and prediction of deforming 3D shapes using oriented bounding boxes and LSTM autoencoders. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 284–296. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_23

    Chapter  Google Scholar 

  10. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos, pp. 5823–5832. IEEE Computer Society (2017)

    Google Scholar 

  11. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision (2014)

    Google Scholar 

  12. Klser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: British Machine Vision Conference (2010)

    Google Scholar 

  13. Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight network architecture for real-time action recognition (2020)

    Google Scholar 

  14. Laha, A., Raykar, V.: An empirical evaluation of various deep learning architectures for bi-sequence classification tasks (2016)

    Google Scholar 

  15. Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. Neurocomputing 395, 138–149 (2020)

    Article  Google Scholar 

  16. Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50

    Chapter  Google Scholar 

  17. Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_45

    Chapter  Google Scholar 

  18. Piergiovanni, A.J., Ryoo, M.S.: Representation flow for action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  19. Shen, J., Xiong, X., Li, Y., He, W., Li, P., Zheng, X.: Detecting safety helmet wearing on construction sites with bounding-box regression and deep transfer learning. Comput. Aided Civil Infrastruc. Eng. 36(2), 180–196 (2021)

    Article  Google Scholar 

  20. Shen, J., Xiong, X., Xue, Z., Bian, Y.: A convolutional neural-network-based pedestrian counting model for various crowded scenes. Comput. Aided Civil Infrastruc. Eng. 34(10), 897–914 (2019)

    Article  Google Scholar 

  21. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems 1 (2014)

    Google Scholar 

  22. Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  23. Sun, B., Liu, M., Zheng, R., Zhang, S.: Attention-based LSTM network for wearable human activity recognition. In: 2019 Chinese Control Conference (CCC) (2019)

    Google Scholar 

  24. Sun, C., Shrivastava, A., Vondrick, C., Murphy, K., Sukthankar, R., Schmid, C.: Actor-Centric Relation Network. Springer, Cham (2018)

    Book  Google Scholar 

  25. Tang, J., Xia, J., Mu, X., Pang, B., Lu, C.: Asynchronous interaction aggregation for action detection (2020)

    Google Scholar 

  26. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. IEEE (2015)

    Google Scholar 

  27. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

    Google Scholar 

  28. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  29. Véges, M., Lőrincz, A.: Multi-person absolute 3D human pose estimation with weak depth supervision. In: Farkaš, I., Masulli, P., Wermter, S. (eds.) ICANN 2020. LNCS, vol. 12396, pp. 258–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61609-0_21

    Chapter  Google Scholar 

  30. Wu, J., Kuang, Z., Wang, L., Zhang, W., Wu, G.: Context-aware RCNN: a baseline for action detection in videos (2020)

    Google Scholar 

  31. Zhao, J., Snoek, C.: Dance with flow: two-in-one stream action detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  32. Zheng, Z., Shi, L., Wang, C., Sun, L., Pan, G.: LSTM with uniqueness attention for human activity recognition. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11729, pp. 498–509. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30508-6_40

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ying Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Shen, J., Xiong, X., He, W., Li, P., Yan, W. (2021). A Multimode Two-Stream Network for Egocentric Action Recognition. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12891. Springer, Cham. https://doi.org/10.1007/978-3-030-86362-3_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86362-3_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86361-6

  • Online ISBN: 978-3-030-86362-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics