Skip to main content

Cross Fusion for Egocentric Interactive Action Recognition

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11961))

Included in the following conference series:

Abstract

The characteristics of egocentric interactive videos, which include heavy ego-motion, frequent viewpoint changes and multiple types of activities, hinder the action recognition methods of third-person vision from obtaining satisfactory results. In this paper, we introduce an effective architecture with two branches and a cross fusion method for action recognition in egocentric interactive vision. The two branches are responsible to model the information from observers and inter-actors respectively, and each branch is designed based on the multimodal multi-stream C3D networks. We leverage cross fusion to establish effective linkages between the two branches, which aims to reduce redundant information and fuse complementary features. Besides, we propose variable sampling to obtain discriminative snippets for training. Experimental results demonstrate that the proposed architecture achieves superior performance over several state-of-the-art methods on two benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tsutsui, S., Bambach, S., Crandall, D., Yu, C.: Estimating head motion from egocentric vision. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI 2018, pp. 342–346. ACM, New York (2018)

    Google Scholar 

  2. Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2730–2737, June 2013

    Google Scholar 

  3. Xia, L., Gori, I., Aggarwal, J.K., Ryoo, M.S.: Robot-centric activity recognition from first-person RGB-D videos. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 357–364, January 2015

    Google Scholar 

  4. Mishra, S.R., Mishra, T.K., Sarkar, A., Sanyal, G.: PSO based combined kernel learning framework for recognition of first-person activity in a video. Evol. Intell. (2018)

    Google Scholar 

  5. Zhang, H.B., Zhang, Y.X., Zhong, B., Lei, Q., Yang, L., Du, J.X., Chen, D.S.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5) (2019)

    Article  Google Scholar 

  6. Zaki, H.F.M., Shafait, F., Mian, A.: Modeling sub-event dynamics in first-person action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1619–1628, July 2017

    Google Scholar 

  7. Fa, L., Song, Y., Shu, X.: Global and local C3D ensemble system for first person interactive action recognition. In: Schoeffmann, K., Chalidabhongse, T.H., Ngo, C.W., Aramvith, S., O’Connor, N.E., Ho, Y.-S., Gabbouj, M., Elgammal, A. (eds.) MMM 2018. LNCS, vol. 10705, pp. 153–164. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73600-6_14

    Chapter  Google Scholar 

  8. Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1894–1903, June 2016

    Google Scholar 

  9. Abebe, G., Cavallaro, A.: A long short-term memory convolutional neural network for first-person vision activity recognition. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 1339–1346, October 2017

    Google Scholar 

  10. Iwashita, Y., Takamine, A., Kurazume, R., Ryoo, M.S.: First-person animal activity recognition from egocentric videos. In: 2014 22nd International Conference on Pattern Recognition, pp. 4310–4315, August 2014

    Google Scholar 

  11. Yudistira, N., Kurita, T.: Temporal evolution of motion superpixel for video classification. In: 2017 3rd IEEE International Conference on Cybernetics (CYBCONF), pp. 1–6, June 2017

    Google Scholar 

  12. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  13. Song, S., Chandrasekhar, V., Mandal, B., Li, L., Lim, J., Babu, G.S., San, P.P., Cheung, N.: Multimodal multi-stream deep learning for egocentric activity recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 378–385, June 2016

    Google Scholar 

  14. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, pp. 568–576 (2014)

    Google Scholar 

  15. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV 2013, pp. 3551–3558. IEEE Computer Society, Washington, DC (2013)

    Google Scholar 

  16. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV 2015, pp. 4489–4497. IEEE Computer Society, Washington, DC (2015)

    Google Scholar 

  17. Khong, V., Tran, T.: Improving human action recognition with two-stream 3D convolutional neural network. In: 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–6, April 2018

    Google Scholar 

  18. Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9, March 2016

    Google Scholar 

  19. Lan, Z., Bao, L., Yu, S.-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. In: Schoeffmann, K., Merialdo, B., Hauptmann, A.G., Ngo, C.-W., Andreopoulos, Y., Breiteneder, C. (eds.) MMM 2012. LNCS, vol. 7131, pp. 173–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27355-1_18

    Chapter  Google Scholar 

  20. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)

    Article  Google Scholar 

  21. Shen, L., Lin, Z., Huang, Q.: Relay backpropagation for effective learning of deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_29

    Chapter  Google Scholar 

  22. Moreira, T.P., Menotti, D., Pedrini, H.: First-person action recognition through visual rhythm texture description. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2627–2631, March 2017

    Google Scholar 

  23. Takamine, A., Iwashita, Y., Kurazume, R.: First-person activity recognition with C3D features from optical flow images. In: 2015 IEEE/SICE International Symposium on System Integration (SII), pp. 619–622, December 2015

    Google Scholar 

  24. Kwon, H., Kim, Y., Lee, J.S., Cho, M.: First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recogn. Lett. 112, 161–167 (2018)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the National Nature Science Foundation of China under Grants 61672285, U1611461, 61732007, and 61702265, Natural Science Foundation of Jiangsu Province (Grant No. BK20170856).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, H., Song, Y., He, J., Shu, X. (2020). Cross Fusion for Egocentric Interactive Action Recognition. In: Ro, Y., et al. MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol 11961. Springer, Cham. https://doi.org/10.1007/978-3-030-37731-1_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37731-1_58

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37730-4

  • Online ISBN: 978-3-030-37731-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics