Cross Fusion for Egocentric Interactive Action Recognition

Jiang, Haiyu; Song, Yan; He, Jiang; Shu, Xiangbo

doi:10.1007/978-3-030-37731-1_58

Haiyu Jiang¹⁶,
Yan Song¹⁶,
Jiang He¹⁶ &
…
Xiangbo Shu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11961))

Included in the following conference series:

International Conference on Multimedia Modeling

2756 Accesses
2 Citations

Abstract

The characteristics of egocentric interactive videos, which include heavy ego-motion, frequent viewpoint changes and multiple types of activities, hinder the action recognition methods of third-person vision from obtaining satisfactory results. In this paper, we introduce an effective architecture with two branches and a cross fusion method for action recognition in egocentric interactive vision. The two branches are responsible to model the information from observers and inter-actors respectively, and each branch is designed based on the multimodal multi-stream C3D networks. We leverage cross fusion to establish effective linkages between the two branches, which aims to reduce redundant information and fuse complementary features. Besides, we propose variable sampling to obtain discriminative snippets for training. Experimental results demonstrate that the proposed architecture achieves superior performance over several state-of-the-art methods on two benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tsutsui, S., Bambach, S., Crandall, D., Yu, C.: Estimating head motion from egocentric vision. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI 2018, pp. 342–346. ACM, New York (2018)
Google Scholar
Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2730–2737, June 2013
Google Scholar
Xia, L., Gori, I., Aggarwal, J.K., Ryoo, M.S.: Robot-centric activity recognition from first-person RGB-D videos. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 357–364, January 2015
Google Scholar
Mishra, S.R., Mishra, T.K., Sarkar, A., Sanyal, G.: PSO based combined kernel learning framework for recognition of first-person activity in a video. Evol. Intell. (2018)
Google Scholar
Zhang, H.B., Zhang, Y.X., Zhong, B., Lei, Q., Yang, L., Du, J.X., Chen, D.S.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5) (2019)
Article Google Scholar
Zaki, H.F.M., Shafait, F., Mian, A.: Modeling sub-event dynamics in first-person action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1619–1628, July 2017
Google Scholar
Fa, L., Song, Y., Shu, X.: Global and local C3D ensemble system for first person interactive action recognition. In: Schoeffmann, K., Chalidabhongse, T.H., Ngo, C.W., Aramvith, S., O’Connor, N.E., Ho, Y.-S., Gabbouj, M., Elgammal, A. (eds.) MMM 2018. LNCS, vol. 10705, pp. 153–164. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73600-6_14
Chapter Google Scholar
Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1894–1903, June 2016
Google Scholar
Abebe, G., Cavallaro, A.: A long short-term memory convolutional neural network for first-person vision activity recognition. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 1339–1346, October 2017
Google Scholar
Iwashita, Y., Takamine, A., Kurazume, R., Ryoo, M.S.: First-person animal activity recognition from egocentric videos. In: 2014 22nd International Conference on Pattern Recognition, pp. 4310–4315, August 2014
Google Scholar
Yudistira, N., Kurita, T.: Temporal evolution of motion superpixel for video classification. In: 2017 3rd IEEE International Conference on Cybernetics (CYBCONF), pp. 1–6, June 2017
Google Scholar
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Song, S., Chandrasekhar, V., Mandal, B., Li, L., Lim, J., Babu, G.S., San, P.P., Cheung, N.: Multimodal multi-stream deep learning for egocentric activity recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 378–385, June 2016
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, pp. 568–576 (2014)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV 2013, pp. 3551–3558. IEEE Computer Society, Washington, DC (2013)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV 2015, pp. 4489–4497. IEEE Computer Society, Washington, DC (2015)
Google Scholar
Khong, V., Tran, T.: Improving human action recognition with two-stream 3D convolutional neural network. In: 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–6, April 2018
Google Scholar
Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9, March 2016
Google Scholar
Lan, Z., Bao, L., Yu, S.-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. In: Schoeffmann, K., Merialdo, B., Hauptmann, A.G., Ngo, C.-W., Andreopoulos, Y., Breiteneder, C. (eds.) MMM 2012. LNCS, vol. 7131, pp. 173–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27355-1_18
Chapter Google Scholar
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)
Article Google Scholar
Shen, L., Lin, Z., Huang, Q.: Relay backpropagation for effective learning of deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_29
Chapter Google Scholar
Moreira, T.P., Menotti, D., Pedrini, H.: First-person action recognition through visual rhythm texture description. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2627–2631, March 2017
Google Scholar
Takamine, A., Iwashita, Y., Kurazume, R.: First-person activity recognition with C3D features from optical flow images. In: 2015 IEEE/SICE International Symposium on System Integration (SII), pp. 619–622, December 2015
Google Scholar
Kwon, H., Kim, Y., Lee, J.S., Cho, M.: First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recogn. Lett. 112, 161–167 (2018)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the National Nature Science Foundation of China under Grants 61672285, U1611461, 61732007, and 61702265, Natural Science Foundation of Jiangsu Province (Grant No. BK20170856).

Author information

Authors and Affiliations

Nanjing University of Science and Technology, Nanjing, China
Haiyu Jiang, Yan Song, Jiang He & Xiangbo Shu

Authors

Haiyu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Song
View author publications
You can also search for this author in PubMed Google Scholar
Jiang He
View author publications
You can also search for this author in PubMed Google Scholar
Xiangbo Shu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Song .

Editor information

Editors and Affiliations

Korea Advanced Institute of Science and, Daejeon, Korea (Republic of)
Yong Man Ro
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Junmo Kim
National Cheng Kung University, Tainan City, Taiwan
Wei-Ta Chu
Tsinghua University, Beijing, China
Peng Cui
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Jung-Woo Choi
National Tsing Hua University, Hsinchu, Taiwan
Min-Chun Hu
Ghent University, Ghent, Belgium
Wesley De Neve

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, H., Song, Y., He, J., Shu, X. (2020). Cross Fusion for Egocentric Interactive Action Recognition. In: Ro, Y., et al. MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol 11961. Springer, Cham. https://doi.org/10.1007/978-3-030-37731-1_58

Download citation

DOI: https://doi.org/10.1007/978-3-030-37731-1_58
Published: 24 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37730-4
Online ISBN: 978-3-030-37731-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics