Abstract
The characteristics of egocentric interactive videos, which include heavy ego-motion, frequent viewpoint changes and multiple types of activities, hinder the action recognition methods of third-person vision from obtaining satisfactory results. In this paper, we introduce an effective architecture with two branches and a cross fusion method for action recognition in egocentric interactive vision. The two branches are responsible to model the information from observers and inter-actors respectively, and each branch is designed based on the multimodal multi-stream C3D networks. We leverage cross fusion to establish effective linkages between the two branches, which aims to reduce redundant information and fuse complementary features. Besides, we propose variable sampling to obtain discriminative snippets for training. Experimental results demonstrate that the proposed architecture achieves superior performance over several state-of-the-art methods on two benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Tsutsui, S., Bambach, S., Crandall, D., Yu, C.: Estimating head motion from egocentric vision. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI 2018, pp. 342–346. ACM, New York (2018)
Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2730–2737, June 2013
Xia, L., Gori, I., Aggarwal, J.K., Ryoo, M.S.: Robot-centric activity recognition from first-person RGB-D videos. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 357–364, January 2015
Mishra, S.R., Mishra, T.K., Sarkar, A., Sanyal, G.: PSO based combined kernel learning framework for recognition of first-person activity in a video. Evol. Intell. (2018)
Zhang, H.B., Zhang, Y.X., Zhong, B., Lei, Q., Yang, L., Du, J.X., Chen, D.S.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5) (2019)
Zaki, H.F.M., Shafait, F., Mian, A.: Modeling sub-event dynamics in first-person action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1619–1628, July 2017
Fa, L., Song, Y., Shu, X.: Global and local C3D ensemble system for first person interactive action recognition. In: Schoeffmann, K., Chalidabhongse, T.H., Ngo, C.W., Aramvith, S., O’Connor, N.E., Ho, Y.-S., Gabbouj, M., Elgammal, A. (eds.) MMM 2018. LNCS, vol. 10705, pp. 153–164. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73600-6_14
Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1894–1903, June 2016
Abebe, G., Cavallaro, A.: A long short-term memory convolutional neural network for first-person vision activity recognition. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 1339–1346, October 2017
Iwashita, Y., Takamine, A., Kurazume, R., Ryoo, M.S.: First-person animal activity recognition from egocentric videos. In: 2014 22nd International Conference on Pattern Recognition, pp. 4310–4315, August 2014
Yudistira, N., Kurita, T.: Temporal evolution of motion superpixel for video classification. In: 2017 3rd IEEE International Conference on Cybernetics (CYBCONF), pp. 1–6, June 2017
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Song, S., Chandrasekhar, V., Mandal, B., Li, L., Lim, J., Babu, G.S., San, P.P., Cheung, N.: Multimodal multi-stream deep learning for egocentric activity recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 378–385, June 2016
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, pp. 568–576 (2014)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV 2013, pp. 3551–3558. IEEE Computer Society, Washington, DC (2013)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV 2015, pp. 4489–4497. IEEE Computer Society, Washington, DC (2015)
Khong, V., Tran, T.: Improving human action recognition with two-stream 3D convolutional neural network. In: 2018 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pp. 1–6, April 2018
Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9, March 2016
Lan, Z., Bao, L., Yu, S.-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. In: Schoeffmann, K., Merialdo, B., Hauptmann, A.G., Ngo, C.-W., Andreopoulos, Y., Breiteneder, C. (eds.) MMM 2012. LNCS, vol. 7131, pp. 173–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-27355-1_18
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)
Shen, L., Lin, Z., Huang, Q.: Relay backpropagation for effective learning of deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_29
Moreira, T.P., Menotti, D., Pedrini, H.: First-person action recognition through visual rhythm texture description. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2627–2631, March 2017
Takamine, A., Iwashita, Y., Kurazume, R.: First-person activity recognition with C3D features from optical flow images. In: 2015 IEEE/SICE International Symposium on System Integration (SII), pp. 619–622, December 2015
Kwon, H., Kim, Y., Lee, J.S., Cho, M.: First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recogn. Lett. 112, 161–167 (2018)
Acknowledgments
This work was supported by the National Nature Science Foundation of China under Grants 61672285, U1611461, 61732007, and 61702265, Natural Science Foundation of Jiangsu Province (Grant No. BK20170856).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jiang, H., Song, Y., He, J., Shu, X. (2020). Cross Fusion for Egocentric Interactive Action Recognition. In: Ro, Y., et al. MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol 11961. Springer, Cham. https://doi.org/10.1007/978-3-030-37731-1_58
Download citation
DOI: https://doi.org/10.1007/978-3-030-37731-1_58
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37730-4
Online ISBN: 978-3-030-37731-1
eBook Packages: Computer ScienceComputer Science (R0)