Abstract
Action recognition in first person videos is different from that in third person videos. In this paper, we aim to recognize interactive actions in first person videos. First person interactive actions contain two kinds of motion which are the ego-motion from the observer and the motion from the actor. To enable an observer to understand “what activity others are doing to me”, we propose a twin stream network architecture based on 3D convolution networks. The global action C3D learns interactions with ego-motion and the local salient motion C3D analyzes the motion from the actor in a salient region, especially when the action happens at a distance from the observer. We also propose a sampling method to extract clips as input to the C3D models and investigate different C3D architectures to improve the performance of C3D. We carry out experiments on the benchmark of JPL first-person interaction dataset. Experiment results prove that the ensemble of global and local networks can increase the accuracy over the state-of-the-art methods by 3.26%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8 (2008)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Laptev, I., Lindeberg, T.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition, pp. 357–360 (2007)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks, pp. 4489–4497 (2014)
Singh, S., Arora, C., Jawahar, C.V.: First person action recognition using deep learned descriptors. In: Computer Vision and Pattern Recognition (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: Computer Vision and Pattern Recognition, pp. 3281–3288 (2011)
Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 314–327. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33718-5_23
Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition, pp. 1894–1903 (2016)
Poleg, Y., Ephrat, A., Peleg, S., Arora, C.: Compact CNN for indexing egocentric videos. Computer Science, pp. 1–9 (2016)
Lee, J., Ryoo, M.S.: Learning robot activities from first-person human videos using convolutional future regression (2017)
Kitani, K.M., Okabe, T., Sato, Y., Sugimoto, A.: Fast unsupervised ego-action learning for first-person sports videos. In: Computer Vision and Pattern Recognition, pp. 3241–3248 (2011)
Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2730–2737 (2013)
Choi, J., Jeon, W.J., Lee, S.C.: Spatio-temporal pyramid matching for sports videos. In: ACM International Conference on Multimedia Information Retrieval, pp. 291–297 (2008)
Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE International Conference on Computer Vision, pp. 1036–1043 (2012)
Acknowledgement
This work was supported in part by the 973 Program (Project No. 2014CB347600), the Natural Science Foundation of Jiangsu Province (Grant No. BK20170856), the National Nature Science Foundation of China (Grant Nos. 61672285 and 61702265) and CCF-Tencent Open Research Fund.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Fa, L., Song, Y., Shu, X. (2018). Global and Local C3D Ensemble System for First Person Interactive Action Recognition. In: Schoeffmann, K., et al. MultiMedia Modeling. MMM 2018. Lecture Notes in Computer Science(), vol 10705. Springer, Cham. https://doi.org/10.1007/978-3-319-73600-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-73600-6_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73599-3
Online ISBN: 978-3-319-73600-6
eBook Packages: Computer ScienceComputer Science (R0)