Abstract
Egocentric Human Activity Recognition (ego-HAR) has received attention in fields where human intentions in a video must be estimated. The performance of existing methods, however, are limited due to insufficient information about the subject’s motion in egocentric videos. We consider that a dataset of egocentric videos along with two inertial sensors attached to both wrists of the subject to obtain more information about the subject’s motion will be useful to study the problem in depth. Therefore, this paper provides a publicly available dataset, EvIs-Kitchen, which contains well-synchronized egocentric videos and two-hand inertial sensor data, as well as interaction-highlighted annotations. We also present a baseline multimodal activity recognition method with two-stream architecture and score fusion to validate that such multimodal learning on egocentric videos and intertial sensor data is more effective to tackle the problem. Experiments show that our multimodal method outperforms other single-modal methods on EvIs-Kitchen.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 8717–8727 (2018)
Chen, C., Jafari, R., Kehtarnavaz, N.: Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP, pp. 168–172 (2015)
Damen, D., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: ECCV, pp. 720–736 (2018). https://doi.org/10.1007/978-3-030-01225-0_44
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. IJCV 130(1), 33–55 (2022)
Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust rgb-d object recognition. In: IROS, pp. 681–687 (2015)
Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: ECCV, pp. 314–327 (2012). https://doi.org/10.1007/978-3-642-33718-5_23
Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR, pp. 12046–12055 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Imran, J., Raman, B.: Multimodal egocentric activity recognition using multi-stream cnn. In: ICVGIP, pp. 1–8 (2018)
Jang, Y., Sullivan, B., Ludwig, C., Gilchrist, I., Damen, D., Mayol-Cuevas, W.: Epic-tent: An egocentric video dataset for camping tent assembly. In: ICCV Workshops, pp. 0–0 (2019)
Kapidis, G., Poppe, R., Van Dam, E., Noldus, L., Veltkamp, R.: Egocentric hand track and object-based human action recognition. In: SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI, pp. 922–929 (2019)
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In: CVPR, pp. 5492–5501 (2019)
Kwon, H., Kim, Y., Lee, J.S., Cho, M.: First person action recognition via two-stream convnet with long-term fusion pooling. Pattern Recogn. Lett. 112, 161–167 (2018)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8. IEEE (2008)
Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. In: CVPR, pp. 287–295 (2015)
Lu, M., Li, Z.N., Wang, Y., Pan, G.: Deep attention network for egocentric action recognition. IEEE Trans. Image Process. 28(8), 3703–3713 (2019)
Lu, M., Liao, D., Li, Z.N.: Learning spatiotemporal attention for egocentric action recognition. In: ICCV Workshops, pp. 4425–4434 (2019)
Ma, M., Fan, H., Kitani, K.M.: Going deeper into first-person activity recognition. In: CVPR, pp. 1894–1903 (2016)
Ng, P.C., Henikoff, S.: Sift: predicting amino acid changes that affect protein function. Nucleic Acids Res. 31(13), 3812–3814 (2003)
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV, pp. 631–648 (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Shavit, Y., Klein, I.: Boosting inertial-based human activity recognition with transformers. IEEE Access 9, 53540–53547 (2021)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014)
Singh, S., Arora, C., Jawahar, C.: First person action recognition using deep learned descriptors. In: CVPR, pp. 2620–2628 (2016)
Song, S., et al.: Multimodal multi-stream deep learning for egocentric activity recognition. In: CVPR Workshops, pp. 24–31 (2016)
Song, S., Cheung, N.M., Chandrasekhar, V., Mandal, B., Liri, J.: Egocentric activity recognition with multimodal fisher vector. In: ICASSP, pp. 2717–2721 (2016)
Sudhakaran, S., Escalera, S., Lanz, O.: Lsta: Long short-term attention for egocentric action recognition. In: CVPR, pp. 9954–9963 (2019)
Tang, Y., Wang, Z., Lu, J., Feng, J., Zhou, J.: Multi-stream deep neural networks for rgb-d egocentric action recognition. IEEE Trans. Circuits Syst. Video Technol. 29(10), 3001–3015 (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp. 6450–6459 (2018)
Wang, X., Wu, Y., Zhu, L., Yang, Y.: Symbiotic attention with privileged information for egocentric action recognition. In: AAAI, vol. 34, pp. 12249–12256 (2020)
Acknowledgement
This work is an outcome of a research project, Development of Quality Foundation for Machine-Learning Applications, supported by DENSO IT LAB Recognition and Learning Algorithm Collaborative Research Chair (Tokyo Tech.). It was also supported by JST CREST JPMJCR1687.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hao, Y., Uto, K., Kanezaki, A., Sato, I., Kawakami, R., Shinoda, K. (2023). EvIs-Kitchen: Egocentric Human Activities Recognition with Video and Inertial Sensor Data. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-27077-2_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27076-5
Online ISBN: 978-3-031-27077-2
eBook Packages: Computer ScienceComputer Science (R0)