Abstract
We consider the problem of detecting Egocentric Human-Object Interactions (EHOIs) in industrial contexts. Since collecting and labeling large amounts of real images is challenging, we propose a pipeline and a tool to generate photo-realistic synthetic First Person Vision (FPV) images automatically labeled for EHOI detection in a specific industrial scenario. To tackle the problem of EHOI detection, we propose a method that detects the hands, the objects in the scene, and determines which objects are currently involved in an interaction. We compare the performance of our method with a set of state-of-the-art baselines. Results show that using a synthetic dataset improves the performance of an EHOI detection system, especially when few real data are available. To encourage research on this topic, we publicly release the proposed dataset at the following url: https://iplab.dmi.unict.it/EHOI_SYNTH/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Ego4D Website: https://ego4d-data.org/.
- 2.
See supplementary material for more details.
- 3.
- 4.
- 5.
We used the following implementation: https://github.com/cocodataset/cocoapi.
- 6.
References
Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. Int. Conf. Comput. Vis. (2015)
Betancourt, A., Morerio, P., Regazzoni, C.S., Rauterberg, M.: The evolution of first person vision methods: a survey. IEEE Trans. Circuits Syst. Video Technol. (2015)
Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. Winter Conf. Appl. Comput. Vis. (2018)
Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. Int. Conf. Comput. Vis. (2019)
Cucchiara, R., Del Bimbo, A.: Visions for augmented cultural heritage experience. IEEE Multim. (2014)
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. Int. J. Comput. Vis. (2021)
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. Eur. Conf. Comput. Vis. (2018)
Fu, Q., Liu, X., Kitani, K.M.: Sequential voting with relational box fields for active object detection. arXiv preprint arXiv:2110.11524 (2021)
Furnari, A., Farinella, G.M., Battiato, S.: Temporal segmentation of egocentric videos to highlight personal locations of interest. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 474–489. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_34
Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. (2017)
Furnari, A., Farinella, G.M.: Rolling-unrolling LSTMS for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. Conf. Comput. Vis. Pattern Recogn. (2018)
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. Conf. Comput. Vis. Pattern Recogn. (2018)
Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. Conf. Comput. Vis. Pattern Recogn. (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Conf. Comput. Vis. Pattern Recogn. (2016)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. Conf. Comput. Vis. Pattern Recogn. (2015)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: Gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell.(2021)
Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. Conf. Comput. Vis. Pattern Recogn. (2015)
Liao, Y., Liu, S., Wang, F., Chen, Y., Feng, J.: PPDM: parallel point detection and matching for real-time human-object interaction detection. Conf. Comput. Vis. Pattern Recogn. (2020)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. Conf. Comput. Vis. Pattern Recogn. (2017)
Lin, T.Y., et al.: Microsoft coco: common objects in context. Eur. Conf. Comput. Vis. (2014)
Lu, Y., Mayol-Cuevas, W.: The object at hand: Automated editing for mixed reality video guidance from hand-object interactions. Int. Symp. Mixed Augment. Real. (2021)
Lu, Y., Mayol-Cuevas, W.W.: Understanding egocentric hand-object interactions from hand pose estimation. arXiv preprint arXiv:2109.14657 (2021)
Mueller, F., et al.: Ganerated hands for real-time 3D hand tracking from monocular RGB. Conf. Comput. Vis. Pattern Recogn. (2018)
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. Int. Conf. Comput. Vis. (2017)
Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.: Egocentric visitors localization in cultural sites. J. Comput. Cult. Herit. (2019)
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. Winter Conf. Appl. Comput. Vis. (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. (2015)
Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. Conf. Comput. Vis. Pattern Recogn. (2020)
Wang, H., et al.: Learning a generative model for multi-step human-object interactions from videos. Comput. Graph. Forum (2019)
Zhang, F.Z., Campbell, D., Gould, S.: Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. arXiv preprint arXiv:2112.01838 (2021)
Acknowledgements
This research has been supported by Next Vision (https://www.nextvisionlab.it/) s.r.l., by the project MISE - PON I&C 2014–2020 - Progetto ENIGMA - Prog n. F/190050/02/X44 - CUP: B61B19000520008, and by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Leonardi, R., Ragusa, F., Furnari, A., Farinella, G.M. (2022). Egocentric Human-Object Interaction Detection Exploiting Synthetic Data. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-06430-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06429-6
Online ISBN: 978-3-031-06430-2
eBook Packages: Computer ScienceComputer Science (R0)