Abstract
In this study, we investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only \(10\%\) of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: \(+5.67\%\) on EPIC-KITCHENS VISOR, \(+8.24\%\) on EgoHOS, and \(+11.69\%\) on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Data, code, and data generation tools to support future research are released at: https://fpv-iplab.github.io/HOI-Synth/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
See the supplementary material for examples of in-domain and out-domain generated images and for additional details about architectures and training setups.
- 2.
Note that, in our implementation, the results of the HOS model differ from those reported in [10] because, for fair comparisons, we adopted a batch size of 4, the largest batch size achievable with domain adaptation models in our configuration.
- 3.
Additional qualitative examples are reported in the supplementary material.
References
Besari, A.R.A., Saputra, A.A., Chin, W.H., Kubota, N., et al.: Hand–object interaction recognition based on visual attention using multiscopic cyber-physical-social system. Int. J. Adv. Intell. Inform. 9(2) (2023)
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: CVPR, pp. 3722–3731 (2017)
Cai, Q., Pan, Y., Ngo, C.W., Tian, X., Duan, L., Yao, T.: Exploring object relation in mean teacher for cross-domain detection. In: CVPR, pp. 11457–11466 (2019)
Carfì, A., et al.: Hand-object interaction: from human demonstrations to robot manipulation. Front. Robot. AI 8, 714023 (2021)
Cheng, T., Shan, D., Hassen, A.S., Higgins, R.E.L., Fouhey, D.: Towards a richer 2D understanding of hands at scale. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Choudhary, A., Mishra, D., Karmakar, A.: Domain adaptive egocentric person re-identification. In: Computer Vision and Image Processing (CVIP), pp. 81–92 (2021)
Csurka, G.: Domain adaptation for visual applications: a comprehensive survey (2017). https://arxiv.org/abs/1702.05374
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. IJCV, 1–23 (2021)
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV, pp. 720–736 (2018)
Darkhalil, A., et al.: Epic-kitchens visor benchmark: video segmentations and object relations. In: NeurIPS, pp. 13745–13758 (2022)
Deng, J., Li, W., Chen, Y., Duan, L.: Unbiased mean teacher for cross-domain object detection. In: CVPR, pp. 4091–4101 (2021)
Di Benedetto, M., Carrara, F., Meloni, E., Amato, G., Falchi, F., Gennaro, C.: Learning accurate personal protective equipment detection from virtual worlds. Multimedia Tools Appl. 80, 23241–23253 (2021)
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16 (2017)
Edsinger, A., Kemp, C.C.: Human-robot interaction for cooperative manipulation: handing objects to one another. In: RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication, pp. 1167–1172. IEEE (2007)
Fabbri, M., et al.: Motsynth: how can synthetic data help pedestrian detection and tracking? In: ICCV (2021)
Fu, Q., Liu, X., Kitani, K.M.: Sequential voting with relational box fields for active object detection. In: CVPR, pp. 2374–2383 (2022)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2021)
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Jian, J., Liu, X., Li, M., Hu, R., Liu, J.: Affordpose: a large-scale dataset of hand-object interactions with affordance-driven hand pose. In: ICCV, pp. 14713–14724 (2023)
Kirillov, A., Wu, Y., He, K., Girshick, R.: Pointrend: image segmentation as rendering. In: CVPR, pp. 9799–9808 (2020)
Kolve, E., et al.: Ai2-thor: an interactive 3d environment for visual AI (2017). https://arxiv.org/abs/1712.05474
Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)
Leonardi, R., Ragusa, F., Furnari, A., Farinella, G.M.: Egocentric human-object interaction detection exploiting synthetic data. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) ICIAP 2022. LNCS, vol. 13232, pp. 237–248. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06430-2_20
Li, C., et al.: igibson 2.0: object-centric simulation for robot learning of everyday household tasks. In: Faust, A., Hsu, D., Neumann, G. (eds.) Proceedings of the 5th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 164, pp. 455–465. PMLR (2022). https://proceedings.mlr.press/v164/li22b.html
Li, Y., Nagarajan, T., Xiong, B., Grauman, K.: Ego-exo: transferring visual representations from third-person to first-person videos. In: CVPR, pp. 6943–6953 (2021)
Li, Y.J., et al.: Cross-domain adaptive teacher for object detection. In: CVPR, pp. 7581–7590 (2022)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: CVPR, pp. 3282–3292 (2022)
Liu, Y.C., et al.: Unbiased teacher for semi-supervised object detection. In: ICLR (2021)
Lu, Y., Mayol-Cuevas, W.W.: Egocentric hand-object interaction detection and application (2021). https://arxiv.org/abs/2109.14734
Lv, Z., Poiesi, F., Dong, Q., Lloret, J., Song, H.: Deep learning for intelligent human-computer interaction. Appl. Sci. 12(22), 11457 (2022)
Savva, M., et al.: Habitat: A platform for embodied AI research. In: ICCV (2019)
Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: CVPR, pp. 122–132 (2020)
Munro, J., Wray, M., Larlus, D., Csurka, G., Damen, D.: Domain adaptation in multi-view embedding for cross-modal video retrieval. ArXiv abs/2110.12812 (2021). https://api.semanticscholar.org/CorpusID:239768993
NVIDIA: Nvidia omniverse (2020). https://www.nvidia.com/en-us/omniverse/synthetic-data/
NVIDIA: Nvidia isaac sim (2021). https://developer.nvidia.com/isaac-sim
Orlando, S., Furnari, A., Farinella, G.M.: Egocentric visitor localization and artwork detection in cultural sites using synthetic data. Pattern Recognition Letters - Special Issue on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage (2020). https://iplab.dmi.unict.it/SimulatedEgocentricNavigations/
Pasqualino, G., Furnari, A., Signorello, G., Farinella, G.M.: An unsupervised domain adaptation scheme for single-stage artwork recognition in cultural sites. Image Vis. Comput. 107, 104098 (2021)
Plizzari, C., Perrett, T., Caputo, B., Damen, D.: What can a cook in Italy teach a mechanic in India? action recognition generalisation over scenarios and locations. In: ICCV2023 (2023)
Quattrocchi, C., Mauro, D.D., Furnari, A., Lopes, A., Moltisanti, M., Farinella, G.M.: Put your PPE on: a tool for synthetic data generation and related benchmark in construction site scenarios. In: International Conference on Computer Vision Theory and Applications, pp. 656–663 (2023)
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: Winter Conference on Applications of Computer Vision, pp. 1569–1578 (2021)
Ragusa, F., et al.: Enigma-51: towards a fine-grained understanding of human behavior in industrial scenarios. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4549–4559 (2024)
Ramakrishnan, S.K., et al.: Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied AI. In: NeurIPS (2021)
Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR, pp. 3723–3732 (2018)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV, pp. 9339–9347 (2019)
Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: CVPR, pp. 21096–21106 (2022)
shadowrobot: Shadowhand (2005). https://www.shadowrobot.com/dexterous-hand-series/
Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: CVPR, pp. 9869–9878 (2020)
Szot, A., et al.: Habitat 2.0: training home assistants to rearrange their habitat. In: Advances in Neural Information Processing Systems, vol. 34, pp. 251–266 (2021)
Tang, Y., Tian, Y., Lu, J., Feng, J., Zhou, J.: Action recognition in RGB-D egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414. IEEE (2017)
Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS 30 (2017)
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR, pp. 7167–7176 (2017)
Unity: Synthetichumans package (unity computer vision) (2022). https://github.com/Unity-Technologies/com.unity.cv.synthetichumans
Wang, R., et al.: Dexgraspnet: a large-scale robotic dexterous grasp dataset for general objects based on simulation. In: CVPR, pp. 11359–11366 (2023)
Xia, F., R. Zamir, A., He, Z.Y., Sax, A., Malik, J., Savarese, S.: Gibson ENV: real-world perception for embodied agents. In: CVPR (2018)
Xia, F., et al.: Interactive Gibson benchmark: a benchmark for interactive navigation in cluttered environments. IEEE Robot. Autom. Lett. 5(2), 713–720 (2020)
Ye, Y., et al.: Affordance diffusion: synthesizing hand-object interactions. In: CVPR, pp. 22479–22489 (2023)
Zhang, L., Zhou, S., Stent, S., Shi, J.: Fine-grained egocentric hand-object segmentation: dataset, model, and applications. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 127–145. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_8
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Acknowledgments
This research has been supported by the project Future Artificial Intelligence Research (FAIR) - PNRR MUR Cod. PE0000013 - CUP: E63C22001940006. This research has been partially supported by the project EXTRA-EYE - PRIN 2022 - CUP E53D23008280006 - Finanziato dall’Unione Europea - Next Generation EU.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Leonardi, R., Furnari, A., Ragusa, F., Farinella, G.M. (2025). Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15129. Springer, Cham. https://doi.org/10.1007/978-3-031-73209-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-73209-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73208-9
Online ISBN: 978-3-031-73209-6
eBook Packages: Computer ScienceComputer Science (R0)