Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

Leonardi, Rosario; Furnari, Antonino; Ragusa, Francesco; Farinella, Giovanni Maria

doi:10.1007/978-3-031-73209-6_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15129))

Included in the following conference series:

European Conference on Computer Vision

266 Accesses

Abstract

In this study, we investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only $10\%$ of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: $+5.67\%$ on EPIC-KITCHENS VISOR, $+8.24\%$ on EgoHOS, and $+11.69\%$ on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Data, code, and data generation tools to support future research are released at: https://fpv-iplab.github.io/HOI-Synth/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Egocentric Human-Object Interaction Detection Exploiting Synthetic Data

3D Hand Pose Estimation in Everyday Egocentric Images

Fine-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and Applications

Notes

1.
See the supplementary material for examples of in-domain and out-domain generated images and for additional details about architectures and training setups.
2.
Note that, in our implementation, the results of the HOS model differ from those reported in [10] because, for fair comparisons, we adopted a batch size of 4, the largest batch size achievable with domain adaptation models in our configuration.
3.
Additional qualitative examples are reported in the supplementary material.

References

Besari, A.R.A., Saputra, A.A., Chin, W.H., Kubota, N., et al.: Hand–object interaction recognition based on visual attention using multiscopic cyber-physical-social system. Int. J. Adv. Intell. Inform. 9(2) (2023)
Google Scholar
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: CVPR, pp. 3722–3731 (2017)
Google Scholar
Cai, Q., Pan, Y., Ngo, C.W., Tian, X., Duan, L., Yao, T.: Exploring object relation in mean teacher for cross-domain detection. In: CVPR, pp. 11457–11466 (2019)
Google Scholar
Carfì, A., et al.: Hand-object interaction: from human demonstrations to robot manipulation. Front. Robot. AI 8, 714023 (2021)
Article Google Scholar
Cheng, T., Shan, D., Hassen, A.S., Higgins, R.E.L., Fouhey, D.: Towards a richer 2D understanding of hands at scale. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Choudhary, A., Mishra, D., Karmakar, A.: Domain adaptive egocentric person re-identification. In: Computer Vision and Image Processing (CVIP), pp. 81–92 (2021)
Google Scholar
Csurka, G.: Domain adaptation for visual applications: a comprehensive survey (2017). https://arxiv.org/abs/1702.05374
Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. IJCV, 1–23 (2021)
Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV, pp. 720–736 (2018)
Google Scholar
Darkhalil, A., et al.: Epic-kitchens visor benchmark: video segmentations and object relations. In: NeurIPS, pp. 13745–13758 (2022)
Google Scholar
Deng, J., Li, W., Chen, Y., Duan, L.: Unbiased mean teacher for cross-domain object detection. In: CVPR, pp. 4091–4101 (2021)
Google Scholar
Di Benedetto, M., Carrara, F., Meloni, E., Amato, G., Falchi, F., Gennaro, C.: Learning accurate personal protective equipment detection from virtual worlds. Multimedia Tools Appl. 80, 23241–23253 (2021)
Article Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16 (2017)
Google Scholar
Edsinger, A., Kemp, C.C.: Human-robot interaction for cooperative manipulation: handing objects to one another. In: RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication, pp. 1167–1172. IEEE (2007)
Google Scholar
Fabbri, M., et al.: Motsynth: how can synthetic data help pedestrian detection and tracking? In: ICCV (2021)
Google Scholar
Fu, Q., Liu, X., Kitani, K.M.: Sequential voting with relational box fields for active object detection. In: CVPR, pp. 2374–2383 (2022)
Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Google Scholar
Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2021)
Google Scholar
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Google Scholar
Jian, J., Liu, X., Li, M., Hu, R., Liu, J.: Affordpose: a large-scale dataset of hand-object interactions with affordance-driven hand pose. In: ICCV, pp. 14713–14724 (2023)
Google Scholar
Kirillov, A., Wu, Y., He, K., Girshick, R.: Pointrend: image segmentation as rendering. In: CVPR, pp. 9799–9808 (2020)
Google Scholar
Kolve, E., et al.: Ai2-thor: an interactive 3d environment for visual AI (2017). https://arxiv.org/abs/1712.05474
Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)
Google Scholar
Leonardi, R., Ragusa, F., Furnari, A., Farinella, G.M.: Egocentric human-object interaction detection exploiting synthetic data. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) ICIAP 2022. LNCS, vol. 13232, pp. 237–248. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06430-2_20
Chapter Google Scholar
Li, C., et al.: igibson 2.0: object-centric simulation for robot learning of everyday household tasks. In: Faust, A., Hsu, D., Neumann, G. (eds.) Proceedings of the 5th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 164, pp. 455–465. PMLR (2022). https://proceedings.mlr.press/v164/li22b.html
Li, Y., Nagarajan, T., Xiong, B., Grauman, K.: Ego-exo: transferring visual representations from third-person to first-person videos. In: CVPR, pp. 6943–6953 (2021)
Google Scholar
Li, Y.J., et al.: Cross-domain adaptive teacher for object detection. In: CVPR, pp. 7581–7590 (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: CVPR, pp. 3282–3292 (2022)
Google Scholar
Liu, Y.C., et al.: Unbiased teacher for semi-supervised object detection. In: ICLR (2021)
Google Scholar
Lu, Y., Mayol-Cuevas, W.W.: Egocentric hand-object interaction detection and application (2021). https://arxiv.org/abs/2109.14734
Lv, Z., Poiesi, F., Dong, Q., Lloret, J., Song, H.: Deep learning for intelligent human-computer interaction. Appl. Sci. 12(22), 11457 (2022)
Article Google Scholar
Savva, M., et al.: Habitat: A platform for embodied AI research. In: ICCV (2019)
Google Scholar
Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: CVPR, pp. 122–132 (2020)
Google Scholar
Munro, J., Wray, M., Larlus, D., Csurka, G., Damen, D.: Domain adaptation in multi-view embedding for cross-modal video retrieval. ArXiv abs/2110.12812 (2021). https://api.semanticscholar.org/CorpusID:239768993
NVIDIA: Nvidia omniverse (2020). https://www.nvidia.com/en-us/omniverse/synthetic-data/
NVIDIA: Nvidia isaac sim (2021). https://developer.nvidia.com/isaac-sim
Orlando, S., Furnari, A., Farinella, G.M.: Egocentric visitor localization and artwork detection in cultural sites using synthetic data. Pattern Recognition Letters - Special Issue on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage (2020). https://iplab.dmi.unict.it/SimulatedEgocentricNavigations/
Pasqualino, G., Furnari, A., Signorello, G., Farinella, G.M.: An unsupervised domain adaptation scheme for single-stage artwork recognition in cultural sites. Image Vis. Comput. 107, 104098 (2021)
Article Google Scholar
Plizzari, C., Perrett, T., Caputo, B., Damen, D.: What can a cook in Italy teach a mechanic in India? action recognition generalisation over scenarios and locations. In: ICCV2023 (2023)
Google Scholar
Quattrocchi, C., Mauro, D.D., Furnari, A., Lopes, A., Moltisanti, M., Farinella, G.M.: Put your PPE on: a tool for synthetic data generation and related benchmark in construction site scenarios. In: International Conference on Computer Vision Theory and Applications, pp. 656–663 (2023)
Google Scholar
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: Winter Conference on Applications of Computer Vision, pp. 1569–1578 (2021)
Google Scholar
Ragusa, F., et al.: Enigma-51: towards a fine-grained understanding of human behavior in industrial scenarios. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4549–4559 (2024)
Google Scholar
Ramakrishnan, S.K., et al.: Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied AI. In: NeurIPS (2021)
Google Scholar
Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR, pp. 3723–3732 (2018)
Google Scholar
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV, pp. 9339–9347 (2019)
Google Scholar
Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: CVPR, pp. 21096–21106 (2022)
Google Scholar
shadowrobot: Shadowhand (2005). https://www.shadowrobot.com/dexterous-hand-series/
Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: CVPR, pp. 9869–9878 (2020)
Google Scholar
Szot, A., et al.: Habitat 2.0: training home assistants to rearrange their habitat. In: Advances in Neural Information Processing Systems, vol. 34, pp. 251–266 (2021)
Google Scholar
Tang, Y., Tian, Y., Lu, J., Feng, J., Zhou, J.: Action recognition in RGB-D egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414. IEEE (2017)
Google Scholar
Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS 30 (2017)
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR, pp. 7167–7176 (2017)
Google Scholar
Unity: Synthetichumans package (unity computer vision) (2022). https://github.com/Unity-Technologies/com.unity.cv.synthetichumans
Wang, R., et al.: Dexgraspnet: a large-scale robotic dexterous grasp dataset for general objects based on simulation. In: CVPR, pp. 11359–11366 (2023)
Google Scholar
Xia, F., R. Zamir, A., He, Z.Y., Sax, A., Malik, J., Savarese, S.: Gibson ENV: real-world perception for embodied agents. In: CVPR (2018)
Google Scholar
Xia, F., et al.: Interactive Gibson benchmark: a benchmark for interactive navigation in cluttered environments. IEEE Robot. Autom. Lett. 5(2), 713–720 (2020)
Article Google Scholar
Ye, Y., et al.: Affordance diffusion: synthesizing hand-object interactions. In: CVPR, pp. 22479–22489 (2023)
Google Scholar
Zhang, L., Zhou, S., Stent, S., Shi, J.: Fine-grained egocentric hand-object segmentation: dataset, model, and applications. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 127–145. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_8
Chapter Google Scholar
Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)
Article Google Scholar

Download references

Acknowledgments

This research has been supported by the project Future Artificial Intelligence Research (FAIR) - PNRR MUR Cod. PE0000013 - CUP: E63C22001940006. This research has been partially supported by the project EXTRA-EYE - PRIN 2022 - CUP E53D23008280006 - Finanziato dall’Unione Europea - Next Generation EU.

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Catania, Catania, Italy
Rosario Leonardi, Antonino Furnari, Francesco Ragusa & Giovanni Maria Farinella
Next Vision s.r.l., Catania, Italy
Antonino Furnari, Francesco Ragusa & Giovanni Maria Farinella

Authors

Rosario Leonardi
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Furnari
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Ragusa
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Maria Farinella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rosario Leonardi .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14593 KB)

Supplementary material 2 (mp4 10779 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leonardi, R., Furnari, A., Ragusa, F., Farinella, G.M. (2025). Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15129. Springer, Cham. https://doi.org/10.1007/978-3-031-73209-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-73209-6_3
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73208-9
Online ISBN: 978-3-031-73209-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?