Skip to main content

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

In this study, we investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only \(10\%\) of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: \(+5.67\%\) on EPIC-KITCHENS VISOR, \(+8.24\%\) on EgoHOS, and \(+11.69\%\) on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. Data, code, and data generation tools to support future research are released at: https://fpv-iplab.github.io/HOI-Synth/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    See the supplementary material for examples of in-domain and out-domain generated images and for additional details about architectures and training setups.

  2. 2.

    Note that, in our implementation, the results of the HOS model differ from those reported in [10] because, for fair comparisons, we adopted a batch size of 4, the largest batch size achievable with domain adaptation models in our configuration.

  3. 3.

    Additional qualitative examples are reported in the supplementary material.

References

  1. Besari, A.R.A., Saputra, A.A., Chin, W.H., Kubota, N., et al.: Hand–object interaction recognition based on visual attention using multiscopic cyber-physical-social system. Int. J. Adv. Intell. Inform. 9(2) (2023)

    Google Scholar 

  2. Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: CVPR, pp. 3722–3731 (2017)

    Google Scholar 

  3. Cai, Q., Pan, Y., Ngo, C.W., Tian, X., Duan, L., Yao, T.: Exploring object relation in mean teacher for cross-domain detection. In: CVPR, pp. 11457–11466 (2019)

    Google Scholar 

  4. Carfì, A., et al.: Hand-object interaction: from human demonstrations to robot manipulation. Front. Robot. AI 8, 714023 (2021)

    Article  Google Scholar 

  5. Cheng, T., Shan, D., Hassen, A.S., Higgins, R.E.L., Fouhey, D.: Towards a richer 2D understanding of hands at scale. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)

    Google Scholar 

  6. Choudhary, A., Mishra, D., Karmakar, A.: Domain adaptive egocentric person re-identification. In: Computer Vision and Image Processing (CVIP), pp. 81–92 (2021)

    Google Scholar 

  7. Csurka, G.: Domain adaptation for visual applications: a comprehensive survey (2017). https://arxiv.org/abs/1702.05374

  8. Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. IJCV, 1–23 (2021)

    Google Scholar 

  9. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV, pp. 720–736 (2018)

    Google Scholar 

  10. Darkhalil, A., et al.: Epic-kitchens visor benchmark: video segmentations and object relations. In: NeurIPS, pp. 13745–13758 (2022)

    Google Scholar 

  11. Deng, J., Li, W., Chen, Y., Duan, L.: Unbiased mean teacher for cross-domain object detection. In: CVPR, pp. 4091–4101 (2021)

    Google Scholar 

  12. Di Benedetto, M., Carrara, F., Meloni, E., Amato, G., Falchi, F., Gennaro, C.: Learning accurate personal protective equipment detection from virtual worlds. Multimedia Tools Appl. 80, 23241–23253 (2021)

    Article  Google Scholar 

  13. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16 (2017)

    Google Scholar 

  14. Edsinger, A., Kemp, C.C.: Human-robot interaction for cooperative manipulation: handing objects to one another. In: RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication, pp. 1167–1172. IEEE (2007)

    Google Scholar 

  15. Fabbri, M., et al.: Motsynth: how can synthetic data help pedestrian detection and tracking? In: ICCV (2021)

    Google Scholar 

  16. Fu, Q., Liu, X., Kitani, K.M.: Sequential voting with relational box fields for active object detection. In: CVPR, pp. 2374–2383 (2022)

    Google Scholar 

  17. Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)

    Google Scholar 

  18. Grauman, K., et al.: Ego4d: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2021)

    Google Scholar 

  19. Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)

    Google Scholar 

  20. Jian, J., Liu, X., Li, M., Hu, R., Liu, J.: Affordpose: a large-scale dataset of hand-object interactions with affordance-driven hand pose. In: ICCV, pp. 14713–14724 (2023)

    Google Scholar 

  21. Kirillov, A., Wu, Y., He, K., Girshick, R.: Pointrend: image segmentation as rendering. In: CVPR, pp. 9799–9808 (2020)

    Google Scholar 

  22. Kolve, E., et al.: Ai2-thor: an interactive 3d environment for visual AI (2017). https://arxiv.org/abs/1712.05474

  23. Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)

    Google Scholar 

  24. Leonardi, R., Ragusa, F., Furnari, A., Farinella, G.M.: Egocentric human-object interaction detection exploiting synthetic data. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds.) ICIAP 2022. LNCS, vol. 13232, pp. 237–248. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06430-2_20

    Chapter  Google Scholar 

  25. Li, C., et al.: igibson 2.0: object-centric simulation for robot learning of everyday household tasks. In: Faust, A., Hsu, D., Neumann, G. (eds.) Proceedings of the 5th Conference on Robot Learning. Proceedings of Machine Learning Research, vol. 164, pp. 455–465. PMLR (2022). https://proceedings.mlr.press/v164/li22b.html

  26. Li, Y., Nagarajan, T., Xiong, B., Grauman, K.: Ego-exo: transferring visual representations from third-person to first-person videos. In: CVPR, pp. 6943–6953 (2021)

    Google Scholar 

  27. Li, Y.J., et al.: Cross-domain adaptive teacher for object detection. In: CVPR, pp. 7581–7590 (2022)

    Google Scholar 

  28. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  29. Liu, S., Tripathi, S., Majumdar, S., Wang, X.: Joint hand motion and interaction hotspots prediction from egocentric videos. In: CVPR, pp. 3282–3292 (2022)

    Google Scholar 

  30. Liu, Y.C., et al.: Unbiased teacher for semi-supervised object detection. In: ICLR (2021)

    Google Scholar 

  31. Lu, Y., Mayol-Cuevas, W.W.: Egocentric hand-object interaction detection and application (2021). https://arxiv.org/abs/2109.14734

  32. Lv, Z., Poiesi, F., Dong, Q., Lloret, J., Song, H.: Deep learning for intelligent human-computer interaction. Appl. Sci. 12(22), 11457 (2022)

    Article  Google Scholar 

  33. Savva, M., et al.: Habitat: A platform for embodied AI research. In: ICCV (2019)

    Google Scholar 

  34. Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: CVPR, pp. 122–132 (2020)

    Google Scholar 

  35. Munro, J., Wray, M., Larlus, D., Csurka, G., Damen, D.: Domain adaptation in multi-view embedding for cross-modal video retrieval. ArXiv abs/2110.12812 (2021). https://api.semanticscholar.org/CorpusID:239768993

  36. NVIDIA: Nvidia omniverse (2020). https://www.nvidia.com/en-us/omniverse/synthetic-data/

  37. NVIDIA: Nvidia isaac sim (2021). https://developer.nvidia.com/isaac-sim

  38. Orlando, S., Furnari, A., Farinella, G.M.: Egocentric visitor localization and artwork detection in cultural sites using synthetic data. Pattern Recognition Letters - Special Issue on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage (2020). https://iplab.dmi.unict.it/SimulatedEgocentricNavigations/

  39. Pasqualino, G., Furnari, A., Signorello, G., Farinella, G.M.: An unsupervised domain adaptation scheme for single-stage artwork recognition in cultural sites. Image Vis. Comput. 107, 104098 (2021)

    Article  Google Scholar 

  40. Plizzari, C., Perrett, T., Caputo, B., Damen, D.: What can a cook in Italy teach a mechanic in India? action recognition generalisation over scenarios and locations. In: ICCV2023 (2023)

    Google Scholar 

  41. Quattrocchi, C., Mauro, D.D., Furnari, A., Lopes, A., Moltisanti, M., Farinella, G.M.: Put your PPE on: a tool for synthetic data generation and related benchmark in construction site scenarios. In: International Conference on Computer Vision Theory and Applications, pp. 656–663 (2023)

    Google Scholar 

  42. Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. In: Winter Conference on Applications of Computer Vision, pp. 1569–1578 (2021)

    Google Scholar 

  43. Ragusa, F., et al.: Enigma-51: towards a fine-grained understanding of human behavior in industrial scenarios. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4549–4559 (2024)

    Google Scholar 

  44. Ramakrishnan, S.K., et al.: Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied AI. In: NeurIPS (2021)

    Google Scholar 

  45. Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: CVPR, pp. 3723–3732 (2018)

    Google Scholar 

  46. Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV, pp. 9339–9347 (2019)

    Google Scholar 

  47. Sener, F., et al.: Assembly101: a large-scale multi-view video dataset for understanding procedural activities. In: CVPR, pp. 21096–21106 (2022)

    Google Scholar 

  48. shadowrobot: Shadowhand (2005). https://www.shadowrobot.com/dexterous-hand-series/

  49. Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. In: CVPR, pp. 9869–9878 (2020)

    Google Scholar 

  50. Szot, A., et al.: Habitat 2.0: training home assistants to rearrange their habitat. In: Advances in Neural Information Processing Systems, vol. 34, pp. 251–266 (2021)

    Google Scholar 

  51. Tang, Y., Tian, Y., Lu, J., Feng, J., Zhou, J.: Action recognition in RGB-D egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414. IEEE (2017)

    Google Scholar 

  52. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS 30 (2017)

    Google Scholar 

  53. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR, pp. 7167–7176 (2017)

    Google Scholar 

  54. Unity: Synthetichumans package (unity computer vision) (2022). https://github.com/Unity-Technologies/com.unity.cv.synthetichumans

  55. Wang, R., et al.: Dexgraspnet: a large-scale robotic dexterous grasp dataset for general objects based on simulation. In: CVPR, pp. 11359–11366 (2023)

    Google Scholar 

  56. Xia, F., R. Zamir, A., He, Z.Y., Sax, A., Malik, J., Savarese, S.: Gibson ENV: real-world perception for embodied agents. In: CVPR (2018)

    Google Scholar 

  57. Xia, F., et al.: Interactive Gibson benchmark: a benchmark for interactive navigation in cluttered environments. IEEE Robot. Autom. Lett. 5(2), 713–720 (2020)

    Article  Google Scholar 

  58. Ye, Y., et al.: Affordance diffusion: synthesizing hand-object interactions. In: CVPR, pp. 22479–22489 (2023)

    Google Scholar 

  59. Zhang, L., Zhou, S., Stent, S., Shi, J.: Fine-grained egocentric hand-object segmentation: dataset, model, and applications. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13689, pp. 127–145. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19818-2_8

    Chapter  Google Scholar 

  60. Zhuang, F., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109(1), 43–76 (2020)

    Article  Google Scholar 

Download references

Acknowledgments

This research has been supported by the project Future Artificial Intelligence Research (FAIR) - PNRR MUR Cod. PE0000013 - CUP: E63C22001940006. This research has been partially supported by the project EXTRA-EYE - PRIN 2022 - CUP E53D23008280006 - Finanziato dall’Unione Europea - Next Generation EU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rosario Leonardi .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 14593 KB)

Supplementary material 2 (mp4 10779 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Leonardi, R., Furnari, A., Ragusa, F., Farinella, G.M. (2025). Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15129. Springer, Cham. https://doi.org/10.1007/978-3-031-73209-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73209-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73208-9

  • Online ISBN: 978-3-031-73209-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics