Skip to main content

Egocentric Human-Object Interaction Detection Exploiting Synthetic Data

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2022 (ICIAP 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13232))

Included in the following conference series:

  • 2142 Accesses

Abstract

We consider the problem of detecting Egocentric Human-Object Interactions (EHOIs) in industrial contexts. Since collecting and labeling large amounts of real images is challenging, we propose a pipeline and a tool to generate photo-realistic synthetic First Person Vision (FPV) images automatically labeled for EHOI detection in a specific industrial scenario. To tackle the problem of EHOI detection, we propose a method that detects the hands, the objects in the scene, and determines which objects are currently involved in an interaction. We compare the performance of our method with a set of state-of-the-art baselines. Results show that using a synthetic dataset improves the performance of an EHOI detection system, especially when few real data are available. To encourage research on this topic, we publicly release the proposed dataset at the following url: https://iplab.dmi.unict.it/EHOI_SYNTH/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Ego4D Website: https://ego4d-data.org/.

  2. 2.

    See supplementary material for more details.

  3. 3.

    https://www.artec3d.com/portable-3d-scanners/artec-eva-v2.

  4. 4.

    https://matterport.com/.

  5. 5.

    We used the following implementation: https://github.com/cocodataset/cocoapi.

  6. 6.

    YOLOv5: https://github.com/ultralytics/yolov5.

References

  1. Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: detecting hands and recognizing activities in complex egocentric interactions. Int. Conf. Comput. Vis. (2015)

    Google Scholar 

  2. Betancourt, A., Morerio, P., Regazzoni, C.S., Rauterberg, M.: The evolution of first person vision methods: a survey. IEEE Trans. Circuits Syst. Video Technol. (2015)

    Google Scholar 

  3. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. Winter Conf. Appl. Comput. Vis. (2018)

    Google Scholar 

  4. Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. Int. Conf. Comput. Vis. (2019)

    Google Scholar 

  5. Cucchiara, R., Del Bimbo, A.: Visions for augmented cultural heritage experience. IEEE Multim. (2014)

    Google Scholar 

  6. Damen, D., et al.: Rescaling egocentric vision: collection, pipeline and challenges for epic-kitchens-100. Int. J. Comput. Vis. (2021)

    Google Scholar 

  7. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. Eur. Conf. Comput. Vis. (2018)

    Google Scholar 

  8. Fu, Q., Liu, X., Kitani, K.M.: Sequential voting with relational box fields for active object detection. arXiv preprint arXiv:2110.11524 (2021)

  9. Furnari, A., Farinella, G.M., Battiato, S.: Temporal segmentation of egocentric videos to highlight personal locations of interest. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 474–489. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_34

  10. Furnari, A., Battiato, S., Grauman, K., Farinella, G.M.: Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent. (2017)

    Google Scholar 

  11. Furnari, A., Farinella, G.M.: Rolling-unrolling LSTMS for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

  12. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. Conf. Comput. Vis. Pattern Recogn. (2018)

    Google Scholar 

  13. Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. Conf. Comput. Vis. Pattern Recogn. (2018)

    Google Scholar 

  14. Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)

  15. Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. Conf. Comput. Vis. Pattern Recogn. (2019)

    Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. Conf. Comput. Vis. Pattern Recogn. (2016)

    Google Scholar 

  17. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. Conf. Comput. Vis. Pattern Recogn. (2015)

    Google Scholar 

  18. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  19. Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: Gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell.(2021)

    Google Scholar 

  20. Li, Y., Ye, Z., Rehg, J.M.: Delving into egocentric actions. Conf. Comput. Vis. Pattern Recogn. (2015)

    Google Scholar 

  21. Liao, Y., Liu, S., Wang, F., Chen, Y., Feng, J.: PPDM: parallel point detection and matching for real-time human-object interaction detection. Conf. Comput. Vis. Pattern Recogn. (2020)

    Google Scholar 

  22. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. Conf. Comput. Vis. Pattern Recogn. (2017)

    Google Scholar 

  23. Lin, T.Y., et al.: Microsoft coco: common objects in context. Eur. Conf. Comput. Vis. (2014)

    Google Scholar 

  24. Lu, Y., Mayol-Cuevas, W.: The object at hand: Automated editing for mixed reality video guidance from hand-object interactions. Int. Symp. Mixed Augment. Real. (2021)

    Google Scholar 

  25. Lu, Y., Mayol-Cuevas, W.W.: Understanding egocentric hand-object interactions from hand pose estimation. arXiv preprint arXiv:2109.14657 (2021)

  26. Mueller, F., et al.: Ganerated hands for real-time 3D hand tracking from monocular RGB. Conf. Comput. Vis. Pattern Recogn. (2018)

    Google Scholar 

  27. Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. Int. Conf. Comput. Vis. (2017)

    Google Scholar 

  28. Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.: Egocentric visitors localization in cultural sites. J. Comput. Cult. Herit. (2019)

    Google Scholar 

  29. Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain. Winter Conf. Appl. Comput. Vis. (2021)

    Google Scholar 

  30. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. (2015)

    Google Scholar 

  31. Shan, D., Geng, J., Shu, M., Fouhey, D.F.: Understanding human hands in contact at internet scale. Conf. Comput. Vis. Pattern Recogn. (2020)

    Google Scholar 

  32. Wang, H., et al.: Learning a generative model for multi-step human-object interactions from videos. Comput. Graph. Forum (2019)

    Google Scholar 

  33. Zhang, F.Z., Campbell, D., Gould, S.: Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. arXiv preprint arXiv:2112.01838 (2021)

Download references

Acknowledgements

This research has been supported by Next Vision (https://www.nextvisionlab.it/) s.r.l., by the project MISE - PON I&C 2014–2020 - Progetto ENIGMA - Prog n. F/190050/02/X44 - CUP: B61B19000520008, and by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rosario Leonardi .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 11758 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Leonardi, R., Ragusa, F., Furnari, A., Farinella, G.M. (2022). Egocentric Human-Object Interaction Detection Exploiting Synthetic Data. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06430-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06429-6

  • Online ISBN: 978-3-031-06430-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics