Skip to main content

Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition

  • Conference paper
  • First Online:
16th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2021) (SOCO 2021)

Abstract

We propose an action estimation pipeline based on the simultaneous recognition of the hands and the objects in the scene from an egocentric perspective. From the latest approaches, we have come to the conclusion that the hands are a key element from this point of view. An action consists of the interactions of the hands with the different objects in the scene therefore, the 2D positions of hands and objects are used to compute the object which is more likely to be used. The architecture used for achieving this goal is YOLO, as its prediction speed allows us to predict the actions fluently with good accuracy on the detected objects and hands. After reviewing the available datasets and generators for hand and object detection, different experiments have been conducted. The best results determined by PascalVOC metric have been used in the proposed pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/cansik/yolo-hand-detectifon.

  2. 2.

    https://github.com/3dperceptionlab/tfg_mbenavent.

References

  1. Kapidis, G., et al.: Egocentric hand track and object-based human action recognition (2019)

    Google Scholar 

  2. Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M., Liu, S.: Holistic 3d scene understanding from a single image with implicit representation (2021)

    Google Scholar 

  3. Vaca-Castano, G., Das, S., Sousa, J.P., Lobo, N.D., Shah, M.: Improved scene identification and object detection on egocentric vision of daily activities. Comput. Vis. Image Underst. 156, 92–103 (2017). Image and Video Understanding in Big Data

    Google Scholar 

  4. Sudhakaran, S., Escalera, S., Lanz, O.: Learning to recognize actions on objects in egocentric video with attention dictionaries. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2021)

    Google Scholar 

  5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation, Jitendra (2014)

    Google Scholar 

  6. Girshick, R.: Fast r-cnn (2015)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23

    Chapter  Google Scholar 

  8. Ren, S., He, K., Girshick, R., Sun, J.,: Faster r-cnn: Towards real-time object detection with region proposal networks (2016)

    Google Scholar 

  9. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection (2016)

    Google Scholar 

  10. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018)

    Google Scholar 

  11. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020)

    Google Scholar 

  12. Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger (2016)

    Google Scholar 

  13. Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: Cspnet: a new backbone that can enhance learning capability of CNN (2019)

    Google Scholar 

  14. Ben-Shabat, Y., et al.: The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose (2020)

    Google Scholar 

  15. Kong, Q., Wu, Z., Deng, Z., Klinkigt, M., Tong, B., Murakami, T.: Mmact: a large-scale dataset for cross modal human action understanding. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)

    Google Scholar 

  16. Dai, R., et al.: Toyota smarthome untrimmed : Real-world untrimmed videos for activity detection (2020)

    Google Scholar 

  17. Hwang, H., Jang, C., Park, G., Cho, J., Kim, I.J., et al.: Eldersim: a synthetic data generation platform for human action recognition in eldercare applications (2020)

    Google Scholar 

  18. Puig, X., et al.: Virtualhome: Simulating household activities via programs (2018)

    Google Scholar 

  19. Martinez-Gonzalez, P., Oprea, S., Garcia-Garcia, A., Jover-Alvarez, A., Orts-Escolano, S., Garcia-Rodriguez, J.: UnrealROX: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. Virtual Reality 24(2), 271–288 (2019). https://doi.org/10.1007/s10055-019-00399-5

    Article  Google Scholar 

  20. Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

  21. Talavera, E., Wuerich, C., Petkov, N., Radeva, P.: Topic modelling for routine discovery from egocentric photo-streams. Pattern Recognit. 104, 107330 (2020)

    Article  Google Scholar 

  22. Tang, Y., Tian, Y., Lu, J., Feng, J., Zhou, J.: Action recognition in rgb-d egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414 (2017)

    Google Scholar 

  23. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2847–2854 (2012)

    Google Scholar 

  24. Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain (2020)

    Google Scholar 

  25. Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In: The IEEE International Conference on Computer Vision (ICCV) (December 2015)

    Google Scholar 

  26. Cruz, S., Chan, A.: Is that my hand? an egocentric dataset for hand disambiguation. Image Vis. Comput. 89, 131–143 (2019)

    Article  Google Scholar 

  27. Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

Download references

Acknowledgement

This work has been funded by the Spanish Government PID2019-104818RB-I00 grant for the MoDeaAS project, supported with Feder funds. This work has also been supported by Spanish national grants for PhD studies FPU17/00166, ACIF/2018/197 and UAFPU2019-13. Experiments were made possible by a generous hardware donation from NVIDIA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jose Garcia-Rodriguez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Benavent-Lledó, M., Oprea, S., Castro-Vargas, J.A., Martinez-Gonzalez, P., Garcia-Rodriguez, J. (2022). Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition. In: Sanjurjo González, H., Pastor López, I., García Bringas, P., Quintián, H., Corchado, E. (eds) 16th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2021). SOCO 2021. Advances in Intelligent Systems and Computing, vol 1401. Springer, Cham. https://doi.org/10.1007/978-3-030-87869-6_42

Download citation

Publish with us

Policies and ethics