Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition

Benavent-Lledó, Manuel; Oprea, Sergiu; Castro-Vargas, John Alejandro; Martinez-Gonzalez, Pablo; Garcia-Rodriguez, Jose

doi:10.1007/978-3-030-87869-6_42

Manuel Benavent-Lledó¹⁹,
Sergiu Oprea¹⁹,
John Alejandro Castro-Vargas¹⁹,
Pablo Martinez-Gonzalez¹⁹ &
…
Jose Garcia-Rodriguez¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1401))

Included in the following conference series:

International Workshop on Soft Computing Models in Industrial and Environmental Applications

1245 Accesses
3 Citations

Abstract

We propose an action estimation pipeline based on the simultaneous recognition of the hands and the objects in the scene from an egocentric perspective. From the latest approaches, we have come to the conclusion that the hands are a key element from this point of view. An action consists of the interactions of the hands with the different objects in the scene therefore, the 2D positions of hands and objects are used to compute the object which is more likely to be used. The architecture used for achieving this goal is YOLO, as its prediction speed allows us to predict the actions fluently with good accuracy on the detected objects and hands. After reviewing the available datasets and generators for hand and object detection, different experiments have been conducted. The best results determined by PascalVOC metric have been used in the proposed pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Detection of Generic Human-Object Interactions in Video Streams

Hand Detection and Tracking in Videos for Fine-Grained Action Recognition

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Notes

References

Kapidis, G., et al.: Egocentric hand track and object-based human action recognition (2019)
Google Scholar
Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M., Liu, S.: Holistic 3d scene understanding from a single image with implicit representation (2021)
Google Scholar
Vaca-Castano, G., Das, S., Sousa, J.P., Lobo, N.D., Shah, M.: Improved scene identification and object detection on egocentric vision of daily activities. Comput. Vis. Image Underst. 156, 92–103 (2017). Image and Video Understanding in Big Data
Google Scholar
Sudhakaran, S., Escalera, S., Lanz, O.: Learning to recognize actions on objects in egocentric video with attention dictionaries. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2021)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation, Jitendra (2014)
Google Scholar
Girshick, R.: Fast r-cnn (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23
Chapter Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.,: Faster r-cnn: Towards real-time object detection with region proposal networks (2016)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection (2016)
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018)
Google Scholar
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020)
Google Scholar
Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger (2016)
Google Scholar
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: Cspnet: a new backbone that can enhance learning capability of CNN (2019)
Google Scholar
Ben-Shabat, Y., et al.: The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose (2020)
Google Scholar
Kong, Q., Wu, Z., Deng, Z., Klinkigt, M., Tong, B., Murakami, T.: Mmact: a large-scale dataset for cross modal human action understanding. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
Google Scholar
Dai, R., et al.: Toyota smarthome untrimmed : Real-world untrimmed videos for activity detection (2020)
Google Scholar
Hwang, H., Jang, C., Park, G., Cho, J., Kim, I.J., et al.: Eldersim: a synthetic data generation platform for human action recognition in eldercare applications (2020)
Google Scholar
Puig, X., et al.: Virtualhome: Simulating household activities via programs (2018)
Google Scholar
Martinez-Gonzalez, P., Oprea, S., Garcia-Garcia, A., Jover-Alvarez, A., Orts-Escolano, S., Garcia-Rodriguez, J.: UnrealROX: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. Virtual Reality 24(2), 271–288 (2019). https://doi.org/10.1007/s10055-019-00399-5
Article Google Scholar
Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Google Scholar
Talavera, E., Wuerich, C., Petkov, N., Radeva, P.: Topic modelling for routine discovery from egocentric photo-streams. Pattern Recognit. 104, 107330 (2020)
Article Google Scholar
Tang, Y., Tian, Y., Lu, J., Feng, J., Zhou, J.: Action recognition in rgb-d egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414 (2017)
Google Scholar
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2847–2854 (2012)
Google Scholar
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain (2020)
Google Scholar
Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In: The IEEE International Conference on Computer Vision (ICCV) (December 2015)
Google Scholar
Cruz, S., Chan, A.: Is that my hand? an egocentric dataset for hand disambiguation. Image Vis. Comput. 89, 131–143 (2019)
Article Google Scholar
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar

Download references

Acknowledgement

This work has been funded by the Spanish Government PID2019-104818RB-I00 grant for the MoDeaAS project, supported with Feder funds. This work has also been supported by Spanish national grants for PhD studies FPU17/00166, ACIF/2018/197 and UAFPU2019-13. Experiments were made possible by a generous hardware donation from NVIDIA.

Author information

Authors and Affiliations

University of Alicante, Alicante, Spain
Manuel Benavent-Lledó, Sergiu Oprea, John Alejandro Castro-Vargas, Pablo Martinez-Gonzalez & Jose Garcia-Rodriguez

Authors

Manuel Benavent-Lledó
View author publications
You can also search for this author in PubMed Google Scholar
Sergiu Oprea
View author publications
You can also search for this author in PubMed Google Scholar
John Alejandro Castro-Vargas
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Martinez-Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar
Jose Garcia-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jose Garcia-Rodriguez .

Editor information

Editors and Affiliations

Faculty of Engineering, University of Deusto, Bilbao, Spain
Hugo Sanjurjo González
Faculty of Engineering, University of Deusto, Bilbao, Spain
Iker Pastor López
Faculty of Engineering, University of Deusto, Bilbao, Spain
Pablo García Bringas
Department of Industrial Engineering, University of A Coruña, Ferrol, Spain
Héctor Quintián
BISITE Research Group, University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benavent-Lledó, M., Oprea, S., Castro-Vargas, J.A., Martinez-Gonzalez, P., Garcia-Rodriguez, J. (2022). Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition. In: Sanjurjo González, H., Pastor López, I., García Bringas, P., Quintián, H., Corchado, E. (eds) 16th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2021). SOCO 2021. Advances in Intelligent Systems and Computing, vol 1401. Springer, Cham. https://doi.org/10.1007/978-3-030-87869-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-87869-6_42
Published: 23 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87868-9
Online ISBN: 978-3-030-87869-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics