Abstract
We propose an action estimation pipeline based on the simultaneous recognition of the hands and the objects in the scene from an egocentric perspective. From the latest approaches, we have come to the conclusion that the hands are a key element from this point of view. An action consists of the interactions of the hands with the different objects in the scene therefore, the 2D positions of hands and objects are used to compute the object which is more likely to be used. The architecture used for achieving this goal is YOLO, as its prediction speed allows us to predict the actions fluently with good accuracy on the detected objects and hands. After reviewing the available datasets and generators for hand and object detection, different experiments have been conducted. The best results determined by PascalVOC metric have been used in the proposed pipeline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kapidis, G., et al.: Egocentric hand track and object-based human action recognition (2019)
Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M., Liu, S.: Holistic 3d scene understanding from a single image with implicit representation (2021)
Vaca-Castano, G., Das, S., Sousa, J.P., Lobo, N.D., Shah, M.: Improved scene identification and object detection on egocentric vision of daily activities. Comput. Vis. Image Underst. 156, 92–103 (2017). Image and Video Understanding in Big Data
Sudhakaran, S., Escalera, S., Lanz, O.: Learning to recognize actions on objects in egocentric video with attention dictionaries. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2021)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation, Jitendra (2014)
Girshick, R.: Fast r-cnn (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_23
Ren, S., He, K., Girshick, R., Sun, J.,: Faster r-cnn: Towards real-time object detection with region proposal networks (2016)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection (2016)
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018)
Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020)
Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger (2016)
Wang, C.Y., Liao, H.Y.M., Wu, Y.H., Chen, P.Y., Hsieh, J.W., Yeh, I.H.: Cspnet: a new backbone that can enhance learning capability of CNN (2019)
Ben-Shabat, Y., et al.: The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose (2020)
Kong, Q., Wu, Z., Deng, Z., Klinkigt, M., Tong, B., Murakami, T.: Mmact: a large-scale dataset for cross modal human action understanding. In: The IEEE International Conference on Computer Vision (ICCV) (October 2019)
Dai, R., et al.: Toyota smarthome untrimmed : Real-world untrimmed videos for activity detection (2020)
Hwang, H., Jang, C., Park, G., Cho, J., Kim, I.J., et al.: Eldersim: a synthetic data generation platform for human action recognition in eldercare applications (2020)
Puig, X., et al.: Virtualhome: Simulating household activities via programs (2018)
Martinez-Gonzalez, P., Oprea, S., Garcia-Garcia, A., Jover-Alvarez, A., Orts-Escolano, S., Garcia-Rodriguez, J.: UnrealROX: an extremely photorealistic virtual reality environment for robotics simulations and synthetic data generation. Virtual Reality 24(2), 271–288 (2019). https://doi.org/10.1007/s10055-019-00399-5
Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Talavera, E., Wuerich, C., Petkov, N., Radeva, P.: Topic modelling for routine discovery from egocentric photo-streams. Pattern Recognit. 104, 107330 (2020)
Tang, Y., Tian, Y., Lu, J., Feng, J., Zhou, J.: Action recognition in rgb-d egocentric videos. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3410–3414 (2017)
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2847–2854 (2012)
Ragusa, F., Furnari, A., Livatino, S., Farinella, G.M.: The meccano dataset: understanding human-object interactions from egocentric videos in an industrial-like domain (2020)
Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In: The IEEE International Conference on Computer Vision (ICCV) (December 2015)
Cruz, S., Chan, A.: Is that my hand? an egocentric dataset for hand disambiguation. Image Vis. Comput. 89, 131–143 (2019)
Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: European Conference on Computer Vision (ECCV) (2018)
Acknowledgement
This work has been funded by the Spanish Government PID2019-104818RB-I00 grant for the MoDeaAS project, supported with Feder funds. This work has also been supported by Spanish national grants for PhD studies FPU17/00166, ACIF/2018/197 and UAFPU2019-13. Experiments were made possible by a generous hardware donation from NVIDIA.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Benavent-Lledó, M., Oprea, S., Castro-Vargas, J.A., Martinez-Gonzalez, P., Garcia-Rodriguez, J. (2022). Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition. In: Sanjurjo González, H., Pastor López, I., GarcÃa Bringas, P., Quintián, H., Corchado, E. (eds) 16th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2021). SOCO 2021. Advances in Intelligent Systems and Computing, vol 1401. Springer, Cham. https://doi.org/10.1007/978-3-030-87869-6_42
Download citation
DOI: https://doi.org/10.1007/978-3-030-87869-6_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87868-9
Online ISBN: 978-3-030-87869-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)