Weakly Supervised Attended Object Detection Using Gaze Data as Annotations

Mazzamuto, Michele; Ragusa, Francesco; Furnari, Antonino; Signorello, Giovanni; Farinella, Giovanni Maria

doi:10.1007/978-3-031-06430-2_22

Michele Mazzamuto¹²,
Francesco Ragusa^12,13,
Antonino Furnari^12,13,
Giovanni Signorello¹⁴ &
…
Giovanni Maria Farinella^12,13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13232))

Included in the following conference series:

International Conference on Image Analysis and Processing

2119 Accesses

Abstract

We consider the problem of detecting and recognizing the objects observed by visitors (i.e., attended objects) in cultural sites from egocentric vision. A standard approach to the problem involves detecting all objects and selecting the one which best overlaps with the gaze of the visitor, measured through a gaze tracker. Since labeling large amounts of data to train a standard object detector is expensive in terms of costs and time, we propose a weakly supervised version of the task which leans only on gaze data and a frame-level label indicating the class of the attended object. To study the problem, we present a new dataset composed of egocentric videos and gaze coordinates of subjects visiting a museum. We hence compare three different baselines for weakly supervised attended object detection on the collected data. Results show that the considered approaches achieve satisfactory performance in a weakly supervised manner, which allows for significant time savings with respect to a fully supervised detector based on Faster R-CNN. To encourage research on the topic, we publicly release the code and the dataset at the following url: https://iplab.dmi.unict.it/WS_OBJ_DET/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

‘Labelling the Gaps’: A Weakly Supervised Automatic Eye Gaze Estimation

I-MPN: inductive message passing network for efficient human-in-the-loop annotation of mobile eye tracking data

Article Open access 23 April 2025

Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency

Notes

1.
https://www.microsoft.com/it-it/hololens.
2.
See supplementary material for more details.

References

Bearman, A.L., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: European Conference on Computer Vision, pp. 549–565 (2016)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Article Google Scholar
Cheng, B., Parkhi, O., Kirillov, A.: Pointly-supervised instance segmentation. arXiv preprint arXiv:2104.06404 (2021)
Chiaro, R.D., Bagdanov, A.D., Bimbo, A.: Noisyart: a dataset for webly-supervised artwork recognition. In: International Conference on Computer Vision Theory and Applications (2019)
Google Scholar
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Article Google Scholar
Farhadi, A., Redmon, J.: Yolov3: an incremental improvement. In: Computer Vision and Pattern Recognition, vol. 1804 (2018)
Google Scholar
Furnari, A., Farinella, G.: Rolling-unrolling LSTMs for action anticipation from first-person video. Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2021)
Article Google Scholar
Furnari, A., Farinella, G.M., Battiato, S.: Temporal segmentation of egocentric videos to highlight personal locations of interest. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 474–489. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_34
Chapter Google Scholar
Garcia, N., et al.: A dataset and baselines for visual question answering on art. In: European Conference on Computer Vision Workshops, pp. 92–108 (2020)
Google Scholar
Girshick, R.B.: Fast R-CNN. In: International Conference on Computer Vision, pp. 1440–1448 (2015)
Google Scholar
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: European Conference on Computer Vision, pp. 297–312 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ibrahim, B.I.E., Eyharabide, V., Le Page, V., Billiet, F.: Few-shot object detection: application to medieval musicological studies. J. Imaging 8(2), 18 (2022)
Article Google Scholar
Joyce, J.M.: Kullback-Leibler Divergence, pp. 720–722. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-04898-2_327
Karthikeyan, S., Jagadeesh, V., Shenoy, R., Ecksteinz, M., Manjunath, B.: From where and how to what we see. In: International Conference on Computer Vision, pp. 625–632 (2013)
Google Scholar
Karthikeyan, S., Ngo, T., Eckstein, M., Manjunath, B.: Eye tracking assisted extraction of attentionally important objects from videos. In: Computer Vision and Pattern Recognition, pp. 3241–3250 (2015)
Google Scholar
Koniusz, P., Tas, Y., Zhang, H., Harandi, M., Porikli, F., Zhang, R.: Museum exhibit identification challenge for the supervised domain adaptation and beyond. In: European Conference on Computer Vision (2018)
Google Scholar
Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Computer Vision and Pattern Recognition), pp. 280–287 (2014)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: Transactions on Pattern Analysis and Machine Intelligence (2020)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision (2014)
Google Scholar
Liu, W., et al.: Single shot multibox detector. In: European Conference on Computer Vision (2016)
Google Scholar
Mishra, A., Aloimonos, Y., Fah, C.L.: Active segmentation with fixation. In: International Conference on Computer Vision, pp. 468–475 (2009)
Google Scholar
Papadopoulos, D.P., Clarke, A.D.F., Keller, F., Ferrari, V.: Training object class detectors from eye tracking data. In: European Conference on Computer Vision (2014)
Google Scholar
Pathak, R., Saini, A., Wadhwa, A., Sharma, H., Sangwan, D.: An object detection approach for detecting damages in heritage sites using 3-d point clouds and 2-d visual data. J. Cult. Herit. 48, 74–82 (2021)
Article Google Scholar
Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.M.: Egocentric point of interest recognition in cultural sites. In: International Conference on Computer Vision Theory and Applications (VISAPP) (2019)
Google Scholar
Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.M.: EGO-CH: dataset and fundamental tasks for visitors behavioral understanding using egocentric vision. Pattern Recogn. Lett. 131, 150–157 (2020)
Article Google Scholar
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)
Google Scholar
Seidenari, L., Baecchi, C., Uricchio, T., Ferracani, A., Bertini, M., Bimbo, A.D.: Deep artwork detection and retrieval for automatic context-aware audio guides. Trans. Multim. Comput. Commun. Appl. 13, 1–21 (2017)
Article Google Scholar
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017)
Article Google Scholar
Subramanian, R., Yanulevskaya, V., Sebe, N.: Can computers learn from humans to see better? inferring scene semantics from viewers’ eye movements. In: International Conference on Multimedia (ACM), pp. 33–42 (2011)
Google Scholar
Wang, Y., Hou, J., Hou, X., Chau, L.P.: A self-training approach for point-supervised object detection and counting in crowds. Trans. Image Process. 30, 2876–2887 (2021)
Article Google Scholar
Yoo, I., Yoo, D., Paeng, K.: Pseudoedgenet: nuclei segmentation only with point annotations. In: Medical Image Computing and Computer Assisted Intervention (MICCAI), pp. 731–739 (2019)
Google Scholar
Yun, K., Peng, Y., Samaras, D., Zelinsky, G.J., Berg, T.L.: Studying relationships between human gaze, description, and computer vision. In: Computer Vision and Pattern Recognition, pp. 739–746 (2013)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Google Scholar

Download references

Acknowledgements

This research has been supported by Next Vision (https://www.nextvisionlab.it/) s.r.l., by the project VALUE (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 - Azione 1.1.5., and by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania.

Author information

Authors and Affiliations

FPV@IPLAB, DMI - University of Catania, Catania, Italy
Michele Mazzamuto, Francesco Ragusa, Antonino Furnari & Giovanni Maria Farinella
Next Vision s.r.l. - Spinoff of the University of Catania, Catania, Italy
Francesco Ragusa, Antonino Furnari & Giovanni Maria Farinella
CUTGANA - University of Catania, Catania, Italy
Giovanni Signorello & Giovanni Maria Farinella

Authors

Michele Mazzamuto
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Ragusa
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Furnari
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Signorello
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Maria Farinella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michele Mazzamuto .

Editor information

Editors and Affiliations

Boston University, Boston, MA, USA
Stan Sclaroff
National Research Council, Lecce, Italy
Cosimo Distante
National Research Council, Lecce, Italy
Marco Leo
University of Catania, Catania, Italy
Giovanni M. Farinella
Technische Universität München, Garching, Germany
Federico Tombari

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16849 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mazzamuto, M., Ragusa, F., Furnari, A., Signorello, G., Farinella, G.M. (2022). Weakly Supervised Attended Object Detection Using Gaze Data as Annotations. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-06430-2_22
Published: 17 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06429-6
Online ISBN: 978-3-031-06430-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Weakly Supervised Attended Object Detection Using Gaze Data as Annotations