Abstract
We consider the problem of detecting and recognizing the objects observed by visitors (i.e., attended objects) in cultural sites from egocentric vision. A standard approach to the problem involves detecting all objects and selecting the one which best overlaps with the gaze of the visitor, measured through a gaze tracker. Since labeling large amounts of data to train a standard object detector is expensive in terms of costs and time, we propose a weakly supervised version of the task which leans only on gaze data and a frame-level label indicating the class of the attended object. To study the problem, we present a new dataset composed of egocentric videos and gaze coordinates of subjects visiting a museum. We hence compare three different baselines for weakly supervised attended object detection on the collected data. Results show that the considered approaches achieve satisfactory performance in a weakly supervised manner, which allows for significant time savings with respect to a fully supervised detector based on Faster R-CNN. To encourage research on the topic, we publicly release the code and the dataset at the following url: https://iplab.dmi.unict.it/WS_OBJ_DET/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
See supplementary material for more details.
References
Bearman, A.L., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: European Conference on Computer Vision, pp. 549–565 (2016)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Cheng, B., Parkhi, O., Kirillov, A.: Pointly-supervised instance segmentation. arXiv preprint arXiv:2104.06404 (2021)
Chiaro, R.D., Bagdanov, A.D., Bimbo, A.: Noisyart: a dataset for webly-supervised artwork recognition. In: International Conference on Computer Vision Theory and Applications (2019)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Farhadi, A., Redmon, J.: Yolov3: an incremental improvement. In: Computer Vision and Pattern Recognition, vol. 1804 (2018)
Furnari, A., Farinella, G.: Rolling-unrolling LSTMs for action anticipation from first-person video. Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2021)
Furnari, A., Farinella, G.M., Battiato, S.: Temporal segmentation of egocentric videos to highlight personal locations of interest. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 474–489. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_34
Garcia, N., et al.: A dataset and baselines for visual question answering on art. In: European Conference on Computer Vision Workshops, pp. 92–108 (2020)
Girshick, R.B.: Fast R-CNN. In: International Conference on Computer Vision, pp. 1440–1448 (2015)
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: European Conference on Computer Vision, pp. 297–312 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Ibrahim, B.I.E., Eyharabide, V., Le Page, V., Billiet, F.: Few-shot object detection: application to medieval musicological studies. J. Imaging 8(2), 18 (2022)
Joyce, J.M.: Kullback-Leibler Divergence, pp. 720–722. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-04898-2_327
Karthikeyan, S., Jagadeesh, V., Shenoy, R., Ecksteinz, M., Manjunath, B.: From where and how to what we see. In: International Conference on Computer Vision, pp. 625–632 (2013)
Karthikeyan, S., Ngo, T., Eckstein, M., Manjunath, B.: Eye tracking assisted extraction of attentionally important objects from videos. In: Computer Vision and Pattern Recognition, pp. 3241–3250 (2015)
Koniusz, P., Tas, Y., Zhang, H., Harandi, M., Porikli, F., Zhang, R.: Museum exhibit identification challenge for the supervised domain adaptation and beyond. In: European Conference on Computer Vision (2018)
Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Computer Vision and Pattern Recognition), pp. 280–287 (2014)
Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: Transactions on Pattern Analysis and Machine Intelligence (2020)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision (2014)
Liu, W., et al.: Single shot multibox detector. In: European Conference on Computer Vision (2016)
Mishra, A., Aloimonos, Y., Fah, C.L.: Active segmentation with fixation. In: International Conference on Computer Vision, pp. 468–475 (2009)
Papadopoulos, D.P., Clarke, A.D.F., Keller, F., Ferrari, V.: Training object class detectors from eye tracking data. In: European Conference on Computer Vision (2014)
Pathak, R., Saini, A., Wadhwa, A., Sharma, H., Sangwan, D.: An object detection approach for detecting damages in heritage sites using 3-d point clouds and 2-d visual data. J. Cult. Herit. 48, 74–82 (2021)
Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.M.: Egocentric point of interest recognition in cultural sites. In: International Conference on Computer Vision Theory and Applications (VISAPP) (2019)
Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.M.: EGO-CH: dataset and fundamental tasks for visitors behavioral understanding using egocentric vision. Pattern Recogn. Lett. 131, 150–157 (2020)
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)
Seidenari, L., Baecchi, C., Uricchio, T., Ferracani, A., Bertini, M., Bimbo, A.D.: Deep artwork detection and retrieval for automatic context-aware audio guides. Trans. Multim. Comput. Commun. Appl. 13, 1–21 (2017)
Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017)
Subramanian, R., Yanulevskaya, V., Sebe, N.: Can computers learn from humans to see better? inferring scene semantics from viewers’ eye movements. In: International Conference on Multimedia (ACM), pp. 33–42 (2011)
Wang, Y., Hou, J., Hou, X., Chau, L.P.: A self-training approach for point-supervised object detection and counting in crowds. Trans. Image Process. 30, 2876–2887 (2021)
Yoo, I., Yoo, D., Paeng, K.: Pseudoedgenet: nuclei segmentation only with point annotations. In: Medical Image Computing and Computer Assisted Intervention (MICCAI), pp. 731–739 (2019)
Yun, K., Peng, Y., Samaras, D., Zelinsky, G.J., Berg, T.L.: Studying relationships between human gaze, description, and computer vision. In: Computer Vision and Pattern Recognition, pp. 739–746 (2013)
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Acknowledgements
This research has been supported by Next Vision (https://www.nextvisionlab.it/) s.r.l., by the project VALUE (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 - Azione 1.1.5., and by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mazzamuto, M., Ragusa, F., Furnari, A., Signorello, G., Farinella, G.M. (2022). Weakly Supervised Attended Object Detection Using Gaze Data as Annotations. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-06430-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06429-6
Online ISBN: 978-3-031-06430-2
eBook Packages: Computer ScienceComputer Science (R0)