Skip to main content

Weakly Supervised Attended Object Detection Using Gaze Data as Annotations

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2022 (ICIAP 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13232))

Included in the following conference series:

  • 2119 Accesses

Abstract

We consider the problem of detecting and recognizing the objects observed by visitors (i.e., attended objects) in cultural sites from egocentric vision. A standard approach to the problem involves detecting all objects and selecting the one which best overlaps with the gaze of the visitor, measured through a gaze tracker. Since labeling large amounts of data to train a standard object detector is expensive in terms of costs and time, we propose a weakly supervised version of the task which leans only on gaze data and a frame-level label indicating the class of the attended object. To study the problem, we present a new dataset composed of egocentric videos and gaze coordinates of subjects visiting a museum. We hence compare three different baselines for weakly supervised attended object detection on the collected data. Results show that the considered approaches achieve satisfactory performance in a weakly supervised manner, which allows for significant time savings with respect to a fully supervised detector based on Faster R-CNN. To encourage research on the topic, we publicly release the code and the dataset at the following url: https://iplab.dmi.unict.it/WS_OBJ_DET/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.microsoft.com/it-it/hololens.

  2. 2.

    See supplementary material for more details.

References

  1. Bearman, A.L., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: European Conference on Computer Vision, pp. 549–565 (2016)

    Google Scholar 

  2. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)

    Article  Google Scholar 

  3. Cheng, B., Parkhi, O., Kirillov, A.: Pointly-supervised instance segmentation. arXiv preprint arXiv:2104.06404 (2021)

  4. Chiaro, R.D., Bagdanov, A.D., Bimbo, A.: Noisyart: a dataset for webly-supervised artwork recognition. In: International Conference on Computer Vision Theory and Applications (2019)

    Google Scholar 

  5. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)

    Article  Google Scholar 

  6. Farhadi, A., Redmon, J.: Yolov3: an incremental improvement. In: Computer Vision and Pattern Recognition, vol. 1804 (2018)

    Google Scholar 

  7. Furnari, A., Farinella, G.: Rolling-unrolling LSTMs for action anticipation from first-person video. Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2021)

    Article  Google Scholar 

  8. Furnari, A., Farinella, G.M., Battiato, S.: Temporal segmentation of egocentric videos to highlight personal locations of interest. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 474–489. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_34

    Chapter  Google Scholar 

  9. Garcia, N., et al.: A dataset and baselines for visual question answering on art. In: European Conference on Computer Vision Workshops, pp. 92–108 (2020)

    Google Scholar 

  10. Girshick, R.B.: Fast R-CNN. In: International Conference on Computer Vision, pp. 1440–1448 (2015)

    Google Scholar 

  11. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. Computer Vision and Pattern Recognition, pp. 580–587 (2014)

    Google Scholar 

  12. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: European Conference on Computer Vision, pp. 297–312 (2014)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  14. Ibrahim, B.I.E., Eyharabide, V., Le Page, V., Billiet, F.: Few-shot object detection: application to medieval musicological studies. J. Imaging 8(2), 18 (2022)

    Article  Google Scholar 

  15. Joyce, J.M.: Kullback-Leibler Divergence, pp. 720–722. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-04898-2_327

  16. Karthikeyan, S., Jagadeesh, V., Shenoy, R., Ecksteinz, M., Manjunath, B.: From where and how to what we see. In: International Conference on Computer Vision, pp. 625–632 (2013)

    Google Scholar 

  17. Karthikeyan, S., Ngo, T., Eckstein, M., Manjunath, B.: Eye tracking assisted extraction of attentionally important objects from videos. In: Computer Vision and Pattern Recognition, pp. 3241–3250 (2015)

    Google Scholar 

  18. Koniusz, P., Tas, Y., Zhang, H., Harandi, M., Porikli, F., Zhang, R.: Museum exhibit identification challenge for the supervised domain adaptation and beyond. In: European Conference on Computer Vision (2018)

    Google Scholar 

  19. Li, Y., Hou, X., Koch, C., Rehg, J.M., Yuille, A.L.: The secrets of salient object segmentation. In: Computer Vision and Pattern Recognition), pp. 280–287 (2014)

    Google Scholar 

  20. Lin, T.Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: Transactions on Pattern Analysis and Machine Intelligence (2020)

    Google Scholar 

  21. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision (2014)

    Google Scholar 

  22. Liu, W., et al.: Single shot multibox detector. In: European Conference on Computer Vision (2016)

    Google Scholar 

  23. Mishra, A., Aloimonos, Y., Fah, C.L.: Active segmentation with fixation. In: International Conference on Computer Vision, pp. 468–475 (2009)

    Google Scholar 

  24. Papadopoulos, D.P., Clarke, A.D.F., Keller, F., Ferrari, V.: Training object class detectors from eye tracking data. In: European Conference on Computer Vision (2014)

    Google Scholar 

  25. Pathak, R., Saini, A., Wadhwa, A., Sharma, H., Sangwan, D.: An object detection approach for detecting damages in heritage sites using 3-d point clouds and 2-d visual data. J. Cult. Herit. 48, 74–82 (2021)

    Article  Google Scholar 

  26. Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.M.: Egocentric point of interest recognition in cultural sites. In: International Conference on Computer Vision Theory and Applications (VISAPP) (2019)

    Google Scholar 

  27. Ragusa, F., Furnari, A., Battiato, S., Signorello, G., Farinella, G.M.: EGO-CH: dataset and fundamental tasks for visitors behavioral understanding using egocentric vision. Pattern Recogn. Lett. 131, 150–157 (2020)

    Article  Google Scholar 

  28. Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: Computer Vision and Pattern Recognition, pp. 779–788 (2016)

    Google Scholar 

  29. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)

    Google Scholar 

  30. Seidenari, L., Baecchi, C., Uricchio, T., Ferracani, A., Bertini, M., Bimbo, A.D.: Deep artwork detection and retrieval for automatic context-aware audio guides. Trans. Multim. Comput. Commun. Appl. 13, 1–21 (2017)

    Article  Google Scholar 

  31. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semantic segmentation. Trans. Pattern Anal. Mach. Intell. 39, 640–651 (2017)

    Article  Google Scholar 

  32. Subramanian, R., Yanulevskaya, V., Sebe, N.: Can computers learn from humans to see better? inferring scene semantics from viewers’ eye movements. In: International Conference on Multimedia (ACM), pp. 33–42 (2011)

    Google Scholar 

  33. Wang, Y., Hou, J., Hou, X., Chau, L.P.: A self-training approach for point-supervised object detection and counting in crowds. Trans. Image Process. 30, 2876–2887 (2021)

    Article  Google Scholar 

  34. Yoo, I., Yoo, D., Paeng, K.: Pseudoedgenet: nuclei segmentation only with point annotations. In: Medical Image Computing and Computer Assisted Intervention (MICCAI), pp. 731–739 (2019)

    Google Scholar 

  35. Yun, K., Peng, Y., Samaras, D., Zelinsky, G.J., Berg, T.L.: Studying relationships between human gaze, description, and computer vision. In: Computer Vision and Pattern Recognition, pp. 739–746 (2013)

    Google Scholar 

  36. Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

    Google Scholar 

Download references

Acknowledgements

This research has been supported by Next Vision (https://www.nextvisionlab.it/) s.r.l., by the project VALUE (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 - Azione 1.1.5., and by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michele Mazzamuto .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16849 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mazzamuto, M., Ragusa, F., Furnari, A., Signorello, G., Farinella, G.M. (2022). Weakly Supervised Attended Object Detection Using Gaze Data as Annotations. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13232. Springer, Cham. https://doi.org/10.1007/978-3-031-06430-2_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06430-2_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06429-6

  • Online ISBN: 978-3-031-06430-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics