Visual Search target inference subsumes methods for predicting the target object through eye tracking. A person intents to find an object in a visual scene which we predict based on the fixation behavior. Knowing about the search target can improve intelligent user interaction. In this work, we implement a new feature encoding, the Bag of Deep Visual Words, for search target inference using a pre-trained convolutional neural network (CNN). Our work is based on a recent approach from the literature that uses Bag of Visual Words, common in computer vision applications. We evaluate our method using a gold standard dataset.
The results show that our new feature encoding outperforms the baseline from the literature, in particular, when excluding fixations on the target.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
- 1.
The Amazon book cover dataset from Sattar et al. [15].
- 2.
- 3.
Akkil, D., Isokoski, P.: Gaze augmentation in egocentric video improves awareness of intention. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 1573–1584. ACM Press (2016). http://dl.acm.org/citation.cfm?doid=2858036.2858127
Bader, T., Beyerer, J.: Natural gaze behavior as input modality for human-computer interaction. In: Nakano, Y., Conati, C., Bader, T. (eds.) Eye Gaze in Intelligent User Interfaces, pp. 161–183. Springer, London (2013). https://doi.org/10.1007/978-1-4471-4784-8_9
Borji, A., Lennartz, A., Pomplun, M.: What do eyes reveal about the mind? Algorithmic inference of search targets from fixations. Neurocomputing 149(PB), 788–799 (2015). https://doi.org/10.1016/j.neucom.2014.07.055
DeAngelus, M., Pelz, J.B.: Top-down control of eye movements: Yarbus revisited. Vis. Cognit. 17(6–7), 790–811 (2009). https://doi.org/10.1080/13506280902793843
Donahue, J., et al.: DeCAF: A deep convolutional activation feature for generic visual recognition. In: Icml, vol. 32, pp. 647–655 (2014). http://arxiv.org/abs/1310.1531
Flanagan, J.R., Johansson, R.S.: Action plans used in action observation. Nature 424(6950), 769–771 (2003). http://www.nature.com/doifinder/10.1038/nature01861
Goldberg, Y.: Neural network methods for natural language processing. Synth. Lect. Hum. Lang. Technol. 10(1), 1–309 (2017)
Gredeback, G., Falck-Ytter, T.: Eye movements during action observation. Perspect. Psychol. Sci. 10(5), 591–598 (2015). http://pps.sagepub.com/lookup/doi/10.1177/1745691615589103
Huang, C.M., Mutlu, B.: Anticipatory robot control for efficient human-robot collaboration. In: 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 83–90. IEEE, March 2016. https://doi.org/10.1109/HRI.2016.7451737, http://ieeexplore.ieee.org/document/7451737/
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS 2012, pp. 1097–1105. Curran Associates Inc., USA (2012). http://dl.acm.org/citation.cfm?id=2999134.2999257
Lowe, D.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157 (1999). https://doi.org/10.1109/ICCV.1999.790410, http://ieeexplore.ieee.org/document/790410/
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 512–519 (2014). https://doi.org/10.1109/CVPRW.2014.131, http://arxiv.org/abs/1403.6382
Rotman, G., Troje, N.F., Johansson, R.S., Flanagan, J.R.: Eye movements when observing predictable and unpredictable actions. J. Neurophysiol. 96(3), 1358–1369 (2006). https://doi.org/10.1152/jn.00227.2006. http://www.ncbi.nlm.nih.gov/pubmed/16687620
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Sattar, H., Müller, S., Fritz, M., Bulling, A.: Prediction of search targets from fixations in open-world settings. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 981–990, June 2015. https://doi.org/10.1109/CVPR.2015.7298700
Sattar, H., Bulling, A., Fritz, M.: Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling (2016). http://arxiv.org/abs/1611.10162
Sonntag, D.: Kognit: intelligent cognitive enhancement technology by cognitive models and mixed reality for dementia patients. In: AAAI Fall Symposium Series (2015). https://www.aaai.org/ocs/index.php/FSS/FSS15/paper/view/11702
Sonntag, D.: Intelligent user interfaces - A tutorial. CoRR abs/1702.05250 (2017). http://arxiv.org/abs/1702.05250
Toyama, T., Sonntag, D.: Towards episodic memory support for dementia patients by recognizing objects, faces and text in eye gaze. In: Hölldobler, S., Krötzsch, M., Peñaloza, R., Rudolph, S. (eds.) KI 2015. LNCS (LNAI), vol. 9324, pp. 316–323. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24489-1_29
Wolfe, J.M.: Guided search 2.0 a revised model of visual search. Psychon. Bull. Rev. 1(2), 202–238 (1994). https://doi.org/10.3758/BF03200774
Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval, MIR 2007, pp. 197–206. ACM, New York (2007). http://doi.acm.org/10.1145/1290082.1290111
Yarbus, A.L.: Eye movements and vision. Neuropsychologia 6(4), 222 (1967). https://doi.org/10.1016/0028-3932(68)90012-2
Zelinsky, G.J., Peng, Y., Samaras, D.: Eye can read your mind: decoding gaze fixations to reveal categorical search targets. J. Vis. 13(14), 10 (2013). https://doi.org/10.1167/13.14.10. http://www.ncbi.nlm.nih.gov/pubmed/24338446
This work was funded by the Federal Ministry of Education and Research (BMBF) under grant number 16SV7768 in the Interakt project.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Stauden, S., Barz, M., Sonntag, D. (2018). Visual Search Target Inference Using Bag of Deep Visual Words. In: Trollmann, F., Turhan, AY. (eds) KI 2018: Advances in Artificial Intelligence. KI 2018. Lecture Notes in Computer Science(), vol 11117. Springer, Cham. https://doi.org/10.1007/978-3-030-00111-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-00111-7_25
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00110-0
Online ISBN: 978-3-030-00111-7
eBook Packages: Computer ScienceComputer Science (R0)