Abstract
Predicting human visual attention cannot only increase our understanding of the underlying biological mechanisms, but also bring new insights for other computer vision-related tasks such as autonomous driving and human–computer interaction. Current deep learning-based methods often place emphasis on high-level semantic feature for prediction. However, high-level semantic feature lacks fine-scale spatial information. Ideally, a saliency prediction model should include both spatial and semantic features. In this paper, we propose a multi-scale attention gated network (we refer to as MSAGNet) to fuse semantic features with different spatial resolutions for visual saliency prediction. Specifically, we adopt the high-resolution net (HRNet) as the backbone to extract the multi-scales semantic features. A multi-scale attention gating module is designed to adaptively fuse these multi-scale features in a hierarchical way. Different from the conventional way of feature concatenation from multiple layers or multi-scale inputs, this module calculates a spatial attention map from high-level semantic feature and then fuses it with the low-level spatial feature through gating operation. Through the hierarchical gating fusion, final saliency prediction is achieved at the finest scale. Extensive experimental analyses on three benchmark datasets demonstrate the superior performance of the proposed method.
Similar content being viewed by others
References
Borji, A., Itti, L.: Cat2000: A large scale fixation dataset for boosting saliency research. arXiv preprint. arXiv:1505.03581 (2015)
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE Trans. Pattern Anal. Mach. Intell. 41(3), 740–757 (2018)
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3488–3493. IEEE (2016)
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Trans. Image Process. 27(10), 5142–5154 (2018)
Fan, S., Shen, Z., Jiang, M., Koenig, B.L., Xu, J., Kankanhalli, M.S., Zhao, Q.: Emotional attention: a study of image sentiment and visual attention. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 7521–7531 (2018)
Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: Proceedings of the Conference on Neural Information processing Systems (NIPS), vol. 19, pp. 545–552 (2007)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, X., Shen, C., Boix, X., Zhao, Q.: Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 262–270 (2015)
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Jetley, S., Murray, N., Vig, E.: End-to-end saliency mapping via probability distribution prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5753–5761 (2016)
Jiang, M., Huang, S., Duan, J., Zhao, Q.: Salicon: Saliency in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1072–1080 (2015)
Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. In: Matters of intelligence, pp. 115–141. Springer (1987)
Kruthiventi, S.S., Ayush, K., Babu, R.V.: Deepfix: a fully convolutional neural network for predicting human eye fixations. IEEE Trans. Image Process. 26(9), 4446–4456 (2017)
Kümmerer, M., Theis, L., Bethge, M.: Deep gaze i: boosting saliency prediction with feature maps trained on imagenet. arXiv preprint. arXiv:1411.1045 (2014)
Kümmerer, M., Wallis, T.S., Bethge, M.: Information-theoretic model comparison unifies saliency metrics. Proc. Natl. Acad. Sci. 112(52), 16054–16059 (2015)
Le Meur, O., Baccino, T.: Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behav. Res. Methods 45(1), 251–266 (2013)
Le Meur, O., Le Callet, P., Barba, D.: Predicting visual fixations on video based on low-level visual features. Vis. Res. 47(19), 2483–2498 (2007)
Liu, N., Han, J., Liu, T., Li, X.: Learning to predict eye fixations via multiresolution convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29(2), 392–404 (2016)
Pan, J., Ferrer, C.C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint. arXiv:1701.01081 (2017)
Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 598–606 (2016)
Peters, R.J., Iyer, A., Itti, L., Koch, C.: Components of bottom-up gaze allocation in natural images. Vis. Res. 45(18), 2397–2416 (2005)
Ramanathan, S., Katti, H., Sebe, N., Kankanhalli, M., Chua, T.S.: An eye fixation database for saliency detection in images. In: European Conference on Computer Vision, pp. 30–43. Springer (2010)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Swain, M.J., Ballard, D.H.: Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991)
Tang, H., Chen, C., Pei, X.: Visual saliency detection via sparse residual and outlier detection. IEEE Signal Process. Lett. 23(12), 1736–1740 (2016)
Tatler, B.W., Baddeley, R.J., Gilchrist, I.D.: Visual correlates of fixation selection: effects of scale and time. Vis. Res. 45(5), 643–659 (2005)
Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1), 97–136 (1980)
Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2798–2805 (2014)
Wang, W., Shen, J.: Deep visual attention prediction. IEEE Trans. Image Process. 27(5), 2368–2378 (2017)
Wilming, N., Betz, T., Kietzmann, T.C., König, P.: Measures and limits of models of fixation selection. PLoS One 6(9), e24038 (2011)
Wolfe, J.M., Cave, K.R., Franzel, S.L.: Guided search: an alternative to the feature integration model for visual search. J. Exp. Psychol. Hum. Percept. Perform. 15(3), 419 (1989)
Woo, S., Park, J., Lee, J.Y., So Kweon, I.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Zhang, J., Sclaroff, S.: Saliency detection: a boolean map approach. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 153–160 (2013)
Acknowledgements
This work is partially supported by National Natural Science Foundation of China under Grant (U2001211, 61672292, 61532009), and partially supported by Basic Research Program of Jiangsu Province (Frontier Leading Technology) under Grant BK20192004.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by B.-K. Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sun, Y., Zhao, M., Hu, K. et al. Visual saliency prediction using multi-scale attention gated network. Multimedia Systems 28, 131–139 (2022). https://doi.org/10.1007/s00530-021-00796-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-021-00796-4