Abstract
Existing recognition methods based on deep learning have achieved impressive performance. However, most of these algorithms do not fully utilize the contexts and discriminative parts, which limit the recognition performance. In this paper, we propose a context-aware attention network that imitates the human visual attention mechanism. The proposed network mainly consists of a context learning module and an attention transfer module. Firstly, we design the context learning module that carries on contextual information transmission along four directions: left, right, top and down to capture valuable contexts. Second, the attention transfer module is proposed to generate attention maps that contain different attention regions, benefiting for extracting discriminative features. Specially, the attention maps are generated through multiple glimpses. In each glimpse, we generate the corresponding attention map and apply it to the next glimpse. This means that our attention is shifting constantly, and the shift is not random but is closely related to the last attention. Finally, we consider all located attention regions to achieve accurate image recognition. Experimental results show that our method achieves state-of-the-art performance with 97.68% accuracy, 82.42% accuracy, 80.32% accuracy and 86.12% accuracy on CIFAR-10, CIFAR-100, Caltech-256 and CUB-200, respectively.
Similar content being viewed by others
References
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PH (2016) Fully-convolutional siamese networks for object tracking. In: ECCV. Springer, pp 850–865
Nam H, Han B (2016) Learning multi-domain convolutional neural net-506 works for visual tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4293–4302
Chen LC, Papandreou G, Kokkinos I et al (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Girshick RB, Donahue J, Darrell T et al (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38(1):142–158
Redmon J, Farhadi A (2016) YOLO9000: better, faster, stronger. arXiv preprint, 1612
Santoro A, Raposo D, Barrett DG et al (2017) A simple neural network module for relational reasoning. In: Advances in neural information processing systems, pp 4974–4983
Leng J, Liu Y (2018) An enhanced SSD with feature fusion and visual reasoning for object detection. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3486-1
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. ICLR
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR
Itti L, Koch C (2001) Computational modelling of visual attention. Nat Rev Neurosci 2(3):194
Mnih V, Heess N, Graves A et al (2014) Recurrent models of visual attention. In: NIPS
Zhao B, Wu X, Feng J, Peng Q, Yan S (2016) Diversified visual attention networks for fine-grained object classification. arXiv preprint arXiv:1606.08572
Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: CVPR, pp 842–850
Liu X, Xia T, Wang J, Lin Y (2016) Fully convolutional attention localization networks: efficient attention localization for fine-grained recognition. CoRR arXiv:1603.06765
Ji Y, Zhang H, Wu QMJ (2018) Salient object detection via multi-scale attention CNN. Neurocomputing 322:130–140
Zhang H, Ji Y, Huang W et al (2018) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3579-x
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Chen L, Zhang H, Xiao J et al (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667
Seo PH, Lin Z, Cohen S et al (2016) Progressive attention networks for visual attribute prediction. arXiv preprint arXiv:1606.02393
Das D, George Lee CS (2018) Sample-to-sample correspondence for unsupervised domain adaptation. Eng Appl Artif Intell 73:80–91
Das D, George Lee CS (2018) Unsupervised domain adaptation using regularized hyper-graph matching. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE
Courty N et al (2017) Optimal transport for domain adaptation. IEEE Trans Pattern Anal Mach Intell 39(9):1853–1865
Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine. In: Advances in neural information processing systems, pp 1243–1251
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Kim JH, Lee SW, Kwak D et al (2016) Multimodal residual learning for visual QA. In: Advances in neural information processing systems, pp 361–369
Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 1520–1528
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Advances in neural information processing systems, pp 2377–2385
Jaderberg M, Simonyan K, Zisserman A (2015) Spatial transformer networks. In: Advances in neural information processing systems, pp 2017–2025
Xiao T, Xu Y, Yang K et al (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 842–850
Fu J, Zheng H, Mei T (2017) Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. CVPR 2:3
Wang F et al (2017) Residual attention network for image classification. In: CVPR
Divvala SK, Hoiem D, Hays JH, Efros AA, Hebert M (2009) An empirical study of context in object detection. In: CVPR
Galleguillos C, Rabinovich A, Belongie S (2008) Object categorization using co-occurrence, location and appearance. In: CVPR
Uijlings JR, De Sande KE, Gevers T et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
He K, Zhang X, Ren S et al (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 346–361
Girshick RB (2015) Fast R-CNN. In: International conference on computer vision, pp 1440–1448
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
Griffin G, Holub A, Perona P (2007) Caltech-256 object category dataset
Welinder P, Branson S, Mita T, Wah C, Schroff F, Be-longie S, Perona P (2010) Caltech-UCSD Birds 200. Technical report CNS-TR-2010-001, California Institute of Technology
Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In: IEEE CVPR 2004, workshop on generative-model based vision
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: NIPS, pp 2017–2025
Wang F, Jiang M, Qian C et al (2017) Residual attention network for image classification. arXiv preprint arXiv:1704.06904
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR, pp 1409–1556
Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Acknowledgements
This project was partially supported by Grants from Natural Science Foundation of China 71671178, 91546201. It was also supported by University of Chinese Academy of Sciences Project Y954016XX2, and by Guangdong Provincial Science and Technology Project 2016B010127004.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Leng, J., Liu, Y. & Chen, S. Context-aware attention network for image recognition. Neural Comput & Applic 31, 9295–9305 (2019). https://doi.org/10.1007/s00521-019-04281-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04281-y