Skip to main content
Log in

A Simple and Light-Weight Attention Module for Convolutional Neural Networks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Many aspects of deep neural networks, such as depth, width, or cardinality, have been studied to strengthen the representational power. In this work, we study the effect of attention in convolutional neural networks and present our idea in a simple self-contained module, called Bottleneck Attention Module (BAM). Given an intermediate feature map, BAM efficiently produces the attention map along two factorized axes, channel and spatial, with negligible overheads. BAM is placed at bottlenecks of various models where the downsampling of feature maps occurs, and is jointly trained in an end-to-end manner. Ablation studies and extensive experiments are conducted in CIFAR-100/ImageNet classification, VOC2007/MS-COCO detection, super resolution and scene parsing with various architectures including mobile-oriented networks. BAM shows consistent improvements over all experiments, demonstrating the wide applicability of BAM. The code and models are available at https://github.com/Jongchan/attentionmodule.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. https://pytorch.org/.

References

  • Ba, J., Mnih, V., & Kavukcuoglu, K. (2015). Multiple object recognition with visual attention. In Proceedings of international conference on learning representations (ICLR).

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  • Bell, S., Lawrence Zitnick, C., Bala, K., & Girshick, R. (2016). Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of computer vision and pattern recognition (CVPR).

  • Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., & Chua, T. S. (2017a) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of computer vision and pattern recognition (CVPR).

  • Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., & Chua, T. S. (2017b) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5659–5667).

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915.

  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Proceedings of computer vision and pattern recognition (CVPR).

  • Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience, 3, 3.

    Article  Google Scholar 

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y. (2017). Deformable convolutional networks. CoRR, abs/170306211 1(2):3.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Proceedings of computer vision and pattern recognition (CVPR).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672–2680).

  • Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). Draw: A recurrent neural network for image generation. In Proceedings of international conference on machine learning (ICML).

  • Han, D., Kim, J., & Kim, J. (2017). Deep pyramidal residual networks. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6307–6315). IEEE.

  • Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2015). Hypercolumns for object segmentation and fine-grained localization. In Proceedings of computer vision and pattern recognition (CVPR).

  • He, K., Zhang, X., Ren, S., Sun, J. (2016a). Deep residual learning for image recognition. In Proceedings of computer vision and pattern recognition (CVPR).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity mappings in deep residual networks. In Proceedings of European conference on computer vision (ECCV).

  • Hirsch, J., & Curcio, C. A. (1989). The spatial resolution capacity of human foveal retina. Vision Research, 29(9), 1095–1101.

    Article  Google Scholar 

  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

  • Hu, J., Shen, L., Albanie, S., Sun, G., & Vedaldi, A. (2018a). Gather-excite: Exploiting feature context in convolutional neural networks. In Advances in neural information processing systems (pp. 9422–9432).

  • Hu, J., Shen, L., & Sun, G. (2018b). Squeeze-and-excitation networks. In Proceedings of computer vision and pattern recognition (CVPR).

  • Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017). Densely connected convolutional networks. In Proceedings of computer vision and pattern recognition (CVPR).

  • Huang, G., Sun, Y., Liu, Z., Sedra, D., & Weinberger, K. Q. (2016). Deep networks with stochastic depth. In Proceedings of European conference on computer vision (ECCV).

  • Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. The Journal of Physiology, 148(3), 574–591.

    Article  Google Scholar 

  • Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and<0.5mb model size. arXiv preprint arXiv:1602.07360.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of international conference on machine learning (ICML).

  • Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. In IEEE transactions on pattern analysis machine intelligence (TPAMI).

  • Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015a). Spatial transformer networks. In Proceedings of neural information processing systems (NIPS).

  • Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015b). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).

  • Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In Advances in neural information processing systems (pp. 667–675).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of neural information processing systems (NIPS).

  • Larochelle, H., & Hinton, G. E. (2010). Learning to combine foveal glimpses with a third-order Boltzmann machine. In Proceedings of neural information processing systems (NIPS).

  • Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A. P., Tejani, A., Totz, J., Wang, Z., et al. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of computer vision and pattern recognition (CVPR).

  • Li, W., Zhu, X., & Gong, S. (2018). Harmonious attention network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2285–2294).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of European conference on computer vision (ECCV).

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In Proceedings of European conference on computer vision (ECCV).

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of computer vision and pattern recognition (CVPR).

  • Marr, D., & Vision, A. (1982). A computational investigation into the human representation and processing of visual information (Vol. 1, No. 2). WH San Francisco: Freeman and Company.

  • Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. Advances in neural information processing systems. In Proceedings of neural information processing systems (NIPS).

  • Morcos, A. S., Barrett, D. G., Rabinowitz, N. C., & Botvinick, M. (2018) On the importance of single directions for generalization. In Proceedings of international conference on learning representations (ICLR).

  • Nam, H., Ha, J. W., & Kim, J. (2017). Dual attention networks for multimodal reasoning and matching. In Proceedings of computer vision and pattern recognition (CVPR) (pp. 2156–2164).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of neural information processing systems (NIPS).

  • Rensink, R. A. (2000). The dynamic representation of scenes. Visual Cognition, 7(1–3), 17–42.

    Article  Google Scholar 

  • Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025.

    Article  Google Scholar 

  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510–4520).

  • Simon, M., & Rodner, E. (2015). Neural activation constellations: Unsupervised part model discovery with convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 1143–1151).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In Proceedings of international conference on learning representations (ICLR).

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of computer vision and pattern recognition (CVPR).

  • Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., & Tang, X. (2017). Residual attention network for image classification. In Proceedings of computer vision and pattern recognition (CVPR).

  • Woo, S., Hwang, S., & Kweon, I. S. (2018a) Stairnet: Top-down semantic aggregation for accurate one shot detection. In Proceedings of winter conference on applications of computer vision (WACV).

  • Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018b) Cbam: Convolutional block attention module. In Proceedings of European conference on computer vision (ECCV).

  • Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In Proceedings of European conference on computer vision (ECCV).

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of computer vision and pattern recognition (CVPR).

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of international conference on machine learning (ICML).

  • Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In Proceedings of computer vision and pattern recognition (CVPR).

  • Yu, F, & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions. In Proceedings of international conference on learning representations (ICLR).

  • Zagoruyko, S, & Komodakis, N. (2016). Wide residual networks. In Proceedings of British machine vision conference (BMVC).

  • Zeiler, M. D. (2012) Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701.

  • Zhang, X., Xiong, H., Zhou, W., Lin, W., & Tian, Q. (2016). Picking deep filter responses for fine-grained image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1134–1142).

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2018). Semantic segmentation on MIT ADE20K dataset in PyTorch. https://github.com/CSAILVision/semantic-segmentation-pytorch/.

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision (IJCV), 127, 302–321. https://doi.org/10.1007/s11263-018-1140-0.

    Article  Google Scholar 

  • Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., & Lu, H. (2017). Couplenet: Coupling global structure with local parts for object detection. In Proceedings of international conference on computer vision (ICCV).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jongchan Park.

Additional information

Communicated by Ling Shao, Hubert P. H. Shum, Timothy Hospedales.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, J., Woo, S., Lee, JY. et al. A Simple and Light-Weight Attention Module for Convolutional Neural Networks. Int J Comput Vis 128, 783–798 (2020). https://doi.org/10.1007/s11263-019-01283-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01283-0

Keywords

Navigation