Abstract
Recent works have extensively probed contextual relevance to enhance the scene understanding. However, most approaches tend to model the relationships between local regions due to the limitation of the convolution kernel, while rarely exploring long-range dependencies. In this paper, we come up with the Dual Context Aggregation Module (DCM) to effectively capture such important information. DCM splits into two attention modules to obtain dense contextual information via modeling relations between positions and channels. The spatial attention module generates huge attention maps by constructing pairwise relationships between positions in the same row or column. The channel attention module applies the Weight Calibrate Block to generate weights for all the channels to effectively get the correlation between different channels. We adopt an element addition to integrate the feature maps of the two modules. Moreover, we design a two-step decoder module to improve the feature representation. On the basis of these developments, we construct the Dual Context aggregation Network (DCNet). Extensive evaluation experiments on the benchmarks prove that our model leads to robust feature representation. Our method demonstrates competitive performance compared to state-of-the-art models, achieving the MIoU scores of 81.9% on Cityscapes and 45.54% on ADE20K.
Similar content being viewed by others
References
Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Bulo SR, Porzi L, Kontschieder P (2018) In-place activated BatchNorm for memory-optimized training of DNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647
Cao Y, Xu J, Lin S, Wei F, Hu H (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:14127062
Chen L, Papandreou G, Schroff F, Adam H (2017) Rethinking Atrous convolution for semantic image segmentation. arXiv preprint arXiv:170605587
Chen L, Collins MD, Zhu Y, Papandreou G, Zoph B, Schroff F, Adam H, Shlens J (2018) Searching for efficient multi-scale architectures for dense image prediction. In: Advances in neural information processing systems, pp. 8699–8710
Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–848
Chen L, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: Proceedings of European conference on computer vision, pp. 833–851
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Fan H, Mei X, Prokhorov DV, Ling H (2018) Multi-level contextual RNNs with attention model for scene labeling. IEEE Trans Intell Transp Syst 19:3475–3485
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154
Fu J, Liu J, Wang Y, Li Y, Bao Y, Tang J, Lu H (2019) Adaptive context network for scene parsing. In: Proceedings of IEEE International Conference on Computer Vision, pp. 6748–6757
Fu J, Liu J, Wang Y, Lu H (2019) Stacked Deconvolutional network for semantic segmentation. IEEE Trans Image Process:1. https://doi.org/10.1109/TIP.2019.2895460
Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3588–3597
Hu J, Shen L, Albanie S, Sun G, Wu E (2018) Squeeze-and-excitation networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) CCNet: Criss-cross attention for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, pp. 603–612
Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, pp. 9167–9176
Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vis 128:2265–2278. https://doi.org/10.1007/s11263-020-01331-0
Lin G, Milan A, Shen C, Reid I (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5168–5177
Lin Z, Feng M, Santos CND, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:170303130
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440
Lu Y, Chen Y, Zhao D, Chen J (2020) Graph-FCN for image semantic segmentation. In: Proceedings of International Symposium on Neural Networks, pp. 97–105
Peng Z, Li Z, Zhang J, Li Y, Qi G, Tang J (2019) Few-shot image recognition with knowledge transfer. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 441-449
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, pp. 234–241
Shuai B, Zuo Z, Wang B, Wang G (2016) DAG-recurrent neural networks for scene labeling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3620–3629
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems neural information processing systems, pp. 5998–6008
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:151107122
Yu C, Wang J, Gao C, Yu G, Shen C, Sang N (2020) Context prior for scene segmentation. arXiv preprint arXiv:200401547
Yuan Y, Wang J (2018) OCNet: object context network for scene parsing. arXiv preprint arXiv:180900916
Zhang H, Dana KJ, Shi J, Zhang Z, Wang X, Tyagi A, Agrawal A (2018) Context encoding for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7151–7160
Zhang H, Goodfellow I, Metaxas DN, Odena A (2018) Self-attention generative adversarial networks. In: Proceedings of International Conference on Machine Learning, pp. 7354–7363
Zhang Y, Sun X, Dong J, Chen C, Shen Y (2020) High-Order Paired-ASPP Networks for Semantic Segmentation. arXiv preprint arXiv: 200207371
Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6230–6239
Zhao H, Zhang Y, Liu S, Shi J, Loy CC, Lin D, Jia J (2018) PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of European conference on computer vision, pp. 270–286
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ADE20K dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5122–5130
Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, pp. 593–602
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There are no conflicts of interests in this work.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, D., Qi, Z., Yang, R. et al. Attention-based dual context aggregation for image semantic segmentation. Multimed Tools Appl 80, 28201–28216 (2021). https://doi.org/10.1007/s11042-021-11094-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11094-6