Skip to main content
Log in

Attention-based dual context aggregation for image semantic segmentation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Recent works have extensively probed contextual relevance to enhance the scene understanding. However, most approaches tend to model the relationships between local regions due to the limitation of the convolution kernel, while rarely exploring long-range dependencies. In this paper, we come up with the Dual Context Aggregation Module (DCM) to effectively capture such important information. DCM splits into two attention modules to obtain dense contextual information via modeling relations between positions and channels. The spatial attention module generates huge attention maps by constructing pairwise relationships between positions in the same row or column. The channel attention module applies the Weight Calibrate Block to generate weights for all the channels to effectively get the correlation between different channels. We adopt an element addition to integrate the feature maps of the two modules. Moreover, we design a two-step decoder module to improve the feature representation. On the basis of these developments, we construct the Dual Context aggregation Network (DCNet). Extensive evaluation experiments on the benchmarks prove that our model leads to robust feature representation. Our method demonstrates competitive performance compared to state-of-the-art models, achieving the MIoU scores of 81.9% on Cityscapes and 45.54% on ADE20K.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Badrinarayanan V, Kendall A, Cipolla R (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495

    Article  Google Scholar 

  2. Bulo SR, Porzi L, Kontschieder P (2018) In-place activated BatchNorm for memory-optimized training of DNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647

  3. Cao Y, Xu J, Lin S, Wei F, Hu H (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  4. Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:14127062

  5. Chen L, Papandreou G, Schroff F, Adam H (2017) Rethinking Atrous convolution for semantic image segmentation. arXiv preprint arXiv:170605587

  6. Chen L, Collins MD, Zhu Y, Papandreou G, Zoph B, Schroff F, Adam H, Shlens J (2018) Searching for efficient multi-scale architectures for dense image prediction. In: Advances in neural information processing systems, pp. 8699–8710

  7. Chen L, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 40:834–848

    Article  Google Scholar 

  8. Chen L, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: Proceedings of European conference on computer vision, pp. 833–851

  9. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  10. Fan H, Mei X, Prokhorov DV, Ling H (2018) Multi-level contextual RNNs with attention model for scene labeling. IEEE Trans Intell Transp Syst 19:3475–3485

    Article  Google Scholar 

  11. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154

  12. Fu J, Liu J, Wang Y, Li Y, Bao Y, Tang J, Lu H (2019) Adaptive context network for scene parsing. In: Proceedings of IEEE International Conference on Computer Vision, pp. 6748–6757

  13. Fu J, Liu J, Wang Y, Lu H (2019) Stacked Deconvolutional network for semantic segmentation. IEEE Trans Image Process:1. https://doi.org/10.1109/TIP.2019.2895460

  14. Hu H, Gu J, Zhang Z, Dai J, Wei Y (2018) Relation networks for object detection. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3588–3597

  15. Hu J, Shen L, Albanie S, Sun G, Wu E (2018) Squeeze-and-excitation networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141

  16. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) CCNet: Criss-cross attention for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, pp. 603–612

  17. Li X, Zhong Z, Wu J, Yang Y, Lin Z, Liu H (2019) Expectation-maximization attention networks for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, pp. 9167–9176

  18. Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vis 128:2265–2278. https://doi.org/10.1007/s11263-020-01331-0

    Article  MathSciNet  Google Scholar 

  19. Lin G, Milan A, Shen C, Reid I (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 5168–5177

  20. Lin Z, Feng M, Santos CND, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:170303130

  21. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440

  22. Lu Y, Chen Y, Zhao D, Chen J (2020) Graph-FCN for image semantic segmentation. In: Proceedings of International Symposium on Neural Networks, pp. 97–105

  23. Peng Z, Li Z, Zhang J, Li Y, Qi G, Tang J (2019) Few-shot image recognition with knowledge transfer. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 441-449

  24. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, pp. 234–241

  25. Shuai B, Zuo Z, Wang B, Wang G (2016) DAG-recurrent neural networks for scene labeling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3620–3629

  26. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems neural information processing systems, pp. 5998–6008

  27. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803

  28. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:151107122

  29. Yu C, Wang J, Gao C, Yu G, Shen C, Sang N (2020) Context prior for scene segmentation. arXiv preprint arXiv:200401547

  30. Yuan Y, Wang J (2018) OCNet: object context network for scene parsing. arXiv preprint arXiv:180900916

  31. Zhang H, Dana KJ, Shi J, Zhang Z, Wang X, Tyagi A, Agrawal A (2018) Context encoding for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7151–7160

  32. Zhang H, Goodfellow I, Metaxas DN, Odena A (2018) Self-attention generative adversarial networks. In: Proceedings of International Conference on Machine Learning, pp. 7354–7363

  33. Zhang Y, Sun X, Dong J, Chen C, Shen Y (2020) High-Order Paired-ASPP Networks for Semantic Segmentation. arXiv preprint arXiv: 200207371

  34. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6230–6239

  35. Zhao H, Zhang Y, Liu S, Shi J, Loy CC, Lin D, Jia J (2018) PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of European conference on computer vision, pp. 270–286

  36. Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ADE20K dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5122–5130

  37. Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision, pp. 593–602

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiyang Qi.

Ethics declarations

Conflict of interest

There are no conflicts of interests in this work.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, D., Qi, Z., Yang, R. et al. Attention-based dual context aggregation for image semantic segmentation. Multimed Tools Appl 80, 28201–28216 (2021). https://doi.org/10.1007/s11042-021-11094-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11094-6

Keywords

Navigation