Skip to main content
Log in

Augmented FCN: rethinking context modeling for semantic segmentation

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

The effectiveness of modeling contextual information has been empirically shown in numerous computer vision tasks. In this paper, we propose a simple yet efficient augmented fully convolutional network (AugFCN) by aggregating content- and position-based object contexts for semantic segmentation. Specifically, motivated because each deep feature map is a global, class-wise representation of the input, we first propose an augmented nonlocal interaction (AugNI) to aggregate the global content-based contexts through all feature map interactions. Compared to classical position-wise approaches, AugNI is more efficient. Moreover, to eliminate permutation equivariance and maintain translation equivariance, a learnable, relative position embedding branch is then supportably installed in AugNI to capture the global position-based contexts. AugFCN is built on a fully convolutional network as the backbone by deploying AugNI before the segmentation head network. Experimental results on two challenging benchmarks verify that AugFCN can achieve a competitive 45.38% mIoU (standard mean intersection over union) and 81.9% mIoU on the ADE20K val set and Cityscapes test set, respectively, with little computational overhead. Additionally, the results of the joint implementation of AugNI and existing context modeling schemes show that AugFCN leads to continuous segmentation improvements in state-of-the-art context modeling. We finally achieve a top performance of 45.43% mIoU on the ADE20K val set and 83.0% mIoU on the Cityscapes test set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Li X, Chen H, Qi X, et al. H-DenseUNet: hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Trans Med Imag, 2018, 37: 2663–2674

    Article  Google Scholar 

  2. Li P, Chen X, Shen S. Stereo R-CNN based 3D object detection for autonomous driving. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  3. Alhaija H A, Mustikovela S K, Mescheder L, et al. Augmented reality meets computer vision: efficient data generation for urban driving scenes. Int J Comput Vis, 2018, 126: 961–972

    Article  Google Scholar 

  4. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  5. Hou Q, Zhang L, Cheng M M, et al. Strip pooling: rethinking spatial pooling for scene parsing. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  6. Feng J P, Wang X G, Liu W Y. Deep graph cut network for weakly-supervised semantic segmentation. Sci China Inf Sci, 2021, 64: 130105

    Article  Google Scholar 

  7. Zhang D, Zhang H, Tang J, et al. Self-regulation for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2021

  8. Yuan Y, Chen X, Wang J. Object-contextual representations for semantic segmentation. In: Proceedings of European Conference on Computer Vision (ECCV), 2020

  9. Zhang H, Zhang H, Wang C, et al. Co-occurrent features in semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  10. Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. In: Proceedings of International Conference on Learning Representations (ICLR), 2016

  11. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), 2017

  12. Wang X, Girshick R, Gupta A, et al. Non-local neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  13. Chen L C, Papandreou G, Kokkinos I, et al. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell, 2017, 40: 834–848

    Article  Google Scholar 

  14. Wang P, Chen P, Yuan Y, et al. Understanding convolution for semantic segmentation. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), 2018

  15. Ahn J, Cho S, Kwak S. Weakly supervised learning of instance segmentation with inter-pixel relations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  16. Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  17. He J, Deng Z, Zhou L, et al. Adaptive pyramid context network for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  18. Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation. 2017. ArXiv:1706.05587

  19. Chen L C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of European Conference on Computer Vision (ECCV), 2018

  20. Cordts M, Omran M, Ramos S, et al. The Cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  21. Zhu Z, Xu M, Bai S, et al. Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2019

  22. Zhang D, Zhang H, Tang J, et al. Feature pyramid transformer. In: Proceedings of European Conference on Computer Vision (ECCV), 2020

  23. Huang Z, Wang X, Huang L, et al. CCNet: criss-cross attention for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2019

  24. Yuan Y, Wang J. OCNet: object context network for scene parsing. 2018. ArXiv:1809.00916

  25. Chen Y, Rohrbach M, Yan Z, et al. Graph-based global reasoning networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  26. Li J, Ma A J, Yuen P C. Semi-supervised region metric learning for person re-identification. Int J Comput Vis, 2018, 126: 855–874

    Article  Google Scholar 

  27. Fu J, Liu J, Wang Y, et al. Adaptive context network for scene parsing. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2019

  28. Ma C, Huang J B, Yang X, et al. Adaptive correlation filters with long-term and short-term memory for object tracking. Int J Comput Vis, 2018, 126: 771–796

    Article  Google Scholar 

  29. Bello I, Zoph B, Vaswani A, et al. Attention augmented convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2019

  30. Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers. In: Proceedings of European Conference on Computer Vision (ECCV), 2020

  31. Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In: Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), 2018

  32. Parmar N, Vaswani A, Uszkoreit J, et al. Image transformer. In: Proceedings of International Conference on Machine Learning (ICML), 2018

  33. Liu R, Lehman J, Molino P, et al. An intriguing failing of convolutional neural networks and the coordconv solution. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), 2018

  34. Huang C Z A, Vaswani A, Uszkoreit J, et al. Music transformer. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), 2018

  35. Shen Z, Zhang M, Zhao H, et al. Efficient attention: attention with linear complexities. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), 2021

  36. Selvaraju R R, Cogswell M, Das A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

  37. Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  38. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  39. Zhou B, Zhao H, Puig X, et al. Scene parsing through ADE20K dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  40. Gao H B, Guo F, Zhu J P, et al. Human motion segmentation based on structure constraint matrix factorization. Sci China Inf Sci, 2022, 65: 119103

    Article  Google Scholar 

  41. Zhang Z J, Pang Y W. CGNet: cross-guidance network for semantic segmentation. Sci China Inf Sci, 2020, 63: 120104

    Article  Google Scholar 

  42. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations (ICLR), 2014

  43. Huang G, Liu Z, van der Maaten L, et al. Densely connected convolutional networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  44. Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  45. Peng C, Zhang X, Yu G, et al. Large kernel matters—improve semantic segmentation by global convolutional network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  46. Zhang Z, Zhang X, Peng C, et al. ExFuse: enhancing feature fusion for semantic segmentation. In: Proceedings of European Conference on Computer Vision (ECCV), 2018

  47. Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2481–2495

    Article  Google Scholar 

  48. Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015

  49. Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015

  50. Guo M H, Lu C Z, Liu Z N, et al. Visual attention network. 2022. ArXiv:2202.09741

  51. Zhou H, Qi L, Huang H, et al. CANet: co-attention network for RGB-D semantic segmentation. Pattern Recognition, 2022, 124: 108468

    Article  Google Scholar 

  52. Zhang D W, Wang B, Wang G R, et al. Onfocus detection: identifying individual-camera eye contact from unconstrained images. Sci China Inf Sci, 2022, 65: 160101

    Article  Google Scholar 

  53. Zhang D W, Zeng W, Yao J, et al. Weakly supervised object detection using proposal- and semantic-level relationships. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 3349–3363

    Article  Google Scholar 

  54. Zhang D W, Han J, Cheng G, et al. Weakly supervised object localization and detection: a survey. IEEE Trans Pattern Anal Mach Intell, 2022, 44: 5866–5885

    Article  Google Scholar 

  55. Yang M, Yu K, Zhang C, et al. DenseASPP for semantic segmentation in street scenes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  56. Wu T, Tang S, Zhang R, et al. CGNet: a light-weight context guided network for semantic segmentation. IEEE Trans Image Process, 2020, 30: 1169–1179

    Article  Google Scholar 

  57. Kong B, Supančič J, Ramanan D, et al. Cross-domain image matching with deep feature maps. Int J Comput Vis, 2019, 127: 1738–1750

    Article  Google Scholar 

  58. Li W, Wang X, Xia X, et al. SepViT: separable vision transformer. 2022. ArXiv:2203.15380

  59. Chen L, Zhang H, Xiao J, et al. SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  60. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  61. Li X, Wang W, Hu X, et al. Selective kernel networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  62. Zhang H, Wu C, Zhang Z, et al. ResNeSt: split-attention networks. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022

  63. Bello I. LambdaNetworks: modeling long-range interactions without attention. In: Proceedings of International Conference on Learning Representations (ICLR), 2021

  64. Tao C, Gao S, Shang M, et al. Get the point of my utterance! Learning towards effective responses with multi-head attention mechanism. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 2018

  65. Goodfellow I, Bengio Y, Courville A, et al. Deep Learning. Cambridge: MIT Press, 2016

    Google Scholar 

  66. Albawi S, Mohammed T A, Al-Zawi S. Understanding of a convolutional neural network. In: Proceedings of International Conference on Engineering and Technology (ICET), 2017

  67. Zhong Z, Lin Z Q, Bidart R, et al. Squeeze-and-attention networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  68. Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  69. Zhang H, Dana K, Shi J, et al. Context encoding for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  70. Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), 2019

  71. Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009

  72. Lin G, Milan A, Shen C, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  73. Li Y, Gupta A. Beyond grids: learning graph representations for visual recognition. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), 2018

  74. Xiao T, Liu Y, Zhou B, et al. Unified perceptual parsing for scene understanding. In: Proceedings of European Conference on Computer Vision (ECCV), 2018

  75. Liang X, Zhou H, Xing E. Dynamic-structured semantic propagation network. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  76. Zhao H, Zhang Y, Liu S, et al. PSANet: point-wise spatial attention network for scene parsing. In: Proceedings of European Conference on Computer Vision (ECCV), 2018

  77. Zhang R, Tang S, Zhang Y, et al. Scale-adaptive convolutions for scene parsing. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

  78. Liang X, Hu Z, Zhang H, et al. Symbolic graph reasoning meets convolutions. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), 2018

  79. Kong S, Fowlkes C C. Recurrent scene parsing with perspective understanding in the loop. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  80. Wu Z, Shen C, van den Hengel A. Wider or deeper: revisiting the ResNet model for visual recognition. Pattern Recognition, 2019, 90: 119–133

    Article  Google Scholar 

  81. Yu C, Wang J, Peng C, et al. BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Proceedings of European Conference on Computer Vision (ECCV), 2018

  82. Ke T W, Hwang J J, Liu Z, et al. Adaptive affinity fields for semantic segmentation. In: Proceedings of European Conference on Computer Vision (ECCV), 2018

  83. Yu C, Wang J, Peng C, et al. Learning a discriminative feature network for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  84. Ding H, Jiang X, Shuai B, et al. Semantic correlation promoted shape-variant context for segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  85. Cheng B, Chen L C, Wei Y, et al. SPGNet: semantic prediction guidance for scene parsing. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2019

  86. Ding H, Jiang X, Liu A Q, et al. Boundary-aware feature propagation for scene segmentation. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2019

Download references

Acknowledgements

This work was partially supported by National Key Research and Development Program of China (Grant No. 2018AAA0102002) and National Natural Science Foundation of China (Grant Nos. 61925204, 62172212). The authors would like to thank all the anonymous reviewers for their constructive comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinhui Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D., Zhang, L. & Tang, J. Augmented FCN: rethinking context modeling for semantic segmentation. Sci. China Inf. Sci. 66, 142105 (2023). https://doi.org/10.1007/s11432-021-3590-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-021-3590-1

Keywords