Abstract
Recent studies witness that combining contextual and spatial information significantly improves the performance of segmentation networks. Existing methods differ from each other mainly in the way of extracting contextual and spatial information. To comprehensively utilize spatial details from shallow layers, semantic information of deeper layers, and attention mechanism by special pooling, we propose an Anisotropic Non-local Attention Network (ANANet) to jointly acquire contextual and spatial information in a flexible and efficient way. We first present a spatial contextual module with anisotropic pooling (SCMA) to further encode contextual features by integrating traditional square pooling, anisotropic pooling and attention mechanisms. Our SCMA adopts adaptive spatial pooling to extract multi-scale features and designs an anisotropic pooling attention module (APAM) to compensate for the shortage of square pooling. Our APAM first uses horizontal and vertical pooling, and then multiplies one pooling result by another to generate attention maps for long-shaped and anisotropic objects. Then, we propose a non-local channel contextual module (CCM) to fully reuse shallow features by the backbone network for emphasizing channel interdependency. Our CCM encodes category differences to further reduce erroneous segmentation of ambiguous boundary pixels. Finally, we concatenated the outputs of SCMA and CCM to further improve feature representation. Experiments show that our method achieves obviously better results than existing state-of-the-art methods on public datasets.







Similar content being viewed by others
References
Teichmann, M., Weber, M., Zoellner, M., Cipolla, R., Urtasun, R.: Multinet: real-time joint semantic reasoning for autonomous driving. In: IEEE Intelligent Vehicles Symposium (IVS), pp. 1013–1020 (2018)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241 (2015)
Murali, S., Govindan, V., Kalady, S.: Single image shadow removal by optimization using non-shadow anchor values. Comput. Vis. Media 5(3), 311–324 (2019)
Le, T., Almansa, A., Gousseau, Y., Masnou, S.: Object removal from complex videos using a few annotations. Comput. Vis. Media 5(3), 267–291 (2019)
Borji, A., Cheng, M., Hou, Q., Jiang, H., Li, J.: Salient object detection: a survey. Comput. Vis. Media 5(2), 117–150 (2019)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2018)
Chen, L., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation (2017). arXiv preprint arXiv:1706.05587
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6230–6239 (2017)
Ding, H., Jiang, X., Shuai, B., Liu, A., Wang, G.: Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2393–2402 (2018)
Lin, G., Milan, A., Shen, C., Reid, I.: Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5168–5177 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention, pp. 234–241 (2015)
Byeon, W., Breuel, T., Raue, F., Liwicki, M.: Scene labeling with LSTM recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3547–3555 (2015)
Shuai, B., Zuo, Z., Wang, B., Wang, G.: Scene segmentation with dag-recurrent neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1480–1493 (2018)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation (2015). arXiv preprint arXiv:1511.00561
Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation (2018). arXiv preprint arXiv:1802.02611
Liu, W., Rabinovich, A., Berg, A.: Parsenet: looking wider to see better (2015). arXiv preprint arXiv:1506.04579
He, J., Deng, Z., Qiao, Y.: Dynamic multi-scale filters for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3562–3572 (2019)
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3146–3154 (2019)
Yuan, Y., Wang, J.: Ocnet: object context network for scene parsing (2018). arXiv preprint arXiv:1809.00916
Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105 (2012)
Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., Cottrell, G.: Understanding convolution for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3, 4, 5, 13 (2017)
Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: ‘Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1743–1751 (2017)
Li, H., Xiong, P., An, J., Wang, L.: Pyramid attention network for semantic segmentation (2018). arxiv:1805.10180
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Learning a discriminative feature network for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1857–1866 (2018)
Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Anisotropic non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 593–602 (2019)
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: European Conference on Computer Vision (2018)
Bluche, T.: Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. In: Conference on Neural Information Processing Systems (2016)
Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification (2017). 1706.06905
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Conference on Neural Information Processing Systems (2015)
Cao, C., Liu, X., Yang, Y., Yu, Y., Wang, J., Wang, Z., Huang, Y., Wang, L., Huang, C., Xu, W., Ramanan, D., Huang, T.S.: Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In: International Conference on Computer Vision (2015)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (2015)
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Zhao, H., Yi, Z., Shu, L., Jianping, S., Loy, C., Dahua, L., Jia, J.: Psanet: point-wise spatial attention network for scene parsing. In: European Conference on Computer Vision (2018)
Shu, K., Charless, F.: Recurrent scene parsing with perspective understanding in the loop. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 956–965 (2018)
Huang, Z., Wang, X., Huang, L., et al.: Ccnet: criss-cross attention for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 603–612 (2019)
Everingham, M., Gool, L., Williams, C., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Brostow, G., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: European Conference on Computer Vision, pp. 44–57 (2008)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’ pp. 177–186 (2010)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: Conference on Neural Information Processing Systems (2017)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Computer Science (2015)
He, J., Deng, Z., Zhou, L., Wang, Y., Qiao, Y.: Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7519–7528 (2019)
Li, Y., Song, L., Chen, Y., Li, Z., Zhang, X., Wang, X., Sun, J.: Learning dynamic routing for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8553–8562 (2020)
Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9167–9176 (2019)
Zoph, B., Ghiasi, G., Lin, T., Cui, Y., Liu, H., Cubuk, E., Le, Q.: Rethinking pre-training and self-training (2020). arXiv preprint arxiv:2006.06882
Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T., Cubuk, E., Quoc, V., Zoph, B.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2918–2928 (2021)
Rashwan, A., Du, X., Yin, X., Li, J.: Dilated SpineNet for semantic segmentation (2021). arXiv preprint arxiv:2103.12270
Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: Enet: a deep neural network architecture for real-time semantic segmentation, pp. 2, 4, 5, 6, 9, 11, 12 (2016). arXiv
Li, H., Xiong, P., Fan, H., Sun, J.: Dfanet: deep feature aggregation for real-time semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9522–9531 (2019)
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks (2017). arXiv preprint arxiv:1707.01629
Karianakis, N., Liu, Z., Chen, Y., Soatto, S.: Reinforced temporal attention and split-rate transfer for depth-based person re-identification. In: Proceedings of the European Conference on Computer Vision, pp. 715–733 (2018)
Zhu, Y., Sapra, K., Reda, F., Shih, K., Newsam, S., Tao, A., Catanzaro, B.: Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8856–8865 (2019)
Chandra, S., Couprie, C., Kokkinos, I.: Deep spatio-temporal random fields for efficient video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8915–8924 (2018)
Pal, A., Krishnan, G., Moorthy, M.R., Yadav, N., Ganesh, A.R., Sharmila, T.S.: DICENet: fine-grained recognition via dilated iterative contextual encoding. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Kreso, I., Causevic, D., Krapac, J., Segvic, S.: Convolutional scale invariance for semantic segmentation. In: German Conference on Pattern Recognition, pp. 64–75. Springer, Cham (2016)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions (2015). arXiv preprint arxiv:1511.07122
Song, H., Zhou, Y., Jiang, Z., Guo, X., Yang, Z.: ResNet with global and local image features, stacked pooling block, for semantic segmentation. In: 2018 IEEE/CIC International Conference on Communications in China (ICCC), pp. 79–83 (2018)
Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 405–420 (2018)
Han, H., Fan, L.: A new semantic segmentation model for supplementing more spatial information. IEEE Access 7, 86979–86988 (2019)
Chen, P., Lo, S., Hang, H., Chan, S., Lin, J.: Efficient road lane marking detection with deep learning. In: 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), pp. 1–5 (2018)
Nekrasov, V., Shen, C., Reid, I.: Template-based automatic search of compact semantic segmentation architectures. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1980–1989 (2020)
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (61862029, 62062038), and the Natural Science Foundation of Jiangxi Province (20202BABL212007, 20192BAB207011, 20212BAB202012).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yuan, F., Zhu, Y., Li, K. et al. An anisotropic non-local attention network for image segmentation. Machine Vision and Applications 33, 23 (2022). https://doi.org/10.1007/s00138-021-01265-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-021-01265-8