Deep gated attention networks for large-scale street-level scene segmentation
Introduction
Street-level scene segmentation is a practical application of semantic segmentation on street-view images [1]. It aims to label each pixel of street-view images into predefined semantic categories, e.g., car, person, road, vegetation, building and sky. Recently, street-level scene segmentation has been attracting growing interest due to its various real-world applications, especially in the area of autonomous driving. It helps the self-driving cars to detect the driveable areas and avoid potential driving danger.
During the past five years, deep learning based techniques have achieved the breakthrough performance in various computer vision tasks including image classification and object detection. They have also been applied for pixel-wise labeling tasks such as saliency detection and semantic segmentation. When employing deep learning for street-level scene segmentation, the recognition of major objects in the image such as persons or vehicles is realized at high-level layers of a deep Convolutional Neural Network (CNN). High-level layers work on a coarser scale and are translation invariant, such that minor variations on a pixel do not influence the recognition. However, scene segmentation requires pixel-exact classification of fine details, which are typically only found in low-level layers. This trade-off in resolution is typically solved by using skip-connections from lower layers to the output [2]. Most of the existing approaches mainly differ in how to encode the object level information and how to decode the corresponding prediction to pixel-exact labels. For example, the original Fully Convolutional Network (FCN) architectures [2] have been improved by alternative ways to connect to the low-level layers by (1) accessing the lower pooling layers [3], (2) using enhanced methods to integrate lower level information [4], or (3) forgoing pooling operations for dilated convolution [5], [6]. Many recent systems also apply Conditional Random Field (CRF)-based refinements on the output produced by the FCN.
Although effective, existing approaches mainly focus on enriching feature representations or enlarging the efficient receptive field, and can not well capture the spatial structure of street scenes, which is very important for scene understanding [7]. In this work, we argue that both the spatial layout of street-level scenes and multi-level features play important roles in accurate scene segmentation, as shown in Fig. 1. Based on this fact, we propose a novel deep gated attention network, termed as Gated Attention Network (GANet), to perform multi-scale spatial feature recalibration for street-level scene segmentation. It can leverage the state-of-the-art FCNs to enhance the spatial features. More specifically, to efficiently encode different visual regions, we propose a self-gated attention module to adaptively model and compute the attentive features of FCNs. The proposed module takes as input the multi-scale feature maps in FCNs, and outputs an attention mask for each feature map. The learned attention masks can neatly highlight the regions of interest while suppress background clutter. In addition, to enrich the feature representation, we propose an efficient multi-scale feature interaction mechanism that can adaptively aggregate hierarchical features. Based on this mechanism the features of different levels are adaptively re-weighted according to the local spatial structure and the surrounding contextual information. Thus, both the original input features and the attention information can be fully exploited by FCNs in a unified framework, leading to a comprehensive and effective feature representation. Extensive experiments on three large-scale benchmarks, i.e., Cityscapes [1], Mapillary Vistas [8] and ADE20K [9], demonstrate that our approach performs favourably against other state-of-the-art methods.
In summary, our main contributions are three folds:
- •
We propose a novel spatial gated attention mechanism in the context of pixel-wise labeling tasks. The proposed attention mechanism can be incorporated into any existing deep networks and provide effective attentive features of interest regions. The proposed attention model is applied to street-level scene segmentation and show its superior performance over the baseline approaches.
- •
We propose an efficient multi-scale feature interaction mechanism that can adaptively aggregate hierarchical features to enrich the feature representation. Based on this mechanism, different levels of features at each spatial location are re-weighted according to the corresponding local structure and surrounding contextual information
- •
Extensive experiments on three large-scale benchmarks have validated the feasibility of our proposed modules, and show that our approach performs favorably against other state-of-the-art methods.
Section snippets
Scene segmentation
In the past two decades, scene segmentation methods rely on hand-crafted features (e.g., color histogram and textons [10]) together with shallow classifiers such as boosting [11], random forests [12], support vector machines [13]. Due to the limited discriminative power of hand-crafted features, many efforts have been paid into developing graphical models [14], [15]. However, graphical models increase the accuracy of the segmentation at the cost of additional computation.
Recently, deep learning
Deep gated attention networks
In this section, we first describe in detail the proposed Spatial Gated Attention (SGA) module and Attentive Feature Interaction (AFI) module. Then we introduce the proposed complete Gated Attention Network (GANet), which is specifically designed for street-level scene segmentation task.
Structural training
Given the training dataset with N training image pairs, where and are the input street-view image and the ground-truth segmentation image with T pixels, respectively. denotes the labels of j-class. For notional simplicity, we subsequently drop the subscript n and consider each image independently. In most of existing segmentation methods [2], [4], [6], the softmax Cross-Entropy (CE) loss is used to train the network:
Street-level scene datasets
We report results on the Cityscapes [1], Mapillary Vistas [8] and ADE20K [9], since these datasets have complementary properties in terms of image content, size, number of class labels and annotation quality. The Cityscapes dataset contains street-level images captured in central Europe and comprises a total of 5 k densely annotated images (19 object categories + 1 void class, all images sized 2048 × 1024), which are split into 2975/500/1525 images for training, validation and testing,
Results and discussion
In this section, we report the results on the street-level scene segmentation task. For fair comparison with other methods, we adopt the source codes with suggested parameters or the segmentation results provided by corresponding authors. For the methods which do not provide the results on adopted testing datasets, we re-implement these methods and report the best results for comparison.
Conclusion and future work
In this paper, we propose a novel end-to-end gated attention network (GANet) architecture for street-level scene segmentation. More specifically, we introduce the Spatial Gated Attention (SGA) module and an effective Attentive Feature Interaction (AFI) module. The SGA module provides pixel-level attention information and highlights the regions of interest for semantic pixel localization. The AFI module exploits multi-level feature maps to enrich feature representations and increases receptive
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China (NSFC), No. 61502070, No. 61528101, No. 61403265 and No. 61471371. PingpingZhang and Wei Liu are currently visiting the University of Adelaide, supported by the China Scholarship Council (CSC) program. This work is also supported by the Science and Technology Plan of Sichuan Province under Grant Number 2015SZ0226.
Pingping Zhang received his B.E. degree in mathematics and applied mathematics, Henan Normal University (HNU), Xinxiang, China, in 2012. He is currently a Ph.D. candidate in the School of Information and Communication Engineering, Dalian University of Technology (DUT), Dalian, China. His research interests include deep learning, saliency detection, object tracking and semantic segmentation.
References (64)
- et al.
Semantic segmentation of images exploiting DCT based features and random forest
Pattern Recognit. (PR)
(2016) - et al.
Scalable image segmentation via decoupled sub-graph compression
Pattern Recognit. (PR)
(2018) - et al.
Binary partition tree construction from multiple features for image segmentation
Pattern Recognit. (PR)
(2018) - et al.
A multiscale image segmentation method
Pattern Recognit. (PR)
(2016) - et al.
The functional architecture of human visual motion perception
Vis. Res.
(1995) - et al.
Bridging the gap between monkey neurophysiology and human perception: an ambiguity resolution theory of visual selective attention
Cogn. Psychol.
(1997) - et al.
MoE-SPNet: a mixture-of-experts scene parsing network
Pattern Recognit. (PR)
(2018) - et al.
The cityscapes dataset for semantic urban scene understanding
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016) - et al.
Fully convolutional networks for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015) - et al.
SegNet: a deep convolutional encoder-decoder architecture for image segmentation
IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)
(2017)
Pyramid scene parsing network
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Multi-scale context aggregation by dilated convolutions
Proceedings of the International Conference on Learning Representations (ICLR)
DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)
The mapillary vistas dataset for semantic understanding of street scenes
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Scene parsing through ADE20K dataset
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
TextonBoost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context
Int. J. Comput. Vis. (IJCV)
Class segmentation and object localization with superpixel neighborhoods
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Very deep convolutional networks for large-scale image recognition
Proceedings of the International Conference on Learning Representations (ICLR)
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Hypercolumns for object segmentation and fine-grained localization
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
RefineNet: multi-path refinement networks for high-resolution semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Laplacian pyramid reconstruction and refinement for semantic segmentation
Proceedings of the European Conference on Computer Vision (ECCV)
Learning hierarchical features for scene labeling
IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)
Efficient piecewise training of deep structured models for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Understanding convolution for semantic segmentation
Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)
End-to-end instance segmentation with recurrent attention
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Squeeze-and-excitation networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Context encoding for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Identity mappings in deep residual networks
Proceedings of the European Conference on Computer Vision (ECCV)
Cited by (90)
Hierarchical image peeling: A flexible scale-space filtering framework
2023, Computer Vision and Image UnderstandingA real-time semantic segmentation model using iteratively shared features in multiple sub-encoders
2023, Pattern RecognitionDeep convolution neural network based semantic segmentation for ocean eddy detection
2023, Expert Systems with ApplicationsPositive–negative equal contrastive loss for semantic segmentation
2023, NeurocomputingComputer-aided fish assessment in an underwater marine environment using parallel and progressive spatial information fusion
2023, Journal of King Saud University - Computer and Information SciencesMulti-organ segmentation network for abdominal CT images based on spatial attention and deformable convolution
2023, Expert Systems with ApplicationsCitation Excerpt :The attention block highlights salient features through dynamic weighting. The attention maps are usually obtained by calculating internal correlation, such as additive attention (Schlemper et al., 2019; Gao et al., 2021), multiplicative attention (Wang et al., 2018; Huang et al., 2019; Xie et al., 2020), and self-attention (Fu et al., 2019; Zhang et al. 2019). However, they all ignore the structure of abdominal organs in terms of relative locations and sizes, which leads to coarse attentional maps.
Pingping Zhang received his B.E. degree in mathematics and applied mathematics, Henan Normal University (HNU), Xinxiang, China, in 2012. He is currently a Ph.D. candidate in the School of Information and Communication Engineering, Dalian University of Technology (DUT), Dalian, China. His research interests include deep learning, saliency detection, object tracking and semantic segmentation.
Wei Liu received the B.Eng. degree from the Department of Automation, Xi’an Jiaotong University, in 2012. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University. His current research interests mainly focus on low-level computer vision and graphics.
Hongyu Wang received the B.S. degree from Jilin University of Technology, Changchun, China, in 1990 and the M.S. degree from the Graduate School of Chinese Academy of Sciences, Beijing, China, in 1993, both in electronic engineering. He received the Ph.D. degree in precision instrument and optoelectronics engineering from Tianjin University, Tianjin, China, in 1997. He is currently a professor with Dalian University of Technology, Dalian, China. His research interests include algorithmic, optimization, and performance issues in wireless ad hoc, mesh, and sensor networks.
Yinjie Lei received his M.S. degree from Sichuan University (SCU), China, with the area of Image Processing in 2009, and the Ph.D. degree in Computer Vision from University of Western Australia (UWA), Australia in 2013. He is currently an associate professor with the college of Electronics and Information Engineering at SCU. He serves as the vice dean of the College of Electronics and Information Engineering at SCU since 2017. His research interests mainly include deep learning, 3D biometrics, object recognition and semantic segmentation.
Huchuan Lu received the M.S. degree in signal and information processing, Ph.D. degree in system engineering, Dalian University of Technology (DUT), China, in 1998 and 2008, respectively. He has been a faculty since 1998 and a professor since 2012 in the School of Information and Communication Engineering of DUT. His research interests are in the areas of computer vision and pattern recognition. In recent years, he focus on visual tracking, saliency detection and semantic segmentation. Now, he serves as an associate editor of the IEEE Transactions on Systems, Man, and Cybernetics: Part B.