Deep gated attention networks for large-scale street-level scene segmentation

doi:10.1016/j.patcog.2018.12.021

Pattern Recognition

Volume 88, April 2019, Pages 702-714

https://doi.org/10.1016/j.patcog.2018.12.021 Get rights and content

Highlights

•
A novel spatial gated attention mechanism is proposed in the context of pixel-wise labeling tasks.
•
A multi-scale feature interaction mechanism is proposed for hierarchical feature aggregation.
•
Different levels of features are re-weighted with the local structure and contextual information.
•
The proposed GANet can better capture the scene layout and multi-level information of street-view images.
•
State-of-the-art performance on three challenging large-scale scene segmentation benchmarks is achieved.

Abstract

Street-level scene segmentation aims to label each pixel of street-view images into specific semantic categories. It has been attracting growing interest due to various real-world applications, especially in the area of autonomous driving. However, this pixel-wise labeling task is very challenging under the complex street-level scenes and large-scale object categories. Motivated by the scene layout of street-view images, in this work we propose a novel Spatial Gated Attention (SGA) module, which automatically highlights the attentive regions for pixel-wise labeling, resulting in effective street-level scene segmentation. The proposed module takes as input the multi-scale feature maps based on a Fully Convolutional Network (FCN) backbone, and produces the corresponding attention mask for each feature map. The learned attention masks can neatly highlight the regions of interest while suppress background clutter. Furthermore, we propose an efficient multi-scale feature interaction mechanism which is able to adaptively aggregate the hierarchical features. Based on the proposed mechanism, the features of different levels are adaptively re-weighted according to the local spatial structure and the surrounding contextual information. Consequently, the proposed modules are able to boost standard FCN architectures and result in an enhanced pixel-wise segmentation for street-level scene images. Extensive experiments on three public available street-level benchmarks demonstrate that the proposed Gated Attention Network (GANet) approach achieves consistently superior performance and outperforms the very recent state-of-the-art methods.

Introduction

Street-level scene segmentation is a practical application of semantic segmentation on street-view images [1]. It aims to label each pixel of street-view images into predefined semantic categories, e.g., car, person, road, vegetation, building and sky. Recently, street-level scene segmentation has been attracting growing interest due to its various real-world applications, especially in the area of autonomous driving. It helps the self-driving cars to detect the driveable areas and avoid potential driving danger.

During the past five years, deep learning based techniques have achieved the breakthrough performance in various computer vision tasks including image classification and object detection. They have also been applied for pixel-wise labeling tasks such as saliency detection and semantic segmentation. When employing deep learning for street-level scene segmentation, the recognition of major objects in the image such as persons or vehicles is realized at high-level layers of a deep Convolutional Neural Network (CNN). High-level layers work on a coarser scale and are translation invariant, such that minor variations on a pixel do not influence the recognition. However, scene segmentation requires pixel-exact classification of fine details, which are typically only found in low-level layers. This trade-off in resolution is typically solved by using skip-connections from lower layers to the output [2]. Most of the existing approaches mainly differ in how to encode the object level information and how to decode the corresponding prediction to pixel-exact labels. For example, the original Fully Convolutional Network (FCN) architectures [2] have been improved by alternative ways to connect to the low-level layers by (1) accessing the lower pooling layers [3], (2) using enhanced methods to integrate lower level information [4], or (3) forgoing pooling operations for dilated convolution [5], [6]. Many recent systems also apply Conditional Random Field (CRF)-based refinements on the output produced by the FCN.

Although effective, existing approaches mainly focus on enriching feature representations or enlarging the efficient receptive field, and can not well capture the spatial structure of street scenes, which is very important for scene understanding [7]. In this work, we argue that both the spatial layout of street-level scenes and multi-level features play important roles in accurate scene segmentation, as shown in Fig. 1. Based on this fact, we propose a novel deep gated attention network, termed as Gated Attention Network (GANet), to perform multi-scale spatial feature recalibration for street-level scene segmentation. It can leverage the state-of-the-art FCNs to enhance the spatial features. More specifically, to efficiently encode different visual regions, we propose a self-gated attention module to adaptively model and compute the attentive features of FCNs. The proposed module takes as input the multi-scale feature maps in FCNs, and outputs an attention mask for each feature map. The learned attention masks can neatly highlight the regions of interest while suppress background clutter. In addition, to enrich the feature representation, we propose an efficient multi-scale feature interaction mechanism that can adaptively aggregate hierarchical features. Based on this mechanism the features of different levels are adaptively re-weighted according to the local spatial structure and the surrounding contextual information. Thus, both the original input features and the attention information can be fully exploited by FCNs in a unified framework, leading to a comprehensive and effective feature representation. Extensive experiments on three large-scale benchmarks, i.e., Cityscapes [1], Mapillary Vistas [8] and ADE20K [9], demonstrate that our approach performs favourably against other state-of-the-art methods.

In summary, our main contributions are three folds:

•
We propose a novel spatial gated attention mechanism in the context of pixel-wise labeling tasks. The proposed attention mechanism can be incorporated into any existing deep networks and provide effective attentive features of interest regions. The proposed attention model is applied to street-level scene segmentation and show its superior performance over the baseline approaches.
•
We propose an efficient multi-scale feature interaction mechanism that can adaptively aggregate hierarchical features to enrich the feature representation. Based on this mechanism, different levels of features at each spatial location are re-weighted according to the corresponding local structure and surrounding contextual information
•
Extensive experiments on three large-scale benchmarks have validated the feasibility of our proposed modules, and show that our approach performs favorably against other state-of-the-art methods.

Section snippets

Scene segmentation

In the past two decades, scene segmentation methods rely on hand-crafted features (e.g., color histogram and textons [10]) together with shallow classifiers such as boosting [11], random forests [12], support vector machines [13]. Due to the limited discriminative power of hand-crafted features, many efforts have been paid into developing graphical models [14], [15]. However, graphical models increase the accuracy of the segmentation at the cost of additional computation.

Recently, deep learning

Deep gated attention networks

In this section, we first describe in detail the proposed Spatial Gated Attention (SGA) module and Attentive Feature Interaction (AFI) module. Then we introduce the proposed complete Gated Attention Network (GANet), which is specifically designed for street-level scene segmentation task.

Structural training

Given the training dataset $D = {(X_{n}, Y_{n})}_{n = 1}^{N}$ with N training image pairs, where $X_{n} = {x_{i}^{n}, i = 1, \dots, T}$ and $Y_{n} = {y_{i}^{n}, i = 1, \dots, T}$ are the input street-view image and the ground-truth segmentation image with T pixels, respectively. $y_{i}^{n} = j (j \in C)$ denotes the labels of j-class. For notional simplicity, we subsequently drop the subscript n and consider each image independently. In most of existing segmentation methods [2], [4], [6], the softmax Cross-Entropy (CE) loss is used to train the network: $L_{c e} = - \frac{1}{T} \sum_{i = 1}^{T} \sum_{j = 0}^{C} I ($

Street-level scene datasets

We report results on the Cityscapes [1], Mapillary Vistas [8] and ADE20K [9], since these datasets have complementary properties in terms of image content, size, number of class labels and annotation quality. The Cityscapes dataset contains street-level images captured in central Europe and comprises a total of 5 k densely annotated images (19 object categories + 1 void class, all images sized 2048  ×  1024), which are split into 2975/500/1525 images for training, validation and testing,

Results and discussion

In this section, we report the results on the street-level scene segmentation task. For fair comparison with other methods, we adopt the source codes with suggested parameters or the segmentation results provided by corresponding authors. For the methods which do not provide the results on adopted testing datasets, we re-implement these methods and report the best results for comparison.

Conclusion and future work

In this paper, we propose a novel end-to-end gated attention network (GANet) architecture for street-level scene segmentation. More specifically, we introduce the Spatial Gated Attention (SGA) module and an effective Attentive Feature Interaction (AFI) module. The SGA module provides pixel-level attention information and highlights the regions of interest for semantic pixel localization. The AFI module exploits multi-level feature maps to enrich feature representations and increases receptive

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (NSFC), No. 61502070, No. 61528101, No. 61403265 and No. 61471371. PingpingZhang and Wei Liu are currently visiting the University of Adelaide, supported by the China Scholarship Council (CSC) program. This work is also supported by the Science and Technology Plan of Sichuan Province under Grant Number 2015SZ0226.

Pingping Zhang received his B.E. degree in mathematics and applied mathematics, Henan Normal University (HNU), Xinxiang, China, in 2012. He is currently a Ph.D. candidate in the School of Information and Communication Engineering, Dalian University of Technology (DUT), Dalian, China. His research interests include deep learning, saliency detection, object tracking and semantic segmentation.

References (64)

D. Ravì et al.
Semantic segmentation of images exploiting DCT based features and random forest
Pattern Recognit. (PR)
(2016)
R.S. Medeiros et al.
Scalable image segmentation via decoupled sub-graph compression
Pattern Recognit. (PR)
(2018)
J.F. Randrianasoa et al.
Binary partition tree construction from multiple features for image segmentation
Pattern Recognit. (PR)
(2018)
LiY. et al.
A multiscale image segmentation method
Pattern Recognit. (PR)
(2016)
LuZ.-L. et al.
The functional architecture of human visual motion perception
Vis. Res.
(1995)
S.J. Luck et al.
Bridging the gap between monkey neurophysiology and human perception: an ambiguity resolution theory of visual selective attention
Cogn. Psychol.
(1997)
FuH. et al.
MoE-SPNet: a mixture-of-experts scene parsing network
Pattern Recognit. (PR)
(2018)
M. Cordts et al.
The cityscapes dataset for semantic urban scene understanding
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016)
J. Long et al.
Fully convolutional networks for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015)
V. Badrinarayanan et al.
SegNet: a deep convolutional encoder-decoder architecture for image segmentation
IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)
(2017)

ZhaoH. et al.

Pyramid scene parsing network

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

YuF. et al.

Multi-scale context aggregation by dilated convolutions

Proceedings of the International Conference on Learning Representations (ICLR)

(2016)

ChenL.-C. et al.

DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)

(2018)

M.-Y. Liu, S. Lin, S. Ramalingam, O. Tuzel, Layered interpretation of street view images, arXiv:1506.04723...

G. Neuhold et al.

The mapillary vistas dataset for semantic understanding of street scenes

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

(2017)

ZhouB. et al.

Scene parsing through ADE20K dataset

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

M. Unger et al.

Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2012)

J. Shotton et al.

TextonBoost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context

Int. J. Comput. Vis. (IJCV)

(2009)

B. Fulkerson et al.

Class segmentation and object localization with superpixel neighborhoods

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

(2009)

K. Simonyan et al.

Very deep convolutional networks for large-scale image recognition

Proceedings of the International Conference on Learning Representations (ICLR)

(2015)

HeK. et al.

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2016)

B. Hariharan et al.

Hypercolumns for object segmentation and fine-grained localization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2015)

LinG. et al.

RefineNet: multi-path refinement networks for high-resolution semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

G. Ghiasi et al.

Laplacian pyramid reconstruction and refinement for semantic segmentation

Proceedings of the European Conference on Computer Vision (ECCV)

(2016)

C. Farabet et al.

Learning hierarchical features for scene labeling

IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)

(2013)

LinG. et al.

Efficient piecewise training of deep structured models for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2016)

WangP. et al.

Understanding convolution for semantic segmentation

Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)

(2018)

RenM. et al.

End-to-end instance segmentation with recurrent attention

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

P. Zhang, L. Wang, D. Wang, H. Lu, C. Shen, Agile amulet: real-time salient object detection with contextual attention,...

HuJ. et al.

Squeeze-and-excitation networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2018)

ZhangH. et al.

Context encoding for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2018)

HeK. et al.

Identity mappings in deep residual networks

Proceedings of the European Conference on Computer Vision (ECCV)

(2016)

Cited by (90)

Hierarchical image peeling: A flexible scale-space filtering framework
2023, Computer Vision and Image Understanding
The importance of hierarchical image organization has been witnessed by a wide spectrum of applications in computer vision and graphics. Different from image segmentation with the spatial whole-part consideration, this work designs a modern framework for disassembling an image into a family of derived signals from a scale-space perspective. Specifically, we first formulate the goal of the hierarchical image organization problem. Then, by concerning desired properties, such as peeling hierarchy and structure preservation, we convert the original complex problem into a series of two-component separation sub-problems, significantly reducing the complexity. The proposed framework is flexible to be trained on both paired and unpaired data. A compact recurrent network, namely hierarchical image peeling net, is customized to efficiently and effectively fulfill the task, which is about 3.5Mb in size, and can handle 1080p images in more than 60 fps per recurrence on a GTX 2080Ti GPU, making it attractive for practical use. Both theoretical findings and experimental results are provided to demonstrate the efficacy of the proposed framework, reveal its superiority over other state-of-the-art alternatives, and show its potential to various applicable scenarios. The codes have been made publicly available at https://github.com/ForawardStar/HIPe.
A real-time semantic segmentation model using iteratively shared features in multiple sub-encoders
2023, Pattern Recognition
Recent studies show a significant growth in semantic segmentation. However, many semantic segmentation models still have a large number of parameters, making them unsuitable for resource-constrained embedded devices. To address this issue, we propose an efficient Shared Feature Reuse Segmentation (SFRSeg) model containing several novelties: a new yet effective shared-branch multiple sub-encoders design, a context mining module and a semantic aggregating module for better context granularity. In particular, our shared-branch approach improves the entire feature hierarchy by sharing the spatial and context knowledge in both shallow and deep branches. After every shared point in each sub-encoder, a proposed cascading context mining (CCM) module is deployed to filter out the noisy spatial details from the feature maps and provides a diverse size of receptive fields for capturing the latent context between multi-scale geometric shapes in the scene. To overcome the gradient vanishing issue at the early stage, we reduce the number of layers in the first sub-encoder and employ a unique multiple sub-encoders design which reprocesses the rich global feature maps through multiple sub-encoders for better feature refinement. Later, the rich semantic features generated by the efficient sub-encoders at different levels are fused by the proposed Hybrid Path Attention Semantic Aggregation (HPA-SA) module that effectively reduces the semantic gap between feature maps at different levels and alleviate the well-known boundary degeneration effect. To make it computationally efficient for resource-constrained embedded devices, a series of lightweight methods such as a lightweight encoder, a squeeze-and-excitation design, separable convolution filters, channel reduction (CR) are carefully exploited. With an exceptional performance on Cityscapes (70.6% test mIoU) and CamVid (74.7% test mIoU) data sets, the proposed model is shown to be superior over existing light real-time semantic segmentation models whilst having only 1.6 million parameters.
Deep convolution neural network based semantic segmentation for ocean eddy detection
2023, Expert Systems with Applications
A circular movement of ocean water known as eddy is crucial for moving various ocean elements across the ocean. They are essential to the biota and circulation of the ocean. The detection of ocean eddies has significant advantages for the research of marine biological habitats and climate change. The evolution of oceanic remote sensing technologies enables their identification in sea surface height images. Deep learning techniques employed for eddy detection are still in their infancy. This paper proposes a deep convolutional neural network model for automated semantic segmentation and identification of ocean eddies at the pixel level. Semantic segmentation requires an understanding of context efficiently for pixel-level recognition. The attention mechanism is proposed to capture contextual information to tackle this challenge. The self-attention process is used in the suggested attention mechanism to define the semantic relationship between any two pixels. A novel module termed the series atrous spatial pyramid module is proposed as an alternative to the multi-scale fusion model currently in use to capture the multi-scale context of feature maps. Further, a new feature enhancing block that cascades encoder outputs with the decoder is also proposed. The experimental findings demonstrate that the proposed architecture has obtained mean pixel accuracy, F-beta score, mean intersection of union score of 94.52%, 94.45%, 88.64% on the Southern Atlantic Ocean dataset and 94.97%, 94.90%, 88.67% on the South China Sea dataset respectively which is better when compared to existing state-of-the-art techniques.
Positive–negative equal contrastive loss for semantic segmentation
2023, Neurocomputing
The contextual information is critical for various computer vision tasks, previous works commonly design plug-and-play modules and structural losses to effectively extract and aggregate the global context. These methods utilize fine-label to optimize the model but ignore that fine-trained features are also precious training resources, which can introduce preferable distribution to hard pixels (i.e., misclassified pixels). Inspired by contrastive learning in unsupervised paradigm, we apply the contrastive loss in a supervised manner and re-design the loss function to cast off the stereotype of unsupervised learning (e.g., imbalance of positives and negatives, confusion of anchors computing). To this end, we propose Positive-Negative Equal contrastive loss (PNE loss), which increases the latent impact of positive embedding on the anchor and treats the positive as well as negative sample pairs equally. The PNE loss can be directly plugged right into existing semantic segmentation frameworks and leads to excellent performance with neglectable extra computational costs. We utilize a number of classic segmentation methods (e.g., DeepLabV3, HRNetV2, OCRNet, UperNet) and backbone (e.g., ResNet, HRNet, Swin Transformer) to conduct comprehensive experiments and achieve state-of-the-art performance on three benchmark datasets (e.g., Cityscapes, COCO-Stuff and ADE20K). Our code will be publicly available at https://github.com/jingw193/PNE_Loss.
Computer-aided fish assessment in an underwater marine environment using parallel and progressive spatial information fusion
2023, Journal of King Saud University - Computer and Information Sciences
Fish assessment and monitoring are important for the development of a modern aquatic ecosystem. Fish are a vital part of the marine and freshwater environments. Morphological and computational details of fish, such as size, shape, and position, are important in fish observation and fisheries. Typically, manual, or low-efficient techniques are used to acquire fish details. However, existing typical methods are usually time-consuming, less accurate, and resource-intensive. Computer-aided methods are crucial for intelligent and automatic fish assessment. Two novel networks, namely parallel feature fusion-based segmentation network (PFFS-Net) and progressive information fusion-based segmentation network (PIFS-Net), were developed for pixel-wise fish segmentation. PFFS-Net is a base network that uses parallel feature fusion to achieve a better segmentation performance. PIFS-Net is the final model of this work and uses a progressive spatial feature fusion (SFF) mechanism to enhance segmentation accuracy. PIFS-Net also employs rapid feature reduction and pre-prediction low-level information fusion blocks to further boost performance.
The proposed models were evaluated using the following three publicly available databases: semantic segmentation of underwater imagery (SUIM), DeepFish, and Large-scale fish. The proposed networks outperformed the state-of-the-art methods in challenging underwater conditions with superior computational efficiency. PIFS-Net needs only 2.02 million trainable parameters for its complete training. Automatic and accurate fish segmentation can be a major step towards an intelligent aquatic ecosystem. The codes of our algorithms and trained models are available on Github.
Multi-organ segmentation network for abdominal CT images based on spatial attention and deformable convolution
2023, Expert Systems with Applications
Citation Excerpt :
The attention block highlights salient features through dynamic weighting. The attention maps are usually obtained by calculating internal correlation, such as additive attention (Schlemper et al., 2019; Gao et al., 2021), multiplicative attention (Wang et al., 2018; Huang et al., 2019; Xie et al., 2020), and self-attention (Fu et al., 2019; Zhang et al. 2019). However, they all ignore the structure of abdominal organs in terms of relative locations and sizes, which leads to coarse attentional maps.
The accurate segmentation of multi-organ based on computed tomography (CT) images is important for the diagnosis of abdominal diseases, such as cancer staging, and for surgical planning, such as reducing damage to healthy tissues surrounding the target organ. This task is extremely challenging due to the complexity of background in CT and the variable sizes and shapes of different organs. In this paper, a segmentation model based on U-Net is proposed for five organs related to hepato-biliary-pancreatic surgery, including the pancreas, duodenum, gallbladder, liver and stomach. The proposed model has deformable receptive fields and utilizes the structure of organs in terms of locations and sizes to reduce the interference of complex backgrounds, which makes it an efficient and accurate segmentation method. A spatial attention block is proposed to highlight the organ regions of interest during feature extraction by learning spatial attention maps through explicit external supervision. Moreover, a deformable convolution block is set up to deal with variations in shapes and sizes by producing reasonable receptive fields for different organs through additional trainable offsets. In addition, the skip-connection structure of U-Net is improved by using multi-scale attention maps and high-level semantic information. The proposed model is compared with U-Net and several improved variants on the TCIA multi-organ segmentation dataset, including segmentation performance, time consumption and model parameters. The results show that the proposed model can effectively improve the overall segmentation performance with an average DICE of 80.46% at the cost of a 7.86% increase in model parameters. Compared to U-Net, the average DICE is increased by 1.65%, the average JSC is increased by 1.79% and the average 95% HD is reduced by 4.08. It is a competitive multi-organ segmentation method with better application potential.

View all citing articles on Scopus

Wei Liu received the B.Eng. degree from the Department of Automation, Xi’an Jiaotong University, in 2012. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University. His current research interests mainly focus on low-level computer vision and graphics.

Hongyu Wang received the B.S. degree from Jilin University of Technology, Changchun, China, in 1990 and the M.S. degree from the Graduate School of Chinese Academy of Sciences, Beijing, China, in 1993, both in electronic engineering. He received the Ph.D. degree in precision instrument and optoelectronics engineering from Tianjin University, Tianjin, China, in 1997. He is currently a professor with Dalian University of Technology, Dalian, China. His research interests include algorithmic, optimization, and performance issues in wireless ad hoc, mesh, and sensor networks.

Yinjie Lei received his M.S. degree from Sichuan University (SCU), China, with the area of Image Processing in 2009, and the Ph.D. degree in Computer Vision from University of Western Australia (UWA), Australia in 2013. He is currently an associate professor with the college of Electronics and Information Engineering at SCU. He serves as the vice dean of the College of Electronics and Information Engineering at SCU since 2017. His research interests mainly include deep learning, 3D biometrics, object recognition and semantic segmentation.

Huchuan Lu received the M.S. degree in signal and information processing, Ph.D. degree in system engineering, Dalian University of Technology (DUT), China, in 1998 and 2008, respectively. He has been a faculty since 1998 and a professor since 2012 in the School of Information and Communication Engineering of DUT. His research interests are in the areas of computer vision and pattern recognition. In recent years, he focus on visual tracking, saliency detection and semantic segmentation. Now, he serves as an associate editor of the IEEE Transactions on Systems, Man, and Cybernetics: Part B.

View full text

Deep gated attention networks for large-scale street-level scene segmentation

Highlights

Abstract

Introduction

Section snippets

Scene segmentation

Deep gated attention networks

Structural training

Street-level scene datasets

Results and discussion

Conclusion and future work

Acknowledgments

Pattern Recognit. (PR)

Pattern Recognit. (PR)

Pattern Recognit. (PR)

Pattern Recognit. (PR)

Vis. Res.

Cogn. Psychol.

Pattern Recognit. (PR)

The cityscapes dataset for semantic urban scene understanding

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Fully convolutional networks for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

SegNet: a deep convolutional encoder-decoder architecture for image segmentation

IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)

Pyramid scene parsing network

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Multi-scale context aggregation by dilated convolutions

Proceedings of the International Conference on Learning Representations (ICLR)

DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)

The mapillary vistas dataset for semantic understanding of street scenes

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Scene parsing through ADE20K dataset

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

TextonBoost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context

Int. J. Comput. Vis. (IJCV)

Class segmentation and object localization with superpixel neighborhoods

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Very deep convolutional networks for large-scale image recognition

Proceedings of the International Conference on Learning Representations (ICLR)

Deep residual learning for image recognition

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Hypercolumns for object segmentation and fine-grained localization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

RefineNet: multi-path refinement networks for high-resolution semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Laplacian pyramid reconstruction and refinement for semantic segmentation

Proceedings of the European Conference on Computer Vision (ECCV)

Learning hierarchical features for scene labeling

IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI)

Efficient piecewise training of deep structured models for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Understanding convolution for semantic segmentation

Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)

End-to-end instance segmentation with recurrent attention

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Squeeze-and-excitation networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Context encoding for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Identity mappings in deep residual networks

Proceedings of the European Conference on Computer Vision (ECCV)