Neighborhood Encoding Network for Semantic Segmentation

Lou, Xiaotian; Chen, Xiaoyu; Bai, Lianfa; Han, Jing

doi:10.1007/978-3-030-34113-8_47

Xiaotian Lou¹⁴,
Xiaoyu Chen¹⁴,
Lianfa Bai¹⁴ &
…
Jing Han¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11903))

Included in the following conference series:

International Conference on Image and Graphics

1632 Accesses

Abstract

With recent advances of deep neural networks, semantic segmentation algorithms are in rapid development. However, as pixel-level semantic segmentation is often treated as pixel-wise classification task where the neighbor correlation is ignored during inference, the entirety of results is inevitably impaired. In order to increase the correlation ship among the pixels in neural networks, we propose neighborhood encoding network (NENet) to extract the semantics and encode the pixel-level correlation of inputs in a backbone network. In NENet, we use neighborhood prediction module (NPM) to decode the pixel-level correlation and get the result. The NPM can also help the backbone network encode the correlation during training phase. We also design a stage-wise training strategy with NPM for correlation transmission, which eases the training process and increases the performance effectively. The structure of NENet can be expanded to other encoder-decoder network. We evaluate the proposed NENet on CamVid and Cityscpaes datasets, and the NENet achieves impressive results.

You have full access to this open access chapter, Download conference paper PDF

Dual Context Network for real-time semantic segmentation

Article 19 January 2023

Semantic segmentation using reinforced fully convolutional densenet with multiscale kernel

Article 09 April 2019

Analytical Comparison of Deep Learning Frameworks for Semantic Segmentation with Pixel-Level Understanding

Keywords

1 Introduction

The semantic segmentation can be described as a classification task, and each pixel of input is recognized by analyzing its global and local data. Traditional methods [17, 20] extract features based on texture and color information, cluster pixels to blobs and then analyze the semantics of blobs.

With the success of deep learning, the deep neural network is widely adopted in semantic segmentation task [18, 19]. In neural networks, the task of semantic segmentation is considered as a pixel-level classification, where a feature map is often computed for classification pixel by pixel. In other words, pixels in the feature map are recognized individually ignoring the clustering process in traditional methods. However, the neural nodes in high-level layers of deep neural network have wide receptive field along with rich local semantics, training the pixels individually causes the output lose the correlation of neighboring pixels within its pixel features. So the methods based on deep learning often have a poor performance on edges or details, while the results of traditional segmentation methods often have clear edges.

One remedy for segmentation networks is to add a post-process to adjust results based on texture and color of inputs, such as Conditional Random Fields (CRF) [2]. CRF is one of graph models which helps approximate the posterior distribution of results based on network inputs and outputs. However, CRF involves much computation and adds computation latency.

In order to obtain fine features for semantic segmentation in neural networks, input data goes through a backbone network to encode its high-level semantics and then passed by a upsample network to recover spatial details. This encoder-decoder structure helps the feature map extract both high-level semantics and low-level details. Each pixel in the feature map is computed from a number of neighboring neural nodes in previous layers. So the feature of each pixel contains not only the semantics but also the neighboring correlation. Through the encoder network, the spatial resolution of feature map gets smaller but the channel number become more, which means the local context information or neighboring correlation is encoded to pixel feature in feature maps.

As semantic segmentation is considered as a classification task, the loss function in segmentation networks is often designed as pixel-wise cross entropy loss [15]. Though the pixel feature has the ability to extract neighboring correlation, the pixel-wise loss function does not guide the network to learn the neighboring correlation. So if we consider the neighboring correlation in loss function, the potentials of networks will be further explored.

The methods based on encoder-decoder structure have an encoder network and a decoder network. The training difficulty and the optimizing state are affected by the complexity of network, such as depth of the network and number of convolution kernels. ResNet [1] is proposed to ease the training process by accumulating residuals to approximate. The training target is also converted to sparse residuals with skip layers which helps us train up to 1000 layers ResNet. ResNet improves the network degradation and gradient vanishing, and now skip layer has become one of most commonly used layers. Optimizing training strategy can also help network convergence. The relay backpropagation is proposed for effective learning of deep convolutional neural networks. By training sub-networks separately, the relay backpropagation helps the network converge to better state.

This paper proposes neighborhood encoding network (NENet) to extract more neighboring correlation for semantic segmentation by training the network to encode neighboring correlation to pixel feature in feature map. A new relay loss is also designed for level-wise training. The proposed NENet is evaluated on CamVid [3] and Cityscapes [4] dataset and achieves impressive results.

2 Related Work

Semantic segmentation is widely used in various fields, the segmentation results are also used as masks in other tasks, such as pedestrian detection and Landmark Localization. Affords have been made to increase the performance and enhance the training efficiency.

2.1 Context Encoding for Semantic Segmentation

In order to enhance the strength to classify, different types of layers are proposed to generate a fine feature map. In addition to adopt more powerful backbone networks, recent methods also enhance the context encoding ability by combining features with different encoded semantics.

In order to enhance the perception of convolution layer, DeepLab propose Atrous convolution [5], which has larger receptive field with the same number of parameters compared with conventional convolution.

Spatial Pyramid Pooling (SPP) [8] is first proposed to deal with multi-scale problem in object detection. SPP uses pooling operator to compute spatial pyramid of sample feature, then combine them to obtain features containing multi-scale semantics for multi-scale detection.

Combining Atrous convolution and SPP to enhance context encoding has been popular practice. DeepLab V2 and V3 are devoted to construct Atrous Spatial Pyramid Pooling module (ASPP) to enhance the feature semantics by combining features computed by convolution with various Atrous rates. Dense ASPP [6] takes the advantages of DenseNet [7] and further enhance ASPP module. PSPNet [9] design PSP module to combine different scale feature for segmentation and achieves state-of-art results.

2.2 Pixel Level Correlation Extraction in Pixel Level Tasks

As encoder networks have extract feature maps with rich global and local context, some methods attempt to obtain pixel level correlation within networks.

The concept of Adaptive Affinity Fields (AAF) [10] is proposed to analyze and match the neighboring pixels’ correlation in semantic segmentation, and adds an extra affinity field matching loss function to learn optimal affinity fields and enhance the performance of spatial structures and small details.

EncNet [11] studies the impact of global context in semantic segmentation by capturing the semantic context of scenes and selectively highlights class-dependent feature maps. With the semantic context, EncNet significantly improves the performance.

2.3 Training Strategy in Segmentation Networks

In pixel-level tasks, the methods based on deep neural network often adopt an upsample network appended to backbone encoder network to recover the resolution. The extra network inevitably increases the depth of network and increases the difficulty of training. In order to ease the training process of network, converge to optimal state, different methods are proposed.

Skip layer [21] is proposed in ResNet, which adds the identity of the input to output in one Resblock. This practice converts the training target of the layer to residuals of labels and then makes the network easier for training.

Batch Normalization [16] is proposed to improve internal covariate shift problem by normalizing layer inputs. Batch Normalization allows us to use higher learning rates and downplay the importance of weights initialization.

Auxiliary loss function is also used for network training in many methods. PSPNet [9] and BiSeNet [12] use add extra auxiliary loss function appended to hidden layer and train the network with weighted loss. Relay propagation strategy divides the network into several subnetworks and trains them separately with their loss functions.

3 Proposed Method

Neighborhood encoding network (NENet) attempts to encode the pixel level correlation in the network and utilizes the correlation for better segmentation. The following sections are organized according to the parts of network structure, NPM blocks and training strategy.

3.1 Neighborhood Encoding Network

We design our neighborhood encoding network (NENet) based on encoder-decoder structure. The overall structure design refers to the lightweight semantic segmentation network ENet [13]. The NENet is set up with an encoder network and a decoder network as shown in Fig. 1. The encoder network consists of an initialize block, two downsampling blocks and 20 Resblocks implemented among initializing and downsampling blocks, the decoder network consists of two upsampling blocks, a neighborhood prediction module (NPM) and 3 Resblocks implemented among upsampling blocks and NPM. Compared with original ENet, we add skip layers in NENet and replace the last deconvolution layer with our NPM. We train our NENet with our level-wise relay strategy.

3.2 Neighborhood Prediction Module

Review of Deconvolution

In order to decoder the spatial semantics from high level feature map with low resolution, common practice is to interpolate the feature map and approximate the target with convolution or use deconvolution directly. As interpolation operator is one of specific states of deconvolution, so the two practice are actually the same way to approximate each pixel target with the same convolution kernel in the final layer at testing phase, as illustrated in Fig. 2.

The lower blue graph is the input with dash line fillers, while the upper green graph is the output. The actual operation of deconvolution is just convolution after upsampling. So the each pixels in output is computed with weight $ W $, despite the position or relationship in neighborhood. Although the feature map has big receptive field and rich neighboring correlation, the predictor of deconvolution only computes the result with individual feature of pixels. In order to utilize the neighboring correlation for semantic segmentation, we put forward neighborhood prediction module in the following subsection.

Design of Neighborhood Prediction Module

In order to extract neighboring correlation, we design the neighborhood prediction module to predict neighboring four pixels of target with the feature at corresponding position of the feature map.

As shown in Fig. 3, The neighboring pixels of target (including left-up, left-down, right-up, right-down pixels) are approximated by four convolution kernels $ W_{1} , W_{2} , W_{3} , W_{4} $ separately based on the feature. So different from deconvolution operator, the NPM use four directional kernels to predict four neighboring target maps, which makes a good use of the neighboring correlation extracted by the encoder. After approximation by the four kernel, the four directional map can be used to recover the complete result of target.

At the training phase, because the output is computed by the NPM, the layers before NPM is trained to extract more neighboring correlation information to ensure the NPM to recover better results close to labels. So NPM can help extract more neighboring correlation at training phase and makes a better use of neighboring correlation at testing phase.

3.3 Level-Wise Relay Training for NENet

The encoder-decoder network has two subnetworks, an encoder network and a decoder network. Although the skip layer is used in ResBlocks, the gradient vanishing problem still exists to some degree. From this respect, the gradient values in lower layers are smaller and then the lower layers cannot make the information, including neighboring correlation, better for propagating to top.

In order to help the network extract more semantics and neighboring correlation, we propose the level-wise relay training. We append neighborhood prediction module (NPM) to each ResBlock before Upsampling block to approximate the target at different level, and compare them with the ground truth, as shown in Fig. 1. The scaled ground truths are nearest interpolated with original ground truth. The different levels of loss are computed the different levels of outputs and corresponding scaled ground truths (Fig. 4).

The training process is from $ {\text{Level}}1 $ loss to $ {\text{Level}}3 $ loss. Because the information stream is propagating from down to top, and the details of input are on the decrease along with the propagation, we insert relay training process is to help the network maintain and extract more semantics. The NPM is further helps extract the neighboring correlation information.

By adding the relay NPM, the approximation of target is computed step by step, from coarse to fine.

4 Experiment

We set up the experiments with the CamVid and Cityscapes dataset to evaluate our NENet. The computing platform is NVIDIA GTX 2080Ti and our NENet is implemented on PyTorch toolkit.

CamVid.

The Cambridge-driving Labeled Video Database (CamVid) a street scene dataset from the perspective of driving automobile. The CamVid consists of 367 images for training and 233 images for testing, including 11 classes, with the resolution of 480 * 360.

Cityscapes.

The Cityscapes is also a street scene dataset which consists of 2975 images for training, 500 for validation and 1525 for testing, at the resolution of 2048 * 1024. In order to speed up the training and inference process, we downsample the images and ground truth to the resolution of 1024 * 512.

PASCAL VOC intersection-over-union metric (IoU) [14] is used to evaluate the methods on CamVid and Cityscapes, and Mean IoU is used to describe the performance on the whole dataset. The definition of IoU is

$$ {\text{IoU}} = \frac{TP}{{\left( {TP + FP + FN} \right)}} . $$

(1)

where $ TP $, $ FP $ and $ FN $ are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set.

4.1 Ablation Study

In order to evaluate the different part of our NENet, we design two comparison experiments to testify the validity of network structure and training strategy.

We first evaluate our baseline method ENet, and add skip decoder and neighborhood prediction module (NPM) separately, and construct NENet gradually. The result is shown in Table 1.

Table 1. Comparison of different settings of network.

Full size table

We also set up a comparison experiment to evaluate our level-wise relay loss [21]. The Level1 loss only means the training process only contain one step with Level1 loss. In weighted loss, we set weight values for each level of loss and train the network in one step. The weight values are $ 8,4,2,1 $ from level1 to level8. Summed loss is also a one-step training but the final loss is the sum of the for level loss (Table 2).

Table 2. Comparison of different types of training strategy based on NENet.

Full size table

4.2 Result on CamVid Dataset

In order to compare our NENet with the benchmark network ENet and some other popular lightweight segmentation networks like FCN, SegNet, we did experiments on the Camvid dataset and got the segmentation results for each class. As shown in the Table 3, due to the NPM and the new training methods, NENet performs better in small targets and details. For example, mIoU is much better in sigh, fence and cyclist. At the same time, in the comparison of the overall mIoU, our NENet also performed better, mIoU reached 64.01, which was 9.61 higher than the original ENet, Fig. 5 shows the results of NENet.

Table 3. Results on CamVid.

Full size table

4.3 Result on Cityscapes Dataset

We also conducted a comparative experiment on Cityscapes. In order to speed up the training, and in order to decrease the computational pressure, we resize the training data so that the size of the training image is 1024 * 512. As shown in the Table 4, compared with SegNet, ENet and ESPNet, NENet has achieved better results. And Fig. 6 shows the segmentation result of NEnet.

Table 4. Results on Cityscapes dataset.

Full size table

5 Conclusion

This paper proposes a NENet for semantic segmentation. We use neighborhood prediction module (NPM) encoding in the encoder part and extract more neighboring correlation in the decoder part to enhance the performance of segmentation. Level-wise relay training strategy is designed to ensure the training efficiency with (NPM). The NENet achieves impressive result on CamVid and Cityscapes datasets, and has a bright prospect.

References

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with gaussian edge potentials. In: Advances in Neural Information Processing Systems, pp. 109–117 (2011)
Google Scholar
Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a high-definition ground truth database. Pattern Recogn. Lett. 30(2), 88–97 (2009)
Article Google Scholar
Cordts, M., Omran, M., Ramos, S., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Google Scholar
Chen, L.C., Papandreou, G., Schroff, F., et al.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
Yang, M., Yu, K., Zhang, C., et al.: Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692 (2018)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Article Google Scholar
Zhao, H., Shi, J., Qi, X., et al.: Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890 (2017)
Google Scholar
Ke, T.-W., Hwang, J.-J., Liu, Z., Yu, S.X.: Adaptive affinity fields for semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 605–621. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_36
Chapter Google Scholar
Zhang, H., Dana, K., Shi, J., et al.: Context encoding for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7151–7160 (2018)
Google Scholar
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20
Chapter Google Scholar
Paszke, A., Chaurasia, A., Kim, S., et al.: Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016)
Everingham, M., Van Gool, L., Williams, C.K.I., et al.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)
Article Google Scholar
Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method: a Unified Approach to Combinatorial Optimization. Monte-Carlo Simulation and Machine Learning. Springer Science & Business Media, New York (2013)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016)
Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Article Google Scholar
Lin, G., Milan, A., Shen, C., et al.: RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1925–1934 (2017)
Google Scholar
Kato, Z., Pong, T.C.: A Markov random field image segmentation model for color textured images. Image Vis. Comput. 24(10), 1103–1114 (2006)
Article Google Scholar
Shen, L., Lin, Z., Huang, Q.: Relay backpropagation for effective learning of deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 467–482. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_29
Chapter Google Scholar

Download references

Acknowledgement

This work is supported by The Natural Science Foundations of China 61727802, Key Research & Development programs in Jiangsu China, BE2018126.

Author information

Authors and Affiliations

Jiangsu Key Laboratory of Spectral Imaging and Intelligent Sense, Nanjing University of Science and Technology, Nanjing, 210094, China
Xiaotian Lou, Xiaoyu Chen, Lianfa Bai & Jing Han

Authors

Xiaotian Lou
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lianfa Bai
View author publications
You can also search for this author in PubMed Google Scholar
Jing Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Han .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
The Australian National University, Canberra, Australia
Nick Barnes
Peking University, Peking, China
Baoquan Chen
The Technical University of Munich, München, Bayern, Germany
Rüdiger Westermann
Zhejiang University, Hangzhou, China
Xiangwei Kong
Beijing Jiaotong University, Beijing, China
Chunyu Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lou, X., Chen, X., Bai, L., Han, J. (2019). Neighborhood Encoding Network for Semantic Segmentation. In: Zhao, Y., Barnes, N., Chen, B., Westermann, R., Kong, X., Lin, C. (eds) Image and Graphics. ICIG 2019. Lecture Notes in Computer Science(), vol 11903. Springer, Cham. https://doi.org/10.1007/978-3-030-34113-8_47

Download citation

DOI: https://doi.org/10.1007/978-3-030-34113-8_47
Published: 28 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34112-1
Online ISBN: 978-3-030-34113-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Neighborhood Encoding Network for Semantic Segmentation

Abstract

Similar content being viewed by others

Dual Context Network for real-time semantic segmentation

Semantic segmentation using reinforced fully convolutional densenet with multiscale kernel

Analytical Comparison of Deep Learning Frameworks for Semantic Segmentation with Pixel-Level Understanding

Keywords

1 Introduction

2 Related Work

2.1 Context Encoding for Semantic Segmentation

2.2 Pixel Level Correlation Extraction in Pixel Level Tasks

2.3 Training Strategy in Segmentation Networks