Keywords

1 Introduction

The semantic segmentation can be described as a classification task, and each pixel of input is recognized by analyzing its global and local data. Traditional methods [17, 20] extract features based on texture and color information, cluster pixels to blobs and then analyze the semantics of blobs.

With the success of deep learning, the deep neural network is widely adopted in semantic segmentation task [18, 19]. In neural networks, the task of semantic segmentation is considered as a pixel-level classification, where a feature map is often computed for classification pixel by pixel. In other words, pixels in the feature map are recognized individually ignoring the clustering process in traditional methods. However, the neural nodes in high-level layers of deep neural network have wide receptive field along with rich local semantics, training the pixels individually causes the output lose the correlation of neighboring pixels within its pixel features. So the methods based on deep learning often have a poor performance on edges or details, while the results of traditional segmentation methods often have clear edges.

One remedy for segmentation networks is to add a post-process to adjust results based on texture and color of inputs, such as Conditional Random Fields (CRF) [2]. CRF is one of graph models which helps approximate the posterior distribution of results based on network inputs and outputs. However, CRF involves much computation and adds computation latency.

In order to obtain fine features for semantic segmentation in neural networks, input data goes through a backbone network to encode its high-level semantics and then passed by a upsample network to recover spatial details. This encoder-decoder structure helps the feature map extract both high-level semantics and low-level details. Each pixel in the feature map is computed from a number of neighboring neural nodes in previous layers. So the feature of each pixel contains not only the semantics but also the neighboring correlation. Through the encoder network, the spatial resolution of feature map gets smaller but the channel number become more, which means the local context information or neighboring correlation is encoded to pixel feature in feature maps.

As semantic segmentation is considered as a classification task, the loss function in segmentation networks is often designed as pixel-wise cross entropy loss [15]. Though the pixel feature has the ability to extract neighboring correlation, the pixel-wise loss function does not guide the network to learn the neighboring correlation. So if we consider the neighboring correlation in loss function, the potentials of networks will be further explored.

The methods based on encoder-decoder structure have an encoder network and a decoder network. The training difficulty and the optimizing state are affected by the complexity of network, such as depth of the network and number of convolution kernels. ResNet [1] is proposed to ease the training process by accumulating residuals to approximate. The training target is also converted to sparse residuals with skip layers which helps us train up to 1000 layers ResNet. ResNet improves the network degradation and gradient vanishing, and now skip layer has become one of most commonly used layers. Optimizing training strategy can also help network convergence. The relay backpropagation is proposed for effective learning of deep convolutional neural networks. By training sub-networks separately, the relay backpropagation helps the network converge to better state.

This paper proposes neighborhood encoding network (NENet) to extract more neighboring correlation for semantic segmentation by training the network to encode neighboring correlation to pixel feature in feature map. A new relay loss is also designed for level-wise training. The proposed NENet is evaluated on CamVid [3] and Cityscapes [4] dataset and achieves impressive results.

2 Related Work

Semantic segmentation is widely used in various fields, the segmentation results are also used as masks in other tasks, such as pedestrian detection and Landmark Localization. Affords have been made to increase the performance and enhance the training efficiency.

2.1 Context Encoding for Semantic Segmentation

In order to enhance the strength to classify, different types of layers are proposed to generate a fine feature map. In addition to adopt more powerful backbone networks, recent methods also enhance the context encoding ability by combining features with different encoded semantics.

In order to enhance the perception of convolution layer, DeepLab propose Atrous convolution [5], which has larger receptive field with the same number of parameters compared with conventional convolution.

Spatial Pyramid Pooling (SPP) [8] is first proposed to deal with multi-scale problem in object detection. SPP uses pooling operator to compute spatial pyramid of sample feature, then combine them to obtain features containing multi-scale semantics for multi-scale detection.

Combining Atrous convolution and SPP to enhance context encoding has been popular practice. DeepLab V2 and V3 are devoted to construct Atrous Spatial Pyramid Pooling module (ASPP) to enhance the feature semantics by combining features computed by convolution with various Atrous rates. Dense ASPP [6] takes the advantages of DenseNet [7] and further enhance ASPP module. PSPNet [9] design PSP module to combine different scale feature for segmentation and achieves state-of-art results.

2.2 Pixel Level Correlation Extraction in Pixel Level Tasks

As encoder networks have extract feature maps with rich global and local context, some methods attempt to obtain pixel level correlation within networks.

The concept of Adaptive Affinity Fields (AAF) [10] is proposed to analyze and match the neighboring pixels’ correlation in semantic segmentation, and adds an extra affinity field matching loss function to learn optimal affinity fields and enhance the performance of spatial structures and small details.

EncNet [11] studies the impact of global context in semantic segmentation by capturing the semantic context of scenes and selectively highlights class-dependent feature maps. With the semantic context, EncNet significantly improves the performance.

2.3 Training Strategy in Segmentation Networks

In pixel-level tasks, the methods based on deep neural network often adopt an upsample network appended to backbone encoder network to recover the resolution. The extra network inevitably increases the depth of network and increases the difficulty of training. In order to ease the training process of network, converge to optimal state, different methods are proposed.

Skip layer [21] is proposed in ResNet, which adds the identity of the input to output in one Resblock. This practice converts the training target of the layer to residuals of labels and then makes the network easier for training.

Batch Normalization [16] is proposed to improve internal covariate shift problem by normalizing layer inputs. Batch Normalization allows us to use higher learning rates and downplay the importance of weights initialization.

Auxiliary loss function is also used for network training in many methods. PSPNet [9] and BiSeNet [12] use add extra auxiliary loss function appended to hidden layer and train the network with weighted loss. Relay propagation strategy divides the network into several subnetworks and trains them separately with their loss functions.

3 Proposed Method

Neighborhood encoding network (NENet) attempts to encode the pixel level correlation in the network and utilizes the correlation for better segmentation. The following sections are organized according to the parts of network structure, NPM blocks and training strategy.

3.1 Neighborhood Encoding Network

We design our neighborhood encoding network (NENet) based on encoder-decoder structure. The overall structure design refers to the lightweight semantic segmentation network ENet [13]. The NENet is set up with an encoder network and a decoder network as shown in Fig. 1. The encoder network consists of an initialize block, two downsampling blocks and 20 Resblocks implemented among initializing and downsampling blocks, the decoder network consists of two upsampling blocks, a neighborhood prediction module (NPM) and 3 Resblocks implemented among upsampling blocks and NPM. Compared with original ENet, we add skip layers in NENet and replace the last deconvolution layer with our NPM. We train our NENet with our level-wise relay strategy.

Fig. 1.
figure 1

Structure of NENet.

3.2 Neighborhood Prediction Module

Review of Deconvolution

In order to decoder the spatial semantics from high level feature map with low resolution, common practice is to interpolate the feature map and approximate the target with convolution or use deconvolution directly. As interpolation operator is one of specific states of deconvolution, so the two practice are actually the same way to approximate each pixel target with the same convolution kernel in the final layer at testing phase, as illustrated in Fig. 2.

Fig. 2.
figure 2

Illustration of deconvolution operators. (Color figure online)

The lower blue graph is the input with dash line fillers, while the upper green graph is the output. The actual operation of deconvolution is just convolution after upsampling. So the each pixels in output is computed with weight \( W \), despite the position or relationship in neighborhood. Although the feature map has big receptive field and rich neighboring correlation, the predictor of deconvolution only computes the result with individual feature of pixels. In order to utilize the neighboring correlation for semantic segmentation, we put forward neighborhood prediction module in the following subsection.

Design of Neighborhood Prediction Module

In order to extract neighboring correlation, we design the neighborhood prediction module to predict neighboring four pixels of target with the feature at corresponding position of the feature map.

As shown in Fig. 3, The neighboring pixels of target (including left-up, left-down, right-up, right-down pixels) are approximated by four convolution kernels \( W_{1} , W_{2} , W_{3} , W_{4} \) separately based on the feature. So different from deconvolution operator, the NPM use four directional kernels to predict four neighboring target maps, which makes a good use of the neighboring correlation extracted by the encoder. After approximation by the four kernel, the four directional map can be used to recover the complete result of target.

Fig. 3.
figure 3

Illustration of NPM in NENet.

At the training phase, because the output is computed by the NPM, the layers before NPM is trained to extract more neighboring correlation information to ensure the NPM to recover better results close to labels. So NPM can help extract more neighboring correlation at training phase and makes a better use of neighboring correlation at testing phase.

3.3 Level-Wise Relay Training for NENet

The encoder-decoder network has two subnetworks, an encoder network and a decoder network. Although the skip layer is used in ResBlocks, the gradient vanishing problem still exists to some degree. From this respect, the gradient values in lower layers are smaller and then the lower layers cannot make the information, including neighboring correlation, better for propagating to top.

In order to help the network extract more semantics and neighboring correlation, we propose the level-wise relay training. We append neighborhood prediction module (NPM) to each ResBlock before Upsampling block to approximate the target at different level, and compare them with the ground truth, as shown in Fig. 1. The scaled ground truths are nearest interpolated with original ground truth. The different levels of loss are computed the different levels of outputs and corresponding scaled ground truths (Fig. 4).

Fig. 4.
figure 4

Level-wise relay training for NENet.

The training process is from \( {\text{Level}}1 \) loss to \( {\text{Level}}3 \) loss. Because the information stream is propagating from down to top, and the details of input are on the decrease along with the propagation, we insert relay training process is to help the network maintain and extract more semantics. The NPM is further helps extract the neighboring correlation information.

By adding the relay NPM, the approximation of target is computed step by step, from coarse to fine.

4 Experiment

We set up the experiments with the CamVid and Cityscapes dataset to evaluate our NENet. The computing platform is NVIDIA GTX 2080Ti and our NENet is implemented on PyTorch toolkit.

CamVid.

The Cambridge-driving Labeled Video Database (CamVid) a street scene dataset from the perspective of driving automobile. The CamVid consists of 367 images for training and 233 images for testing, including 11 classes, with the resolution of 480 * 360.

Cityscapes.

The Cityscapes is also a street scene dataset which consists of 2975 images for training, 500 for validation and 1525 for testing, at the resolution of 2048 * 1024. In order to speed up the training and inference process, we downsample the images and ground truth to the resolution of 1024 * 512.

PASCAL VOC intersection-over-union metric (IoU) [14] is used to evaluate the methods on CamVid and Cityscapes, and Mean IoU is used to describe the performance on the whole dataset. The definition of IoU is

$$ {\text{IoU}} = \frac{TP}{{\left( {TP + FP + FN} \right)}} . $$
(1)

where \( TP \), \( FP \) and \( FN \) are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set.

4.1 Ablation Study

In order to evaluate the different part of our NENet, we design two comparison experiments to testify the validity of network structure and training strategy.

We first evaluate our baseline method ENet, and add skip decoder and neighborhood prediction module (NPM) separately, and construct NENet gradually. The result is shown in Table 1.

Table 1. Comparison of different settings of network.

We also set up a comparison experiment to evaluate our level-wise relay loss [21]. The Level1 loss only means the training process only contain one step with Level1 loss. In weighted loss, we set weight values for each level of loss and train the network in one step. The weight values are \( 8,4,2,1 \) from level1 to level8. Summed loss is also a one-step training but the final loss is the sum of the for level loss (Table 2).

Table 2. Comparison of different types of training strategy based on NENet.

4.2 Result on CamVid Dataset

In order to compare our NENet with the benchmark network ENet and some other popular lightweight segmentation networks like FCN, SegNet, we did experiments on the Camvid dataset and got the segmentation results for each class. As shown in the Table 3, due to the NPM and the new training methods, NENet performs better in small targets and details. For example, mIoU is much better in sigh, fence and cyclist. At the same time, in the comparison of the overall mIoU, our NENet also performed better, mIoU reached 64.01, which was 9.61 higher than the original ENet, Fig. 5 shows the results of NENet.

Table 3. Results on CamVid.
Fig. 5.
figure 5

Result samples of CamVid. From left to right: (a) Input, (b) NENet, (c) Ground truth.

4.3 Result on Cityscapes Dataset

We also conducted a comparative experiment on Cityscapes. In order to speed up the training, and in order to decrease the computational pressure, we resize the training data so that the size of the training image is 1024 * 512. As shown in the Table 4, compared with SegNet, ENet and ESPNet, NENet has achieved better results. And Fig. 6 shows the segmentation result of NEnet.

Table 4. Results on Cityscapes dataset.
Fig. 6.
figure 6

Result samples on Cityscapes. From left to right: (a) Input, (b) NENet, (c) Ground truth.

5 Conclusion

This paper proposes a NENet for semantic segmentation. We use neighborhood prediction module (NPM) encoding in the encoder part and extract more neighboring correlation in the decoder part to enhance the performance of segmentation. Level-wise relay training strategy is designed to ensure the training efficiency with (NPM). The NENet achieves impressive result on CamVid and Cityscapes datasets, and has a bright prospect.