Keywords

1 Introduction

Over the past years, the performance of semantic image segmentation, a per-pixel classification problem, has been dramatically advanced by fully convolutional network (FCN) based approaches [1]. Generally, FCN can be converted from a classification model [2,3,4] pre-trained on ImageNet [5] by replacing fully connected layers with corresponding convolution ones. However, due to the strided layers existing in FCN, result usually performs poorly where small objects or object boundary exists, namely segmentation detail. In order to solve this problem, atrous convolution [6] and many other novel modules [7, 8] are introduced, attempting to preserve or recover spatial information when doing strided operations. However, most of the work seems to give model the capability to handle spatial information, instead of learning it.

The current mainstream loss function for semantic image segmentation is cross entropy, which treats all pixels equally. However, in the context of semantic segmentation, there are some pixels for which model is more difficult to make the right prediction. This is also in line with intuition, for human-beings, we can easily roughly circle the object in an image, but precisely segment it requiring careful consideration. This motivates us to develop a method that pays more attention to these pixels.

In this paper, we introduce two region-based metrics to analyze the performance bottleneck of model and based on this analysis, we propose a simple yet effective loss function \(\mathcal {L}_\mathrm{{cehe}}\) by combining cross entropy with hard example [9], which can alleviate the problem discovered by region-based metrics. The proposed \(\mathcal {L}_\mathrm{{cehe}}\) can be implemented as cross entropy loss \(\mathcal {L}_\mathrm{{ce}}\) with pixel-wise weight, which can replace \(\mathcal {L}_\mathrm{{ce}}\) without damage to training speed. Experiments show that model using \(\mathcal {L}_\mathrm{{cehe}}\) outperforms its counterpart \(\mathcal {L}_\mathrm{{ce}}\) by \(1.12\%\) in terms of MIoU on Cityscapes validation set, and by \(4.15\%\) in terms of the region-based metric MIoUiER proposed in this paper, indicating that our proposed method performs better in segmentation detail.

In summary, our contributions are:

  • We propose two region-based metrics which can quantitatively evaluate the performance of segmentation detail.

  • By analyzing model using region-based metrics, we find the key factor that limits model’s performance, which can provide insights for future research.

  • We propose a simple yet effective loss function \(\mathcal {L}_\mathrm{{cehe}}\), which outperforms the widely used cross entropy \(\mathcal {L}_\mathrm{{ce}}\).

2 Related Work

Approaches based on FCN have made remarkable progress in the field of semantic image segmentation. However, some properties, such as spatial in-variance, which make deep convolution networks successful in image classification, are precisely the factors that lead model failing to produce fine-grained segmentation. Quite a part of research focuses on preserving or recovering spatial information [6, 8, 10, 11]. Besides, compared with classification, this dense prediction task has it own properties which we need take into consideration. Current methods to solve problems existing in semantic image segmentation can be divided into three categories: approaches solving intra-class inconsistency or inter-class indistinction problem, or simultaneously both [12]. In this paper, we focus on the inter-class indistinction problem.

Atrous convolution [6, 13] is a solution for preserving spatial information and keeping receptive size at the same time. What’s more, it will not introduce extra computation by sparsely sampling the input feature map. Currently, nearly all of the segmentation models replace the conventional convolution by the atrous one in the deep layer considering the trade-off between performance and memory usage. However, simply stacking atrous convolutions may cause the gridding issue described in [14] and they propose Hybrid Dilated Convolution (HDC) to alleviate this problem.

As for recovering spatial information, there is no general solution. The most straightforward method is to combine the low-level spatial information and the high-level semantic one by simply adding or stacking them together. This idea produces an encoder-decoder series like UNet [15], which shows good performance in the field of medical image segmentation. However, in semantic segmentation task, due to the complex content of the input image, the above method seems to cause chaos when fusing feature of different levels. Thus, various methods are proposed to alleviate this problem. For example, SegNet [7] is an instance of encoder-decoder series, which memorizes the indices of response when doing max-pooling and then uses them in decoder stage. RefineNet [8] presents a multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections.

The concept of hard example [9], proposed in the field of object detection, can be generalized to semantic segmentation task by regarding each pixel as an example. Based on this, [16] only back propagates the gradients of hard example determined by the predicted probability and a threshold, which is a fairly direct expansion of the idea stated in [9]. [17] divides a deep model into several cascade sub-models and each sub-model only operates on the hard example decided by the previous one, here hard example is also defined by a pre-defined threshold and the predicted probability. These work defines the term hard exactly as in [9], different from them, we define hard example based on our analysis of model, and then integrate this information into the loss function by multi-task learning scheme.

3 Method

3.1 Region Partition

Given an image, we divide it into two parts: edge region and object region. Formally, assuming (x, y) and set \(I = \{(x,y)\}\) represent the coordinate of a pixel and all pixels in an image respectively, we define object region \(I_\mathrm{{object}} = I - I_\mathrm{{edge}}\), where \(I_\mathrm{{edge}}\) denotes edge region. From the definition of these two regions, we can see that the key step of region partition is to obtain \(I_\mathrm{{edge}}\). The following will introduce how to get the edge map of an image and how to efficiently and quantitatively obtain \(I_\mathrm{{edge}}\).

Fig. 1.
figure 1

Using Canny algorithm to extract edge map from ground truth in Cityscapes. Best viewed in color.

In the dataset of semantic image segmentation task, there are usually two types of images: original image and ground truth. The value of pixel in ground truth represents the target class which the corresponding pixel in the original image belongs to. Because of this unique property of ground truth, we can utilize Canny [18] algorithm to extract edge map of an image by setting the threshold \(t_1\) and \(t_2\) of Canny to 0. Some examples are illustrated with Fig. 1.

In order to efficiently and quantitatively obtain \(I_\mathrm{{edge}}\), we use chessboard distance as the distance between pixels in an image. Assuming that two pixels \(q_i, q_j \in I\) and the distance between them is \(d_{ij}\), the edge region \(I_\mathrm{{edge}}\) is quantitatively defined based on \(d_{ij}\). We let \(I_\mathrm{{canny}}\) denote the set of edge pixels obtained by Canny algorithm described above, as shown in Fig. 1(b), then we define \(I_\mathrm{{edge}}^{(r)} = \{q\,|\,q \in I \ \mathrm{{and}} \ d(q, q_\mathrm{{canny}}) < r, \exists \, q_\mathrm{{canny}} \in I_\mathrm{{canny}}\}\), where r is called the radius of edge region. By using chessboard distance, the process of computing \(I_\mathrm{{edge}}^{(r)}\) can be efficiently implemented by convolution operation in modern deep learning framework.

Assuming \(E_{\mathrm{{H}} \times \mathrm{{W}}}\) represents the edge map obtained by Canny algorithm, where the value of pixel is 1 if it’s an edge pixel, otherwise 0, as shown in Fig. 1(b). The method for efficiently computing \(I_\mathrm{{edge}}^{(r)}\) is summarized in Algorithm 1.

figure a

3.2 Loss Function

Cross Entropy. Cross entropy is widely used as the loss function for semantic segmentation, which can be formulized as Eq. 1.

$$\begin{aligned} \mathcal {L}_\mathrm{{ce}}=-\frac{1}{N}\sum _{i=1}^N \sum _{j=1}^K \mathcal {I}\{y_i=j\}\log {p_{ij}} \end{aligned}$$
(1)

where N and K represent the number of pixels and classes, respectively. \(y_i\) is the target class of pixel i, and \(p_{ij}\) is the probability of pixel i assigned to class j. \(\mathcal {I}\{\cdot \}\) is indicator function whose value is set to 1 if condition is satisfied, otherwise 0.

Combine Cross Entropy with Hard Example. \(\mathcal {L}_\mathrm{{ce}}\) implies that all pixels equally contribute to the total loss, however, it seems that some pixels are more difficult to be correctly predicted, as detailed in [16, 19]. Different from them, we combine cross entropy with manually defined hard example by multi-task learning scheme, as shown in Eq. 2.

$$\begin{aligned} \mathcal {L}_\mathrm{{cehe}} = \mathcal {L}_\mathrm{{ce}} + \lambda \mathcal {L}_\mathrm{{he}} \end{aligned}$$
(2)

where \(\mathcal {L}_\mathrm{{he}}\) is the loss function for hard example and \(\lambda \) is a weight factor for these two losses.

We manually define pixels in edge region are hard example for semantic segmentation task, and the radius r of edge region is a hyper-parameter. The reason for this definition will be discussed in the experiment part. Function m(i) indicates that whether pixel i is hard example, and it is defined as below:

$$\begin{aligned} m(i) = \left\{ \begin{aligned} 1,&\quad \text {pixel}\ i \in I_{\text {edge}}^{(r)}\\ 0,&\quad \text {otherwise} \end{aligned}\right. \end{aligned}$$
(3)

Then we can formulate \(\mathcal {L}_\mathrm{{he}}\) as:

$$\begin{aligned} \mathcal {L}_\mathrm{{he}} = -\frac{1}{N}\sum _{i=1}^N m(i)\sum _{j=1}^K \mathcal {I}\{y_i=j\}\log {p_{ij}} \end{aligned}$$
(4)

Different from the conventional multi-task learning, here we compute \(\mathcal {L}_\mathrm{{ce}}\) and \(\mathcal {L}_\mathrm{{he}}\) on the same logits outputted by model, so them can be merged into a single loss function, shown as below:

$$\begin{aligned} \mathcal {L}_\mathrm{{cehe}} =-\frac{1}{N}\sum _{i=1}^N \sum _{j=1}^K \mathcal {I}\{y_i=j\}(1 + \lambda m(i))\log {p_{ij}} \end{aligned}$$
(5)
Fig. 2.
figure 2

Visualization of hierarchical edge region of 3 levels. Colored part represents edge region and different color means different level. Best viewed in color.

3.3 Hierarchical Edge Region

In our experiments, the performance of \(\mathcal {L}_\mathrm{{cehe}}\) largely depends on the choice of \(\lambda \) and r. In order to alleviate this problem, we further divide edge region into different levels by the shortest distance between pixel in edge region and edge pixels obtained by Canny algorithm. Formally, \(I_\mathrm{{edge}}^{(r_1, r_2, \cdots , r_n)}\) represents an edge region of n levels, and ith region equals the set \(I_\mathrm{{edge}}^{(r_i)} - I_\mathrm{{edge}}^{(r_{i - 1})}\) when \(i > 1\) or \(I_\mathrm{{edge}}^{(r_1)}\) when \(i = 1\). Figure 2 shows some examples.

We re-definite m(i) according to the number of levels, shown as below:

$$\begin{aligned} m(i) = \left\{ \begin{aligned}&n - l(i) + 1, \quad \mathrm{{pixel}}\ i \in I_\mathrm{{edge}}^{(r_1, \cdots , r_n)}\\&0,\quad \quad \quad \quad \quad \ \ \mathrm{{otherwise}} \end{aligned}\right. \end{aligned}$$
(6)

where \(l(i) \in \{1, 2, \cdots , n\}\) is level index of pixel i.

4 Experiment

The purpose of this paper is to improve the performance of segmentation detail, rather than push the state-of-the-art. All experiments are conducted on a TITAN X (Pascal) GPU with 12 GB RAM, and the training parameters are detailed in the following part so that the results are easy to reproduce.

4.1 Region-Based Metric

Conventional metric like MIoU cannot quantitatively evaluate the performance of segmentation detail. In order to solve this problem, we introduce two region-based metrics: MIoUiER and MIoUiOR, which are defined on edge and object region, respectively. The calculation method of them is the same as MIoU except that MIoUiER only considers pixels belonging to set \(I_\mathrm{{edge}}\) and MIoUiOR set \(I_\mathrm{{object}}\). The former can quantitatively evaluate the performance of segmentation detail. It should be noted that both of them are the function of radius r.

4.2 Dataset

We adopt Cityscapes [20] as the evaluation dataset. This dataset involves 19 semantic labels for segmentation task, which belong to 7 groups: flat, human, vehicle, construction, object, nature and sky. The dataset focuses on semantic understanding of urban street scenes, which has 5,000 fine and 20,000 coarse annotations. The former contains 2,975 (train), 500 (val) and 1,525 (test) pixel-level labeled images for training, validation and test, respectively. Previous work shows that model pre-trained on coarse annotations will have superior performance. Since our purpose is to study the effect of the proposed method rather than push the state-of-the-art, we will not use coarse annotations for the simplicity of training process. The performance is measured by MIoU, MIoUiER and MIoUiOR over 19 classes.

4.3 Implementation Details

Model. We use PyTorch framework for implementation, and we adopt DeepLab V3 Plus [11] with ResNet-50 [3] as the backbone. The output stride is set to 16. It should be noted that the proposed method can be applied to any model which uses cross entropy as the loss function.

Data Preprocess. Data augmentation is a powerful way to expand dataset and it makes the learned model robust to input varieties. Similar to previous work [11], we first scale the image by a factor randomly chosen from a pre-defined array (0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 0), then randomly horizontally flip it and crop it to the size \(513 \times 513\) for training. In order to make full use of the memory, the batch size is set to 12.

Learning Rate Policy. We adopt poly learning rate policy with initial learning rate 0.007 and power 0.9, the same as [11]. We train the model for 30,000 steps, considering the trade-off between accuracy and training time.

Inference. During inference stage, we do not use any data augmentation techniques. All metrics are obtained by a single-scale test on Cityscapes validation set.

Loss Function. As shown in Eq. 5, the proposed loss function \(\mathcal {L}_\mathrm{{cehe}}\) has some hyper-parameters to be specified. In all our experiments, the number of levels n is set to 3 and correspondingly, the radii for different levels are \(r_1=7\), \(r_2=9\) and \(r_3=11\), the factor \(\lambda \) is set to 2. All these parameters are not carefully chosen, and performance gains may be obtained by grid search of them.

4.4 Evaluation

Metric Analysis. The proposed method helps to improve the performance of segmentation detail, thus leading to an improvement in terms of the overall performance. To evaluate \(\mathcal {L}_\mathrm{{cehe}}\), we use different loss functions to train the DeepLab model with other parameter settings and training pipeline unchanged. As listed in Table 1, our proposed \(\mathcal {L}_\mathrm{{cehe}}\) yields \(72.94\%\) in terms of MIoU, outperforming the loss function \(\mathcal {L}_\mathrm{{ce}}\) by \(1.12\%\), which proves the effectiveness of the proposed method.

Table 1. Performance (MIoU) on Cityscapes validation set.
Fig. 3.
figure 3

Performance of model trained with different loss functions. All metrics are obtained on Cityscapes validation set. Best viewed in color.

Fig. 4.
figure 4

Performance (region-based metrics) of model trained with different loss functions. All metrics are obtained on Cityscapes validation set. Best viewed in color.

Region-Based Metric Analysis. The analysis based on MIoU only gives us an overall evaluation of the segmentation results, and it does not seem to be able to verify the purpose of the proposed method \(\mathcal {L}_\mathrm{{cehe}}\): improve segmentation detail. The following part will utilize two region-based metrics to (1) analyze the performance bottleneck of the model and (2) prove that the loss function \(\mathcal {L}_\mathrm{{cehe}}\) can enhance segmentation detail.

As shown in Fig. 3, generally, the curves of three metrics of these two loss functions have the same trend. Both of them perform well (MIoUiOR is larger than \(80\%\)) in object region even if radius r is just larger than 10. However, both have inferior performance in edge region comparing with their MIoUiOR. For example, MIoUiER is \(45.99\%\) and \(50.13\%\) for \(\mathcal {L}_\mathrm{{ce}}\) and \(\mathcal {L}_\mathrm{{cehe}}\) respectively when \(r=11\), which are \(36.68\%\) and \(31.84\%\) less than the value of corresponding MIoUiOR. This indicates that the factor limiting the performance of the model is the prediction of the pixels in edge region. Since the architecture of DeepLab, FCN-based feature extractor with a decoder, is commonly used in semantic segmentation, we argue that this conclusion is applicable to most models.

Table 2. Performance on the Cityscapes validation set obtained under different r settings.
Fig. 5.
figure 5

Visualization of segmentation results obtained by different loss functions on Cityscapes validation set. Results in red box indicate that the proposed \(\mathcal {L}_\mathrm{{cehe}}\) has better performance in segmentation detail. Best viewed in color.

Figure 4 illustrates the metric curves of two loss functions, and detail statistics can be found in Table 2. In terms of MIoUiER, our proposed \(\mathcal {L}_\mathrm{{cehe}}\) outperforms the widely used \(\mathcal {L}_\mathrm{{ce}}\) by a large margin (between \(3\%\) and \(4\%\)) under almost all r settings, indicating that \(\mathcal {L}_\mathrm{{cehe}}\) performs better in edge region, namely segmentation detail. Some visualized examples are shown in Fig. 5. In terms of MIoUiOR, \(\mathcal {L}_\mathrm{{cehe}}\) has inferior performance compared with \(\mathcal {L}_\mathrm{{ce}}\), which further confirms that \(\mathcal {L}_\mathrm{{cehe}}\) can improve the performance of segmentation detail since \(\mathcal {L}_\mathrm{{cehe}}\) is superior than \(\mathcal {L}_\mathrm{{ce}}\), considering the MIoU listed in Table 1. We conjecture the reason may be that the gradients of hard example are emphasized by the product of m(i) and \(\lambda \) in Eq. 5, so they play a leading role in the update direction of some model parameters. Further performance gains may be obtained by carefully choosing m(i) and \(\lambda \), and can be obtained by fusing models trained with different loss functions. We leave this for future research.

5 Conclusion

In this paper, we introduce two region-based metrics to quantitatively evaluate the performance of segmentation detail, which we use to analyze the performance bottleneck of the model. What’s more, we combine cross entropy with manually defined hard example to propose a loss function named \(\mathcal {L}_\mathrm{{cehe}}\), which outperforms the widely used cross entropy \(\mathcal {L}_\mathrm{{ce}}\) by \(1.12\%\) in terms of MIoU, and by \(4.15\%\) in terms of MIoUiER when radius \(r=13\), indicating that the proposed \(\mathcal {L}_\mathrm{{cehe}}\) performs better in segmentation detail.