1 Introduction

Semantic segmentation is a critical and challenging task in computer vision, which aims at predicting the class label of each pixel in images. Over the past years, deep convolutional networks have achieved great advances in semantic segmentation [1, 9, 18]. However, the pixel-level annotation is an extremely heavy work. Specifically, we need more than 1 h to annotate a single image in the Cityscapes dataset [3]. Training models on synthetic images can be a promising way to relieve the tedious annotation burden as their pixel-level labels can be automatically generated. Unfortunately, the domain shift between the synthetic images and real-world scenarios will degenerate the prediction results on real images. Therefore, domain adaptation should be considered to adapt the segmentation network trained on synthetic images to real images, given labeled source data and unlabeled target data. Although the recently proposed feature adaptation methods can bridge the source and target domains through learning domain-invariant features with adversarial mechanism [2, 7, 13], they cannot ensure that these features encode the structure information of the target images, since semantic segmentation is a highly structured prediction task.

In this paper, we propose to improve the domain adaptation performance of segmentation networks through enhancing the structure information of the target images at both the feature level and the output level. The main contribution of our work is two-fold: (1) enforcing an intermediate feature to reconstruct the training images; (2) adversarially aligning the structured output of the source and target images. Specifically, the reconstruction branch can enforce the encoding representation to preserve the visual cues of the target images, which are beneficial to their structured prediction. On the other hand, the output-level structure enhancement can directly regularize the target image’s structured prediction since both domains should share similar spatial layout and local context. We conduct experiments on “GTA5 to Cityscapes” which is a standard domain adaptation benchmark for semantic segmentation to evaluate the performance of our method. The experimental results clearly demonstrate that our method can effectively bridge both domains and obtain better adaptation results than the existing state-of-the-art methods.

2 Related Work

Over the past years, domain adaptation in computer vision has been primarily explored for the classification task. Overall, the main idea is to learn a “deep” representation that is domain invariant [4, 5, 10, 15, 16]. Thus far, unsupervised domain adaptation for semantic segmentation has not been widely explored. In [7], Hoffman et al. first proposed to adapt segmentation networks through domain adversarial learning in the feature space. In [2], Chen et al. further proposed class-specific domain adversarial learning framework, which aimed at reducing the domain divergence in each class. In [11], Murez et al. proposed to learn domain adaptive segmentation networks through directly translating the source images to the target ones at the pixel level. In [14], Tsai et al. proposed to align both domains at the structured output space. In short, the previous works mainly focused on angling the source and target domains through implementing adversarial learning at different levels, ranging from intermediate features to final predictions. In our method, our main idea is to enhance the structure information of the target images, which provides a reasonable regularization to their structured prediction.

3 Our Method

In this paper, we focus on unsupervised domain adaptation for semantic segmentation. Our goal is to learn a segmentation network which can achieve good prediction results on the target domain, given source images \(I_S\) with pixel-level labels \(L_{S}\) and unlabeled target images \(I_T\).

Fig. 1.
figure 1

The overall architecture of our model (best viewed in color).

Overall, our adaptation method contains two major components, including reconstructing the training images and aligning the target images’ structured prediction with adversarial training. Figure 1 shows the overview of our method.

Image Reconstruction: our main idea aims at adapting the segmentation network trained on the source images through learning a representation that encodes the visual cues of the target images. This is achieved through enforcing an intermediate layer to reconstruct the training images. As displayed in Fig. 1, the encoding network is shared by both the segmentation branch and the reconstruction branch. The reconstruction branch can regularize the encoding network to enhance the target images’ structure information.

Throughout this paper, we denote the encoding network and the decoding network as E and G, respectively. The segmentation branch is represented as S. We define our image reconstruction loss as

$$\begin{aligned} \begin{aligned} \min _{E, G, S}&\,\, \mathcal L(E,G,S) \\ s.t.&\,\, \mathcal L(E,G,S) = \lambda _{rec}\mathcal L_{rec} + \mathcal L_{seg}\\&\qquad \,\, = \lambda _{rec} (L_{1}(G \circ E(I_S), I_S)+ L_{1}( G \circ E(I_T), I_T )) \\&\qquad \,\, + L_{sup}( S \circ E(I_S), L_S), \end{aligned} \end{aligned}$$
(1)

where the former part is the reconstruction term for the training images and \(L_{sup}\) is the segmentation supervision term for the source images. In our method, the image reconstruction is implemented with \(L_1\) loss. Though ideally we only need to consider the reconstruction of the target images, the reconstruction of the source images can help the training of the decoding network.

Output Adaptation: Further more, we implement adversarial training at the output space of the segmentation network to align the structured prediction on the source and target images since both domains should share similar spatial layouts. As displayed in Fig. 1, a discriminative network is invoked to discriminate whether a softmax prediction is from the source domain or the target domain. In contrast, the segmentation network \(S\circ E(\cdot )\) will try to cheat the discriminator in order to make the target images’ structured predictions resemble the source images’ pixel maps. This can provide gradient updates to the segmentation network when the target images’ predictions are not structured reasonably. As a whole, the segmentation network and the discriminative network play a minimax game.

To retain the spatial information, D is specified as a fully convolutional network, which discriminates the domain label of each spatial unit. Following [17], we adopt Atrous Spatial Pyramid Pooling (ASPP) in our discriminative network since this can help to align the structured output at multiple scales. The adversarial loss are formulated as

$$\begin{aligned} \begin{aligned} \max _{D} \ \min _{E,S} \ \mathcal L_{adv}\, = \,&\mathbb E_{{I_T} \sim \mathcal X_{t}} [ \frac{1}{HW} \sum _{i=1}^{H} \sum _{j=1}^{W} \log (1-D_{i,j}( S \circ E(I_T))) ] \\&+ \mathbb E_{{I_S} \sim \mathcal X_{s}} [ \frac{1}{HW} \sum _{i=1}^{H} \sum _{j=1}^{W} \log (D_{i,j}( S \circ E(I_S))) ]. \end{aligned} \end{aligned}$$
(2)

H and W are the height and width of the discriminator’s output, respectively.

In conclusion, with the above sub-objectives, our finial objective function is defined as

$$\begin{aligned} \max _{D} \ \min _{E,S,G} \ \mathcal L_{seg} + \lambda _{rec}\mathcal L_{rec} + \lambda _{adv} \mathcal L_{adv}. \end{aligned}$$
(3)

In our defined minimax game, we alternately optimize each sub-network, while holding the other parts fixed. The parameters of the encoding network E is updated by averaging the gradients from each branch.

4 Experiments

4.1 Dataset

To evaluate the performance of our method, we conduct experiments on “GAT5 to Cityscapes”, which is a standard benchmark of domain adaptation for semantic segmentation. Specifically, GAT5 is the dataset that contains 24,966 synthetic images with resolution of \(1914 \times 1052\), rendered by the gaming engine Grand Theft Auto V. The pixel-level annotations of the GAT5 images are automatically generated. On the other hand, Cityscapes is a dataset that focuses on autonomous driving. The Cityscapes dataset consists of 2,975 images for training and 500 images in validation set. These images have a resolution of \(2048 \times 1024\). We use 19 common semantic categories between GTA5 and Cityscapes as the labels. Following the existing state-of-the-art works [7, 14], we train our domain adaptive segmentation network using the full GTA5 dataset and the Cityscapes training set with 2,975 images, and evaluate the performance on the Cityscapes validation set with 500 images.

Table 1. Results of different methods on the “GTA5 to Cityscapes” dataset. Ablation studies are conducted for both the feature-level encoding and the output-level enhancement.

4.2 Implementation Details

We adopt deeplabv2 as our baseline [1]. Specifically, the encoding network E is implemented with Resnet-101. The outputs of the res5c layer are fed into both the segmentation branch S and the reconstruction network G. G follows the identical architecture in [8], except that all the layers are shared by both domains. The discriminative network D contains 3 layer, including a ASPP layer with 4 dilated convolutional operators in parallel, and a convolutional layer followed sigmoid activations. The sampling rates in the ASPP layer are respectively set to 1, 2, 3 and 4. In our experiments, we use the PyTorch framework to implement our method. Overall, our experimental setting follows [14]. For E and S, we adopt stochastic gradient descent (SGD) with momentum of 0.9 as the optimizer. The parameters G and D are optimized by Adam with momentum of 0.99. In addition, we initialize the learning rate to \(2.5 \times 10^{-4}\) and decay it through the polynomial policy with power of 0.9. As the tradeoff parameters, \(\lambda _{rec}\) and \(\lambda _{adv}\) are set to \(1.0 \times 10^{-5}\) and \(1.0 \times 10^{-3}\), respectively. The mIoU value is used as the metric of evaluation.

Fig. 2.
figure 2

The qualitative example results. The first row displays the target images, with their corresponding ground truth segmentation masks in the second row. The third and fourth rows display the results before adaptation and after adaptation with our adaptation method, respectively.

4.3 Experimental Results

In Table 1 and Fig. 2, we report our adaptation results both quantitatively and qualitatively. The results demonstrate that our adaptation method can effectively improve the structured predictions of the target images. From Fig. 2, we can see that the structure information of the target images’ predictions are significantly enhanced, which is consistent with our motivation. With our method, the target images’ pixel-level predictions clearly delineate the real spatial layout. As displayed in Table 1, our method performs better than the existing state-of-the-art methods. The ablation studies demonstrate that the feature-level encoding and the output-level enhancement can work complementarily to improve the adaptation performance. This can be ascribed to the fact that these two branches enhance the target images’ structure information from complementary perspectives. Specifically, the reconstruction branch enforces the encoding representation to preserve the target images’ visual cues such as the local contexts or spatial layouts, which are essential for the structured predictions. In contrast, the output-level enhancement can directly leverages the source images’ pixel maps to regularize the target images’ structured predictions.

5 Conclusion

In this paper, we propose an effective method to learn domain adaptive segmentation network in an unsupervised domain adaptation setting. Through enhancing the structure information of the target images at both the feature level and the output level, our method can effectively improve the domain adaptation performance of the segmentation networks. After adaptation using our method, the target images’ pixel maps can clearly reveal their structure characteristics such as the spatial layout or the local context. The experimental results demonstrate that our method can effectively bridge the source and target domains.