Elsevier

Neurocomputing

Volume 394, 21 June 2020, Pages 127-135
Neurocomputing

GAN-Based virtual-to-real image translation for urban scene semantic segmentation

https://doi.org/10.1016/j.neucom.2019.01.115Get rights and content

Abstract

Semantic image segmentation requires large amounts of pixel-wise labeled training data. Creating such data generally requires labor-intensive human manual annotation. Thus, extracting training data from video games is a practical idea, and pixel-wise annotation can be automated from video games with near perfect accuracy. However, experiments show that models trained using raw video-game data cannot be directly applied to real-world scenes because of the domain shift problem. In this paper, we propose a domain-adaptive network based on CycleGAN that translates scenes from a virtual domain to a real domain in both the pixel and feature spaces. Our contributions are threefold: 1) we propose a dynamic perceptual network to improve the quality of the generated images in the feature spaces, making the translated images are more conducive to semantic segmentation; 2) we introduce a novel weighted self-regularization loss to prevent semantic changes in translated images; and 3) we design a discrimination mechanism to coordinate multiple subnetworks and improve the overall training efficiency. We devise a series of metrics to evaluate the quality of translated images during our experiments on the public GTA-V (a video game dataset, i.e., the virtual domain) and Cityscapes (a real-world dataset, i.e., the real domain) and achieved notably improved results, demonstrating the efficacy of the proposed model.

Introduction

Semantic segmentation is a fundamental topic in computer vision. Recently, various deep learning models [1], [2], [3] have been developed that performed excellently on this task, but they require massive amounts of annotated training data. Unfortunately, annotation is labor-intensive, which makes large amounts of high-quality training data extremely difficult to obtain. For example, Cityscapes [4], the famous urban traffic scene semantic segmentation dataset, contains only 2500 pixel-wise fine-annotation images, yet tens of millions of dollars have been invested in its annotation. Therefore, some researchers have attempted to extract accurately labeled data from video games to train semantic segmentation models [5], [6], [7]. With some computer graphic techniques, the semantic information extracted from game scenes is naturally both precise and complete, making it possible to acquire large-scale training data efficiently through an automated process. However, previous experiments have shown that models trained directly on a game scenes dataset (virtual domain) perform poorly in the real domain (see Fig. 1) because the data distribution varies significantly between these two domains. This discrepancy is referred to as a domain shift problem.

Domain adaption techniques have recently been introduced to cope with the domain shift problem [8], [9]. Inspired by dual learning techniques, several domain-adaptive models [10], [11], [12] have been proposed to reduce the domain shift problem by translating images from source domains to target domains. These models work well for some simple tasks that involve only a few object translations. For example, to translate a grayscale apple image into a colored image, the model must determine only that the object in the image is an apple. However, wider adoption of this technology faces severe challenges when applied to more complex scenarios that contain plentiful objects. To translate a typical complex scene (such as an urban road scene) from the virtual domain to the real domain, the aforementioned models are primarily prone to learn only the global features (i.e., richness of vegetation, overall tone, etc.), but they pay less attention to texture and other fine details of each object. This situation usually results in severe semantic inconsistency failures, as shown in Fig. 2.

In this paper, we propose a semantic-consistent virtual-to-real image translation model. The model concentrates on texture and detail feature translation while maintaining the semantic consistency between the virtual domain and real domain. With the presence of the model, mass-object scenarios such as complex urban roads can be effectively translated. The primary contributions of this study are as follows:

  • A dynamic perceptual loss is introduced to improve the quality of the generated images in the feature spaces.

  • An effective weighted self-regularization loss is employed to maintain the semantic consistency between the raw game images and the translated images.

  • A discrimination mechanism is designed to coordinate the multiple subnetworks the model employs and improve the overall training efficiency.

Extensive experiments on the public GTA-V and Cityscapes datasets demonstrate that the proposed approach can generate very realistic images, and it improves the performance of the models trained using the translated data.

Section snippets

Semantic image segmentation

Semantic image segmentation [13], [14] assigns semantic labels (e.g., “road” or “sidewalk”) to every pixel in the image. The first successful study to introduce the deep learning method into semantic segmentation was the Fully Convolutional Network (FCN), which removed the fully connected structure of traditional CNNs and replaced bilinear interpolation kernels with trainable deconvolution kernels to obtain finer segmentation. However, FCNs can be very problematic (e.g., they have redundant

The proposed method

Our work aims to translate virtual domain images to real domain images by performing a typical image-to-image translation without paired training data. Given a virtual domain image, our goal is to translate it to a realistic and semantically identical image. Based on CycleGAN, we present a new model that can achieve this goal. Previous methods [10], [11], [12] are unsuitable for virtual-to-real translation, as shown in Fig. 2. Models similar to CycleGAN are prone to learn global features

Settings

Our experiments are based on the GTA-V and Cityscapes datasets. The GTA-V dataset containing 24,966 labeled images and the real-scene Cityscapes dataset. In the Cityscapes dataset, the training set contains 19,998 images, of which 2000 are finely labeled and 17,998 are coarsely labeled. Note that we do not use the coarse labels; thus, these 17,998 images are regarded as unlabeled data. Besides, its validation set contains other 500 images with fine labels. The semantic segmentation performance

Conclusion and future work

In this paper, we propose a unified model that can translate virtual scene data to a real scene. The model introduces an effective weighted self-regularization loss to maintain the semantic consistency between the raw game images and the translated images and employs a dynamic perceptual loss to improve the accuracy of the generated images in the feature spaces. A discrimination mechanism is also designed to train the perceptual network for better semantic segmentation. Extensive experiments on

Conflict of interest

None.

Acknowledgment

This research was supported by the China State Key Laboratory of Software Development Environment (SKLSDE-2017ZX-22) and Beijing Science and Technology Project (Z171100000917016).

Xi Guo received the B.S. degree in Nuclear Engineering and Technology from North China Electric Power University in 2015. He is currently pursuing the Ph.D. degree at the State Key Laboratory of Software Development Environment, Beihang University. His research interests include computer vision and machine learning.

References (41)

  • W. Qiu et al.

    Unrealcv: Connecting computer vision to unreal engine

  • S.R. Richter et al.

    Playing for benchmarks

    Proceedings of the International Conference on Computer Vision (ICCV)

    (2017)
  • J.-Y. Zhu et al.

    Unpaired image-to-image translation using cycle-consistent adversarial networks

  • T. Kim et al.

    Learning to discover cross-domain relations with generative adversarial networks

  • Z. Yi et al.

    Dualgan: Unsupervised dual learning for image-to-image translation

  • O. Ronneberger et al.

    U-net: Convolutional networks for biomedical image segmentation

    Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention

    (2015)
  • F. Yu, V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint...
  • L.-C. Chen et al.

    Semantic image segmentation with deep convolutional nets and fully connected CRFs

    Comput. Sci.

    (2014)
  • ChenL.-C. et al.

    Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv...
  • Cited by (0)

    Xi Guo received the B.S. degree in Nuclear Engineering and Technology from North China Electric Power University in 2015. He is currently pursuing the Ph.D. degree at the State Key Laboratory of Software Development Environment, Beihang University. His research interests include computer vision and machine learning.

    Zhicheng Wang received the B.S. degree in Software Engineering from Beihang University at Beijing. He is currently studying for a master’s degree at Beihang University. His research interests include visual domain adaptation and machine learning.

    Qin Yang is a Professor and Ph.D. Supervisor of Computer Science, working at the State Key Laboratory of Software Development Environment of School of Computer Science and Engineering in Beihang University. He received Ph.D. degree in Computer Software from Beihang University at Beijing in 2001. He has been researching and developing software in the geological and geophysical industry for 20 years. His research areas include Computer Graphics, Geometry Modeling, 3D Geological Modeling, Mesh Generation, Spatial Data Processing and Management etc.

    Weifeng Lv received the Ph.D. degree in computer science from Beihang University. His research interests include deep learning and massive information system. He is a professor, the dean of the School of Computer Science and Engineering, vice director of the State Key Laboratory of Software Development Environment, Secretary-General of the China Software Industry Association, and director of National Engineering Research Center for Science and Technology Resources Sharing Service.

    Xianglong Liu received the BS and Ph.D. degrees in computer science from Beihang University, Beijing, China, in 2008 and 2014. From 2011 to 2012, he visited the Digital Video and Multimedia (DVMM) Lab, Columbia University as a joint Ph.D. student. He is currently an Associate Professor with the School of Computer Science and Engineering, Beihang University. He has published over 30 research papers at top venues like the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANSACTIONS ON CYBERNETICS, the Conference on Computer Vision and Pattern Recognition, the International Conference on Computer Vision, and the Association for the Advancement of Artificial Intelligence. His research interests include machine learning, computer vision and multimedia information retrieval.

    Qiong Wu received her Masters degree in Beihang University. She is current work in Advanced Research Institute of CTFO. Her research interests include high accuracy positioning and test technology of intelligent and self-driving cars.

    Jian Huang received his M.Sc. and Ph.D. degree from Beihang University, Beijing, P.R. China, respectively in 2003 and 2015. In 2003, he joined the School of Software, Beihang university. Since May 2016, he has been an Associate Professor in the same department. His research interests include Smart Transportation, Big Data, Image and Video Processing, etc. Prof. Huang has been published more than 30 papers in those areas.

    View full text