Elsevier

Pattern Recognition Letters

Volume 136, August 2020, Pages 257-263
Pattern Recognition Letters

OccGAN: Semantic image augmentation for driving scenes

https://doi.org/10.1016/j.patrec.2020.06.011Get rights and content

Highlights

  • The OccGAN structure is a semantic augmentation method on Cityscapes.

  • The Rationality Module utilizes prior knowledge to implant occluders.

  • The Authenticity Module ensures the plausibility by a generative adversarial network.

  • Our Method improves the performance of several SOTA algorithms.

Abstract

Difficult images with complicated environments and occlusion have significant impacts on the performance of algorithms. They obey the long-tail distribution in the widely used datasets, which results in rare samples being overwhelmed during training. This paper presents a new approach to generate plausible occluded images with annotation as a kind of data augmentation with scenes semantics. To achieve this task, we proposed the Occlusion-based Generative Adversarial Network (OccGAN) structure, which consists of a Rationality Module and an Authenticity Module. The Rationality Module generated preliminary occluded samples under the guidance of prior semantic knowledge. And the Authenticity Module is a generative adversarial structure to ensure the reality of the produced images. Qualitative results of the visualization process are given to verify the ablation study. Experiments on the semantic segmentation task indicate that several state-of-the-art algorithms combined with our OccGAN such as DRN, Deeplabv3+, PSPNet and ResNet-38, have boosts on IoU class scores and IoU category scores successfully.

Introduction

When the image contains abundant objects or the scene is quite complicated, sunch as MS COCO [1] and KITTI [2], deep networks are not as effective as humans in recognition [3]. Furthermore, the common property of these complicated images is occlusion. On the one hand, occlusion is arbitrary and various inherently. Several kinds of research have shown that the key parts of an object usually play an important role in recognition [4], [5], [6]. When the occlusion occurred at the position of key parts, the valuable information will be lost. Therefore, occlusion will lead to immense difficulty in recognition, detection, and segmentation tasks. On the other hand, deep learning tools rely on data heavily [7], while occlusion goes under a long-tail distribution in nature [8], as shown in Fig. 1. Large scale datasets of nature scenes which are widely used, contain occluded images and other difficult images inevitably. The information of occluded objects is almost impossible to obtain, which means a straightforward approach cannot obtain the contour or the location box of occluded objects. In order to get annotations about occlusion, more manual work and human imagination are needed, which are laborious and costly.

Since the long-tail distribution means that rare images are more likely to be overwhelmed by common images, direct training on a large-scale dataset cannot lead to satisfied results. Therefore, there are several data augmentation methods to assist network training [9], [10], [11]. Different kinds of transformation to the original image can improve the robustness of the training model, including cropping, flipping, rotation, shifting, scaling, etc. Random Erasing [11] is also a method to provide occluded samples with gray blocks. These low-level augmentations mainly rely on the prior known invariances but lack semantic information. In other words, most traditional augmentation methods are low-level stacking of data, rather than instance-level operations that match scenes semantics. The core of this paper is to provide an augmentation method at semantic level to enlarge the number of plausible occlusion samples to balance the distribution (Fig. 2).

Thus we propose an occlusion-based generative adversarial networks, which is abbreviated as OccGAN, to generate occlusion samples from natural images, which takes advantage of generative adversarial networks (GAN) [12] in image generation. A generative adversarial network is usually composed of two parts: a generator and a discriminator. The generator attempts to spoof the discriminator by producing samples similar to real ones, while the discriminator is trained to distinguish the generated samples. The two models are trained together to reach the “Nash Equilibrium” by an adversarial loss. In the seminal paper [12], the original GAN cannot control the pattern of generated samples because of random noise as input. Therefore, conditional GAN [13] is proposed to handle this problem, and many subsequent works are variants of conditional GAN. Then, Convolutional Neural Network (CNN) has also been adopted to train GAN. Proposing a set of architectural guidelines, the authors of [14] obtained a stable deep convolutional GAN. The tasks of image-to-image translation and text-to-image translation have also employed GAN [15], [16], [17].

A GAN with fully learning can hardly control the output images occluded or transformed. Furthermore, the generated occlusion may be as uncontrollable as noise. Although the GAN structure in A-Fast-RCNN [8] is used to generate occlusion to overcome the long-tail distribution, these occlusions occur on deep features rather than RGB images. All of the generated occlusions are modeled inside the network convolutional layers to make the network more robust. In fact, they just point out the essential region of one image. In our research, the OccGAN structure is manipulated to generate occluded images with natural objects instead of black or gray blocks.

Lee et al [18] insert object instance masks of specific classes by GAN to the semantic label maps on Cityscapes [19] dataset. Spatial and shape distributions are learned by two generators separately. Hence, the generation of RGB images can be achieved in two ways. One is to utilize other conditional GANs with semantic label maps as input [20], and the other is simple cropping and mapping. Due to the instability of GANs, the output will have blurred content or twisted edge, resulting in strange synthetic RGB images. By contrast, our approach utilizes prior knowledge to generate occlusions both on semantic label maps and RGB images, and exploits the GAN structure to ensure authenticity.

Unlike the idea of generating difficult occluded samples for training, some other studies devoted to image inpainting by GANs to overcome occlusion. In [21] a recurrent neural network is used to complement the occluded part of the image. A multi-scale deconvolutional network is also proposed to fix the occluded images [22]. In fact, the GAN structure is introduced to recognize occluded parts of objects [23].

In this paper, we also choose the Cityscapes dataset, which is a practical autonomous driving dataset with urban scenes. The most important thing is that we are able to establish reliable location priors because the in-vehicle cameras have a fixed viewpoint. The posture and spatial relationship of pedestrians and vehicles are relatively fixed in driving scenes. Random persons and cars are selected from natural images and are exploited to occlude the input. At the same time, inspired by the benefit of location prior in driving scene [24] and scene understanding [25], [26], data statistics and spatial zooming factors could determinate where and what to occlude. Therefore, we propose the OccGAN structure to generate occlusions, which consists of a Rationality Module and an Authenticity Module. The whole structure constructs the human concept of occlusion into GAN to produce plausible occluded images guided by prior knowledge. Our method utilizes instance-level operations to achieve image semantic augmentation of the number, position and relationships of targets. Unlike the classic data augmentation results, most of which will be contrary to the real-world situation, the generated results of our OccGAN is consistent with the scene semantics.

Our main contributions are four-fold. First, we put forward a kind of instance-level semantic augmentation method OccGAN corresponding to the scenarios. Second, the Rationality Module utilizes prior knowledge, including scene understanding and spatial relationship, to implant occluders reasonably. Third, the Authenticity Module exploits a generative adversarial structure to promote the authenticity of images. Finally, experiments on semantic segmentation task verify the effectiveness of our method and successfully improve the performance of existed state-of-the-art algorithms.

The rest of our paper is organized as follows. Section 2 introduces the whole OccGAN structure, including Rationality Module and Authenticity Module. Section 3 shows experiments validate the effectiveness of our method on semantic segmentation task. And Section 4 gives the conclusion and discussion.

Section snippets

Approach

The proposed model OccGAN takes a single RGB image as input and attempts to predict the occluders which are as plausible as possible with regard to real images. Our model is divided into two modules: one is a Rationality Module to generate occluded images by occlusion implantation, and the other is an Authenticity Module to ensure occlusion authenticity by a generative adversarial network.

Experiments

OccGAN architecture can produce occluded images quickly and plausibly. Moreover, due to the semantic synthesis, these occluded images have fine annotations of occluded objects. Our method can be regarded as a new kind of augmentation tool to produce a large number of occluded images based on original real images to obtain sufficient training samples. Besides, our method can be combined with traditional augmentation methods to improve several state-of-the-art algorithms. In order to prove the

Conclusion

This paper proposes an OccGAN structure as a kind of augmentation method to generate plausible occluded images on autonomous driving dataset. Under the guidance of prior knowledge, the Rationality Module of OccGAN generates preliminary occluded images by the prior probability distribution and scale factor. Furthermore, the Authenticity Module handles the problem of edge sharpness and style inconsistency by a generative adversarial structure. At the same time, we can obtain the precise

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key R&D Plan (No.2016YFB0100901), the National Natural Science Foundation of China (No. 61773231, No. 61673237) and the Beijing Municipal Science & Technology Project (No.Z191100007419001).

References (31)

  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Neural Inf. Process. Syst.

    (2012)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    Comput. Sci.

    (2014)
  • Z. Zhong et al.

    Random erasing data augmentation

    Comput. Vis. Pattern Recognit.

    (2017)
  • I.J. Goodfellow et al.

    Generative adversarial nets

    International Conference on Neural Information Processing Systems

    (2014)
  • M. Mirza et al.

    Conditional generative adversarial nets

    Comput. Sci.

    (2014)
  • Cited by (4)

    The handling Editor name: Wei Zhang

    1

    Contributed equally to this work.

    View full text