Elsevier

Neurocomputing

Volume 401, 11 August 2020, Pages 123-132
Neurocomputing

A novel data augmentation scheme for pedestrian detection with attribute preserving GAN

https://doi.org/10.1016/j.neucom.2020.02.094Get rights and content

Abstract

Recently pedestrian detection has progressed significantly. However, detecting pedestrians of small scale or in heavy occlusions is still notoriously difficult. Besides, the generalization ability of pre-trained detectors across different datasets remains to be improved. Both of these issues can be attributed to insufficient training data coverage. To cope with this, we present an efficient data augmentation scheme by transferring pedestrians from other datasets into the target scene with a novel Attribute Preserving Generative Adversarial Networks (APGAN). The proposed methodology consists of two steps: pedestrian embedding and style transfer. The former step can simulate pedestrian images of various scale and occlusion, in any pose or background, thus greatly promoting the data variation. The latter step aims to make the generated samples more realistic while guarantee the data coverage. To achieve this goal, we propose APGAN, which pursues both good visual quality and attribute preserving after style transfer. With the proposed method, we can make effective sample augmentations to improve the generalization ability of the trained detectors and enhance its robustness to scale change and occlusions. Extensive experiment results validate the effectiveness and advantages of our method.

Introduction

Pedestrian detection has made great progress in recent years. The performance on public datasets seems pleasing, however, there are still several challenges remaining unresolved. First, pedestrian detection in real applications needs to handle complex lighting conditions, background change, pose variations, occlusions, and scale changes. While the public datasets only cover limited data variation, thus existing methods might be difficult to handle these complex situations, especially for small-scale and occluded pedestrians. What’s more, the domain gap between public datasets also makes the pre-trained detectors generalize poorly across different datasets. As illustrated in Fig. 1, detectors trained from typical training data can hardly deal with the above challenges.

Apart from the deficiency of algorithm, the insufficient training sample coverage is also an important reason for the unpleasing detection performance, particularly for the methods based on deep learning. However, collecting sufficient pedestrian samples is impossible in fact. Therefore, many researchers resort to data augmentation strategies to increase data coverage by making full use of available training data. Common data augmentation methods include random cropping, color jittering, random deformation, etc. However, these methods can only introduce limited data variation, so they improve the performance slightly. Recently, several works propose crop-and-paste data augmentation schemes for object detection [1], [2] and instance segmentation [3], which is, cropping some object foregrounds and pasting them into the target scene by following some rules. However, these methods do not consider whether the pasted patches fit the target scene. As a result, the augmentation samples may look unrealistic and hinder the model learning. Meanwhile, many works [4], [5], [6], [7] leverage the Generative Adversarial Networks (GAN) to conduct domain adaptation in person re-identification problem. They can simulate the target resolution and illumination conditions to some extent, but can hardly generate pedestrian samples of other scale or in other occlusions.

To tackle the above issue, this paper proposes a novel person transferable data augmentation approach for pedestrian detection. As shown in Fig. 2, it involves two stages: (1) embed persons from other datasets into the target scene randomly with the guide of scene semantics; (2) crop pedestrian patches from the pedestrian embedding images, transfer their style into the target domain and embed them back to obtain the generated training samples. The newly generated images and labels can be combined with the original training data for pedestrian detector training. Pedestrian embedding mainly has two benefits. First, it can significantly promote the diversity of pedestrian samples to improve the generalization ability of the detectors. Second, we can simulate certain target data augmentation requirements (e.g. occlusion, small scale) during the embedding stage to promote the detection performance in such special situations.

To make the generated samples look more realistic while guaranteeing the data variation, we propose APGAN in Stage 2, which is a novel variant of CycleGAN [8]. The proposed APGAN can transfer the person style from the source domain into the target domain with the person attributes preserved, such as clothing colors and dress patterns. The preserved attribute can guarantee enough variations of the embedding pedestrians.

Original CycleGAN only considers whether the generated sample looks realistic or not. Compared with it, our proposed APGAN pursuits both good visual quality and attribute preserving by introducing two extra losses. One is Masked Reconstruction Loss (MR-Loss), which restrains the background as well as the attributes of the source persons unchanged during style transfer. Another is Total Variation Loss (TV-Loss), which enforces the smooth spatial color transformation of person images.

In summary, this paper makes the following contributions:

  • A novel data augmentation scheme for pedestrian detection is proposed. It can effectively promote the variation of training data by transferring persons from other datasets to the target scene.

  • We propose an efficient APGAN by introducing the novel Masked Reconstruction Loss to CycleGAN, to achieve good visual quality as well as attribute preserving after the style transfer.

  • Our approach consistently improves the performance of two representative pedestrian detectors, i.e., Adapted FasterRCNN [9] and Asymptotic Localization Fitting Networks (ALFNet) [10], especially in detecting small-scale as well as occluded pedestrians on CityScapes dataset [11]. It also enhances their generalization ability across different datasets including Caltech [12], KITTI [13], INRIA [14], ETH [15], and TUD-Brussels [16].

The paper is organized as follows: In Section 2, we introduce some related work about pedestrian detection, data augmentation and image-to-image translation; In Section 3, we introduce the two steps of our data augmentation scheme: pedestrian embedding and style transfer; In Section 4, we demonstrate the experiment results of our augmentation method on pedestrian detection; In Section 5, we make the conclusion.

Section snippets

Pedestrian detection

Recent works [9], [17], [18] for pedestrian detection are based on RCNN [19], FastRCNN [20], FasterRCNN [21] or some customized architectures like MS-CNN [22] and SA-FastRCNN [23]. Besides, more and more research efforts have been devoted to breaking the performance bottleneck in small-scale object detection. Lin et al. [24] develop a Feature Pyramid Network (FPN) to locate objects at all scales. Zhang et al. [25] propose a real-time Single Shot Scale-Invariant Face Detector (S3FD), which

Methodology

By now, insufficient training data coverage is still a bottleneck for pedestrian detection in real-world applications. To resolve this issue, we propose a novel person transferable data augmentation scheme. As illustrated in Fig. 2, our proposed method includes two steps:

  • Pedestrian Embedding: Extract source pedestrians from other datasets and embed them into the target scene randomly.

  • Style Transfer: Crop the pedestrian patches from the embedding images, transfer their styles with APGAN, and

Experiments

In this section, we conduct experiments to validate the effectiveness of the proposed approach in two aspects: boosting the detection performance for small-scale and occluded pedestrians; improving the generalization ability for pre-trained detectors. Specifically, we perform two groups of experiments: (1) transfer persons from MPII [43] and KITTI [13] to the target scene in CityPersons [9], combine the sample augmentations with the original training set of CityPersons to train the pedestrian

Conclusion

This paper proposes a novel data augmentation method to tackle the problem of insufficient training data coverage by embedding the source pedestrians with a target scene and transferring its style by Attribute Preserving GAN. The experimental results show that our method can be combined with different pedestrian detectors and achieves great improvement in both the performance and the generalization ability.

CRediT authorship contribution statement

Songyan Liu: Conceptualization, Methodology, Investigation, Writing - original draft. Haiyun Guo: Validation, Formal analysis, Writing - review & editing. Jian-Guo Hu: Writing - review & editing. Xu Zhao: Validation, Formal analysis. Chaoyang Zhao: Validation. Tong Wang: Investigation. Yousong Zhu: Validation. Jinqiao Wang: Writing - review & editing. Ming Tang: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by National Natural Science Foundation of China (No. 61772527, 61806200 and 61976210), the Research and Development Projects in the Key Areas of Guangdong Province (No. 2019B010142002, 2019B010153001), and China Postdoctoral Science Foundation (No. 2019M660859).

Songyan Liu received the B.E. degree in 2015 from Southeast University, Nanjing, China. He is now pursuing a Ph.D. degree on pattern recognition and intelligence systems at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing, China, since 2015. His research interests include the analysis of deep learning networks and the application of generative adversarial networks.

References (50)

  • J. Zhu et al.

    Toward multimodal image-to-image translation.

    Proceedings of the Advances in Neural Information Processing Systems

    (2017)
  • N. Dvornik et al.

    Modeling visual context is key to augmenting object detection datasets.

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • D. Dwibedi et al.

    Cut, paste and learn: Surprisingly easy synthesis for instance detection.

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2017)
  • H.-S. Fang et al.

    InstaBoost: boosting instance segmentation via probability map guided copy-pasting

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2019)
  • Z. Zhong et al.

    Camera style adaptation for person re-identification.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • W. Deng et al.

    Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • L. Wei et al.

    Person transfer GAN to bridge domain gap for person re-identification.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • J. Liu et al.

    Pose transferrable person re-identification.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • J. Zhu et al.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    Proceedings of International Conference on Computer Vision

    (2017)
  • S. Zhang et al.

    Citypersons: a diverse dataset for pedestrian detection.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • W. Liu et al.

    Learning efficient single-stage pedestrian detectors by asymptotic localization fitting.

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding.

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    (2016)
  • P. Dollár et al.

    Pedestrian detection: an evaluation of the state of the art.

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • A. Geiger et al.

    Are we ready for autonomous driving? the kitti vision benchmark suite.

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    (2012)
  • D. Navneet et al.

    Histograms of oriented gradients for human detection.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2005)
  • A. Ess et al.

    Depth and appearance for mobile scene analysis.

    Proceedings of International Conference on Computer Vision

    (2007)
  • C. Wojek et al.

    Multi-cue onboard pedestrian detection.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2009)
  • J. Hosang et al.

    Taking a deeper look at pedestrians.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • S. Zhang et al.

    How far are we from solving pedestrian detection?

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • R. Girshick

    Fast R-CNN

    Proceedings of the International Conference on Computer Vision (ICCV)

    (2015)
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks.

    Proceedings of the Advances in Neural Information Processing Systems

    (2015)
  • Z. Cai et al.

    A unified multi-scale deep convolutional neural network for fast object detection.

    Proceedings of the European Conference on Computer Vision

    (2016)
  • J. Li, X. Liang, S. Shen, T. Xu, S. Yan, Scale-aware fast R-CNN for pedestrian detection., arXiv preprint...
  • T.Y. Lin, P. Dollár, R.B. Girshick, K. He, B. Hariharan, S.J. Belongie, Feature pyramid networks for object detection.,...
  • Cited by (39)

    • Multi-expert learning for fusion of pedestrian detection bounding box

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Our method can also give better generalization under the same settings compared with this work. APGAN [41] presented an intuitive way to improve the generalization ability of the pre-trained models, which generated many new pedestrians in source data with similar distributions of the target domain. Although it is a feasible way to improve the unsupervised cross-domain detection performance, APGAN also needs source and target domains to help learn the distributions of different domains and generate new pedestrians.

    View all citing articles on Scopus

    Songyan Liu received the B.E. degree in 2015 from Southeast University, Nanjing, China. He is now pursuing a Ph.D. degree on pattern recognition and intelligence systems at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing, China, since 2015. His research interests include the analysis of deep learning networks and the application of generative adversarial networks.

    Haiyun Guo received the B.E. degree from Wuhan University in 2013 and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, University of Chinese Academy of Sciences, in 2018. She is currently an Assistant Researcher with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image and video processing, and intelligent video surveillance.

    Jian-Guo Hu received the B.S. and M.S. degrees in National University of Defense Technology, in 2000 and 2004, respectively, and Ph.D degree in communication and information systems, School of Information Science and Technology, Sun Yat-sen University, Guangzhou, China, in 2010. He is currently a professor with the School of Microelectronics Science and Technology, Sun Yat-sen University. And he is the director of Development Research Institute of Guangzhou Smart City. He is the leading talents in science and technology of the ”Special Support Plan” of Guangdong Province, the leader of the Innovation Leading Team of Guangzhou, and the outstanding expert of Guangzhou. He is the director of Guangdong Internet of Things Chip and System Application Engineering Center, director of Guangdong Biological Identification Chip and System Engineering Technology Research Center, director of Guangzhou Key Laboratory of Internet of Things Identification and Perception Chip.

    Xu Zhao received the B.E. degree in 2014 from Dalian University of Technology and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of Sciences in 2019. He is currently an assistant researcher in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include object detection, scene text detection, image and video processing, and intelligent video surveillance.

    Chaoyang Zhao received the B.E. degree and the M.S. degree in 2009 and 2012 respectively from University of Electronic Science and Technology of China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2016. He is currently an Assistant Professor in National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include object detection, image and video processing and intelligent video surveillance.

    Tong Wang received the B.E. degree in 2017 from Nankai University, Tianjin, China. He is now pursuing a PhD degree on pattern recognition and intelligence systems at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, since 2017. His research interests include object detection, image and video processing, and intelligent video surveillance.

    Yousong Zhu received the B.E. degree from Central South University in 2014 and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of Sciences in 2019. He is currently an assistant researcher in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include object detection, video object detection, pattern recognition and machine learning, and intelligent video surveillance.

    Jinqiao Wang received the B.E. degree in 2001 from Hebei University of Technology, China, and the M.S. degree in 2004 from Tianjin University, China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2008. He is currently a Professor with Chinese Academy of Sciences. His research interests include pattern recognition and machine learning, image and video processing, mobile multimedia, and intelligent video surveillance.

    Ming Tang received the B.S. degree in computer science and engineering and M.S. degree in artificial intelligence from Zhejiang University, Hangzhou, China, in 1984 and 1987, respectively, and the Ph.D. degree in pattern recognition and intelligent system from the Chinese Academy of Sciences, Beijing, China, in 2002. He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include computer vision and machine learning.

    View full text