A novel data augmentation scheme for pedestrian detection with attribute preserving GAN

doi:10.1016/j.neucom.2020.02.094

Neurocomputing

Volume 401, 11 August 2020, Pages 123-132

https://doi.org/10.1016/j.neucom.2020.02.094 Get rights and content

Abstract

Recently pedestrian detection has progressed significantly. However, detecting pedestrians of small scale or in heavy occlusions is still notoriously difficult. Besides, the generalization ability of pre-trained detectors across different datasets remains to be improved. Both of these issues can be attributed to insufficient training data coverage. To cope with this, we present an efficient data augmentation scheme by transferring pedestrians from other datasets into the target scene with a novel Attribute Preserving Generative Adversarial Networks (APGAN). The proposed methodology consists of two steps: pedestrian embedding and style transfer. The former step can simulate pedestrian images of various scale and occlusion, in any pose or background, thus greatly promoting the data variation. The latter step aims to make the generated samples more realistic while guarantee the data coverage. To achieve this goal, we propose APGAN, which pursues both good visual quality and attribute preserving after style transfer. With the proposed method, we can make effective sample augmentations to improve the generalization ability of the trained detectors and enhance its robustness to scale change and occlusions. Extensive experiment results validate the effectiveness and advantages of our method.

Introduction

Pedestrian detection has made great progress in recent years. The performance on public datasets seems pleasing, however, there are still several challenges remaining unresolved. First, pedestrian detection in real applications needs to handle complex lighting conditions, background change, pose variations, occlusions, and scale changes. While the public datasets only cover limited data variation, thus existing methods might be difficult to handle these complex situations, especially for small-scale and occluded pedestrians. What’s more, the domain gap between public datasets also makes the pre-trained detectors generalize poorly across different datasets. As illustrated in Fig. 1, detectors trained from typical training data can hardly deal with the above challenges.

Apart from the deficiency of algorithm, the insufficient training sample coverage is also an important reason for the unpleasing detection performance, particularly for the methods based on deep learning. However, collecting sufficient pedestrian samples is impossible in fact. Therefore, many researchers resort to data augmentation strategies to increase data coverage by making full use of available training data. Common data augmentation methods include random cropping, color jittering, random deformation, etc. However, these methods can only introduce limited data variation, so they improve the performance slightly. Recently, several works propose crop-and-paste data augmentation schemes for object detection [1], [2] and instance segmentation [3], which is, cropping some object foregrounds and pasting them into the target scene by following some rules. However, these methods do not consider whether the pasted patches fit the target scene. As a result, the augmentation samples may look unrealistic and hinder the model learning. Meanwhile, many works [4], [5], [6], [7] leverage the Generative Adversarial Networks (GAN) to conduct domain adaptation in person re-identification problem. They can simulate the target resolution and illumination conditions to some extent, but can hardly generate pedestrian samples of other scale or in other occlusions.

To tackle the above issue, this paper proposes a novel person transferable data augmentation approach for pedestrian detection. As shown in Fig. 2, it involves two stages: (1) embed persons from other datasets into the target scene randomly with the guide of scene semantics; (2) crop pedestrian patches from the pedestrian embedding images, transfer their style into the target domain and embed them back to obtain the generated training samples. The newly generated images and labels can be combined with the original training data for pedestrian detector training. Pedestrian embedding mainly has two benefits. First, it can significantly promote the diversity of pedestrian samples to improve the generalization ability of the detectors. Second, we can simulate certain target data augmentation requirements (e.g. occlusion, small scale) during the embedding stage to promote the detection performance in such special situations.

To make the generated samples look more realistic while guaranteeing the data variation, we propose APGAN in Stage 2, which is a novel variant of CycleGAN [8]. The proposed APGAN can transfer the person style from the source domain into the target domain with the person attributes preserved, such as clothing colors and dress patterns. The preserved attribute can guarantee enough variations of the embedding pedestrians.

Original CycleGAN only considers whether the generated sample looks realistic or not. Compared with it, our proposed APGAN pursuits both good visual quality and attribute preserving by introducing two extra losses. One is Masked Reconstruction Loss (MR-Loss), which restrains the background as well as the attributes of the source persons unchanged during style transfer. Another is Total Variation Loss (TV-Loss), which enforces the smooth spatial color transformation of person images.

In summary, this paper makes the following contributions:

•
A novel data augmentation scheme for pedestrian detection is proposed. It can effectively promote the variation of training data by transferring persons from other datasets to the target scene.
•
We propose an efficient APGAN by introducing the novel Masked Reconstruction Loss to CycleGAN, to achieve good visual quality as well as attribute preserving after the style transfer.
•
Our approach consistently improves the performance of two representative pedestrian detectors, i.e., Adapted FasterRCNN [9] and Asymptotic Localization Fitting Networks (ALFNet) [10], especially in detecting small-scale as well as occluded pedestrians on CityScapes dataset [11]. It also enhances their generalization ability across different datasets including Caltech [12], KITTI [13], INRIA [14], ETH [15], and TUD-Brussels [16].

The paper is organized as follows: In Section 2, we introduce some related work about pedestrian detection, data augmentation and image-to-image translation; In Section 3, we introduce the two steps of our data augmentation scheme: pedestrian embedding and style transfer; In Section 4, we demonstrate the experiment results of our augmentation method on pedestrian detection; In Section 5, we make the conclusion.

Section snippets

Pedestrian detection

Recent works [9], [17], [18] for pedestrian detection are based on RCNN [19], FastRCNN [20], FasterRCNN [21] or some customized architectures like MS-CNN [22] and SA-FastRCNN [23]. Besides, more and more research efforts have been devoted to breaking the performance bottleneck in small-scale object detection. Lin et al. [24] develop a Feature Pyramid Network (FPN) to locate objects at all scales. Zhang et al. [25] propose a real-time Single Shot Scale-Invariant Face Detector (S3FD), which

Methodology

By now, insufficient training data coverage is still a bottleneck for pedestrian detection in real-world applications. To resolve this issue, we propose a novel person transferable data augmentation scheme. As illustrated in Fig. 2, our proposed method includes two steps:

•
Pedestrian Embedding: Extract source pedestrians from other datasets and embed them into the target scene randomly.
•
Style Transfer: Crop the pedestrian patches from the embedding images, transfer their styles with APGAN, and

Experiments

In this section, we conduct experiments to validate the effectiveness of the proposed approach in two aspects: boosting the detection performance for small-scale and occluded pedestrians; improving the generalization ability for pre-trained detectors. Specifically, we perform two groups of experiments: (1) transfer persons from MPII [43] and KITTI [13] to the target scene in CityPersons [9], combine the sample augmentations with the original training set of CityPersons to train the pedestrian

Conclusion

This paper proposes a novel data augmentation method to tackle the problem of insufficient training data coverage by embedding the source pedestrians with a target scene and transferring its style by Attribute Preserving GAN. The experimental results show that our method can be combined with different pedestrian detectors and achieves great improvement in both the performance and the generalization ability.

CRediT authorship contribution statement

Songyan Liu: Conceptualization, Methodology, Investigation, Writing - original draft. Haiyun Guo: Validation, Formal analysis, Writing - review & editing. Jian-Guo Hu: Writing - review & editing. Xu Zhao: Validation, Formal analysis. Chaoyang Zhao: Validation. Tong Wang: Investigation. Yousong Zhu: Validation. Jinqiao Wang: Writing - review & editing. Ming Tang: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by National Natural Science Foundation of China (No. 61772527, 61806200 and 61976210), the Research and Development Projects in the Key Areas of Guangdong Province (No. 2019B010142002, 2019B010153001), and China Postdoctoral Science Foundation (No. 2019M660859).

Songyan Liu received the B.E. degree in 2015 from Southeast University, Nanjing, China. He is now pursuing a Ph.D. degree on pattern recognition and intelligence systems at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, Beijing, China, since 2015. His research interests include the analysis of deep learning networks and the application of generative adversarial networks.

References (50)

J. Zhu et al.
Toward multimodal image-to-image translation.
Proceedings of the Advances in Neural Information Processing Systems
(2017)
N. Dvornik et al.
Modeling visual context is key to augmenting object detection datasets.
Proceedings of the European Conference on Computer Vision (ECCV)
(2018)
D. Dwibedi et al.
Cut, paste and learn: Surprisingly easy synthesis for instance detection.
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
(2017)
H.-S. Fang et al.
InstaBoost: boosting instance segmentation via probability map guided copy-pasting
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
(2019)
Z. Zhong et al.
Camera style adaptation for person re-identification.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018)
W. Deng et al.
Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018)
L. Wei et al.
Person transfer GAN to bridge domain gap for person re-identification.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018)
J. Liu et al.
Pose transferrable person re-identification.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2018)
J. Zhu et al.
Unpaired image-to-image translation using cycle-consistent adversarial networks.
Proceedings of International Conference on Computer Vision
(2017)
S. Zhang et al.
Citypersons: a diverse dataset for pedestrian detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)

W. Liu et al.

Learning efficient single-stage pedestrian detectors by asymptotic localization fitting.

Proceedings of the European Conference on Computer Vision (ECCV)

(2018)

M. Cordts et al.

The cityscapes dataset for semantic urban scene understanding.

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

(2016)

P. Dollár et al.

Pedestrian detection: an evaluation of the state of the art.

IEEE Trans. Pattern Anal. Mach. Intell.

(2012)

A. Geiger et al.

Are we ready for autonomous driving? the kitti vision benchmark suite.

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

(2012)

D. Navneet et al.

Histograms of oriented gradients for human detection.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2005)

A. Ess et al.

Depth and appearance for mobile scene analysis.

Proceedings of International Conference on Computer Vision

(2007)

C. Wojek et al.

Multi-cue onboard pedestrian detection.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2009)

J. Hosang et al.

Taking a deeper look at pedestrians.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

S. Zhang et al.

How far are we from solving pedestrian detection?

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

R. Girshick et al.

Rich feature hierarchies for accurate object detection and semantic segmentation.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2014)

R. Girshick

Fast R-CNN

Proceedings of the International Conference on Computer Vision (ICCV)

(2015)

S. Ren et al.

Faster R-CNN: towards real-time object detection with region proposal networks.

Proceedings of the Advances in Neural Information Processing Systems

(2015)

Z. Cai et al.

A unified multi-scale deep convolutional neural network for fast object detection.

Proceedings of the European Conference on Computer Vision

(2016)

J. Li, X. Liang, S. Shen, T. Xu, S. Yan, Scale-aware fast R-CNN for pedestrian detection., arXiv preprint...

T.Y. Lin, P. Dollár, R.B. Girshick, K. He, B. Hariharan, S.J. Belongie, Feature pyramid networks for object detection.,...

Cited by (39)

Homomorphic federated learning schemes enabled pedestrian and vehicle detection system
2023, Internet of Things (Netherlands)
Intelligent transport systems are increasingly being used in practice these days. Fog nodes and cloud servers collect real-time pedestrian and vehicle data and train them based on machine learning models. Existing pedestrian and vehicle detection systems need more security, less resource leakage, and faster processing. This paper proposes a homomorphic, secure, federated learning-enabled pedestrian detection system named HMFLS. The HMFLS consists of base stations (BS) and homogeneous (homo) federated learning servers with weights, surveillance, and traffic light components, and these nodes serve as data generation and processing sources. Homomorphic encryption is a cryptographic technique that allows computations on encrypted data without decryption. In other words, it enables computations on encrypted data while preserving the privacy and confidentiality of the information. The HMFLS exploits Generative Adversarial Networks to train pedestrian and vehicle images and extract features based on VGG19 from fog nodes and surveillance sensor images. We trained the different tasks and weights based on the model’s 28,000 pedestrian and vehicle images. The goal is to identify pedestrians and vehicles in the system. The interface is based on an Android-based application that can be easily integrated into different vehicles and mobile phones. Simulation results demonstrated that HMFL performed well as compared to existing schemes (e.g., PEL, TFL-CNN, FLAV) in terms of security accuracy by 98%, resource leakage by 50%, and processing time by 52% for vehicle and pedestrian detection in the system.
Occlusion and multi-scale pedestrian detection A review
2023, Array
Pedestrian detection has a wide range of application prospects in many fields such as unmanned driving, intelligent monitoring, robot, etc., and has always been a hot issue in the field of computer vision. In recent years, with the development of deep learning and the proposal of many large pedestrian data sets, pedestrian detection technology has also made great progress, and the detection accuracy and detection speed have been significantly improved. However, the performance of the most advanced pedestrian detection methods is still far behind that of human beings, especially when there is occlusion and scale change, the detection accuracy decreases significantly. Occlusion and scale problems are the key problems to be solved in pedestrian detection. The purpose of this paper is to discuss the research progress of pedestrian detection. Firstly, this paper explores the research status of pedestrian detection in the past four years (2019–2022), focuses on analyzing the occlusion and scale problems of pedestrian detection and corresponding solutions, summarizes the data sets and evaluation methods of pedestrian detection, and finally looks forward to the development trend of the occlusion and scale problems of pedestrian detection.
Intelligent multimodal pedestrian detection using hybrid metaheuristic optimization with deep learning model
2023, Image and Vision Computing
For video surveillance, pedestrian detection assists in providing baseline data for crowd monitoring, people counting, and event detection; for smart transport system, pedestrian detection acts as a vital part in the semantic understanding of the environment. Pedestrian detection is frequently confronted with substantial intra-class variability because human tends to have great variation in human appearance and pose. Currently, the emergence of deep learning (DL) model has received considerable attention in computer vision techniques like object detection and object classification, and this application is based on supervised learning which required labels. Convolution neural networks (CNN) have assisted substantial improvement in pedestrian recognition due to the stronger representative capability of the CNN feature. But it is usually hard to decrease false positives on hard negative samples namely poles, tree leaves, traffic lights, and so on. Therefore, this study develops an intelligent multimodal pedestrian detection and classification using hybrid metaheuristic optimization with deep learning (IPDC-HMODL) algorithm. The major aim of the presented IPDC-HMODL model is the recognition and classification of multiple pedestrians exist in the input frames. It follows a three stage process namely multimodal object detection, pedestrian classification, and parameter tuning. At the initial stage, the IPDC-HMODL model uses multimodal object detector using YOLO-v5 and RetinaNet model. In addition, the IPDC-HMODL model applies kernel extreme learning machine (KELM) algorithm for pedestrian classification. Finally, hybrid salp swarm optimization (HSSO) model is used for optimal parameter adjustment. To depict the improvised outcomes of the IPDC-HMODL technique, a wide spread simulation analysis was conducted. The comparison study highlighted the enhanced outcomes of the IPDC-HMODL model over other approaches on multimodal pedestrian detection.
Multi-expert learning for fusion of pedestrian detection bounding box
2022, Knowledge-Based Systems
Citation Excerpt :
Our method can also give better generalization under the same settings compared with this work. APGAN [41] presented an intuitive way to improve the generalization ability of the pre-trained models, which generated many new pedestrians in source data with similar distributions of the target domain. Although it is a feasible way to improve the unsupervised cross-domain detection performance, APGAN also needs source and target domains to help learn the distributions of different domains and generate new pedestrians.
Performance of pedestrian detection, which is one of the essential tasks in automatic drive, relies heavily on a large number of labels. Some researchers proposed unsupervised domain adaptive frameworks to improve the detection accuracy in wild datasets to reduce the need for labels. However, it is not a down-to-earth and cost-effective way for deploying these frameworks in practical engineering because it needs both source and target data for training. Unlike the former research, this work presents a new fine-tuning method without using source and target data for unsupervised detection. In this work, different well-trained models from the source domain are regarded as less-accurate experts in the wild domain, where a multi-expert learning algorithm is applied to learn from the difference between these models and fuse bounding boxes to present more accurate detection results. Experimental results on three common pedestrian detection datasets show that our method can efficiently improve the detection accuracy under unsupervised settings. Our method can also achieve better performance without source and target data involved compared with state-of-the-art works.
Data augmentation by morphological mixup for solving Raven’s progressive matrices
2024, Visual Computer
Data Augmentation in Human-Centric Vision
2024, arXiv

View all citing articles on Scopus

Haiyun Guo received the B.E. degree from Wuhan University in 2013 and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, University of Chinese Academy of Sciences, in 2018. She is currently an Assistant Researcher with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. Her current research interests include pattern recognition and machine learning, image and video processing, and intelligent video surveillance.

Jian-Guo Hu received the B.S. and M.S. degrees in National University of Defense Technology, in 2000 and 2004, respectively, and Ph.D degree in communication and information systems, School of Information Science and Technology, Sun Yat-sen University, Guangzhou, China, in 2010. He is currently a professor with the School of Microelectronics Science and Technology, Sun Yat-sen University. And he is the director of Development Research Institute of Guangzhou Smart City. He is the leading talents in science and technology of the ”Special Support Plan” of Guangdong Province, the leader of the Innovation Leading Team of Guangzhou, and the outstanding expert of Guangzhou. He is the director of Guangdong Internet of Things Chip and System Application Engineering Center, director of Guangdong Biological Identification Chip and System Engineering Technology Research Center, director of Guangzhou Key Laboratory of Internet of Things Identification and Perception Chip.

Xu Zhao received the B.E. degree in 2014 from Dalian University of Technology and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of Sciences in 2019. He is currently an assistant researcher in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include object detection, scene text detection, image and video processing, and intelligent video surveillance.

Chaoyang Zhao received the B.E. degree and the M.S. degree in 2009 and 2012 respectively from University of Electronic Science and Technology of China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2016. He is currently an Assistant Professor in National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His research interests include object detection, image and video processing and intelligent video surveillance.

Tong Wang received the B.E. degree in 2017 from Nankai University, Tianjin, China. He is now pursuing a PhD degree on pattern recognition and intelligence systems at the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, since 2017. His research interests include object detection, image and video processing, and intelligent video surveillance.

Yousong Zhu received the B.E. degree from Central South University in 2014 and the Ph.D. degree in pattern recognition and intelligence systems from the Institute of Automation, Chinese Academy of Sciences and University of Chinese Academy of Sciences in 2019. He is currently an assistant researcher in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include object detection, video object detection, pattern recognition and machine learning, and intelligent video surveillance.

Jinqiao Wang received the B.E. degree in 2001 from Hebei University of Technology, China, and the M.S. degree in 2004 from Tianjin University, China. He received the Ph.D. degree in pattern recognition and intelligence systems from the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, in 2008. He is currently a Professor with Chinese Academy of Sciences. His research interests include pattern recognition and machine learning, image and video processing, mobile multimedia, and intelligent video surveillance.

Ming Tang received the B.S. degree in computer science and engineering and M.S. degree in artificial intelligence from Zhejiang University, Hangzhou, China, in 1984 and 1987, respectively, and the Ph.D. degree in pattern recognition and intelligent system from the Chinese Academy of Sciences, Beijing, China, in 2002. He is currently a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include computer vision and machine learning.

View full text

A novel data augmentation scheme for pedestrian detection with attribute preserving GAN

Abstract

Introduction

Section snippets

Pedestrian detection

Methodology

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Modeling visual context is key to augmenting object detection datasets.

Proceedings of the European Conference on Computer Vision (ECCV)

Cut, paste and learn: Surprisingly easy synthesis for instance detection.

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

InstaBoost: boosting instance segmentation via probability map guided copy-pasting

Proceedings of the IEEE International Conference on Computer Vision (ICCV)

Camera style adaptation for person re-identification.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Person transfer GAN to bridge domain gap for person re-identification.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Pose transferrable person re-identification.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Unpaired image-to-image translation using cycle-consistent adversarial networks.

Proceedings of International Conference on Computer Vision

Citypersons: a diverse dataset for pedestrian detection.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Learning efficient single-stage pedestrian detectors by asymptotic localization fitting.

Proceedings of the European Conference on Computer Vision (ECCV)

The cityscapes dataset for semantic urban scene understanding.

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

Pedestrian detection: an evaluation of the state of the art.

IEEE Trans. Pattern Anal. Mach. Intell.

Are we ready for autonomous driving? the kitti vision benchmark suite.

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

Histograms of oriented gradients for human detection.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Depth and appearance for mobile scene analysis.

Proceedings of International Conference on Computer Vision

Multi-cue onboard pedestrian detection.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Taking a deeper look at pedestrians.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

How far are we from solving pedestrian detection?

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Rich feature hierarchies for accurate object detection and semantic segmentation.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Fast R-CNN

Proceedings of the International Conference on Computer Vision (ICCV)

Faster R-CNN: towards real-time object detection with region proposal networks.

Proceedings of the Advances in Neural Information Processing Systems

A unified multi-scale deep convolutional neural network for fast object detection.

Proceedings of the European Conference on Computer Vision