Elsevier

Neurocomputing

Volume 394, 21 June 2020, Pages 114-126
Neurocomputing

A parallel vision approach to scene-specific pedestrian detection

https://doi.org/10.1016/j.neucom.2019.03.095Get rights and content

Abstract

In recent years, with the development of computing power and deep learning algorithms, pedestrian detection has made great progress. Nevertheless, once a detection model trained on generic datasets (such as PASCAL VOC and MS COCO) is applied to a specific scene, its precision is limited by the distribution gap between the generic data and the specific scene data. It is difficult to train the model for a specific scene, due to the lack of labeled data from that scene. Even though we manage to get some labeled data from a specific scene, the changing environmental conditions make the pre-trained model perform bad. In light of these issues, we propose a parallel vision approach to scene-specific pedestrian detection. Given an object detection model, it is trained via two sequential stages: (1) the model is pre-trained on augmented-reality data, to address the lack of scene-specific training data; (2) the pre-trained model is incrementally optimized with newly synthesized data as the specific scene evolves over time. On publicly available datasets, our approach leads to higher precision than the models trained on generic data. To tackle the dynamically changing scene, we further evaluate our approach on the webcam data collected from Church Street Market Place, and the results are also encouraging.

Introduction

Scene-specific pedestrian detection, with cameras fixed in the scene, plays an important role in the surveillance of traffic and other public spots [1], [2]. Fig. 1 shows some application scenes of scene-specific pedestrian detection. In recent years, with the widespread of Internet of Things, webcams have been widely installed around the world. Large amounts of video data captured by the webcams can be analyzed for traffic flow prediction, smart community, and public safety. As the basis of many applications, pedestrian detection [3], [4], [5] has made great progress with the development of deep learning [6] together with big data and parallel computing.

So far, there has been much work done in pedestrian detection from traditional methods [3], [7] to deep-learning-based models [8], [9]. But it is still a challenging task for pedestrian detection in specific scenes with fixed cameras. The challenges stem from a few aspects. First, compared to other classes of objects, pedestrians are often of low resolution, making it difficult for a detection model to extract effective features that are critical to discriminate pedestrians from background [10], [11]. With low resolution, complex scenes often bring about hard negative samples, e.g., traffic signs in traffic scene, plastic models in shopping windows, and common pillar boxes in general scene. Second, pedestrians may vary significantly in scale and shape for different capturing angles and distances. For detecting pedestrians of different scales, Zhang et al. [11] propose an approach to make accurate prediction of pedestrian locations across multi-layer feature representations. Third, the varied illumination and weather conditions make pedestrian detection even more challenging. In order for pedestrian detection model robust to adverse conditions, Xu et al. [12] train a cross-modal deep model for robust pedestrian detection by combining RGB images with thermal images, which inevitably increases the cost. Last but not least, there lacks scene-specific labeled data to train pedestrian detection model that is then used for specific scene. A simple but widely-exploited way to handle this problem is to train the model with generic datasets. However, the detection accuracy drops significantly [13] once the pre-trained detection model is applied directly to a specific scene, due to the distribution difference between generic training data and the specific scene caused by inconsistency of geometrical layout, illumination, camera field of view, and others.

The biggest difference of pedestrian detection between fixed camera and mobile platform is whether the background layout is changing. For mobile platform, the background layout is changing unpredictably and rapidly, so that one must utilize generic data with diverse background layouts to train a robust detection model. But for pedestrian detection in a specific scene, the background layout is relatively constant, which is rather different from those background layouts covered by generic data. In this case, we naturally expect to use the background layout information of specific scene to improve the accuracy of pedestrian detection.

Even in the same scene, with time going on, the illumination and other conditions in the scene change gradually, so that the model pre-trained on specific dataset may work bad in the process of practical application. Because the models of the existent scene-specific monitoring systems are trained off-line and then applied to the real scene, the real scene may have changed when the models work. As a result, the pre-trained models for the specific scene will fail to follow the scene changes and produce bad results.

All the problems mentioned above can be addressed by reasonably training and updating the model. However, lack of training data is a serious obstacle considering the tremendous work of collecting and annotating scene-specific data. It is time-consuming and labor-intensive to compile training data manually for every specific scene.

In order to deal with the problems of 1) lack of scene-specific labeled data to train the model and 2) failure of pre-trained models working in a changing scene, we propose a parallel vision approach to scene-specific pedestrian detection [14]. The theoretical framework of parallel vision was proposed by Wang et al. [14], who extended the parallel system theory and ACP methodology [15], [16] to the computer vision field for accurate perception and understanding of complex scenes [17]. ACP can be expressed as a trilogy: ACP = Artificial Systems + Computational Experiments + Parallel Execution. While the paper [17] is a review article that builds a parallel vision framework for perception and understanding of complex scenes without experimental validations, it is very different from this work which proposes the concrete methods to realize the ACP trilogy together with detailed experimental results.

For parallel vision, the ACP trilogy serves as a unity. Through the ACP trilogy, the system builds artificial scenes as proxies of the real scene for algorithm design and evaluation. The artificial and real spaces compose the complete “complex space” to solve the complex problems of scene understanding. As the real scene changes, the artificial scenes change synchronously. From the real scene, much information can be obtained to build the artificial scenes. Then, from the artificial scene, an amount of data can be collected, on which controllable and repeatable computational experiments can be conducted to design and evaluate the vision algorithms. As for parallel execution, the vision model can work in both the real scene and artificial scenes concurrently, and the artificial scenes are updated gradually according to the changing real scene so that the vision model can be optimized online.

Recently, with the development of computer graphics, it is feasible to build realistic artificial scenes with 3D object models. Based on the parallel vision theory, the artificial scenes reflect various aspects of the real scene and can simulate many realistic appearances, some of which are even rarely seen but reasonable in the real world. With the artificial scenes, a large amount of labeled synthetic data can be obtained to train the visual model. Meanwhile, in the process of parallel execution, the artificial scenes are consistent with the real scenes, and the model is continuously updated.

In this paper, we deal with the problem of longterm pedestrian detection in specific scenes, which is difficult to handle by the general framework as we mentioned above. Fig. 2 shows the framework of our proposed scene-specific pedestrian detection approach based on parallel vision, and the details can be found in Section 3. For realizing the ACP trilogy, we construct the artificial scenes corresponding to the real scene. We build up the artificial scene of the specific scene with augmented reality and make the artificial scene keep up with the real scene by changing the scene background at some time intervals, together with artificial illumination and weather conditions. From the artificial scenes, we can collect massive automatically labeled data. Through Computational Experiments, the scene-specific pedestrian detection models are trained and applied to the real scene. Finally, we make the system execute in artificial scene and real scene concurrently, and collect new training data from the artificial scene. With more training data available, we continuously fine-tune the vision model to make it work better in the real scene.

We evaluate the utility of synthetic data from the artificial scene on public datasets and then apply our proposed approach to the real 24-h dataset collected from a webcam. All the results show our parallel vision approach to scene-specific pedestrian detection works well, and the experimental details are presented in Section 4.

In this paper, three contributions are made: 1) we validate the effectiveness of synthetic data in scene-specific pedestrian detection; 2) we propose an online learning system for scene-specific pedestrian detection based on the parallel vision theory; 3) we create a new validation dataset from the webcam of Church Street Market Place for research of longterm scene-specific pedestrian detection. This dataset will be released at our group website (http://openpv.cn).

The remainder of this paper is organized as follows: Sec-tion 2 describes the related works. The details of parallel monitoring framework are described in Section 3. In Section 4, the validation data are presented together with experimental evaluation. The conclusion is drawn in Section 5.

Section snippets

Related works

As part of parallel vision, the construction of artificial scenes is important. In artificial scenes, we are able to change the appearance and motion of target objects, and generate large amounts of labeled data just like collecting data from the real world. Based on such virtual data, data-driven vision models [18] can be trained.

As BainBridge [19] pointed out, based on video game engines, the virtual world can simulate the complex physic world and provide a new space for scientific research.

Proposed approach

As shown in Fig. 2, the proposed scene-specific pedestrian detection approach can be considered as an online learning system based on parallel vision theory to solve the problems of lacking training data and scene changes in visual monitoring applications.

The whole framework consists of 1) artificial virtual scene, 2) the real scene, 3) model training module and 4) environmental perception and understanding. Based on the real scene, the artificial virtual scene can be easily built up as a peer

Experimental evaluation

In this section, we first train the models with synthetic datasets collected from the virtual scenes and evaluate them on public datasets off-line. Then we conduct the experiments of online optimization in the longterm datasets. Through offline experiments and online optimization, we evaluated the effectiveness of parallel vision [14] and ACP method [15] in scene-specific pedestrian detection proposed by us.

Conclusion and future work

We propose a parallel vision approach to pedestrian detection that generates large amounts of synthetic data by building the virtual world and adapts the generic model to specific scenes. The experimental results show that synthetic data are able to train a scene-specific detector without real labeled data. Furthermore, we apply the parallel vision theory to the longterm scene-specific pedestrian detection, and keep the model updating by parallel execution of the system in both the real scene

Declarations of interest

None.

Acknowledgment

This work was supported in part by National Natural Science Foundation of China (61533019, U1811463).

Wenwen Zhang received his bachelor degree in software engineering from Xi ’an Jiaotong University, Xi’an, China, in 2014. He is currently a Ph.D. student in the School of Software Engineering, Xi’an Jiaotong University as well as the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. His research interests include intelligent visual surveillance and deep learning.

References (48)

  • SunB. et al.

    From virtual to reality: fast adaptation of virtual object detectors to real domains.

    BMVC

    (2014)
  • ZhangW. et al.

    Scene-specific pedestrian detection based on parallel vision

    IEEE International Conference on Intelligent Transportation Systems

    (2017)
  • WangK. et al.

    A multi-view learning approach to foreground detection for traffic surveillance applications

    IEEE Trans. Veh. Technol.

    (2016)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005

    (2005)
  • P. Felzenszwalb et al.

    A discriminatively trained, multiscale, deformable part model 8

    IEEE Conference on Computer Vision and Pattern Recognition

    (2008)
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    International Conference on Neural Information Processing Systems

    (2015)
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems

    (2012)
  • P.F. Felzenszwalb et al.

    Object detection with discriminatively trained part-based models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2010)
  • ZhangL. et al.

    Is Faster R-CNN doing well for pedestrian detection?

    European Conference on Computer Vision

    (2016)
  • LiJ. et al.

    Scale-aware-Fast R-CNN for pedestrian detection

    IEEE Trans. Multim.

    (2018)
  • MaoJ. et al.

    What can help pedestrian detection?

    IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • ZhangX. et al.

    Too far to see? not really!—pedestrian detection with scale-aware localization policy

    IEEE Trans. Image Process.

    (2018)
  • D. Xu, W. Ouyang, E. Ricci, X. Wang, N. Sebe, Learning cross-modal deep representations for robust pedestrian detection...
  • J. Ferryman, A. Shahrokni, An overview of the PETS 2009 challenge,...
  • WangK. et al.

    Parallel vision: an ACP-based approach to intelligent vision computing

    Acta Autom. Sin.

    (2016)
  • WangF.-Y.

    Parallel system methods for management and control of complex systems

    Control Decis.

    (2004)
  • WangF.Y.

    The emergence of intelligent enterprises: from CPS to CPSS

    IEEE Intell. Syst.

    (2010)
  • WangK. et al.

    Parallel vision for perception and understanding of complex scenes: methods, framework, and perspectives

    Artif. Intell. Rev.

    (2017)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • W.S. Bainbridge

    The scientific research potential of virtual worlds

    Science

    (2007)
  • H. Prendinger et al.

    Tokyo Virtual Living Lab: designing smart cities based on the 3D internet

    IEEE Internet Comput.

    (2013)
  • I. Karamouzas et al.

    Simulating and evaluating the local behavior of small pedestrian groups

    IEEE Trans. Vis. Comput. Graph.

    (2012)
  • F. Qureshi et al.

    Smart camera networks in virtual reality

    Proc. IEEE

    (2008)
  • W. Starzyk et al.

    Software laboratory for camera networks research

    IEEE J. Emerg. Sel. Top. Circuits Syst.

    (2013)
  • Cited by (22)

    • Pervasive computing of adaptable recommendation system for head-up display in smart transportation

      2022, Computers and Electrical Engineering
      Citation Excerpt :

      Clusters are typically composed of an odd number of nodes equal to or greater than 3. A parallel vision approach (PVA) was introduced in [15] to detect the scene-specific pedestrian. This uses two stages for detection first model uses the pre-trained augmented reality data.

    • Learning scene-specific object detectors based on a generative-discriminative model with minimal supervision

      2022, Pattern Recognition Letters
      Citation Excerpt :

      Hence, our framework, with a high label correct rate, can improve the detection performance by focusing learning on the problematic samples located near the decision boundary. Ye et al. [13], Shu et al. [14], Cinbis et al. [15], Zhang et al. [30] are widely used state-of-the-art scene-specific object detection methods in recent years [30]. is the scene-specific pedestrian detection method trained with synthetic various pedestrian images[13].

    • Unsupervised domain-adaptive scene-specific pedestrian detection for static video surveillance

      2021, Pattern Recognition
      Citation Excerpt :

      This approach is used to infer the potential appearance of pedestrians by using geometric scene data and a customizable database of virtual simulations of pedestrian movement. Supervised-FS [7]: A supervised deep-learning based scene-specific pedestrian detector trained without any real pedestrian training data. In this approach, the computer graphics rendering engine is employed to synthesize various potential pedestrian images for training scene-specific pedestrian detectors in different surveillance scenarios.

    • Recent trends in pedestrian detection for robotic vision using deep learning techniques

      2021, Artificial Intelligence for Future Generation Robotics
    • Parallel Vision ⫆ Image Synthesis/Augmentation

      2024, IEEE/CAA Journal of Automatica Sinica
    View all citing articles on Scopus

    Wenwen Zhang received his bachelor degree in software engineering from Xi ’an Jiaotong University, Xi’an, China, in 2014. He is currently a Ph.D. student in the School of Software Engineering, Xi’an Jiaotong University as well as the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. His research interests include intelligent visual surveillance and deep learning.

    Kunfeng Wang received his Ph.D. degree in control theory and control engineering from the Graduate University of Chinese Academy of Sciences, Beijing, China, in 2008. From December 2015 to January 2017, he was a Visiting Scholar at the School of Interactive Computing, Georgia Institute of Technology,Atlanta, GA, USA. He is currently an Associate Professor at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences. His research interests include intelligent transportation systems, intelligent vision computing, and machine learning.

    Yating Liu received her B.Eng. degree from the Civil Aviation University of China in 2014. She is currently a Ph.D. student at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences as well as University of Chinese Academy of Sciences. Her research interests include visual object detection and tracking, machine learning, and intelligent transportation systems.

    Yue Lu received his bachelor degree in automation from Tongji University, Shanghai, China, in 2016. He is currently a Ph.D. student at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences as well as University of Chinese Academy of Sciences. His research interests include computer vision, machine learning, and robotics.

    Fei-Yue Wang received his Ph.D. in computer and systems engineering from Rensselaer Polytechnic Institute, Troy, New York in 1990. He joined the University of Arizona in 1990 and became a Professorand Director of the Robotics and Automation Lab (RAL) and Program in Advanced Research for Complex Systems (PARCS). In 1999, he founded the Intelligent Control and Systems Engineering Center at the Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China, under the support of the Outstanding Oversea Chinese Talents Program from the State Planning Council and “100 Talent Program” from CAS, and in 2002, was appointed as the Director of the Key Lab of Complex Systems and Intelligence Science, CAS. In 2011, he became the State Specially Appointed Expert and the Director of The State Key Laboratory for Management and Control of Complex Systems. Dr. Wang’s current research focuses on methods and applica tions for parallel systems, social computing, and knowledge automation. He was the Founding Editor-in-Chief of the International Journal of Intelligent Control and Systems (1995–2000), Founding EiC of IEEE ITS Magazine (2006–2007), EiC of IEEE Intelligent Systems (2009–2012), and EiC of IEEE Transactions on ITS (2009–2016). Currently he is EiC of China’s Journal of Command and Control. Since 1997, he has served as General or Program Chair of more than 20 IEEE, INFORMS, ACM, ASME conferences. He was the President of IEEE ITS Society (2005–2007), Chinese Association for Science and Technology (CAST, USA) in 2005, the American Zhu Kezhen Education Foundation (2007–2008), and the Vice President of the ACM China Council (2010–2011). Since 2008, he is the Vice President and Secretary General of Chinese Association of Automation. Dr. Wang is elected Fellow of IEEE, INCOSE, IFAC, ASME, and AAAS. In 2007, he received the 2nd Class National Prize in Natural Sciences of China and awarded the Outstanding Scientist by ACM for his work in intelligent control and social computing. He received IEEE ITS Outstanding Application and Research Awards in 2009 and 2011, and IEEE SMC Norbert Wiener Award in 2014.

    View full text