Elsevier

Knowledge-Based Systems

Volume 123, 1 May 2017, Pages 102-115
Knowledge-Based Systems

Passenger flow estimation based on convolutional neural network in public transportation system

https://doi.org/10.1016/j.knosys.2017.02.016Get rights and content

Abstract

Automatic passenger flow estimation is very useful in public transportation system, which can improve the efficiency of public transportation service by optimizing the route plan and traffic scheduling. However, this task usually encounters many challenges in public transportation system, such as low resolution, background clutter, variation of illumination, pose and scale, etc. In this paper we propose a passenger counting system based on the convolutional neural network (CNN) and the spatio-temporal context (STC) model, where the CNN model is used to detect the passengers and the STC model is used to track the moving head of each passenger, respectively. Different from the traditional hand-engineered representation methods, our method uses CNN to automatically learn the related features of passengers. Meanwhile, target pre-location is used by combining the mixture of Gaussian (MoG) model and background subtraction, which can greatly reduce the following detection time. To address the tracking drift problem, inspired by the movement of ants in nature, we attempt to exploit the trajectory information to build a biologically inspired pheromone map and a 3D peak confidence map. Then, the number of passengers can be obtained by counting the regions of interest (ROI). Experimental results on an actual public bus transportation dataset show that this method outperforms some existing methods.

Introduction

Automatic passenger flow estimation is very useful for traffic management and overcrowding situation detection in public transportation. The accurate passenger flow information can improve the efficiency of public transportation service by optimizing the route plan and traffic scheduling. Meanwhile, it can also prevent severe traffic accidents caused by overloading. Traditionally, automatic passenger counting can be done by contact-type counters, optical sensors, and vision-based systems. Contact-type counters can be applied in many public places with entrances, such as subways and bus stations. However, it can cause congestion when the passenger flow is high because it counts passengers in sequence one by one. Optical sensors, such as radiation beam systems, do not block the doorways, but suffer from the undercounting problem. In recent years, automatic method of counting the passing passengers based on digital image processing has attracted more attention, which can reduce the cost and require no user intervention.

In the past few decades, several methods related to people counting have been proposed. Generally speaking, these methods can be divided into three approaches. The first approach is trajectory clustering based counting. This kind of methods count passengers based on the following hypothesis. Those trajectories belonging to the same human body are more similar than trajectories belonging to different individuals. In [1], the trajectories of visual features were clustered, and the number of passengers was estimated by the number of these clusters. Based on Dirichlet process mixture models (DPMMs), Topkaya et al. [2] employed a clustering scheme to estimate the number of passengers. This method fused a set of spatial, color and temporal information features for each detection. However, the performance of these methods will degrade greatly in illumination variation and low resolution transportation scene since this situation usually reduces the stability of the algorithms. The second one is regression based counting. The number of passengers in this type of methods is estimated by learning the regression function between features extracted from the input images and the people counted in a scene. Many regression functions, such as Bayesian regression [3], neural networks [4], [5] and SVR regression [6], [7], were used to estimate the crowd density. However, these methods just provide the counted number of people and fail to locate the individuals. Sometimes, the location information of individuals is important for video surveillance security systems, especially in public transportation scenarios. The last one is detection based counting. In this type of approach, a detector is carefully designed to detect the people from the input images. Different type of detection methods, such as body detection [8], [9], [10], head-shoulder detection [11], [12], [13], skeleton graph [14], [15] and head template matching [16], [17], were proposed. Although this kind of methods offers the promising results, the carefully hand-engineered designed detector is not robustness in low resolution scene. It fails for some real surveillance applications. Besides, the stereo camera based method in [16] needs multiple cameras and extra calibration procedure, which might cause lots of inconvenience in deployment.

In recent years, although a few advances have been proposed [18], [19] in this field, there are still some challenges, especially in complicated low-resolution scenes. The resolution of most videos in public transportation surveillance system is relatively low. Nonetheless, this task is nontrivial due to the illumination, occlusion, scale and pose variation of passengers in cluttered background. Some of them, such as the regression based method [19] are not suitable for our application. The number of passengers is time-varying in our scene. The examples of typical application are shown in Fig. 1.

From Fig. 1, we can observe that in addition to occlusion caused by the movement of passengers in Fig. 1(a) and (b), the images in Fig. 1(a)–(h) are of low resolution with 320*320 and the scales of heads in Fig. 1(c) and (d) change gradually as the passengers get on/off. Fig. 1(e) and (f) show the background clutter caused by the vibration of buses, which will be difficult for the followed-up counting. Meanwhile, lots of issues should be addressed due to the variation of illumination, as shown in Fig. 1(g) and (h). To our best knowledge, there are few papers discussing the problem of passenger counting in the complicated low-resolution public transportation scenarios.

Recently, convolutional neural networks (CNN) have shown outstanding performance on image classification tasks [20], [21], [22] and object detection tasks [23], [24], [25], [26]. Different from traditional hand-engineered representations, they have deep architectures [27] and can learn powerful object representations. Motivated by this, we adopt CNN to handle the above mentioned challenges, which can extract the target feature automatically and is robust to the variations of illumination, pose and scale.

In this paper, we propose a passenger counting system to address the counting problem by combining the CNN detection model and the spatio-temporal context (STC) model [28], which offers the following advantages. Firstly, by combining the mixture of Gaussian (MoG) model [29] and background subtraction, target pre-location will greatly reduce the detection time of CNN. The CNN model is used to automatically learn the related features of passengers and then detect the moving passengers. Secondly, inspired by the movement of ants in nature, a biologically inspired pheromone map and a 3D peak confidence map are proposed to address the tracking drift.

The rest of the paper is organized as follows. Section 2 briefly describes the overview of our framework and introduces each module of the proposed method in detail. In Section 3, we evaluate our method with extensive experiments on an actual public transportation dataset. Finally, conclusions and future work are given in Section 4.

Section snippets

Overview of our framework

As shown in Fig. 2, the proposed passenger counting framework includes two stages: the off-line training stage and the online counting stage. In the off-line training stage, the CNN model is trained by both the positive samples and negative samples with a new dataset constructed from the real complicated public transportation scenarios. The layers, weights and neuron numbers of CNN model are obtained by the off-line training. After that, the online counting stage is used to count the passengers

Experimental results

Our dataset is collected from surveillance video of public bus transportation in China. The classical scenes of our application are shown in Fig. 1. We manually extract 34,322 head positive samples and collect 179,398 negative samples from real surveillance video in bus transportation. We will show our counting demo video with front door, rear door and bi-directional counting scenes in the supplemental materials.

Conclusion

The proposed system uses a single inexpensive camera mounted overhead, which eliminates the need for calibration and creates a low-cost system. The presented passenger counting system combines the CNN detection model and spatio temporal context model to address the counting problem in the scenes of low resolution and with variation of illumination, pose and scale. Different from many sliding windows based detection method, our method adopts MoG model to obtain the foreground, which can

Acknowledgement

Thank Hitachi (China) Research & Development Corporation for providing the foundation for this research. This work was supported partly by the National Natural Science Foundation of China with Grant No. 61671452, No. 61675036 and No. 61302054.

References (45)

  • G. Huang et al.

    Trends in extreme learning machines: a review

    Neural Netw.

    (2015)
  • G. Antonini et al.

    Counting pedestrians in video sequences using trajectory clustering

    IEEE Trans. Circuits Syst. Video Technol.

    (2006)
  • I.S. Topkaya et al.

    Counting people by clustering person detector outputs

  • A.B. Chan et al.

    Counting people with low-level features and Bayesian regression

    Image Process. IEEE Trans.

    (2012)
  • C. Zhang et al.

    Cross-scene crowd counting via deep convolutional neural networks

  • D. Kong et al.

    Counting pedestrians in crowds using viewpoint invariant training

  • H. Idrees et al.

    Multi-source multi-scale counting in extremely dense crowd images

  • D. Conte et al.

    A method based on the indirect approach for counting people in crowded scenes

  • W. Ouyang et al.

    Joint deep learning for pedestrian detection

  • P. Buehler et al.

    Upper body detection and tracking in extended signing sequences

    Int. J. Comput. Vision

    (2011)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

  • C. Zeng et al.

    Robust head-shoulder detection by pca-based multilevel hog-lbp detector for people counting

  • S. Wang et al.

    A new edge feature for head-shoulder detection

  • L. Chen et al.

    Head-shoulder detection using joint HOG features for people counting and video surveillance in library

  • D. Merad et al.

    Fast people counting using head detection from skeleton graph

  • K. Aziz et al.

    Head detection based on skeleton graph method for counting people in crowded environments

    J. Electron. Imaging

    (2016)
  • T. Van Oosterhout et al.

    Head detection in stereo data for people counting and segmentation

    VISAPP

    (2011)
  • P. Dollar et al.

    Pedestrian detection: an evaluation of the state of the art

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2012)
  • C. Gao et al.

    People counting based on head detection combining Adaboost and CNN in crowded surveillance environment

    Neurocomputing

    (2016)
  • Y. Zhang et al.

    Single-image crowd counting via multi-column convolutional neural network

  • K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: visualising image classification models and...
  • Y. Jia et al.

    Caffe: convolutional architecture for fast feature embedding

  • Cited by (75)

    • Deep Learning-Based Passenger Counting System Using Surveillance Cameras

      2024, 2024 16th International Conference on COMmunication Systems and NETworkS, COMSNETS 2024
    View all citing articles on Scopus
    View full text