Elsevier

Pattern Recognition Letters

Volume 36, 15 January 2014, Pages 125-134
Pattern Recognition Letters

People counting by learning their appearance in a multi-view camera environment

https://doi.org/10.1016/j.patrec.2013.10.006Get rights and content

Abstract

We present a people counting system that, based on the information gathered by multiple cameras, is able to tackle occlusions and lack of visibility that are typical in crowded and cluttered scenes. In our method, evidence of the foreground likelihood in each available view is obtained through a bio-inspired mechanism of self-organizing background subtraction, that is robust against well known foreground detection challenges and is able to detect both moving and stationary foreground objects. This information is gathered into a synergistic framework, that exploits the homography associated to each scene view and the scene ground plane, thus allowing to reconstruct people feet positions in a single “feet map” image. Finally, people counting is obtained by a k-NN classification, based on learning the count estimates from the feet maps, supported by a tracking mechanism that keeps track of people movements and of their identities along time, also enabling tolerance to occasional misdetections. Experimental results with detailed qualitative and quantitative analysis and comparisons with state-of-the-art methods are provided on publicly available benchmark datasets with different crowd densities and environmental conditions.

Introduction

Localization and counting of people in image sequences is a video surveillance task with useful applications. Indeed, people counting can be used for several aims, such as to survey passenger load in urban transportation (buses, ferries, railways, airports, etc.) in order to facilitate the service planning, or to obtain detailed collection of counting data of visitors and customers in public structures (museums, libraries, etc.) and in commercial areas (trade centers, supermarkets, etc.) in order to optimize the resource management.

Certainly, people counting in image sequences is a complex process. Indeed, objects in the scene interact, giving place to overlaps that lead to the temporary loss of some of them, the so-called “occlusions”. If a person is visually isolated in the images, localization of its position and visual tracking are quite easy to obtain, because the information usually exploited to identify him (e.g., color distribution, shape, etc.) mainly remains unchanged when the person moves. If the density of objects in the scene increases, also occlusions intensify; consequently, a region of contiguous pixels of the image foreground (blob) could not belong to a single person, but to several persons. In such conditions of limited visibility and crowded scenes it is extremely difficult to correctly detect and track all the persons only based on the images coming from a single camera (single-view). Using several views of the same scene (multi-view) can allow to recover the information that could have been hidden in a specific view.

Several people counting approaches have been proposed in the past twenty years. They have been classified into detection-based methods, that determine the number of people, as well as their locations, by identifying individuals in the scene, and map-based methods, that exploit the relationship between the number of people and some features extracted from the images (Hou and Pang, 2011). More recently, they have been subdivided into individual-centric methods, based on the detection, tracking, and counting the number of tracks, and crowd-centric methods, based on the analysis of global low-level features extracted from crowd imagery to produce accurate counts (Chan and Vasconcelos, 2012).

Most of the literature concerning people counting relies on a single-view approach, due to the wide availability of single surveillance cameras and to the relative ease of implementation, since they do not require calibrated cameras nor specific knowledge of the scene geometry. Examples include Davies et al., 1995, Wren et al., 1996, Zhao and Nevatia, 2003, Rabaud and Belongie, 2006, Kilambi et al., 2008, Albiol et al., 2009, Chan et al., 2009, Sharma et al., 2009, Choudri et al., 2009, Conte et al., 2010, Patzold et al., 2010, Zeng and Ma, 2010, Subburaman et al., 2012. Also neural networks can be exploited for people counting and crowd density estimation (Maddalena and Petrosino, 2012a), as, for instance, in Marana et al., 1998, Cho et al., 1999, Kong et al., 2006, Hou and Pang, 2011. Generally, single-view approaches present difficulties in the analysis of crowded scenes, due to highly possible severe occlusions, and some of them are not robust to illumination changes or have heavy computational load.

Several research directions have been taken in order to handle occlusions. For example, the adoption of cameras looking straight down from the ceiling greatly helps reducing the occlusions (Albiol et al., 2001, Kim et al., 2002, Velipasalar et al., 2006, Englebienne and Krose, 2010). However, the application is limited to indoor environments; moreover, either the acquired sequences still present occlusions in all but the central portion of the image, or the cameras have limited field of view (Harville, 2004). Also stereo cameras have been considered, in order to exploit depth information to project moving people to the ground plane, producing an occupancy map and reducing occlusions (Beymer, 2000, Harville, 2004, Qiuyu et al., 2010, Yahiaoui et al., 2010, van Oosterhout et al., 2011). The use of multiple cameras reveals as fundamental for localizing and counting people in crowded environments. Multi-view approaches aim at reducing hidden image regions due to occlusions, allowing at the same time to reconstruct the target 3D position based on the abundant information provided by different observation points. Examples include Kim and Davis, 2006, Alahi et al., 2009, Krahnstoever et al., 2009, Stalder et al., 2009, Ge and Collins, 2010, Ma et al., 2012. However, multi-view approaches usually require calibrated and synchronized cameras, and have a complex structure, resulting in computationally demanding algorithms.

In this work we propose an individual-centric system for robustly counting the number of people under occlusion through multiple cameras with overlapping fields of view, characterized to be neural-brain-like inspired. Compared to the other occlusion handling methods, our multi-view approach relies on the learning of motion templates in time, can adapt to detect both moving and stationary people, and turns out to be robust to gradual lighting variations, moving backgrounds, and cast shadows. The proposed approach is based on the idea of performing an accurate moving object detection in each available view and suitably fusing such information in order to limit problems related to occlusions. To this end, the neural approach to moving object detection recently proposed in Maddalena and Petrosino (2013b) is adopted, where the background model is built by learning in a self-organizing manner image sequence variations, seen as trajectories of pixels in time. The neural model is here adapted to compute “foreground likelihood maps” from different views to be effectively merged together in the multi-view setting. Information fusion is based on the Homographic Occupancy Constraint (Khan and Shah, 2006), that exploits the homography associated to each scene view and the scene ground plane, in order to combine the visual information available by different view-points. This allows to reconstruct people feet positions on the scene ground plane in a single “feet map” image, through the homographies of the foreground likelihood maps. Subsequent tracking in the feet map images is adopted to support people counting, by keeping track of people movements and of their identities along time. Finally, people counting is obtained by supervised classification, based on learning the counts from the feet maps.

The paper is organized as follows. Section 2 describes the moving object detection approach, that allows to obtain a “foreground likelihood” information for each view, based on neural modeling on motion templates. People localization, achieved by reconstructing people feet positions on the scene ground plane, is described in Section 3, while tracking in the feet maps is described in Section 4, and people counting is described in Section 5. Experimental results and comparisons on different real datasets are reported in Section 6, while concluding remarks are provided in Section 7.

Section snippets

Neural modelling on motion templates

Foreground detection in each single scene view is the basic building block of our proposed people counting system and its accuracy is crucial for the entire process. Therefore, we adopt here the self-organizing background model for image sequences presented in Maddalena and Petrosino (2013b), whose high accuracy and robustness to well known moving object detection challenges has already been proven. Indeed, extensive experimental results on daytime, night-time, and thermal sequences made

People localization

In order to localize people in a way that is robust against occlusions and lack of visibility, we exploit and fuse the information available in a multi-view setting, assuming that at least one reference scene plane is visible in the scene views. This is a reasonable assumption in typical video surveillance installations, where the ground plane or a planar structure, like a building wall, is usually visible in the monitored area. Here, we propose to first train, frame by frame, different

Tracking

In the proposed system the tracking module is adopted to support people counting, by keeping track of people movements and of their identities along time, thus allowing the system to properly handle occasional misdetections in the feet maps. The choice of the tracking approach is driven by the fact that, contrary to the usual case (i.e., blobs to be tracked come from foreground moving people in a single-view scene), here blob locations of moving people come from the scene ground plane. Indeed,

People counting

As already pointed out, the appearance of blob locations in the feet map changes according to the scene dynamics, often abruptly from one frame to the next. Indeed, due to crowd density, there are cases where it is not possible to simply count moving blobs containing single persons. This is the case of groups of people that walk so close that, even in a multi-view setting, they are detected all together in a single blob of the feet map. In order to more accurately handle such blob density

Experimental results

In this section we present experimental results carried out in order to evaluate the accuracy of the proposed multi-view people counting system, as well as its ability to tackle occlusion and lack of visibility in crowded and cluttered scenes typical of complex environments. Comparisons are provided between results achieved by the approach based on supervised classification as described in Section 5.1 (in the following referred to as SC) and the approach based on blob density in the feet map as

Conclusions

In this paper we propose a multi-view camera system for people counting in video surveillance. The system relies on recent achievements in multi-view video analysis, integrating modules that turned out to be very efficient.

The 3D self-organizing neural network approach enables the integration of spatial and temporal templates into a common framework, combining simultaneously motion state and viewpoint changes, at the same time reducing the high dimensionality of the temporal templates. The

References (60)

  • S.Y. Cho et al.

    A neural-based crowd estimation by hybrid global learning algorithm

    IEEE Trans. Syst. Man Cybern. Part B: Cybern.

    (1999)
  • S. Choudri et al.

    Robust background model for pixel based people counting using a single uncalibrated camera

  • D. Conte et al.

    A method for counting moving people in video surveillance videos

    EURASIP J. Adv. Signal Process.

    (2010)
  • M. Cottrell et al.

    A stochastic model of retinotopy: a self organizing process

    Biol. Cybern.

    (1986)
  • Cunningham, P., Delany, S.J. 2007. K-Nearest Neighbour Classifiers. Technical Report UCD-CSI-2007-4, University College...
  • A. Davies et al.

    Crowd monitoring using image processing

    Electron. Commun. Eng. J.

    (1995)
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • Ellis, A., Ferryman, J., 2010. Pets2010 and pets2009 evaluation of results using individual ground truthed single...
  • Englebienne, G., Krose, B., 2010. Fast bayesian people detection. In: The 22nd Benelux Conference on Artificial...
  • J. Ferryman et al.

    An overview of the PETS 2009 challenge

  • M.A. Fischler et al.

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

    Commun. ACM

    (1981)
  • Fisher, R., 1999. Change detection in color images....
  • F. Fleuret et al.

    Multi-camera people tracking with a probabilistic occupancy map

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2008)
  • R.M. Gaze

    The Formation of Nerve Connections

    (1970)
  • W. Ge et al.

    Crowd detection with a multiview sampler

  • Gloyer, B., Aghajan, H.K., Siu, K.Y., Kailath, T., 1995. Video-based freeway-monitoring system using recursive vehicle...
  • R.I. Hartley et al.

    Multiple View Geometry in Computer Vision

    (2003)
  • D. Hebb

    The Organization of Behavior

    (1949)
  • Y.L. Hou et al.

    People counting and human detection in a challenging situation

    IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans

    (2011)
  • R.E. Kalman

    A new approach to linear filtering and prediction problems

    Trans. ASME–J. Basic Eng.

    (1960)
  • Cited by (52)

    • Deep and transfer learning for building occupancy detection: A review and comparative analysis

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Similarly, the study in Rodrigues et al. (2017) provided a MLP model for estimating classroom occupancy using relative humidity, temperature, and CO2 concentration. The authors in Maddalena et al. (2014) presented a camera system with a multi-view property used for people counting. The system uses current advances in multi-view video analysis, incorporating effective components.

    • Pattern recognition and beyond: Alfredo Petrosino's scientific results

      2020, Pattern Recognition Letters
      Citation Excerpt :

      In [32], stopped object detection triggers a video-based access control system, that starts localizing and recognizing license plates based on matching with an existing dataset of allowed plates. In a multi-view setting, the MOD masks produced by SOBS_CF are combined to reconstruct the position of moving objects onto the ground plane, providing a ”feet map” exploited for people counting [74]. MOD masks produced by SOBS are also exploited for human activity modeling [86], temporal segmentation [5], and gait and action recognition [4].

    • Earthquake emergency response framework on campus based on multi-source data monitoring

      2019, Journal of Cleaner Production
      Citation Excerpt :

      Dynamic monitoring based on multi-source data for emergency response can shorten the time required for rescue decision making and improve its accuracy, and then rescue operations can be carried out as soon as possible. Promoting this research framework requires a platform that monitors building energy consumption information (Miller and Meggers, 2017) or image processing techniques for crowd counting (Maddalena et al., 2014). A university campus is a microcosm of a city (Lozano et al., 2013), and establishing a campus-based emergency response decision-making framework is conducive to extending the method to larger community areas, even urban areas.

    • Target Counting Method Based on UAV View in Large Area Scenes

      2024, Yingyong Kexue Xuebao/Journal of Applied Sciences
    • A Fusion-Based Dense Crowd Counting Method for Multi-Imaging Systems

      2023, International Journal of Intelligent Systems
    View all citing articles on Scopus
    View full text