People counting by learning their appearance in a multi-view camera environment
Introduction
Localization and counting of people in image sequences is a video surveillance task with useful applications. Indeed, people counting can be used for several aims, such as to survey passenger load in urban transportation (buses, ferries, railways, airports, etc.) in order to facilitate the service planning, or to obtain detailed collection of counting data of visitors and customers in public structures (museums, libraries, etc.) and in commercial areas (trade centers, supermarkets, etc.) in order to optimize the resource management.
Certainly, people counting in image sequences is a complex process. Indeed, objects in the scene interact, giving place to overlaps that lead to the temporary loss of some of them, the so-called “occlusions”. If a person is visually isolated in the images, localization of its position and visual tracking are quite easy to obtain, because the information usually exploited to identify him (e.g., color distribution, shape, etc.) mainly remains unchanged when the person moves. If the density of objects in the scene increases, also occlusions intensify; consequently, a region of contiguous pixels of the image foreground (blob) could not belong to a single person, but to several persons. In such conditions of limited visibility and crowded scenes it is extremely difficult to correctly detect and track all the persons only based on the images coming from a single camera (single-view). Using several views of the same scene (multi-view) can allow to recover the information that could have been hidden in a specific view.
Several people counting approaches have been proposed in the past twenty years. They have been classified into detection-based methods, that determine the number of people, as well as their locations, by identifying individuals in the scene, and map-based methods, that exploit the relationship between the number of people and some features extracted from the images (Hou and Pang, 2011). More recently, they have been subdivided into individual-centric methods, based on the detection, tracking, and counting the number of tracks, and crowd-centric methods, based on the analysis of global low-level features extracted from crowd imagery to produce accurate counts (Chan and Vasconcelos, 2012).
Most of the literature concerning people counting relies on a single-view approach, due to the wide availability of single surveillance cameras and to the relative ease of implementation, since they do not require calibrated cameras nor specific knowledge of the scene geometry. Examples include Davies et al., 1995, Wren et al., 1996, Zhao and Nevatia, 2003, Rabaud and Belongie, 2006, Kilambi et al., 2008, Albiol et al., 2009, Chan et al., 2009, Sharma et al., 2009, Choudri et al., 2009, Conte et al., 2010, Patzold et al., 2010, Zeng and Ma, 2010, Subburaman et al., 2012. Also neural networks can be exploited for people counting and crowd density estimation (Maddalena and Petrosino, 2012a), as, for instance, in Marana et al., 1998, Cho et al., 1999, Kong et al., 2006, Hou and Pang, 2011. Generally, single-view approaches present difficulties in the analysis of crowded scenes, due to highly possible severe occlusions, and some of them are not robust to illumination changes or have heavy computational load.
Several research directions have been taken in order to handle occlusions. For example, the adoption of cameras looking straight down from the ceiling greatly helps reducing the occlusions (Albiol et al., 2001, Kim et al., 2002, Velipasalar et al., 2006, Englebienne and Krose, 2010). However, the application is limited to indoor environments; moreover, either the acquired sequences still present occlusions in all but the central portion of the image, or the cameras have limited field of view (Harville, 2004). Also stereo cameras have been considered, in order to exploit depth information to project moving people to the ground plane, producing an occupancy map and reducing occlusions (Beymer, 2000, Harville, 2004, Qiuyu et al., 2010, Yahiaoui et al., 2010, van Oosterhout et al., 2011). The use of multiple cameras reveals as fundamental for localizing and counting people in crowded environments. Multi-view approaches aim at reducing hidden image regions due to occlusions, allowing at the same time to reconstruct the target 3D position based on the abundant information provided by different observation points. Examples include Kim and Davis, 2006, Alahi et al., 2009, Krahnstoever et al., 2009, Stalder et al., 2009, Ge and Collins, 2010, Ma et al., 2012. However, multi-view approaches usually require calibrated and synchronized cameras, and have a complex structure, resulting in computationally demanding algorithms.
In this work we propose an individual-centric system for robustly counting the number of people under occlusion through multiple cameras with overlapping fields of view, characterized to be neural-brain-like inspired. Compared to the other occlusion handling methods, our multi-view approach relies on the learning of motion templates in time, can adapt to detect both moving and stationary people, and turns out to be robust to gradual lighting variations, moving backgrounds, and cast shadows. The proposed approach is based on the idea of performing an accurate moving object detection in each available view and suitably fusing such information in order to limit problems related to occlusions. To this end, the neural approach to moving object detection recently proposed in Maddalena and Petrosino (2013b) is adopted, where the background model is built by learning in a self-organizing manner image sequence variations, seen as trajectories of pixels in time. The neural model is here adapted to compute “foreground likelihood maps” from different views to be effectively merged together in the multi-view setting. Information fusion is based on the Homographic Occupancy Constraint (Khan and Shah, 2006), that exploits the homography associated to each scene view and the scene ground plane, in order to combine the visual information available by different view-points. This allows to reconstruct people feet positions on the scene ground plane in a single “feet map” image, through the homographies of the foreground likelihood maps. Subsequent tracking in the feet map images is adopted to support people counting, by keeping track of people movements and of their identities along time. Finally, people counting is obtained by supervised classification, based on learning the counts from the feet maps.
The paper is organized as follows. Section 2 describes the moving object detection approach, that allows to obtain a “foreground likelihood” information for each view, based on neural modeling on motion templates. People localization, achieved by reconstructing people feet positions on the scene ground plane, is described in Section 3, while tracking in the feet maps is described in Section 4, and people counting is described in Section 5. Experimental results and comparisons on different real datasets are reported in Section 6, while concluding remarks are provided in Section 7.
Section snippets
Neural modelling on motion templates
Foreground detection in each single scene view is the basic building block of our proposed people counting system and its accuracy is crucial for the entire process. Therefore, we adopt here the self-organizing background model for image sequences presented in Maddalena and Petrosino (2013b), whose high accuracy and robustness to well known moving object detection challenges has already been proven. Indeed, extensive experimental results on daytime, night-time, and thermal sequences made
People localization
In order to localize people in a way that is robust against occlusions and lack of visibility, we exploit and fuse the information available in a multi-view setting, assuming that at least one reference scene plane is visible in the scene views. This is a reasonable assumption in typical video surveillance installations, where the ground plane or a planar structure, like a building wall, is usually visible in the monitored area. Here, we propose to first train, frame by frame, different
Tracking
In the proposed system the tracking module is adopted to support people counting, by keeping track of people movements and of their identities along time, thus allowing the system to properly handle occasional misdetections in the feet maps. The choice of the tracking approach is driven by the fact that, contrary to the usual case (i.e., blobs to be tracked come from foreground moving people in a single-view scene), here blob locations of moving people come from the scene ground plane. Indeed,
People counting
As already pointed out, the appearance of blob locations in the feet map changes according to the scene dynamics, often abruptly from one frame to the next. Indeed, due to crowd density, there are cases where it is not possible to simply count moving blobs containing single persons. This is the case of groups of people that walk so close that, even in a multi-view setting, they are detected all together in a single blob of the feet map. In order to more accurately handle such blob density
Experimental results
In this section we present experimental results carried out in order to evaluate the accuracy of the proposed multi-view people counting system, as well as its ability to tackle occlusion and lack of visibility in crowded and cluttered scenes typical of complex environments. Comparisons are provided between results achieved by the approach based on supervised classification as described in Section 5.1 (in the following referred to as SC) and the approach based on blob density in the feet map as
Conclusions
In this paper we propose a multi-view camera system for people counting in video surveillance. The system relies on recent achievements in multi-view video analysis, integrating modules that turned out to be very efficient.
The 3D self-organizing neural network approach enables the integration of spatial and temporal templates into a common framework, combining simultaneously motion state and viewpoint changes, at the same time reducing the high dimensionality of the temporal templates. The
References (60)
Topographic organisation of nerve fields
Bull. Math. Biol.
(1980)Stereo person tracking with adaptive plan-view templates of height and occupancy statistics
Image Vision Comput.
(2004)- et al.
Estimating pedestrian counts in groups
Comput. Vision Image Understanding
(2008) - et al.
Automatic estimation of crowd density using texture
Saf. Sci.
(1998) - et al.
Sparsity-driven people localization algorithm: evaluation in crowded scenes environments
- et al.
Real-time high density people counter using morphological tools
IEEE Trans. Intell. Transp. Syst.
(2001) - et al.
Video analysis using corner motion statistics
- Beymer, D., 2000. Person counting using stereo. In: Proc. Workshop on Human Motion, pp....
- et al.
Analysis of crowded scenes using holistic properties
- et al.
Counting people with low-level features and Bayesian regression
IEEE Trans. Image Process.
(2012)
A neural-based crowd estimation by hybrid global learning algorithm
IEEE Trans. Syst. Man Cybern. Part B: Cybern.
Robust background model for pixel based people counting using a single uncalibrated camera
A method for counting moving people in video surveillance videos
EURASIP J. Adv. Signal Process.
A stochastic model of retinotopy: a self organizing process
Biol. Cybern.
Crowd monitoring using image processing
Electron. Commun. Eng. J.
Pattern Classification
An overview of the PETS 2009 challenge
Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography
Commun. ACM
Multi-camera people tracking with a probabilistic occupancy map
IEEE Trans. Pattern Anal. Mach. Intell.
The Formation of Nerve Connections
Crowd detection with a multiview sampler
Multiple View Geometry in Computer Vision
The Organization of Behavior
People counting and human detection in a challenging situation
IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans
A new approach to linear filtering and prediction problems
Trans. ASME–J. Basic Eng.
Cited by (52)
Deep and transfer learning for building occupancy detection: A review and comparative analysis
2022, Engineering Applications of Artificial IntelligenceCitation Excerpt :Similarly, the study in Rodrigues et al. (2017) provided a MLP model for estimating classroom occupancy using relative humidity, temperature, and CO2 concentration. The authors in Maddalena et al. (2014) presented a camera system with a multi-view property used for people counting. The system uses current advances in multi-view video analysis, incorporating effective components.
Pattern recognition and beyond: Alfredo Petrosino's scientific results
2020, Pattern Recognition LettersCitation Excerpt :In [32], stopped object detection triggers a video-based access control system, that starts localizing and recognizing license plates based on matching with an existing dataset of allowed plates. In a multi-view setting, the MOD masks produced by SOBS_CF are combined to reconstruct the position of moving objects onto the ground plane, providing a ”feet map” exploited for people counting [74]. MOD masks produced by SOBS are also exploited for human activity modeling [86], temporal segmentation [5], and gait and action recognition [4].
Earthquake emergency response framework on campus based on multi-source data monitoring
2019, Journal of Cleaner ProductionCitation Excerpt :Dynamic monitoring based on multi-source data for emergency response can shorten the time required for rescue decision making and improve its accuracy, and then rescue operations can be carried out as soon as possible. Promoting this research framework requires a platform that monitors building energy consumption information (Miller and Meggers, 2017) or image processing techniques for crowd counting (Maddalena et al., 2014). A university campus is a microcosm of a city (Lozano et al., 2013), and establishing a campus-based emergency response decision-making framework is conducive to extending the method to larger community areas, even urban areas.
Target Counting Method Based on UAV View in Large Area Scenes
2024, Yingyong Kexue Xuebao/Journal of Applied SciencesComputer vision and machine learning approaches on crowd density estimation: A review
2023, AIP Conference ProceedingsA Fusion-Based Dense Crowd Counting Method for Multi-Imaging Systems
2023, International Journal of Intelligent Systems