1 Introduction

Computer vision algorithms for the automatic detection of the presence of persons within a scene captured by a camera represent the enabling technology for several important real world applications. Some noticeable examples where person detection is required as a preliminary step are: counting the number of persons passing through a virtual line, determining the statistics regarding the permanence times of persons in specific areas (as in front of shop windows), detecting overcrowding conditions, etc. [2, 6, 9, 14].

Unfortunately, accurate person detection is seriously hampered by a series of problems that arise in real contexts. Among the most important issues we highlight the occlusions, i.e. the situation when a subject is not detected since he/she is partly or completely obscured by the scene elements (typically other subjects) which are interposed between the camera and the subject. It is intuitive to understand that the probability of occurrence of this phenomenon is directly related to the density of people in the area. In order to mitigate such phenomenon several authors propose the installation of the camera in a zenithal position which allows to eliminate the occlusions in the area immediately below the camera, with only a gradual increase as far as the persons move apart from the optical axis of the device. A further phenomenon that typically has a significant impact on the reliability of the person detection methods in real contexts is represented by the variability of the lighting conditions of the scene due to light switching in indoor environments or to the slow variation of the solar illumination along the day in outdoor environments [15], so as the presence of shadows [12] and specular reflections [1]. In order to cope with such issues, in the recent years several authors have proposed to use the depth map image provided by the Microsoft Kinect device as it proves to be partially immune to the problems due to lighting. Furthermore, the availability of the depth information may ease the detection of the persons starting from the observation that in typical real world applications the head is the element of the person that is closest to the camera, if the latter is installed in a zenithal position. It is interesting to note that the adoption of top-view depth cameras has also an additional positive effect. In fact, it allows to easily overcome the stringent regulations on the privacy of the people applied in the vast majority of countries, making this preferable to other solutions based on the use of traditional optical cameras and mounted in such positions as to acquire the faces of persons.

All the above observations motivated several research groups in the recent years to propose solutions to the problem based on the use of top-view depth cameras. The recent literature on this topic can be divided in two main streams: on one side, there are papers proposing unsupervised approaches which find the persons by looking for their head, being the body part closest to the camera. Specifically, in [17], Zhang et al. propose an unsupervised approach to locate persons in the scene; the method simulates water filling with the aim of finding the local minimum regions in the input depth map, which should correspond to heads of people. Similarly, in [8], Galcík et al. propose a method that locates people in the scene by detecting maximal in the depth images followed by region growing. The found regions are considered heads if they satisfy criteria related to size, roundness, and there is evidence of being above shoulder-like structures. Lin and Jhuang in [10] assume that the shapes of the pedestrians in the scene are similar to ellipses and the area of projection of the upper portion of a person is normally smaller compared to the lower one, so they compare and stack the areas of every layer from low to high estimating the top portion of each object (head and shoulders). Nalepa et al. [11] determine local minima of pixel depth values and use a modified flood fill algorithm to append neighboring pixels to the found minima. Then, the method groups blobs representing various parts of a human body into a single connected component.

A second group of methods formulate the person localization in the scene as a detection problem. Rauter in [13] introduces simplified local ternary patterns, a new feature descriptor which is used for human tracking based on the head and shoulder part of the human body, then a support vector machine (SVM) for the classification stage. Also Vera et al. in [16] propose to use an SVM classifier to detect people, although in this case the description is based on the histograms of oriented gradients. In Zhu and Wong [18], the 3D map of a person is described by a data structure called the head and shoulder profile (HASP) based on Haar wavelet features. The classifier is based on Adaboost algorithm [7]. Unfortunately, most methods have been tested on private datasets, or in few cases on very small public datasets that do not allow to thoroughly assess performance of the approaches in realistic conditions and to have a detailed insight of the pros and cons of each method.

Contribution. This paper intends to face the latter issue by proposing the benchmarking of two alternative person detection methods selected from the recent literature which use depth based vision systems mounted in a top-view position. The methods have been selected to be representative of the two categories of approaches described before, i.e. unsupervised and supervised ones. To the best of our knowledge there is no paper providing a similar contribution in the literature. The benchmarking is carried out on a common and large dataset, publicly available, with the aim of providing the reader with a detailed view of the performance of each method subjected to the two main sources of errors, namely the lighting conditions and the people density.

The paper is organized as it follows: in Sect. 2 we provide basic information regarding the methods which have been considered for the benchmarking reported in this paper; then, in Sects. 3 and 4 we describe, respectively, the dataset adopted for the experimentations and the results achieved by the two methods focusing on their behaviors in two different lighting scenarios and under varying persons densities. Finally, in Sect. 5 we draw conclusions and delineate future directions of our research.

2 Methods Considered for the Benchmarking

In this Section we briefly describe the two methods [16, 17] which have been considered for the benchmarking in this paper; for the details the interested reader may examine the original papers. Hereinfter, we will refer to the method proposed by Zhang et al. in [17] with the name WATERFILLING, and with the name HOG-SVM to the method by Vera et al. in [16]. The methods here considered for the benchmarking have been selected as representative of two complementary approaches: the WATERFILLING is an unsupervised method devised to locate the heads of the persons by searching for the local minimum regions within the depth map, while the HOG-SVM adopts a supervised approach based on support vector machine classifier.

2.1 WATERFILLING

The method moves from the idea that the head is the part of the human body that is closest to the top-view sensor, so the authors formulate person detection as the problem of searching the local minimum regions in the depth image; such regions should correspond to the head of the persons. Formally, the localization of a head into the depth image is done by finding a region A and its neighborhood N satisfying the following constraint:

$$\begin{aligned} E_A(f(x,y)) + \eta \le E_{N \setminus A}(f(x,y)), A \in N \end{aligned}$$
(1)

The operator \(E(\cdot )\) allows to pool the depth information in the region to a real value that reflects the total depth information in the region. \(\eta \) is a predefined threshold to ensure that depth in A should be lower than \(N \setminus A\) with a margin. The idea is that A and N represent the head and the shoulder, respectively. In order to find the local regions A, the authors employ a methodology based on the water filling process, which, starting from a representation of the depth map as a land with humps and hollows, simulates the falling of the raindrop over it. After the water fall simulation the hollow regions will gather the raindrops. The hollow regions sufficiently large and deep are considered as heads. For our test, the authors of the WATERFILLING method provided us the original code implemented in C++ and based on the OpenCV library.

2.2 HOG-SVM

The HOG-SVM is based on the method initially proposed by Dalal and Triggs in [3] for pedestrian detection and adapted by Vera et al. in [16] for people detection from top-view depth cameras. The HOG-SVM method describes a candidate in terms of the histograms of oriented gradients (HOG). The analysis is performed on patches of the image of fixed size (\(96\times 96\) pixels); each patch is divided into blocks, which are divided in \(2 \times 2\) cells each of \(8 \times 8\) pixels. The blocks are partially overlapped; the amount of overlap corresponds to the size of the cell. The features are extracted by computing the gradient over the cells. The orientation of the gradient is clustered into nine-bin histograms. The frequency is weighted using the magnitude of the gradient blocks. At the end, a person is described by a feature vector of size 1089. Then classification is done using a support vector machine. The HOG-SVM person detector uses a sliding window which is moved around the image over a dense grid. At each position the HOG description is derived from the \(96 \times 96\) pixels patch and used by the SVM to classify the patch as either person or not a person. In order to detect persons at different scales, the image is subsampled to multiple sizes and each of these subsampled images is searched for people. We provided our own implementation of the HOG-SVM method. Also in this case, the method has been implemented in C++ using the OpenCV library.

3 The Adopted Dataset

The experimental validation of the method has been carried out using the dataset presented in [4] and successively adopted in [5]. The dataset has been acquired by using two image sensors, namely a traditional RGB camera and the depth sensor of a Kinect device. Both acquisition devices are mounted in a zenithal position; video sequences were captured at 30 fps with a resolution of \(640 \times 480\) pixels. Since in this paper we are interested only to the images provided by the depth sensor, the RGB images were not considered. The dataset includes scenes captured with either the prevalence of the solar illumination (OUTDOOR) or the artificial light (INDOOR). The dataset comprises sequences with a variable number of persons flowing within the area of interest in the same direction and/or in opposite directions. In particular, in the simplest case, there is a single person in the area framed by the camera, while in the most complex cases there are up to four persons moving within the area and proceeding either in the same direction, as in a queue, or walking in two opposite directions. As a consequence, the adoption of this dataset for our tests allows to characterize the accuracy of the analyzed methods under different illumination and crowding conditions. Example images from the INDOOR and the OUTDOOR environments are shown in Fig. 1, while in Table 1 we report the number of frames in the dataset containing the number of persons as specified in the leftmost columns.

Fig. 1.
figure 1

Examples of depth images acquired in the INDOOR (left image) and in the OUTDOOR (right image) scenarios. The OUTDOOR case is characterized by high noise due to the sunlight illumination and appearing in the form of numerous black spots.

Fig. 2.
figure 2

Ground truth used for the people detection: the orange and red solid lines shows the head-shoulders ground truth used for the HOG-SVM method, the blue solid line shows the head ground truth used for the WATERFILLING method. (Color figure online)

Table 1. Dataset information: number of frames with 0 to 4 persons, under different illumination conditions (INDOOR/OUTDOOR).

The test dataset was originally devised to allow the test of the methods for counting people crossing a virtual line. Thus, in order to allow the benchmarking of the methods proposed in this paper, we augmented the ground truth of the dataset by providing information regarding the position of each person in each frame. In particular, for each person we added a smaller box containing the head of the person, and a second box including also the shoulders, as shown in Fig. 2.

4 Experimental Analysis

In this section we report and analyze the results achieved by the two considered people detection approaches on the adopted dataset. Specifically, we first describe the performance indices used for comparing the methods, then we provide information regarding the configuration parameters of the methods, and finally, we report the performance and comment the pros and cons of both approaches.

4.1 Performance Indices

The figure of merit adopted for measuring the detection performance of the considered approaches is the f-index defined as the armonic mean of Precision and Recall. Following [16], we declare a person as correctly detected by a method if the following condition stands:

$$\begin{aligned} \frac{area ( B_d \cap B_g)}{area ( B_d \cup B_g)} > 0.5 \end{aligned}$$
(2)

where \(B_d\) and \(B_g\) are the bounding boxes generated by the method and of the ground truth, respectively. It has to be noted that the outputs of the two methods considered for the benchmarking are not exactly the same. As a matter of fact the WATERFILLING method only provides the location of the head of the person, while the HOG-SVM method provides the head and shoulder area. Consequently in the evaluation of the performance the condition in Eq. (2) was checked using for each method the proper ground truth (head bounding box for WATERFILLING and shoulder bounding box for HOG-SVM). We also highlight that for our evaluation we did not consider the persons in the dataset with head and shoulder bounding box not completely contained into the capture area of the camera; coherently we did not care about the object detected by the methods laying across the borders of the frame.

4.2 Training Procedure

Both people detection methods required a training phase aimed at setting the optimal parameters to be used during the tests. To this aim, we extracted a total of 102 frames from the dataset described in the previous section. The frames, containing at least a person, were randomly selected within the whole dataset, preserving the original distribution of the number of persons present in each frame and equally distributed between the two scenarios. The frames used for the training stage were not used during the tests. Furthermore, the training dataset was also augmented using rotated and flipped version of the images; this was particular important for achieving higher generalization of the SVM stage of the HOG-SVM method from the given set of samples extracted from the original dataset. During the training phase we noticed that while the HOG-SVM method is able to cope with the high difference in the signal to noise ratio that characterizes the video sequences captured in the INDOOR and the OUTDOOR scenarios (see Fig. 1), in the case of the WATERFILLING approach we found that the optimal values of the parameters greatly change between the two scenarios. Consequently we used two different parameterizations for the latter method for the INDOOR or the OUTDOOR cases.

Table 2. Overall performance of the HOG-SVM and the WATERFILLING methods over the considered dataset.

4.3 Analysis of the Experimental Results

Table 2 reports the overall performance achieved by the HOG-SVM and the WATERFILLING methods over the considered dataset expressed in terms of the indices defined before. We immediately notice the large difference between the two methods. The HOG-SVM largely outperforms the WATERFILLING approach with respect to all the three indices with a \(28.9\%\) relative improvement of the f-index.

Table 3. Performance of the HOG-SVM and the WATERFILLING methods over the considered dataset in the INDOOR e OUTDOOR cases.
Table 4. Performance of the HOG-SVM method under different flow densities.

In Table 3 we analyze performance of the two approaches with respect to the scenario, reporting the values of the indices separately for the INDOOR and the OUTDOOR cases. Focusing on the HOG-SVM method, we notice that its performance does not depend on the scenario; in fact, the variation of the f-index between the two cases remains practically unchanged (0.984 vs 0.986), thus demonstrating to be highly robust to the image noise. Conversely, the WATERFILLING shows a very different behavior, being strongly affected by the noise, especially the one that characterizes the OUTDOOR scenario (see Fig. 1). This is demonstrated by the very high difference of the f-index achieved in the INDOOR and in the OUTDOOR cases, 0.919 vs 0.670, respectively. Results in Table 3 shows that the strongest limitation of the WATERFILLING in the OUTDOOR scenario is the high incidence of false alarms and, to a lesser extent, the incidence of false negatives. The high alarm rate is motivated by the fact that the high noise level into the background often causes the fragmentation of the person’s head in several connected components which generate spurious detections.

Table 5. Performance of the WATERFILLING method under different flow densities.
Fig. 3.
figure 3

The first two rows show the output in the INDOOR scenario, the last two rows show the output in the OUTDOOR scenario. In the first and third columns there are false negative and false positive events from HOG-SVM method, while in the second and fourth columns there are false negative and false positive events from WATERFILLING method.

In Tables 4 and 5 we report the performance of the methods for the two scenarios and for the different number of persons simultaneously present into the scene. We notice that for both methods the number of persons in the scene does not have a significant influence over the performance. Specifically, in the case of HOG-SVM the value of the f-index is bound to a narrow range, from 0.972 to 0.996. Furthermore, the best values are obtained in case of a single person in the scene. This is motivated by the fact that in few cases when there are persons close to each other the method provides false detections in the region separating the two persons (see Fig. 3 for some examples of this situation). In the case of the WATERFILLING we notice that the values of the f-index varies in two relatively short ranges for the INDOOR scenario (from 0.882 to 0.983) and the OUTDOOR scenario (from 0.654 to 0.701) highlighting that the illumination source has an higher impact than the crowding level on the performance of this method.

5 Conclusions and Future Work

In this paper, we studied two methods available in the scientific literature for people detection from top-view depth cameras. The methods under consideration follow two alternative approaches: the WATERFILLING is an unsupervised method aimed at locating the head of persons by looking for the local minima in the depth map; conversely, the HOG-SVM is a supervised method based on an SVM classifier fed by the description of the head and shoulder pattern through the histograms of oriented gradients.

The two methods have been tested on a publicly available dataset characterized by two illumination scenarios (indoor and outdoor) and containing images with varying persons density. The experimental results highlight an overall accuracy of the HOG-SVM method higher than the unsupervised approach, mostly in the outdoor scenario where the latter generates many false positives. Furthermore, for both methods the crowd density does not appear to have a significant impact over the performance.

In our future benchmarking effort, we will consider the following aspects: expanding the set of methods from those available in the scientific literature, enlarging the dataset in order to account also for other issues that may affect the performance as the installation height and the depth sensor technology (e.g. stereo camera and Kinect 2), studying the complementarity of the responses of the considered detectors and consequently the possibility to improve performance by fusion of the outputs.