Abstract
Automatic people detection from videos is an important task in many computer vision applications either for security and safety motivations or for business intelligence purposes. In order to achieve high person detection accuracy many authors propose the adoption of a depth sensor mounted in a top-view position in order to mitigate the effects of occlusions and illumination conditions on the performance. Unfortunately, most approaches presented so far in the scientific literature have been tested on very small datasets which do not account for the typical situations arising in real scenarios and consequently do not allow interested readers to figure out which method has to be used in the specific scenario at hand. In this paper we benchmark two different approaches available in the literature for people detection from a zenithal mounted depth camera; the former is an unsupervised method aimed at finding the head of persons defined as the local minimum regions in the depth map, while the latter is based on the combination of the histograms of oriented gradient description and the support vector machine classifier. The benchmarking is performed on a public dataset of images captured in two different lighting conditions and with varying number of persons; this allows to assess the performance of the considered approaches under different real world scenarios. A detailed analysis of the two methods is reported in the experimental section of the paper allowing the reader to comprehend the pros and cons of each approach on the considered scenes.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
Computer vision algorithms for the automatic detection of the presence of persons within a scene captured by a camera represent the enabling technology for several important real world applications. Some noticeable examples where person detection is required as a preliminary step are: counting the number of persons passing through a virtual line, determining the statistics regarding the permanence times of persons in specific areas (as in front of shop windows), detecting overcrowding conditions, etc. [2, 6, 9, 14].
Unfortunately, accurate person detection is seriously hampered by a series of problems that arise in real contexts. Among the most important issues we highlight the occlusions, i.e. the situation when a subject is not detected since he/she is partly or completely obscured by the scene elements (typically other subjects) which are interposed between the camera and the subject. It is intuitive to understand that the probability of occurrence of this phenomenon is directly related to the density of people in the area. In order to mitigate such phenomenon several authors propose the installation of the camera in a zenithal position which allows to eliminate the occlusions in the area immediately below the camera, with only a gradual increase as far as the persons move apart from the optical axis of the device. A further phenomenon that typically has a significant impact on the reliability of the person detection methods in real contexts is represented by the variability of the lighting conditions of the scene due to light switching in indoor environments or to the slow variation of the solar illumination along the day in outdoor environments [15], so as the presence of shadows [12] and specular reflections [1]. In order to cope with such issues, in the recent years several authors have proposed to use the depth map image provided by the Microsoft Kinect device as it proves to be partially immune to the problems due to lighting. Furthermore, the availability of the depth information may ease the detection of the persons starting from the observation that in typical real world applications the head is the element of the person that is closest to the camera, if the latter is installed in a zenithal position. It is interesting to note that the adoption of top-view depth cameras has also an additional positive effect. In fact, it allows to easily overcome the stringent regulations on the privacy of the people applied in the vast majority of countries, making this preferable to other solutions based on the use of traditional optical cameras and mounted in such positions as to acquire the faces of persons.
All the above observations motivated several research groups in the recent years to propose solutions to the problem based on the use of top-view depth cameras. The recent literature on this topic can be divided in two main streams: on one side, there are papers proposing unsupervised approaches which find the persons by looking for their head, being the body part closest to the camera. Specifically, in [17], Zhang et al. propose an unsupervised approach to locate persons in the scene; the method simulates water filling with the aim of finding the local minimum regions in the input depth map, which should correspond to heads of people. Similarly, in [8], Galcík et al. propose a method that locates people in the scene by detecting maximal in the depth images followed by region growing. The found regions are considered heads if they satisfy criteria related to size, roundness, and there is evidence of being above shoulder-like structures. Lin and Jhuang in [10] assume that the shapes of the pedestrians in the scene are similar to ellipses and the area of projection of the upper portion of a person is normally smaller compared to the lower one, so they compare and stack the areas of every layer from low to high estimating the top portion of each object (head and shoulders). Nalepa et al. [11] determine local minima of pixel depth values and use a modified flood fill algorithm to append neighboring pixels to the found minima. Then, the method groups blobs representing various parts of a human body into a single connected component.
A second group of methods formulate the person localization in the scene as a detection problem. Rauter in [13] introduces simplified local ternary patterns, a new feature descriptor which is used for human tracking based on the head and shoulder part of the human body, then a support vector machine (SVM) for the classification stage. Also Vera et al. in [16] propose to use an SVM classifier to detect people, although in this case the description is based on the histograms of oriented gradients. In Zhu and Wong [18], the 3D map of a person is described by a data structure called the head and shoulder profile (HASP) based on Haar wavelet features. The classifier is based on Adaboost algorithm [7]. Unfortunately, most methods have been tested on private datasets, or in few cases on very small public datasets that do not allow to thoroughly assess performance of the approaches in realistic conditions and to have a detailed insight of the pros and cons of each method.
Contribution. This paper intends to face the latter issue by proposing the benchmarking of two alternative person detection methods selected from the recent literature which use depth based vision systems mounted in a top-view position. The methods have been selected to be representative of the two categories of approaches described before, i.e. unsupervised and supervised ones. To the best of our knowledge there is no paper providing a similar contribution in the literature. The benchmarking is carried out on a common and large dataset, publicly available, with the aim of providing the reader with a detailed view of the performance of each method subjected to the two main sources of errors, namely the lighting conditions and the people density.
The paper is organized as it follows: in Sect. 2 we provide basic information regarding the methods which have been considered for the benchmarking reported in this paper; then, in Sects. 3 and 4 we describe, respectively, the dataset adopted for the experimentations and the results achieved by the two methods focusing on their behaviors in two different lighting scenarios and under varying persons densities. Finally, in Sect. 5 we draw conclusions and delineate future directions of our research.
2 Methods Considered for the Benchmarking
In this Section we briefly describe the two methods [16, 17] which have been considered for the benchmarking in this paper; for the details the interested reader may examine the original papers. Hereinfter, we will refer to the method proposed by Zhang et al. in [17] with the name WATERFILLING, and with the name HOG-SVM to the method by Vera et al. in [16]. The methods here considered for the benchmarking have been selected as representative of two complementary approaches: the WATERFILLING is an unsupervised method devised to locate the heads of the persons by searching for the local minimum regions within the depth map, while the HOG-SVM adopts a supervised approach based on support vector machine classifier.
2.1 WATERFILLING
The method moves from the idea that the head is the part of the human body that is closest to the top-view sensor, so the authors formulate person detection as the problem of searching the local minimum regions in the depth image; such regions should correspond to the head of the persons. Formally, the localization of a head into the depth image is done by finding a region A and its neighborhood N satisfying the following constraint:
The operator \(E(\cdot )\) allows to pool the depth information in the region to a real value that reflects the total depth information in the region. \(\eta \) is a predefined threshold to ensure that depth in A should be lower than \(N \setminus A\) with a margin. The idea is that A and N represent the head and the shoulder, respectively. In order to find the local regions A, the authors employ a methodology based on the water filling process, which, starting from a representation of the depth map as a land with humps and hollows, simulates the falling of the raindrop over it. After the water fall simulation the hollow regions will gather the raindrops. The hollow regions sufficiently large and deep are considered as heads. For our test, the authors of the WATERFILLING method provided us the original code implemented in C++ and based on the OpenCV library.
2.2 HOG-SVM
The HOG-SVM is based on the method initially proposed by Dalal and Triggs in [3] for pedestrian detection and adapted by Vera et al. in [16] for people detection from top-view depth cameras. The HOG-SVM method describes a candidate in terms of the histograms of oriented gradients (HOG). The analysis is performed on patches of the image of fixed size (\(96\times 96\) pixels); each patch is divided into blocks, which are divided in \(2 \times 2\) cells each of \(8 \times 8\) pixels. The blocks are partially overlapped; the amount of overlap corresponds to the size of the cell. The features are extracted by computing the gradient over the cells. The orientation of the gradient is clustered into nine-bin histograms. The frequency is weighted using the magnitude of the gradient blocks. At the end, a person is described by a feature vector of size 1089. Then classification is done using a support vector machine. The HOG-SVM person detector uses a sliding window which is moved around the image over a dense grid. At each position the HOG description is derived from the \(96 \times 96\) pixels patch and used by the SVM to classify the patch as either person or not a person. In order to detect persons at different scales, the image is subsampled to multiple sizes and each of these subsampled images is searched for people. We provided our own implementation of the HOG-SVM method. Also in this case, the method has been implemented in C++ using the OpenCV library.
3 The Adopted Dataset
The experimental validation of the method has been carried out using the dataset presented in [4] and successively adopted in [5]. The dataset has been acquired by using two image sensors, namely a traditional RGB camera and the depth sensor of a Kinect device. Both acquisition devices are mounted in a zenithal position; video sequences were captured at 30 fps with a resolution of \(640 \times 480\) pixels. Since in this paper we are interested only to the images provided by the depth sensor, the RGB images were not considered. The dataset includes scenes captured with either the prevalence of the solar illumination (OUTDOOR) or the artificial light (INDOOR). The dataset comprises sequences with a variable number of persons flowing within the area of interest in the same direction and/or in opposite directions. In particular, in the simplest case, there is a single person in the area framed by the camera, while in the most complex cases there are up to four persons moving within the area and proceeding either in the same direction, as in a queue, or walking in two opposite directions. As a consequence, the adoption of this dataset for our tests allows to characterize the accuracy of the analyzed methods under different illumination and crowding conditions. Example images from the INDOOR and the OUTDOOR environments are shown in Fig. 1, while in Table 1 we report the number of frames in the dataset containing the number of persons as specified in the leftmost columns.
The test dataset was originally devised to allow the test of the methods for counting people crossing a virtual line. Thus, in order to allow the benchmarking of the methods proposed in this paper, we augmented the ground truth of the dataset by providing information regarding the position of each person in each frame. In particular, for each person we added a smaller box containing the head of the person, and a second box including also the shoulders, as shown in Fig. 2.
4 Experimental Analysis
In this section we report and analyze the results achieved by the two considered people detection approaches on the adopted dataset. Specifically, we first describe the performance indices used for comparing the methods, then we provide information regarding the configuration parameters of the methods, and finally, we report the performance and comment the pros and cons of both approaches.
4.1 Performance Indices
The figure of merit adopted for measuring the detection performance of the considered approaches is the f-index defined as the armonic mean of Precision and Recall. Following [16], we declare a person as correctly detected by a method if the following condition stands:
where \(B_d\) and \(B_g\) are the bounding boxes generated by the method and of the ground truth, respectively. It has to be noted that the outputs of the two methods considered for the benchmarking are not exactly the same. As a matter of fact the WATERFILLING method only provides the location of the head of the person, while the HOG-SVM method provides the head and shoulder area. Consequently in the evaluation of the performance the condition in Eq. (2) was checked using for each method the proper ground truth (head bounding box for WATERFILLING and shoulder bounding box for HOG-SVM). We also highlight that for our evaluation we did not consider the persons in the dataset with head and shoulder bounding box not completely contained into the capture area of the camera; coherently we did not care about the object detected by the methods laying across the borders of the frame.
4.2 Training Procedure
Both people detection methods required a training phase aimed at setting the optimal parameters to be used during the tests. To this aim, we extracted a total of 102 frames from the dataset described in the previous section. The frames, containing at least a person, were randomly selected within the whole dataset, preserving the original distribution of the number of persons present in each frame and equally distributed between the two scenarios. The frames used for the training stage were not used during the tests. Furthermore, the training dataset was also augmented using rotated and flipped version of the images; this was particular important for achieving higher generalization of the SVM stage of the HOG-SVM method from the given set of samples extracted from the original dataset. During the training phase we noticed that while the HOG-SVM method is able to cope with the high difference in the signal to noise ratio that characterizes the video sequences captured in the INDOOR and the OUTDOOR scenarios (see Fig. 1), in the case of the WATERFILLING approach we found that the optimal values of the parameters greatly change between the two scenarios. Consequently we used two different parameterizations for the latter method for the INDOOR or the OUTDOOR cases.
4.3 Analysis of the Experimental Results
Table 2 reports the overall performance achieved by the HOG-SVM and the WATERFILLING methods over the considered dataset expressed in terms of the indices defined before. We immediately notice the large difference between the two methods. The HOG-SVM largely outperforms the WATERFILLING approach with respect to all the three indices with a \(28.9\%\) relative improvement of the f-index.
In Table 3 we analyze performance of the two approaches with respect to the scenario, reporting the values of the indices separately for the INDOOR and the OUTDOOR cases. Focusing on the HOG-SVM method, we notice that its performance does not depend on the scenario; in fact, the variation of the f-index between the two cases remains practically unchanged (0.984 vs 0.986), thus demonstrating to be highly robust to the image noise. Conversely, the WATERFILLING shows a very different behavior, being strongly affected by the noise, especially the one that characterizes the OUTDOOR scenario (see Fig. 1). This is demonstrated by the very high difference of the f-index achieved in the INDOOR and in the OUTDOOR cases, 0.919 vs 0.670, respectively. Results in Table 3 shows that the strongest limitation of the WATERFILLING in the OUTDOOR scenario is the high incidence of false alarms and, to a lesser extent, the incidence of false negatives. The high alarm rate is motivated by the fact that the high noise level into the background often causes the fragmentation of the person’s head in several connected components which generate spurious detections.
In Tables 4 and 5 we report the performance of the methods for the two scenarios and for the different number of persons simultaneously present into the scene. We notice that for both methods the number of persons in the scene does not have a significant influence over the performance. Specifically, in the case of HOG-SVM the value of the f-index is bound to a narrow range, from 0.972 to 0.996. Furthermore, the best values are obtained in case of a single person in the scene. This is motivated by the fact that in few cases when there are persons close to each other the method provides false detections in the region separating the two persons (see Fig. 3 for some examples of this situation). In the case of the WATERFILLING we notice that the values of the f-index varies in two relatively short ranges for the INDOOR scenario (from 0.882 to 0.983) and the OUTDOOR scenario (from 0.654 to 0.701) highlighting that the illumination source has an higher impact than the crowding level on the performance of this method.
5 Conclusions and Future Work
In this paper, we studied two methods available in the scientific literature for people detection from top-view depth cameras. The methods under consideration follow two alternative approaches: the WATERFILLING is an unsupervised method aimed at locating the head of persons by looking for the local minima in the depth map; conversely, the HOG-SVM is a supervised method based on an SVM classifier fed by the description of the head and shoulder pattern through the histograms of oriented gradients.
The two methods have been tested on a publicly available dataset characterized by two illumination scenarios (indoor and outdoor) and containing images with varying persons density. The experimental results highlight an overall accuracy of the HOG-SVM method higher than the unsupervised approach, mostly in the outdoor scenario where the latter generates many false positives. Furthermore, for both methods the crowd density does not appear to have a significant impact over the performance.
In our future benchmarking effort, we will consider the following aspects: expanding the set of methods from those available in the scientific literature, enlarging the dataset in order to account also for other issues that may affect the performance as the installation height and the depth sensor technology (e.g. stereo camera and Kinect 2), studying the complementarity of the responses of the considered detectors and consequently the possibility to improve performance by fusion of the outputs.
References
Conte, D., Foggia, P., Percannella, G., Vento, M.: Removing object reflections in videos by global optimization. IEEE Trans. Circuits Syst. Video Technol. 22(11), 1623–1633 (2012)
Conte, D., Foggia, P., Percannella, G., Vento, M.: Counting moving persons in crowded scenes. Mach. Vis. Appl. 24(5), 1029–1042 (2013)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893. IEEE (2005)
Del Pizzo, L., Foggia, P., Greco, A., Percannella, G., Vento, M.: A versatile and effective method for counting people on either RGB or depth overhead cameras. In: 2015 IEEE International Conference on Multimedia and Expo Workshops, ICMEW 2015 (2015)
Del Pizzo, L., Foggia, P., Greco, A., Percannella, G., Vento, M.: Counting people by RGB or depth overhead cameras. Pattern Recogn. Lett. 81, 41–50 (2016)
Erickson, V.L., Lin, Y., Kamthe, A., Brahme, R., Surana, A., Cerpa, A.E., Sohn, M.D., Narayanan, S.: Energy efficient building environment control strategies using real-time occupancy measurements. In: Proceedings of 1st ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings, BuildSys 2009, pp. 19–24. ACM, New York (2009)
Freund, Y., Schapire, R.E.: A desicion-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995). doi:10.1007/3-540-59119-2_166
Galčík, F., Gargalík, R.: Real-time depth map based people counting. In: Blanc-Talon, J., Kasinski, A., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2013. LNCS, vol. 8192, pp. 330–341. Springer, Cham (2013). doi:10.1007/978-3-319-02895-8_30
Karpagavalli, P., Ramprasad, A.: Estimating the density of the people and counting the number of people in a crowd environment for human safety. pp. 663–667 (2013)
Lin, D.-T., Jhuang, D.-H.: A novel layer-scanning method for improving real-time people counting. In: Stephanidis, C. (ed.) HCI 2013. CCIS, vol. 374, pp. 661–665. Springer, Heidelberg (2013). doi:10.1007/978-3-642-39476-8_133
Nalepa, J., Szymanek, J., Kawulok, M.: Real-time people counting from depth images. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015. CCIS, vol. 521, pp. 387–397. Springer, Cham (2015). doi:10.1007/978-3-319-18422-7_34
Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R.: Detecting moving shadows: algorithms and evaluation. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 918–923 (2003)
Rauter, M.: Reliable human detection and tracking in top-view depth images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 529–534 (2013)
Saleh, S.A.M., Suandi, S.A., Ibrahim, H.: Recent survey on crowd density estimation and counting for visual surveillance. Eng. Appl. Artif. Intell. 41, 103–114 (2015)
Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: principles and practice of background maintenance. In: Proceedings of 7th IEEE International Conference on Computer Vision, vol. 1, pp. 255–261 (1999)
Vera, P., Zenteno, D., Salas, J.: Counting pedestrians in bidirectional scenarios using zenithal depth images. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Rodríguez, J.S., di Baja, G.S. (eds.) MCPR 2013. LNCS, vol. 7914, pp. 84–93. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38989-4_9
Zhang, X., Yan, J., Feng, S., Lei, Z., Yi, D., Li, S.Z.: Water filling: unsupervised people counting via vertical KINECT sensor. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance (AVSS), pp. 215–220. IEEE (2012)
Zhu, L., Wong, K.-H.: Human tracking and counting using the KINECT range sensor based on Adaboost and Kalman filter. In: Bebis, G., et al. (eds.) ISVC 2013. LNCS, vol. 8034, pp. 582–591. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41939-3_57
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Carletti, V., Del Pizzo, L., Percannella, G., Vento, M. (2017). Benchmarking Two Algorithms for People Detection from Top-View Depth Cameras. In: Battiato, S., Gallo, G., Schettini, R., Stanco, F. (eds) Image Analysis and Processing - ICIAP 2017 . ICIAP 2017. Lecture Notes in Computer Science(), vol 10484. Springer, Cham. https://doi.org/10.1007/978-3-319-68560-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-68560-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68559-5
Online ISBN: 978-3-319-68560-1
eBook Packages: Computer ScienceComputer Science (R0)