Crowd counting in public video surveillance by label distribution learning
Introduction
With the increase of population, threats in crowded environment are rising, including fighting, rioting, and violent protest. The most common indicator of these behaviors is the crowd size, and its evaluation known as crowd counting or crowd density estimation attracts more attentions.
Generally, when given a video captured by a static camera in crowd situations, crowd counting approaches can estimate the number of people or the level of crowd density. There are many potential real-world applications in crowd counting [1], [2], e.g. surveillance in public for safety and security by detecting abnormally large crowd, resource management in retail sectors for optimizing floor plan or product display by quantifying the number of people entering and existing at different times of the day, and urban planning for developing long-term crowd management strategies or designing evacuation routes in public spaces by statistically analysing the flow rate of people around an area. In other fields, crowd counting methods are also applicative. For instance, animals pass through a particular boundary, blood cells flow through a blood vessel under a microscope, and the rate of car traffic.
Existing methods for crowd counting could be roughly divided into the following three categories: pedestrian detection based approaches, trajectory clustering based approaches, and the feature-based regression approaches [1]. The pedestrian detection based approaches [3], [4], [5], [6] estimate the number of people by detecting the whole instances of people in a crowd image, using a trained detector to scan the image with different scales. The trajectory clustering based approaches [7], [8] count the number of people by analyzing the feature trajectory extracted from each crowd frame. The approaches based on pedestrian detection and trajectory clustering either rely on explicit object segmentation or feature point tracking, which requires sufficient computational expense and high frame rate video. Thus, they are not suitable to crowd scenes with cluttered background and frequent occlusion. In contrast, the feature-based regression approaches [9], [10], [11], [12], [13], [2] aim to learn a direct mapping between multi-dimensional features and class labels, only depending on the low-level features extracted from crowd frames with ordinary frame rate. Based on the above analysis, the feature-based regression approaches are more appropriate to be adopted in real applications.
However, the feature-based regression approaches also have inherent disadvantages. As we know, the performance of an appropriate regression function always depends on the population of training data. Existing benchmarking datasets for crowd counting such as Mall and UCSD are insufficient and imbalanced, as shown in Fig. 1. The insufficiency of the dataset reflects in its limited number of samples belonging to certain classes, while the imbalance of the dataset means the samples of different classes have a great quantity variance. Both insufficiency and imbalance of training data have significantly negative effect in crowd estimation. To dispose this challenge, Chen et al. [12] propose a cumulative attribute based ridge regression (CA-RR) method for crowd counting. The experimental results of CA-RR on Mall dataset show its superior performance to the state-of-the-arts, but the solution seems complicated and not straightforward. Besides the attribute based solution, we can consider a more essential solution based on information reuse. According to the discussions in [12], the problem of crowd counting has a characteristic that features of crowd images which contain adjacent number of people are strongly correlated. In another word, the number of people varies continuously on features while discretely on labels. For instance, the crowd image containing 30 people shows the similar features to the one containing 29 people, while is significantly different from the one with 10 people. As we can see, apart from the real number of people, one crowd image can also contribute to the learning of its neighboring people-count. In this way, it can reuse the information of samples for insufficient and imbalanced training data.
The most popular strategy to combine regression and information reuse is label distribution learning [14], which learns a regression function between multi-dimensional features and a label distribution, instead of the conventional single label. Label distribution learning has been successfully used in facial age estimation, which has the same characteristic as the problem of crowd counting that samples vary continuously on features while discretely on labels. In our opinion, the label distribution learning algorithm can also be employed for crowd counting, and dispose the insufficiency and imbalance of training data by reusing the information of samples.
In this paper, we assign a label distribution to each crowd image rather than a traditional single label. The label distribution of each crowd instance covers numerous class labels with a probability model, which is utilized to represent the degree that each class label describes the instance. In this way, a crowd instance will contribute to not only the learning of its real class, but also the learning of its neighboring classes. Hence the training data are increased significantly, and the classes with insufficient training data are supplied with more training data, which belong to their adjacent classes. Then, a regression function between feature set and label distributions is learned by the IIS-LDL algorithm, where the optimization uses a strategy similar to Improved Iterative Scaling (IIS). The IIS is a well-known algorithm for maximizing the likelihood of the maximum entropy model, which makes the IIS-LDL be an iterative optimization process. Finally, given an unseen instance, the regression function will generate a label distribution. The predicted value is estimated by the weighted average of the label distribution.
When designing the label distribution of the training instance, the real class label must be the leading description. In other words, the highest probability in the label distribution is assigned to the real class label, which ensures its leading position in the class description. While the probability of other class labels decreases with the distance from the real class label, which makes the classes closer to the real class contribute more to the class description. Consequently, in the process of prediction, labels in a predicted distribution are also correlative, which proves the theory evidence to synthetically use the predicted distribution for estimating results.
Our framework is illustrated in Fig. 2. Firstly, the label set of a dataset is transformed into label distributions by allocating different probabilities to each label within a certain range. Secondly, the normalization is adopted to remove the effect of perspective before extracting features from dataset. Thirdly, three types of features are extracted, including the global features and the local texture features. In the end, a regression function utilized to predict is learned by the IIS-LDL algorithm. Experimental results on benchmarking datasets show that label distribution learning for crowd counting can effectively reduce the effect caused by the insufficient and imbalanced training data, thus improving the accuracy generally over the state-of-the-arts.
Section snippets
Related work
Existing crowd counting techniques are classified into three categories: counting by pedestrian detection, counting by trajectory clustering, and counting by feature-based regression. Various approaches for crowd counting have been proposed [1].
Counting by detection detects the instances of pedestrian by using a trained detector to scan the image space. This paradigm includes several different detection methods.
Monolithic detection method [3], [15], [16] is a typical pedestrian detection
Perspective normalization
Before extracting features from datasets, the effects of perspective cannot be neglected. Perspective makes objects closer to the camera appear larger in frames. Thus, it is important to normalize the features for reducing the effects of perspective. Since perspective conforms to linear variation, feature extracted from each pixel could be weighted by the relative location of the pixel in a frame. The weight of a pixel is dependent on the depth of object which contains the pixel. That is to
Label distribution
In this subsection, we cite the discussion in [14] to illustrate the formulation of label distribution, and explain how to utilize it in our framework.
Each label y in a label distribution is assigned a real number ∈[0,1], which represents the degree that describes the instance. The sum of these numbers assigned to all labels is 1, meaning the full description of the instance. In traditional ways, an instance is labelled with a single label or multiple labels, which can be viewed as several
Datasets and evaluation settings
Datasets: Experiments are conducted on two benchmarking datasets: the UCSD dataset and the Mall dataset which represent the outdoor and the indoor scene respectively. The UCSD dataset was collected from a stationary digital camcorder overlooking a pedestrian walkway at University of California, San Diego (UCSD). The Mall dataset was captured using a publicly accessible surveillance camera in a shopping mall. There are some detailed information of the two datasets in Table 1. As shown in Fig. 8,
Conclusions
This paper adopts the label distribution learning method into the problem of crowd counting, where it labels training data with label distributions and makes a training sample to contribute to not only the learning of its real class but also the learning of its neighboring classes. In this way, the training data are increased significantly, and the classes with insufficient training data are supplied with more training data, which belong to their adjacent classes. As for the label distribution,
Acknowledgment
This work is funded by the National Natural Science Foundation of China (Nos. 61375036, 61273300, 61232007), the Beijing Natural Science Foundation (No. 4132064), the Jiangsu Natural Science Funds for Distinguished Young Scholar (BK20140022), the Program for New Century Excellent Talents in University, the Beijing Higher Education Young Elite Teacher Project, the Fundamental Research Funds for the Central Universities, and the Key Lab of Computer Network and Information Integration of Ministry
Zhaoxiang Zhang received his B.S. degree in electronic science and technology from the University of Science and Technology of China, Hefei, China in 2004 and Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2009, respectively. After that he joined the School of Computer Science and Engineering, Beihang University, Beijing 100191, China, as a faculty. He is now an Associate Professor in the School of
References (41)
- et al.
Partial least-squares regression: a tutorial
Anal. Chim. Acta
(1986) - Chen Change Loy, Ke Chen, Shaogang Gong, Tao Xiang, Crowd counting and profiling: methodology and evaluation, in:...
- Antoni B. Chan, Z.-S.J. Liang, Nuno Vasconcelos, Privacy preserving crowd monitoring: counting people without people...
- Navneet Dalal, Bill Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference...
- et al.
Object detection with discriminatively trained part-based models
IEEE Trans. Pattern Anal. Mach. Intell.
(2010) - et al.
Segmentation and tracking of multiple humans in crowded environments
IEEE Trans. Pattern Anal. Mach. Intell.
(2008) - Danny B. Yang, Héctor H González-Baños, Leonidas J. Guibas, Counting people in crowds with a real-time network of...
- Vincent Rabaud, Serge Belongie, Counting crowded moving objects, in: 2006 IEEE Computer Society Conference on Computer...
- Gabriel J. Brostow, Roberto Cipolla, Unsupervised Bayesian detection of independent motion in crowds, in: 2006 IEEE...
- et al.
Crowd monitoring using image processing
Electron. Commun. Eng. J.
(1995)
Facial age estimation by learning from label distributions
IEEE Trans. Pattern Anal. Mach. Intell.
Pedestrian detection via classification on Riemannian manifolds
IEEE Trans. Pattern Anal. Mach. Intell.
Robust real-time face detection
Int. J. Comput. Vis.
Cited by (101)
Severity prediction of pulmonary diseases using chest CT scans via cost-sensitive label multi-kernel distribution learning
2023, Computers in Biology and MedicineTwo-stage label distribution learning with label-independent prediction based on label-specific features
2023, Knowledge-Based SystemsMulti-contextual learning in disinformation research: A review of challenges, approaches, and opportunities
2023, Online Social Networks and MediaLabel distribution learning with noisy labels via three-way decisions
2022, International Journal of Approximate ReasoningSemi-Supervised Label Distribution Learning with Co-regularization
2022, NeurocomputingSafe incomplete label distribution learning
2022, Pattern Recognition
Zhaoxiang Zhang received his B.S. degree in electronic science and technology from the University of Science and Technology of China, Hefei, China in 2004 and Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2009, respectively. After that he joined the School of Computer Science and Engineering, Beihang University, Beijing 100191, China, as a faculty. He is now an Associate Professor in the School of Computer Science and Engineering, the vise-director of the Department of Computer Application Technology. His research interest include computer vision, pattern recognition and image processing. He is the corresponding author of this paper.
Mo Wang received his B.E. degree in computer science and technology from Beihang University, China (BUAA) in 2013, and is a M.E. candidate in School of Computer Science and Engineering, Beihang University, China.
Xin Geng is currently a professor and the director of the PALM lab (http://palm.seu.edu.cn/) of Southeast University, China. He received the B.Sc. (2001) and M.Sc. (2004) degrees in computer science from Nanjing University, China, and the Ph.D. (2008) degree in computer science from Deakin University, Australia. His research interests include pattern recognition, machine learning, and computer vision. He has published 38 refereed papers in these areas, including those published in prestigious journals and top international conferences. He has been an Associate Editor or Guest Editor of several international journals, such as FCS, PRL and IJPRAI. He has served as a Program Committee Chair for several international/national conferences, such as PRICAI’18, JSAI’14, VALSE’13, etc., and a Program Committee Member for a number of top international conferences, such as CVPR, ICCV, IJCAI, AAAI, ACMMM, ECCV, etc.