Keywords

1 Introduction

Crowd management has become an important task and video surveillance systems can be of great help in this context. In daily life people may gather at various public places like railway station and market place, and also for different activities or events like sports and cultural. To ensure safety and proper management, crowd behavior analysis is crucial. The behavioral anomaly of the crowd depends not only on the nature of the participating group but also on the crowd volume and density. Hence, estimating these parameters through video surveillance system is an important step towards crowd behavior analysis and management. In this paper, we present three novel methods for classifying the crowd image as dense or sparse using domain knowledge based low level features. Finally, classifiers are fused to develop a robust system.

The paper is organized as follows. This brief introduction is followed by a review of past work presented in Sect. 2. Proposed methodology is elaborated in Sect. 3. Section 4 presents the experimental results and discussion. Concluding remarks are sited in Sect. 5.

2 Past Work

A large variety of methods exists in the literature. Some works are based on still images and some are on videos. Some works focus only on dense crowd images. One of the main approaches towards crowd density estimation is to count the population. This approach [1, 2] can be sub-grouped as human detection based and motion based. In human detection based approach [3], the challenge lies in designing the human detector. and subsequent counting is straight forward. In motion based approach, the number of components with independent motion is taken as the count [4, 5].

Marana et al. [6] used texture features in the form of Gray-level Dependence Matrices (GLDM) and applied Self Organizing Map (SOM) to classify crowd images to different density categories ranging from very low to very high. Li et al. [7] applied head-detector on the segmented foreground to obtain the count. Cheriyadat et al. [4] worked on image sequence with moving crowd, where low-level feature points are tracked, and regions with coherent motion are detected as objects for counting. SIFT features are also used for crowd detection in [8]. Corner points based methods are widely used to count the number of moving people [5, 9]. Subburaman et al. [3] used gradient orientation features at interest points and Adaboost classifier. Jiang [10] proposed an improvisation on the regression based crowd counting mechanism. Idrees et al. [11] proposed a hybrid approach for highly dense crowd image, where head detector and interest point based count were combined with Fourier analysis. Hafeezallah et al. [12] introduced the curvelet frame change detection which enhances the statistical features for counting the individuals in the crowd.

In recent times convolutional neural network (CNN) is being used for crowd density estimation [13, 14]. The network is trained with known crowd patches and then adapt it for target scenario. It is well known that obtaining a meaningful result from deep learning based method requires a huge training set whose distribution should be good representative of the population from which test (target) data would be drawn. Such a training set may not always be available. Second, it is observed that though a considerable variety of methods exists, there is not a single method that can handle all sorts of crowds. Moreover, some methods can handles image(s) of dense crowd only. Thus, characterizing a crowd as dense or sparse at the onset is essential in choosing an optimal strategy. In this work, we attempt to develop a robust system that can classify crowd image(s) into dense or sparse based on a small training set.

3 Proposed Methodology

In this work, we try to determine whether a crowd seen in an image is dense or sparse. Here, crowd image is conceived as texture image, and dense crowd image appears to be fine (micro) texture, while sparse crowd mimics coarse (macro) texture. Thus, sparse/dense crowd classification degenerates to fine/coarse texture classification. This motivates us to look for a variety of texture descriptors suitable for the task. Here we consider three different texture descriptors. First two try to rely on fractal dimension; whereas, the last one is based on count of interest (corner) points over. Feature extraction processes are detailed as follows.

Fig. 1.
figure 1

Sample images from the dataset

Fig. 2.
figure 2

Distance transform based descriptor for example sparse crowd in Fig. 1

Fig. 3.
figure 3

Distance transform based descriptor for example dense crowd in Fig. 1

Descriptor Based on Distance Transform and Fractal Dimension: First, color image is converted into gray-scale image and segmented using morphological watershed algorithm [15, 16], where gray-scale value of a pixel represents altitude at that location. The watershed line surrounds each region depicting a uniform surface feature. For Dense crowd images, a large number of small segments are obtained; while for sparse crowd, segments are large and small in number. Watershed algorithm produces a binary image with distinct regions with watershed line in-between. It may noted that one could have used any other segmentation scheme that generates closed contour.

Fig. 4.
figure 4

Fractal dimension descriptor for example sparse and dense crowds in Fig. 1

Fig. 5.
figure 5

Corner point based descriptor for example sparse and dense crowd in Fig. 1

To extract texture feature from the said binary image, we apply distance transform [17]. The result of the transform is a two-dimensional matrix (say, T) of the same size as the image and a matrix element denotes the distance of the corresponding pixel from nearest watershed line. Hence, it reveals a kind (fine or coarse) of texture. Finally, texture feature is extracted from distance matrix in terms of fractal dimension. Note that, fractal dimension has already been used for texture segmentation [18, 19]. It indicates roughness and self-similarity in the image. For a dense image, more self-similarity is expected compared to a sparse one. T is divided into \(K \times K\) patches with a stride of K/p. Fractal dimension is computed over each patch. A normalized histogram of these fractal dimensions is taken as feature vector. Here, we empirically decide \(K=100\) and \(p=2\).

Watershed algorithm has a parameter that controls the segmentation process, and its selection is data-dependent and is a non-trivial task. Impact of different threshold values on segmentation will vary depending on the crowd density and the variation pattern can be an indicator of density. In our work, we take three threshold values: 40, 35 and 30 which are chosen empirically and applied to all the images in the dataset. Corresponding histograms are concatenated to form image texture descriptor. Figure 1 show sample sparse and dense images. Corresponding distance transform matrices and histograms are shown in Figs. 2 and  3. It is evident that the fractal dimension distribution is different the two types of crowd.

Descriptor Based on Fractal Dimension: Above algorithm is intuitively very promising, but it gets affected by certain issues. For example, we expect large and less number of segments in the binarized sparse crowd image. But the assumption fails in case of textured background. In order to get rid of it we drop the segmentation step. Fractal dimension is computed over each patch of gray-level image and these values are summarized into a normalized histogram of fractal dimension. The histograms of fractal dimension for sample sparse and dense crowd are shown in Fig. 4.

Descriptor Based on Corner Point: Fractal dimension based feature is global in nature and bears impact of background texture. To reduce such influence and to incorporate local character we focus on corner point based descriptor. Number of such points in a small patch of a dense crowd image is usually higher than that of sparse crowd image.

We extract corner points using Harris-Stephens algorithm [20]. Sensitivity factor is taken as 0.05. Then image is divided into patches as before. For each patch, corner points are counted. Histogram of normalized count is taken as the descriptor. The histograms of example crowd images are shown in Fig. 5. Usually for a dense crowd, the non-zero histogram bins spread over large counts, whereas sparse crowd they are usually restricted to lower range of counts with strong peak. To reduce the effect of noise, an edge preserving smoothing [21] can be applied as pre-processing.

3.1 Classification

For all the three descriptors, we have used Decision tree as classifier [22]. During training, data is split at each decision node based on maximization of information gain at child nodes. During test, a simple condition is tested on feature at each node and corresponding branch is taken. This process goes on recursively and eventually a leaf node is reached based on which we predict the class-label.

Fusing the Classifier: It is understood from the description of features that some of them are supplementary and some are redundant too. Second, the classifier must be robust. That means, standard deviation of various test run must be as low as possible. So it may worth exploring the fusion of the classifiers based on these features. We have tried both kind of fusion: feature level fusion and decision level fusion. In the former case, three sets of features obtained based on (i) distance transform and fractal dimension, (ii) fractal dimension, and (iii) count of corner points are concatenated together to form a single feature vector, which is then fed to the classifier. In the latter case, output or decision obtained from each of the classifiers using three different feature sets as stated above are combined through an artificial neural network with three input nodes, two output nodes and a hidden layer. Results of fused classifiers are also reported.

4 Experimental Results and Discussion

We have performed the experiments on a machine with Intel®Core™i5-5200U CPU and 4 GB RAM. All the codes are written in MATLAB®.

Although there are many public datasets for crowd counting and tracking, dataset for crowd density based classification is not readily available, at least, to the best of our knowledge. Hence, we have created a dataset by collecting images from UCF-CC50 dataset [23] and SanghaiTech dataset [24]. The images are selected manually in a manner such that the pictures mostly contain the region of interest, i.e., spaces where crowd is actually present. Multiple raters were employed to categorize these clearly as dense or sparse. Based on the raters opinion ground-truth is associated with each image as label. Final label is assigned to each image based on majority voting. The dataset thus prepared contains 64 dense and 64 sparse crowd images to avoid imbalance in dataset of either type.

To run the experiment with the given dataset, we have randomly partitioned the dataset of each category into two halves, trained the model, i.e., decision tree classifier with one half and test on the other half. This is done 50 times and an average score of accuracy is reported in Table 1 as a quantitative measure of performance of the proposed system.

For comparison among the descriptors, experiment is done for each descriptor separately and average classification accuracy is shown in the first three rows of Table 1. Table 1 reveals that accuracy due to corner point based descriptor (96.18%) is significantly higher than that of the fractal dimension based descriptors (80.06% and 88.59%). Second, lower standard deviation of the former indicates that this descriptor develops more robust descriptors compared to the other two. We tried to work with other widely used classifiers like neural network and SVM. But, the performance was poor and that can be attributed to limited dataset. For the same reason also we could not explore deep learning approach.

Table 1. Classification accuracy for different descriptors

As suggested earlier, we have explored both feature level and decision level fusion of classifier.

Results are shown in 4th and 5th rows of Table 1. It is revealed that though in both cases robustness is improved, it cannot exceed the performance of corner point based descriptor. These indicates that fractal dimension based features are complementary to corner based descriptors and do not add any value while they are fused. Second, performance of decision level fusion and feature level fusion are same in terms of statistical significance.

We have compared the performance with Multi-column CNN (MCNN) used in [24]. The pretrained network is used to prepare the density map for the images of our dataset and that is used as input to neural network with one hidden layer. Results in Table 1 shows that accuracy of MCNN is less than corner point based descriptor and fused classifiers (both feature level and decision level).

5 Conclusion

In this work we have presented a simple method to classify a crowd image as dense or sparse. Proposed method exploits three different descriptors based om domain knowledge. It is found that among those features, interest point based feature performs best because it includes local information. Most important part is that neither of the features require interest region segmentation nor background subtraction. It is also seen that classifier fusion leads to more robustness or less variation in performance. But as these methods rely on texture information, a texture-heavy sparse crowd image may be wrongly classified as dense one. This issue may be addressed in future. Moreover, dataset can be further enhanced to include more variety and also to utilize deep learning. However, the work shows proposed feature based methodology has good potential in classifying the crowd as dense or sparse.