1 Introduction

The automatic estimation of crowd density as a method of crowd control and management is a decisive research area in current video surveillance. The existing methods perform density estimation by extracting complex features, however, it is difficult to meet the requirements in practical applications because of crowd mutual occlusion and perplexed environments. Convolutional neural networks have marked capabilities in feature learning, for instance, automatically and reliably obtaining the number of people in monitoring or population density can not only alert and alarm certain abnormalities of the crowd, but also can be used for crowd simulation, crowd behavior and crowd psychology research.

Many methods have been developed that coalesce scale information into the learning process. Some of the early methods relied on people detection to estimate the instantaneous count of pedestrians crossing a line of interest in a video sequence which catered merely to low density crowded scenes [1]. These methods are hampered in dense crowds and the performance is far from expected. Inspired by the success of Convolutional Neural Networks (CNNs) for various computer vision tasks, many types of CNN-based methods have been developed, some new methods related to visual understanding [2, 3] while some techniques devoted to overcoming the difficulties of crowd counting [4, 5]. In terms of receptive field and the loss of details issues, the algorithms that can achieve better accuracy still have some limited capabilities, certain CNN-based methods specifically meet the issue of utilizing features at different scales via multi-column or recover spatial resolution via transposed convolutions in CNN-based cascaded multi-task learning (Cascaded-MTL) network [6, 7]. Though these methods demonstrated robustness to corresponding issues, they are still restricted to the scales that are used during training and hence are limited in their capacity to learn well-generalized models.

This paper has been devoted to the generalization of the predigestion of model in an attempt to fetch as much prolific features as possible and minimize network parameters to our best knowledge. The global features is learned along with density map estimation via a dilated convolutional neural network. Results of the proposed method on a sample image and corresponding density map are shown in Fig. 1(a)–(b). Note that it differs from the latest works, such as the deep CNN applied for appurtenance, we focus on designing an easy-trained CNN-based density map manager. Our model uses pure convolutional layers as the hard core to support input images with flexible resolutions. We deploy dilated convolution layers as the front-end to enlarge receptive fields and fractionally strided convolution layers as the back-end to restore its spatial resolution. By making use of such simply equipped structure, we lower the number of network parameters which makes CAFN be trained easily. In addition, we outperform previous crowd counting solution lower MAE in Shanghai Tech Part A and Part B datasets respectively.

Fig. 1.
figure 1

Results of the proposed method. Left one is input image (from the Shanghai tech dataset [6]) with ground truth number (819). Right one is corresponding density map generated by the proposed method with estimated count (834).

The rest of the paper is structured as follows. Section 2 presents the previous works for crowd counting and density map generation. Section 3 introduces the fabric and configuration of our model while Sect. 4 presents the experimental results on several datasets. In Sect. 5, we conclude the paper.

2 Related Work

Traditional approaches for crowd counting from images relied on hand-crafted representations to extract low level features. These features were then mapped for counting or generating density maps via various regression techniques. Detection-based methods typically employ sliding window-based detection algorithms to count people in an image [8]. These methods are disadvantageously influenced by the existence of high density crowd and background disturbance. To overcome these obstacles, researchers attempted to count by regression where they learn a mapping to their counts via features extracted from local image patches to their counts [9].

Unlike counting by detection, estimating crowd counts without recognizing the location of each person via regression, preparatory works employ edge and texture features such as HOG and LBP to learn the mapping from image patterns to corresponding crowd counts [10,11,12]. Multi-source information is utilized [13] to regress the crowd counts in extreme dense crowd images. An end-to-end CNN model adopted from AlexNet [14] is constructed recently for counting in extremely crowded scenes. Later, instead of regressing the count directly, the spatial information of crowds are taken into consideration by regressing the CNN feature maps as crowd density maps. Similar frameworks are also developed in [15], where a Hydra-CNN architecture is designed to estimate crowd densities in a variety of scenes. Better performance can be obtained by further exploiting switching structures or contextual correlations using LSTM [16,17,18]. Though counting by regression is reliable in crowded settings, without information of object location, their predictions for low density crowds tend to be overestimated. The firmness of such kind of methods depends upon the stability of statistical data, while in such scenarios the instance number is too small to help explore the its intrinsic statistical philosophy. Detection and regression methods ignore key spatial information present in the images as they regress on the global count. Hence, Lempitsky et al. [10] proposed a new approach of learning a linear mapping between local patch features and corresponding object density maps so as to incorporate spatial information present in the images.

Most recently, Sam et al. [17] propose the Switch-CNN using a density level classifier to choose different regressors for particular input patches. Sindagi et al. [18] present a Contextual Pyramid CNN, which uses CNN networks to estimate context at various levels for achieving lower count error and better quality density maps. These two solutions achieve the state-of-the-art performance, and both of them used multi-column based architecture (MCNN) and density level classifier. However, we observe several disadvantages in these approaches: Multi-column CNNs are hard to train according to the training method described in work [6]. Such inflated network structure requires more time to train. Both solutions require density level classifier before sending pictures in the MCNN. However, the granularity of density level is hard to define in real-time congested scene analysis since the number of objects keeps varying in a large scale. Also, using a classifier means more columns need to be implemented which makes the design more complicated and causes more overabundance. These works spend a large portion of parameters for density level classification to label the input regions instead of feeding parameters into the final density map generation. Since the branch structure in MCNN is not efficient, the lack of parameters for generating density map is unfavorable to the final accuracy. Taking all above drawbacks into consideration, we propose a novel approach to concentrate on encoding the wider and deeper features in clustered scenes and generating high quality density map. Our CAFN increases the receptive fields through atrous convolution, while reducing the number of convolutional networks, and finally the loss of details will be restored by transposed convolutional layers as much as possible.

3 Proposed Method

The fundamental idea of the proposed design is to deploy a double-column dilated CNN for capturing high-level features with larger receptive fields and generating high-quality density maps without brutally expanding network complexity. In this section, we firstly introduce the architecture, a network whose input is the image and the output is a density map of the crowd (say how many people per square meter), and then obtain the head count by integration, then we present the corresponding training method.

3.1 CAFN Architecture

The front-end VGG-16 [19] of the model named CSRNet in [20] outputs a picture, which is 1/8 size of the original input. As discussed in CSRNet, the output size will be further shrunken if proceeding to stack more convolution layers and pooling layers (basic components in VGG-16), additionally, it is difficult to generate high-quality density maps, so the back-end employs dilated convolutions used to extract deeper salient information and improve output resolution. Inspired by the idea of CSRNet, we utilized dilated convolutions as the front-end of CAFN because of its greater receptive fields, unlike adopting dilated convolution to capture more features when the resolution has been dropped off to a very low level in CSRNet. Atrous convolutions are primarily made use of in this paper, which intent to gain more image information from original image, then the transposed convolutional layer is for the purpose of enlarging the size of image and upsampling the previous layer’s output to supplement the loss of details.

In this paper, for attaining the training dataset, we crop 9 patches from each image at different locations with 1/4 size of the original image. The first four patches contain four quarters of the image without overlapping while the other five patches are randomly cropped from the input image. Based on three branches of MCNN, we add dilation rate to filters to enlarge its receptive fields. For the purpose of reducing net parameters, we consider 4 types of double-column association as experimental objects, which will be discussed in details on Sect. 4. After extracting features from filters with different scales, we try to deploy transposed convolutional layers as the back-end for maintaining the output resolution. We choose a relatively better model, taking into account the stability of the model, of which MAE is not the lowest (but MSE is the lowest) by comparing different groups.

The overall structure of our network is illustrated in Fig. 2. It contains double parallel columns whose filters are with different dilation rates and local receptive fields of different sizes. Double convolutional columns are merged for fusing features from different scales, here we use the function (torch.cat) to concatenate matrices (feature maps) output from double columns respectively on the first dimension. For simplification, we use the same network structures for all columns (i.e. conv–pooling–conv–pooling) except for the sizes and numbers of filters. Max pooling is applied for each 2 × 2 region, and Parametric Rectified linear unit (PReLU) is adopted as the activation function because of its favourable performance for CNNs. To diminish the computational complexity (the number of parameters to be optimized), we apply less number of filters for convolutional layer with larger filters. We stack the output feature maps of all convolutional layers and map them to a density map. To map the feature maps to the density map, we adopt filters whose sizes are 1 × 1. Configuration of our network is shown below in details (See Table 1).

Fig. 2.
figure 2

The structure of the proposed double-column convolutional neural network for crowd density map estimation.

Table 1. Configuration of CAFN

All convolutional layers use padding to maintain the previous size. The convolutional layers’ parameters are denoted as “conv(kernel size) @ (number of filters)”, max-pooling layers are conducted over a 2 × 2 pixel window with stride 2, The fractionally strided convolutional layer is denoted as “ConvTransposed (kernel size) @ (number of filters)”, PReLU is used as non-linear activation layer.

Then Euclidean distance is used to measure the difference between the estimated density map and ground truth. The loss function is defined as follows:

$$ L\left( {\theta } \right) = \frac{1}{2N}\sum\limits_{i = 1}^{N} {\left\| {Y\left( {X_{i} ;\theta } \right) - Y_{i}^{GT} } \right\|}_{2}^{2} $$
(1)

where \( \theta \) is a set of learnable parameters in the CAFN. \( N \) is the number of training image. \( X_{i} \) is the input image and \( Y_{i} \) is the ground truth density map of image \( X_{i} \). \( Y\left( {X_{i} ;\theta } \right) \) stands for the estimated density map generated by CAFN which is parameterized with for sample \( X_{i} \). \( L \) is the loss between estimated density map and the ground truth density map.

3.2 Dilated Convolutions and Transposed Convolutions

The network is fetched by an image of arbitrary size, and outputs crowd density map. The network has two sections corresponding to the two functions, with the first part learning larger scale features and the second part restoring resolution to preform density map estimation. One of the critical components of our design is the dilated convolutional layer. Systematic dilation supports exponential expansion of the receptive field without loss of resolution or coverage.

The higher the network layer, the more information the original image contains in the unit pixel, that is, the larger the receptive field, however, which is done by pooling and takes the reduction of the resolution and the loss of the information in the original image as a cost. Due to the existence of the pooling layer, the size of the feature map in the back layer will be smaller and smaller. Dilated convolution is a kind of convolution idea that the down sampling will reduce the image resolution and the loss of information for image semantic segmentation. This character enlarges the receptive field without increasing the number of parameters or the amount of computation. Examples can be found in Fig. 3(c) where normal convolution (dilation = 1) is 3 × 3 receptive field and the dilated convolutions (dilation rate = 2) deliver 5 × 5 receptive fields.

Fig. 3.
figure 3

Different convolution methods in this paper. Blue maps are inputs, and cyan maps are outputs. (a) Half padding, no strides. (b) No padding, no strides, transposed. (c) No padding, no stride, dilation. (Color figure online)

The second component consists of a transposed convolutional layer for upsampling the previous layer’s output to account for the loss of details due to earlier pooling layers, and we add a 1 × 1 convolutional layer as output layer.

Convolution arithmetic are shown in Fig. 3

We use a simple method in order to ensure that the improvements achieved are due to the proposed method and are not dependent on the sophisticated methods for calculating the ground truth density maps. Ground truth density map \( D_{i} \) corresponding to the \( i \) th training patch is calculated by summing a 2D Gaussian kernel centered at every person’s location as defined below:

$$ D_{i} \left( x \right) = \sum\limits_{{x_{g} \in P}} {G\left( {x - x_{g} ,\delta } \right)} $$
(2)

where \( \sigma \) is the scale parameter of the 2D Gaussian kernel and \( P \) is the set of all points at which people are located. The training and evaluation was performed on NVIDIA TITAN-X GPU using Torch framework [21]. Adam optimization with a learning rate of 0.00001 and momentum of 0.9 was used to train the model.

4 Experiments

In this section, we present the experimental details and evaluation results on three publicly available datasets: Shanghai Tech Part_A, Part_B and UCF_CC_50 [13]. For the purpose of evaluation, the standard metrics used by many existing methods for crowd counting were used. These metrics are defined as follows:

$$ MAE = {\kern 1pt} \frac{1}{N}\sum\limits_{i = 1}^{N} {{\kern 1pt} \left( {y_{i} - \hat{y}_{i} } \right){\kern 1pt} {\kern 1pt} } $$
(3)
$$ MSE = \sqrt {\frac{1}{N}\sum\limits_{1}^{N} {{\kern 1pt} \left( {\left| {y_{i} - \hat{y}_{i} } \right|} \right){\kern 1pt}^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} } } {\kern 1pt} $$
(4)

where \( MAE \) is mean absolute error, \( MSE \) is mean squared error, \( N \) is number of test samples, is ground truth count and \( y_{i} \) is estimated count corresponding to the \( i \) th image. Roughly speaking, \( MAE \) indicates the accuracy of the estimates, and \( MSE \) indicates the robustness of the estimates.

We compare four types of different column combinations (see Fig. 4) with different dilation rates. Type1 is the combination of column1 (dilation rate = 2) with column2 (dilation rate = 3). Type2 is the fusion of column2 and colum3 (dilation rate = 4). Type3 combines the column1 and 3. Type4 merges all the columns. The experimental results are shown as the following Table 2.

Fig. 4.
figure 4

Four types of combinations

Table 2. Comparison of experiments results on Shanghai tech dataset

4.1 Shanghai Tech Dataset

Shanghai tech dataset contains 1198 annotated images, in which a total of 330,165 people with centers of their heads annotated. As far as we know, this dataset is the largest one in terms of the number of annotated people. This dataset consists of two parts: there are 482 images in Part A which are randomly crawled from the Internet, and 716 images in Part B which are taken from the busy streets of metropolitan areas in Shanghai. The crowd density varies significantly between the two subsets, making accurate estimation of the crowd more challenging than most existing datasets. Both Part A and Part B are divided into training and testing: 300 images of Part A are used for training and the remaining 182 images for testing; and 400 images of Part B are for training and 316 for testing.

First we design an experiment to demonstrate that the MCNN does not perform better compared to a regular network in Table 3.

Table 3. Comparison to MCNN

Then, to demonstrate that MCNN may not be the best choice, we design a deeper, double-column network with fewer parameters compared to MCNN. Results show that the double-column version achieves higher performance on Shanghai Tech Part A dataset with the lowest MAE. As shown in Table 4.

Table 4. Estimation errors on Shanghai tech dataset.

4.2 UCF_CC_50 Dataset

UCF_CC_50 dataset includes 50 images with different perspective and resolutions. With arriving at an average number of 1280, the number of persons annotated per image varies from 94 to 4543.5-fold cross-validation is performed following the standard setting in [13]. Result comparisons of MAE and MSE are listed in Table 5.

Table 5. Estimation errors on UCF_CC_50 dataset

5 Conclusions

In this paper, we presented a CAFN network for jointly adopting dilated convolutions and fractionally strided convolutions. Atrous convolutions are devoted to enlarging receptive fields which is beneficial to incorporate abundant characteristics into the network, which enables the model to learn glob ally relevant discriminative features thereby accounting for large count variations in the dataset. Additionally, we employed fractionally strided convolutional layers as the back-end so as to restore the loss of details due to max-pooling layers in the earlier stages, therefore allowing us to regress on full resolution density maps. The model structure has moderate complexity and strong generalization ability, which possess presentable density estimation performance in densely crowded scenes via the experiments on multiple datasets.