Learning a deep network with cross-hierarchy aggregation for crowd counting
Introduction
With the world population growing at a high rate, crowd counting has been arousing a great deal of interest in the past few years. The main objective of crowd counting is to estimate the number of people in a single image or video frame collected by surveillance cameras. Thus, crowd counting is widely applied in video security monitoring, traffic congestion control, urban planning, and other fields [1]. However, it remains difficult because of light change, severe occlusions, uneven crowd distributions, and camera perspective distortions.
Due to the successful application of Convolutional Neural Networks (CNNs) in computer vision including classification [2], [3], detection [4], [5], segmentation [6], [7], and person re-identification [8], [9], researchers have newly proposed lots of methods [10], [11], [12], [13], [14], [15], [16], [17], [18] that use CNNs to extract features from crowd images and generate density maps for crowd counting. And by integrating the generated density maps, we can obtain the number of people in the crowd images. These methods have gained significant progress in addressing the aforementioned problems that hinder the performance of crowd counting. However, two main challenges still exist, as demonstrated in Fig. 1. One is uneven distributions that the crowd is dense in one area and sparse in another in one scene. The other is camera perspective distortions in which the scales of people vary from barely visible tiny points to clear body outlines.
Ideas of multi-column [11], [12], [19], multi-scale [13], [20], [21], [22], and other methods [14], [15], [16] have been implemented on CNN-based frameworks trying to solve the stumbling block that stands in crowd counting. Although these methods are effective in alleviating the problems, they fuse features only from certain layers and aggregate them at a specific stage of the networks. Moreover, they simply mention multi-scale information for various crowd scales when fusing multiple features from different hierarchies. In fact, there is more information than scale information hidden in the hierarchical features. According to Hou et al. [23], deep hierarchies extract high-level features with abundant semantic information while shallow hierarchies extract low-level features with abundant spatial information. Thus, we believe that each hierarchy contains quantitative conducive information including multi-scale information to enhance the accuracy of crowd counting. Furthermore, the information of crowd features extracted from each hierarchy is different but can improve the ability of feature representation and the performance of the model.
To fully extract information from each hierarchy and aggregate cross-hierarchy features from the preceding hierarchies, we propose a novel Cross-Hierarchy Aggregation Network (CHANet) that can learn maximum information of hierarchical features without over-assigning parameters for feature reuse. Residual [24] and dense [25] connections are two different ways to aggregate features from different hierarchies. It has been found that residual connections can pass information directly to the following hierarchies of the network and address degradation problems, while dense connections directly concatenate features from different hierarchies and improve feature reuse. Both connections can reuse the features from preceding hierarchies and allow maximum information flow. Thus, we propose a combination of residual and dense connections named CHA that can fully extract abundant local features from the states of preceding hierarchies and capture maximum information of the crowd features without over-assigning parameters. Specifically, we adopt several CHA modules in our method and each has a set of convolutional layers. In the CHA module, we present local hierarchical aggregation to concatenate the features from each hierarchy. We adopt a 1 × 1 convolution to adaptively aggregate cross-hierarchy information from hierarchical features and enhance the representation ability of the aggregated features. In addition, we introduce a local residual connection to further aggregate local hierarchical features with the input of each CHA module. The CHA module combining residual and dense connections can preserve multiple advantageous information including but not limited to scale information from every hierarchy and ensure maximum information flow in the network. Furthermore, to fuse the global hierarchical features of the whole crowd image from the shallow hierarchies, we present a global residual connection to connect the previous local features generated from the CHA module.
The major contributions of this work are as follows:
- •
A novel CHANet is proposed for crowd counting to extract conducive crowd information in hierarchical features and aggregate cross-hierarchy features to generate more reasonable density maps.
- •
We present a CHA module to ensure maximum information flow by combining residual and dense connections and also study the number of convolutional layers in each CHA module as well as the number of CHA modules.
- •
Experimental evaluations on four crowd counting datasets demonstrate that the proposed CHANet outperforms the state-of-the-art methods.
We give a brief review of CNN-based methods for crowd counting in Section 2. Details of our proposed CHANet are illustrated in Section 3. In Section 4, sufficient experimental results on four datasets are demonstrated to evaluate the performance of our method. Eventually, we summarize this work in Section 5.
Section snippets
Related work
Since CNNs have more powerful representation ability in extracting features than hand-craft, researchers have recently focused on developing CNN-based methods to produce crowd density maps so that the number of people can be calculated by integrating the complete density maps. Thus in this section we generally review the representative counting methods based on CNNs and counting methods via feature aggregation.
Proposed method
In this section, we specifically demonstrate our proposed CHANet. We first give the overall architecture of the network. Then we show the detailed structure of the CHA module. Finally, we present the ground truth generation and loss function of our method.
Experiments
In this section, we evaluate the performance of our proposed CHANet. Firstly, we give the evaluation metrics to measure the performance of our counting method. Implementation details are demonstrated in the second part. Next, we show the results of our CHANet on four datasets and compare them with other state-of-the-art methods. Then, we study the values of C and H. Finally, cross-dataset experiments are conducted to show the generalization performance and effectiveness of the CHANet.
Conclusion
In this work, we propose a novel Cross-Hierarchy Aggregation Network (CHANet) for density estimation and crowd counting. The CHANet extracts conducive information in crowd features from each hierarchy and aggregates cross-hierarchy features to generate reasonable density maps. Cross-Hierarchy Aggregation (CHA) module, which combines residual and dense connections without over-assigning parameters for feature reuse, is proposed to fully extract local hierarchical features and capture maximum
CRediT authorship contribution statement
Qiang Guo: Conceptualization, Investigation, Methodology, Visualization, Writing - original draft, Writing - review & editing. Xin Zeng: Investigation, Methodology, Validation, Writing - review & editing. Shizhe Hu: Conceptualization, Writing - review & editing. Sonephet Phoummixay: Investigation, Data curation. Yangdong Ye: Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61772475), the National Key R&D Program of China (No. 2018YFB1201403).
References (51)
- et al.
Multi-criteria active deep learning for image classification
Knowl. Based Syst.
(2019) - et al.
Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection
Inf. Fusion
(2019) - et al.
People detection and articulated pose estimation framework for crowded scenes
Knowl. Based Syst.
(2017) - et al.
Knowledge based domain adaptation for semantic segmentation
Knowl. Based Syst.
(2020) - et al.
Improving person re-identification by attribute and identity learning
Pattern Recognit.
(2019) - et al.
A multi-context representation approach with multi-task learning for object counting
Knowl. Based Syst.
(2020) - et al.
Adversarial learning for multiscale crowd counting under complex scenes
IEEE Trans. Cybern.
(2020) - et al.
Fast crowd density estimation with convolutional neural networks
Eng. Appl. Artif. Intell.
(2015) - V.S. Lempitsky, A. Zisserman, Learning to count objects in images, in: NIPS, 2010, pp....
- et al.
Going deeper with convolutions
Mask R-CNN
Pose-guided feature alignment for occluded person re-identification
An aggregated multicolumn dilated convolution network for perspective-free counting
Improving the learning of multi-column convolutional neural network for crowd counting
Attentional neural fields for crowd counting
DADNet: Dilated-attention-deformable convnet for crowd counting
Relational attention network for crowd counting
Crowd counting using deep recurrent spatial-aware network
Crowd counting with deep structured scale integration network
Single-image crowd counting via multi-column convolutional neural network
Crowd counting via adversarial cross-scale consistency pursuit
DSPNet: Deep scale purifier network for dense crowd counting
Expert Syst. Appl.
Crowd density estimation using fusion of multi-layer features
IEEE Trans. Intell. Transp. Syst.
Deeply supervised salient object detection with short connections
IEEE Trans. Pattern Anal. Mach. Intell.
Deep residual learning for image recognition
Cited by (21)
Correlation-attention guided regression network for efficient crowd counting
2024, Journal of Visual Communication and Image RepresentationVersatile correlation learning for size-robust generalized counting: A new perspective
2024, Knowledge-Based SystemsDual-branch counting method for dense crowd based on self-attention mechanism
2024, Expert Systems with ApplicationsDirection-aware attention aggregation for single-stage hazy-weather crowd counting
2023, Expert Systems with ApplicationsContext Attention Fusion Network for crowd counting
2023, Knowledge-Based SystemsLearning the cross-modal discriminative feature representation for RGB-T crowd counting
2022, Knowledge-Based SystemsCitation Excerpt :In terms of the type of the processed information, the majority of crowd counting methods are developed for processing the optical information, i.e., the RGB image. For example, Guo et al. [37] establish a crowd counting network through exploiting the multi-hierarchy information during establishing a non-linear mapping from the scene image to the estimated density map. With the application of multiple types of cameras in addition to the RGB camera (e.g., the depth camera and the thermal camera), there have several cross-modal crowd counting methods designed for RGB-D or RGB-T crowd counting task, through combining the complementary information of cross-modal features.