Elsevier

Knowledge-Based Systems

Volume 213, 15 February 2021, 106691
Knowledge-Based Systems

Learning a deep network with cross-hierarchy aggregation for crowd counting

https://doi.org/10.1016/j.knosys.2020.106691Get rights and content

Highlights

  • A novel deep model for crowd counting is proposed.

  • We propose cross-hierarchy aggregation to reuse hierarchical features.

  • Results show that our method outperforms the state-of-the-art methods.

  • Cross-scene evaluation verifies the superior generalization ability of our model.

Abstract

Crowd counting, a significant but challenging task in computer vision, aims at estimating the number of people in an image or video. Recent methods for crowd counting have obtained promising performance due to deep neural networks but most of them ignore the abundant conducive information in hierarchical features. In this paper, a novel Cross-Hierarchy Aggregation Network (CHANet) is proposed to exploit multi-hierarchy information in the crowd features from each hierarchy and aggregate cross-hierarchy features to generate reasonable density maps. Firstly, we propose a CHA module to fully extract local hierarchical features and capture maximum information of the crowd features. The CHA module combines residual and dense connections without over-assigning parameters for feature reuse. Then, we utilize the global hierarchical features from the shallow hierarchies to obtain a more powerful representation ability with a global residual connection. Experimental evaluations on four publicly available crowd counting datasets (ShanghaiTech, UCF-QNRF, WorldExpo’10, and Beijing BRT) demonstrate that the proposed CHANet achieves superior performance compared to other state-of-the-art methods.

Introduction

With the world population growing at a high rate, crowd counting has been arousing a great deal of interest in the past few years. The main objective of crowd counting is to estimate the number of people in a single image or video frame collected by surveillance cameras. Thus, crowd counting is widely applied in video security monitoring, traffic congestion control, urban planning, and other fields [1]. However, it remains difficult because of light change, severe occlusions, uneven crowd distributions, and camera perspective distortions.

Due to the successful application of Convolutional Neural Networks (CNNs) in computer vision including classification [2], [3], detection [4], [5], segmentation [6], [7], and person re-identification [8], [9], researchers have newly proposed lots of methods [10], [11], [12], [13], [14], [15], [16], [17], [18] that use CNNs to extract features from crowd images and generate density maps for crowd counting. And by integrating the generated density maps, we can obtain the number of people in the crowd images. These methods have gained significant progress in addressing the aforementioned problems that hinder the performance of crowd counting. However, two main challenges still exist, as demonstrated in Fig. 1. One is uneven distributions that the crowd is dense in one area and sparse in another in one scene. The other is camera perspective distortions in which the scales of people vary from barely visible tiny points to clear body outlines.

Ideas of multi-column [11], [12], [19], multi-scale [13], [20], [21], [22], and other methods [14], [15], [16] have been implemented on CNN-based frameworks trying to solve the stumbling block that stands in crowd counting. Although these methods are effective in alleviating the problems, they fuse features only from certain layers and aggregate them at a specific stage of the networks. Moreover, they simply mention multi-scale information for various crowd scales when fusing multiple features from different hierarchies. In fact, there is more information than scale information hidden in the hierarchical features. According to Hou et al. [23], deep hierarchies extract high-level features with abundant semantic information while shallow hierarchies extract low-level features with abundant spatial information. Thus, we believe that each hierarchy contains quantitative conducive information including multi-scale information to enhance the accuracy of crowd counting. Furthermore, the information of crowd features extracted from each hierarchy is different but can improve the ability of feature representation and the performance of the model.

To fully extract information from each hierarchy and aggregate cross-hierarchy features from the preceding hierarchies, we propose a novel Cross-Hierarchy Aggregation Network (CHANet) that can learn maximum information of hierarchical features without over-assigning parameters for feature reuse. Residual [24] and dense [25] connections are two different ways to aggregate features from different hierarchies. It has been found that residual connections can pass information directly to the following hierarchies of the network and address degradation problems, while dense connections directly concatenate features from different hierarchies and improve feature reuse. Both connections can reuse the features from preceding hierarchies and allow maximum information flow. Thus, we propose a combination of residual and dense connections named CHA that can fully extract abundant local features from the states of preceding hierarchies and capture maximum information of the crowd features without over-assigning parameters. Specifically, we adopt several CHA modules in our method and each has a set of convolutional layers. In the CHA module, we present local hierarchical aggregation to concatenate the features from each hierarchy. We adopt a 1 × 1 convolution to adaptively aggregate cross-hierarchy information from hierarchical features and enhance the representation ability of the aggregated features. In addition, we introduce a local residual connection to further aggregate local hierarchical features with the input of each CHA module. The CHA module combining residual and dense connections can preserve multiple advantageous information including but not limited to scale information from every hierarchy and ensure maximum information flow in the network. Furthermore, to fuse the global hierarchical features of the whole crowd image from the shallow hierarchies, we present a global residual connection to connect the previous local features generated from the CHA module.

The major contributions of this work are as follows:

  • A novel CHANet is proposed for crowd counting to extract conducive crowd information in hierarchical features and aggregate cross-hierarchy features to generate more reasonable density maps.

  • We present a CHA module to ensure maximum information flow by combining residual and dense connections and also study the number of convolutional layers in each CHA module as well as the number of CHA modules.

  • Experimental evaluations on four crowd counting datasets demonstrate that the proposed CHANet outperforms the state-of-the-art methods.

We give a brief review of CNN-based methods for crowd counting in Section 2. Details of our proposed CHANet are illustrated in Section 3. In Section 4, sufficient experimental results on four datasets are demonstrated to evaluate the performance of our method. Eventually, we summarize this work in Section 5.

Section snippets

Related work

Since CNNs have more powerful representation ability in extracting features than hand-craft, researchers have recently focused on developing CNN-based methods to produce crowd density maps so that the number of people can be calculated by integrating the complete density maps. Thus in this section we generally review the representative counting methods based on CNNs and counting methods via feature aggregation.

Proposed method

In this section, we specifically demonstrate our proposed CHANet. We first give the overall architecture of the network. Then we show the detailed structure of the CHA module. Finally, we present the ground truth generation and loss function of our method.

Experiments

In this section, we evaluate the performance of our proposed CHANet. Firstly, we give the evaluation metrics to measure the performance of our counting method. Implementation details are demonstrated in the second part. Next, we show the results of our CHANet on four datasets and compare them with other state-of-the-art methods. Then, we study the values of C and H. Finally, cross-dataset experiments are conducted to show the generalization performance and effectiveness of the CHANet.

Conclusion

In this work, we propose a novel Cross-Hierarchy Aggregation Network (CHANet) for density estimation and crowd counting. The CHANet extracts conducive information in crowd features from each hierarchy and aggregates cross-hierarchy features to generate reasonable density maps. Cross-Hierarchy Aggregation (CHA) module, which combines residual and dense connections without over-assigning parameters for feature reuse, is proposed to fully extract local hierarchical features and capture maximum

CRediT authorship contribution statement

Qiang Guo: Conceptualization, Investigation, Methodology, Visualization, Writing - original draft, Writing - review & editing. Xin Zeng: Investigation, Methodology, Validation, Writing - review & editing. Shizhe Hu: Conceptualization, Writing - review & editing. Sonephet Phoummixay: Investigation, Data curation. Yangdong Ye: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61772475), the National Key R&D Program of China (No. 2018YFB1201403).

References (51)

  • HeK. et al.

    Mask R-CNN

  • MiaoJ. et al.

    Pose-guided feature alignment for occluded person re-identification

  • DebD. et al.

    An aggregated multicolumn dilated convolution network for perspective-free counting

  • ChengZ. et al.

    Improving the learning of multi-column convolutional neural network for crowd counting

  • ZhangA. et al.

    Attentional neural fields for crowd counting

  • GuoD. et al.

    DADNet: Dilated-attention-deformable convnet for crowd counting

  • ZhangA. et al.

    Relational attention network for crowd counting

  • LiuL. et al.

    Crowd counting using deep recurrent spatial-aware network

  • LiuL. et al.

    Crowd counting with deep structured scale integration network

  • ZhangY. et al.

    Single-image crowd counting via multi-column convolutional neural network

  • ShenZ. et al.

    Crowd counting via adversarial cross-scale consistency pursuit

  • ZengX. et al.

    DSPNet: Deep scale purifier network for dense crowd counting

    Expert Syst. Appl.

    (2020)
  • DingX. et al.

    Crowd density estimation using fusion of multi-layer features

    IEEE Trans. Intell. Transp. Syst.

    (2020)
  • HouQ. et al.

    Deeply supervised salient object detection with short connections

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2020)
  • HeK. et al.

    Deep residual learning for image recognition

  • Cited by (21)

    • Correlation-attention guided regression network for efficient crowd counting

      2024, Journal of Visual Communication and Image Representation
    • Learning the cross-modal discriminative feature representation for RGB-T crowd counting

      2022, Knowledge-Based Systems
      Citation Excerpt :

      In terms of the type of the processed information, the majority of crowd counting methods are developed for processing the optical information, i.e., the RGB image. For example, Guo et al. [37] establish a crowd counting network through exploiting the multi-hierarchy information during establishing a non-linear mapping from the scene image to the estimated density map. With the application of multiple types of cameras in addition to the RGB camera (e.g., the depth camera and the thermal camera), there have several cross-modal crowd counting methods designed for RGB-D or RGB-T crowd counting task, through combining the complementary information of cross-modal features.

    View all citing articles on Scopus
    View full text