Learning a deep network with cross-hierarchy aggregation for crowd counting

doi:10.1016/j.knosys.2020.106691

Knowledge-Based Systems

Volume 213, 15 February 2021, 106691

https://doi.org/10.1016/j.knosys.2020.106691 Get rights and content

Highlights

•
A novel deep model for crowd counting is proposed.
•
We propose cross-hierarchy aggregation to reuse hierarchical features.
•
Results show that our method outperforms the state-of-the-art methods.
•
Cross-scene evaluation verifies the superior generalization ability of our model.

Abstract

Crowd counting, a significant but challenging task in computer vision, aims at estimating the number of people in an image or video. Recent methods for crowd counting have obtained promising performance due to deep neural networks but most of them ignore the abundant conducive information in hierarchical features. In this paper, a novel Cross-Hierarchy Aggregation Network (CHANet) is proposed to exploit multi-hierarchy information in the crowd features from each hierarchy and aggregate cross-hierarchy features to generate reasonable density maps. Firstly, we propose a CHA module to fully extract local hierarchical features and capture maximum information of the crowd features. The CHA module combines residual and dense connections without over-assigning parameters for feature reuse. Then, we utilize the global hierarchical features from the shallow hierarchies to obtain a more powerful representation ability with a global residual connection. Experimental evaluations on four publicly available crowd counting datasets (ShanghaiTech, UCF-QNRF, WorldExpo’10, and Beijing BRT) demonstrate that the proposed CHANet achieves superior performance compared to other state-of-the-art methods.

Introduction

With the world population growing at a high rate, crowd counting has been arousing a great deal of interest in the past few years. The main objective of crowd counting is to estimate the number of people in a single image or video frame collected by surveillance cameras. Thus, crowd counting is widely applied in video security monitoring, traffic congestion control, urban planning, and other fields [1]. However, it remains difficult because of light change, severe occlusions, uneven crowd distributions, and camera perspective distortions.

Due to the successful application of Convolutional Neural Networks (CNNs) in computer vision including classification [2], [3], detection [4], [5], segmentation [6], [7], and person re-identification [8], [9], researchers have newly proposed lots of methods [10], [11], [12], [13], [14], [15], [16], [17], [18] that use CNNs to extract features from crowd images and generate density maps for crowd counting. And by integrating the generated density maps, we can obtain the number of people in the crowd images. These methods have gained significant progress in addressing the aforementioned problems that hinder the performance of crowd counting. However, two main challenges still exist, as demonstrated in Fig. 1. One is uneven distributions that the crowd is dense in one area and sparse in another in one scene. The other is camera perspective distortions in which the scales of people vary from barely visible tiny points to clear body outlines.

Ideas of multi-column [11], [12], [19], multi-scale [13], [20], [21], [22], and other methods [14], [15], [16] have been implemented on CNN-based frameworks trying to solve the stumbling block that stands in crowd counting. Although these methods are effective in alleviating the problems, they fuse features only from certain layers and aggregate them at a specific stage of the networks. Moreover, they simply mention multi-scale information for various crowd scales when fusing multiple features from different hierarchies. In fact, there is more information than scale information hidden in the hierarchical features. According to Hou et al. [23], deep hierarchies extract high-level features with abundant semantic information while shallow hierarchies extract low-level features with abundant spatial information. Thus, we believe that each hierarchy contains quantitative conducive information including multi-scale information to enhance the accuracy of crowd counting. Furthermore, the information of crowd features extracted from each hierarchy is different but can improve the ability of feature representation and the performance of the model.

To fully extract information from each hierarchy and aggregate cross-hierarchy features from the preceding hierarchies, we propose a novel Cross-Hierarchy Aggregation Network (CHANet) that can learn maximum information of hierarchical features without over-assigning parameters for feature reuse. Residual [24] and dense [25] connections are two different ways to aggregate features from different hierarchies. It has been found that residual connections can pass information directly to the following hierarchies of the network and address degradation problems, while dense connections directly concatenate features from different hierarchies and improve feature reuse. Both connections can reuse the features from preceding hierarchies and allow maximum information flow. Thus, we propose a combination of residual and dense connections named CHA that can fully extract abundant local features from the states of preceding hierarchies and capture maximum information of the crowd features without over-assigning parameters. Specifically, we adopt several CHA modules in our method and each has a set of convolutional layers. In the CHA module, we present local hierarchical aggregation to concatenate the features from each hierarchy. We adopt a 1 × 1 convolution to adaptively aggregate cross-hierarchy information from hierarchical features and enhance the representation ability of the aggregated features. In addition, we introduce a local residual connection to further aggregate local hierarchical features with the input of each CHA module. The CHA module combining residual and dense connections can preserve multiple advantageous information including but not limited to scale information from every hierarchy and ensure maximum information flow in the network. Furthermore, to fuse the global hierarchical features of the whole crowd image from the shallow hierarchies, we present a global residual connection to connect the previous local features generated from the CHA module.

The major contributions of this work are as follows:

•
A novel CHANet is proposed for crowd counting to extract conducive crowd information in hierarchical features and aggregate cross-hierarchy features to generate more reasonable density maps.
•
We present a CHA module to ensure maximum information flow by combining residual and dense connections and also study the number of convolutional layers in each CHA module as well as the number of CHA modules.
•
Experimental evaluations on four crowd counting datasets demonstrate that the proposed CHANet outperforms the state-of-the-art methods.

We give a brief review of CNN-based methods for crowd counting in Section 2. Details of our proposed CHANet are illustrated in Section 3. In Section 4, sufficient experimental results on four datasets are demonstrated to evaluate the performance of our method. Eventually, we summarize this work in Section 5.

Section snippets

Related work

Since CNNs have more powerful representation ability in extracting features than hand-craft, researchers have recently focused on developing CNN-based methods to produce crowd density maps so that the number of people can be calculated by integrating the complete density maps. Thus in this section we generally review the representative counting methods based on CNNs and counting methods via feature aggregation.

Proposed method

In this section, we specifically demonstrate our proposed CHANet. We first give the overall architecture of the network. Then we show the detailed structure of the CHA module. Finally, we present the ground truth generation and loss function of our method.

Experiments

In this section, we evaluate the performance of our proposed CHANet. Firstly, we give the evaluation metrics to measure the performance of our counting method. Implementation details are demonstrated in the second part. Next, we show the results of our CHANet on four datasets and compare them with other state-of-the-art methods. Then, we study the values of C and H. Finally, cross-dataset experiments are conducted to show the generalization performance and effectiveness of the CHANet.

Conclusion

In this work, we propose a novel Cross-Hierarchy Aggregation Network (CHANet) for density estimation and crowd counting. The CHANet extracts conducive information in crowd features from each hierarchy and aggregates cross-hierarchy features to generate reasonable density maps. Cross-Hierarchy Aggregation (CHA) module, which combines residual and dense connections without over-assigning parameters for feature reuse, is proposed to fully extract local hierarchical features and capture maximum

CRediT authorship contribution statement

Qiang Guo: Conceptualization, Investigation, Methodology, Visualization, Writing - original draft, Writing - review & editing. Xin Zeng: Investigation, Methodology, Validation, Writing - review & editing. Shizhe Hu: Conceptualization, Writing - review & editing. Sonephet Phoummixay: Investigation, Data curation. Yangdong Ye: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61772475), the National Key R&D Program of China (No. 2018YFB1201403).

References (51)

YuanJ. et al.
Multi-criteria active deep learning for image classification
Knowl. Based Syst.
(2019)
GuanD. et al.
Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection
Inf. Fusion
(2019)
AlyammahiS. et al.
People detection and articulated pose estimation framework for crowded scenes
Knowl. Based Syst.
(2017)
ZhangY. et al.
Knowledge based domain adaptation for semantic segmentation
Knowl. Based Syst.
(2020)
LinY. et al.
Improving person re-identification by attribute and identity learning
Pattern Recognit.
(2019)
KongW. et al.
A multi-context representation approach with multi-task learning for object counting
Knowl. Based Syst.
(2020)
ZhouY. et al.
Adversarial learning for multiscale crowd counting under complex scenes
IEEE Trans. Cybern.
(2020)
FuM. et al.
Fast crowd density estimation with convolutional neural networks
Eng. Appl. Artif. Intell.
(2015)
V.S. Lempitsky, A. Zisserman, Learning to count objects in images, in: NIPS, 2010, pp....
SzegedyC. et al.
Going deeper with convolutions

HeK. et al.

Mask R-CNN

MiaoJ. et al.

Pose-guided feature alignment for occluded person re-identification

DebD. et al.

An aggregated multicolumn dilated convolution network for perspective-free counting

ChengZ. et al.

Improving the learning of multi-column convolutional neural network for crowd counting

ZhangA. et al.

Attentional neural fields for crowd counting

GuoD. et al.

DADNet: Dilated-attention-deformable convnet for crowd counting

ZhangA. et al.

Relational attention network for crowd counting

LiuL. et al.

Crowd counting using deep recurrent spatial-aware network

LiuL. et al.

Crowd counting with deep structured scale integration network

ZhangY. et al.

Single-image crowd counting via multi-column convolutional neural network

ShenZ. et al.

Crowd counting via adversarial cross-scale consistency pursuit

ZengX. et al.

DSPNet: Deep scale purifier network for dense crowd counting

Expert Syst. Appl.

(2020)

DingX. et al.

Crowd density estimation using fusion of multi-layer features

IEEE Trans. Intell. Transp. Syst.

(2020)

HouQ. et al.

Deeply supervised salient object detection with short connections

IEEE Trans. Pattern Anal. Mach. Intell.

(2020)

HeK. et al.

Deep residual learning for image recognition

Cited by (21)

Correlation-attention guided regression network for efficient crowd counting
2024, Journal of Visual Communication and Image Representation
As a valuable component of intelligent video surveillance, crowd counting has received lots of attention. In practice, however, crowd counting always suffers from the problem of the scale change of pedestrians. To mitigate this limitation, we propose a novel correlation-attention guided regression network to estimate the number of people, termed CGR-Net. To make the generation process of spatial attention and channel attention independent of each other, we design a parallel channel/spatial-wise attention module (PCSAM) to avoid error accumulation. A pixel-wise assisted attention module (PAAM) is developed for learning crowd uneven distribution on the different image pixels to further enhance the ability of the CGR-Net. Furthermore, we present a new loss function to ensure the effectiveness and performance of the proposed method. Comprehensive experimental results demonstrate that our model delivers enhanced representation and attains state-of-the-art performance.
Versatile correlation learning for size-robust generalized counting: A new perspective
2024, Knowledge-Based Systems
Generalized counting has recently emerged to count novel-class objects within a query image, leveraging limited exemplars. Although methods based on exemplar-query pairs matching have made impressive progress, they typically rely on a single correlation representation, regardless of the varying sizes of objects, which limits more accurate counting. In this paper, we introduce a novel and conceptually straightforward perspective to guide the design of our correlation mechanism that enhances the effectiveness of counting size-diversity objects. Our new perspective encompasses three key aspects: (1) Small objects typically exhibit features concentrated within limited spatial regions, underscoring the importance of an effective channel-wise correlation mechanism for small object counting. (2) Large objects tend to possess rich spatial semantics, making an effective spatial-wise correlation mechanism crucial for large object counting. (3) Integrating both channel-wise and spatial-wise correlation mechanisms holds the potential to enhance counting accuracy across different object sizes. Building upon the above perspective, firstly, we propose a simple yet effective Dual-level Channel-wise Correlation (DCC) module that utilizes kernel-wise correlation and distinct correlation to encode global-to-local channel-wise relationships, enhancing small objects counting accuracy. Secondly, we develop a 4D-convolution-based Spatial-aware Correlation (4DSC) module to extract local-to-local spatial correlation in 4D space, promoting large objects counting accuracy. Finally, we combine the proposed DCC and 4DSC to realize our Versatile Correlation Module (VCM) to simultaneously process both small and large objects, providing adaptability to object size diversity. Extensive experiments on the FSC-147 dataset and CARPK dataset demonstrate the effectiveness of the proposed methods and the superior performance of our counting model.
Dual-branch counting method for dense crowd based on self-attention mechanism
2024, Expert Systems with Applications
A dense crowd counting method based on self-attention mechanism with dual-branch fusion network is proposed in this paper. Our method aims to address the problems of large variations in head scales and complex backgrounds in dense crowd images. This method combines the CNN and Transformer network frameworks and consists of shallow feature extraction network, dual-branch fusion network, and deep feature extraction network. The VGG16 network is employed by the shallow feature extraction network to extract low-level features. A multi-scale CNN branch and a Transformer branch built on an improved self-attention module make up the dual-branch fusion network, which collects local and global information on crowd areas, respectively. The Transformer network, which is based on a mixed attention module, is employed by the deep feature extraction network to further separate complicated backgrounds and concentrate on crowd areas. Both counting-level weakly supervised and location-level fully supervised methods are employed in the experiments. On four widely used datasets, the results demonstrate that the proposed method outperforms the most recent research. Our method has a higher counting accuracy with low parameter volumes and a counting accuracy of 89.1% under full supervision when compared to existing weakly supervised methods. The results of the experiments demonstrate that the method has excellent crowd counting performance and can accurately count in high-density and high-occlusion scenes.
Direction-aware attention aggregation for single-stage hazy-weather crowd counting
2023, Expert Systems with Applications
Crowd counting in adverse hazy weather is inevitable and significant to the scene understanding in real-world application. For the crucial and challenging hazy-weather crowd counting issue, this paper proposes a single-stage hazy-weather crowd counting method based on direction-aware attention aggregation. The method develops a Directional Context Attention (DCA) block to realize embedding the positional information to the channel attention aggregation and to capture the long-range dependencies about the crowd structure, both of which guide the counting method readily locate the crowd of interest. Due to the lack of the hazy-weather crowd counting benchmark datasets, we also generate the hazy-weather crowd counting datasets to evaluate the proposed method. Experimental results and the discussions demonstrate our method is effective greatly on the hazy-weather crowd counting task, especially the design of the core DCA block. In this research work, both the proposed method and the generated benchmark datasets in hazy-weather scene facilitate the development of the crowd counting in the adverse weather conditions and the universal intelligent surveillance technologies. The generated hazy-weather crowd counting benchmarks datasets (Hazy-ShanghaiTechRGBD and Hazy-JHU) in this paper would be released in https://github.com/312524/Hazy-CC after peer-review process of this work to facilitate further research.
Context Attention Fusion Network for crowd counting
2023, Knowledge-Based Systems
Context information, which plays a crucial role in many computer vision tasks, can benefit deep networks to construct better cognitive competence from more comprehensive surrounding. However, most existing crowd counting methods often overlook the importance of context information extracted from both global and local views. To address or alleviate this problem, this paper proposes a novel Context Attention Fusion Network, which is abbreviated as CAFNet for crowd counting. The core idea behind CAFNet is the interaction of multiple context information, including local context, cross-level context, and cross-layer context. To explore local context, we design a local context aggregation module to extract hierarchically local semantic information and then integrate them adaptively. To utilize cross-level context, a guidance attention fusion module is designed to fuse low-level feature map as the guidance of high-level context information so that the spatial details can be effectively compensated. To make full use of cross-layer features, a multi-layer context fusion module is developed to exchange the potentiality of multi-layer information to generate a high-resolution density map. Experimental results on four challenging datasets manifest that the newly proposed CAFNet can deliver impressive results compared with other state-of-the-art crowd counting models.
Learning the cross-modal discriminative feature representation for RGB-T crowd counting
2022, Knowledge-Based Systems
Citation Excerpt :
In terms of the type of the processed information, the majority of crowd counting methods are developed for processing the optical information, i.e., the RGB image. For example, Guo et al. [37] establish a crowd counting network through exploiting the multi-hierarchy information during establishing a non-linear mapping from the scene image to the estimated density map. With the application of multiple types of cameras in addition to the RGB camera (e.g., the depth camera and the thermal camera), there have several cross-modal crowd counting methods designed for RGB-D or RGB-T crowd counting task, through combining the complementary information of cross-modal features.
To reduce the interference of the arbitrary crowd distribution and complex background on the counting accuracy in the unconstrained scene, this paper presents a novel RGB-T crowd counting approach based on the cross-modal discriminative feature representation learning, including two main stages. Specially, the first stage explicitly establishes a non-linear mapping from the RGB domain to the thermal domain so as to learn the cross-modal discriminative feature representation about the crowd distribution, given that the thermal image of the crowded scene presents more intuitive information related to the crowd and would be more insensitive to the background element than the conventional optical image. The second stage incorporates the complementary features of the RGB and thermal image pair based on combining the learned feature representation from the first stage to yield the final counting result. Experiments on the RGB-T crowd counting benchmark verify the superiority of the proposed approach with the state-of-the-art methods. The ablation studies on RGB-T and the evaluation on RGB crowd counting benchmarks validate the effectiveness of the designed cross-modal discriminative feature representation learning. Experimental results demonstrate that the proposed approach could obviate the need of relying on the specific structure of modality-specific feature extraction parts, rather than the conventional works of treating the paired RGB-thermal image as the indiscriminative information source and fusing them directly. The proposed approach in this study could realize the cross-modal discriminative feature representation learning in an efficient way of establishing cross-modal domain explicitly for the algorithm research of the intelligent surveillance system development.

View all citing articles on Scopus

View full text

Learning a deep network with cross-hierarchy aggregation for crowd counting

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed method

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Knowl. Based Syst.

Inf. Fusion

Knowl. Based Syst.

Knowl. Based Syst.

Pattern Recognit.

Knowl. Based Syst.

IEEE Trans. Cybern.

Eng. Appl. Artif. Intell.

Going deeper with convolutions

Mask R-CNN

Pose-guided feature alignment for occluded person re-identification

An aggregated multicolumn dilated convolution network for perspective-free counting

Improving the learning of multi-column convolutional neural network for crowd counting

Attentional neural fields for crowd counting

DADNet: Dilated-attention-deformable convnet for crowd counting

Relational attention network for crowd counting

Crowd counting using deep recurrent spatial-aware network

Crowd counting with deep structured scale integration network

Single-image crowd counting via multi-column convolutional neural network

Crowd counting via adversarial cross-scale consistency pursuit

DSPNet: Deep scale purifier network for dense crowd counting

Expert Syst. Appl.

Crowd density estimation using fusion of multi-layer features

IEEE Trans. Intell. Transp. Syst.

Deeply supervised salient object detection with short connections

IEEE Trans. Pattern Anal. Mach. Intell.

Deep residual learning for image recognition