A multi-context representation approach with multi-task learning for object counting
Introduction
Object counting is intended to count the number of objects in the single image or video frame [1]. The object counting issue is significant and essential to build high-level cognition for the crowd monitoring, scene understanding and other computer vision task [2]. Object in this task could refer to many meanings, including the pedestrians or vehicles in the surveillance videos [3], [4], cells in microscopic images [5], wildlife in field images [6], and even fish in the ocean [7]. With the rapid development and application of surveillance technology [8], [9], the object counting, especially the vehicle counting and crowd counting, has attracted much attention from both the academia and the industry [10], [11].
For the object counting, the existing methods can be divided into the three following ways: detection-based counting [12], regression-based counting [13] and density estimation-based counting [7]. Owing that the density estimation-based approach can provide more effective visual cues for the other related tasks and can establish the more reasonable mapping relation between the input image and the counting result, the majority of the object counting methods have employed the density estimation-based approach [14]. The initial work on object counting mainly adopts handcrafted features and the recent work on object counting has gained the remarkable progress due to the mighty feature extraction ability of convolutional neural network (CNN) [15]. Thus the recent work on object counting mainly employs the CNN-based approach. Learning to count the number of objects in the given scene image or video frame is difficult due to many challenging factors, including severe occlusion of objects, large variation in scale, the non-uniform crowd density, and the various appearance of the objects and so on [13]. The negative effects of occlusion issue on the object counting task can be reduced by adopting the mighty CNN. Due to the scale-variant problem in the counting task, some existing methods mainly employ the multiple scale analysis to design the counting network [16], [17], [18], [19], including using the different convolutional kernel sizes, different network depths, training classifiers to determine the convolutional kernel and so on. While these methods only use the original image as the input of the multi-scale processing part; are not end-to-end trainable; and do not consider the context in the image. And though some existing methods address the counting task through adding the context information from the images, they only consider modeling the single-scale semantic context. For the non-uniform crowd density and various appearances of the objects problems, some existing methods propose to extract the related context from the original image to estimate the object density maps. But these methods only consider modeling the semantic contexts for this task and have not addressed the scale-variant problem.
In this paper, we introduce a deep architecture that explicitly extracts appearance context and semantic context to learn the multi-context representation for the object counting task. In order to make the appearance context and the semantic context different, according to the characteristics of the appearance context and semantic context, the potential features for these two contexts are extracted from different levels of the global context network, which is designed to classify the object density degree. And then the potential features for these two contexts are then transferred into the corresponding context modeling branches.
The contribution of this paper could be summarized as the following: we propose a multi-context representation approach for object counting, that is, using the visual information from the different levels of CNN to model the appearance and semantic context to finish the final counting task. The potential information of appearance context is obtained from the mid-level layer (shallow layer) and then processed with the designed appearance context modeling sub-network, obtaining the appearance context. And the potential information of semantic context is obtained from the high-level layer (deep layer) and then processed with the designed multi-scale semantic modeling network, obtaining the multi-scale context. The modeling appearance context and semantic context are combined for the final object counting. The proposed method provides a novel sight to address the object counting, and has comparative results on multiple public object counting datasets.
Section snippets
Related work
In this section, we review the recent developments in object counting task. Since the recent convolutional neural network-based models have achieved significant improvement on object counting. There are many research works exploring the network structure to address the object counting efficiently.
Multi-context representation approach
For the human being, finishing the object counting task generally needs to obtain the visual information about the appearance of the objects and then process the high-level semantic information for a global understanding of the whole scene simultaneously. Meanwhile, due to the hierarchical working principles of CNN, the generated features from the shallow to deep layers are often from low-level to high-level. Motivated by this, we attempt to establish an object counting model which could
Experiments
In consideration that the current public object counting datasets mainly focus on crowd and vehicle counting task, we conduct performance evaluations on the public available TRANCOS [4] dataset for vehicle counting and the typical Mall [3] and Shanghaitech_A [17] datasets for crowd counting. In this section, we first conduct the performance comparisons with existing typical methods on these public datasets, and report the result of the ablation study for the proposed MCENet structure. The
Conclusion
In this paper, we propose a multi-context representation approach for object counting, especially for the vehicle counting and crowd counting tasks. The proposed approach integrates the high-level semantic information and the mid-level visual information to provide multiple contexts including multi-scale semantic context and appearance context for the final counting task. For the object counting task, the proposed approach provides a new insight similar to the mode of human thinking. However,
CRediT authorship contribution statement
Weihang Kong: Funding acquisition, Conceptualization, Investigation, Methodology, Writing - original draft, Writing - review & editing. He Li: Data curation, Investigation, Methodology, Writing - original draft, Writing - review & editing, Project administration. Xi Zhang: Writing - original draft, Formal analysis, Resources, Validation. Gongda Zhao: Formal analysis, Software, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant 2017ZX05019001-011, the Natural Science Foundation of Hebei province in China under Grant No. F2019203526, the Project funded by China Postdoctoral Science Foundation under Grant 2018M631763, the Yanshan University, China Doctoral Foundation under Grant BL18010, and Science and Technology Research & Development Program of Qinhuangdao City, China (No.
References (33)
- et al.
A survey of recent advances in CNN-based single image crowd counting and density estimation
Pattern Recognit. Lett.
(2018) - et al.
An object counting network based on hierarchical context and feature fusion
J. Vis. Commun. Image Represent.
(2019) - et al.
Fast visual object counting via example-based density estimation
- et al.
Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding
- et al.
Feature mining for localised crowd counting
- et al.
Extremely overlapping vehicle counting
- et al.
Counting in the wild
- C. Spampinato, Y.H. Chen-Burger, G. Nadarajan, R.B. Fisher, Detecting, tracking and counting fish in low quality...
- V. Lempitsky, A. Zisserman, Learning to count objects in images, in: 24th Annual Conference on Neural Information...
- et al.
Privacy preserving crowd monitoring: Counting people without people models or tracking
Video-based vehicle counting framework
IEEE Access
Crowd scene understanding from video: A survey
ACM Trans. Multimed. Comput. Commun. Appl.
Scene invariant virtual gates using DNNs
IEEE Trans. Circuits Syst. Video Technol.
Cross-scene counting based on domain adaptation-extreme learning machine
IEEE Access
Multi-source multi-scale counting in extremely dense crowd images
Crowd counting and profiling: Methodology and evaluation
Cited by (10)
Versatile correlation learning for size-robust generalized counting: A new perspective
2024, Knowledge-Based SystemsCoupled Global–Local object detection for large VHR aerial images
2023, Knowledge-Based SystemsCitation Excerpt :FPNs [32] can also be used to extract contextual features by fusing them with deep feature maps with a high downsampling ratio. MCENet [49] uses subnetworks to extract multiscale texture and semantic context to improve the performance of object detection and counting. J.-S. Lim et al. proposed FA-SSD [50], which is aware of contextual information and uses attention mechanisms to significantly improve the detection performance for small objects.
Multi-task support vector machine with pinball loss
2021, Engineering Applications of Artificial IntelligenceCitation Excerpt :As an important branch of machine learning, multi-task learning has received substantial attention in many applications (Kong et al., 2020; Zhang et al., 2019; Yang et al., 2021).
Task-adaptive Asymmetric Deep Cross-modal Hashing[Formula presented]
2021, Knowledge-Based SystemsCitation Excerpt :Multi-task learning-based Unsupervised Domain Adaptation (mtUDA) [37] relaxes the single classifier assumption in the conventional classifier-based unsupervised domain adaptation and proposes to jointly optimize source and target classifiers by considering the manifold structure of the target domain and the distribution divergence between the domains. Multi-Context Embedding Network (MCENet) [38] proposes a multi-context representation approach for object counting. It extracts the potential features for the appearance context and the semantic context by the first subnetwork and transfers the learned features into the two parallel and complementary subnetworks.
Learning a deep network with cross-hierarchy aggregation for crowd counting
2021, Knowledge-Based SystemsCitation Excerpt :However, it remains difficult because of light change, severe occlusions, uneven crowd distributions, and camera perspective distortions. Due to the successful application of Convolutional Neural Networks (CNNs) in computer vision including classification [2,3], detection [4,5], segmentation [6,7], and person re-identification [8,9], researchers have newly proposed lots of methods [10–18] that use CNNs to extract features from crowd images and generate density maps for crowd counting. And by integrating the generated density maps, we can obtain the number of people in the crowd images.
Deeply scale aggregation network for object counting
2020, Knowledge-Based SystemsCitation Excerpt :Among them, the scale variant is still the major barrier to accurate counting performance. Though many existing object counting methods have been developed to attempt to extract the scale-aware feature to reduce the negative effect of the scale variant, the performance of the counting method is still subject to the limitation of designing the fixed network depth or the parameters [11–17]. In order to learn more mighty representation adaptive to the current scale, we design a deeply scale aggregation network (DSA-Net) for object counting.