A multi-context representation approach with multi-task learning for object counting

doi:10.1016/j.knosys.2020.105927

Knowledge-Based Systems

Volume 197, 7 June 2020, 105927

https://doi.org/10.1016/j.knosys.2020.105927 Get rights and content

Abstract

Object counting is a fundamental while challenging computer vision task, as it requires the object appearance information as well as semantic understanding of the object. In this paper, we propose an end-to-end multi-context embedding deep network for object counting(MCENet), which observes the object counting task from the three different perspectives to count the number of vehicles in the traffic video frame, or to estimate the number of the pedestrian in the largely congested scene. The first sub-network of MCENet extracts the potential features for the appearance context and the semantic context from different-level layers. The two different-level features from the first sub-network are transferred into the two parallel and complementary sub-networks, which are used to model the appearance context and semantic context for final counting. And thus the multiple contexts are represented and embedded to assist the counting task. Extensive experimental evaluations are reported in this paper, using up to three different object counting benchmarks, which show the proposed approach achieves a competitive performance in all these heterogeneous scenarios.

Introduction

Object counting is intended to count the number of objects in the single image or video frame [1]. The object counting issue is significant and essential to build high-level cognition for the crowd monitoring, scene understanding and other computer vision task [2]. Object in this task could refer to many meanings, including the pedestrians or vehicles in the surveillance videos [3], [4], cells in microscopic images [5], wildlife in field images [6], and even fish in the ocean [7]. With the rapid development and application of surveillance technology [8], [9], the object counting, especially the vehicle counting and crowd counting, has attracted much attention from both the academia and the industry [10], [11].

For the object counting, the existing methods can be divided into the three following ways: detection-based counting [12], regression-based counting [13] and density estimation-based counting [7]. Owing that the density estimation-based approach can provide more effective visual cues for the other related tasks and can establish the more reasonable mapping relation between the input image and the counting result, the majority of the object counting methods have employed the density estimation-based approach [14]. The initial work on object counting mainly adopts handcrafted features and the recent work on object counting has gained the remarkable progress due to the mighty feature extraction ability of convolutional neural network (CNN) [15]. Thus the recent work on object counting mainly employs the CNN-based approach. Learning to count the number of objects in the given scene image or video frame is difficult due to many challenging factors, including severe occlusion of objects, large variation in scale, the non-uniform crowd density, and the various appearance of the objects and so on [13]. The negative effects of occlusion issue on the object counting task can be reduced by adopting the mighty CNN. Due to the scale-variant problem in the counting task, some existing methods mainly employ the multiple scale analysis to design the counting network [16], [17], [18], [19], including using the different convolutional kernel sizes, different network depths, training classifiers to determine the convolutional kernel and so on. While these methods only use the original image as the input of the multi-scale processing part; are not end-to-end trainable; and do not consider the context in the image. And though some existing methods address the counting task through adding the context information from the images, they only consider modeling the single-scale semantic context. For the non-uniform crowd density and various appearances of the objects problems, some existing methods propose to extract the related context from the original image to estimate the object density maps. But these methods only consider modeling the semantic contexts for this task and have not addressed the scale-variant problem.

In this paper, we introduce a deep architecture that explicitly extracts appearance context and semantic context to learn the multi-context representation for the object counting task. In order to make the appearance context and the semantic context different, according to the characteristics of the appearance context and semantic context, the potential features for these two contexts are extracted from different levels of the global context network, which is designed to classify the object density degree. And then the potential features for these two contexts are then transferred into the corresponding context modeling branches.

The contribution of this paper could be summarized as the following: we propose a multi-context representation approach for object counting, that is, using the visual information from the different levels of CNN to model the appearance and semantic context to finish the final counting task. The potential information of appearance context is obtained from the mid-level layer (shallow layer) and then processed with the designed appearance context modeling sub-network, obtaining the appearance context. And the potential information of semantic context is obtained from the high-level layer (deep layer) and then processed with the designed multi-scale semantic modeling network, obtaining the multi-scale context. The modeling appearance context and semantic context are combined for the final object counting. The proposed method provides a novel sight to address the object counting, and has comparative results on multiple public object counting datasets.

Section snippets

Related work

In this section, we review the recent developments in object counting task. Since the recent convolutional neural network-based models have achieved significant improvement on object counting. There are many research works exploring the network structure to address the object counting efficiently.

Multi-context representation approach

For the human being, finishing the object counting task generally needs to obtain the visual information about the appearance of the objects and then process the high-level semantic information for a global understanding of the whole scene simultaneously. Meanwhile, due to the hierarchical working principles of CNN, the generated features from the shallow to deep layers are often from low-level to high-level. Motivated by this, we attempt to establish an object counting model which could

Experiments

In consideration that the current public object counting datasets mainly focus on crowd and vehicle counting task, we conduct performance evaluations on the public available TRANCOS [4] dataset for vehicle counting and the typical Mall [3] and Shanghaitech_A [17] datasets for crowd counting. In this section, we first conduct the performance comparisons with existing typical methods on these public datasets, and report the result of the ablation study for the proposed MCENet structure. The

Conclusion

In this paper, we propose a multi-context representation approach for object counting, especially for the vehicle counting and crowd counting tasks. The proposed approach integrates the high-level semantic information and the mid-level visual information to provide multiple contexts including multi-scale semantic context and appearance context for the final counting task. For the object counting task, the proposed approach provides a new insight similar to the mode of human thinking. However,

CRediT authorship contribution statement

Weihang Kong: Funding acquisition, Conceptualization, Investigation, Methodology, Writing - original draft, Writing - review & editing. He Li: Data curation, Investigation, Methodology, Writing - original draft, Writing - review & editing, Project administration. Xi Zhang: Writing - original draft, Formal analysis, Resources, Validation. Gongda Zhao: Formal analysis, Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant 2017ZX05019001-011, the Natural Science Foundation of Hebei province in China under Grant No. F2019203526, the Project funded by China Postdoctoral Science Foundation under Grant 2018M631763, the Yanshan University, China Doctoral Foundation under Grant BL18010, and Science and Technology Research & Development Program of Qinhuangdao City, China (No.

References (33)

SindagiV.A. et al.
A survey of recent advances in CNN-based single image crowd counting and density estimation
Pattern Recognit. Lett.
(2018)
ZhangS.H. et al.
An object counting network based on hierarchical context and feature fusion
J. Vis. Commun. Image Represent.
(2019)
WangY. et al.
Fast visual object counting via example-based density estimation
LiuN. et al.
Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding
ChenK. et al.
Feature mining for localised crowd counting
Guerrero-GómezR.O. et al.
Extremely overlapping vehicle counting
ArtetaC. et al.
Counting in the wild
C. Spampinato, Y.H. Chen-Burger, G. Nadarajan, R.B. Fisher, Detecting, tracking and counting fish in low quality...
V. Lempitsky, A. Zisserman, Learning to count objects in images, in: 24th Annual Conference on Neural Information...
ChanA.B. et al.
Privacy preserving crowd monitoring: Counting people without people models or tracking

DaiZ. et al.

Video-based vehicle counting framework

IEEE Access

(2019)

GrantJ.M. et al.

Crowd scene understanding from video: A survey

ACM Trans. Multimed. Comput. Commun. Appl.

(2017)

DenmanS. et al.

Scene invariant virtual gates using DNNs

IEEE Trans. Circuits Syst. Video Technol.

(2019)

YangB. et al.

Cross-scene counting based on domain adaptation-extreme learning machine

IEEE Access

(2018)

IdreesH. et al.

Multi-source multi-scale counting in extremely dense crowd images

LoyC.C. et al.

Crowd counting and profiling: Methodology and evaluation

Cited by (10)

Versatile correlation learning for size-robust generalized counting: A new perspective
2024, Knowledge-Based Systems
Generalized counting has recently emerged to count novel-class objects within a query image, leveraging limited exemplars. Although methods based on exemplar-query pairs matching have made impressive progress, they typically rely on a single correlation representation, regardless of the varying sizes of objects, which limits more accurate counting. In this paper, we introduce a novel and conceptually straightforward perspective to guide the design of our correlation mechanism that enhances the effectiveness of counting size-diversity objects. Our new perspective encompasses three key aspects: (1) Small objects typically exhibit features concentrated within limited spatial regions, underscoring the importance of an effective channel-wise correlation mechanism for small object counting. (2) Large objects tend to possess rich spatial semantics, making an effective spatial-wise correlation mechanism crucial for large object counting. (3) Integrating both channel-wise and spatial-wise correlation mechanisms holds the potential to enhance counting accuracy across different object sizes. Building upon the above perspective, firstly, we propose a simple yet effective Dual-level Channel-wise Correlation (DCC) module that utilizes kernel-wise correlation and distinct correlation to encode global-to-local channel-wise relationships, enhancing small objects counting accuracy. Secondly, we develop a 4D-convolution-based Spatial-aware Correlation (4DSC) module to extract local-to-local spatial correlation in 4D space, promoting large objects counting accuracy. Finally, we combine the proposed DCC and 4DSC to realize our Versatile Correlation Module (VCM) to simultaneously process both small and large objects, providing adaptability to object size diversity. Extensive experiments on the FSC-147 dataset and CARPK dataset demonstrate the effectiveness of the proposed methods and the superior performance of our counting model.
Coupled Global–Local object detection for large VHR aerial images
2023, Knowledge-Based Systems
Citation Excerpt :
FPNs [32] can also be used to extract contextual features by fusing them with deep feature maps with a high downsampling ratio. MCENet [49] uses subnetworks to extract multiscale texture and semantic context to improve the performance of object detection and counting. J.-S. Lim et al. proposed FA-SSD [50], which is aware of contextual information and uses attention mechanisms to significantly improve the detection performance for small objects.
Object detection in large aerial images generally requires splitting each image into local images in the preprocessing step, and even the state-of-the-art models currently use this preprocessing method. However, image splitting often leads to deficiencies in contextual information and incomplete detection of oversized objects. At present, many object detection methods are designed to deal with large images. However, they require complex additional structures or training steps, and their applicability is limited. To address these problems, we propose the Coupled Global–Local (CGL) network, which can be easily embedded in frequently used detection models, to efficiently capture more information. Specifically, we employ a multiscale feature fusion module to share information between the global and local branches. Furthermore, a new convolution method is proposed to adaptively adjust the receptive field for better feature extraction. In addition, we find that detection results from global branches in the existing global–local architecture hinder the performance improvement on details when the detection results from different-resolution branches are fused. Therefore, on the global branch, a proposal filter and a nonlocal suppression (NLS) algorithm are developed to prevent small positive proposals and remove unqualified detection boxes easily and efficiently, respectively. We conduct extensive experiments on the DOTA-1.0, DOTA-1.5, and DOTA-2.0 data sets. The results demonstrate that CGL can significantly improve the detection performance of various baseline models for large very-high-resolution (VHR) aerial images without bells and whistles.
Multi-task support vector machine with pinball loss
2021, Engineering Applications of Artificial Intelligence
Citation Excerpt :
As an important branch of machine learning, multi-task learning has received substantial attention in many applications (Kong et al., 2020; Zhang et al., 2019; Yang et al., 2021).
With the boom in machine learning, support vector machine (SVM) is widely employed in pattern recognition. However, most of SVM models concentrate on single-task learning, multi-task learning has been largely neglected. Compared with single-task learning, multi-task learning can achieve a good performance for each task by mining the shared information among tasks. In addition, loss function also plays an important role in the accuracy of SVM. Inspired by multi-task learning and the SVM with pinball loss (pin-SVM), we propose two novel multi-task support vector machines with pinball loss for binary classification, named as MTL-pin-SVM I and MTL-pin-SVM II. Both methods maximize the quantile distance for each task, which realizes less sensitive to noise and more stable for re-sampling. Moreover, MTL-pin-SVM II can use different combinations of kernel functions for different tasks, which can get better performance than other multi-task models by choosing the suitable combinations of kernel functions for different tasks. And they include the multi-task SVM with hinge loss as their special cases, which are denoted as MTL-C-SVM I and MTL-C-SVM II. The extensive experiments on multi-task datasets fully validate the validity of the proposed models.
Task-adaptive Asymmetric Deep Cross-modal Hashing[Formula presented]
2021, Knowledge-Based Systems
Citation Excerpt :
Multi-task learning-based Unsupervised Domain Adaptation (mtUDA) [37] relaxes the single classifier assumption in the conventional classifier-based unsupervised domain adaptation and proposes to jointly optimize source and target classifiers by considering the manifold structure of the target domain and the distribution divergence between the domains. Multi-Context Embedding Network (MCENet) [38] proposes a multi-context representation approach for object counting. It extracts the potential features for the appearance context and the semantic context by the first subnetwork and transfers the learned features into the two parallel and complementary subnetworks.
Supervised cross-modal hashing aims to embed the semantic correlations of heterogeneous modality data into the binary hash codes with discriminative semantic labels. It can support efficient large-scale cross-modal retrieval due to the fast retrieval speed and low storage cost. However, existing methods equally handle the cross-modal retrieval tasks, and simply learn the same couple of hash functions in a symmetric way. Under such circumstances, the characteristics of different cross-modal retrieval tasks are ignored and sub-optimal performance may be brought. Motivated by this, we present a Task-adaptive Asymmetric Deep Cross-modal Hashing (TA-ADCMH) method in this paper. It can learn task-adaptive hash functions for two sub-retrieval tasks via simultaneous modality representation and asymmetric hash learning. Different from previous cross-modal hashing methods, our learning framework jointly optimizes the semantic preserving from multi-modal features to the hash codes, and the semantic regression from query modality representation to the explicit labels. With our model, the learned hash codes can effectively preserve the multi-modal semantic correlations, and meanwhile, adaptively capture the query semantics. Besides, we design an efficient discrete optimization strategy to directly learn the binary hash codes, which alleviates the relaxing quantization errors. Extensive experiments demonstrate the state-of-the-art performance of the proposed TA-ADCMH from various aspects.
Learning a deep network with cross-hierarchy aggregation for crowd counting
2021, Knowledge-Based Systems
Citation Excerpt :
However, it remains difficult because of light change, severe occlusions, uneven crowd distributions, and camera perspective distortions. Due to the successful application of Convolutional Neural Networks (CNNs) in computer vision including classification [2,3], detection [4,5], segmentation [6,7], and person re-identification [8,9], researchers have newly proposed lots of methods [10–18] that use CNNs to extract features from crowd images and generate density maps for crowd counting. And by integrating the generated density maps, we can obtain the number of people in the crowd images.
Crowd counting, a significant but challenging task in computer vision, aims at estimating the number of people in an image or video. Recent methods for crowd counting have obtained promising performance due to deep neural networks but most of them ignore the abundant conducive information in hierarchical features. In this paper, a novel Cross-Hierarchy Aggregation Network (CHANet) is proposed to exploit multi-hierarchy information in the crowd features from each hierarchy and aggregate cross-hierarchy features to generate reasonable density maps. Firstly, we propose a CHA module to fully extract local hierarchical features and capture maximum information of the crowd features. The CHA module combines residual and dense connections without over-assigning parameters for feature reuse. Then, we utilize the global hierarchical features from the shallow hierarchies to obtain a more powerful representation ability with a global residual connection. Experimental evaluations on four publicly available crowd counting datasets (ShanghaiTech, UCF-QNRF, WorldExpo’10, and Beijing BRT) demonstrate that the proposed CHANet achieves superior performance compared to other state-of-the-art methods.
Deeply scale aggregation network for object counting
2020, Knowledge-Based Systems
Citation Excerpt :
Among them, the scale variant is still the major barrier to accurate counting performance. Though many existing object counting methods have been developed to attempt to extract the scale-aware feature to reduce the negative effect of the scale variant, the performance of the counting method is still subject to the limitation of designing the fixed network depth or the parameters [11–17]. In order to learn more mighty representation adaptive to the current scale, we design a deeply scale aggregation network (DSA-Net) for object counting.
Object counting is a fundamental and essential task to build the high-level cognition for the given scene in the computer vision field. For the common scale variant issue in the object counting, this paper designs a deeply scale aggregation network for object counting task. Specially, we design a dense multi-scale (DMS) block to extract the initial scale-aware feature, and thus stack multiple DMS blocks in the specific structure to implement the adaptive network depth, so as to learn the more mighty scale-aware feature. Extensive experimental results are reported in this paper, using up to six public object counting benchmarks, which demonstrate that the proposed method has an effective performance for object counting task. And the ablation studies validate the structural rationality of the proposed method.

View all citing articles on Scopus

View full text

A multi-context representation approach with multi-task learning for object counting

Abstract

Introduction

Section snippets

Related work

Multi-context representation approach

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Pattern Recognit. Lett.

J. Vis. Commun. Image Represent.

Fast visual object counting via example-based density estimation

Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding

Feature mining for localised crowd counting

Extremely overlapping vehicle counting

Counting in the wild

Privacy preserving crowd monitoring: Counting people without people models or tracking

Video-based vehicle counting framework

IEEE Access

Crowd scene understanding from video: A survey

ACM Trans. Multimed. Comput. Commun. Appl.

Scene invariant virtual gates using DNNs

IEEE Trans. Circuits Syst. Video Technol.

Cross-scene counting based on domain adaptation-extreme learning machine

IEEE Access

Multi-source multi-scale counting in extremely dense crowd images

Crowd counting and profiling: Methodology and evaluation