A multi-context representation approach with multi-task learning for object counting

https://doi.org/10.1016/j.knosys.2020.105927Get rights and content

Abstract

Object counting is a fundamental while challenging computer vision task, as it requires the object appearance information as well as semantic understanding of the object. In this paper, we propose an end-to-end multi-context embedding deep network for object counting(MCENet), which observes the object counting task from the three different perspectives to count the number of vehicles in the traffic video frame, or to estimate the number of the pedestrian in the largely congested scene. The first sub-network of MCENet extracts the potential features for the appearance context and the semantic context from different-level layers. The two different-level features from the first sub-network are transferred into the two parallel and complementary sub-networks, which are used to model the appearance context and semantic context for final counting. And thus the multiple contexts are represented and embedded to assist the counting task. Extensive experimental evaluations are reported in this paper, using up to three different object counting benchmarks, which show the proposed approach achieves a competitive performance in all these heterogeneous scenarios.

Introduction

Object counting is intended to count the number of objects in the single image or video frame [1]. The object counting issue is significant and essential to build high-level cognition for the crowd monitoring, scene understanding and other computer vision task [2]. Object in this task could refer to many meanings, including the pedestrians or vehicles in the surveillance videos [3], [4], cells in microscopic images [5], wildlife in field images [6], and even fish in the ocean [7]. With the rapid development and application of surveillance technology [8], [9], the object counting, especially the vehicle counting and crowd counting, has attracted much attention from both the academia and the industry [10], [11].

For the object counting, the existing methods can be divided into the three following ways: detection-based counting [12], regression-based counting [13] and density estimation-based counting [7]. Owing that the density estimation-based approach can provide more effective visual cues for the other related tasks and can establish the more reasonable mapping relation between the input image and the counting result, the majority of the object counting methods have employed the density estimation-based approach [14]. The initial work on object counting mainly adopts handcrafted features and the recent work on object counting has gained the remarkable progress due to the mighty feature extraction ability of convolutional neural network (CNN) [15]. Thus the recent work on object counting mainly employs the CNN-based approach. Learning to count the number of objects in the given scene image or video frame is difficult due to many challenging factors, including severe occlusion of objects, large variation in scale, the non-uniform crowd density, and the various appearance of the objects and so on [13]. The negative effects of occlusion issue on the object counting task can be reduced by adopting the mighty CNN. Due to the scale-variant problem in the counting task, some existing methods mainly employ the multiple scale analysis to design the counting network [16], [17], [18], [19], including using the different convolutional kernel sizes, different network depths, training classifiers to determine the convolutional kernel and so on. While these methods only use the original image as the input of the multi-scale processing part; are not end-to-end trainable; and do not consider the context in the image. And though some existing methods address the counting task through adding the context information from the images, they only consider modeling the single-scale semantic context. For the non-uniform crowd density and various appearances of the objects problems, some existing methods propose to extract the related context from the original image to estimate the object density maps. But these methods only consider modeling the semantic contexts for this task and have not addressed the scale-variant problem.

In this paper, we introduce a deep architecture that explicitly extracts appearance context and semantic context to learn the multi-context representation for the object counting task. In order to make the appearance context and the semantic context different, according to the characteristics of the appearance context and semantic context, the potential features for these two contexts are extracted from different levels of the global context network, which is designed to classify the object density degree. And then the potential features for these two contexts are then transferred into the corresponding context modeling branches.

The contribution of this paper could be summarized as the following: we propose a multi-context representation approach for object counting, that is, using the visual information from the different levels of CNN to model the appearance and semantic context to finish the final counting task. The potential information of appearance context is obtained from the mid-level layer (shallow layer) and then processed with the designed appearance context modeling sub-network, obtaining the appearance context. And the potential information of semantic context is obtained from the high-level layer (deep layer) and then processed with the designed multi-scale semantic modeling network, obtaining the multi-scale context. The modeling appearance context and semantic context are combined for the final object counting. The proposed method provides a novel sight to address the object counting, and has comparative results on multiple public object counting datasets.

Section snippets

Related work

In this section, we review the recent developments in object counting task. Since the recent convolutional neural network-based models have achieved significant improvement on object counting. There are many research works exploring the network structure to address the object counting efficiently.

Multi-context representation approach

For the human being, finishing the object counting task generally needs to obtain the visual information about the appearance of the objects and then process the high-level semantic information for a global understanding of the whole scene simultaneously. Meanwhile, due to the hierarchical working principles of CNN, the generated features from the shallow to deep layers are often from low-level to high-level. Motivated by this, we attempt to establish an object counting model which could

Experiments

In consideration that the current public object counting datasets mainly focus on crowd and vehicle counting task, we conduct performance evaluations on the public available TRANCOS [4] dataset for vehicle counting and the typical Mall [3] and Shanghaitech_A [17] datasets for crowd counting. In this section, we first conduct the performance comparisons with existing typical methods on these public datasets, and report the result of the ablation study for the proposed MCENet structure. The

Conclusion

In this paper, we propose a multi-context representation approach for object counting, especially for the vehicle counting and crowd counting tasks. The proposed approach integrates the high-level semantic information and the mid-level visual information to provide multiple contexts including multi-scale semantic context and appearance context for the final counting task. For the object counting task, the proposed approach provides a new insight similar to the mode of human thinking. However,

CRediT authorship contribution statement

Weihang Kong: Funding acquisition, Conceptualization, Investigation, Methodology, Writing - original draft, Writing - review & editing. He Li: Data curation, Investigation, Methodology, Writing - original draft, Writing - review & editing, Project administration. Xi Zhang: Writing - original draft, Formal analysis, Resources, Validation. Gongda Zhao: Formal analysis, Software, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part the National Science and Technology Major Project of the Ministry of Science and Technology of China under Grant 2017ZX05019001-011, the Natural Science Foundation of Hebei province in China under Grant No. F2019203526, the Project funded by China Postdoctoral Science Foundation under Grant 2018M631763, the Yanshan University, China Doctoral Foundation under Grant BL18010, and Science and Technology Research & Development Program of Qinhuangdao City, China (No.

References (33)

  • SindagiV.A. et al.

    A survey of recent advances in CNN-based single image crowd counting and density estimation

    Pattern Recognit. Lett.

    (2018)
  • ZhangS.H. et al.

    An object counting network based on hierarchical context and feature fusion

    J. Vis. Commun. Image Represent.

    (2019)
  • WangY. et al.

    Fast visual object counting via example-based density estimation

  • LiuN. et al.

    Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding

  • ChenK. et al.

    Feature mining for localised crowd counting

  • Guerrero-GómezR.O. et al.

    Extremely overlapping vehicle counting

  • ArtetaC. et al.

    Counting in the wild

  • C. Spampinato, Y.H. Chen-Burger, G. Nadarajan, R.B. Fisher, Detecting, tracking and counting fish in low quality...
  • V. Lempitsky, A. Zisserman, Learning to count objects in images, in: 24th Annual Conference on Neural Information...
  • ChanA.B. et al.

    Privacy preserving crowd monitoring: Counting people without people models or tracking

  • DaiZ. et al.

    Video-based vehicle counting framework

    IEEE Access

    (2019)
  • GrantJ.M. et al.

    Crowd scene understanding from video: A survey

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2017)
  • DenmanS. et al.

    Scene invariant virtual gates using DNNs

    IEEE Trans. Circuits Syst. Video Technol.

    (2019)
  • YangB. et al.

    Cross-scene counting based on domain adaptation-extreme learning machine

    IEEE Access

    (2018)
  • IdreesH. et al.

    Multi-source multi-scale counting in extremely dense crowd images

  • LoyC.C. et al.

    Crowd counting and profiling: Methodology and evaluation

  • Cited by (10)

    • Coupled Global–Local object detection for large VHR aerial images

      2023, Knowledge-Based Systems
      Citation Excerpt :

      FPNs [32] can also be used to extract contextual features by fusing them with deep feature maps with a high downsampling ratio. MCENet [49] uses subnetworks to extract multiscale texture and semantic context to improve the performance of object detection and counting. J.-S. Lim et al. proposed FA-SSD [50], which is aware of contextual information and uses attention mechanisms to significantly improve the detection performance for small objects.

    • Multi-task support vector machine with pinball loss

      2021, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      As an important branch of machine learning, multi-task learning has received substantial attention in many applications (Kong et al., 2020; Zhang et al., 2019; Yang et al., 2021).

    • Task-adaptive Asymmetric Deep Cross-modal Hashing[Formula presented]

      2021, Knowledge-Based Systems
      Citation Excerpt :

      Multi-task learning-based Unsupervised Domain Adaptation (mtUDA) [37] relaxes the single classifier assumption in the conventional classifier-based unsupervised domain adaptation and proposes to jointly optimize source and target classifiers by considering the manifold structure of the target domain and the distribution divergence between the domains. Multi-Context Embedding Network (MCENet) [38] proposes a multi-context representation approach for object counting. It extracts the potential features for the appearance context and the semantic context by the first subnetwork and transfers the learned features into the two parallel and complementary subnetworks.

    • Learning a deep network with cross-hierarchy aggregation for crowd counting

      2021, Knowledge-Based Systems
      Citation Excerpt :

      However, it remains difficult because of light change, severe occlusions, uneven crowd distributions, and camera perspective distortions. Due to the successful application of Convolutional Neural Networks (CNNs) in computer vision including classification [2,3], detection [4,5], segmentation [6,7], and person re-identification [8,9], researchers have newly proposed lots of methods [10–18] that use CNNs to extract features from crowd images and generate density maps for crowd counting. And by integrating the generated density maps, we can obtain the number of people in the crowd images.

    • Deeply scale aggregation network for object counting

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Among them, the scale variant is still the major barrier to accurate counting performance. Though many existing object counting methods have been developed to attempt to extract the scale-aware feature to reduce the negative effect of the scale variant, the performance of the counting method is still subject to the limitation of designing the fixed network depth or the parameters [11–17]. In order to learn more mighty representation adaptive to the current scale, we design a deeply scale aggregation network (DSA-Net) for object counting.

    View all citing articles on Scopus
    View full text