Dense convolutional feature histograms for robust visual object tracking

https://doi.org/10.1016/j.imavis.2020.103933Get rights and content

Highlights

  • Novel visual object tracking algorithm based on extracting histograms of convolutional features

  • Proposed architecture is better suited to the tracking task, facilitating the training process

  • Extensive experimental validation in multiple tracking benchmarks

  • Entirely offline training, realtime speeds even on embedded systems

Abstract

Despite recent breakthroughs in the field, Visual Object Tracking remains an open and challenging task in Computer Vision. Modern applications require trackers to not only be accurate but also very fast, even on embedded systems. In this work, we use features from Convolutional Neural Networks to build histograms, which are more adept at handling appearance variations, in an end-to-end trainable architecture. To deal with the internal covariate shift that occurs when extracting histograms from convolutional features as well as to incorporate informations from the multiple levels of the neural hierarchy, we propose and use a novel densely connected architecture where histograms from multiple layers are concatenated to produce the final representation. Experimental results validate our hypotheses on the benefits of using histograms as opposed to standard convolutional features, as the proposed histogram-based tracker surpasses recently proposed sophisticated trackers on multiple benchmarks. Long-term tracking results also reaffirm the usefulness of the proposed tracker in more challenging scenarios, where appearance variations are more severe and traditional trackers fail.

Introduction

Visual Object Tracking is the task of locating an arbitrary target, defined by a Region of Interest (ROI), in a video sequence. Various applications such as security and surveillance, cinematography, robotics, augmented reality, and even entertainment [1], [2], all rely on tracking as a first step. Despite recent breakthroughs in the field, owing mostly to the success of Convolutional Neural Networks (CNNs), it remains a very challenging task in the field of Computer Vision.

Typically, a tracker will generate a model of the target ROI and use this model to search for the target in subsequent frames. Challenges arise first from the discriminative ability of the target model, which should be distinguishable enough such that the target will avoid drifting to similar objects, while remaining adaptive enough to the inter-frame appearance changes of the target. Furthermore, a tracker must be equipped to deal with occlusions, fast target movements and out-of-view scenarios. Another important aspect is the speed of the tracker, i.e., the time it takes for the tracker to locate the target in each frame. Real-time requirements impose hard constraints on the per frame processing time of the tracker.

Recently, CNNs have been excelled in other Computer Vision tasks, such as image classification, object detection and semantic segmentation [3]. Their success can be attributed to the semantically meaningful representations they extract from visual information. Hence, CNNs begun to be used in tracking tasks. In general, most recently released CNN-based trackers work by extracting a convolutional feature-based of the target as well as from the subsequent frames and cross-correlating the target model with the frame model to locate the target. They rely either on online optimization of the network parameters or offline learning to produce discriminative features while maintaining high tracking speed on GPUs.

To address the challenges imposed by modern applications, such as those on embedded machines and robots [2], a tracking algorithm must push the speed-accuracy trade-off to its limit. For CNN-based trackers, this means using modern neural architectures while keeping the number of layers relatively short. Learning a discriminative model is a matter of training the network well offline as well as updating the network and the model online. The abundance of annotated video datasets such as ImageNet VID [4], or TrackingNet [5] facilitates the training process of such trackers.

Before convolutional features, histogram-based descriptors were used, such as Histogram of Oriented Gradients (HOG), descriptors in the Discriminative Correlation Filter (DCF) paradigm [6], or color histograms in the Mean Shift (MS) algorithm [7]. Depending on the encoded feature type, histograms can more robust to appearance changes while being computationally efficient. Based on the success of histogram-based trackers, as well as the powerful representations extracted by deep CNNs, in this work, combine the two in a single neural architecture, where histograms are extracted from deep convolutional features, to exploit the best attributes of each.

The main contribution of this work is a novel visual object tracking algorithm, based on extracting histograms of convolutional features in a single, end-to-end learnable architecture. Using histograms, the proposed tracker becomes robust to the appearance variations of the target, while remaining discriminative due to the use of convolutional features, inspired by recent works coupling encoding methods with CNNs [23], [24]. We furthermore introduce an architecture which is better suited to the tracking task and facilitates the training process of neural networks using histograms. The effectiveness of the histogram-based representations is validated experimentally in various tracking scenarios. Our tracker is trained entirely offline, thus remaining very fast during deployment and being capable of running at real-time even on embedded systems.

The rest of this paper is organized as follows. Section 2 is a summary of related work on visual object tracking. Section 3 introduces and analyzes in depth the proposed tracker and methodology. The experimental results, which validate our hypotheses, are presented in Section 4, and finally Section 5 summarizes our findings.

Section snippets

Related work

The following subsections offer an overview of related tracking methods, both CNN-based and histogram-based, as well as highlighting the proposed method’s contribution.

Proposed method

The following subsections describe the proposed method in detail, starting from an overview of histogram-based tracking, a description of histogram extraction from convolutional features, and finally building up to the proposed convolutional feature histogram tracker.

Network and training details

Following [9], we use an AlexNet-like base architecture for our network, keeping four of its convolutional layers and adding LBoF modules [25], [27]. The network architecture is illustrated in Table 1. Each conv_block consists of a convolutional layer, batch normalization and relu activation function. The lbof layers are built as convolutional layers, followed by an absolute value activation function, l1 normalization and an average pooling layer. In our experiments, we use convolutional layers

Conclusions & future work

Inspired by the success of histogram-based trackers such as Mean Shift, we have presented a novel tracker based on extracting histograms of convolutional features in an end-to-end learnable architecture. We have proposed an architecture which facilitates the training of Bag-of-Features modules inside a convolutional neural network, alleviating the problem of covariate shift and allowing earlier layers to be trained more efficiently. Tracking-oriented features are learned and histograms are

CRediT authorship contribution statement

Paraskevi Nousi: Investigation, Software, Validation, Formal analysis, Conceptualization, Writing - original draft, Visualization. Anastasios Tefas: Methodology, Conceptualization, Resources, Writing - review & editing, Supervision, Project administration. Ioannis Pitas: Resources, Conceptualization, Writing - review & editing, Supervision.

Declaration of Competing Interest

No potential conflict of interest was reported by the authors.

Acknowledgments

This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 731667 (MULTIDRONE). This publication reflects the authors views only. The European Commission is not responsible for any use that may be made of the information it contains.

References (40)

  • S. Kamate et al.

    Application of object detection and tracking techniques for unmanned aerial vehicles

    Procedia Computer Science

    (2015)
  • J. Schmidhuber

    Deep learning in neural networks: An overview

    Neural networks

    (2015)
  • T. Vojir et al.

    Robust scale-adaptive mean-shift for tracking

    Pattern Recognition Letters

    (2014)
  • N. Passalis et al.

    Neural bag-of-features learning

    Pattern Recognition

    (2017)
  • B. Deori et al.

    A survey on moving object tracking in video

    International Journal on Information Theory (IJIT)

    (2014)
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image...
  • M. Muller et al.

    Trackingnet: A large-scale dataset and benchmark for object tracking in the wild, in: Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • J.F. Henriques et al.

    High-speed tracking with kernelized correlation filters

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2015)
  • D. Comaniciu et al.

    Real-time tracking of non-rigid objects using mean shift, in: Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on

    (2000)
  • S.S. David Held

    Sebastian Thrun, Learning to track at 100 fps with deep regression networks

  • L. Bertinetto et al.

    Fully-convolutional siamese networks for object tracking, in: European conference on computer vision

    (2016)
  • B. Li et al.

    High performance visual tracking with siamese region proposal network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • H.B. Nam

    Learning multi-domain convolutional neural networks for visual tracking, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • S. Yun et al.

    Action-decision networks for visual tracking with deep reinforcement learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Y. Song et al.

    Crest: Convolutional residual learning for visual tracking, in: Proceedings of the IEEE International Conference on Computer Vision

    (2017)
  • J. Valmadre et al.

    End-to-end representation learning for correlation filter based tracking

  • Q. Wang et al.

    Dcfnet: Discriminant correlation filters network for visual tracking, arXiv preprint

  • O. Zoidi et al.

    Visual object tracking based on local steering kernels and color histograms

    IEEE Trans. Circuits Syst. Video Techn.

    (2013)
  • W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: 2012 IEEE Conference on...
  • S. He et al.

    Visual tracking via locality sensitive histograms

  • Cited by (17)

    • Small object detection using deep feature learning and feature fusion network

      2024, Engineering Applications of Artificial Intelligence
    • RGB-T tracking by modality difference reduction and feature re-selection

      2022, Image and Vision Computing
      Citation Excerpt :

      In the last few years, visual object tracking has been paid much attention. Numerous algorithms [16–24] are proposed to build tracking models with high accuracy and robustness. Recently, CNN-based trackers [16–20,10,21] have emerged and dominated in this field because of the robust feature extraction and representation ability of deep CNNs.

    • Multi–feature fusion tracking algorithm based on peak–context learning

      2022, Image and Vision Computing
      Citation Excerpt :

      Recently, with the integration of deep features, the tracking performance have been significantly promoted, such as in [19], where Zheng et al. integrate the discriminant model into the convolutional neural network (CNN) to effectively learn optimal feature embeddings. Nousi et al. tend to exploit multi–level information from different hierarchical convolutional layers by extracting histograms of convolutional features from CNN and by proposing a densely connected model to produce the final histogram representation, the appearance variation of the target is effectively improved [27]. Although the extraction technique of deep features from all frames will cause time–consuming computational burden.

    • Combining complementary trackers for enhanced long-term visual object tracking

      2022, Image and Vision Computing
      Citation Excerpt :

      This is the most popular setting represented in the community's benchmark datasets [18,26,28,33,43] and subsequently the most tackled by solutions. Indeed, successful methodologies available today to address short-term scenarios include discriminative tracking [3,9,16,35,36], deep siamese networks [2,10,17,19,27], deep regression trackers [11,13,20], and transformers [6,42,45]. In the setting of long-term tracking problems the assumption of the target being always visible is relaxed.

    View all citing articles on Scopus
    View full text