Dense convolutional feature histograms for robust visual object tracking

doi:10.1016/j.imavis.2020.103933

Image and Vision Computing

Volume 99, July 2020, 103933

https://doi.org/10.1016/j.imavis.2020.103933 Get rights and content

Highlights

•
Novel visual object tracking algorithm based on extracting histograms of convolutional features
•
Proposed architecture is better suited to the tracking task, facilitating the training process
•
Extensive experimental validation in multiple tracking benchmarks
•
Entirely offline training, realtime speeds even on embedded systems

Abstract

Despite recent breakthroughs in the field, Visual Object Tracking remains an open and challenging task in Computer Vision. Modern applications require trackers to not only be accurate but also very fast, even on embedded systems. In this work, we use features from Convolutional Neural Networks to build histograms, which are more adept at handling appearance variations, in an end-to-end trainable architecture. To deal with the internal covariate shift that occurs when extracting histograms from convolutional features as well as to incorporate informations from the multiple levels of the neural hierarchy, we propose and use a novel densely connected architecture where histograms from multiple layers are concatenated to produce the final representation. Experimental results validate our hypotheses on the benefits of using histograms as opposed to standard convolutional features, as the proposed histogram-based tracker surpasses recently proposed sophisticated trackers on multiple benchmarks. Long-term tracking results also reaffirm the usefulness of the proposed tracker in more challenging scenarios, where appearance variations are more severe and traditional trackers fail.

Introduction

Visual Object Tracking is the task of locating an arbitrary target, defined by a Region of Interest (ROI), in a video sequence. Various applications such as security and surveillance, cinematography, robotics, augmented reality, and even entertainment [1], [2], all rely on tracking as a first step. Despite recent breakthroughs in the field, owing mostly to the success of Convolutional Neural Networks (CNNs), it remains a very challenging task in the field of Computer Vision.

Typically, a tracker will generate a model of the target ROI and use this model to search for the target in subsequent frames. Challenges arise first from the discriminative ability of the target model, which should be distinguishable enough such that the target will avoid drifting to similar objects, while remaining adaptive enough to the inter-frame appearance changes of the target. Furthermore, a tracker must be equipped to deal with occlusions, fast target movements and out-of-view scenarios. Another important aspect is the speed of the tracker, i.e., the time it takes for the tracker to locate the target in each frame. Real-time requirements impose hard constraints on the per frame processing time of the tracker.

Recently, CNNs have been excelled in other Computer Vision tasks, such as image classification, object detection and semantic segmentation [3]. Their success can be attributed to the semantically meaningful representations they extract from visual information. Hence, CNNs begun to be used in tracking tasks. In general, most recently released CNN-based trackers work by extracting a convolutional feature-based of the target as well as from the subsequent frames and cross-correlating the target model with the frame model to locate the target. They rely either on online optimization of the network parameters or offline learning to produce discriminative features while maintaining high tracking speed on GPUs.

To address the challenges imposed by modern applications, such as those on embedded machines and robots [2], a tracking algorithm must push the speed-accuracy trade-off to its limit. For CNN-based trackers, this means using modern neural architectures while keeping the number of layers relatively short. Learning a discriminative model is a matter of training the network well offline as well as updating the network and the model online. The abundance of annotated video datasets such as ImageNet VID [4], or TrackingNet [5] facilitates the training process of such trackers.

Before convolutional features, histogram-based descriptors were used, such as Histogram of Oriented Gradients (HOG), descriptors in the Discriminative Correlation Filter (DCF) paradigm [6], or color histograms in the Mean Shift (MS) algorithm [7]. Depending on the encoded feature type, histograms can more robust to appearance changes while being computationally efficient. Based on the success of histogram-based trackers, as well as the powerful representations extracted by deep CNNs, in this work, combine the two in a single neural architecture, where histograms are extracted from deep convolutional features, to exploit the best attributes of each.

The main contribution of this work is a novel visual object tracking algorithm, based on extracting histograms of convolutional features in a single, end-to-end learnable architecture. Using histograms, the proposed tracker becomes robust to the appearance variations of the target, while remaining discriminative due to the use of convolutional features, inspired by recent works coupling encoding methods with CNNs [23], [24]. We furthermore introduce an architecture which is better suited to the tracking task and facilitates the training process of neural networks using histograms. The effectiveness of the histogram-based representations is validated experimentally in various tracking scenarios. Our tracker is trained entirely offline, thus remaining very fast during deployment and being capable of running at real-time even on embedded systems.

The rest of this paper is organized as follows. Section 2 is a summary of related work on visual object tracking. Section 3 introduces and analyzes in depth the proposed tracker and methodology. The experimental results, which validate our hypotheses, are presented in Section 4, and finally Section 5 summarizes our findings.

Section snippets

Related work

The following subsections offer an overview of related tracking methods, both CNN-based and histogram-based, as well as highlighting the proposed method’s contribution.

Proposed method

The following subsections describe the proposed method in detail, starting from an overview of histogram-based tracking, a description of histogram extraction from convolutional features, and finally building up to the proposed convolutional feature histogram tracker.

Network and training details

Following [9], we use an AlexNet-like base architecture for our network, keeping four of its convolutional layers and adding LBoF modules [25], [27]. The network architecture is illustrated in Table 1. Each conv_block consists of a convolutional layer, batch normalization and relu activation function. The lbof layers are built as convolutional layers, followed by an absolute value activation function, l₁ normalization and an average pooling layer. In our experiments, we use convolutional layers

Conclusions & future work

Inspired by the success of histogram-based trackers such as Mean Shift, we have presented a novel tracker based on extracting histograms of convolutional features in an end-to-end learnable architecture. We have proposed an architecture which facilitates the training of Bag-of-Features modules inside a convolutional neural network, alleviating the problem of covariate shift and allowing earlier layers to be trained more efficiently. Tracking-oriented features are learned and histograms are

CRediT authorship contribution statement

Paraskevi Nousi: Investigation, Software, Validation, Formal analysis, Conceptualization, Writing - original draft, Visualization. Anastasios Tefas: Methodology, Conceptualization, Resources, Writing - review & editing, Supervision, Project administration. Ioannis Pitas: Resources, Conceptualization, Writing - review & editing, Supervision.

Declaration of Competing Interest

No potential conflict of interest was reported by the authors.

Acknowledgments

This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 731667 (MULTIDRONE). This publication reflects the authors views only. The European Commission is not responsible for any use that may be made of the information it contains.

References (40)

S. Kamate et al.
Application of object detection and tracking techniques for unmanned aerial vehicles
Procedia Computer Science
(2015)
J. Schmidhuber
Deep learning in neural networks: An overview
Neural networks
(2015)
T. Vojir et al.
Robust scale-adaptive mean-shift for tracking
Pattern Recognition Letters
(2014)
N. Passalis et al.
Neural bag-of-features learning
Pattern Recognition
(2017)
B. Deori et al.
A survey on moving object tracking in video
International Journal on Information Theory (IJIT)
(2014)
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image...
M. Muller et al.
Trackingnet: A large-scale dataset and benchmark for object tracking in the wild, in: Proceedings of the European Conference on Computer Vision (ECCV)
(2018)
J.F. Henriques et al.
High-speed tracking with kernelized correlation filters
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2015)
D. Comaniciu et al.
Real-time tracking of non-rigid objects using mean shift, in: Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on
(2000)
S.S. David Held
Sebastian Thrun, Learning to track at 100 fps with deep regression networks

L. Bertinetto et al.

Fully-convolutional siamese networks for object tracking, in: European conference on computer vision

(2016)

B. Li et al.

High performance visual tracking with siamese region proposal network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

H.B. Nam

Learning multi-domain convolutional neural networks for visual tracking, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2016)

S. Yun et al.

Action-decision networks for visual tracking with deep reinforcement learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

Y. Song et al.

Crest: Convolutional residual learning for visual tracking, in: Proceedings of the IEEE International Conference on Computer Vision

(2017)

J. Valmadre et al.

End-to-end representation learning for correlation filter based tracking

Q. Wang et al.

Dcfnet: Discriminant correlation filters network for visual tracking, arXiv preprint

O. Zoidi et al.

Visual object tracking based on local steering kernels and color histograms

IEEE Trans. Circuits Syst. Video Techn.

(2013)

W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: 2012 IEEE Conference on...

S. He et al.

Visual tracking via locality sensitive histograms

Cited by (17)

Small object detection using deep feature learning and feature fusion network
2024, Engineering Applications of Artificial Intelligence
Small object detection is a fundamental and challenging issue in computer vision. We believe that there are two factors that affect the performance of small object detection: small object dataset and small object itself. In terms of datasets, we introduce a dataset named SeaDefine, which opens up a new direction for small object detection in maritime environment. For the small object itself, we utilize deep feature learning and feature fusion network (DFLFFN) to help detect objects. Concretely, the designed deep feature learning module (DFLM) at the single-layer level can describe objects for a variety of scenarios through activating multi-scale receptive fields over a wider scope. Meanwhile, to intensify classification capacity of small objects, the shallow features with rich details will be integrated with the deep features generated from the DFLM by introducing feature fusion block (FFB). In addition, we analyze the multi-scale strategy from a mathematical perspective to a certain extent. A large number of results in the experiments show that proposed DFLFFN achieves the leading detection performance on the MS-COCO and SeaDefine datasets. In particular, our DFLFFN surpasses the baseline by 4.1 points on APrs score for SeaDefine dataset, and 7.8 points on AP_S score for MS-COCO dataset.
Exploiting spatial and temporal context for online tracking with improved transformer
2023, Image and Vision Computing
At present, the transformer is becoming more and more popular in computer vision tasks due to its ability to capture long-range dependencies via self-attention. In this paper, we propose a transformer-based classification regression network TrCAR utilizing the transformer to exploit deeper spatial and temporal context. Different from the classic architecture of the transformer, we introduce convolution operation into the transformer and change the calculation of features to make it suitable for the tracking task. After that, the improved transformer encoder is introduced into the regression branch of TrCAR and combined with the feature pyramid to complete multi-layer feature fusion, which is conducive to obtaining a high-quality target representation. To further enable the target model to adapt to the change of the target appearance, we bring the gradient descent to the regression branch so that it can be updated online to produce a more precise bounding box. Meanwhile, the new transformer is integrated into the classification branch of TrCAR, which as much as possible extracts the essential feature of the target across historical frames via the global computing capability, and uses it to emphasize the target position of the current frame via cross-attention. Which helps the classifier to more easily identify the correct target. Experimental results on OTB, LaSOT, VOT2018, NFS, GOT-10k, and TrackingNet benchmarks show that our TrCAR achieves comparable performance to the popular trackers.
I-VITAL: Information aided visual tracking with adversarial learning
2023, Displays
With the advent of convolutional neural networks (CNN), MDNet and the Siamese trackers posed tracking as supervised learning. They model an object’s presence using classification (foreground and background) and location using regression. For the first time, we have brought probability distribution into the CNN framework for tracking. We have selected “Information maximization Generative Adversarial Network (InfoGAN)” to couple the target and background classes with two unique Gaussian distributions. This paper highlights the use of InfoGAN in information extraction & feedback to improve the tracking framework. Specifically, the novel features proposed in this tracking framework are (i) Coupling of unique probability distributions to target and background classes and (ii) Unsupervised tracker status (success/ failure) identification and correction through information feedback. We demonstrated the efficacy of the proposed I-VITAL tracker in visual tracking with experimental comparisons on well-known data sets such as GOT10K, VOT2020, and OTB-2015. Compared with base works, the proposed tracker has improved performance in locating the object of interest.
RGB-T tracking by modality difference reduction and feature re-selection
2022, Image and Vision Computing
Citation Excerpt :
In the last few years, visual object tracking has been paid much attention. Numerous algorithms [16–24] are proposed to build tracking models with high accuracy and robustness. Recently, CNN-based trackers [16–20,10,21] have emerged and dominated in this field because of the robust feature extraction and representation ability of deep CNNs.
RGB-T tracking has attracted increasing attention, since visible and thermal data have strong complementary advantages to improve the robustness of trackers. Most existing models focus on investigating efficient ways of fusing the complementary information from RGB and thermal images for better tracking performance. However, the modality differences caused by different imaging mechanisms may degrade the discriminability of the fused features. Meanwhile, compared with the unimodal features, the fused features may not always improve the tracking performance, especially when one of the input images contain much noisy information. In view of this, we propose a novel RGB-T tracking model by simultaneously reducing modality difference and re-selecting discriminative features from the fused features as well as from the unimodal features. To this end, a Modality Difference Compensation module (MDC) and a Feature Re-selection module (FRS) are presented. The former one reduces the modality differences between RGB and thermal features and obtains the fused features. The latter one adaptively selects such discriminative features from the unimodal features and the fused features for the subsequent classification and regression. Exhausted experiments are conducted on three RGB-T tracking benchmark datasets, which verify that our proposed tracker performs favorably against some state-of-the-art tracking algorithms.
Multi–feature fusion tracking algorithm based on peak–context learning
2022, Image and Vision Computing
Citation Excerpt :
Recently, with the integration of deep features, the tracking performance have been significantly promoted, such as in [19], where Zheng et al. integrate the discriminant model into the convolutional neural network (CNN) to effectively learn optimal feature embeddings. Nousi et al. tend to exploit multi–level information from different hierarchical convolutional layers by extracting histograms of convolutional features from CNN and by proposing a densely connected model to produce the final histogram representation, the appearance variation of the target is effectively improved [27]. Although the extraction technique of deep features from all frames will cause time–consuming computational burden.
Object tracking is considered a critical process in most applications of computer vision. Recently, tracking algorithms that include correlation filter in their frameworks have gained massive popularity due to their high efficiency. The previous algorithms aim to learn the correlation filter by leveraging over all features of the target and its neighbors. However, in this paper, a new tracking algorithm that merges an elastic net constraint and a contextual information into the training scheme is proposed to estimate the target location successfully. The novel optimization problem can significantly strengthen the peak value of the target and effectively eliminate the distractive features. Moreover, most of the correlation filter trackers only use one single feature, which has poor ability under a sophisticated environment. For this reason, a multi–feature fusion strategy is proposed in the framework that embeds multiple features to enhance the tracking performance. Consequently, a multi–scale adaptive model is implemented to improve the tracking stability through scale variations. Besides, an updating mechanism is applied within the proposed framework to reduce the tracking drift. Extensive quantitative and qualitative experiments on challenging benchmarks show that this unified tracker model achieves impressive performance compared to correlation filter and deep trackers.
Combining complementary trackers for enhanced long-term visual object tracking
2022, Image and Vision Computing
Citation Excerpt :
This is the most popular setting represented in the community's benchmark datasets [18,26,28,33,43] and subsequently the most tackled by solutions. Indeed, successful methodologies available today to address short-term scenarios include discriminative tracking [3,9,16,35,36], deep siamese networks [2,10,17,19,27], deep regression trackers [11,13,20], and transformers [6,42,45]. In the setting of long-term tracking problems the assumption of the target being always visible is relaxed.
Several different algorithms have been studied to combine the capabilities of baseline trackers in the context of short-term visual object tracking. Despite such an extended interest, the long-term setting has not been taken into consideration by previous studies. In this paper, we explicitly consider long-term tracking scenarios and provide a framework to fuse the characteristics of complementary state-of-the-art trackers to achieve enhanced tracking performance. Our strategy perceives whether the two trackers are following the target object through an online learned deep verification model. Such a target recognition strategy enables the activation of a decision strategy which selects the best performing tracker as well as it corrects their performance when failing. The proposed solution is studied extensively and the comparison with several other approaches reveals that it beats the state-of-the-art on the long-term visual tracking benchmarks LTB-35, LTB-50, TLP, and LaSOT.

View all citing articles on Scopus

View full text

Dense convolutional feature histograms for robust visual object tracking

Highlights

Abstract

Introduction

Section snippets

Related work

Proposed method

Network and training details

Conclusions & future work

CRediT authorship contribution statement

Acknowledgments

Procedia Computer Science

Neural networks

Pattern Recognition Letters

Pattern Recognition

A survey on moving object tracking in video

International Journal on Information Theory (IJIT)

Trackingnet: A large-scale dataset and benchmark for object tracking in the wild, in: Proceedings of the European Conference on Computer Vision (ECCV)

High-speed tracking with kernelized correlation filters

IEEE Transactions on Pattern Analysis and Machine Intelligence

Real-time tracking of non-rigid objects using mean shift, in: Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on

Sebastian Thrun, Learning to track at 100 fps with deep regression networks