PaaRPN: Probabilistic anchor assignment with region proposal network for visual tracking

doi:10.1016/j.ins.2022.03.070

Information Sciences

Volume 598, June 2022, Pages 19-36

https://doi.org/10.1016/j.ins.2022.03.070 Get rights and content

Abstract

Recently, visual trackers based on region proposal networks (RPN) have attracted widespread attention due to their relatively high efficiency and excellent performance. RPN-based trackers mainly combine a classification branch and a regression branch to predict a target’s state. These branches are all under the guidance of pre-defined anchor boxes. RPN-based trackers, however, first compute the Intersection-over-Union (IoU) between the anchor boxes and ground truth boxes, and then use a fixed IoU threshold to separate negative and positive training samples. The limit of this design lies in the fact that these trackers lack an analysis of the actual content of the intersecting regions, which may include distractor objects or few meaningful regions of the tracked target. In this research, we propose a probabilistic anchor assignment with region proposal network (PaaRPN) that can adaptively separate anchors into negative samples and positive samples according to the model’s current learning status. To this end, we first calculate the classification scores of the anchor boxes conditioned on the current model and fit a probability distribution to the classification scores. The whole tracking model is then trained with anchor boxes separated into negative and positive samples in a probabilistic manner. Moreover, we introduce an online learning method in the PaaRPN framework that enables the model to have powerful discriminative abilities by exploiting both background and target appearance information. We tested the PaaRPN tracker on six tracking benchmarks to exhibit the effectiveness of the proposed method. In particular, our model outperforms a strong RPN tracker, SiamRPN++, with AUC scores improvements of 0.613 $\to$ 0.657 and 0.496 $\to$ 0.565 on UAV123 and LaSOT, respectively.

Introduction

Visual tracking is a fundamental topic that aims to predict the tracking target’s state in a given video frame [7]. In practice, the state is usually denoted by a bounding box in a video sequence [13]. Current visual tracking tasks mainly include long-term and short-term tracking. For long-term tracking, several tracking methods have become popular in the tracking field. Alan et al. [26] proposed a fully correlational long-term tracker that employs correlation filters trained on different time scales as detector components. Yan et al. [38] developed a ‘Skimming-Perusal’ module in a SiamRPN tracker [22]. Specifically, the perusal module is meant to predict the tracking target in a local search region, while the skimming module is developed to precisely select the most reliable local regions from the predefined sampling sliding window. Dai et al. [6] proposed an offline-trained meta-updater to effectively integrate discriminative, geometric, and appearance cues from a video sequence, guiding the updating of the online tracker effectively. Huang et al. [16] introduced a purely global instance search based on two-stage object detectors; it imposes no constraints or assumptions concering temporal consistency. For short-term tracking, region proposal network (RPN)-based trackers [40], [21], [22], [46], [42], [48] are widely used by the tracking community. A key characteristic of RPN-based trackers [40], [21], [22], [46], [42], [48] is that they are based on anchor boxes, which are the inputs of both the classification branch and the regression branch of the RPN. RPN-based trackers first utilize the classification branch to separate the foreground from the background of tracking scenes under the guidance of hand-crafted anchor boxes. The regression branch is then used to fine-tune the candidate anchor boxes to obtain a more accurate tracking box.

Current RPN-based trackers [40], [21], [22], [46], [42], [48], however, have two drawbacks. First, RPN-based trackers generate anchor boxes with various shapes and sizes so that they can better encapsulate the tracking target. For this design, anchor assignment, in which anchor boxes are defined as negative or positive samples, needs to be performed in advance. Traditional RPN-based trackers mainly use the following steps to define positive and negative samples: 1) the Intersection-over-Union (IoU) values between the hand-crafted anchors and a ground truth box are calculated; 2) for anchor assignment, the anchor boxes are defined as positive samples if their IoU values exceed a given threshold. Subsequently, the classified anchor boxes are fed into the classification and regression branches to obtain the final tracking box. This simple and intuitive anchor assignment strategy is a popular choice for RPN-based trackers. This assignment strategy, however, ignores the target content of the intersecting region, which may include background distractors or few important parts of the tracking target [19]. To overcome this limitation, several recent methods [19], [33], [44] have proposed various anchor assignment strategies. All these methods suggest that a proper anchor assignment method can bring performance gains. Second, current RPN-based trackers [40], [21], [22], [46], [42], [48] have shown inferior robustness in comparison to other state-of-the-art trackers [47], [1] due to the lack of a powerful model update strategy. Intuitively, the appearance of the tracked target may change in a video sequence due to several challenging factors, such as fast motion, deformation, occlusion, etc. Traditional RPN-based trackers can easily drift to other distractor locations due to the lack of online learning of the appearance changes of a tracking target.

This research aims to design a new anchor assignment strategy that can flexibly determine the number of positive samples by developing the assignment inference method of a tracking model in a probabilistic manner rather than using fixed IoUs between anchor boxes and ground truth boxes. To achieve this, the tracking model must adaptively determine the number of positive and negative samples according to the distribution of training samples. When no positive samples have a high IoU between anchor boxes and a ground truth box, the tracking model needs to define some new positive samples to balance the distribution of positive and negative samples. In this case, the tracking model may regard the most meaningful content as positive samples, and anchor boxes with high IoU values are not necessarily classified as positive samples. On the other hand, when there exist many positive samples, the tracking model needs to treat high-quality and competitive anchor boxes as positive samples, and the rest of the samples should be defined as negative samples. Therefore, certain positive samples with noisy backgrounds can be eliminated through this new anchor assignment strategy. Most importantly, the current learning state of a model is required to reflect the assignment quality of anchor boxes.

Motivated by the aforementioned analyses, this study introduces a probabilistic anchor assignment with RPN that is capable of adaptively separating the preset anchor boxes into negative and positive samples according to the current learning status of a tracking model. Specifically, we first define a classification score for the preset anchor boxes that represent the location qualities of the tracking target. Subsequently, we establish a probability distribution of the tracking model that defines which anchor boxes are negative or positive samples. For anchor assignments, anchor boxes are defined as positive samples if the boxes from the positive sample sets have high probabilities. This strategy transforms the assignment of positive and negative samples into a maximum likelihood estimation based on a probability distribution. The parameters of the entire probability model are determined by the classification scores of the anchors. The probabilistic assignment model is trained using training samples that are drawn from a probability distribution. The probabilistic model then classifies the positive and negative samples in a probabilistic manner, leading to a more straightforward training process for our proposed PaaRPN tracker than for other RPN-based trackers [40], [21], [22], [46], [42], [48]. Furthermore, the PaaRPN tracker does not contain a fixed IoU threshold or number of positive samples. In addition, PaaRPN is equipped with a plug-in online learning procedure that has been successfully utilized in IoU-Net trackers [1], [8]. An online learning mechanism with hard negative mining is trained in an effective end-to-end manner with a discriminative training loss by utilizing an iterative optimization operation. The entire discriminative model employs the steepest descent method [1] with an optimal step length to reduce the online learning time. For this design, PaaRPN is empowered by better target-background discriminative abilities than traditional RPN-based trackers (see Fig. 1). To achieve better accuracy while maintaining high computational efficiency, we also explore channel-wise multiplication for cross-correlation in both the classification and regression branches of the PaaRPN tracker rather than using the depth-wise correlation operation in other RPN-based trackers.

Specifically, the main contributions of this work are summarized as follows:

•
We propose a new anchor assignment strategy that transforms the assignment of positive and negative samples into a probabilistic prediction procedure. The probabilistic procedure is calculated from the classification score of the RPN and maximizing the likelihood with respect to the probability distribution of the prediction scores. This operation transforms the anchor assignment method into a probability method that adaptively determines the number of positive samples.
•
We introduce an online learning mechanism to enable the proposed model, PaaRPN, to be more robust to the appearance changes of a tracking target during inference.
•
We employ a hard negative mining strategy to enhance the discriminative power of the online model in the presence of distractor objects.
•
We utilize channel-wise multiplication to compute the correlation features insteading of using the depth-wise convolution operation in SiamRPN++.

To examine the effectiveness of our proposed model, we compared the proposed PaaRPN with other state-of-the-art trackers on six tracking benchmarks: OTB-2015 [36], VOT2019 [20], UAV123 [28], NFS30 [18], LaSOT [11], and TrackingNet [29]. Experimental results show that both our probabilistic anchor assignment method and online learning strategy can improve the tracking performance. In particular, our tracker, PaaRPN, achieves state-of-the-art results and outperforms the strong baseline tracker SiamRPN++ with performance gains of 4.4% on the UAV123 dataset and 6.9% on the LaSOT dataset. We also performed an extensive ablation study to verify the effectiveness of each component.

The remaining contents of this paper are organized as follows. In Section 2, we discuss the difference between the proposed method and other traditional methods, including Siamese trackers with RPN, anchor assignment strategies, online learning approaches and transformer trackers. In Section 3, we first describe how to determine the positive and negative samples through a probabilistic assignment method. We then discuss the loss function and online learning model of PaaRPN. Finally, we highlight the key differences between PaaRPN and existing RPN-based trackers. In Section 4, we perform an extensive comparison of PaaRPN and other state-of-the-art trackers on six tracking datasets and present ablation experiments to verify the effectiveness of the proposed components. Finally, conclusions, along with proposals for future work, are presented in Section 5.

Section snippets

Related work

Generic object tracking has developed rapidly in recent years due to the popularity of many deep learning techniques. Recently, trackers based on Siamese networks [40], [21], [46] have drawn much attention due to their high efficiency and end-to-end learning capability. In this section, we mainly review Siamese trackers with RPN, anchor assignment methods, online learning approaches, and transformer trackers, which are highly relevant to our work.

Siamese Trackers with RPN. Recently, some

Overview of our framework

In this work, we propose a probabilistic anchor assignment with RPN for tracking. Similar to RPN-based trackers, the proposed method benefits from end-to-end training on large-scale training sets. However, unlike common RPN-based trackers, our method separates positive and negative samples in a probabilistic manner during training and provides a powerful online update model for new image sequences. Our tracking framework is derived from two principles: (1) the model should adaptively determine

Experimental results

The proposed PaaRPN was implemented in Python using PyTorch. To facilitate further study, both the training code and testing code will be released at https://github.co m/yangkai12/. On a single NVIDIA RTX 3090 GPU, the PaaRPN tracker runs over 50 frames per second (FPS) by utilizing ResNet-50 as a feature extractor.

Training Details. We employed ResNet-50 pre-trained on ImageNet [32] as the backbone of the framework. The training splits of the TrackingNet [29], LaSOT [11], GOT-10 k [15], and

Conclusion

In this paper, we propose a probabilistic anchor assignment method in which the assignment of training samples is converted into a likelihood optimization problem based on anchor scores computed by the classification network. The core of anchor assignment is to assign positive and negative samples in a probabilistic manner through the PaaRPN model, instead of using heuristic IoU hard assignment. In addition to the probabilistic assignment, we introduce an online learning mechanism in the PaaRPN

CRediT authorship contribution statement

Kai Yang: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Formal analysis, Software, Investigation. Haijun Zhang: Resources, Writing – review & editing, Supervision, Project administration. Dongliang Zhou: Writing – review & editing, Validation, Data curation. Li Dong: Software, Visualization, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61972112 and No. 61832004, the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2021B1515020088, the Shenzhen Science and Technology Program under Grant No. JCYJ20210324131203009, and the HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication under Grant No. HITSZ-J&A-2021A01.

References (48)

D. Elayaperumal et al.
Robust visual object tracking using context-based spatial variation via multi-feature fusion
Inf. Sci.
(2021)
P. Gao et al.
Learning reinforced attentional representation for end-to-end visual tracking
Inf. Sci.
(2020)
X. He et al.
Learning object-uncertainty policy for visual tracking
Inf. Sci.
(2022)
S. Moorthy et al.
Multi-expert visual tracking using hierarchical convolutional feature fusion via contextual information
Inf. Sci.
(2021)
K. Yang et al.
SiamAtt: Siamese attention network for visual tracking
Knowledge-based Systems
(2020)
G. Bhat et al.
Learning discriminative model prediction for tracking
G. Bhat et al.
Unveiling the power of deep tracking
X. Chen et al.
Transformer tracking
Z. Chen et al.
Siamese Box Adaptive Network for Visual Tracking
Y. Cui, C. Jiang, L. Wang, G. Wu, Fully Convolutional Online Tracking, arXiv preprint...

K. Dai et al.

High-performance long-term tracking with meta-updater

M. Danelljan et al.

Atom: Accurate tracking by overlap maximization

M. Danelljan et al.

Probabilistic regression for visual tracking

M. Danelljan, A. Robinson, F.S. Khan, M. Felsberg, Beyond correlation filters: Learning continuous convolution...

H. Fan et al.

Lasot: A high-quality benchmark for large-scale single object tracking

H. Fan et al.

Siamese cascaded region proposal networks for real-time visual tracking

L. Huang et al.

Got-10k: A large high-diversity benchmark for generic object tracking in the wild

IEEE Trans. Pattern Anal. Mach. Intell.

(2019)

L. Huang et al.

Globaltrack: A simple and strong baseline for long-term tracking

Proceedings of the AAAI Conference on Artificial Intelligence

(2020)

I. Jung et al.

Real-time mdnet

H. Kiani Galoogahi et al.

Need for speed: A benchmark for higher frame rate object tracking

K. Kim, H.S. Lee, Probabilistic anchor assignment with iou prediction for object detection, in: Proceedings of the...

M. Kristan et al.

The seventh visual object tracking vot2019 challenge results

B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, Siamrpn++: Evolution of siamese visual tracking with very deep...

B. Li et al.

High performance visual tracking with siamese region proposal network

Cited by (8)

Attacking the tracker with a universal and attractive patch as fake target
2023, Information Sciences
Adversarial attacks in visual object tracking aim to drop tracking performance through injecting imperceptible perturbations to the input of the tracker. Current methods usually superimpose perturbation maps on the input images, and advocate blinding the tracker via occluding the real targets to achieve attack effect. From the perspective of attraction, we alternatively propose a novel idea of attacking the tracker, which advocates using perturbation patches to act as fake targets to attract the tracker's attention. For this purpose, we establish a multi-conditional objective function to generate our ideal patch in an offline iterative manner. For invisibility, we integrate the constraint of patch value into the function for unified optimization. For universality, in addition to adopting large-scale and high-diversity training samples, we also incorporate the video-agnostic condition into this function. To make the patch attractive like a fake target, we elaborately design the non-overlapping area to determine the patch position, and generate matched fake labels to mislead the tracker to track the patch. In online attacking, it only needs to paste the optimized patch onto the video frames, the tracker will be successfully attracted by our patch, achieving attack effect. Extensive experimental results on 8 popular tracking datasets demonstrate that our method can obtain exceptional attack performance in both non-targeted and targeted attack. Additionally, the experiments on transferability illustrate our optimized patches can be directly applied to other trackers with different architectures.
Online intervention siamese tracking
2023, Information Sciences
Online target update excels in helping a visual tracking algorithm adapt to variations in target appearance during inference, and thus is preferred by recent advances. However, this technique typically operates in a heuristic manner by updating numerous network parameters with large amounts of online-collected data. This poses a distinct challenge: can the online target update be effectively executed without using the aforementioned approach? To this end, we propose a novel target feature update scheme to reduce the need for tedious data collection and computation-intensive parameter updates. This scheme operates on the principle of causal intervention and is just as effective as default parameter updates in visual tracking. Besides, we explore a novel video-specific target label to capture the context of a specific target in video frames for feature discrimination. This makes target features better fit for appearance changes. Such schemes together with the off-shelf pre-trained classification backbone form a novel online intervention siamese tracker (OIS). When equipped with an unsupervised pre-trained backbone, OIS outperforms current state-of-the-art unsupervised trackers on the OTB and VOT. When exploiting a supervised trained backbone, it competes with typical supervised trackers trained on massive offline training and online tracking data.
FAML-RT: Feature alignment-based multi-level similarity metric learning network for a two-stage robust tracker
2023, Information Sciences
Existing multi-stage trackers treat visual object tracking as a multiple feature extraction and similarity metric process. However, the similarity metric methods used in them are typically based on linear cross-correlation, ignoring the matching of detailed information. Moreover, the feature extraction operators (e.g., RoI align) lead to a sub-optimal feature representation for matching. In this paper, we propose a novel similarity metric method called feature alignment-based multi-level similarity metric learning network to address these issues. Technically, we elaborate a feature alignment module to extract the features, suppressing the useless background information that affects the matching. Subsequently, using the aligned features, we design a learnable multi-level similarity metric learning network to implement the matching for detailed information at the channel and spatial levels, which effectively guides an accurate and discriminative similarity score. By integrating the above components as second stage, a two-stage robust tracking method FAML-RT is presented. Extensive experiments on the challenging benchmarks OTB100, LaSOT and VOT2018 show that FAML-RT achieves a competitive performance against state-of-the-art methods, while running at a high speed of 60 fps. Furthermore, a series of ablation studies demonstrate the effectiveness of the proposed feature alignment-based multi-level metric learning network.
Siamese residual network for efficient visual tracking
2023, Information Sciences
The Siamese tracking framework has attracted much attention due to its scalability and efficiency in recent years. However, it is less effective in recognizing arbitrary targets with various variations, especially in complex scenarios with background distractors and illumination variations. To this end, we propose a Siamese Residual Network to formulate the characteristics of a specific given target for visual tracking, consisting of an identity branch and a residual branch. The identity branch consists of a generic offline-trained similarity-matching network, which distinguishes the target from the background at the class level. To complement the identity branch for handling complex scenarios and dramatic target appearance variations, we develop a residual branch learned from the samples of exact target states and online distractors under the guidance of the identity branch. These two branches representing arbitrary targets with both class-level and sample-level features achieve accurate target localizations under complicated scenarios. In addition, we propose an adaptive KL-based scheme for updating the residual branch effectively by avoiding overfitting to a long-retained target appearance. Extensive experimental results on OTB-2013, OTB-2015, VOT2016, VOT-2018, VOT-2019, Temple-Color-128, and LaSOT show that the proposed method performs against state-of-the-art trackers.
Flexible Dual-Branch Siamese Network: Learning Location Quality Estimation and Regression Distribution for Visual Tracking
2024, IEEE Transactions on Computational Social Systems
Siamada: visual tracking based on Siamese adaptive learning network
2024, Neural Computing and Applications

View all citing articles on Scopus

View full text

PaaRPN: Probabilistic anchor assignment with region proposal network for visual tracking

Abstract

Introduction

Section snippets

Related work

Overview of our framework

Experimental results

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Knowledge-based Systems

Learning discriminative model prediction for tracking

Unveiling the power of deep tracking

Transformer tracking

Siamese Box Adaptive Network for Visual Tracking

High-performance long-term tracking with meta-updater

Atom: Accurate tracking by overlap maximization

Probabilistic regression for visual tracking

Lasot: A high-quality benchmark for large-scale single object tracking

Siamese cascaded region proposal networks for real-time visual tracking

Got-10k: A large high-diversity benchmark for generic object tracking in the wild

IEEE Trans. Pattern Anal. Mach. Intell.

Globaltrack: A simple and strong baseline for long-term tracking

Proceedings of the AAAI Conference on Artificial Intelligence

Real-time mdnet

Need for speed: A benchmark for higher frame rate object tracking

The seventh visual object tracking vot2019 challenge results

High performance visual tracking with siamese region proposal network