PaaRPN: Probabilistic anchor assignment with region proposal network for visual tracking
Introduction
Visual tracking is a fundamental topic that aims to predict the tracking target’s state in a given video frame [7]. In practice, the state is usually denoted by a bounding box in a video sequence [13]. Current visual tracking tasks mainly include long-term and short-term tracking. For long-term tracking, several tracking methods have become popular in the tracking field. Alan et al. [26] proposed a fully correlational long-term tracker that employs correlation filters trained on different time scales as detector components. Yan et al. [38] developed a ‘Skimming-Perusal’ module in a SiamRPN tracker [22]. Specifically, the perusal module is meant to predict the tracking target in a local search region, while the skimming module is developed to precisely select the most reliable local regions from the predefined sampling sliding window. Dai et al. [6] proposed an offline-trained meta-updater to effectively integrate discriminative, geometric, and appearance cues from a video sequence, guiding the updating of the online tracker effectively. Huang et al. [16] introduced a purely global instance search based on two-stage object detectors; it imposes no constraints or assumptions concering temporal consistency. For short-term tracking, region proposal network (RPN)-based trackers [40], [21], [22], [46], [42], [48] are widely used by the tracking community. A key characteristic of RPN-based trackers [40], [21], [22], [46], [42], [48] is that they are based on anchor boxes, which are the inputs of both the classification branch and the regression branch of the RPN. RPN-based trackers first utilize the classification branch to separate the foreground from the background of tracking scenes under the guidance of hand-crafted anchor boxes. The regression branch is then used to fine-tune the candidate anchor boxes to obtain a more accurate tracking box.
Current RPN-based trackers [40], [21], [22], [46], [42], [48], however, have two drawbacks. First, RPN-based trackers generate anchor boxes with various shapes and sizes so that they can better encapsulate the tracking target. For this design, anchor assignment, in which anchor boxes are defined as negative or positive samples, needs to be performed in advance. Traditional RPN-based trackers mainly use the following steps to define positive and negative samples: 1) the Intersection-over-Union (IoU) values between the hand-crafted anchors and a ground truth box are calculated; 2) for anchor assignment, the anchor boxes are defined as positive samples if their IoU values exceed a given threshold. Subsequently, the classified anchor boxes are fed into the classification and regression branches to obtain the final tracking box. This simple and intuitive anchor assignment strategy is a popular choice for RPN-based trackers. This assignment strategy, however, ignores the target content of the intersecting region, which may include background distractors or few important parts of the tracking target [19]. To overcome this limitation, several recent methods [19], [33], [44] have proposed various anchor assignment strategies. All these methods suggest that a proper anchor assignment method can bring performance gains. Second, current RPN-based trackers [40], [21], [22], [46], [42], [48] have shown inferior robustness in comparison to other state-of-the-art trackers [47], [1] due to the lack of a powerful model update strategy. Intuitively, the appearance of the tracked target may change in a video sequence due to several challenging factors, such as fast motion, deformation, occlusion, etc. Traditional RPN-based trackers can easily drift to other distractor locations due to the lack of online learning of the appearance changes of a tracking target.
This research aims to design a new anchor assignment strategy that can flexibly determine the number of positive samples by developing the assignment inference method of a tracking model in a probabilistic manner rather than using fixed IoUs between anchor boxes and ground truth boxes. To achieve this, the tracking model must adaptively determine the number of positive and negative samples according to the distribution of training samples. When no positive samples have a high IoU between anchor boxes and a ground truth box, the tracking model needs to define some new positive samples to balance the distribution of positive and negative samples. In this case, the tracking model may regard the most meaningful content as positive samples, and anchor boxes with high IoU values are not necessarily classified as positive samples. On the other hand, when there exist many positive samples, the tracking model needs to treat high-quality and competitive anchor boxes as positive samples, and the rest of the samples should be defined as negative samples. Therefore, certain positive samples with noisy backgrounds can be eliminated through this new anchor assignment strategy. Most importantly, the current learning state of a model is required to reflect the assignment quality of anchor boxes.
Motivated by the aforementioned analyses, this study introduces a probabilistic anchor assignment with RPN that is capable of adaptively separating the preset anchor boxes into negative and positive samples according to the current learning status of a tracking model. Specifically, we first define a classification score for the preset anchor boxes that represent the location qualities of the tracking target. Subsequently, we establish a probability distribution of the tracking model that defines which anchor boxes are negative or positive samples. For anchor assignments, anchor boxes are defined as positive samples if the boxes from the positive sample sets have high probabilities. This strategy transforms the assignment of positive and negative samples into a maximum likelihood estimation based on a probability distribution. The parameters of the entire probability model are determined by the classification scores of the anchors. The probabilistic assignment model is trained using training samples that are drawn from a probability distribution. The probabilistic model then classifies the positive and negative samples in a probabilistic manner, leading to a more straightforward training process for our proposed PaaRPN tracker than for other RPN-based trackers [40], [21], [22], [46], [42], [48]. Furthermore, the PaaRPN tracker does not contain a fixed IoU threshold or number of positive samples. In addition, PaaRPN is equipped with a plug-in online learning procedure that has been successfully utilized in IoU-Net trackers [1], [8]. An online learning mechanism with hard negative mining is trained in an effective end-to-end manner with a discriminative training loss by utilizing an iterative optimization operation. The entire discriminative model employs the steepest descent method [1] with an optimal step length to reduce the online learning time. For this design, PaaRPN is empowered by better target-background discriminative abilities than traditional RPN-based trackers (see Fig. 1). To achieve better accuracy while maintaining high computational efficiency, we also explore channel-wise multiplication for cross-correlation in both the classification and regression branches of the PaaRPN tracker rather than using the depth-wise correlation operation in other RPN-based trackers.
Specifically, the main contributions of this work are summarized as follows:
- •
We propose a new anchor assignment strategy that transforms the assignment of positive and negative samples into a probabilistic prediction procedure. The probabilistic procedure is calculated from the classification score of the RPN and maximizing the likelihood with respect to the probability distribution of the prediction scores. This operation transforms the anchor assignment method into a probability method that adaptively determines the number of positive samples.
- •
We introduce an online learning mechanism to enable the proposed model, PaaRPN, to be more robust to the appearance changes of a tracking target during inference.
- •
We employ a hard negative mining strategy to enhance the discriminative power of the online model in the presence of distractor objects.
- •
We utilize channel-wise multiplication to compute the correlation features insteading of using the depth-wise convolution operation in SiamRPN++.
To examine the effectiveness of our proposed model, we compared the proposed PaaRPN with other state-of-the-art trackers on six tracking benchmarks: OTB-2015 [36], VOT2019 [20], UAV123 [28], NFS30 [18], LaSOT [11], and TrackingNet [29]. Experimental results show that both our probabilistic anchor assignment method and online learning strategy can improve the tracking performance. In particular, our tracker, PaaRPN, achieves state-of-the-art results and outperforms the strong baseline tracker SiamRPN++ with performance gains of 4.4% on the UAV123 dataset and 6.9% on the LaSOT dataset. We also performed an extensive ablation study to verify the effectiveness of each component.
The remaining contents of this paper are organized as follows. In Section 2, we discuss the difference between the proposed method and other traditional methods, including Siamese trackers with RPN, anchor assignment strategies, online learning approaches and transformer trackers. In Section 3, we first describe how to determine the positive and negative samples through a probabilistic assignment method. We then discuss the loss function and online learning model of PaaRPN. Finally, we highlight the key differences between PaaRPN and existing RPN-based trackers. In Section 4, we perform an extensive comparison of PaaRPN and other state-of-the-art trackers on six tracking datasets and present ablation experiments to verify the effectiveness of the proposed components. Finally, conclusions, along with proposals for future work, are presented in Section 5.
Section snippets
Related work
Generic object tracking has developed rapidly in recent years due to the popularity of many deep learning techniques. Recently, trackers based on Siamese networks [40], [21], [46] have drawn much attention due to their high efficiency and end-to-end learning capability. In this section, we mainly review Siamese trackers with RPN, anchor assignment methods, online learning approaches, and transformer trackers, which are highly relevant to our work.
Siamese Trackers with RPN. Recently, some
Overview of our framework
In this work, we propose a probabilistic anchor assignment with RPN for tracking. Similar to RPN-based trackers, the proposed method benefits from end-to-end training on large-scale training sets. However, unlike common RPN-based trackers, our method separates positive and negative samples in a probabilistic manner during training and provides a powerful online update model for new image sequences. Our tracking framework is derived from two principles: (1) the model should adaptively determine
Experimental results
The proposed PaaRPN was implemented in Python using PyTorch. To facilitate further study, both the training code and testing code will be released at https://github.co m/yangkai12/. On a single NVIDIA RTX 3090 GPU, the PaaRPN tracker runs over 50 frames per second (FPS) by utilizing ResNet-50 as a feature extractor.
Training Details. We employed ResNet-50 pre-trained on ImageNet [32] as the backbone of the framework. The training splits of the TrackingNet [29], LaSOT [11], GOT-10 k [15], and
Conclusion
In this paper, we propose a probabilistic anchor assignment method in which the assignment of training samples is converted into a likelihood optimization problem based on anchor scores computed by the classification network. The core of anchor assignment is to assign positive and negative samples in a probabilistic manner through the PaaRPN model, instead of using heuristic IoU hard assignment. In addition to the probabilistic assignment, we introduce an online learning mechanism in the PaaRPN
CRediT authorship contribution statement
Kai Yang: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Formal analysis, Software, Investigation. Haijun Zhang: Resources, Writing – review & editing, Supervision, Project administration. Dongliang Zhou: Writing – review & editing, Validation, Data curation. Li Dong: Software, Visualization, Investigation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61972112 and No. 61832004, the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2021B1515020088, the Shenzhen Science and Technology Program under Grant No. JCYJ20210324131203009, and the HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication under Grant No. HITSZ-J&A-2021A01.
References (48)
- et al.
Robust visual object tracking using context-based spatial variation via multi-feature fusion
Inf. Sci.
(2021) - et al.
Learning reinforced attentional representation for end-to-end visual tracking
Inf. Sci.
(2020) - et al.
Learning object-uncertainty policy for visual tracking
Inf. Sci.
(2022) - et al.
Multi-expert visual tracking using hierarchical convolutional feature fusion via contextual information
Inf. Sci.
(2021) - et al.
SiamAtt: Siamese attention network for visual tracking
Knowledge-based Systems
(2020) - et al.
Learning discriminative model prediction for tracking
- et al.
Unveiling the power of deep tracking
- et al.
Transformer tracking
- et al.
Siamese Box Adaptive Network for Visual Tracking
- Y. Cui, C. Jiang, L. Wang, G. Wu, Fully Convolutional Online Tracking, arXiv preprint...
High-performance long-term tracking with meta-updater
Atom: Accurate tracking by overlap maximization
Probabilistic regression for visual tracking
Lasot: A high-quality benchmark for large-scale single object tracking
Siamese cascaded region proposal networks for real-time visual tracking
Got-10k: A large high-diversity benchmark for generic object tracking in the wild
IEEE Trans. Pattern Anal. Mach. Intell.
Globaltrack: A simple and strong baseline for long-term tracking
Proceedings of the AAAI Conference on Artificial Intelligence
Real-time mdnet
Need for speed: A benchmark for higher frame rate object tracking
The seventh visual object tracking vot2019 challenge results
High performance visual tracking with siamese region proposal network
Cited by (8)
Attacking the tracker with a universal and attractive patch as fake target
2023, Information SciencesOnline intervention siamese tracking
2023, Information SciencesSiamese residual network for efficient visual tracking
2023, Information SciencesFlexible Dual-Branch Siamese Network: Learning Location Quality Estimation and Regression Distribution for Visual Tracking
2024, IEEE Transactions on Computational Social SystemsSiamada: visual tracking based on Siamese adaptive learning network
2024, Neural Computing and Applications