Elsevier

Information Sciences

Volume 598, June 2022, Pages 19-36
Information Sciences

PaaRPN: Probabilistic anchor assignment with region proposal network for visual tracking

https://doi.org/10.1016/j.ins.2022.03.070Get rights and content

Abstract

Recently, visual trackers based on region proposal networks (RPN) have attracted widespread attention due to their relatively high efficiency and excellent performance. RPN-based trackers mainly combine a classification branch and a regression branch to predict a target’s state. These branches are all under the guidance of pre-defined anchor boxes. RPN-based trackers, however, first compute the Intersection-over-Union (IoU) between the anchor boxes and ground truth boxes, and then use a fixed IoU threshold to separate negative and positive training samples. The limit of this design lies in the fact that these trackers lack an analysis of the actual content of the intersecting regions, which may include distractor objects or few meaningful regions of the tracked target. In this research, we propose a probabilistic anchor assignment with region proposal network (PaaRPN) that can adaptively separate anchors into negative samples and positive samples according to the model’s current learning status. To this end, we first calculate the classification scores of the anchor boxes conditioned on the current model and fit a probability distribution to the classification scores. The whole tracking model is then trained with anchor boxes separated into negative and positive samples in a probabilistic manner. Moreover, we introduce an online learning method in the PaaRPN framework that enables the model to have powerful discriminative abilities by exploiting both background and target appearance information. We tested the PaaRPN tracker on six tracking benchmarks to exhibit the effectiveness of the proposed method. In particular, our model outperforms a strong RPN tracker, SiamRPN++, with AUC scores improvements of 0.613 0.657 and 0.496 0.565 on UAV123 and LaSOT, respectively.

Introduction

Visual tracking is a fundamental topic that aims to predict the tracking target’s state in a given video frame [7]. In practice, the state is usually denoted by a bounding box in a video sequence [13]. Current visual tracking tasks mainly include long-term and short-term tracking. For long-term tracking, several tracking methods have become popular in the tracking field. Alan et al. [26] proposed a fully correlational long-term tracker that employs correlation filters trained on different time scales as detector components. Yan et al. [38] developed a ‘Skimming-Perusal’ module in a SiamRPN tracker [22]. Specifically, the perusal module is meant to predict the tracking target in a local search region, while the skimming module is developed to precisely select the most reliable local regions from the predefined sampling sliding window. Dai et al. [6] proposed an offline-trained meta-updater to effectively integrate discriminative, geometric, and appearance cues from a video sequence, guiding the updating of the online tracker effectively. Huang et al. [16] introduced a purely global instance search based on two-stage object detectors; it imposes no constraints or assumptions concering temporal consistency. For short-term tracking, region proposal network (RPN)-based trackers [40], [21], [22], [46], [42], [48] are widely used by the tracking community. A key characteristic of RPN-based trackers [40], [21], [22], [46], [42], [48] is that they are based on anchor boxes, which are the inputs of both the classification branch and the regression branch of the RPN. RPN-based trackers first utilize the classification branch to separate the foreground from the background of tracking scenes under the guidance of hand-crafted anchor boxes. The regression branch is then used to fine-tune the candidate anchor boxes to obtain a more accurate tracking box.

Current RPN-based trackers [40], [21], [22], [46], [42], [48], however, have two drawbacks. First, RPN-based trackers generate anchor boxes with various shapes and sizes so that they can better encapsulate the tracking target. For this design, anchor assignment, in which anchor boxes are defined as negative or positive samples, needs to be performed in advance. Traditional RPN-based trackers mainly use the following steps to define positive and negative samples: 1) the Intersection-over-Union (IoU) values between the hand-crafted anchors and a ground truth box are calculated; 2) for anchor assignment, the anchor boxes are defined as positive samples if their IoU values exceed a given threshold. Subsequently, the classified anchor boxes are fed into the classification and regression branches to obtain the final tracking box. This simple and intuitive anchor assignment strategy is a popular choice for RPN-based trackers. This assignment strategy, however, ignores the target content of the intersecting region, which may include background distractors or few important parts of the tracking target [19]. To overcome this limitation, several recent methods [19], [33], [44] have proposed various anchor assignment strategies. All these methods suggest that a proper anchor assignment method can bring performance gains. Second, current RPN-based trackers [40], [21], [22], [46], [42], [48] have shown inferior robustness in comparison to other state-of-the-art trackers [47], [1] due to the lack of a powerful model update strategy. Intuitively, the appearance of the tracked target may change in a video sequence due to several challenging factors, such as fast motion, deformation, occlusion, etc. Traditional RPN-based trackers can easily drift to other distractor locations due to the lack of online learning of the appearance changes of a tracking target.

This research aims to design a new anchor assignment strategy that can flexibly determine the number of positive samples by developing the assignment inference method of a tracking model in a probabilistic manner rather than using fixed IoUs between anchor boxes and ground truth boxes. To achieve this, the tracking model must adaptively determine the number of positive and negative samples according to the distribution of training samples. When no positive samples have a high IoU between anchor boxes and a ground truth box, the tracking model needs to define some new positive samples to balance the distribution of positive and negative samples. In this case, the tracking model may regard the most meaningful content as positive samples, and anchor boxes with high IoU values are not necessarily classified as positive samples. On the other hand, when there exist many positive samples, the tracking model needs to treat high-quality and competitive anchor boxes as positive samples, and the rest of the samples should be defined as negative samples. Therefore, certain positive samples with noisy backgrounds can be eliminated through this new anchor assignment strategy. Most importantly, the current learning state of a model is required to reflect the assignment quality of anchor boxes.

Motivated by the aforementioned analyses, this study introduces a probabilistic anchor assignment with RPN that is capable of adaptively separating the preset anchor boxes into negative and positive samples according to the current learning status of a tracking model. Specifically, we first define a classification score for the preset anchor boxes that represent the location qualities of the tracking target. Subsequently, we establish a probability distribution of the tracking model that defines which anchor boxes are negative or positive samples. For anchor assignments, anchor boxes are defined as positive samples if the boxes from the positive sample sets have high probabilities. This strategy transforms the assignment of positive and negative samples into a maximum likelihood estimation based on a probability distribution. The parameters of the entire probability model are determined by the classification scores of the anchors. The probabilistic assignment model is trained using training samples that are drawn from a probability distribution. The probabilistic model then classifies the positive and negative samples in a probabilistic manner, leading to a more straightforward training process for our proposed PaaRPN tracker than for other RPN-based trackers [40], [21], [22], [46], [42], [48]. Furthermore, the PaaRPN tracker does not contain a fixed IoU threshold or number of positive samples. In addition, PaaRPN is equipped with a plug-in online learning procedure that has been successfully utilized in IoU-Net trackers [1], [8]. An online learning mechanism with hard negative mining is trained in an effective end-to-end manner with a discriminative training loss by utilizing an iterative optimization operation. The entire discriminative model employs the steepest descent method [1] with an optimal step length to reduce the online learning time. For this design, PaaRPN is empowered by better target-background discriminative abilities than traditional RPN-based trackers (see Fig. 1). To achieve better accuracy while maintaining high computational efficiency, we also explore channel-wise multiplication for cross-correlation in both the classification and regression branches of the PaaRPN tracker rather than using the depth-wise correlation operation in other RPN-based trackers.

Specifically, the main contributions of this work are summarized as follows:

  • We propose a new anchor assignment strategy that transforms the assignment of positive and negative samples into a probabilistic prediction procedure. The probabilistic procedure is calculated from the classification score of the RPN and maximizing the likelihood with respect to the probability distribution of the prediction scores. This operation transforms the anchor assignment method into a probability method that adaptively determines the number of positive samples.

  • We introduce an online learning mechanism to enable the proposed model, PaaRPN, to be more robust to the appearance changes of a tracking target during inference.

  • We employ a hard negative mining strategy to enhance the discriminative power of the online model in the presence of distractor objects.

  • We utilize channel-wise multiplication to compute the correlation features insteading of using the depth-wise convolution operation in SiamRPN++.

To examine the effectiveness of our proposed model, we compared the proposed PaaRPN with other state-of-the-art trackers on six tracking benchmarks: OTB-2015 [36], VOT2019 [20], UAV123 [28], NFS30 [18], LaSOT [11], and TrackingNet [29]. Experimental results show that both our probabilistic anchor assignment method and online learning strategy can improve the tracking performance. In particular, our tracker, PaaRPN, achieves state-of-the-art results and outperforms the strong baseline tracker SiamRPN++ with performance gains of 4.4% on the UAV123 dataset and 6.9% on the LaSOT dataset. We also performed an extensive ablation study to verify the effectiveness of each component.

The remaining contents of this paper are organized as follows. In Section 2, we discuss the difference between the proposed method and other traditional methods, including Siamese trackers with RPN, anchor assignment strategies, online learning approaches and transformer trackers. In Section 3, we first describe how to determine the positive and negative samples through a probabilistic assignment method. We then discuss the loss function and online learning model of PaaRPN. Finally, we highlight the key differences between PaaRPN and existing RPN-based trackers. In Section 4, we perform an extensive comparison of PaaRPN and other state-of-the-art trackers on six tracking datasets and present ablation experiments to verify the effectiveness of the proposed components. Finally, conclusions, along with proposals for future work, are presented in Section 5.

Section snippets

Related work

Generic object tracking has developed rapidly in recent years due to the popularity of many deep learning techniques. Recently, trackers based on Siamese networks [40], [21], [46] have drawn much attention due to their high efficiency and end-to-end learning capability. In this section, we mainly review Siamese trackers with RPN, anchor assignment methods, online learning approaches, and transformer trackers, which are highly relevant to our work.

Siamese Trackers with RPN. Recently, some

Overview of our framework

In this work, we propose a probabilistic anchor assignment with RPN for tracking. Similar to RPN-based trackers, the proposed method benefits from end-to-end training on large-scale training sets. However, unlike common RPN-based trackers, our method separates positive and negative samples in a probabilistic manner during training and provides a powerful online update model for new image sequences. Our tracking framework is derived from two principles: (1) the model should adaptively determine

Experimental results

The proposed PaaRPN was implemented in Python using PyTorch. To facilitate further study, both the training code and testing code will be released at https://github.co m/yangkai12/. On a single NVIDIA RTX 3090 GPU, the PaaRPN tracker runs over 50 frames per second (FPS) by utilizing ResNet-50 as a feature extractor.

Training Details. We employed ResNet-50 pre-trained on ImageNet [32] as the backbone of the framework. The training splits of the TrackingNet [29], LaSOT [11], GOT-10 k [15], and

Conclusion

In this paper, we propose a probabilistic anchor assignment method in which the assignment of training samples is converted into a likelihood optimization problem based on anchor scores computed by the classification network. The core of anchor assignment is to assign positive and negative samples in a probabilistic manner through the PaaRPN model, instead of using heuristic IoU hard assignment. In addition to the probabilistic assignment, we introduce an online learning mechanism in the PaaRPN

CRediT authorship contribution statement

Kai Yang: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Formal analysis, Software, Investigation. Haijun Zhang: Resources, Writing – review & editing, Supervision, Project administration. Dongliang Zhou: Writing – review & editing, Validation, Data curation. Li Dong: Software, Visualization, Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61972112 and No. 61832004, the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2021B1515020088, the Shenzhen Science and Technology Program under Grant No. JCYJ20210324131203009, and the HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication under Grant No. HITSZ-J&A-2021A01.

References (48)

  • K. Dai et al.

    High-performance long-term tracking with meta-updater

  • M. Danelljan et al.

    Atom: Accurate tracking by overlap maximization

  • M. Danelljan et al.

    Probabilistic regression for visual tracking

  • M. Danelljan, A. Robinson, F.S. Khan, M. Felsberg, Beyond correlation filters: Learning continuous convolution...
  • H. Fan et al.

    Lasot: A high-quality benchmark for large-scale single object tracking

  • H. Fan et al.

    Siamese cascaded region proposal networks for real-time visual tracking

  • L. Huang et al.

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • L. Huang et al.

    Globaltrack: A simple and strong baseline for long-term tracking

    Proceedings of the AAAI Conference on Artificial Intelligence

    (2020)
  • I. Jung et al.

    Real-time mdnet

  • H. Kiani Galoogahi et al.

    Need for speed: A benchmark for higher frame rate object tracking

  • K. Kim, H.S. Lee, Probabilistic anchor assignment with iou prediction for object detection, in: Proceedings of the...
  • M. Kristan et al.

    The seventh visual object tracking vot2019 challenge results

  • B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, Siamrpn++: Evolution of siamese visual tracking with very deep...
  • B. Li et al.

    High performance visual tracking with siamese region proposal network

  • Cited by (8)

    • Online intervention siamese tracking

      2023, Information Sciences
    View all citing articles on Scopus
    View full text