Abstract:
Despite that both local and context information are crucial for robust tracking, existing CNN-based and transformer-based methods mainly focus on one of these aspects. Co...Show MoreMetadata
Abstract:
Despite that both local and context information are crucial for robust tracking, existing CNN-based and transformer-based methods mainly focus on one of these aspects. Consequently, the former fails to exploit rich global context information due to the limited receptive field, while the latter suffers from the deficiencies in constructing the local relationship among neighboring regions. To address this issue, we propose the SiamPIN tracker, based on our Parallel Interaction Network. It consists of two effective modules, namely Global Aggregation Block (GAB) and Local Process Block (LPB). GAB perceives the global context to capture the long-range spatial dependency through a transformer-based architecture. Meanwhile, LPB performs local information extraction using a CNN model to retain the detailed appearance information of the target. These two modules are connected consecutively to compose a Trans-Conv unit block, which transmits the global context information to the local feature extraction procedure, hence enables the interaction of global-local information flow. Several such blocks are cascaded so that our model can learn to aggregate local and context information interactively. The proposed tracker achieves state-of-the-art performance on six benchmark datasets, while maintaining a real time running speed.
Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 33, Issue: 4, April 2023)