Single object tracking is still challenging because it requires localizing an arbitrary object in a sequence of frames, given only its appearance in the first frame of the sequence. Many trackers, especially those leveraging the Vision Transformer (ViT) backbone, have achieved superior performance. However, the gap between the performance metrics measured on the training data and those on the test data is still large. To alleviate this issue, we propose the deformable masking module in the transformer-based trackers. The deformable masking module is injected after each layer of the ViT. First, It masks out complete vectors of the output representations of the ViT layer. After that, the masked representations are fed into a deformable convolution to reconstruct new reliable representations. The output of the last layer of the ViT is modified by fusing it with all intermediate outputs of the deformable masking modules to produce a final robust attentional feature map. We extensively evaluate the performance of our model, named DMTrack, on seven different tracking benchmarks. Our model outperforms the previous state-of-the-art techniques by (\(+\,2\%\)) while having fewer parameters (\(-\,92.4\%\)). Moreover, our model matches the performance of much larger models in terms of parameters, indicating our training strategy’s effectiveness.
Data and materials availibility
Not applicable.
Code availibility
Not applicable.
Conceptualization: Omar Abdelaziz, Mohamed Shahata; Methodology: Omar Abdelaziz; Formal analysis and investigation: Omar Abdelaziz; Writing—original draft preparation: Omar Abdelaziz; Writing—review and editing: Omar Abdelaziz, Mohamed Shahata; Supervision: Mohamed Shehata
