Skip to main content
Log in

Learning convolutional self-attention module for unmanned aerial vehicle tracking

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Siamese network-based trackers have been proven to maintain splendid performance. Recently, visual tracking has been applied in unmanned aerial vehicle(UAV) tasks. However, it is a challenging task because of the influences by aspect ratio changes, out-of-view and scale variation, etc. Some Siamese-based trackers ignore context-related information generated in the time dimension of continuous frames, lose a lot of foreground information and generate redundant background information. In this paper, we propose a novel the feature fusion network based on convolutional self-attention blocks. The convolutional self-attention blocks are composed of ResNet bottleneck blocks with multi-head self-attention (MHSA) blocks. We eliminate the spatial (\(3\times 3\)) convolution operator limitation through the MHSA blocks in the last stage bottleneck blocks of ResNet. Convolutional self-attention blocks capture the global context-related information of the given target images and further improve the accuracy of global match between a given target and a search region. Extensive experimental evaluations on OTB2015 and four UAV benchmarks, i.e., UAV123, UAV20L, DTB70 and UAV123@10fps. The experimental results demonstrate that the proposed tracker can achieve excellent performances against SOTA trackers for UAV tracking and lead to real-time average tracking speed of 181fps on a single GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  2. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  4. Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)

  5. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018)

  6. Huang, L., Zhao, X., Huang, K.: Got-10k: A large high-diversity benchmark for generic object tracking in the wild. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)

  7. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  8. Wang, J., Meng, C., Deng, C., Wang, Y.: Learning attentionmodules for visual tracking. Signal Image Video Process. (2022). https://doi.org/10.1007/s11760-022-02177-4

    Article  Google Scholar 

  9. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision. Springer, pp. 850–865 (2016)

  10. Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.-H.: Learning spatial-temporal regularized correlation filters for visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4904–4913 (2018)

  11. Yu, Y., Xiong, Y., Huang, W., Scott, M.R.: Deformable Siamese attention networks for visual object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6728–6737 (2020)

  12. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)

  13. Zhang, Z., Peng, H.: Deeper and wider Siamese networks for real-time visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4591–4600 (2019)

  14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  15. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: IEEE International Conference on Computer Vision, pp. 3286–3295 (2019)

  16. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909 (2019)

  17. Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 1834–1848 (2015)

    Article  Google Scholar 

  18. Li, S., Yeung, D.-Y.: Visual object tracking for unmanned aerial vehicles: a benchmark and new motion models. In: AAAI Conference on Artificial Intelligence (2017)

  19. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking. In: European Conference on Computer Vision, pp. 445–461 (2016)

  20. Li, Y., Zhu, J., Hoi, S.C., Song, W., Wang, Z., Liu, H.: Robust estimation of similarity transformation for visual object tracking. In: AAAI Conference on Artificial Intelligence, pp. 8666–8673 (2019)

  21. Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., Li, H.: Unsupervised deep tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1308–1317 (2019)

  22. Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y., Choi, J.Y.: Context-aware deep feature compression for high-speed visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 479–488 (2018)

  23. Li, X., Ma, C., Wu, B., He, Z., Yang, M.-H.: Target-aware deep tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1369–1378 (2019)

  24. Dunnhofer, M., Martinel, N., Micheloni, C.: Tracking-by-trackers with a distilled and reinforced model. In: Asian Conference on Computer Vision (2020)

  25. Pu, S., Song, Y., Ma, C., Zhang, H., Yang, M.-H.: Learning recurrent memory activation networks for visual tracking. In: IEEE Transactions on Image Processing, vol. 30. IEEE, pp. 725–738 (2021)

  26. Lu, X., Ma, C., Shen, J., Yang, X., Reid, I., Yang, M.-H.: Deep object tracking with shrinkage loss. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. IEEE (2020)

  27. Abdelpakey, M.H., Shehata, M.S.: Dp-siam: Dynamic policy Siamese network for robust object tracking. In: IEEE Transactions on Image Processing, vol. 29. IEEE, pp. 1479–1492 (2019)

  28. Li, Y., Fu, C., Ding, F., Huang, Z., Lu, G.: Autotrack: towards high-performance visual tracking for uav with automatic spatio-temporal regularization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 11923–11932 (2020)

  29. Huang, Z., Fu, C., Li, Y., Lin, F., Lu, P.: Learning aberrance repressed correlation filters for real-time uav tracking. In: IEEE International Conference on Computer Vision, pp. 2891–2900 (2019)

  30. Wang, N., Zhou, W., Tian, Q., Hong, R., Wang, M., Li, H.: Multi-cue correlation filters for robust visual tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4844–4853 (2018)

  31. Cao, Z., Fu, C., Ye, J., Li, B., Li, Y.: Hift: hierarchical feature transformer for aerial tracking. In: IEEE International Conference on Computer Vision, pp. 15457–15466 (2021)

  32. Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: Eco: efficient convolution operators for tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)

  33. Zheng, G., Fu, C., Ye, J., Lin, F., Ding, F.: Mutation sensitive correlation filter for real-time uav tracking with adaptive hybrid label. arXiv preprint arXiv:2106.08073 (2021)

  34. Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: Youtube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5296–5305 (2017)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No: 61861032, 61865012), and by the Jiangxi Science and Technology Research Project of Education within the Department of China (No: GJJ190955).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuanyun Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Meng, C., Deng, C. et al. Learning convolutional self-attention module for unmanned aerial vehicle tracking. SIViP 17, 2323–2331 (2023). https://doi.org/10.1007/s11760-022-02449-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02449-z

Keywords

Navigation