Abstract
In object tracking, motion blur is a common challenge induced by rapid movement of target object or long time exposure of the camera, which leads to poor tracking performance. Traditional solutions usually perform image recovery operations before tracking object. However, most image recovery methods usually have higher computational cost, which decreases the tracking speed. In order to solve the above problems, we propose a deblurring Transformer-based tracking method embedding the conditional cross-attention. The proposed method integrates three important modules: (1) an image quality assessment (IQA) module to estimate image quality; (2) an image deblurring module based on lightweight adversarial network to improve image quality; and (3) a tracking module based on Transformer with conditional cross-attention to enhance the object localization ability. Experimental results on two UAV object tracking benchmarks show that the proposed trackers achieve competitive results compared to several state-of-the-art trackers.









Similar content being viewed by others
Data availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
References
You, S., Zhu, H., Li, M., Li, Y.: A review of visual trackers and analysis of its application to mobile robot. ArXiv abs/1910.09761 (2019)
Li, P., Wang, D., Wang, L., Lu, H.: Deep visual tracking: review and experimental comparison. Pattern Recogn. 76, 323–338 (2018)
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: ECCV, pp. 101–117 (2018)
Dai, K., Wang, D., Lu, H., Sun, C., Li, J.: Visual tracking via adaptive spatially-regularized correlation filters. In: CVPR, pp. 4670–4679 (2019)
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: CVPR, pp. 8126–8135 (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020)
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Conditional detr for fast training convergence. In: CVPR, pp. 3651–3660 (2021)
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: ICCV, pp. 1314–1324 (2019)
Li, S., Yeung, D.-Y.: Visual object tracking for unmanned aerial vehicles: a benchmark and new motion models. AAAI Conf. Artif. Intell. 31, 4140–4146 (2017)
Fu, C., Cao, Z., Li, Y., Ye, J., Feng, C.: Onboard real-time aerial tracking with efficient siamese anchor proposal network. IEEE Trans. Geosci. Remote Sens. 60, 1–13 (2021)
Wang, F., Yin, S., Mbelwa, J.T., Sun, F.: Learning saliency aware correlation filter for visual tracking. Multimed. Tools Appl. 81, 27879–27893 (2022)
Wang, Y., Wang, F., Wang, C., He, J., Sun, F.: Context and saliency aware correlation filter for visual target tracking. Computer J. 65, 1846–1859 (2022)
Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: CVPR, pp. 1420–1429 (2016)
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: ECCV, pp. 850–865 (2016)
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. In: CVPR, pp. 6668–6677 (2020)
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: Evolution of siamese visual tracking with very deep networks. In: CVPR, pp. 4282–4291 (2019)
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., Shen, C.: Graph attention tracking. In: CVPR, pp. 9543–9552 (2021)
Wu, R., Wen, X., Liu, Z., Yuan, L., Xu, H.: Stasiamrpn: visual tracking based on spatiotemporal and attention. Multimed. Syst. 28, 1543–1555 (2021)
Ondrašovič, M., Tarábek, P.: Siamese visual object tracking: a survey. IEEE Access 9, 110149–110172 (2021)
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: CVPR, pp. 8971–8980 (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł, Polosukhin, I.: Attention is all you need. NeurIPS 30, 6000–6010 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.A.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: ICCV, pp. 10448–10457 (2021)
Chen, B., Li, P., Bai, L., Qiao, L., Shen, Q., Li, B., Gan, W., Wu, W., Ouyang, W.: Backbone is all your need: a simplified architecture for visual object tracking. In: ECCV, pp. 375–392 (2022)
Cui, Y., Jiang, C., Wang, L., Wu, G.: Mixformer: End-to-end tracking with iterative mixed attention. In: CVPR, pp. 13598–13608 (2022)
Song, Z., Yu, J., Chen, Y.P., Yang, W.: Transformer tracking with cyclic shifting window attention. In: CVPR, pp. 8781–8790 (2022)
Zhao, M., Okada, K., Inaba, M.: Trtr: Visual tracking with transformer. ArXiv abs/2105.03817 (2021)
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR, pp. 1125–1134 (2017)
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)
Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D., Matas, J.: Deblurgan: Blind motion deblurring using conditional adversarial networks. In: CVPR, pp. 8183–8192 (2018)
Kupyn, O., Martyniuk, T., Wu, J., Wang, Z.: Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In: ICCV, pp. 8878–8887 (2019)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV, pp. 694–711 (2016)
Acknowledgements
This work was partly supported by the National Natural Science Foundation of China under Grant 61976042 and 61972068, Innovative Talents Program for Liaoning Universities under Grant LR2019020 and the Liaoning Revitalization Talents Program under Grant XLYC2007023, and was partly supported by Applied Basic Research Project of Liaoning Province under Grant 2022JH2/101300279.
Author information
Authors and Affiliations
Contributions
F. Sun and B. Zhu conceived this study. T. Zhao and F. Wang conducted the experiment and wrote the initial manuscript. X. Jia and F. Wang reviewed and edited it.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, F., Zhao, T., Zhu, B. et al. Deblurring transformer tracking with conditional cross-attention. Multimedia Systems 29, 1131–1144 (2023). https://doi.org/10.1007/s00530-022-01043-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-022-01043-0