Skip to main content

Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14436))

Included in the following conference series:

  • 887 Accesses

Abstract

Multi-modal tracking has increasingly gained attention due to its superior accuracy and robustness in complex scenarios. The primary challenges in this field lie in effectively extracting and fusing multi-modal data that inherently contain gaps. To address the above issues, we propose a novel regularized single-stream multi-modal tracking framework, drawing inspiration from the perspective of disentanglement. Specifically, taking into account the similarities and differences intrinsic in multi-modal data, we design a modality-specific weights sharing feature extraction module to extract well-disentangled multi-modal features. To emphasize feature-level specificity across different modal features, we propose a cross-modal deformable attention mechanism for the adaptive integration of multi-modal features with efficiency. Through extensive experiments on three multi-modal tracking benchmarks, including RGB+Thermal infrared and RGB+Depth, we demonstrate that our method significantly outperforms existing multi-modal tracking algorithms. Code is available at https://github.com/ccccwb/Multimodal-Detection-and-Tracking-UAV.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6181–6190 (2019)

    Google Scholar 

  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End Object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  3. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8122–8131 (2021)

    Google Scholar 

  4. Cui, Y., Jiang, C., Wu, G., Wang, L.: Mixformer: end-to-end tracking with iterative mixed attention (2023)

    Google Scholar 

  5. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Eco: efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6931–6939 (2017)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  7. Feng, M., Su, J.: Learning reliable modal weight with transformer for robust RGBT tracking. Knowl.-Based Syst. 249, 108945 (2022)

    Article  Google Scholar 

  8. Gao, S., Yang, J., Li, Z., Zheng, F., Leonardis, A., Song, J.: Learning dual-fused modality-aware representations for RGBD tracking. In: Computer Vision - ECCV 2022 Workshops, pp. 478–494. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25085-9_27

  9. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37, 583–596 (2015)

    Article  Google Scholar 

  10. Hou, R., Ren, T., Wu, G.: Mirnet: a robust RGBT tracking jointly with multi-modal interaction and refinement. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)

    Google Scholar 

  11. Jiawen, Z., Simiao, l., Xin, C., Wang, D., Lu, H.: Visual prompt multi-modal tracking. In: CVPR (2023)

    Google Scholar 

  12. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4277–4286 (2019)

    Google Scholar 

  13. Li, C., Liang, X., Lu, Y., Zhao, N., Tang, J.: RGB-t object tracking: benchmark and baseline. Pattern Recogn. 96, 106977 (2019)

    Article  Google Scholar 

  14. Li, C.: Lasher: a large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 31, 392–404 (2022)

    Article  Google Scholar 

  15. Li, F., Zha, Y., Zhang, L., Zhang, P., Chen, L.: Information lossless multi-modal image generation for RGB-T tracking. In: Yu, S., et al. (eds.) Pattern Recognition and Computer Vision, pp. 671–683. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18916-6_53

  16. Liu, Y., Jing, X.Y., Nie, J., Gao, H., Liu, J., Jiang, G.P.: Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in RGB-D videos. IEEE Trans. Multimedia 21, 664–677 (2019)

    Article  Google Scholar 

  17. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)

    Google Scholar 

  18. Lu, A., Li, C., Yan, Y., Tang, J., Luo, B.: RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 30, 5613–5625 (2021)

    Article  Google Scholar 

  19. Lu, A., Qian, C., Li, C., Tang, J., Wang, L.: Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. (2022)

    Google Scholar 

  20. Lukeźič, A., Zajc, L.V., Vojíř, T., Matas, J., Kristan, M.: Performance evaluation methodology for long-term single-object tracking. IEEE Trans. Cybern. 51(12), 6305–6318 (2021)

    Article  Google Scholar 

  21. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  22. Pengyu, Z., Zhao, J., Wang, D., Lu, H., Ruan, X.: Visible-thermal UAV tracking: a large-scale benchmark and new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)

    Google Scholar 

  23. Qian, Y., Yan, S., Lukežič, A., Kristan, M., Kámäräinen, J.K., Matas, J.: Dal: a deep depth-aware long-term tracker. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7825–7832 (2021)

    Google Scholar 

  24. Thangavel, J., Kokul, T., Ramanan, A., Fernando, S.: Transformers in single object tracking: an experimental survey. arXiv preprint arXiv:2302.11867 (2023)

  25. Wang, C., et al.: Cross-modal pattern-propagation for RGB-T tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7062–7071 (2020)

    Google Scholar 

  26. Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  27. Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4784–4793 (2022)

    Google Scholar 

  28. Xiao, Y., Yang, M., Li, C., Liu, L., Tang, J.: Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2831–2838 (2022)

    Google Scholar 

  29. Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10428–10437 (2021)

    Google Scholar 

  30. Yan, S., et al.: Depthtrack: unveiling the power of RGBD tracking. In: ICCV, pp. 10705–10713 (2021)

    Google Scholar 

  31. Yang, J., Li, Z., Zheng, F., Leonardis, A., Song, J.: Prompting for multi-modal tracking. In: Proceedings of the 30th ACM International Conference on Multimedia, MM 2022, p. 3492–3500. Association for Computing Machinery (2022)

    Google Scholar 

  32. Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_20

  33. You, H., et al.: Learning visual representation from modality-shared contrastive language-image pre-training. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 69–87. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_5

  34. Zhang, L., Danelljan, M., Gonzalez-Garcia, A., van de Weijer, J., Khan, F.S.: Multi-modal fusion for end-to-end RGB-T tracking. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 2252–2261 (2019)

    Google Scholar 

  35. Zhang, P., Zhao, J., Bo, C., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 30, 3335–3347 (2021)

    Article  Google Scholar 

  36. Zhang, T., Liu, X., Zhang, Q., Han, J.: Siamcda: complementarity- and distractor-aware RGB-T tracking based on siamese network. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1403–1417 (2022)

    Article  Google Scholar 

  37. Zhao, H., Chen, J., Wang, L., Lu, H.: Arkittrack: a new diverse dataset for tracking using mobile RGB-D data. In: CVPR (2023)

    Google Scholar 

  38. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-iou loss: faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12993–13000 (2020)

    Google Scholar 

  39. Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Rgbd1k: a large-scale dataset and benchmark for RGB-D object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)

    Google Scholar 

  40. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2021)

    Google Scholar 

  41. Zhu, Y., Li, C., Tang, J., Luo, B.: Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 6(1), 121–130 (2021)

    Article  Google Scholar 

Download references

Acknowledgments

This project is in part supported by the Key-Area Research and Development Program of Guangzhou (202206030003), and the National Natural Science Foundation of China (U22A2095, 62072482). We would like to thank Qi Chen and Jintang Bian for insight discussion.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaohua Xie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Z., Cai, W., Dong, J., Lai, J., Xie, X. (2024). Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14436. Springer, Singapore. https://doi.org/10.1007/978-981-99-8555-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8555-5_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8554-8

  • Online ISBN: 978-981-99-8555-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics