Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking

Li, Zheng; Cai, Weibo; Dong, Junhao; Lai, Jianhuang; Xie, Xiaohua

doi:10.1007/978-981-99-8555-5_6

Zheng Li^15,16,17,
Weibo Cai^15,16,17,
Junhao Dong^15,16,17,
Jianhuang Lai^15,16,17 &
…
Xiaohua Xie^15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14436))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

887 Accesses

Abstract

Multi-modal tracking has increasingly gained attention due to its superior accuracy and robustness in complex scenarios. The primary challenges in this field lie in effectively extracting and fusing multi-modal data that inherently contain gaps. To address the above issues, we propose a novel regularized single-stream multi-modal tracking framework, drawing inspiration from the perspective of disentanglement. Specifically, taking into account the similarities and differences intrinsic in multi-modal data, we design a modality-specific weights sharing feature extraction module to extract well-disentangled multi-modal features. To emphasize feature-level specificity across different modal features, we propose a cross-modal deformable attention mechanism for the adaptive integration of multi-modal features with efficiency. Through extensive experiments on three multi-modal tracking benchmarks, including RGB+Thermal infrared and RGB+Depth, we demonstrate that our method significantly outperforms existing multi-modal tracking algorithms. Code is available at https://github.com/ccccwb/Multimodal-Detection-and-Tracking-UAV.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

Multi-modal visual tracking: Review and experimental comparison

Article Open access 03 January 2024

Learning a multimodal feature transformer for RGBT tracking

Article 09 April 2024

References

Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Learning discriminative model prediction for tracking. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6181–6190 (2019)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End Object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8122–8131 (2021)
Google Scholar
Cui, Y., Jiang, C., Wu, G., Wang, L.: Mixformer: end-to-end tracking with iterative mixed attention (2023)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Eco: efficient convolution operators for tracking. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6931–6939 (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Google Scholar
Feng, M., Su, J.: Learning reliable modal weight with transformer for robust RGBT tracking. Knowl.-Based Syst. 249, 108945 (2022)
Article Google Scholar
Gao, S., Yang, J., Li, Z., Zheng, F., Leonardis, A., Song, J.: Learning dual-fused modality-aware representations for RGBD tracking. In: Computer Vision - ECCV 2022 Workshops, pp. 478–494. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-25085-9_27
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37, 583–596 (2015)
Article Google Scholar
Hou, R., Ren, T., Wu, G.: Mirnet: a robust RGBT tracking jointly with multi-modal interaction and refinement. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022)
Google Scholar
Jiawen, Z., Simiao, l., Xin, C., Wang, D., Lu, H.: Visual prompt multi-modal tracking. In: CVPR (2023)
Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4277–4286 (2019)
Google Scholar
Li, C., Liang, X., Lu, Y., Zhao, N., Tang, J.: RGB-t object tracking: benchmark and baseline. Pattern Recogn. 96, 106977 (2019)
Article Google Scholar
Li, C.: Lasher: a large-scale high-diversity benchmark for RGBT tracking. IEEE Trans. Image Process. 31, 392–404 (2022)
Article Google Scholar
Li, F., Zha, Y., Zhang, L., Zhang, P., Chen, L.: Information lossless multi-modal image generation for RGB-T tracking. In: Yu, S., et al. (eds.) Pattern Recognition and Computer Vision, pp. 671–683. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-18916-6_53
Liu, Y., Jing, X.Y., Nie, J., Gao, H., Liu, J., Jiang, G.P.: Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in RGB-D videos. IEEE Trans. Multimedia 21, 664–677 (2019)
Article Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Google Scholar
Lu, A., Li, C., Yan, Y., Tang, J., Luo, B.: RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Trans. Image Process. 30, 5613–5625 (2021)
Article Google Scholar
Lu, A., Qian, C., Li, C., Tang, J., Wang, L.: Duality-gated mutual condition network for RGBT tracking. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Google Scholar
Lukeźič, A., Zajc, L.V., Vojíř, T., Matas, J., Kristan, M.: Performance evaluation methodology for long-term single-object tracking. IEEE Trans. Cybern. 51(12), 6305–6318 (2021)
Article Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Pengyu, Z., Zhao, J., Wang, D., Lu, H., Ruan, X.: Visible-thermal UAV tracking: a large-scale benchmark and new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Qian, Y., Yan, S., Lukežič, A., Kristan, M., Kámäräinen, J.K., Matas, J.: Dal: a deep depth-aware long-term tracker. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7825–7832 (2021)
Google Scholar
Thangavel, J., Kokul, T., Ramanan, A., Fernando, S.: Transformers in single object tracking: an experimental survey. arXiv preprint arXiv:2302.11867 (2023)
Wang, C., et al.: Cross-modal pattern-propagation for RGB-T tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7062–7071 (2020)
Google Scholar
Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4784–4793 (2022)
Google Scholar
Xiao, Y., Yang, M., Li, C., Liu, L., Tang, J.: Attribute-based progressive fusion network for RGBT tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2831–2838 (2022)
Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10428–10437 (2021)
Google Scholar
Yan, S., et al.: Depthtrack: unveiling the power of RGBD tracking. In: ICCV, pp. 10705–10713 (2021)
Google Scholar
Yang, J., Li, Z., Zheng, F., Leonardis, A., Song, J.: Prompting for multi-modal tracking. In: Proceedings of the 30th ACM International Conference on Multimedia, MM 2022, p. 3492–3500. Association for Computing Machinery (2022)
Google Scholar
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_20
You, H., et al.: Learning visual representation from modality-shared contrastive language-image pre-training. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 69–87. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-19812-0_5
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., van de Weijer, J., Khan, F.S.: Multi-modal fusion for end-to-end RGB-T tracking. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 2252–2261 (2019)
Google Scholar
Zhang, P., Zhao, J., Bo, C., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 30, 3335–3347 (2021)
Article Google Scholar
Zhang, T., Liu, X., Zhang, Q., Han, J.: Siamcda: complementarity- and distractor-aware RGB-T tracking based on siamese network. IEEE Trans. Circuits Syst. Video Technol. 32(3), 1403–1417 (2022)
Article Google Scholar
Zhao, H., Chen, J., Wang, L., Lu, H.: Arkittrack: a new diverse dataset for tracking using mobile RGB-D data. In: CVPR (2023)
Google Scholar
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-iou loss: faster and better learning for bounding box regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12993–13000 (2020)
Google Scholar
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Rgbd1k: a large-scale dataset and benchmark for RGB-D object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence (2023)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2021)
Google Scholar
Zhu, Y., Li, C., Tang, J., Luo, B.: Quality-aware feature aggregation network for robust RGBT tracking. IEEE Trans. Intell. Veh. 6(1), 121–130 (2021)
Article Google Scholar

Download references

Acknowledgments

This project is in part supported by the Key-Area Research and Development Program of Guangzhou (202206030003), and the National Natural Science Foundation of China (U22A2095, 62072482). We would like to thank Qi Chen and Jintang Bian for insight discussion.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Zheng Li, Weibo Cai, Junhao Dong, Jianhuang Lai & Xiaohua Xie
Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou, China
Zheng Li, Weibo Cai, Junhao Dong, Jianhuang Lai & Xiaohua Xie
Guangdong Province Key Laboratory of Information Security Technology, Guangzhou, China
Zheng Li, Weibo Cai, Junhao Dong, Jianhuang Lai & Xiaohua Xie

Authors

Zheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Weibo Cai
View author publications
You can also search for this author in PubMed Google Scholar
Junhao Dong
View author publications
You can also search for this author in PubMed Google Scholar
Jianhuang Lai
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Xie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohua Xie .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Z., Cai, W., Dong, J., Lai, J., Xie, X. (2024). Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14436. Springer, Singapore. https://doi.org/10.1007/978-981-99-8555-5_6

Download citation

DOI: https://doi.org/10.1007/978-981-99-8555-5_6
Published: 28 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8554-8
Online ISBN: 978-981-99-8555-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

Multi-modal visual tracking: Review and experimental comparison

Learning a multimodal feature transformer for RGBT tracking

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

Multi-modal visual tracking: Review and experimental comparison

Learning a multimodal feature transformer for RGBT tracking

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation