Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

Gao, Shang; Yang, Jinyu; Li, Zhe; Zheng, Feng; Leonardis, Aleš; Song, Jingkuan

doi:10.1007/978-3-031-25085-9_27

Shang Gao¹⁰,
Jinyu Yang^10,11,
Zhe Li¹⁰,
Feng Zheng¹⁰,
Aleš Leonardis¹¹ &
…
Jingkuan Song¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13808))

Included in the following conference series:

European Conference on Computer Vision

1374 Accesses
6 Citations

Abstract

With the development of depth sensors in recent years, RGBD object tracking has received significant attention. Compared with the traditional RGB object tracking, the addition of the depth modality can effectively solve the target and background interference. However, some existing RGBD trackers use the two modalities separately and thus some particularly useful shared information between them is ignored. On the other hand, some methods attempt to fuse the two modalities by treating them equally, resulting in the missing of modality-specific features. To tackle these limitations, we propose a novel Dual-fused Modality-aware Tracker (termed DMTracker) which aims to learn informative and discriminative representations of the target objects for robust RGBD tracking. The first fusion module focuses on extracting the shared information between modalities based on cross-modal attention. The second aims at integrating the RGB-specific and depth-specific information to enhance the fused features. By fusing both the modality-shared and modality-specific information in a modality-aware scheme, our DMTracker can learn discriminative representations in complex tracking scenes. Experiments show that our proposed tracker achieves very promising results on challenging RGBD benchmarks. Code is available at https://github.com/ShangGaoG/DMTracker.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

TEFNet: Target-Aware Enhanced Fusion Network for RGB-T Tracking

Feature Disentanglement and Adaptive Fusion for Improving Multi-modal Tracking

Learning Explicit Modulation Vectors for Disentangled Transformer Attention-Based RGB-D Visual Tracking

Notes

1.
For a fair comparison, we use DeT-DiMP50-Max checkpoint in all experiments.

References

An, N., Zhao, X.G., Hou, Z.G.: Online rgb-d tracking via detection-learning-segmentation. In: 2016 23rd International Conference on Pattern Recognition, pp. 1231–1236. IEEE (2016)
Google Scholar
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019)
Google Scholar
Camplani, M., et al.: Real-time rgb-d tracking with depth scaling kernelised correlation filters and occlusion handling. In: BMVC, vol. 4, p. 5 (2015)
Google Scholar
Cao, Z., Huang, Z., Pan, L., Zhang, S., Liu, Z., Fu, C.: Tctrack: Temporal contexts for aerial tracking. arXiv preprint arXiv:2203.01885 (2022)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)
Google Scholar
Chen, Y.W., Tsai, Y.H., Yang, M.H.: End-to-end multi-modal video temporal grounding. In: Advances in Neural Information Processing Systems 34 (2021)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Google Scholar
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2020)
Google Scholar
Hannuna, S., et al.: Ds-kcf: a real-time tracker for rgb-d data. J. Real-Time Image Proc. 16(5), 1–20 (2016)
Google Scholar
Hu, J., Lu, J., Tan, Y.P.: Sharable and individual multi-view metric learning. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2017)
Google Scholar
Kart, U., Kämäräinen, J.K., Matas, J.: How to make an rgbd tracker? In: ECCVW (2018)
Google Scholar
Kart, U., Lukežič, A., Kristan, M., Kämäräinen, J.K., Matas, J.: Object tracking by reconstruction with view-specific discriminative correlation filters. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Kim, J., Ma, M., Pham, T., Kim, K., Yoo, C.D.: Modality shifting attention network for multi-modal video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10106–10115 (2020)
Google Scholar
Lin, L., Fan, H., Xu, Y., Ling, H.: Swintrack: A simple and strong baseline for transformer tracking. arXiv preprint arXiv:2112.00995 (2021)
Liu, Y., Jing, X.Y., Nie, J., Gao, H., Liu, J., Jiang, G.P.: Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in rgb-d videos. IEEE Trans. Multimedia 21(3), 664–677 (2018)
Article Google Scholar
Lukezic, A., et al.: Cdtb: A color and depth visual object tracking dataset and benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10013–10022 (2019)
Google Scholar
Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 3569–3577 (2018)
Google Scholar
Machida, E., Cao, M., Murao, T., Hashimoto, H.: Human motion tracking of mobile robot with kinect 3d sensor. In: 2012 Proceedings of SICE Annual Conference (SICE), pp. 2207–2211. IEEE (2012)
Google Scholar
Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13444–13454 (2021)
Google Scholar
Meshgi, K., ichi Maeda, S., Oba, S., Skibbe, H., zhe Li, Y., Ishii, S.: An occlusion-aware particle filter tracker to handle complex and persistent occlusions. Comput. Vision Image Understand. 150, 81–94 (2016). https://doi.org/10.1016/j.cviu.2016.05.011, https://www.sciencedirect.com/science/article/pii/S1077314216300649
Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870–10879 (2020)
Google Scholar
Qi, H., Feng, C., Cao, Z., Zhao, F., Xiao, Y.: P2b: Point-to-box network for 3d object tracking in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6329–6338 (2020)
Google Scholar
Qian, Y., Yan, S., Lukežič, A., Kristan, M., Kämäräinen, J.K., Matas, J.: DAL: A deep depth-aware long-term tracker. In: International Conference on Pattern Recognition (2020)
Google Scholar
Song, S., Xiao, J.: Tracking revisited using rgbd camera: Unified benchmark and baselines. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 233–240 (2013)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)
Google Scholar
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: Exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)
Google Scholar
Wang, Q., Fang, J., Yuan, Y.: Multi-cue based tracking (2014)
Google Scholar
Xiao, J., Stolkin, R., Gao, Y., Leonardis, A.: Robust fusion of color and depth data for rgb-d target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE Trans. Cybern. 48(8), 2485–2499 (2017)
Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021)
Google Scholar
Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., Kämäräinen, J.K.: Depthtrack: Unveiling the power of rgbd tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10725–10733 (2021)
Google Scholar
Yang, J., et al.: Rgbd object tracking: An in-depth review. arXiv preprint arXiv:2203.14134 (2022)
Yu, B., et al.: High-performance discriminative tracking with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9856–9865 (2021)
Google Scholar
Zhao, P., Liu, Q., Wang, W., Guo, Q.: Tsdm: Tracking by siamrpn++ with a depth-refiner and a mask-generator. In: 2020 25th International Conference on Pattern Recognition, pp. 670–676. IEEE (2021)
Google Scholar
Zheng, C., Yan, X., Gao, J., Zhao, W., Zhang, W., Li, Z., Cui, S.: Box-aware feature enhancement for single object tracking on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13199–13208 (2021)
Google Scholar
Zhou, T., Fu, H., Chen, G., Zhou, Y., Fan, D.P., Shao, L.: Specificity-preserving rgb-d saliency detection. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 61972188 and 62122035.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China
Shang Gao, Jinyu Yang, Zhe Li & Feng Zheng
University of Birmingham, Birmingham, UK
Jinyu Yang & Aleš Leonardis
University of Electronic Science and Technology of China, Chengdu, China
Jingkuan Song

Authors

Shang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jinyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Li
View author publications
You can also search for this author in PubMed Google Scholar
Feng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Aleš Leonardis
View author publications
You can also search for this author in PubMed Google Scholar
Jingkuan Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Zheng .

Editor information

Editors and Affiliations

IBM Research AI and MIT-IBM Watson AI Lab, Haifa, Israel
Leonid Karlinsky
Technion – Israel Institute of Technology, Haifa, Israel
Tomer Michaeli
Kyoto University, Kyoto, Japan
Ko Nishino

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 14157 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, S., Yang, J., Li, Z., Zheng, F., Leonardis, A., Song, J. (2023). Learning Dual-Fused Modality-Aware Representations for RGBD Tracking. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13808. Springer, Cham. https://doi.org/10.1007/978-3-031-25085-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-25085-9_27
Published: 12 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-25084-2
Online ISBN: 978-3-031-25085-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking