Skip to main content

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 Workshops (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13808))

Included in the following conference series:

Abstract

With the development of depth sensors in recent years, RGBD object tracking has received significant attention. Compared with the traditional RGB object tracking, the addition of the depth modality can effectively solve the target and background interference. However, some existing RGBD trackers use the two modalities separately and thus some particularly useful shared information between them is ignored. On the other hand, some methods attempt to fuse the two modalities by treating them equally, resulting in the missing of modality-specific features. To tackle these limitations, we propose a novel Dual-fused Modality-aware Tracker (termed DMTracker) which aims to learn informative and discriminative representations of the target objects for robust RGBD tracking. The first fusion module focuses on extracting the shared information between modalities based on cross-modal attention. The second aims at integrating the RGB-specific and depth-specific information to enhance the fused features. By fusing both the modality-shared and modality-specific information in a modality-aware scheme, our DMTracker can learn discriminative representations in complex tracking scenes. Experiments show that our proposed tracker achieves very promising results on challenging RGBD benchmarks. Code is available at https://github.com/ShangGaoG/DMTracker.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For a fair comparison, we use DeT-DiMP50-Max checkpoint in all experiments.

References

  1. An, N., Zhao, X.G., Hou, Z.G.: Online rgb-d tracking via detection-learning-segmentation. In: 2016 23rd International Conference on Pattern Recognition, pp. 1231–1236. IEEE (2016)

    Google Scholar 

  2. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191 (2019)

    Google Scholar 

  3. Camplani, M., et al.: Real-time rgb-d tracking with depth scaling kernelised correlation filters and occlusion handling. In: BMVC, vol. 4, p. 5 (2015)

    Google Scholar 

  4. Cao, Z., Huang, Z., Pan, L., Zhang, S., Liu, Z., Fu, C.: Tctrack: Temporal contexts for aerial tracking. arXiv preprint arXiv:2203.01885 (2022)

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  6. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135 (2021)

    Google Scholar 

  7. Chen, Y.W., Tsai, Y.H., Yang, M.H.: End-to-end multi-modal video temporal grounding. In: Advances in Neural Information Processing Systems 34 (2021)

    Google Scholar 

  8. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: Accurate tracking by overlap maximization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)

    Google Scholar 

  9. Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192 (2020)

    Google Scholar 

  10. Hannuna, S., et al.: Ds-kcf: a real-time tracker for rgb-d data. J. Real-Time Image Proc. 16(5), 1–20 (2016)

    Google Scholar 

  11. Hu, J., Lu, J., Tan, Y.P.: Sharable and individual multi-view metric learning. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2017)

    Google Scholar 

  12. Kart, U., Kämäräinen, J.K., Matas, J.: How to make an rgbd tracker? In: ECCVW (2018)

    Google Scholar 

  13. Kart, U., Lukežič, A., Kristan, M., Kämäräinen, J.K., Matas, J.: Object tracking by reconstruction with view-specific discriminative correlation filters. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)

    Google Scholar 

  14. Kim, J., Ma, M., Pham, T., Kim, K., Yoo, C.D.: Modality shifting attention network for multi-modal video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10106–10115 (2020)

    Google Scholar 

  15. Lin, L., Fan, H., Xu, Y., Ling, H.: Swintrack: A simple and strong baseline for transformer tracking. arXiv preprint arXiv:2112.00995 (2021)

  16. Liu, Y., Jing, X.Y., Nie, J., Gao, H., Liu, J., Jiang, G.P.: Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in rgb-d videos. IEEE Trans. Multimedia 21(3), 664–677 (2018)

    Article  Google Scholar 

  17. Lukezic, A., et al.: Cdtb: A color and depth visual object tracking dataset and benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10013–10022 (2019)

    Google Scholar 

  18. Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 3569–3577 (2018)

    Google Scholar 

  19. Machida, E., Cao, M., Murao, T., Hashimoto, H.: Human motion tracking of mobile robot with kinect 3d sensor. In: 2012 Proceedings of SICE Annual Conference (SICE), pp. 2207–2211. IEEE (2012)

    Google Scholar 

  20. Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13444–13454 (2021)

    Google Scholar 

  21. Meshgi, K., ichi Maeda, S., Oba, S., Skibbe, H., zhe Li, Y., Ishii, S.: An occlusion-aware particle filter tracker to handle complex and persistent occlusions. Comput. Vision Image Understand. 150, 81–94 (2016). https://doi.org/10.1016/j.cviu.2016.05.011, https://www.sciencedirect.com/science/article/pii/S1077314216300649

  22. Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870–10879 (2020)

    Google Scholar 

  23. Qi, H., Feng, C., Cao, Z., Zhao, F., Xiao, Y.: P2b: Point-to-box network for 3d object tracking in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6329–6338 (2020)

    Google Scholar 

  24. Qian, Y., Yan, S., Lukežič, A., Kristan, M., Kämäräinen, J.K., Matas, J.: DAL: A deep depth-aware long-term tracker. In: International Conference on Pattern Recognition (2020)

    Google Scholar 

  25. Song, S., Xiao, J.: Tracking revisited using rgbd camera: Unified benchmark and baselines. In: Proceedings of the IEEE International Conference On Computer Vision, pp. 233–240 (2013)

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30 (2017)

    Google Scholar 

  27. Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: Exploiting temporal context for robust visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1571–1580 (2021)

    Google Scholar 

  28. Wang, Q., Fang, J., Yuan, Y.: Multi-cue based tracking (2014)

    Google Scholar 

  29. Xiao, J., Stolkin, R., Gao, Y., Leonardis, A.: Robust fusion of color and depth data for rgb-d target tracking using adaptive range-invariant depth models and spatio-temporal consistency constraints. IEEE Trans. Cybern. 48(8), 2485–2499 (2017)

    Google Scholar 

  30. Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021)

    Google Scholar 

  31. Yan, S., Yang, J., Käpylä, J., Zheng, F., Leonardis, A., Kämäräinen, J.K.: Depthtrack: Unveiling the power of rgbd tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10725–10733 (2021)

    Google Scholar 

  32. Yang, J., et al.: Rgbd object tracking: An in-depth review. arXiv preprint arXiv:2203.14134 (2022)

  33. Yu, B., et al.: High-performance discriminative tracking with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9856–9865 (2021)

    Google Scholar 

  34. Zhao, P., Liu, Q., Wang, W., Guo, Q.: Tsdm: Tracking by siamrpn++ with a depth-refiner and a mask-generator. In: 2020 25th International Conference on Pattern Recognition, pp. 670–676. IEEE (2021)

    Google Scholar 

  35. Zheng, C., Yan, X., Gao, J., Zhao, W., Zhang, W., Li, Z., Cui, S.: Box-aware feature enhancement for single object tracking on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13199–13208 (2021)

    Google Scholar 

  36. Zhou, T., Fu, H., Chen, G., Zhou, Y., Fan, D.P., Shao, L.: Specificity-preserving rgb-d saliency detection. In: International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant No. 61972188 and 62122035.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Zheng .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 14157 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gao, S., Yang, J., Li, Z., Zheng, F., Leonardis, A., Song, J. (2023). Learning Dual-Fused Modality-Aware Representations for RGBD Tracking. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13808. Springer, Cham. https://doi.org/10.1007/978-3-031-25085-9_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25085-9_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25084-2

  • Online ISBN: 978-3-031-25085-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics