Skip to main content

3D Siamese Transformer Network for Single Object Tracking on Point Clouds

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13662))

Included in the following conference series:

Abstract

Siamese network based trackers formulate 3D single object tracking as cross-correlation learning between point features of a template and a search area. Due to the large appearance variation between the template and search area during tracking, how to learn the robust cross correlation between them for identifying the potential target in the search area is still a challenging problem. In this paper, we explicitly use Transformer to form a 3D Siamese Transformer network for learning robust cross correlation between the template and the search area of point clouds. Specifically, we develop a Siamese point Transformer network to learn shape context information of the target. Its encoder uses self-attention to capture non-local information of point clouds to characterize the shape information of the object, and the decoder utilizes cross-attention to upsample discriminative point features. After that, we develop an iterative coarse-to-fine correlation network to learn the robust cross correlation between the template and the search area. It formulates the cross-feature augmentation to associate the template with the potential target in the search area via cross attention. To further enhance the potential target, it employs the ego-feature augmentation that applies self-attention to the local k-NN graph of the feature space to aggregate target features. Experiments on the KITTI, nuScenes, and Waymo datasets show that our method achieves state-of-the-art performance on the 3D single object tracking task. Source code is available at https://github.com/fpthink/STNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56

    Chapter  Google Scholar 

  2. Bibi, A., Zhang, T., Ghanem, B.: 3D part-based sparse tracker with automatic synchronization and registration. In: CVPR (2016)

    Google Scholar 

  3. Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: CVPR (2010)

    Google Scholar 

  4. Caesar, H., et al.: nuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027 (2019)

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  6. Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: CVPR (2021)

    Google Scholar 

  7. Chiu, H.k., Prioletti, A., Li, J., Bohg, J.: Probabilistic 3D multi-object tracking for autonomous driving. arXiv preprint arXiv:2001.05673 (2020)

  8. Comport, A.I., Marchand, É., Chaumette, F.: Robust model-based tracking for robot vision. In: IROS (2004)

    Google Scholar 

  9. Cui, Y., Fang, Z., Shan, J., Gu, Z., Sifan, Z.: 3D object tracking with Transformer. In: BMVC (2021)

    Google Scholar 

  10. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)

  11. Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: ECO: efficient convolution operators for tracking. In: CVPR (2017)

    Google Scholar 

  12. Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: ICCV (2015)

    Google Scholar 

  13. Danelljan, M., Shahbaz Khan, F., Felsberg, M., Van de Weijer, J.: Adaptive color attributes for real-time visual tracking. In: CVPR (2014)

    Google Scholar 

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  15. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  16. Fan, H., Yang, Y., Kankanhalli, M.: Point 4D Transformer networks for spatio-temporal modeling in point cloud videos. In: CVPR

    Google Scholar 

  17. Feng, T., Jiao, L., Zhu, H., Sun, L.: A novel object re-track framework for 3D point clouds. In: ACM MM (2020)

    Google Scholar 

  18. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the KITTI vision benchmark suite. In: CVPR (2012)

    Google Scholar 

  19. Giancola, S., Zarzar, J., Ghanem, B.: Leveraging shape completion for 3D Siamese tracking. In: CVPR (2019)

    Google Scholar 

  20. Gordon, N., Ristic, B., Arulampalam, S.: Beyond the Kalman filter: particle filters for tracking applications. Artech House, London 830(5), 1–4 (2004)

    MATH  Google Scholar 

  21. Guo, M.H., Cai, J.X., Liu, Z.N., Mu, T.J., Martin, R.R., Hu, S.M.: PCT: point cloud transformer. arXiv preprint arXiv:2012.09688 (2020)

  22. Guo, Q., Feng, W., Zhou, C., Huang, R., Wan, L., Wang, S.: Learning dynamic Siamese network for visual object tracking. In: ICCV (2017)

    Google Scholar 

  23. Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 749–765. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_45

    Chapter  Google Scholar 

  24. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50

    Chapter  Google Scholar 

  25. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)

    Article  Google Scholar 

  26. Hui, L., Wang, L., Cheng, M., Xie, J., Yang, J.: 3D Siamese voxel-to-BEV tracker for sparse point clouds. In: NeurIPS (2021)

    Google Scholar 

  27. Hui, L., Yang, H., Cheng, M., Xie, J., Yang, J.: Pyramid point cloud Transformer for large-scale place recognition. In: ICCV (2021)

    Google Scholar 

  28. Kart, U., Kämäräinen, J.-K., Matas, J.: How to make an RGBD tracker? In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 148–161. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_8

    Chapter  Google Scholar 

  29. Kart, U., Lukezic, A., Kristan, M., Kämäräinen, J., Matas, J.: Object tracking by reconstruction with view-specific discriminative correlation filters. In: CVPR (2019)

    Google Scholar 

  30. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive Transformers with linear attention. In: ICML (2020)

    Google Scholar 

  31. Kim, A., Ošep, A., Leal-Taixé, L.: EagerMOT: 3D multi-object tracking via sensor fusion. arXiv preprint arXiv:2104.14682 (2021)

  32. Kristan, M., et al.: The visual object tracking VOT2016 challenge results. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 777–823. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_54

    Chapter  Google Scholar 

  33. Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernandez, G., Nebehay, G., Porikli, F., Čehovin, L.: A novel performance evaluation methodology for single-target trackers. IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2137–2155 (2016)

    Article  Google Scholar 

  34. Lee, K.H., Hwang, J.N.: On-road pedestrian tracking across multiple driving recorders. IEEE Trans. Multimedia 17(9), 1429–1438 (2015)

    Article  Google Scholar 

  35. Lin, Z., et al.: A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017)

  36. Liu, Y., Jing, X.Y., Nie, J., Gao, H., Liu, J., Jiang, G.P.: Context-aware three-dimensional mean-shift with occlusion handling for robust object tracking in RGB-D videos. IEEE Trans. Multimedia 21(3), 664–677 (2018)

    Article  Google Scholar 

  37. Liu, Z., et al.: Swin Transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  38. Liu, Z., Chen, W., Lu, J., Wang, H., Wang, J.: Formation control of mobile robots using distributed controller with sampled-data and communication delays. IEEE Trans. Control Syst. Technol. 24(6), 2125–2132 (2016)

    Article  Google Scholar 

  39. Liu, Z., Suo, C., Liu, Y., Shen, Y., Qiao, Z., Wei, H., Zhou, S., Li, H., Liang, X., Wang, H., et al.: Deep learning-based localization and perception systems: approaches for autonomous cargo transportation vehicles in large-scale, semiclosed environments. IEEE Robot. Autom. Mag. 27(2), 139–150 (2020)

    Article  Google Scholar 

  40. Luber, M., Spinello, L., Arras, K.O.: People tracking in RGB-D data with on-line boosted target models. In: IROS (2011)

    Google Scholar 

  41. Luo, W., Yang, B., Urtasun, R.: Fast and furious: Real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: CVPR (2018)

    Google Scholar 

  42. Lüscher, C., et al.: RWTH ASR Systems for LibriSpeech: Hybrid vs attention-w/o data augmentation. arXiv preprint arXiv:1905.03072 (2019)

  43. Mao, J., et al.: Voxel Transformer for 3D object detection. In: ICCV (2021)

    Google Scholar 

  44. Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3D object detection with pointformer. In: CVPR (2021)

    Google Scholar 

  45. Pang, Z., Li, Z., Wang, N.: Model-free vehicle tracking and state estimation in point cloud sequences. In: IROS (2021)

    Google Scholar 

  46. Pieropan, A., Bergström, N., Ishikawa, M., Kjellström, H.: Robust 3D tracking of unknown objects. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 2410–2417. IEEE (2015)

    Google Scholar 

  47. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3D object detection in point clouds. In: ICCV (2019)

    Google Scholar 

  48. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)

    Google Scholar 

  49. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)

    Google Scholar 

  50. Qi, H., Feng, C., Cao, Z., Zhao, F., Xiao, Y.: P2B: point-to-box network for 3D object tracking in point clouds. In: CVPR (2020)

    Google Scholar 

  51. Scheidegger, S., Benjaminsson, J., Rosenberg, E., Krishnan, A., Granström, K.: Mono-camera 3D multi-object tracking using deep learning detections and PMBM filtering. In: IV (2018)

    Google Scholar 

  52. Shan, J., Zhou, S., Fang, Z., Cui, Y.: PTT: point-track-Transformer module for 3D single object tracking in point clouds. In: IROS (2021)

    Google Scholar 

  53. Shenoi, A., et al.: JRMOT: a real-time 3D multi-object tracker and a new large-scale dataset. In: IROS (2020)

    Google Scholar 

  54. Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: CVPR (2019)

    Google Scholar 

  55. Shi, S., Wang, Z., Shi, J., Wang, X., Li, H.: From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. (2020)

    Google Scholar 

  56. Spinello, L., Arras, K., Triebel, R., Siegwart, R.: A layered approach to people detection in 3D range data. In: AAAI (2010)

    Google Scholar 

  57. Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)

    Google Scholar 

  58. Synnaeve, G., et al.: End-to-end ASR: from supervised to semi-supervised learning with modern architectures. arXiv preprint arXiv:1911.08460 (2019)

  59. Tang, S., Andriluka, M., Andres, B., Schiele, B.: Multiple people tracking by lifted multicut and person re-identification. In: CVPR (2017)

    Google Scholar 

  60. Tao, R., Gavves, E., Smeulders, A.W.: Siamese instance search for tracking. In: CVPR (2016)

    Google Scholar 

  61. Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.: End-to-end representation learning for correlation filter based tracking. In: CVPR (2017)

    Google Scholar 

  62. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  63. Wang, Q., Gao, J., Xing, J., Zhang, M., Hu, W.: DCFNet: discriminant correlation filters network for visual tracking. arXiv preprint arXiv:1704.04057 (2017)

  64. Wang, W., et al.: Pyramid vision Transformer: a versatile backbone for dense prediction without convolutions. In: ICCV (2021)

    Google Scholar 

  65. Wang, Y., Weng, X., Kitani, K.: Joint detection and multi-object tracking with graph neural networks. arXiv preprint arXiv:2006.13164 (2020)

  66. Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829 (2018)

  67. Wang, Z., Xie, Q., Lai, Y.K., Wu, J., Long, K., Wang, J.: MLVSNet: multi-level voting Siamese network for 3D visual tracking. In: ICCV (2021)

    Google Scholar 

  68. Weng, X., Wang, J., Held, D., Kitani, K.: 3D multi-object tracking: a baseline and new evaluation metrics. In: IROS (2020)

    Google Scholar 

  69. Weng, X., Wang, Y., Man, Y., Kitani, K.M.: GNN3DMOT: graph neural network for 3D multi-object tracking with 2D–3D multi-feature learning. In: CVPR (2020)

    Google Scholar 

  70. Weng, X., Yuan, Y., Kitani, K.: Joint 3D tracking and forecasting with graph neural network and diversity sampling. arXiv preprint arXiv:2003.07847 (2020)

  71. Wu, H., Han, W., Wen, C., Li, X., Wang, C.: 3D multi-object tracking in point clouds based on prediction confidence-guided data association. IEEE Trans. Intell. Transp. Syst. (2021)

    Google Scholar 

  72. Xing, J., Ai, H., Lao, S.: Multiple human tracking based on multi-view upper-body detection and discriminative learning. In: ICPR (2010)

    Google Scholar 

  73. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 (2019)

  74. Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D object detection and tracking. In: CVPR (2021)

    Google Scholar 

  75. Zhang, M., Xing, J., Gao, J., Shi, X., Wang, Q., Hu, W.: Joint scale-spatial correlation tracking with adaptive rotation estimation. In: ICCV Workshops (2015)

    Google Scholar 

  76. Zhang, W., Zhou, H., Sun, S., Wang, Z., Shi, J., Loy, C.C.: Robust multi-modality multi-object tracking. In: CVPR (2019)

    Google Scholar 

  77. Zhao, H., Jiang, L., Jia, J., Torr, P., Koltun, V.: Point transformer. In: ICCV (2021)

    Google Scholar 

  78. Zheng, C., et al.: Box-aware feature enhancement for single object tracking on point clouds. In: ICCV (2021)

    Google Scholar 

  79. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)

  80. Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 103–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_7

    Chapter  Google Scholar 

Download references

Acknowledgment

The authors would like to thank reviewers for their detailed comments and instructive suggestions. This work was supported by the National Science Fund of China (Grant Nos. U1713208, 61876084).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jin Xie or Jian Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2759 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hui, L., Wang, L., Tang, L., Lan, K., Xie, J., Yang, J. (2022). 3D Siamese Transformer Network for Single Object Tracking on Point Clouds. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13662. Springer, Cham. https://doi.org/10.1007/978-3-031-20086-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20086-1_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20085-4

  • Online ISBN: 978-3-031-20086-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics