Skip to main content
Log in

STMT: Spatio-temporal memory transformer for multi-object tracking

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Typically, modern online Multi-Object Tracking (MOT) methods first obtain the detected objects in each frame and then establish associations between them in successive frames. However, it is difficult to obtain high-quality trajectories when camera motion, fast motion, and occlusion challenges occur. To address these problems, this paper proposes a transformer-based MOT system named Spatio-Temporal Memory Transformer (STMT), which focuses on time and history information. The proposed STMT consists of a Spatio-Temporal Enhancement Module (STEM) that uses 3D convolution to model the spatial and temporal interactions of objects and obtains rich features in spatio-temporal information. Moreover, a Dynamic Spatio-Temporal Memory (DSTM) is presented to associate detections with tracklets and contains three units: an Identity Aggregation Module (IAM), a Linear Dynamic Encoder (LD-Encoder) and a memory Decoder (Decoder). The IAM utilizes the geometric changes of objects to reduce the impact of deformation on tracking performance, the LD-Encoder is used to obtain the dependency between objects, and the Decoder generates appearance similarity scores. Furthermore, a Score Fusion Equilibrium Strategy (SFES) is employed to balance the similarity and position distance fusion scores. Extensive experiments demonstrate that the proposed STMT approach is generally superior to the state-of-the-art trackers on the MOT16 and MOT17 benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Wang H, Peng J, Chen D, Jiang G, Zhao T, Fu X (2020) Attributeguided feature learning network for vehicle reidentification. IEEE MultiMedia 27(4):112–121. https://doi.org/10.1109/mmul.2020.2999464

  2. Wang H, Peng J, Zhao Y, Fu X (2020) Multi-path deep cnns for fine-grained car recognition. IEEE Trans Vehicular Technol 69(10):10484–10493. https://doi.org/10.1109/tvt.2020.3009162

    Article  Google Scholar 

  3. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE. pp 3464–3468. https://doi.org/10.1109/ICIP.2016.7533003

  4. Bochinski E, Eiselein V, Sikora T (2017) High-speed tracking-by-detection without using image information. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE. pp 1–6.https://doi.org/10.1109/AVSS.2017.8078516

  5. Wang Z, Zheng L, Liu Y, Li Y, Wang S (2020) Towards real-time multiobject tracking. In: European Conference on Computer Vision, Springer pp 107–122. https://doi.org/10.1007/978-3-030-58621-87

  6. Sun S, Akhtar N, Song H, Mian A, Shah M (2019) Deep affinity network for multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(1):104–119. https://doi.org/10.1109/TPAMI.2019.2929520

    Article  Google Scholar 

  7. Zhou Z, Xing J, Zhang M, Hu W (2018) Online multi-target tracking with tensor-based high-order graph matching. In: 2018 24th International Conference on Pattern Recognition (ICPR), IEEE. pp 1809–1814. https://doi.org/10.1109/ICPR.2018.8545450

  8. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE pp 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962

  9. Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129(11):3069–3087. https://doi.org/10.1007/s11263-021-01513-4

    Article  Google Scholar 

  10. Huang K, Hao Q (2021) Joint multi-object detection and tracking with camera-lidar fusion for autonomous driving. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp 6983–6989. https://doi.org/10.1109/iros51168.2021.9636311

  11. Stadler D, Beyerer J (2021) Improving multiple pedestrian tracking by track management and occlusion handling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10958–10967. https://doi.org/10.1109/cvpr46437.2021.01081

  12. Yang S, Xu M, Xie H, Perry S, Xia J (2021) Single-view 3d object reconstruction from shape priors in memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3152–3161. https://doi.org/10.1109/cvpr46437.2021.00317

  13. Hornakova A, Henschel R, Rosenhahn B, Swoboda P (2020) Lifted disjoint paths with application in multiple object tracking. In: International Conference on Machine Learning, PMLR pp 4364–4375

  14. Brasó G, Leal-Taixé L (2020) Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6247–6257. https://doi.org/10.1109/CVPR42600.2020.00628

  15. Wang H, Yao M, Jiang G, Mi Z, Fu X (2023) Graph-collaborated autoencoder hashing for multi-view binary clustering. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/tnnls.2023.3239033

    Article  Google Scholar 

  16. Wang H, Wang Y, Zhang Z, Fu X, Zhuo L, Xu M, Wang M (2020) Kernelized multiview subspace analysis by self-weighted learning. IEEE Transactions on Multimedia 23:3828–3840. https://doi.org/10.1109/tmm.2020.3032023

    Article  Google Scholar 

  17. Wang H, Jiang G, Peng J, Deng R, Fu X (2022) Towards adaptive consensus graph: Multi-view clustering via graph collaboration. IEEE Transactions on Multimedia. https://doi.org/10.1109/tmm.2022.3212270

    Article  Google Scholar 

  18. Xu J, Cao Y, Zhang Z, Hu H (2019) Spatial-temporal relation networks for multi-object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3988–998. https://doi.org/10.1109/ICCV.2019.00409

  19. Zhou X, Koltun V, Krähenbühl, P (2020) Tracking objects as points. In: European Conference on Computer Vision, Springer pp 474–490.https://doi.org/10.1007/978-3-030-58548-828

  20. Wu J, Cao J, Song L, Wang Y, Yang M, Yuan J (2021) Track to detect and segment: An online multi-object tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2352–12361. https://doi.org/10.1109/CVPR46437.2021.01217

  21. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30

  22. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer pp 213–229. https://doi.org/10.1007/978-3-030-58452-813

  23. Yu R, Du D, LaLonde R, Davila D, Funk C, Hoogs A, Clipp B (2022) Cascade transformers for end-to-end person search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7267–7276.https://doi.org/10.1109/CVPR52688.2022.00712

  24. Ye YS, Chen MR, Zou HL, Yang BB, Zeng GQ (2022) Gid: Global information distillation for medical semantic segmentation. Neurocomputing 503:248–258. https://doi.org/10.1016/j.neucom.2022.09045

    Article  Google Scholar 

  25. Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2022) Trackformer: Multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8844–8854. https://doi.org/10.1109/CVPR52688.2022.00864

  26. Xu Y, Ban Y, Delorme G, Gan C, Rus D, Alameda-Pineda X (2021) Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145

  27. Sun P, Cao J, Jiang Y, Zhang R, Xie E, Yuan Z, Wang C, Luo P (2020) Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460

  28. Zeng F, Dong B, Zhang Y, Wang T, Zhang X, Wei Y (2022) Motr: End-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision, Springer pp 659–675. https://doi.org/10.1007/978-3-031-19812-038

  29. Zhu T, Hiller M, Ehsanpour M, Ma R, Drummond T, Reid I, Rezatofighi H (2022) Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3213073

    Article  Google Scholar 

  30. Cai J, Xu M, Li W, Xiong Y, Xia W, Tu Z, Soatto S (2022) Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8090–8100. https://doi.org/10.1109/CVPR52688.2022.00792

  31. Zhou X, Yin T, Koltun V, Krähenbühl P (2022) Global tracking transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8771–8780. https://doi.org/10.1109/CVPR52688.2022.00857

  32. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510

  33. Chen Q, Liu Z, Zhang Y, Fu K, Zhao Q, Du H (2021) Rgb-d salient object detection via 3d convolutional neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp 1063–107. https://doi.org/10.1609/aaai.v35i2.16191

  34. Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Transactions on Image Processing 30:3995–4007. https://doi.org/10.1109/tip.2021.3068644

    Article  Google Scholar 

  35. Qiu Z, Yao T, Ngo CW, Tian X, Mei T (2019) Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12056–12065. https://doi.org/10.1109/cvpr.2019.01233

  36. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5533–5541. https://doi.org/10.1109/iccv.2017.590

  37. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 305–321. https://doi.org/10.1007/978-3-030-01267-019

  38. Chen T, Xiao J, Hu X, Zhang G, Wang S (2022) Spatiotemporal contextaware network for video salient object detection. Neural Computing and Applications 34(19):16861–16877. https://doi.org/10.1007/s00521-022-07330-1

    Article  Google Scholar 

  39. Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 764–773. https://doi.org/10.1109/ICCV.2017.89

  40. Wu Q, Luo W, Chai Z, Guo G (2022) Scene text detection by adaptive feature selection with text scale-aware loss. Applied Intelligence 52(1):514–529. https://doi.org/10.1007/s10489-021-02331-4

    Article  Google Scholar 

  41. Liu Y, Huang L, Li J, Zhang W, Sheng Y, Wei Z (2023) Multitask learning based on geometric invariance discriminative features. Applied Intelligence 53(3):3505–3518. https://doi.org/10.1007/s10489-022-03617-x

    Article  Google Scholar 

  42. Liu W, Song Y, Chen D, He S, Yu Y, Yan T, Hancke GP, Lau RW (2019) Deformable object tracking with gated fusion. IEEE Transactions on Image Processing 28(8):3766–3777. https://doi.org/10.1109/tip.2019.2902784

    Article  MathSciNet  MATH  Google Scholar 

  43. Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K (2016) Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831

  44. Ristani E, Solera F, Zou R, Cucchiara R, Tomasi C (2016) Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision, Springer pp 17–35. https://doi.org/10.1007/978-3-319-48881-32

  45. Luiten J, Osep A, Dendorfer P, Torr P, Geiger A, Leal-Taixé L, Leibe B (2021) Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision 129(2):548–578. https://doi.org/10.1007/s11263-020-01375-2

    Article  Google Scholar 

  46. Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008:1–10. https://doi.org/10.1155/2008/246309

  47. Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2403–2412. https://doi.org/10.1109/CVPR.2018.00255

  48. Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10781–10790. https://doi.org/10.1109/CVPR42600.2020.01079

  49. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969. https://doi.org/10.1109/ICCV.2017.322

  50. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7

  51. Pang B, Li Y, Zhang Y, Li M, Lu C (2020) Tubetk: Adopting tubes to track multi-object in a one-step training model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6308–6318. https://doi.org/10.1109/CVPR42600.2020.00634

  52. Peng J, Wang C, Wan F, Wu Y, Wang Y, Tai Y, Wang C, Li J, Huang F, Fu Y (2020) Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: European Conference on Computer Vision, Springer pp 145–161. https://doi.org/10.1007/978-3-030-58548-89

  53. Liang C, Zhang Z, Zhou X, Li B, Zhu S, Hu W (2022) Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing 31:3182–3196. https://doi.org/10.1109/TIP.2022.3165376

    Article  Google Scholar 

  54. Han S, Huang P, Wang H, Yu E, Liu D, Pan X (2022) Mat: Motionaware multi-object tracking. Neurocomputing 476:75–86. https://doi.org/10.1016/j.neucom.2021.12.104

    Article  Google Scholar 

  55. Zheng L, Tang M, Chen Y, Zhu G, Wang J, Lu H (2021) Improving multiple object tracking with single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2453–2462. https://doi.org/10.1109/CVPR46437.2021.00248

  56. Wang Y, Kitani K, Weng X (2021) Joint object detection and multi-object tracking with graph neural networks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE pp 13708–13715. https://doi.org/10.1109/ICRA48506.2021.9561110

  57. Tokmakov P, Li J, Burgard W, Gaidon A (2021) Learning to track with object permanence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10860–10869. https://doi.org/10.1109/ICCV48922.2021.01068

  58. Pang J, Qiu L, Li X, Chen H, Li Q, Darrell T, Yu F (2021) Quasidense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 164–173. https://doi.org/10.1109/CVPR46437.2021.00023

  59. Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123

Download references

Acknowledgements

This research was funded by the Young Elite Scientist Sponsorship Program by Henan Association for Science and Technology (No. 2021HYTP014), Henan Province scientific and technological research (222102220028); Key Projects of Henan Province Colleges (NO.22A416004), National Natural Science Foundation of China (62002100).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Qiyang Xiao or Wentao Shi.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, S., Ma, J., Hui, G. et al. STMT: Spatio-temporal memory transformer for multi-object tracking. Appl Intell 53, 23426–23441 (2023). https://doi.org/10.1007/s10489-023-04617-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04617-1

Keywords

Navigation