STMT: Spatio-temporal memory transformer for multi-object tracking

Gu, Songbo; Ma, Jianxin; Hui, Guancheng; Xiao, Qiyang; Shi, Wentao

doi:10.1007/s10489-023-04617-1

STMT: Spatio-temporal memory transformer for multi-object tracking

Published: 10 July 2023

Volume 53, pages 23426–23441, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Songbo Gu¹,
Jianxin Ma¹,
Guancheng Hui¹,
Qiyang Xiao¹ &
…
Wentao Shi²

396 Accesses
Explore all metrics

Abstract

Typically, modern online Multi-Object Tracking (MOT) methods first obtain the detected objects in each frame and then establish associations between them in successive frames. However, it is difficult to obtain high-quality trajectories when camera motion, fast motion, and occlusion challenges occur. To address these problems, this paper proposes a transformer-based MOT system named Spatio-Temporal Memory Transformer (STMT), which focuses on time and history information. The proposed STMT consists of a Spatio-Temporal Enhancement Module (STEM) that uses 3D convolution to model the spatial and temporal interactions of objects and obtains rich features in spatio-temporal information. Moreover, a Dynamic Spatio-Temporal Memory (DSTM) is presented to associate detections with tracklets and contains three units: an Identity Aggregation Module (IAM), a Linear Dynamic Encoder (LD-Encoder) and a memory Decoder (Decoder). The IAM utilizes the geometric changes of objects to reduce the impact of deformation on tracking performance, the LD-Encoder is used to obtain the dependency between objects, and the Decoder generates appearance similarity scores. Furthermore, a Score Fusion Equilibrium Strategy (SFES) is employed to balance the similarity and position distance fusion scores. Extensive experiments demonstrate that the proposed STMT approach is generally superior to the state-of-the-art trackers on the MOT16 and MOT17 benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

Article Open access 12 April 2024

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Article Open access 08 October 2020

ByteTrack: Multi-object Tracking by Associating Every Detection Box

References

Wang H, Peng J, Chen D, Jiang G, Zhao T, Fu X (2020) Attributeguided feature learning network for vehicle reidentification. IEEE MultiMedia 27(4):112–121. https://doi.org/10.1109/mmul.2020.2999464
Wang H, Peng J, Zhao Y, Fu X (2020) Multi-path deep cnns for fine-grained car recognition. IEEE Trans Vehicular Technol 69(10):10484–10493. https://doi.org/10.1109/tvt.2020.3009162
Article Google Scholar
Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE. pp 3464–3468. https://doi.org/10.1109/ICIP.2016.7533003
Bochinski E, Eiselein V, Sikora T (2017) High-speed tracking-by-detection without using image information. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE. pp 1–6.https://doi.org/10.1109/AVSS.2017.8078516
Wang Z, Zheng L, Liu Y, Li Y, Wang S (2020) Towards real-time multiobject tracking. In: European Conference on Computer Vision, Springer pp 107–122. https://doi.org/10.1007/978-3-030-58621-87
Sun S, Akhtar N, Song H, Mian A, Shah M (2019) Deep affinity network for multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(1):104–119. https://doi.org/10.1109/TPAMI.2019.2929520
Article Google Scholar
Zhou Z, Xing J, Zhang M, Hu W (2018) Online multi-target tracking with tensor-based high-order graph matching. In: 2018 24th International Conference on Pattern Recognition (ICPR), IEEE. pp 1809–1814. https://doi.org/10.1109/ICPR.2018.8545450
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE pp 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962
Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129(11):3069–3087. https://doi.org/10.1007/s11263-021-01513-4
Article Google Scholar
Huang K, Hao Q (2021) Joint multi-object detection and tracking with camera-lidar fusion for autonomous driving. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp 6983–6989. https://doi.org/10.1109/iros51168.2021.9636311
Stadler D, Beyerer J (2021) Improving multiple pedestrian tracking by track management and occlusion handling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10958–10967. https://doi.org/10.1109/cvpr46437.2021.01081
Yang S, Xu M, Xie H, Perry S, Xia J (2021) Single-view 3d object reconstruction from shape priors in memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3152–3161. https://doi.org/10.1109/cvpr46437.2021.00317
Hornakova A, Henschel R, Rosenhahn B, Swoboda P (2020) Lifted disjoint paths with application in multiple object tracking. In: International Conference on Machine Learning, PMLR pp 4364–4375
Brasó G, Leal-Taixé L (2020) Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6247–6257. https://doi.org/10.1109/CVPR42600.2020.00628
Wang H, Yao M, Jiang G, Mi Z, Fu X (2023) Graph-collaborated autoencoder hashing for multi-view binary clustering. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/tnnls.2023.3239033
Article Google Scholar
Wang H, Wang Y, Zhang Z, Fu X, Zhuo L, Xu M, Wang M (2020) Kernelized multiview subspace analysis by self-weighted learning. IEEE Transactions on Multimedia 23:3828–3840. https://doi.org/10.1109/tmm.2020.3032023
Article Google Scholar
Wang H, Jiang G, Peng J, Deng R, Fu X (2022) Towards adaptive consensus graph: Multi-view clustering via graph collaboration. IEEE Transactions on Multimedia. https://doi.org/10.1109/tmm.2022.3212270
Article Google Scholar
Xu J, Cao Y, Zhang Z, Hu H (2019) Spatial-temporal relation networks for multi-object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3988–998. https://doi.org/10.1109/ICCV.2019.00409
Zhou X, Koltun V, Krähenbühl, P (2020) Tracking objects as points. In: European Conference on Computer Vision, Springer pp 474–490.https://doi.org/10.1007/978-3-030-58548-828
Wu J, Cao J, Song L, Wang Y, Yang M, Yuan J (2021) Track to detect and segment: An online multi-object tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2352–12361. https://doi.org/10.1109/CVPR46437.2021.01217
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer pp 213–229. https://doi.org/10.1007/978-3-030-58452-813
Yu R, Du D, LaLonde R, Davila D, Funk C, Hoogs A, Clipp B (2022) Cascade transformers for end-to-end person search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7267–7276.https://doi.org/10.1109/CVPR52688.2022.00712
Ye YS, Chen MR, Zou HL, Yang BB, Zeng GQ (2022) Gid: Global information distillation for medical semantic segmentation. Neurocomputing 503:248–258. https://doi.org/10.1016/j.neucom.2022.09045
Article Google Scholar
Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2022) Trackformer: Multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8844–8854. https://doi.org/10.1109/CVPR52688.2022.00864
Xu Y, Ban Y, Delorme G, Gan C, Rus D, Alameda-Pineda X (2021) Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145
Sun P, Cao J, Jiang Y, Zhang R, Xie E, Yuan Z, Wang C, Luo P (2020) Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460
Zeng F, Dong B, Zhang Y, Wang T, Zhang X, Wei Y (2022) Motr: End-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision, Springer pp 659–675. https://doi.org/10.1007/978-3-031-19812-038
Zhu T, Hiller M, Ehsanpour M, Ma R, Drummond T, Reid I, Rezatofighi H (2022) Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3213073
Article Google Scholar
Cai J, Xu M, Li W, Xiong Y, Xia W, Tu Z, Soatto S (2022) Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8090–8100. https://doi.org/10.1109/CVPR52688.2022.00792
Zhou X, Yin T, Koltun V, Krähenbühl P (2022) Global tracking transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8771–8780. https://doi.org/10.1109/CVPR52688.2022.00857
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Chen Q, Liu Z, Zhang Y, Fu K, Zhao Q, Du H (2021) Rgb-d salient object detection via 3d convolutional neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp 1063–107. https://doi.org/10.1609/aaai.v35i2.16191
Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Transactions on Image Processing 30:3995–4007. https://doi.org/10.1109/tip.2021.3068644
Article Google Scholar
Qiu Z, Yao T, Ngo CW, Tian X, Mei T (2019) Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12056–12065. https://doi.org/10.1109/cvpr.2019.01233
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5533–5541. https://doi.org/10.1109/iccv.2017.590
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 305–321. https://doi.org/10.1007/978-3-030-01267-019
Chen T, Xiao J, Hu X, Zhang G, Wang S (2022) Spatiotemporal contextaware network for video salient object detection. Neural Computing and Applications 34(19):16861–16877. https://doi.org/10.1007/s00521-022-07330-1
Article Google Scholar
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 764–773. https://doi.org/10.1109/ICCV.2017.89
Wu Q, Luo W, Chai Z, Guo G (2022) Scene text detection by adaptive feature selection with text scale-aware loss. Applied Intelligence 52(1):514–529. https://doi.org/10.1007/s10489-021-02331-4
Article Google Scholar
Liu Y, Huang L, Li J, Zhang W, Sheng Y, Wei Z (2023) Multitask learning based on geometric invariance discriminative features. Applied Intelligence 53(3):3505–3518. https://doi.org/10.1007/s10489-022-03617-x
Article Google Scholar
Liu W, Song Y, Chen D, He S, Yu Y, Yan T, Hancke GP, Lau RW (2019) Deformable object tracking with gated fusion. IEEE Transactions on Image Processing 28(8):3766–3777. https://doi.org/10.1109/tip.2019.2902784
Article MathSciNet MATH Google Scholar
Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K (2016) Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831
Ristani E, Solera F, Zou R, Cucchiara R, Tomasi C (2016) Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision, Springer pp 17–35. https://doi.org/10.1007/978-3-319-48881-32
Luiten J, Osep A, Dendorfer P, Torr P, Geiger A, Leal-Taixé L, Leibe B (2021) Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision 129(2):548–578. https://doi.org/10.1007/s11263-020-01375-2
Article Google Scholar
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008:1–10. https://doi.org/10.1155/2008/246309
Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2403–2412. https://doi.org/10.1109/CVPR.2018.00255
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10781–10790. https://doi.org/10.1109/CVPR42600.2020.01079
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969. https://doi.org/10.1109/ICCV.2017.322
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
Pang B, Li Y, Zhang Y, Li M, Lu C (2020) Tubetk: Adopting tubes to track multi-object in a one-step training model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6308–6318. https://doi.org/10.1109/CVPR42600.2020.00634
Peng J, Wang C, Wan F, Wu Y, Wang Y, Tai Y, Wang C, Li J, Huang F, Fu Y (2020) Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: European Conference on Computer Vision, Springer pp 145–161. https://doi.org/10.1007/978-3-030-58548-89
Liang C, Zhang Z, Zhou X, Li B, Zhu S, Hu W (2022) Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing 31:3182–3196. https://doi.org/10.1109/TIP.2022.3165376
Article Google Scholar
Han S, Huang P, Wang H, Yu E, Liu D, Pan X (2022) Mat: Motionaware multi-object tracking. Neurocomputing 476:75–86. https://doi.org/10.1016/j.neucom.2021.12.104
Article Google Scholar
Zheng L, Tang M, Chen Y, Zhu G, Wang J, Lu H (2021) Improving multiple object tracking with single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2453–2462. https://doi.org/10.1109/CVPR46437.2021.00248
Wang Y, Kitani K, Weng X (2021) Joint object detection and multi-object tracking with graph neural networks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE pp 13708–13715. https://doi.org/10.1109/ICRA48506.2021.9561110
Tokmakov P, Li J, Burgard W, Gaidon A (2021) Learning to track with object permanence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10860–10869. https://doi.org/10.1109/ICCV48922.2021.01068
Pang J, Qiu L, Li X, Chen H, Li Q, Darrell T, Yu F (2021) Quasidense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 164–173. https://doi.org/10.1109/CVPR46437.2021.00023
Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123

Download references

Acknowledgements

This research was funded by the Young Elite Scientist Sponsorship Program by Henan Association for Science and Technology (No. 2021HYTP014), Henan Province scientific and technological research (222102220028); Key Projects of Henan Province Colleges (NO.22A416004), National Natural Science Foundation of China (62002100).

Author information

Authors and Affiliations

School of Artificial Intelligence, Henan University, Zhengzhou, 450001, Henan, China
Songbo Gu, Jianxin Ma, Guancheng Hui & Qiyang Xiao
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, Shanxi, China
Wentao Shi

Authors

Songbo Gu
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Guancheng Hui
View author publications
You can also search for this author in PubMed Google Scholar
Qiyang Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Qiyang Xiao or Wentao Shi.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gu, S., Ma, J., Hui, G. et al. STMT: Spatio-temporal memory transformer for multi-object tracking. Appl Intell 53, 23426–23441 (2023). https://doi.org/10.1007/s10489-023-04617-1

Download citation

Accepted: 07 April 2023
Published: 10 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04617-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STMT: Spatio-temporal memory transformer for multi-object tracking

Abstract

Access this article

Similar content being viewed by others

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

ByteTrack: Multi-object Tracking by Associating Every Detection Box

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STMT: Spatio-temporal memory transformer for multi-object tracking

Abstract

Access this article

Similar content being viewed by others

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

ByteTrack: Multi-object Tracking by Associating Every Detection Box

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation