Abstract
Typically, modern online Multi-Object Tracking (MOT) methods first obtain the detected objects in each frame and then establish associations between them in successive frames. However, it is difficult to obtain high-quality trajectories when camera motion, fast motion, and occlusion challenges occur. To address these problems, this paper proposes a transformer-based MOT system named Spatio-Temporal Memory Transformer (STMT), which focuses on time and history information. The proposed STMT consists of a Spatio-Temporal Enhancement Module (STEM) that uses 3D convolution to model the spatial and temporal interactions of objects and obtains rich features in spatio-temporal information. Moreover, a Dynamic Spatio-Temporal Memory (DSTM) is presented to associate detections with tracklets and contains three units: an Identity Aggregation Module (IAM), a Linear Dynamic Encoder (LD-Encoder) and a memory Decoder (Decoder). The IAM utilizes the geometric changes of objects to reduce the impact of deformation on tracking performance, the LD-Encoder is used to obtain the dependency between objects, and the Decoder generates appearance similarity scores. Furthermore, a Score Fusion Equilibrium Strategy (SFES) is employed to balance the similarity and position distance fusion scores. Extensive experiments demonstrate that the proposed STMT approach is generally superior to the state-of-the-art trackers on the MOT16 and MOT17 benchmarks.
Similar content being viewed by others
References
Wang H, Peng J, Chen D, Jiang G, Zhao T, Fu X (2020) Attributeguided feature learning network for vehicle reidentification. IEEE MultiMedia 27(4):112–121. https://doi.org/10.1109/mmul.2020.2999464
Wang H, Peng J, Zhao Y, Fu X (2020) Multi-path deep cnns for fine-grained car recognition. IEEE Trans Vehicular Technol 69(10):10484–10493. https://doi.org/10.1109/tvt.2020.3009162
Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE. pp 3464–3468. https://doi.org/10.1109/ICIP.2016.7533003
Bochinski E, Eiselein V, Sikora T (2017) High-speed tracking-by-detection without using image information. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE. pp 1–6.https://doi.org/10.1109/AVSS.2017.8078516
Wang Z, Zheng L, Liu Y, Li Y, Wang S (2020) Towards real-time multiobject tracking. In: European Conference on Computer Vision, Springer pp 107–122. https://doi.org/10.1007/978-3-030-58621-87
Sun S, Akhtar N, Song H, Mian A, Shah M (2019) Deep affinity network for multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(1):104–119. https://doi.org/10.1109/TPAMI.2019.2929520
Zhou Z, Xing J, Zhang M, Hu W (2018) Online multi-target tracking with tensor-based high-order graph matching. In: 2018 24th International Conference on Pattern Recognition (ICPR), IEEE. pp 1809–1814. https://doi.org/10.1109/ICPR.2018.8545450
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), IEEE pp 3645–3649. https://doi.org/10.1109/ICIP.2017.8296962
Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129(11):3069–3087. https://doi.org/10.1007/s11263-021-01513-4
Huang K, Hao Q (2021) Joint multi-object detection and tracking with camera-lidar fusion for autonomous driving. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE. pp 6983–6989. https://doi.org/10.1109/iros51168.2021.9636311
Stadler D, Beyerer J (2021) Improving multiple pedestrian tracking by track management and occlusion handling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10958–10967. https://doi.org/10.1109/cvpr46437.2021.01081
Yang S, Xu M, Xie H, Perry S, Xia J (2021) Single-view 3d object reconstruction from shape priors in memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3152–3161. https://doi.org/10.1109/cvpr46437.2021.00317
Hornakova A, Henschel R, Rosenhahn B, Swoboda P (2020) Lifted disjoint paths with application in multiple object tracking. In: International Conference on Machine Learning, PMLR pp 4364–4375
Brasó G, Leal-Taixé L (2020) Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6247–6257. https://doi.org/10.1109/CVPR42600.2020.00628
Wang H, Yao M, Jiang G, Mi Z, Fu X (2023) Graph-collaborated autoencoder hashing for multi-view binary clustering. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/tnnls.2023.3239033
Wang H, Wang Y, Zhang Z, Fu X, Zhuo L, Xu M, Wang M (2020) Kernelized multiview subspace analysis by self-weighted learning. IEEE Transactions on Multimedia 23:3828–3840. https://doi.org/10.1109/tmm.2020.3032023
Wang H, Jiang G, Peng J, Deng R, Fu X (2022) Towards adaptive consensus graph: Multi-view clustering via graph collaboration. IEEE Transactions on Multimedia. https://doi.org/10.1109/tmm.2022.3212270
Xu J, Cao Y, Zhang Z, Hu H (2019) Spatial-temporal relation networks for multi-object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3988–998. https://doi.org/10.1109/ICCV.2019.00409
Zhou X, Koltun V, Krähenbühl, P (2020) Tracking objects as points. In: European Conference on Computer Vision, Springer pp 474–490.https://doi.org/10.1007/978-3-030-58548-828
Wu J, Cao J, Song L, Wang Y, Yang M, Yuan J (2021) Track to detect and segment: An online multi-object tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2352–12361. https://doi.org/10.1109/CVPR46437.2021.01217
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer pp 213–229. https://doi.org/10.1007/978-3-030-58452-813
Yu R, Du D, LaLonde R, Davila D, Funk C, Hoogs A, Clipp B (2022) Cascade transformers for end-to-end person search. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7267–7276.https://doi.org/10.1109/CVPR52688.2022.00712
Ye YS, Chen MR, Zou HL, Yang BB, Zeng GQ (2022) Gid: Global information distillation for medical semantic segmentation. Neurocomputing 503:248–258. https://doi.org/10.1016/j.neucom.2022.09045
Meinhardt T, Kirillov A, Leal-Taixe L, Feichtenhofer C (2022) Trackformer: Multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8844–8854. https://doi.org/10.1109/CVPR52688.2022.00864
Xu Y, Ban Y, Delorme G, Gan C, Rus D, Alameda-Pineda X (2021) Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145
Sun P, Cao J, Jiang Y, Zhang R, Xie E, Yuan Z, Wang C, Luo P (2020) Transtrack: Multiple object tracking with transformer. arXiv preprint arXiv:2012.15460
Zeng F, Dong B, Zhang Y, Wang T, Zhang X, Wei Y (2022) Motr: End-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision, Springer pp 659–675. https://doi.org/10.1007/978-3-031-19812-038
Zhu T, Hiller M, Ehsanpour M, Ma R, Drummond T, Reid I, Rezatofighi H (2022) Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2022.3213073
Cai J, Xu M, Li W, Xiong Y, Xia W, Tu Z, Soatto S (2022) Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8090–8100. https://doi.org/10.1109/CVPR52688.2022.00792
Zhou X, Yin T, Koltun V, Krähenbühl P (2022) Global tracking transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8771–8780. https://doi.org/10.1109/CVPR52688.2022.00857
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Chen Q, Liu Z, Zhang Y, Fu K, Zhao Q, Du H (2021) Rgb-d salient object detection via 3d convolutional neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp 1063–107. https://doi.org/10.1609/aaai.v35i2.16191
Chen C, Wang G, Peng C, Fang Y, Zhang D, Qin H (2021) Exploring rich and efficient spatial temporal interactions for real-time video salient object detection. IEEE Transactions on Image Processing 30:3995–4007. https://doi.org/10.1109/tip.2021.3068644
Qiu Z, Yao T, Ngo CW, Tian X, Mei T (2019) Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12056–12065. https://doi.org/10.1109/cvpr.2019.01233
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5533–5541. https://doi.org/10.1109/iccv.2017.590
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 305–321. https://doi.org/10.1007/978-3-030-01267-019
Chen T, Xiao J, Hu X, Zhang G, Wang S (2022) Spatiotemporal contextaware network for video salient object detection. Neural Computing and Applications 34(19):16861–16877. https://doi.org/10.1007/s00521-022-07330-1
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 764–773. https://doi.org/10.1109/ICCV.2017.89
Wu Q, Luo W, Chai Z, Guo G (2022) Scene text detection by adaptive feature selection with text scale-aware loss. Applied Intelligence 52(1):514–529. https://doi.org/10.1007/s10489-021-02331-4
Liu Y, Huang L, Li J, Zhang W, Sheng Y, Wei Z (2023) Multitask learning based on geometric invariance discriminative features. Applied Intelligence 53(3):3505–3518. https://doi.org/10.1007/s10489-022-03617-x
Liu W, Song Y, Chen D, He S, Yu Y, Yan T, Hancke GP, Lau RW (2019) Deformable object tracking with gated fusion. IEEE Transactions on Image Processing 28(8):3766–3777. https://doi.org/10.1109/tip.2019.2902784
Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K (2016) Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831
Ristani E, Solera F, Zou R, Cucchiara R, Tomasi C (2016) Performance measures and a data set for multi-target, multi-camera tracking. In: European Conference on Computer Vision, Springer pp 17–35. https://doi.org/10.1007/978-3-319-48881-32
Luiten J, Osep A, Dendorfer P, Torr P, Geiger A, Leal-Taixé L, Leibe B (2021) Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision 129(2):548–578. https://doi.org/10.1007/s11263-020-01375-2
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008:1–10. https://doi.org/10.1155/2008/246309
Yu F, Wang D, Shelhamer E, Darrell T (2018) Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2403–2412. https://doi.org/10.1109/CVPR.2018.00255
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10781–10790. https://doi.org/10.1109/CVPR42600.2020.01079
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2961–2969. https://doi.org/10.1109/ICCV.2017.322
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
Pang B, Li Y, Zhang Y, Li M, Lu C (2020) Tubetk: Adopting tubes to track multi-object in a one-step training model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6308–6318. https://doi.org/10.1109/CVPR42600.2020.00634
Peng J, Wang C, Wan F, Wu Y, Wang Y, Tai Y, Wang C, Li J, Huang F, Fu Y (2020) Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: European Conference on Computer Vision, Springer pp 145–161. https://doi.org/10.1007/978-3-030-58548-89
Liang C, Zhang Z, Zhou X, Li B, Zhu S, Hu W (2022) Rethinking the competition between detection and reid in multiobject tracking. IEEE Transactions on Image Processing 31:3182–3196. https://doi.org/10.1109/TIP.2022.3165376
Han S, Huang P, Wang H, Yu E, Liu D, Pan X (2022) Mat: Motionaware multi-object tracking. Neurocomputing 476:75–86. https://doi.org/10.1016/j.neucom.2021.12.104
Zheng L, Tang M, Chen Y, Zhu G, Wang J, Lu H (2021) Improving multiple object tracking with single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2453–2462. https://doi.org/10.1109/CVPR46437.2021.00248
Wang Y, Kitani K, Weng X (2021) Joint object detection and multi-object tracking with graph neural networks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE pp 13708–13715. https://doi.org/10.1109/ICRA48506.2021.9561110
Tokmakov P, Li J, Burgard W, Gaidon A (2021) Learning to track with object permanence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10860–10869. https://doi.org/10.1109/ICCV48922.2021.01068
Pang J, Qiu L, Li X, Chen H, Li Q, Darrell T, Yu F (2021) Quasidense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 164–173. https://doi.org/10.1109/CVPR46437.2021.00023
Shao S, Zhao Z, Li B, Xiao T, Yu G, Zhang X, Sun J (2018) Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123
Acknowledgements
This research was funded by the Young Elite Scientist Sponsorship Program by Henan Association for Science and Technology (No. 2021HYTP014), Henan Province scientific and technological research (222102220028); Key Projects of Henan Province Colleges (NO.22A416004), National Natural Science Foundation of China (62002100).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gu, S., Ma, J., Hui, G. et al. STMT: Spatio-temporal memory transformer for multi-object tracking. Appl Intell 53, 23426–23441 (2023). https://doi.org/10.1007/s10489-023-04617-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04617-1