Abstract
Multi-Object Tracking (MOT), an essential task in computer vision, underperforms when existing occlusions or motion blurs, which will cause changes in the object’s appearances. We develop three modules based on Graph Neural Networks (GNNs) to handle such appearance changes. The appearance enhancement module boosts appearance features by applying self-attention and Graph Convolutional Neural Network (GCNN) to the local features. The temporal feature updating module automatically updates a tracklet appearance template using GCNNs with different Laplacian operations. The spatial feature updating module encodes interactive spatial features by combining a graph attention network and a GCNN. After processing input video frames with these three modules, our tracker stores all extracted features in a memory bank and then forwards them to a matching algorithm to complete tracking. Using popular benchmark datasets MOT16, MOT17, and MOT20, we show that introducing GNNs to MOT benefits tracking, and the proposed tracker surpasses the state-of-the-art trackers, including StrongSORT, ByteTrack, and BoT-SORT. Specifically, we can achieve 81.1% (77.9%) in MOTA, 80.3% (77.3%) in IDF1, and 65.1% (63.2%) in HOTA on the challenging MOT17 (or the newest MOT20) datasets.












Similar content being viewed by others
Data availability
Data will be made available on request.
References
Wang, G., et al.: Track without appearance: learn box and tracklet embedding with local and global motion patterns for vehicle tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada, pp. 9856–9866 (2021). https://doi.org/10.1109/ICCV48922.2021.00973
Li, J., Gao, X., Jiang, T.: Graph networks for multiple object tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Snowmass, USA, pp. 708–717 (2020). https://doi.org/10.1109/WACV45572.2020.9093347
Hu, Y., Gao, J., Xu, C.: Learning scene-aware spatio-temporal gnns for few-shot early action prediction. IEEE Trans. Multim. 25, 2061–2073 (2023). https://doi.org/10.1109/TMM.2022.3142413
Hu, J., Hooi, B., He, B.: Efficient heterogeneous graph learning via random projection. IEEE Trans. Knowl. Data Eng. 36(12), 8093–8107 (2024). https://doi.org/10.1109/TKDE.2024.3434956
Hu, Y., Gao, J., Xu, C.: Learning dual-pooling graph neural networks for few-shot video classification. IEEE Trans. Multim. 23, 4285–4296 (2021). https://doi.org/10.1109/TMM.2020.3039329
Zhang, J.: TGCN: time domain graph convolutional network for multiple objects tracking. arXiv:2101.01861 (2021)
Yu, F., et al.: Poi: multiple object tracking with high performance detection and appearance feature. In: European Conference on Computer Vision (ECCV). Amsterdam, The Netherlands, vol. 9914, pp. 36–42 (2016). https://doi.org/10.1007/978-3-319-48881-3_3
Zhang, Y., et al.: Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 129(11), 3069–3087 (2021). https://doi.org/10.1007/s11263-021-01513-4
Du, Y., et al.: Strongsort: make deepsort great again. IEEE Trans. Multim. 25, 8725–8737 (2023). https://doi.org/10.1109/TMM.2023.3240881
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing (ICIP). Beijing, China, pp. 3645–3649 (2017). https://doi.org/10.1109/ICIP.2017.8296962
Papakis, I., Sarkar, A., Karpatne, A.: A graph convolutional neural network based approach for traffic monitoring using augmented detections with optical flow. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). Indianapolis, USA, pp. 2980–2986 (2021). https://doi.org/10.1109/ITSC48978.2021.9564655
Lan, L., et al.: Interacting tracklets for multi-object tracking. IEEE Trans. Image Process. 27(9), 4585–4597 (2018). https://doi.org/10.1109/TIP.2018.2843129
He, J., et al.: Learnable graph matching: incorporating graph partitioning with deep feature learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA, pp. 5295–5305 (2021). https://doi.org/10.1109/CVPR46437.2021.00526
Aharon, N., Orfaig, R., Bobrovsky, B.-Z.: BoT-SORT: robust associations multi-pedestrian tracking. arXiv:2206.14651 (2022)
Liang, T., et al.: Enhancing the association in multi-object tracking via neighbor graph. Int. J. Intell. Syst. 36(11), 6713–6730 (2021). https://doi.org/10.1002/int.22565
Ma, C., et al.: Deep association: end-to-end graph-based learning for multiple object tracking with conv-graph neural network. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR). Ottawa ON, Canada, pp. 253–261 (2019). https://doi.org/10.1145/3323873.3325010
Zhang, Y., et al.: Bytetrack: multi-object tracking by associating every detection box. In: European Conference on Computer Vision (ECCV). Tel Aviv, Israel. p. 1–21 (2022). https://doi.org/10.1007/978-3-031-20047-2_1
Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Lecce, Italy, pp. 1–6 (2017). https://doi.org/10.1109/AVSS.2017.8078516
Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA, pp. 6247–6257 (2020). https://doi.org/10.1109/CVPR42600.2020.00628
Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), pp. 941–951 (2019). https://doi.org/10.1109/ICCV.2019.00103
Cui, Y., et al.: Tf-blender: temporal feature blender for video object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8118–8127 (2021). https://doi.org/10.1109/ICCV48922.2021.00803
Ren, H., et al.: Focus on details: online multi-object tracking with diverse fine-grained representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada, pp. 11289–11298 (2023). https://doi.org/10.1109/CVPR52729.2023.01086.
Du, Y., et al.: GIAOTracker: a comprehensive framework for mcmot with global information and optimizing strategies in visdrone 2021. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Montreal, Canada, pp. 2809–2819 (2021). https://doi.org/10.1109/ICCVW54120.2021.00315
Wang, Y.-H., et al.: Smiletrack: similarity learning for occlusion-aware multiple object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada (2024), vol. 38, no. 6, pp. 5740–5748. https://doi.org/10.1609/aaai.v38i6.28386
Liu, D., et al.: Sg-net: spatial granularity network for one-stage video instance segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9811–9820 (2021). https://doi.org/10.1109/CVPR46437.2021.00969
Liu, Q., et al.: GSM: graph similarity model for multi-object tracking. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI). Yokohama, Japan, pp. 530–536 (2021)
Wang, Y., Kitani, K., Weng, X.: Joint object detection and multi-object tracking with graph neural networks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). Xi’an, China, pp. 13708–13715 (2021). https://doi.org/10.1109/ICRA48506.2021.9561110
Chu, P., et al., Transmot: spatial-temporal graph transformer for multiple object tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA, pp. 4859–4869 (2023). https://doi.org/10.1109/WACV56688.2023.00485
Hyun, J., et al.: Detection recovery in online multi-object tracking with sparse graph tracker. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA, pp. 4839–4848 (2023). https://doi.org/10.1109/WACV56688.2023.00483
Wu, M., et al.: Multiview vehicle tracking by graph matching model. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), pp. 29–36 (2019)
Wang, Z., et al.: Towards real-time multi-object tracking. In: European Conference on Computer Vision (ECCV). Glasgow, UK, vol. 12356, pp. 107–122 (2020). https://doi.org/10.1007/978-3-030-58621-8_7
Cai, J., et al.: Memot: multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA, pp. 8080–8090 (2022). https://doi.org/10.1109/CVPR52688.2022.00792
Wang, H., et al.: Sture: spatial–temporal mutual representation learning for robust data association in online multi-object tracking. Comput. Vis. Image Underst. 220, 103433 (2022). https://doi.org/10.1016/j.cviu.2022.103433
Zhu, T., et al.: Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 45(11), 12783–12797 (2023). https://doi.org/10.1109/TPAMI.2022.3213073
Li, Q., Han, Z., Wu, X.-M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans, USA, vol. 32, no. 1 (2018). https://doi.org/10.1609/aaai.v32i1.11604
Feng, W., et al.: Online multiple-pedestrian tracking with detection-pair-based graph convolutional networks. IEEE Internet Things J. 9(24), 25086–25099 (2022). https://doi.org/10.1109/JIOT.2022.3195359
He, L., et al.: Fastreid: a pytorch toolbox for general instance re-identification. In: Proceedings of the 31st ACM International Conference on Multimedia (MM). Ottawa, Canada, pp. 9664–9667 (2023). https://doi.org/10.1145/3581783.3613460
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (2017)
Tan, H., et al.: MHSA-Net: multihead self-attention network for occluded person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 34(11), 8210–8224 (2023). https://doi.org/10.1109/TNNLS.2022.3144163
Veličković, P., et al.: Graph attention networks. arXiv:1710.10903 (2017)
Milan, A., et al.: MOT16: a benchmark for multi-object tracking. arXiv:1603.00831 (2016)
Dendorfer, P., et al.: Motchallenge: a benchmark for single-camera multiple target tracking. Int. J. Comput. Vis. 129, 845–881 (2021). https://doi.org/10.1007/s11263-020-01393-0
Dendorfer, P., et al.: Mot20: a benchmark for multi object tracking in crowded scenes. arXiv:2003.09003 (2020)
Luiten, J., et al.: Hota: a higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 129(2), 548–578 (2021). https://doi.org/10.1007/s11263-020-01375-2
Weihong, R., et al.: Joint counting, detection and re-identification for multi-object tracking. arXiv:2212.05861 (2024)
You, S., et al.: UTM: a unified multiple object tracking model with identity-aware feature enhancement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada, pp. 21876–21886 (2023). https://doi.org/10.1109/CVPR52729.2023.02095
Kong, J., et al.: MOTFR: multiple object tracking based on feature recoding. IEEE Trans. Circuits Syst. Video Technol. 32(11), 7746–7757 (2022). https://doi.org/10.1109/TCSVT.2022.3182709
Wu, J., et al.: Track to detect and segment: An online multi-object tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA, pp. 12347–12356 (2021). https://doi.org/10.1109/CVPR46437.2021.01217
Ren, S., et al.: Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016). https://doi.org/10.1109/TPAMI.2016.2577031
Zeng, F., et al.: Motr: end-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision (ECCV). Tel Aviv, Israel, vol. 13687, pp. 659–675 (2022). https://doi.org/10.1007/978-3-031-19812-0_38
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: European Conference on Computer Vision (ECCV). Glasgow, UK, vol. 12349, pp. 474–490 (2020). https://doi.org/10.1007/978-3-030-58548-8_28
Liang, J., et al.: Clusterfomer: clustering as a universal visual learner. Adv. Neural. Inf. Process. Syst. 36, 64029–64042 (2023)
Wang, T., et al.: M2pt: multimodal prompt tuning for zero-shot instruction learning. Miami, Florida, USA, pp. 3723–3740 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.218
Han, C., et al.: Prototypical transformer as unified motion learners. In: Forty-first international conference on machine learning (ICML), no. 69. (2024). https://openreview.net/forum?id=JOrLz5d7OW
Acknowledgements
This work is supported by the National Natural Science Foundation of China (No. 61771155).
Author information
Authors and Affiliations
Contributions
Yubo Zhang: methodology; software; visualization; writing—original draft. Liying Zheng: conceptualization; funding acquisition; resources; writing—review and editing; project administration; supervision. Qingming Huang: supervision; writing—review and editing.
Corresponding author
Ethics declarations
Conflict of interest
All authors disclosed no relevant relationships.
Additional information
Communicated by Junyu Gao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Zheng, L. & Huang, Q. Multi-object tracking based on graph neural networks. Multimedia Systems 31, 89 (2025). https://doi.org/10.1007/s00530-025-01679-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-025-01679-8