Skip to main content

Advertisement

Log in

Multi-object tracking based on graph neural networks

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Multi-Object Tracking (MOT), an essential task in computer vision, underperforms when existing occlusions or motion blurs, which will cause changes in the object’s appearances. We develop three modules based on Graph Neural Networks (GNNs) to handle such appearance changes. The appearance enhancement module boosts appearance features by applying self-attention and Graph Convolutional Neural Network (GCNN) to the local features. The temporal feature updating module automatically updates a tracklet appearance template using GCNNs with different Laplacian operations. The spatial feature updating module encodes interactive spatial features by combining a graph attention network and a GCNN. After processing input video frames with these three modules, our tracker stores all extracted features in a memory bank and then forwards them to a matching algorithm to complete tracking. Using popular benchmark datasets MOT16, MOT17, and MOT20, we show that introducing GNNs to MOT benefits tracking, and the proposed tracker surpasses the state-of-the-art trackers, including StrongSORT, ByteTrack, and BoT-SORT. Specifically, we can achieve 81.1% (77.9%) in MOTA, 80.3% (77.3%) in IDF1, and 65.1% (63.2%) in HOTA on the challenging MOT17 (or the newest MOT20) datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

Data will be made available on request.

References

  1. Wang, G., et al.: Track without appearance: learn box and tracklet embedding with local and global motion patterns for vehicle tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada, pp. 9856–9866 (2021). https://doi.org/10.1109/ICCV48922.2021.00973

  2. Li, J., Gao, X., Jiang, T.: Graph networks for multiple object tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Snowmass, USA, pp. 708–717 (2020). https://doi.org/10.1109/WACV45572.2020.9093347

  3. Hu, Y., Gao, J., Xu, C.: Learning scene-aware spatio-temporal gnns for few-shot early action prediction. IEEE Trans. Multim. 25, 2061–2073 (2023). https://doi.org/10.1109/TMM.2022.3142413

    Article  Google Scholar 

  4. Hu, J., Hooi, B., He, B.: Efficient heterogeneous graph learning via random projection. IEEE Trans. Knowl. Data Eng. 36(12), 8093–8107 (2024). https://doi.org/10.1109/TKDE.2024.3434956

    Article  Google Scholar 

  5. Hu, Y., Gao, J., Xu, C.: Learning dual-pooling graph neural networks for few-shot video classification. IEEE Trans. Multim. 23, 4285–4296 (2021). https://doi.org/10.1109/TMM.2020.3039329

    Article  Google Scholar 

  6. Zhang, J.: TGCN: time domain graph convolutional network for multiple objects tracking. arXiv:2101.01861 (2021)

  7. Yu, F., et al.: Poi: multiple object tracking with high performance detection and appearance feature. In: European Conference on Computer Vision (ECCV). Amsterdam, The Netherlands, vol. 9914, pp. 36–42 (2016). https://doi.org/10.1007/978-3-319-48881-3_3

  8. Zhang, Y., et al.: Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 129(11), 3069–3087 (2021). https://doi.org/10.1007/s11263-021-01513-4

    Article  Google Scholar 

  9. Du, Y., et al.: Strongsort: make deepsort great again. IEEE Trans. Multim. 25, 8725–8737 (2023). https://doi.org/10.1109/TMM.2023.3240881

    Article  Google Scholar 

  10. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing (ICIP). Beijing, China, pp. 3645–3649 (2017). https://doi.org/10.1109/ICIP.2017.8296962

  11. Papakis, I., Sarkar, A., Karpatne, A.: A graph convolutional neural network based approach for traffic monitoring using augmented detections with optical flow. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). Indianapolis, USA, pp. 2980–2986 (2021). https://doi.org/10.1109/ITSC48978.2021.9564655

  12. Lan, L., et al.: Interacting tracklets for multi-object tracking. IEEE Trans. Image Process. 27(9), 4585–4597 (2018). https://doi.org/10.1109/TIP.2018.2843129

    Article  MathSciNet  Google Scholar 

  13. He, J., et al.: Learnable graph matching: incorporating graph partitioning with deep feature learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA, pp. 5295–5305 (2021). https://doi.org/10.1109/CVPR46437.2021.00526

  14. Aharon, N., Orfaig, R., Bobrovsky, B.-Z.: BoT-SORT: robust associations multi-pedestrian tracking. arXiv:2206.14651 (2022)

  15. Liang, T., et al.: Enhancing the association in multi-object tracking via neighbor graph. Int. J. Intell. Syst. 36(11), 6713–6730 (2021). https://doi.org/10.1002/int.22565

    Article  Google Scholar 

  16. Ma, C., et al.: Deep association: end-to-end graph-based learning for multiple object tracking with conv-graph neural network. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR). Ottawa ON, Canada, pp. 253–261 (2019). https://doi.org/10.1145/3323873.3325010

  17. Zhang, Y., et al.: Bytetrack: multi-object tracking by associating every detection box. In: European Conference on Computer Vision (ECCV). Tel Aviv, Israel. p. 1–21 (2022). https://doi.org/10.1007/978-3-031-20047-2_1

  18. Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Lecce, Italy, pp. 1–6 (2017). https://doi.org/10.1109/AVSS.2017.8078516

  19. Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, USA, pp. 6247–6257 (2020). https://doi.org/10.1109/CVPR42600.2020.00628

  20. Bergmann, P., Meinhardt, T., Leal-Taixé, L.: Tracking without bells and whistles. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South), pp. 941–951 (2019). https://doi.org/10.1109/ICCV.2019.00103

  21. Cui, Y., et al.: Tf-blender: temporal feature blender for video object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8118–8127 (2021). https://doi.org/10.1109/ICCV48922.2021.00803

  22. Ren, H., et al.: Focus on details: online multi-object tracking with diverse fine-grained representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada, pp. 11289–11298 (2023). https://doi.org/10.1109/CVPR52729.2023.01086.

  23. Du, Y., et al.: GIAOTracker: a comprehensive framework for mcmot with global information and optimizing strategies in visdrone 2021. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Montreal, Canada, pp. 2809–2819 (2021). https://doi.org/10.1109/ICCVW54120.2021.00315

  24. Wang, Y.-H., et al.: Smiletrack: similarity learning for occlusion-aware multiple object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver, Canada (2024), vol. 38, no. 6, pp. 5740–5748. https://doi.org/10.1609/aaai.v38i6.28386

  25. Liu, D., et al.: Sg-net: spatial granularity network for one-stage video instance segmentation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9811–9820 (2021). https://doi.org/10.1109/CVPR46437.2021.00969

  26. Liu, Q., et al.: GSM: graph similarity model for multi-object tracking. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI). Yokohama, Japan, pp. 530–536 (2021)

  27. Wang, Y., Kitani, K., Weng, X.: Joint object detection and multi-object tracking with graph neural networks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). Xi’an, China, pp. 13708–13715 (2021). https://doi.org/10.1109/ICRA48506.2021.9561110

  28. Chu, P., et al., Transmot: spatial-temporal graph transformer for multiple object tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA, pp. 4859–4869 (2023). https://doi.org/10.1109/WACV56688.2023.00485

  29. Hyun, J., et al.: Detection recovery in online multi-object tracking with sparse graph tracker. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA, pp. 4839–4848 (2023). https://doi.org/10.1109/WACV56688.2023.00483

  30. Wu, M., et al.: Multiview vehicle tracking by graph matching model. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), pp. 29–36 (2019)

  31. Wang, Z., et al.: Towards real-time multi-object tracking. In: European Conference on Computer Vision (ECCV). Glasgow, UK, vol. 12356, pp. 107–122 (2020). https://doi.org/10.1007/978-3-030-58621-8_7

  32. Cai, J., et al.: Memot: multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA, pp. 8080–8090 (2022). https://doi.org/10.1109/CVPR52688.2022.00792

  33. Wang, H., et al.: Sture: spatial–temporal mutual representation learning for robust data association in online multi-object tracking. Comput. Vis. Image Underst. 220, 103433 (2022). https://doi.org/10.1016/j.cviu.2022.103433

    Article  Google Scholar 

  34. Zhu, T., et al.: Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 45(11), 12783–12797 (2023). https://doi.org/10.1109/TPAMI.2022.3213073

    Article  Google Scholar 

  35. Li, Q., Han, Z., Wu, X.-M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans, USA, vol. 32, no. 1 (2018). https://doi.org/10.1609/aaai.v32i1.11604

  36. Feng, W., et al.: Online multiple-pedestrian tracking with detection-pair-based graph convolutional networks. IEEE Internet Things J. 9(24), 25086–25099 (2022). https://doi.org/10.1109/JIOT.2022.3195359

    Article  Google Scholar 

  37. He, L., et al.: Fastreid: a pytorch toolbox for general instance re-identification. In: Proceedings of the 31st ACM International Conference on Multimedia (MM). Ottawa, Canada, pp. 9664–9667 (2023). https://doi.org/10.1145/3581783.3613460

  38. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. (2017)

  39. Tan, H., et al.: MHSA-Net: multihead self-attention network for occluded person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 34(11), 8210–8224 (2023). https://doi.org/10.1109/TNNLS.2022.3144163

    Article  Google Scholar 

  40. Veličković, P., et al.: Graph attention networks. arXiv:1710.10903 (2017)

  41. Milan, A., et al.: MOT16: a benchmark for multi-object tracking. arXiv:1603.00831 (2016)

  42. Dendorfer, P., et al.: Motchallenge: a benchmark for single-camera multiple target tracking. Int. J. Comput. Vis. 129, 845–881 (2021). https://doi.org/10.1007/s11263-020-01393-0

    Article  Google Scholar 

  43. Dendorfer, P., et al.: Mot20: a benchmark for multi object tracking in crowded scenes. arXiv:2003.09003 (2020)

  44. Luiten, J., et al.: Hota: a higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 129(2), 548–578 (2021). https://doi.org/10.1007/s11263-020-01375-2

    Article  Google Scholar 

  45. Weihong, R., et al.: Joint counting, detection and re-identification for multi-object tracking. arXiv:2212.05861 (2024)

  46. You, S., et al.: UTM: a unified multiple object tracking model with identity-aware feature enhancement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, Canada, pp. 21876–21886 (2023). https://doi.org/10.1109/CVPR52729.2023.02095

  47. Kong, J., et al.: MOTFR: multiple object tracking based on feature recoding. IEEE Trans. Circuits Syst. Video Technol. 32(11), 7746–7757 (2022). https://doi.org/10.1109/TCSVT.2022.3182709

    Article  Google Scholar 

  48. Wu, J., et al.: Track to detect and segment: An online multi-object tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, USA, pp. 12347–12356 (2021). https://doi.org/10.1109/CVPR46437.2021.01217

  49. Ren, S., et al.: Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016). https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  50. Zeng, F., et al.: Motr: end-to-end multiple-object tracking with transformer. In: European Conference on Computer Vision (ECCV). Tel Aviv, Israel, vol. 13687, pp. 659–675 (2022). https://doi.org/10.1007/978-3-031-19812-0_38

  51. Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: European Conference on Computer Vision (ECCV). Glasgow, UK, vol. 12349, pp. 474–490 (2020). https://doi.org/10.1007/978-3-030-58548-8_28

  52. Liang, J., et al.: Clusterfomer: clustering as a universal visual learner. Adv. Neural. Inf. Process. Syst. 36, 64029–64042 (2023)

    Google Scholar 

  53. Wang, T., et al.: M2pt: multimodal prompt tuning for zero-shot instruction learning. Miami, Florida, USA, pp. 3723–3740 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.218

  54. Han, C., et al.: Prototypical transformer as unified motion learners. In: Forty-first international conference on machine learning (ICML), no. 69. (2024). https://openreview.net/forum?id=JOrLz5d7OW

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61771155).

Author information

Authors and Affiliations

Authors

Contributions

Yubo Zhang: methodology; software; visualization; writing—original draft. Liying Zheng: conceptualization; funding acquisition; resources; writing—review and editing; project administration; supervision. Qingming Huang: supervision; writing—review and editing.

Corresponding author

Correspondence to Liying Zheng.

Ethics declarations

Conflict of interest

All authors disclosed no relevant relationships.

Additional information

Communicated by Junyu Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Zheng, L. & Huang, Q. Multi-object tracking based on graph neural networks. Multimedia Systems 31, 89 (2025). https://doi.org/10.1007/s00530-025-01679-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-025-01679-8

Keywords