Skip to main content

Beyond MOT: Semantic Multi-object Tracking

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15093))

Included in the following conference series:

  • 370 Accesses

Abstract

Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e., “where”) in videos. Yet, knowing merely “where” is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., “what”) from videos, associated with “where”, is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating “where” and “what” for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting “where” and “what” for SMOT, opening up a new direction in tracking for video understanding. We will release BenSMOT and SMOTer at https://github.com/Nathan-Li123/SMOTer.

H. Fan and L. Zhang—Equal advising and co-last author.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Here by “semantic”, we emphasize high-level trajectory-based activity understanding in videos in the context of tracking, instead of category as in semantic segmentation.

References

  1. Bai, H., Cheng, W., Chu, P., Liu, J., Zhang, K., Ling, H.: GMOT-40: a benchmark for generic multiple object tracking. In: CVPR (2021)

    Google Scholar 

  2. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop (2005)

    Google Scholar 

  3. Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. JIVP (2008)

    Google Scholar 

  4. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)

    Google Scholar 

  5. Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: rethinking sort for robust multi-object tracking. In: CVPR (2023)

    Google Scholar 

  6. Chen, S., Shi, Z., Mettes, P., Snoek, C.G.: Social fabric: tubelet compositions for video relation detection. In: ICCV (2021)

    Google Scholar 

  7. Chu, P., Wang, J., You, Q., Ling, H., Liu, Z.: TransMOT: spatial-temporal graph transformer for multiple object tracking. In: WACV (2023)

    Google Scholar 

  8. Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: SportsMOT: a large multi-object tracking dataset in multiple sports scenes. In: ICCV (2023)

    Google Scholar 

  9. Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D.: TAO: a large-scale benchmark for tracking any object. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 436–454. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_26

    Chapter  Google Scholar 

  10. Dendorfer, P., et al.: MOT20: a benchmark for multi object tracking in crowded scenes. arXiv (2020)

    Google Scholar 

  11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)

    Google Scholar 

  12. Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: ECCV (2018)

    Google Scholar 

  13. Du, Y., et al.: StrongSORT: make DeepSORT great again. TMM 25, 8725–8737 (2023)

    Google Scholar 

  14. Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: ICCV (2019)

    Google Scholar 

  15. Fellbaum, C.: WordNet: an electronic lexical database (1998)

    Google Scholar 

  16. Ferryman, J., Shahrokni, A.: PETS2009: dataset and challenge. In: PET Workshop (2009)

    Google Scholar 

  17. Gao, R., Wang, L.: MeMOTR: long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)

    Google Scholar 

  18. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset (2013)

    Google Scholar 

  19. Girshick, R.: Fast R-CNN. In: ICCV (2015)

    Google Scholar 

  20. Han, X., Pasquier, T., Bates, A., Mickens, J., Seltzer, M.: Unicorn: runtime provenance-based detector for advanced persistent threats. In: ECCV (2020)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  22. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)

    Google Scholar 

  23. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV (2017)

    Google Scholar 

  24. Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: MOTChallenge 2015: towards a benchmark for multi-target tracking. arXiv (2015)

    Google Scholar 

  25. Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: CVPR (2017)

    Google Scholar 

  26. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL (2004)

    Google Scholar 

  27. Lin, K., et al.: SwinBERT: end-to-end transformers with sparse attention for video captioning. In: CVPR (2022)

    Google Scholar 

  28. Liu, C., Jin, Y., Xu, K., Gong, G., Mu, Y.: Beyond short-term snippet: video relation detection with spatio-temporal global context. In: CVPR (2020)

    Google Scholar 

  29. Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51

    Chapter  Google Scholar 

  30. Luiten, J., et al.: HOTA: a higher order metric for evaluating multi-object tracking. IJCV (2021)

    Google Scholar 

  31. Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep OC-SORT: multi-pedestrian tracking by adaptive re-identification. In: ICIP (2023)

    Google Scholar 

  32. Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: TrackFormer: multi-object tracking with transformers. In: CVPR (2022)

    Google Scholar 

  33. Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv (2016)

    Google Scholar 

  34. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)

    Google Scholar 

  35. Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2

    Chapter  Google Scholar 

  36. Seo, P.H., Nagrani, A., Arnab, A., Schmid, C.: End-to-end generative pretraining for multimodal video captioning. In: CVPR (2022)

    Google Scholar 

  37. Shen, Y., Gu, X., Xu, K., Fan, H., Wen, L., Zhang, L.: Accurate and fast compressed video captioning. In: ICCV (2023)

    Google Scholar 

  38. Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: DanceTrack: multi-object tracking in uniform appearance and diverse motion. In: CVPR (2022)

    Google Scholar 

  39. Sun, P., et al.: TransTrack: multiple object tracking with transformer. arXiv (2020)

    Google Scholar 

  40. Taud, H., Mas, J.: Multilayer perceptron (MLP). In: Camacho Olmedo, M., Paegelow, M., Mas, J.F., Escobar, F. (eds.) Geomatic Approaches for Modeling Land Change Scenarios. Lecture Notes in Geoinformation and Cartography, pp. 451–455. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60801-3_27

  41. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  42. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)

    Google Scholar 

  43. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)

    Google Scholar 

  44. Yan, B., et al.: Towards grand unification of object tracking. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13681, pp. 733–751. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_43

    Chapter  Google Scholar 

  45. Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)

    Google Scholar 

  46. Yang, A., et al.: Vid2Seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)

    Google Scholar 

  47. Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_20

    Chapter  Google Scholar 

  48. Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)

    Google Scholar 

  49. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: CVPR (2018)

    Google Scholar 

  50. Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: end-to-end multiple-object tracking with transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 659–675. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_38

    Chapter  Google Scholar 

  51. Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: CVPR (2017)

    Google Scholar 

  52. Zhang, L., Gao, J., Xiao, Z., Fan, H.: AnimalTrack: a benchmark for multi-animal tracking in the wild. IJCV 131, 496–513 (2023)

    Article  Google Scholar 

  53. Zhang, Y., et al.: ByteTrack: multi-object tracking by associating every detection box. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 1–21. Springer, Cham (2022)

    Chapter  Google Scholar 

  54. Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. IJCV 129, 3069–3087 (2021)

    Article  Google Scholar 

  55. Zhang, Y., Wang, T., Zhang, X.: MOTRv2: bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: CVPR (2023)

    Google Scholar 

  56. Zheng, S., Chen, S., Jin, Q.: VRDFormer: end-to-end video visual relation detection with transformers. In: CVPR (2022)

    Google Scholar 

  57. Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: CVPR (2018)

    Google Scholar 

  58. Zhou, X., Arnab, A., Sun, C., Schmid, C.: Dense video object captioning from disjoint supervision. arXiv (2023)

    Google Scholar 

  59. Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 474–490. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_28

    Chapter  Google Scholar 

  60. Zhou, X., Yin, T., Koltun, V., Krähenbühl, P.: Global tracking transformers. In: CVPR (2022)

    Google Scholar 

  61. Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and tracking meet drones challenge. TPAMI 44, 7380–7399 (2021)

    Article  Google Scholar 

Download references

Acknowledgements

Heng Fan was not supported by any fund for this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Libo Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y. et al. (2025). Beyond MOT: Semantic Multi-object Tracking. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15093. Springer, Cham. https://doi.org/10.1007/978-3-031-72761-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72761-0_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72760-3

  • Online ISBN: 978-3-031-72761-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics