Abstract
Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e., “where”) in videos. Yet, knowing merely “where” is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., “what”) from videos, associated with “where”, is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating “where” and “what” for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting “where” and “what” for SMOT, opening up a new direction in tracking for video understanding. We will release BenSMOT and SMOTer at https://github.com/Nathan-Li123/SMOTer.
H. Fan and L. Zhang—Equal advising and co-last author.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Here by “semantic”, we emphasize high-level trajectory-based activity understanding in videos in the context of tracking, instead of category as in semantic segmentation.
References
Bai, H., Cheng, W., Chu, P., Liu, J., Zhang, K., Ling, H.: GMOT-40: a benchmark for generic multiple object tracking. In: CVPR (2021)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop (2005)
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. JIVP (2008)
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: rethinking sort for robust multi-object tracking. In: CVPR (2023)
Chen, S., Shi, Z., Mettes, P., Snoek, C.G.: Social fabric: tubelet compositions for video relation detection. In: ICCV (2021)
Chu, P., Wang, J., You, Q., Ling, H., Liu, Z.: TransMOT: spatial-temporal graph transformer for multiple object tracking. In: WACV (2023)
Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: SportsMOT: a large multi-object tracking dataset in multiple sports scenes. In: ICCV (2023)
Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D.: TAO: a large-scale benchmark for tracking any object. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 436–454. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_26
Dendorfer, P., et al.: MOT20: a benchmark for multi object tracking in crowded scenes. arXiv (2020)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: ECCV (2018)
Du, Y., et al.: StrongSORT: make DeepSORT great again. TMM 25, 8725–8737 (2023)
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: ICCV (2019)
Fellbaum, C.: WordNet: an electronic lexical database (1998)
Ferryman, J., Shahrokni, A.: PETS2009: dataset and challenge. In: PET Workshop (2009)
Gao, R., Wang, L.: MeMOTR: long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset (2013)
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Han, X., Pasquier, T., Bates, A., Mickens, J., Seltzer, M.: Unicorn: runtime provenance-based detector for advanced persistent threats. In: ECCV (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV (2017)
Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: MOTChallenge 2015: towards a benchmark for multi-target tracking. arXiv (2015)
Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: CVPR (2017)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL (2004)
Lin, K., et al.: SwinBERT: end-to-end transformers with sparse attention for video captioning. In: CVPR (2022)
Liu, C., Jin, Y., Xu, K., Gong, G., Mu, Y.: Beyond short-term snippet: video relation detection with spatio-temporal global context. In: CVPR (2020)
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Luiten, J., et al.: HOTA: a higher order metric for evaluating multi-object tracking. IJCV (2021)
Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep OC-SORT: multi-pedestrian tracking by adaptive re-identification. In: ICIP (2023)
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: TrackFormer: multi-object tracking with transformers. In: CVPR (2022)
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv (2016)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2
Seo, P.H., Nagrani, A., Arnab, A., Schmid, C.: End-to-end generative pretraining for multimodal video captioning. In: CVPR (2022)
Shen, Y., Gu, X., Xu, K., Fan, H., Wen, L., Zhang, L.: Accurate and fast compressed video captioning. In: ICCV (2023)
Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: DanceTrack: multi-object tracking in uniform appearance and diverse motion. In: CVPR (2022)
Sun, P., et al.: TransTrack: multiple object tracking with transformer. arXiv (2020)
Taud, H., Mas, J.: Multilayer perceptron (MLP). In: Camacho Olmedo, M., Paegelow, M., Mas, J.F., Escobar, F. (eds.) Geomatic Approaches for Modeling Land Change Scenarios. Lecture Notes in Geoinformation and Cartography, pp. 451–455. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60801-3_27
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)
Yan, B., et al.: Towards grand unification of object tracking. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13681, pp. 733–751. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_43
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)
Yang, A., et al.: Vid2Seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_20
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: CVPR (2018)
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: end-to-end multiple-object tracking with transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 659–675. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_38
Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: CVPR (2017)
Zhang, L., Gao, J., Xiao, Z., Fan, H.: AnimalTrack: a benchmark for multi-animal tracking in the wild. IJCV 131, 496–513 (2023)
Zhang, Y., et al.: ByteTrack: multi-object tracking by associating every detection box. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 1–21. Springer, Cham (2022)
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. IJCV 129, 3069–3087 (2021)
Zhang, Y., Wang, T., Zhang, X.: MOTRv2: bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: CVPR (2023)
Zheng, S., Chen, S., Jin, Q.: VRDFormer: end-to-end video visual relation detection with transformers. In: CVPR (2022)
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: CVPR (2018)
Zhou, X., Arnab, A., Sun, C., Schmid, C.: Dense video object captioning from disjoint supervision. arXiv (2023)
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 474–490. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_28
Zhou, X., Yin, T., Koltun, V., Krähenbühl, P.: Global tracking transformers. In: CVPR (2022)
Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and tracking meet drones challenge. TPAMI 44, 7380–7399 (2021)
Acknowledgements
Heng Fan was not supported by any fund for this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y. et al. (2025). Beyond MOT: Semantic Multi-object Tracking. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15093. Springer, Cham. https://doi.org/10.1007/978-3-031-72761-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-72761-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72760-3
Online ISBN: 978-3-031-72761-0
eBook Packages: Computer ScienceComputer Science (R0)