Beyond MOT: Semantic Multi-object Tracking

Li, Yunhao; Li, Qin; Wang, Hao; Ma, Xue; Yao, Jiali; Dong, Shaohua; Fan, Heng; Zhang, Libo

doi:10.1007/978-3-031-72761-0_16

Yunhao Li^13,14,
Qin Li¹³,
Hao Wang¹⁴,
Xue Ma¹³,
Jiali Yao¹⁵,
Shaohua Dong¹⁶,
Heng Fan¹⁶ &
…
Libo Zhang^13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15093))

Included in the following conference series:

European Conference on Computer Vision

370 Accesses

Abstract

Current multi-object tracking (MOT) aims to predict trajectories of targets (i.e., “where”) in videos. Yet, knowing merely “where” is insufficient in many crucial applications. In comparison, semantic understanding such as fine-grained behaviors, interactions, and overall summarized captions (i.e., “what”) from videos, associated with “where”, is highly-desired for comprehensive video analysis. Thus motivated, we introduce Semantic Multi-Object Tracking (SMOT), that aims to estimate object trajectories and meanwhile understand semantic details of associated trajectories including instance captions, instance interactions, and overall video captions, integrating “where” and “what” for tracking. In order to foster the exploration of SMOT, we propose BenSMOT, a large-scale Benchmark for Semantic MOT. Specifically, BenSMOT comprises 3,292 videos with 151K frames, covering various scenarios for semantic tracking of humans. BenSMOT provides annotations for the trajectories of targets, along with associated instance captions in natural language, instance interactions, and overall caption for each video sequence. To our best knowledge, BenSMOT is the first publicly available benchmark for SMOT. Besides, to encourage future research, we present a novel tracker named SMOTer, which is specially designed and end-to-end trained for SMOT, showing promising performance. By releasing BenSMOT, we expect to go beyond conventional MOT by predicting “where” and “what” for SMOT, opening up a new direction in tracking for video understanding. We will release BenSMOT and SMOTer at https://github.com/Nathan-Li123/SMOTer.

H. Fan and L. Zhang—Equal advising and co-last author.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Elysium: Exploring Object-Level Perception in Videos via MLLM

SemTrack: A Large-Scale Dataset for Semantic Tracking in the Wild

Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation

Notes

1.
Here by “semantic”, we emphasize high-level trajectory-based activity understanding in videos in the context of tracking, instead of category as in semantic segmentation.

References

Bai, H., Cheng, W., Chu, P., Liu, J., Zhang, K., Ling, H.: GMOT-40: a benchmark for generic multiple object tracking. In: CVPR (2021)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop (2005)
Google Scholar
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. JIVP (2008)
Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Google Scholar
Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K.: Observation-centric sort: rethinking sort for robust multi-object tracking. In: CVPR (2023)
Google Scholar
Chen, S., Shi, Z., Mettes, P., Snoek, C.G.: Social fabric: tubelet compositions for video relation detection. In: ICCV (2021)
Google Scholar
Chu, P., Wang, J., You, Q., Ling, H., Liu, Z.: TransMOT: spatial-temporal graph transformer for multiple object tracking. In: WACV (2023)
Google Scholar
Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: SportsMOT: a large multi-object tracking dataset in multiple sports scenes. In: ICCV (2023)
Google Scholar
Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D.: TAO: a large-scale benchmark for tracking any object. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 436–454. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_26
Chapter Google Scholar
Dendorfer, P., et al.: MOT20: a benchmark for multi object tracking in crowded scenes. arXiv (2020)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: ECCV (2018)
Google Scholar
Du, Y., et al.: StrongSORT: make DeepSORT great again. TMM 25, 8725–8737 (2023)
Google Scholar
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: ICCV (2019)
Google Scholar
Fellbaum, C.: WordNet: an electronic lexical database (1998)
Google Scholar
Ferryman, J., Shahrokni, A.: PETS2009: dataset and challenge. In: PET Workshop (2009)
Google Scholar
Gao, R., Wang, L.: MeMOTR: long-term memory-augmented transformer for multi-object tracking. In: ICCV (2023)
Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset (2013)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Han, X., Pasquier, T., Bates, A., Mickens, J., Seltzer, M.: Unicorn: runtime provenance-based detector for advanced persistent threats. In: ECCV (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)
Google Scholar
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV (2017)
Google Scholar
Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: MOTChallenge 2015: towards a benchmark for multi-target tracking. arXiv (2015)
Google Scholar
Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: CVPR (2017)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL (2004)
Google Scholar
Lin, K., et al.: SwinBERT: end-to-end transformers with sparse attention for video captioning. In: CVPR (2022)
Google Scholar
Liu, C., Jin, Y., Xu, K., Gong, G., Mu, Y.: Beyond short-term snippet: video relation detection with spatio-temporal global context. In: CVPR (2020)
Google Scholar
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Chapter Google Scholar
Luiten, J., et al.: HOTA: a higher order metric for evaluating multi-object tracking. IJCV (2021)
Google Scholar
Maggiolino, G., Ahmad, A., Cao, J., Kitani, K.: Deep OC-SORT: multi-pedestrian tracking by adaptive re-identification. In: ICIP (2023)
Google Scholar
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: TrackFormer: multi-object tracking with transformers. In: CVPR (2022)
Google Scholar
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv (2016)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_2
Chapter Google Scholar
Seo, P.H., Nagrani, A., Arnab, A., Schmid, C.: End-to-end generative pretraining for multimodal video captioning. In: CVPR (2022)
Google Scholar
Shen, Y., Gu, X., Xu, K., Fan, H., Wen, L., Zhang, L.: Accurate and fast compressed video captioning. In: ICCV (2023)
Google Scholar
Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: DanceTrack: multi-object tracking in uniform appearance and diverse motion. In: CVPR (2022)
Google Scholar
Sun, P., et al.: TransTrack: multiple object tracking with transformer. arXiv (2020)
Google Scholar
Taud, H., Mas, J.: Multilayer perceptron (MLP). In: Camacho Olmedo, M., Paegelow, M., Mas, J.F., Escobar, F. (eds.) Geomatic Approaches for Modeling Land Change Scenarios. Lecture Notes in Geoinformation and Cartography, pp. 451–455. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60801-3_27
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)
Google Scholar
Yan, B., et al.: Towards grand unification of object tracking. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13681, pp. 733–751. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19803-8_43
Chapter Google Scholar
Yan, B., et al.: Universal instance perception as object discovery and retrieval. In: CVPR (2023)
Google Scholar
Yang, A., et al.: Vid2Seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)
Google Scholar
Ye, B., Chang, H., Ma, B., Shan, S., Chen, X.: Joint feature learning and relation modeling for tracking: a one-stream framework. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 341–357. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_20
Chapter Google Scholar
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
Google Scholar
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: CVPR (2018)
Google Scholar
Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., Wei, Y.: MOTR: end-to-end multiple-object tracking with transformer. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13687, pp. 659–675. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19812-0_38
Chapter Google Scholar
Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: CVPR (2017)
Google Scholar
Zhang, L., Gao, J., Xiao, Z., Fan, H.: AnimalTrack: a benchmark for multi-animal tracking in the wild. IJCV 131, 496–513 (2023)
Article Google Scholar
Zhang, Y., et al.: ByteTrack: multi-object tracking by associating every detection box. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 1–21. Springer, Cham (2022)
Chapter Google Scholar
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. IJCV 129, 3069–3087 (2021)
Article Google Scholar
Zhang, Y., Wang, T., Zhang, X.: MOTRv2: bootstrapping end-to-end multi-object tracking by pretrained object detectors. In: CVPR (2023)
Google Scholar
Zheng, S., Chen, S., Jin, Q.: VRDFormer: end-to-end video visual relation detection with transformers. In: CVPR (2022)
Google Scholar
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: CVPR (2018)
Google Scholar
Zhou, X., Arnab, A., Sun, C., Schmid, C.: Dense video object captioning from disjoint supervision. arXiv (2023)
Google Scholar
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 474–490. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_28
Chapter Google Scholar
Zhou, X., Yin, T., Koltun, V., Krähenbühl, P.: Global tracking transformers. In: CVPR (2022)
Google Scholar
Zhu, P., Wen, L., Du, D., Bian, X., Fan, H., Hu, Q., Ling, H.: Detection and tracking meet drones challenge. TPAMI 44, 7380–7399 (2021)
Article Google Scholar

Download references

Acknowledgements

Heng Fan was not supported by any fund for this work.

Author information

Authors and Affiliations

Institute of Software Chinese Academy of Sciences, Beijing, China
Yunhao Li, Qin Li, Xue Ma & Libo Zhang
University of Chinese Academy of Sciences, Beijing, China
Yunhao Li, Hao Wang & Libo Zhang
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Beijing, China
Jiali Yao & Libo Zhang
Department of Computer Science and Engineering, University of North Texas, Denton, USA
Shaohua Dong & Heng Fan

Authors

Yunhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Qin Li
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xue Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jiali Yao
View author publications
You can also search for this author in PubMed Google Scholar
Shaohua Dong
View author publications
You can also search for this author in PubMed Google Scholar
Heng Fan
View author publications
You can also search for this author in PubMed Google Scholar
Libo Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Libo Zhang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y. et al. (2025). Beyond MOT: Semantic Multi-object Tracking. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15093. Springer, Cham. https://doi.org/10.1007/978-3-031-72761-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-72761-0_16
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72760-3
Online ISBN: 978-3-031-72761-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Beyond MOT: Semantic Multi-object Tracking

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Elysium: Exploring Object-Level Perception in Videos via MLLM

SemTrack: A Large-Scale Dataset for Semantic Tracking in the Wild

Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Beyond MOT: Semantic Multi-object Tracking

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Elysium: Exploring Object-Level Perception in Videos via MLLM

SemTrack: A Large-Scale Dataset for Semantic Tracking in the Wild

Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation