skip to main content
research-article

Multi-object Tracking with Spatial-Temporal Tracklet Association

Published: 11 January 2024 Publication History

Abstract

Recently, the tracking-by-detection methods have achieved excellent performance in Multi-Object Tracking (MOT), which focuses on obtaining a robust feature for each object and generating tracklets based on feature similarity. However, they are confronted with two issues: (1) unstable features in short-term occlusion and (2) insufficient matching in long-term occlusion. Specifically, the unstable feature is caused by the appearance variation under occlusion, and the association with the current unstable feature will lead to insufficient matching in long-term occlusion. To address the above issues, we propose a two-stage tracklet-level association method, Spatial-Temporal Tracklet Association (STTA), to effectively combine spatial-temporal context between feature extraction and data association. In the first stage, we propose the Tracklet-guided Spatial-Temporal Attention network (TSTA) to generate robust and stable features. Specifically, TSTA captures spatial-temporal context to obtain the most salient regions between the current and previous clips. In the second stage, we design the Bi-Tracklet Spatial-Temporal association (BTST) module to fully exploit the spatial-temporal context in data association. Specifically, we leverage BTST to merge different tracklets into long-term trajectories by jointly learning visual feature and spatial-temporal context and designing a bidirectional interpolation to recover the missed objects between matched tracklets. Extensive experiments of public and private detections on four benchmarks demonstrate the robustness of STTA. Furthermore, the proposed method is a model-agnostic method, which can be plugged and played with existing methods to boost their performance, e.g., obtain 11.0%, 10.1%, 2.9%, 3.2%, and 7.8% improvement on IDF1 in the MOT16 validation dataset for Tracktor, CenterTrack, Deepsort, JDE, and CTracker, respectively.

References

[1]
Na An and Wei Qi Yan. 2021. Multitarget tracking using Siamese neural networks. ACM Transactions on Multimedia Computing Communications and Applications 17, 2s (2021), 1–16.
[2]
Maryam Babaee, Ali Athar, and Gerhard Rigoll. 2018. Multiple people tracking using hierarchical deep tracklet re-identification. arXiv:1811.04091. Retrieved from https://arixv.org/abs/1811.04091
[3]
Seung-Hwan Bae and Kuk-Jin Yoon. 2014. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In CVPR. 1218–1225.
[4]
Shutao Bai, Bingpeng Ma, Hong Chang, Rui Huang, and Xilin Chen. 2022. Salient-to-broad transition for video person re-identification. In CVPR. 7339–7348.
[5]
Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In ICCV. 941–951.
[6]
Keni Bernardin and Rainer Stiefelhagen. 2008. Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP J. Image Vid. Process. 2008 (2008), 1–10.
[7]
Guillem Brasó and Laura Leal-Taixé. 2020. Learning a neural solver for multiple object tracking. In CVPR. 6247–6257.
[8]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213–229.
[9]
Guangyi Chen, Jiwen Lu, Ming Yang, and Jie Zhou. 2019. Spatial-temporal attention-aware learning for video-based person re-identification. IEEE Trans. Image Process. 28, 9 (2019), 4192–4205.
[10]
Long Chen, Haizhou Ai, Rui Chen, and Zijie Zhuang. 2019. Aggregate tracklet appearance features for multi-object tracking. IEEE Sign. Process. Lett. 26, 11 (2019), 1613–1617.
[11]
Yangming Cheng, Liulei Li, Yuanyou Xu, Xiaodi Li, Zongxin Yang, Wenguan Wang, and Yi Yang. 2023. Segment and track anything. arXiv:2305.06558. Retrieved from https://arxiv.org/abs/2305.06558
[12]
Qi Chu, Wanli Ouyang, Hongsheng Li, Xiaogang Wang, Bin Liu, and Nenghai Yu. 2017. Online multi-object tracking using CNN-based single object tracker with spatial-temporal attention mechanism. In ICCV. 4836–4845.
[13]
Peng Dai, Renliang Weng, Wongun Choi, Changshui Zhang, Zhangping He, and Wei Ding. 2021. Learning a proposal classifier for multiple object tracking. In CVPR. 2443–2452.
[14]
Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. 2020. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003. Retrieved from https://arxiv.org/abs/2003.09003
[15]
Piotr Dollár, Ron Appel, Serge Belongie, and Pietro Perona. 2014. Fast feature pyramids for object detection. Trans. Pattern. Anal. Mach. Intell. 36, 8 (2014), 1532–1545.
[16]
Xingping Dong, Jianbing Shen, Wenguan Wang, Ling Shao, Haibin Ling, and Fatih Porikli. 2019. Dynamical hyperparameter optimization via deep reinforcement learning in tracking. IEEE Trans. Pattern Anal. Mach. Intell. 43, 5 (2019), 1515–1529.
[17]
Andreas Ess, Bastian Leibe, and Luc Van Gool. 2007. Depth and appearance for mobile scene analysis. In ICCV. IEEE, 1–8.
[18]
Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. 2010. Object detection with discriminatively trained part-based models. Trans. Pattern Anal. Mach. Intell. 32, 9 (2010), 1627–1645.
[19]
Yang Fu, Xiaoyang Wang, Yunchao Wei, and Thomas Huang. 2019. Sta: Spatial-temporal attention for large-scale video-based person re-identification. In AAAI, Vol. 33. 8287–8294.
[20]
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv:2107.08430. Retrieved from https://arxiv.org/abs/2107.08430
[21]
Hongji Guo, Hanjing Wang, and Qiang Ji. 2022. Uncertainty-guided probabilistic transformer for complex action recognition. In CVPR. 20052–20061.
[22]
Song Guo, Jingya Wang, Xinchao Wang, and Dacheng Tao. 2021. Online multiple object tracking with cross-task synergy. In CVPR. 8136–8145.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
[24]
Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737. Retrieved from https://arxiv.org/abs/1703.07737
[25]
Andrea Hornakova, Roberto Henschel, Bodo Rosenhahn, and Paul Swoboda. 2020. Lifted disjoint paths with application in multiple object tracking. In ICML. PMLR, 4364–4375.
[26]
Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2020. Temporal complementary learning for video person re-identification. In ECCV. Springer, 388–405.
[27]
Tao Hu, Lichao Huang, and Han Shen. 2020. Multi-object tracking via end-to-end tracklet searching and ranking. arXiv:2003.02795. Retrieved from https://arxiv.org/abs/2003.02795
[28]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980. Retrieved from https://arxiv.org/abs/1412.6980
[29]
Harold W. Kuhn. 2010. The Hungarian method for the assignment problem. In 50 Years of Integer Programming 1958-2008—From the Early Years to the State-of-the-Art. 29–47.
[30]
Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. 2015. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942. Retrieved from https://arxiv.org/abs/1504.01942
[31]
Rui Li, Baopeng Zhang, Wei Liu, Zhu Teng, and Jianping Fan. 2023. PANet: An end-to-end network based on relative motion for online multi-object tracking. ACM Trans. Multimedia Comput. Commun. Appl. (2023). DOI:
[32]
Shuang Li, Slawomir Bak, Peter Carr, and Xiaogang Wang. 2018. Diversity regularized spatiotemporal attention for video-based person re-identification. In CVPR. 369–378.
[33]
Qiankun Liu, Qi Chu, Bin Liu, and Nenghai Yu. 2020. GSM: Graph similarity model for multi-object tracking. In IJCAI, Christian Bessiere (Ed.). 530–536. DOI:
[34]
Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixé, and Bastian Leibe. 2020. HOTA: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. (2020), 1–31.
[35]
Niall McLaughlin, Jesus Martinez Del Rincon, and Paul Miller. 2016. Recurrent convolutional network for video-based person re-identification. In CVPR. 1325–1334.
[36]
Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. 2016. MOT16: A benchmark for multi-object tracking. arXiv:1603.00831. Retrieved from https://arxiv.org/abs/1603.00831
[37]
Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li, and Cewu Lu. 2020. Tubetk: Adopting tubes to track multi-object in a one-step training model. In CVPR. 6308–6318.
[38]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS. 8024–8035.
[39]
Jinlong Peng, Fan Qiu, John See, Qi Guo, Shaoshuai Huang, Ling-Yu Duan, and Weiyao Lin. 2018. Tracklet siamese network with constrained clustering for multiple object tracking. In VCIP. IEEE, 1–4.
[40]
Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2020. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In ECCV. Springer, 145–161.
[41]
Jinlong Peng, Tao Wang, Weiyao Lin, Jian Wang, John See, Shilei Wen, and Erui Ding. 2020. TPM: Multiple object tracking with tracklet-plane matching. Pattern Recogn. (2020), 107480.
[42]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards real-time object detection with region proposal networks. Trans. Pattern Anal. Mach. Intell. 39, 6 (2017), 1137–1149.
[43]
Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV Workshops. 17–35.
[44]
Fatemeh Saleh, Sadegh Aliakbarian, Hamid Rezatofighi, Mathieu Salzmann, and Stephen Gould. 2021. Probabilistic tracklet scoring and inpainting for multiple object tracking. In CVPR. 14329–14339.
[45]
Han Shen, Lichao Huang, Chang Huang, and Wei Xu. 2018. Tracklet association tracker: An end-to-end learning-based association approach for multi-object tracking. arXiv:1808.01562. Retrieved from https://arxiv.org/abs/1808.01562
[46]
Daniel Stadler and Jurgen Beyerer. 2021. Improving multiple pedestrian tracking by track management and occlusion handling. In CVPR. 10958–10967.
[47]
Daniel Stadler and Jürgen Beyerer. 2022. Modelling ambiguous assignments for multi-person tracking in crowds. In WACV. 133–142.
[48]
Arulkumar Subramaniam, Athira Nambiar, and Anurag Mittal. 2019. Co-segmentation inspired attention networks for video-based person re-identification. In ICCV. 562–572.
[49]
Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. 2020. Transtrack: Multiple object tracking with transformer. arXiv:2012.15460. Retrieved from https://arxiv.org/abs/2012.15460
[50]
Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien Gaidon. 2021. Learning to track with object permanence. In ICCV. 10860–10869.
[51]
Albert Tseng, Jennifer J. Sun, and Yisong Yue. 2022. Automatic synthesis of diverse weak supervision sources for behavior analysis. In CVPR. 2211–2220.
[52]
Gaoang Wang, Renshu Gu, Zuozhu Liu, Weijie Hu, Mingli Song, and Jenq-Neng Hwang. 2021. Track without appearance: Learn box and tracklet embedding with local and global motion patterns for vehicle tracking. In ICCV. 9876–9886.
[53]
Gaoang Wang, Yizhou Wang, Renshu Gu, Weijie Hu, and Jenq-Neng Hwang. 2022. Split and connect: A universal tracklet booster for multi-object tracking. IEEE Trans. Multimedia 25 (2022), 1256–1268.
[54]
Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu, and Jenq-Neng Hwang. 2019. Exploit the connectivity: Multi-object tracking with trackletnet. In ACM MM. 482–490.
[55]
Haidong Wang, Xuan He, Zhiyong Li, Jin Yuan, and Shutao Li. 2023. JDAN: Joint detection and association network for real-time online multi-object tracking. ACM Trans. Multimedia Comput. Commun. Appl. 19, 1s (2023), 1–17.
[56]
Qiang Wang, Yun Zheng, Pan Pan, and Yinghui Xu. 2021. Multiple object tracking with correlation learning. In CVPR. 3876–3886.
[57]
Shuai Wang, Hao Sheng, Yang Zhang, Yubin Wu, and Zhang Xiong. 2021. A general recurrent tracking framework without real data. In ICCV. 13219–13228.
[58]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In CVPR. 7794–7803.
[59]
Yongxin Wang, Kris Kitani, and Xinshuo Weng. 2021. Joint object detection and multi-object tracking with graph neural networks. In ICRA. 13708–13715.
[60]
Zitian Wang, Xuecheng Nie, Xiaochao Qu, Yunpeng Chen, and Si Liu. 2022. Distribution-aware single-stage models for multi-person 3D pose estimation. In CVPR. 13096–13105.
[61]
Zhongdao Wang, Liang Zheng, Yixuan Liu, and Shengjin Wang. 2020. Towards real-time multi-object tracking. In ECCV. Springer, 107–122.
[62]
Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association metric. In ICIP. 3645–3649.
[63]
Wei Wu, Jiawei Liu, Kecheng Zheng, Qibin Sun, and Zheng-Jun Zha. 2022. Temporal complementarity-guided reinforcement learning for image-to-video person re-identification. In CVPR. 7319–7328.
[64]
Jun Xiang, Guohan Xu, Chao Ma, and Jianhua Hou. 2020. End-to-end learning deep CRF models for multi-object tracking deep CRF models. Trans. Circ. Syst. Vid. Technol 31, 1 (2020), 275–288.
[65]
Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. 2019. Spatial-temporal relation networks for multi-object tracking. In ICCV. 3988–3998.
[66]
Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. 2022. TransCenter: Transformers with dense representations for multiple-object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 45, 6 (2022), 7820–7835.
[67]
Fan Yang, Xin Chang, Sakriani Sakti, Yang Wu, and Satoshi Nakamura. 2021. ReMOT: A model-agnostic refinement for multiple object tracking. Image Vis. Comput. 106 (2021), 104091.
[68]
Junbo Yin, Wenguan Wang, Qinghao Meng, Ruigang Yang, and Jianbing Shen. 2020. A unified object motion and affinity model for online multi-object tracking. In CVPR. 6768–6777.
[69]
Qian Yu, Gérard Medioni, and Isaac Cohen. 2007. Multiple target tracking using spatio-temporal markov chain monte carlo data association. In CVPR. IEEE, 1–8.
[70]
Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. 2022. Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision. Springer, 659–675.
[71]
Le Zhang, Zenglin Shi, Joey Tianyi Zhou, Ming-Ming Cheng, Yun Liu, Jia-Wang Bian, Zeng Zeng, and Chunhua Shen. 2020. Ordered or orderless: A revisit for video based person re-identification. Trans. Pattern Anal. Mach. Intell. 43, 4 (2020), 1460–1466.
[72]
Wei Zhang, Xuanyu He, Xiaodong Yu, Weizhi Lu, Zhengjun Zha, and Qi Tian. 2019. A multi-scale spatial-temporal attention model for person re-identification in videos. TIP 29 (2019), 3365–3373.
[73]
Yang Zhang, Hao Sheng, Yubin Wu, Shuai Wang, Weifeng Lyu, Wei Ke, and Zhang Xiong. 2020. Long-term tracking with deep tracklet association. Trans. Image Process. 29 (2020), 6694–6706.
[74]
Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. 2022. Bytetrack: Multi-object tracking by associating every detection box. In ECCV. Springer, 1–21.
[75]
Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. 2021. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. (2021), 1–19.
[76]
Yiru Zhao, Xu Shen, Zhongming Jin, Hongtao Lu, and Xian-sheng Hua. 2019. Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In CVPR. 4913–4922.
[77]
Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In ECCV. Springer, 474–490.
[78]
Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp Krähenbühl. 2022. Global tracking transformers. In CVPR. 8771–8780.
[79]
Tianyu Zhu, Markus Hiller, Mahsa Ehsanpour, Rongkai Ma, Tom Drummond, Ian Reid, and Hamid Rezatofighi. 2022. Looking beyond two frames: End-to-end multi-object tracking using spatial and temporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 45, 11 (2023), 12783–12797.

Cited By

View all
  • (2024)P2FTrack: Multi-Object Tracking with Motion Prior and Feature PosteriorACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370044321:1(1-22)Online publication date: 14-Oct-2024
  • (2024)SPLICEGNN: SPLIt and ConnEct Tracklets in a Unified Graph Neural NetworkPattern Recognition and Computer Vision10.1007/978-981-97-8858-3_22(315-329)Online publication date: 18-Oct-2024

Index Terms

  1. Multi-object Tracking with Spatial-Temporal Tracklet Association

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
    May 2024
    650 pages
    EISSN:1551-6865
    DOI:10.1145/3613634
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 January 2024
    Online AM: 30 November 2023
    Accepted: 24 November 2023
    Revised: 20 October 2023
    Received: 25 May 2023
    Published in TOMM Volume 20, Issue 5

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi-object tracking
    2. spatial-temporal tracklet association
    3. tracklet-guided spatial-temporal attention network
    4. bi-tracklet spatial-temporal association

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development Program of China
    • National Natural Science Foundation of China
    • Beijing Natural Science Foundation
    • Key Research and Development Program of Jiangsu Province

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)388
    • Downloads (Last 6 weeks)27
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)P2FTrack: Multi-Object Tracking with Motion Prior and Feature PosteriorACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370044321:1(1-22)Online publication date: 14-Oct-2024
    • (2024)SPLICEGNN: SPLIt and ConnEct Tracklets in a Unified Graph Neural NetworkPattern Recognition and Computer Vision10.1007/978-981-97-8858-3_22(315-329)Online publication date: 18-Oct-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media