Skip to main content
Log in

Joint Detection and Association for End-to-End Multi-object Tracking

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Multi-object tracking (MOT) is mainly used for detecting and tracking the object on multi-cameras, which is widely applied in intelligent video surveillance and intelligent security. The process of MOT generally involves three import parts: feature extracting, multi-task learning and object matching. Unfortunately, the existed methods still have some drawbacks. Firstly, the feature extracting module cannot effectively fuse the shallow and deep features. What’s more, the multi-task learning module cannot strike a good balance between the detection and re-identification. In addition, the object matching module associates with pedestrian by using a traditional method rather than training a model. For these problems, we propose a method of joint detection and association (JDA) for end-to-end multi-object tracking network, which involves the multi-scale feature extraction and the learnable object association. It first combines a feature extraction backbone based on multi-scale feature fusion and a point-based multi-task object detection branch, to solve the task of feature extraction and object detection. Then, a learnable object motion association module is embedded, which uses the historical frames information to infer the position of the object, and associate the object identity between previous frames and subsequent frames. In addition, the JDA can be end-to-end trained when handling the detection and matching tasks. The proposed JDA is evaluated through a series of experiments on MOT16 and MOT17. The results shows that JDA the existing methods in terms of precision and stability of MOT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Kokkinos I (2017) Ubernet: training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6129–6138

  2. Ranjan R, Patel VM, Chellappa R (2017) Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Trans Pattern Anal Mach Intell 41(1):121–135

    Article  Google Scholar 

  3. Sener O, Koltun V (2018) Multi-task learning as multi-objective optimization. In: Advances in neural information processing systems, vol 31

  4. Voigtlaender P, Krause M, Osep A, Luiten J, Sekar BBG, Geiger A, Leibe B (2019) Mots: multi-object tracking and segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7942–7951

  5. Wang Z, Zheng L, Liu Y, Li Y, Wang S (2020) Towards real-time multi-object tracking. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, pp 107–122

  6. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP), pp 3464–3468

  7. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 39(6):1137–1149

    Google Scholar 

  8. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing (ICIP), pp 3645–3649

  9. Kim C, Li F, Ciptadi A, Rehg JM (2015) Multiple hypothesis tracking revisited. In: IEEE international conference on computer vision

  10. Yu F, Li W, Li Q, Liu Y, Shi X, Yan J (2016) Poi: multiple object tracking with high performance detection and appearance feature. In: European conference on computer vision, pp 36–42

  11. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) SSD: single shot multibox detector. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp 21–37

  12. Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767

  13. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271

  14. He Q, Sun X, Yan Z, Fu K (2022) Dabnet: deformable contextual and boundary-weighted network for cloud detection in remote sensing images. IEEE Trans Geosci Remote Sens 60:1–16. https://doi.org/10.1109/TGRS.2020.3045474

    Article  Google Scholar 

  15. Xu Y, Osep A, Ban Y, Horaud R, Leal-Taixe L, Alameda-Pineda X (2019) How to train your deep multi-object tracker. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6786–6795

  16. Pang B, Li Y, Zhang Y, Li M, Lu C (2020) Tubetk: adopting tubes to track multi-object in a one-step training model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6307–6317

  17. Feichtenhofer C, Pinz A, Zisserman A (2017) Detect to track and track to detect. In: Proceedings of the IEEE international conference on computer vision, pp 3038–3046

  18. Bergmann P, Meinhardt T, Leal-Taixe L (2019) Tracking without bells and whistles. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 941–951

  19. He Q, Sun X, Yan Z, Li B, Fu K (2022) Multi-object tracking in satellite videos with graph-based multitask modeling. IEEE Trans Geosci Remote Sens 60:1–13. https://doi.org/10.1109/TGRS.2022.3152250

    Article  Google Scholar 

  20. Zhou X, Wang D, Krähenbühl P (2019) Objects as points. arXiv preprint arXiv:1904.07850

  21. Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6569–6578

  22. Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis 129(11):3069–3087

    Article  Google Scholar 

  23. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  24. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  25. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708

  26. Zhang Y, Wang C, Wang X, Zeng W, Liu W (2021) Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int J Comput Vis 1–19

  27. Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596

    Article  Google Scholar 

  28. Danelljan M, Häger G, Khan F, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: British machine vision conference, Nottingham, September 1–5, 2014. Bmva Press

  29. Li Y, Luo X, Hou S, Li C, Yin G (2021) End-to-end network embedding unsupervised key frame extraction for video-based person re-identification. In: 2021 11th international conference on information science and technology (ICIST), pp 404–410 . https://doi.org/10.1109/ICIST52614.2021.9440586

  30. Zhou X, Koltun V, Krähenbühl P (2020) Tracking objects as points. In: European conference on computer vision. Springer, pp 474–490

  31. Danelljan M, Hager G, Shahbaz Khan F, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE international conference on computer vision, pp 4310–4318

  32. Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP J Image Video Process 2008:1–10

    Article  Google Scholar 

  33. Ristani E, Solera F, Zou R, Cucchiara R, Tomasi C (2016) Performance measures and a data set for multi-target, multi-camera tracking. In: Computer vision–ECCV 2016 workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, Proceedings, Part II. Springer, pp 17–35

  34. Pang B, Li Y, Zhang Y, Li M, Lu C (2020) Tubetk: adopting tubes to track multi-object in a one-step training model. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6307–6317. https://doi.org/10.1109/CVPR42600.2020.00634

  35. Bergmann P, Meinhardt T, Leal-Taixe L (2019) Tracking without bells and whistles. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 941–951

  36. Xu Y, Ban Y, Alameda-Pineda X, Horaud R (2019) Deepmot: a differentiable framework for training multiple object trackers. arXiv preprint arXiv:1906.06618

  37. Brasó G, Leal-Taixé L (2020) Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6247–6257

  38. Bewley A, Ge Z, Ott L, Ramos F, Upcroft B (2016) Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP). IEEE, pp 3464–3468

  39. Meneses M, Matos L, Prado B, Carvalho A, Macedo H (2021) Smartsort: an mlp-based method for tracking multiple objects in real-time. J Real-Time Image Proc 18:913–921

    Article  Google Scholar 

  40. He Y, Wei X, Hong X, Ke W, Gong Y (2022) Identity-quantity harmonic multi-object tracking. IEEE Trans Image Process 31:2201–2215. https://doi.org/10.1109/TIP.2022.3154286

    Article  Google Scholar 

  41. Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing (ICIP). IEEE, pp 3645–3649

  42. Wang G, Wang Y, Gu R, Hu W, Hwang J-N (2022) Split and connect: a universal tracklet booster for multi-object tracking. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2022.3140919

    Article  Google Scholar 

Download references

Funding

This work was supported in part by the Young People Fund of Xinjiang Science and Technology Department (No. 2022D01B05).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xinzhong Wang, Guangqiang Yin or Zhiguo Wang.

Ethics declarations

Conflict of interest

The data that support the findings of this study are available online. These datasets were derived from the following public domain resources: MOT17 and MOT16. The authors declare that they have no conflict of interest/competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Luo, X., Shi, J. et al. Joint Detection and Association for End-to-End Multi-object Tracking. Neural Process Lett 55, 11823–11844 (2023). https://doi.org/10.1007/s11063-023-11397-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11397-9

Keywords

Navigation