Abstract
Multi-object tracking requires accurately identifying and tracking multiple targets over long periods. However, tracking performance is highly susceptible to various factors, such as target deformation, occlusion, etc. Meanwhile, most existing MOT models perform simple aggregation and classification of target features, ignoring the inherent differences and connections between detection and re-identification. This often leads to frequent identity switches. To address the above issues, we propose our tracker IFMOT, a simple and efficient network that combines an interactive perception network with feature optimization. Specifically, we propose an interactive perception network with a multi-head cross-attention mechanism design to alleviate feature conflicts. And then, we introduce a feature optimization module that refines the target representation to improve the extraction capability of feature embeddings. Furthermore, a feature integration similarity matrix is used to comprehensively assess the similarity between objects and handle unreliable similarity matching. Experiments on the MOT16, MOT17, MOT20 and Dancetrack datasets show that the proposed method achieves a higher accuracy while keeping the tracking speed, in contrast to other state-of-the-art trackers.









Similar content being viewed by others
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Moorthy, S., Joo, Y.H.: Adaptive spatial-temporal surrounding-aware correlation filter tracking via ensemble learning. Pattern Recogn. 139, 109457 (2023)
Moorthy, S., Choi, J.Y., Joo, Y.H.: Gaussian-response correlation filter for robust visual object tracking. Neurocomputing 411, 78–90 (2020)
Kuppusami Sakthivel, S.S., Moorthy, S., Arthanari, S., Jeong, J.H., Joo, Y.H.: Learning a context-aware environmental residual correlation filter via deep convolution features for visual object tracking. Mathematics 12(14), 2279 (2024)
Moorthy, S., KS, S.S., Arthanari, S., Jeong, J.H., Joo, Y.H.: Hybrid multi-attention transformer for robust video object detection. Eng. Appl. Artif. Intell. 139, 109606 (2025)
Moorthy, S., Joo, Y.H.: Learning dynamic spatial-temporal regularized correlation filter tracking with response deviation suppression via multi-feature fusion. Neural Netw. 167, 360–379 (2023)
Hu, Y., Niu, A., Zhu, Y., Yan, Q., Sun, J., Zhang, Y.: Multiple object tracking based on occlusion-aware embedding consistency learning. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 9521–9525 (2024)
Zhang, Y., Xie, H., Jia, Y., Meng, J., Sang, M., Qiu, J., Zhao, S., Yang, Y.: Aipt: adaptive information perception for online multi-object tracking. Knowl.-Based Syst. 285, 111369 (2024)
Cui, Y., Zeng, C., Zhao, X., Yang, Y., Wu, G., Wang, L.: Sportsmot: a large multi-object tracking dataset in multiple sports scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9921–9931 (2023)
Wang, X., Sun, Z., Chehri, A., Jeon, G., Song, Y.: Deep learning and multi-modal fusion for real-time multi-object tracking: algorithms, challenges, datasets, and comparative study. Inf Fusion 105, 102247 (2024)
Nalaie, K., Zheng, R.: Atttrack: online deep attention transfer for multi-object tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1654–1663 (2023)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649 (2017)
Saleh, F., Aliakbarian, S., Rezatofighi, H., Salzmann, M., Gould, S.: Probabilistic tracklet scoring and inpainting for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14329–14339 (2021)
Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., Yu, F.: Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 164–173 (2021)
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: European Conference on Computer Vision, pp. 107–122 (2020)
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: Fairmot: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vision 129(11), 3069–3087 (2021)
Liang, C., Zhang, Z., Zhou, X., Li, B., Zhu, S., Hu, W.: Rethinking the competition between detection and reid in multiobject tracking. IEEE Trans. Image Process. 31, 3182–3196 (2022)
Lee, S.-H., Park, D.-H., Bae, S.-H.: Decode-mot: How can we hurdle frames to go beyond tracking-by-detection? IEEE Trans. Image Process. 32, 4378–4392 (2023)
Chan, S., Qiu, C., Wu, D., Hu, J., Heidari, A.A., Chen, H.: Fusion detection and reid embedding with hybrid attention for multi-object tracking. Neurocomputing 575, 127328 (2024)
Chen, S., Yu, E., Li, J., Tao, W.: Delving into the trajectory long-tail distribution for muti-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19341–19351 (2024)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229 (2020)
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468 (2016)
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: multi-object tracking by associating every detection box. In: European Conference on Computer Vision, pp. 1–21 (2022)
Dendorfer, P., Yugay, V., Osep, A., Leal-Taixé, L.: Quo vadis: is trajectory forecasting the key towards long-term multi-object tracking? Adv. Neural. Inf. Process. Syst. 35, 15657–15671 (2022)
Liu, C., Li, H., Wang, Z.: Fasttrack: a highly efficient and generic gpu-based multi-object tracking method with parallel kalman filter. Int. J. Comput. Vision 132(5), 1463–1483 (2024)
Liang, C., Zhang, Z., Zhou, X., Li, B., Hu, W.: One more check: making “fake background” be tracked again. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1546–1554 (2022)
Yang, P., Luo, X., Sun, J.: A simple but effective method for balancing detection and re-identification in multi-object tracking. IEEE Trans. Multimed. 25, 7456–7468 (2022)
Cao, Z., Li, J., Zhang, D., Zhou, M., Abusorrah, A.: A multi-object tracking algorithm with center-based feature extraction and occlusion handling. IEEE Trans. Intell. Transp. Syst. 24(4), 4464–4473 (2022)
Ma, S., Duan, S., Hou, Z., Yu, W., Pu, L., Zhao, X.: Multi-object tracking algorithm based on interactive attention network and adaptive trajectory reconnection. Expert Syst. Appl. 249, 123581 (2024)
Xu, L., Huang, Y.: Rethinking joint detection and embedding for multiobject tracking in multiscenario. IEEE Trans. Industr. Inf. 20(6), 8079–8088 (2024)
Hu, Y., Niu, A., Sun, J., Zhu, Y., Yan, Q., Dong, W., Woźniak, M., Zhang, Y.: Dynamic center point learning for multiple object tracking under severe occlusions. Knowl.-Based Syst. 300, 112130 (2024)
De Plaen, P.-F., Marinello, N., Proesmans, M., Tuytelaars, T., Van Gool, L.: Contrastive learning for multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6867–6877 (2024)
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: European Conference on Computer Vision, pp. 474–490 (2020)
Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., Fu, Y.: Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 145–161 (2020)
Cai, J., Xu, M., Li, W., Xiong, Y., Xia, W., Tu, Z., Soatto, S.: Memot: Multi-object tracking with memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8090–8100 (2022)
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: Trackformer: multi-object tracking with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8844–8854 (2022)
Gao, R., Wang, L.: Memotr: long-term memory-augmented transformer for multi-object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9901–9910 (2023)
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412 (2018)
Liu, S., Huang, D., et al.: Receptive field block net for accurate and fast object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 385–400 (2018)
Yang, F., Odashima, S., Masui, S., Jiang, S.: Hard to track objects with irregular motions and similar appearances? Make it easier by buffering the matching space. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4799–4808 (2023)
Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, I., Roth, S., Leal-Taixé, L.: Motchallenge: a benchmark for single-camera multiple target tracking. Int. J. Comput. Vision 129, 845–881 (2021)
Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., Luo, P.: Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20993–21002 (2022)
Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: an online multi-object tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12352–12361 (2021)
Zhou, X., Yin, T., Koltun, V., Krähenbühl, P.: Global tracking transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8771–8780 (2022)
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Contributions
D. C. and W. R. proposed the idea, designed and performed the simulations, and wrote the paper. C. Y. and B. W. analyzed the data. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
There are no Conflict of interest in this study.
Ethical approval
This research paper does not involve any studies with human participants or animals performed by any authors.
Consent to participate
Not applicable.
Consent for publication
All authors of this manuscript consent to its publication.
Additional information
Communicated by Haojie Li.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cao, D., Ren, W., Yu, C. et al. IFMOT: interactive perception and feature optimization network for multi-object tracking. Multimedia Systems 31, 136 (2025). https://doi.org/10.1007/s00530-025-01694-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-025-01694-9