Abstract
Although the use of a Siamese network is the most popular approach in object tracking, it creates an undesirable trivial solution and requires a large amount of training data reflecting changes in the object’s shape in every frame. To solve this problem, in this paper, a self-supervised learning method for multi-object tracking (SSL-MOT) based on a contrastive structure is proposed. Unlike the existing SSL, we adopt a generative adversarial network as a preprocessing step to generate various pose changes of tracking objects. A positive pair composed of the augmented image and pose data is applied to the SSL network to learn an encoder that can generate a non-collapsed output vector. To improve the discrimination power of the encoder output features, we propose an affinity correlation distance, which combines invariance and redundancy terms as a loss function for learning. During the test, because only the dot product between two output vectors of the tracker and detection was used for a data association, the computation time was significantly reduced, and thus real-time online tracking about 12 fps was possible. The proposed method is the first attempt to apply SSL to an online MOT. Experimental results on the MOT16, 17, and 20 challenge datasets proved that the proposed method is a fast and reasonable tracking method that occupies less memory and achieves an excellent tracking performance compared to other state-of-the-art methods.
Similar content being viewed by others
References
Shu G, Dehghan A, Oreifej O, Hand E, Shah M (2012) Part-based multiple-person tracking with partial occlusion handling. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp 1815–1821
Kuhn H W (1955) The hungarian method for the assignment problem. Naval Res Logist Quart 2(1-2):83–97
Kim H-U, Koh Y J, Kim C-S (2020) Online multiple object tracking based on open-set few-shot learning. IEEE Access 8:190312–190326
Leal-Taixé L, Canton-Ferrer C, Schindler K (2016) Learning by tracking: Siamese cnn for robust target association. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 33–40
Chu P, Ling H (2019) Famnet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6172–6181
Lee S, Kim E (2018) Multiple object tracking via feature pyramid siamese networks. IEEE Access 7:8181–8194
Lee J, Kim S, Ko B C (2020) Online multiple object tracking using rule distillated siamese random forest. IEEE Access 8:182828–182841
Zhang Z, Zhang Y, Cheng X, Lu G (2021) Siamese network for object tracking with multi-granularity appearance representations. Pattern Recogn 118:108003
Shuai B, Berneshawi A, Li X, Modolo D, Tighe J (2021) Siammot: Siamese multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12372–12382
Papakis I, Sarkar A, Karpatne A (2020) Gcnnmatch: Graph convolutional neural networks for multi-object tracking via sinkhorn normalization. arXiv:2010.00067
Ristani E, Tomasi C (2018) Features for multi-target multi-camera tracking and re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6036–6046
Son J, Baek M, Cho M, Han B (2017) Multi-object tracking with quadruplet convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5620–5629
Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 15750–15758
Dai P, Weng R, Choi W, Zhang C, He Z, Ding W (2021) Learning a proposal classifier for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2443–2452
He J, Huang Z, Wang N, Zhang Z (2021) Learnable graph matching: Incorporating graph partitioning with deep feature learning for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5299–5309
Stadler D, Beyerer J (2021) Improving multiple pedestrian tracking by track management and occlusion handling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10958–10967
Grill J-B, Strub F, Altché F, Tallec C, Richemond P H, Buchatskaya E, Doersch C, Pires B A, Guo Z D, Azar M G et al (2020) Bootstrap your own latent: A new approach to self-supervised learning. arXiv:2006.07733
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607
Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: Self-supervised learning via redundancy reduction. arXiv:2103.03230
Qian X, Fu Y, Xiang T, Wang W, Qiu J, Wu Y, Jiang Y-G, Xue X (2018) Pose-normalized image generation for person re-identification. In: Proceedings of the European conference on computer vision (ECCV), pp 650–667
Lu Y, Lu C, Tang C-K (2017) Online video object detection using association lstm. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2344–2352
Liu H, Zhang H, Mertz C (2019) Deepda: Lstm-based deep data association network for multi-targets tracking in clutter. In: 2019 22th International Conference on Information Fusion (FUSION). IEEE, pp 1–8
Kim C, Fuxin L, Alotaibi M, Rehg J M (2021) Discriminative appearance modeling with multi-track pooling for real-time multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9553–9562
Ge W (2018) Deep metric learning with hierarchical triplet loss. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 269–285
Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv:1703.07737
Zou H, Cui J, Kong X, Zhang C, Liu Y, Wen F, Li W (2020) F-siamese tracker: A frustum-based double siamese network for 3d single object tracking. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp 8133–8139
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv:2006.09882
Bahri D, Jiang H, Tay Y, Metzler D (2021) Scarf: Self-supervised contrastive learning using random feature corruption. arXiv:2106.15147
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q (2015) Scalable person re-identification: A benchmark. In: Proceedings of the IEEE international conference on computer vision, pp 1116– 1124
Milan A, Leal-Taixé L, Reid I, Roth S, Schindler K (2016) Mot16: A benchmark for multi-object tracking. arXiv:1603.00831
MOT Benchmarks https://motchallenge.net/data/MOT17/
Dendorfer P, Rezatofighi H, Milan A, Shi J, Cremers D, Reid I, Roth S, Schindler K, Leal-Taixé L (2020) Mot20: A benchmark for multi object tracking in crowded scenes. arXiv:2003.09003
Li J, Gao X, Jiang T (2020) Graph networks for multiple object tracking. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 719–728
Yang J, Ge H, Yang J, Tong Y, Su S (2021) Online multi-object tracking using multi-function integration and tracking simulation training. Appl Intell:1–21
Saleh F, Aliakbarian S, Rezatofighi H, Salzmann M, Gould S (2021) Probabilistic tracklet scoring and inpainting for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14329– 14339
Zhou X, Koltun V, Krähenbühl P (2020) Tracking objects as points. In: European Conference on Computer Vision. Springer, pp 474–490
Si T, He F, Wu H, Duan Y (2022) Spatial-driven features based on image dependencies for person re-identification. Pattern Recogn 124:108462
Pan Y, He F, Yu H (2020) Learning social representations with deep autoencoder for recommender system. World Wide Web 23(4):2259–2279
Liang Y, He F, Zeng X (2020) 3d mesh simplification with feature preservation based on whale optimization algorithm and differential evolution. Integr Comput-Aided Eng (Preprint):1–19
Acknowledgements
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), Ministry of Education, under Grant 2019R1I1A3A01042506.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kim, S., Lee, J. & Ko, B.C. SSL-MOT: self-supervised learning based multi-object tracking. Appl Intell 53, 930–940 (2023). https://doi.org/10.1007/s10489-022-03473-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03473-9