Abstract
Object tracking in thermal imagery is a challenging problem relevant to more and more growing applications. By fusing the complementary features of RGB and thermal images, the object tracking algorithms can be enhanced to give better outputs. mfSiamTrack (Multi-modality fusion for Siamese Network based RGB-T Tracking) is a dual-mode STSO (Short Term Single Object) tracker. The tracker works in Thermal Mode and Multi-modality fusion mode (RGBT mode). The RGBT mode gets activated if the dataset contains the Thermal Infrared and the corresponding RGB sequences. The complementary features from both RGB and Thermal Imagery are fused, and tracking uses fused sequences. If only thermal sequences exist in the dataset, the tracker works in thermal tracking mode. An auto-encoder (AE) based fusion network is proposed for multi-modality fusion. The Encoder decomposes the RGB and thermal images into the background and detail feature maps. The background and detail feature maps of the source images are fused by the Fusion Layer and the Decoder reconstructs the fused image. For handling objects at different scales and viewpoints, mfSiamTrack introduces a Multi-Scale Structural Similarity (MS-SSIM) based reconstruction method. mfSiamTrack is a fully convolutional-siamese network based tracker which also incorporates a semi-supervised video object segmentation (VOS) for pixel-wise target identification. The tracker was evaluated on the VOT-RGBT2019 dataset with Accuracy, Robustness, Expected Average Overlap (EAO) and Average IoU as performance evaluation measures. It is observed that mfSiamTrack outperforms the state-of-the-art.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Berg, A., Ahlberg, J., Felsberg, M.: A thermal object tracking benchmark. In: 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. IEEE (2015)
Caelles, S., Maninis, K.-K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018)
Ci, H., Wang, C., Wang, Y.: Video object segmentation by learning location-sensitive embeddings. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 524–539. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_31
Hu, P., Wang, G., Kong, X., Kuen, J., Tan, Y.-P.: Motion-guided cascaded refinement network for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1400–1409 (2018)
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese” time delay neural network. In: Advances in Neural Information Processing Systems, vol. 6 (1993)
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018)
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 103–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_7
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1328–1338 (2019)
Li, H., Wu, X.-J., Kittler, J.: Infrared and visible image fusion using a deep learning framework. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2705–2710. IEEE (2018)
Li, H., Wu, X.-J.: DenseFuse: a fusion approach to infrared and visible images. IEEE Trans. Image Process. 28(5), 2614–2623 (2018)
Li, H., Wu, X.-J., Durrani, T.S.: Infrared and visible image fusion with ResNet and zero-phase component analysis. Infrared Phys. Technol. 102, 103039 (2019)
Zhao, Z., Xu, S., Zhang, C., Liu, J., Li, P., Zhang, J.: DIDFuse: deep image decomposition for infrared and visible image fusion. arXiv preprint arXiv:2003.09210 (2020)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers 2003, vol. 2, pp. 1398–1402. IEEE (2003)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 603–619. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_36
Kristan, M., et al.: The seventh visual object tracking VOT2019 challenge results. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Zhang, L., Danelljan, M., Gonzalez-Garcia, A., van de Weijer, J., Shahbaz Khan, F.: Multi-modal fusion for end-to-end RGB-T tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Zhang, Z., Peng, H.: Deeper and wider siamese networks for real-time visual tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4591–4600 (2019)
Li, C.L., Lu, A., Zheng, A.H., Tu, Z., Tang, J.: Multi-adapter RGBT tracking. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 2262–2270. IEEE (2019)
Zhang, P., Zhao, J., Bo, C., Wang, D., Lu, H., Yang, X.: Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 30, 3335–3347 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Chithra, A.V., Mishra, D. (2023). Multi-modality Fusion for Siamese Network Based RGB-T Tracking (mfSiamTrack). In: Gupta, D., Bhurchandi, K., Murala, S., Raman, B., Kumar, S. (eds) Computer Vision and Image Processing. CVIP 2022. Communications in Computer and Information Science, vol 1776. Springer, Cham. https://doi.org/10.1007/978-3-031-31407-0_31
Download citation
DOI: https://doi.org/10.1007/978-3-031-31407-0_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31406-3
Online ISBN: 978-3-031-31407-0
eBook Packages: Computer ScienceComputer Science (R0)