Abstract
Visual tracking technology based on the Siamese network have enabled excellent performance on many tracking datasets. However, these trackers cannot provide desirable results in unconstrained environments, such as fast motion and extensive scale variations. To solve this problem, this paper proposes Adaptive Dilated Fusion module, Depth Pixel-Wise Correlation module and Feature Alignment module to meet the above challenges. Adaptive Dilated Fusion module facilitates extensive scale variations by adding receptive field pyramid on the last layer of Siamese network; Depth Pixel-Wise Correlation module aims to extract pixel level features through average pooling and maximum pooling and reduce the influence of background noise; Feature Alignment module alleviates the mismatch between classification task and regression task. Experiments are performed on several public datasets VOT2017, OTB100, LaSOT, etc. The tracking performance of algorithm is tested on complex scenes such as fast motion, various resolutions and extensive scale variations. On the OTB100 dataset, the tracker proposed in this paper (named SiamAPA) obtains up 2.4% (AUC) compared with the reference network on fast motion scene, 4.9% on various resolution scene and 1.3% on extensive scale variations scene. On the VOT2017 dataset, SiamAPA obtains up 3.7% (EAO) compared with the reference network. On the LaSOT dataset, the accuracy is improved by 1% and the robustness is improved by 1.9% compared with the reference network. Thanks to the coordination of the above three innovations, the proposed algorithm is superior to classical algorithms such as SPM tracker in many datasets while performs real-time tracking effect.
Similar content being viewed by others
Data Availability
All data generated during this study are included in these published articles:
1) OTB100.
https://doi.org/10.1109/TPAMI.2014.2388226.
2) VOT2017.
https://doi.org/10.1109/ICCVW.2017.230.
3) LaSOT.
https://doi.org/10.1109/CVPR.2019.00552.
4) Trackingnet.
https://doi.org/10.1007/978-3-030-01246-5_19.
5) GOT-10k.
Reference
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional siamese networks for object tracking. European conference on computer vision, 850–865. https://doi.org/10.1007/978-3-319-48881-3_56
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X. L. High performance visual tracking with Siamese region proposal network. Proceedings of 2018 IEEE/ CVF Conference on Computer Vision and Pattern, & Recognition (2018). 8971–8980. https://doi.org/10.1109/CVPR. 2018. 00935
Ren, S. Q., He, K. M., Girshick, R., & Sun, J. (2017). Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Guo, D. Y., Wang, J., Cui, Y., Wang, Z. H., & Chen, S. Y. (2020). SiamCAR: siamese fully convolutional classification and regression for visual tracking, Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6268–6276 https://doi.org/10.1109/CVPR42600.2020.00630
Tian, Z., Shen, C., Chen, H., & He, T. (2019). FCOS: Fully Convolutional One-Stage Object Detection, Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 9626–9635. https://doi.org/10.1109/ICCV.2019.00972
Wang, G., Luo, C., Xiong, Z., Zeng, W. SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking, Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern, & Recognition (2019). 3638–3647. https://doi.org/10.1109/CVPR.2019.00376
Voigtlaender, P., Luiten, J., Torr, P. H., & Leibe, B. (2020). Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6578–6588. https://doi.org/10.1109/cvpr42600.2020.00661
Lin, T., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S. Feature Pyramid Networks for Object Detection, Proceedings of 2017 IEEE Conference on Computer Vision and Pattern, & Recognition (2017). 936–944. https://doi.org/10.1109/CVPR.2017.106
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement.arXiv preprint arXiv:1804.02767
Liu, S., & Huang, D. (2018). Receptive field block net for accurate and fast object detection. In Proceedings of the European conference on computer vision, 385–400. https://doi.org/10.1007/978-3-030-01252-6_24
Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition of localization confidence for accurate object detection, Proceedings of the 15th European Conference on Computer Vision, 784–799.https://doi.org/10.1007/978-3-030-01264-9_48
Yang, Z., Liu, S., Hu, H., Wang, L., & Lin, S. (2019). RepPoints: Point Set Representation for Object Detection, Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 9656–9665. https://doi.org/10.1109/ICCV.2019.00975
Wu, Y., Chen, Y., Yuan, L., Liu, Z., Wang, L., Li, H. … Recognition (2020). 10183–10192. https://doi.org/10.1109/CVPR42600. 2020.01020
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: Evolution of siamese visual tracking with very deep networks, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern, & Recognition (2019). 4282–4291. https://doi.org/10.1109/CVPR. 2019.00441
Yan, B., Zhang, X., Wang, D., Lu, H., & Yang, X. (2021). Alpha-refine: Boosting tracking performance by precise bounding box estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5289–5298. https://doi.org/10.1109/CVPR46437. 2021.00525
Fan, H., & Ling, H. (2020). CRACT: Cascaded Regression-Align-Classification for Robust Visual Tracking. arXiv preprint arXiv:2011.12483
Chen, Q., Wang, Y., Yang, T., Zhang, X., Cheng, J., & Sun, J. (2021). You only look one-level feature. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13039–13048. https://doi.org/10.1109/CVPR46437.2021.01284
Wu, Y., Lim, J., & Yang, M. H. (2015). Object Tracking Benchmark. IEEE Transactions on Pattern Analysis & Machine Intelligence, 37(9), 1834–1848. https://doi.org/10.1109/TPAMI.2014.2388226
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., ˇCehovin Zajc, L., & Fernandez, G. (2017). The visual object tracking vot2017 challenge results. In Proceedings of the IEEE international conference on computer vision workshops, pp. 1949–1972. https://doi.org/10.1109/ICCVW.2017.230
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., & Yu, S. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5374–5383. https://doi.org/10.1109/CVPR.2019.00552
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV) pp. 300–317 https://doi.org/10.1007/978-3-030-01246-5_19
Huang, L., Zhao, X., & Huang, K. (2019). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1562–1577. https://doi.org/10.1109/tpami.2019.2957464
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., & Torr, P. H. (2017). End-to-end representation learning for correlation filter based tracking. Proceedings of the IEEE conference on computer vision and pattern recognition, 2805–2813. 10. https://doi.org/1109/ CVPR.2017.531
Zhang, Z. P., Peng, H. W. Deeper and wider Siamese networks for real-time visual tracking, Proceedings of 2019 IEEE/ CVF Conference on Computer Vision and Pattern, & Recognition (2019). 4586–4595. https://doi.org/10.1109/CVPR. 2019. 00472
Danelljan, M., Bhat, G., Khan, F. S., Felsberg, M. ECO: efficient convolution operators for tracking, Proceedings of 2017 IEEE Conference on Computer Vision and Pattern, & Recognition (2017). 6931–6939. https://doi.org/10.1109/CVPR. 2017. 733
Li, P., Wang, C. B. O. W., Yang, D. X., & Lu, H. (2019). GradNet: Gradient-Guided Network for Visual Object Tracking, Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 6161–6170. https://doi.org/10.1109/ICCV.2019.00626
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J. J., & Hu, W. M. (2018). Distractor-aware Siamese networks for visual object tracking, Proceedings of the 15th European Conference on Computer Vision, 103–119 https://doi.org/10.1007/978-3-030-01240-3_7
Wang, Q., Zhang, L., Bertinetto, L., Hu, W. M., Torr, P. H. S. Fast online object tracking and segmentation: a unifying approach, Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern, & Recognition (2019). 1328–1338. https://doi.org/10.1109/CVPR.2019.00142
Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R. W. H., & Yang, M. (2017). CREST: Convolutional Residual Learning for Visual Tracking, Proceedings of 2017 IEEE International Conference on Computer Vision, 2574–2583. https://doi.org/10.1109/ICCV.2017.279
Zhang, Z., Xie, Y., Xing, F., McGough, M., & Yang, L. (2017). Mdnet: A semantically and visually interpretable medical image diagnosis network. Proceedings of the IEEE conference on computer vision and pattern recognition, 6428–6436. https://doi.org/10.1109/CVPR.2017.378
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P. H. S. Staple: complementary learners for real-time tracking, Proceedings of 2016 IEEE Conference on Computer Vision and Pattern, & Recognition (2016). 1401–1409. https://doi.org/10.1109/CVPR. 2016. 156
Bhat, G., Johnander, J., Danelljan, M., Khan, F. S., & Felsberg, M. (2018). Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision 483–498 https://doi.org/10.1007/978-3-030-01216-8_30
Ma, C., Huang, J. B., Yang, X., & Yang, M. H. (2015). Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on computer vision pp. 3074–3082 https://doi.org/10.1109/iccv.2015.352
Danelljan, M., Robinson, A., Shahbaz Khan, F., & Felsberg, M. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European conference on computer vision pp. 472–488 https://doi.org/10.1007/978-3-319-46454-1_29
Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 fps with deep regression networks. In European conference on computer vision pp. 749–765 https://doi.org/10.1007/978-3-319-46448-0_45
Funding
The work was supported by the Natural Science Foundation of China NSFC under Grants 61871445, 61302156; the Key R & D Foundation Project of Jiangsu province under Grant BE2016001-4.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Yao Xiao, Fuxiang Wang, Xuhui Liu. The first draft of the manuscript was written by Guang Han and Yao Xiao and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval
This paper is not a study with human subjects, so no ethics approval is required.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Han, G., Xiao, Y., Wang, F. et al. Visual tracking based on depth cross-correlation and feature alignment. J Sign Process Syst 95, 37–47 (2023). https://doi.org/10.1007/s11265-022-01791-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-022-01791-2