Abstract
Siamese networks have drawn increasing interest in the field of visual object tracking due to their balance of precision and efficiency. However, Siamese trackers use relatively shallow backbone networks, such as AlexNet, and therefore do not take full advantage of the capabilities of modern deep convolutional neural networks (CNNs). Moreover, the feature representations of the target object in a Siamese tracker are extracted through the last layer of CNNs and mainly capture semantic information, which causes the tracker's precision to be relatively low and to drift easily in the presence of similar distractors. In this paper, a new nonpadding residual unit is designed and used to stack a 22-layer deep ResNet, referred as ResNet22. After utilizing ResNet22 as the backbone network, we can build a deep Siamese network, which can greatly enhance the tracking performance. Considering that the different levels of the feature maps of the CNN represent different aspects of the target object, we aggregated different deep convolutional layers to make use of ResNet22’s multilevel feature maps, which can form hyperfeature representations of targets. The final network architecture is named DSiamLA. Experimental results show that DSiamLA has achieved significant improvement on multiple benchmark datasets.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig2_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-021-01861-1/MediaObjects/11760_2021_1861_Fig8_HTML.png)
Similar content being viewed by others
References
Zhang, Z., Peng, H.: Deeper and wider siamese networks for real-time visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Wang, Q., Zhang, M., Xing, J., et al.: Do not lose the details: reinforced representation learning for high performance visual tracking. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, F (2018)
Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Bertinetto, L., et al.: Fully-convolutional siamese networks for object tracking. In: European Conference on Computer Vision. Springer, Cham (2016)
Wang, Q., et al.: Learning attentions: residual attentional siamese network for high performance online visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Held, D., Thrun, S., Savarese, S.: Learning to track at 100 fps with deep regression networks. In: European Conference on Computer Vision. Springer, Cham (2016)
Valmadre, J., et al.: End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Guo, Q., et al.: Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Fan, H., Ling, H.: Siamese cascaded region proposal networks for real-time visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Zhu, Z., et al.: In: Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Wang, Q., Zhang, L., Bertinetto, L., et al.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Kong, T., et al.: Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Kim, K-H., et al.: Pvanet: deep but lightweight neural networks for real-time object detection (2016). arXiv preprint arXiv:1608.08021
Wang, L., et al.: Visual tracking with fully convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
Wang, N., et al.: Transferring rich feature hierarchies for robust visual tracking (2015). arXiv preprint arXiv:1501.04587
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Danelljan, M., et al.: Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Bertinetto, L., et al.: Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Fan, H., Ling, H.: Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yu, L., Zhao, Y. & Zheng, X. Towards real-time object tracking with deep Siamese network and layerwise aggregation. SIViP 15, 1303–1311 (2021). https://doi.org/10.1007/s11760-021-01861-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-021-01861-1