Abstract
The Vision Meets Drone (VisDrone2020) Single Object Tracking is the third annual UAV tracking evaluation activity organized by the VisDrone team, in conjunction with European Conference on Computer Vision (ECCV 2020). The VisDrone-SOT2020 Challenge presents and discusses the results of 13 participating algorithms in detail. By using ensemble of different trackers trained on several large-scale datasets, the top performer in VisDrone-SOT2020 achieves better results than the counterparts in VisDrone-SOT2018 and VisDrone-SOT2019. The challenging results, collected videos as well as the valuation toolkit are made available at http://aiskyeye.com/. By holding VisDrone-SOT2020 challenge, we hope to provide the community a dedicated platform for developing and evaluating drone-based tracking approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahn, N., Kang, B., Sohn, K.A.: Efficient deep neural network for photo-realistic image super-resolution. arXiv (2019)
Kristan, M., et al.: The sixth visual object tracking VOT2018 challenge results. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 3–53. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_1
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional Siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: ICCV (2019)
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: CVPR (2010)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Eco: Efficient convolution operators for tracking. In: CVPR (2017)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: CVPR (2019)
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)
Danelljan, M., Häger, G., Khan, F., Felsberg, M.: Accurate scale estimation for robust visual tracking. In: BMVC (2014)
Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: ICCV (2015)
Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 472–488. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_29
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 375–391. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_23
Du, D., Wen, L., Qi, H., Huang, Q., Tian, Q., Lyu, S.: Iterative graph seeking for object tracking. TIP 27(4), 1809–1821 (2018)
Du, D., et al.: VisDrone-SOT2019: the vision meets drone single object tracking challenge results. In: ICCVW (2019)
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR (2019)
Fan, H., Ling, H.: Parallel tracking and verifying: a framework for real-time and high accuracy visual tracking. In: ICCV (2017)
Fan, H., Ling, H.: SANet: structure-aware network for visual tracking. In: CVPRW (2017)
Fan, H., Ling, H.: Siamese cascaded region proposal networks for real-time visual tracking. In: CVPR (2019)
Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: a benchmark for higher frame rate object tracking. In: ICCV (2017)
Galoogahi, H.K., Fagg, A., Lucey, S.: Learning background-aware correlation filters for visual tracking. In: ICCV (2017)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. TPAMI 37(3), 583–596 (2015)
Huang, L., Zhao, X., Huang, K.: GOT-10k: a large high-diversity benchmark for generic object tracking in the wild. TPAMI (2019)
Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization confidence for accurate object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 816–832. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_48
Jung, I., Son, J., Baek, M., Han, B.: Real-time MDNet. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 89–104. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_6
Kristan, M., et al.: A novel performance evaluation methodology for single-target trackers. TPAMI 38(11), 2137–2155 (2016)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: CVPR (2019)
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: CVPR (2018)
Li, F., Tian, C., Zuo, W., Zhang, L., Yang, M.H.: Learning spatial-temporal regularized correlation filters for visual tracking. In: CVPR (2018)
Li, S., Yeung, D.Y.: Visual object tracking for unmanned aerial vehicles: a benchmark and new motion models. In: AAAI (2017)
Li, Y., Zhu, J.: A scale adaptive kernel correlation filter tracker with feature integration. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 254–265. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16181-5_18
Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking: algorithms and benchmark. TIP 24(12), 5630–5644 (2015)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, T., Wang, G., Yang, Q.: Real-time part-based visual tracking via adaptive correlation filters. In: CVPR (2015)
Lukezic, A., et al.: CDTB: a color and depth visual object tracking dataset and benchmark. In: ICCV (2019)
Lv, F., Lu, F., Wu, J., Lim, C.: MBLLEN: low-light image/video enhancement using CNNs. In: BMVC (2018)
Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for visual tracking. In: ICCV (2015)
Marvasti-Zadeh, S.M., Khaghani, J., Ghanei-Yakhdan, H., Kasaei, S., Cheng, L.: COMET: context-aware IoU-guided network for small object tracking. arXiv (2020)
Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27
Mueller, M., Smith, N., Ghanem, B.: Context-aware correlation filter tracking. In: CVPR (2017)
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 310–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_19
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR (2016)
Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video. In: CVPR, pp. 7464–7473 (2017)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)
Smeulders, A.W., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.: Visual tracking: an experimental survey. TPAMI 36(7), 1442–1468 (2014)
Song, Y., et al.: VITAL: visual tracking via adversarial learning. In: CVPR (2018)
Tao, R., Gavves, E., Smeulders, A.W.: Siamese instance search for tracking. In: CVPR (2016)
Valmadre, J., et al.: Long-term tracking in the wild: a benchmark. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 692–707. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_41
Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam R-CNN: visual tracking by re-detection. In: CVPR (2020)
Wang, G., Luo, C., Xiong, Z., Zeng, W.: SPM-Tracker: series-parallel matching for real-time visual object tracking. In: CVPR (2019)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)
Wen, L., et al.: VisDrone-SOT2018: the vision meets drone single-object tracking challenge results. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 469–495. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11021-5_28
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR (2013)
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. TPAMI 37(9), 1834–1848 (2015)
Yan, B., Wang, D., Lu, H., Yang, X.: Alpha-Refine: boosting tracking performance by precise bounding box estimation. arXiv (2020)
Yang, G., Ramanan, D.: Volumetric correspondence networks for optical flow. In: NeurIPS (2019)
Ying, Z., Li, G., Ren, Y., Wang, R., Wang, W.: A new low-light image enhancement algorithm using camera response model. In: ICCVW (2017)
Yuan, D., Fan, N., He, Z.: Learning target-focusing convolutional regression model for visual object tracking. Knowl.-Based Syst. (2020)
Zhang, Y., Zhang, J., Guo, X.: Kindling the darkness: a practical low-light image enhancer. In: ACM MM (2019)
Zhou, J., Wang, P., Sun, H.: Discriminative and robust online learning for Siamese visual tracking. In: AAAI (2020)
Zhou, W., Wen, L., Zhang, L., Du, D., Luo, T., Wu, Y.: SiamMan: Siamese motion-aware network for visual tracking. CoRR abs/1912.05515 (2019)
Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H.: Vision meets drones: past, present and future. CoRR abs/2001.06303 (2020)
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware Siamese networks for visual object tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 103–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_7
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61876127 and Grant 61732011, in part by Natural Science Foundation of Tianjin under Grant 17JCZDJC30800.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Descriptions of Submitted Trackers
A Descriptions of Submitted Trackers
In the appendix, we summarize 13 trackers submitted in the VisDrone-SOT2020 Challenge, which are ordered according to the submissions of their final results.
1.1 A.1 Strategy and Motion Integrated Long-Term Experts-Version 2 (SMILEv2)
Yuxuan Li, Zhongjian Huang and Biao Wang
liyuxuan_xidian@126.com, huangzj@stu.xidian.edu.cn, biaowang@webank.com
SMILEv2 is combined with three kind of basic trackers and integrated in our IPIU-tracking framework. In this new framework, we are able to select different trackers in different situations by a semi-automatic way. As shown in Fig. 9, the framework has three parts, which are prediction module, tracking module and fix module. For prediction module, we introduce the Kalman filter and the optical flow method of VCN [61] as the object motion information and camera motion information respectively. For tracking module, we use three trackers including Dimp [5], SiamMask [56], SORT-MOT [4]. For fix module, we first obtain the output of prediction and tracking modules, and then judge the final result.
1.2 A.2 Long-Term Tracking with Night-Enhancement and Motion Integrated (LTNMI)
Yuting Yang, Yanjie Gao, Ruiyan Ma and Xin Hou
{ytyang_1,yjgao}@stu.xidian.edu.cn, 3028408083@qq.com, xinhou@webank.com
LTNMI is a combination of ATOM [8], SiamRPN++ [31], Siam-RCNN [54] and Dimp [5]. We combined the ATOM and SiamRPN++ to get a better result, and then our method can give the reliability low limits of the above two systems on the condition of different confidence levels, which makes the systems more reliable respectively as different features play different role in the process of tracking based on their reliability. In addition, we improve the prediction of blurred scenes by using SIFT algorithm to match features. By estimating motion, the regression boxes can continue tracking the target in case of occlusion. When encountering dark or low-resolution scenes, we use threshold judgment and image brightness enhancement processing. We use MBLLEN [40] algorithm to process weak light enhancement. And then, we use Dimp to get the result of the sequences with weak light enhancement. At last, we use Siam-RCNN to find some lost frames. As a result, when the overlap are of fused result and the result generated by Siam-RCNN is nearly 95%, we conclude that the result generated by Siam-RCNN is better because of accurate detecting bounding box.
1.3 A.3 Ensemble of Classification and Matching Models with Alpha-Refine for UAV Tracking (ECMMAR)
Shuhao Chen, Zezhou Wang, Simiao Lai, Dong Wang and Huchuan Lu
{shuhaochn,zzwang}@mail.dlut.edu.cn, laisimiao1@gmail.com,
{wdice,lhchuan}@dlut.edu.cn
ECMMAR tracker is improved from Dimp [5] and SiamRPN++ [31] with online update module [65]. Dimp performs well in distinguishing distractors, while SiamRPN++ with the re-detection module performs well in detecting target when target disappeared by full occlusion or fast perspective conversion. The main modification are: 1) Develop an interactive mechanism to handle with long-term tracking and improve the robustness. 2) Muti-scale search regions are set to help to re-detect target when full occlusion or fast perspective conversion happened. 3) Use a refinement module [60] to refine the localized bounding box. 4) Employ a low-light image enhancement [62] method to deal with low-light scenes. 5) Fine-tune the superdimp pre-trained model and alpha-refine pre-trained model with visdrone2020 dataset. 6) Motion compensation is used when the camera viewing angle changes greatly. 7) Inertial motion is added when both tracker results are unreliable.
1.4 A.4 UAV Tracking with Extra Proposals Based on Corrected Velocity Prediction (CVP-superdimp)
Zitong Yi and Yanyun Zhao
{zitong.yi,zyy}@mail.dlut.edu.cn
CVP-superdimp is a robust tracking strategy under the circumstance of UAV tracking, especially for the nerve-wracking problem of fierce camera moving and long-term full occlusion. The base tracker follows [5, 9], which contains two modules: object classification module based on DIMP and bounding box regression module based on prDIMP. Our proposed tracking strategy adds a new module of velocity prediction for both short-term and long-term, which can provide additional high-quality proposals for tracker searching in the next frame.
1.5 A.5 LTCOMET: Context-Aware IoU-Guided Network for Small Object Tracking (LTCOMET)
Seyed Mojtaba Marvasti-Zadeh, Javad Khaghani, Li Cheng, Hossein Ghanei-Yakhdan and Shohreh Kasaei
{mojtaba.marvasti,khaghani,lcheng5}@ualberta.ca,hghaneiy@yazd.ac.ir,
kasaei@sharif.edu
To bridge the gap between aerial views tracking methods and modern trackers, the modified context-aware IoU-guided tracker (LTCOMET) is proposed that exploits the offline reference proposal generation strategy (same as COMET tracker [42]), multitask two-stream network [42], kindling the darkness (KinD) [64], and photo-realistic cascading residual network (PCARN) [1]. The network architecture is the same as [42] without using channel reduction after the multi-scale aggregation and fusion modules (MSAFs). The KinD employs a network for light adjustment and degradation removal, which is employed as a preprocessing of LTCOMET on target patches. Also, the LTCOMET employs the generator network of PCARN to recover high-resolution patches of target and its context from low-resolution ones. Furthermore, the proposed method uses a widowing search strategy when it loses the target. The proposed LTCOMET has been trained on a broad range of tracking datasets while it exploits various photometric and geometric distortions (i.e., data augmentations) to improve the variability of target regions.
1.6 A.6 Discriminative and Robust Online Learning for Long Term Siamese Visual Tracking (DROL_LT)
Jinghao Zhou, Peng Wang, Haoyang Sun and Zikai Zhang
{jensen.zhoujh,zzkdemail}@gmail.com,{peng.wang,sunhaoyang}@gmail.com
DROL_LT is based on DROL [65]. DROL proposes an online module with an attention mechanism for offline Siamese networks to extract target-specific features under L2 error. DROL also proposes a filter update strategy adaptive to treacherous background noises for discriminative learning, and a template update strategy to handle large target deformations for robust learning. DROL_LT adds two modules to improve DROL results in long term tracking tasks. (1) A detector is added to help DROL recover the targets, which disappear and appear many times. ROI Align is used to extract the features from mixed offline feature maps with the bounding boxes information from detector. (2) A mechanism is designed to help tracker to decide when to update online classifiers and when to use detectors, which depends on a set of thresholds given from experience.
1.7 A.7 Discriminative and Robust Online Learning for Long Term Siamese Visual Tracking (DIMP-SiamRPN)
Zhipeng Luo, Penghao Zhang, Yubo Sun and Bin Dong
{luozp,zhangph,sunyb,Dongbin}@deepblueai.com
DIMP-SiamRPN is improved based on PrDIMP [9] and SiamRPN++ [31]. First, we use the frame numbers to divide videos in the challenge set into long-term videos and short-term videos. The short videos are tested using PrDIMP’s hyper-parameter adjustment model to obtain the results. Daytime scenes in the long videos are tested by the SiamRPN++ model. In the SiamRPN++ model, we enlarge the instance size 15 pixels every frame, and the upper limit of the search threshold is 1000. In addition, when the target seems to be lost, we reset the center of search scope to the center of the image. Furthermore, we define a make-up strategy to deal with occlusion. As scenes of night in the long videos, we divide them into strong light scenes and dark scenes according to the light intensity, in which different inference parameters are used.
1.8 A.8 Discriminative Model Prediction and Accurate Re-detection for Drone Tracking (DiMP_AR)
Xuefeng Zhu, Xiaojun Wu and Tianyang Xu
{xuefeng_zhu95,xiaojun_wu_jnu,tianyang_xu}@163.com
DiMP_AR is based on the DiMP [5] by adding a re-detection module. We use the DiMP tracker as a local tracker to predict target state normally and the RT-MDNet [28] is used as a verifier to verify the prediction of DiMP. If the verification is above a predefined threshold, the normal local tracking is conducted in next frame. Otherwise, the re-detection module will be activated. Firstly, the faster R-CNN detector [48] is used to detect some highly possible target candidates in the whole image of next frame. Then the SiamRPN++ [31] tracker is employed to detect the search regions regarding the possible target candidates. When the target is regained, we switch to local tracking with the tracker DiMP.
1.9 A.9 Precise Visual Tracking by Re-detection (PrSiamR-CNN)
Zhongzhou Zhang, Lei Zhang, Keyang Wang and Zhenwei He
{zz.zhang,leizhang,wangkeyang,hzw}@cqu.edu.cn
PrSiamR-CNN is modified from recently proposed state-of-the-art single object tracker Siam R-CNN [54] by using extra training data from VisDrone-SOT2020.
1.10 A.10 Discriminative Model Prediction with Deeper ResNet-101 (DiMP-101)
Liting Lin and Yong Xu
l.lt@mail.scut.edu.cn
DiMP-101 is based on the DiMP [5] model, adopting deeper ResNet-101 as the backbone. With higher learning capacity of the feature extraction network, the performance of the tracking algorithm has been significantly improved to a new level.
1.11 A.11 ECO: Efficient Convolution Operators for Tracking (ECO)
Lei Pang
panglei2015@ia.ac.cn
ECO [7] is a discriminative correlation filter based tracker using deep features. This method introduces a factorized convolution operator and a compact generative model of the training sample distribution to reduce model parameters. In addition, it proposes a conservative model update strategy with improved robustness and reduced complexity. More details can be referred to [7].
1.12 A.12 Target-Focusing Convolutional Regression Tracking (TFCR)
Di Yuan, Nana Fan and Zhenyu He
dyuanhit@gmail.com
TFCR [63] is a target-focusing convolutional regression (CR) model for visual object tracking tasks. This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model.
1.13 A.13 DDL-Tracker (DDL)
Yong Wang, Lu Ding, Dongjie Zhou and Wentao He
wangyong5@mail.sysu.edu.cn,dinglu@sjtu.edu.cn,13520071811@163.com, weishiinsky@126.com
DDL-tracker employs deep layers to extract features. Meanwhile, one HOG detector is trained online. If the tracking result is below a threshold, we use the results by the detector.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Fan, H. et al. (2020). VisDrone-SOT2020: The Vision Meets Drone Single Object Tracking Challenge Results. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12538. Springer, Cham. https://doi.org/10.1007/978-3-030-66823-5_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-66823-5_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66822-8
Online ISBN: 978-3-030-66823-5
eBook Packages: Computer ScienceComputer Science (R0)