Skip to main content

VisDrone-MOT2020: The Vision Meets Drone Multiple Object Tracking Challenge Results

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 Workshops (ECCV 2020)

Abstract

The Vision Meets Drone (VisDrone2020) Multiple Object Tracking (MOT) is the third annual UAV MOT tracking evaluation activity organized by the VisDrone team, in conjunction with European Conference on Computer Vision (ECCV 2020). The VisDrone-MOT2020 consists of 79 challenging video sequences, including 56 videos (\(\sim \)24K frames) for training, 7 videos (\(\sim \)3K frames) for validation and 17 videos (\(\sim \)6K frames) for evaluation. All frames in these sequences are manually annotated with high-quality bounding boxes. Results of 12 participating MOT algorithms are presented and analyzed in detail. The challenging results, video sequences as well as the evaluation toolkit are made available at http://aiskyeye.com/. By holding VisDrone-MOT2020 challenge, we hope to facilitate future research and applications of MOT algorithms on drone videos.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/ultralytics/yolov5.

References

  1. Al-Shakarji, N.M., Bunyak, F., Seetharaman, G., Palaniappan, K.: Multi-object tracking cascade with multi-step data association and occlusion handling. In: AVSS (2018)

    Google Scholar 

  2. Al-Shakarji, N.M., Seetharaman, G., Bunyak, F., Palaniappan, K.: Robust multi-object tracking with semantic color correlation. In: AVSS (2017)

    Google Scholar 

  3. Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: ICCV (2019)

    Google Scholar 

  4. Bochinski, E., Eiselein, V., Sikora, T.: High-speed tracking-by-detection without using image information. In: AVSS (2017)

    Google Scholar 

  5. Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: CVPR (2020)

    Google Scholar 

  6. Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: CVPR (2018)

    Google Scholar 

  7. Chang, Z., et al.: Weighted bilinear coding over salient body parts for person re-identification. Neurocomputing 407, 454–464 (2020)

    Article  Google Scholar 

  8. Chen, B., Deng, W., Hu, J.: Mixed high-order attention network for person re-identification. In: ICCV (2019)

    Google Scholar 

  9. Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR (2019)

    Google Scholar 

  10. Chu, P., Fan, H., Tan, C.C., Ling, H.: Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In: WACV (2019)

    Google Scholar 

  11. Chu, P., Ling, H.: FAMNet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: ICCV (2019)

    Google Scholar 

  12. Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D.: TAO: a large-scale benchmark for tracking any object. arXiv (2020)

    Google Scholar 

  13. Dendorfer, P., et al.: MOT20: a benchmark for multi object tracking in crowded scenes. arXiv (2020)

    Google Scholar 

  14. Du, D., et al.: The unmanned aerial vehicle benchmark: object detection and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 375–391. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_23

    Chapter  Google Scholar 

  15. Evangelidis, G.D., Psarakis, E.Z.: Parametric image alignment using enhanced correlation coefficient maximization. PAMI 30(10), 1858–1865 (2008)

    Article  Google Scholar 

  16. Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR (2019)

    Google Scholar 

  17. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)

    Article  Google Scholar 

  18. Girshick, R.: Fast R-CNN. In: ICCV (2015)

    Google Scholar 

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  20. Hsieh, M.R., Lin, Y.L., Hsu, W.H.: Drone-based object counting by spatially regularized regional proposal network. In: ICCV (2017)

    Google Scholar 

  21. Keuper, M., Tang, S., Andres, B., Brox, T., Schiele, B.: Motion segmentation & multiple object tracking by correlation co-clustering. PAMI 42(1), 140–153 (2018)

    Article  Google Scholar 

  22. Kim, C., Li, F., Rehg, J.M.: Multi-object tracking with neural gating using bilinear LSTM. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 208–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_13

    Chapter  Google Scholar 

  23. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  24. Li, J., Wang, J., Tian, Q., Gao, W., Zhang, S.: Global-local temporal representations for video person re-identification. In: ICCV (2019)

    Google Scholar 

  25. Li, J., Zhang, S., Huang, T.: Multi-scale 3D convolution network for video based person re-identification. In: AAAI (2019)

    Google Scholar 

  26. Li, S., Yu, H., Hu, H.: Appearance and motion enhancement for video-based person re-identification. In: AAAI (2020)

    Google Scholar 

  27. Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep filter pairing neural network for person re-identification. In: CVPR (2014)

    Google Scholar 

  28. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)

    Google Scholar 

  29. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  30. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  31. Luo, H., Gu, Y., Liao, X., Lai, S., Jiang, W.: Bag of tricks and a strong baseline for deep person re-identification. In: CVPRW (2019)

    Google Scholar 

  32. Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv (2016)

    Google Scholar 

  33. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 445–461. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_27

    Chapter  Google Scholar 

  34. Pan, S., Tong, Z., Zhao, Y., Zhao, Z., Su, F., Zhuang, B.: Multi-object tracking hierarchically in visual data taken from drones. In: ICCVW (2019)

    Google Scholar 

  35. Park, E., Liu, W., Russakovsky, O., Deng, J., Li, F.F., Berg, A.: Large Scale Visual Recognition Challenge 2017. http://image-net.org/challenges/LSVRC/2017

  36. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv (2018)

    Google Scholar 

  37. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)

    Google Scholar 

  38. Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: human trajectory understanding in crowded scenes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_33

    Chapter  Google Scholar 

  39. Wang, G., Wang, Y., Zhang, H., Gu, R., Hwang, J.: Exploit the connectivity: multi-object tracking with trackletnet. In: ACM MM, pp. 482–490 (2019)

    Google Scholar 

  40. Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: ACM MM (2018)

    Google Scholar 

  41. Wang, J., et al.: Deep high-resolution representation learning for visual recognition. PAMI (2020)

    Google Scholar 

  42. Wen, L., et al.: UA-DETRAC: a new benchmark and protocol for multi-object detection and tracking. Comput. Vis. Image Underst. 193, 102907 (2020)

    Article  Google Scholar 

  43. Wen, L., Du, D., Li, S., Bian, X., Lyu, S.: Learning non-uniform hypergraph for multi-object tracking. In: AAAI, pp. 8981–8988 (2019)

    Google Scholar 

  44. Wen, L., Li, W., Yan, J., Lei, Z., Yi, D., Li, S.Z.: Multiple target tracking based on undirected hierarchical relation hypergraph. In: CVPR (2014)

    Google Scholar 

  45. Wen, L., Zhang, Y., Bo, L., Shi, H., Zhu, R., et al.: VisDrone-MOT2019: the vision meets drone multiple object tracking challenge results. In: ICCVW, pp. 189–198 (2019)

    Google Scholar 

  46. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)

    Google Scholar 

  47. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR (2013)

    Google Scholar 

  48. Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_1

    Chapter  Google Scholar 

  49. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)

    Google Scholar 

  50. Yang, Y., Wen, L., Lyu, S., Li, S.Z.: Unsupervised learning of multi-level descriptors for person re-identification. In: AAAI (2017)

    Google Scholar 

  51. Zhan, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: A simple baseline for multi-object tracking. arXiv (2020)

    Google Scholar 

  52. Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeply-learned part-aligned representations for person re-identification. In: ICCV (2017)

    Google Scholar 

  53. Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re-identification. In: CVPR (2013)

    Google Scholar 

  54. Zhou, K., Yang, Y., Cavallaro, A., Xiang, T.: Omni-scale feature learning for person re-identification. In: ICCV (2019)

    Google Scholar 

  55. Zhou, Q., et al.: Graph correspondence transfer for person re-identification. In: AAAI (2018)

    Google Scholar 

  56. Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. arXiv (2020)

    Google Scholar 

  57. Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.-H.: Online multi-object tracking with dual matching attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 379–396. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_23

    Chapter  Google Scholar 

  58. Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H.: Vision meets drones: past, present and future. CoRR abs/2001.06303 (2020)

    Google Scholar 

  59. Zhu, P., et al.: VisDrone-VDT2018: the vision meets drone video detection and tracking challenge results. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11133, pp. 496–518. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11021-5_29

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61876127 and Grant 61732011, in part by Natural Science Foundation of Tianjin under Grant 17JCZDJC30800.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pengfei Zhu .

Editor information

Editors and Affiliations

A Descriptions of Submitted Trackers

A Descriptions of Submitted Trackers

In the appendix, we summarize 12 trackers submitted in the VisDrone-MOT2020 Challenge, which are ordered according to the submissions of their final results.

1.1 A.1 Coarse-to-Fine Multi-Class Multi-Object Tracking (COFE)

Yuhang He, Wentao Yu, Jie Han, Xiaopeng Hong, Xing Wei and Yihong Gong

{hyh1379478,yu1034397129,hanjie1997}@stu.xjtu.edu.cn,

{hongxiaopeng,weixing,ygong}@mail.xjtu.edu.cn

COFE is proposed to track multiple targets in different categories under different scenarios. As shown in Fig. 1, the proposed method contains three major modules: 1) Multi-class object detection, 2) Coarse-category multi-object tracking, and 3) Fine-grained trajectory finetuning. Firstly, we use a Deep Convolutional Neural Network (DCNN) based object detector [6] to detect interested targets in the image plane, where each detection is denoted by a bounding box with a class label and a confidence score. Secondly, we track multiple targets in coarse categories, where fine-grained classes (such as van, bus, car) are summarized into coarse categories (e.g., vehicle). For each coarse category, we perform multi-object tracking by exploiting the appearance and motion information of targets, where the appearance feature is extracted using a DCNN feature extractor [54] and the motion pattern of each target is modeled by a Kalman Filter. Finally, for each obtained trajectory, we finetune its fine-grained class label by a simple voting and refine the tracking results by post processing (i.e., bounding box smoothing).

Fig. 1.
figure 1

The framework of COFE.

1.2 A.2 Simple Online Multi-Object Tracker (SOMOT)

Zhipeng Luo, Yuehan Yao, Zhenyu Xu, Bin Dong and Wang Sai

{luozp,yaoyh,xuzy,dongb,wangs}@deepblueai.com

Following separate detection and embedding model, we build a strong detector based on Cascade R-CNN [6] and a embedding model based on Multiple Granularity Network (MGN). For association step, we build simple online multi-object tracker, which is inspired by DeepSORT [46] and FairMOT [51]. For detector, Cascade R-CNN [6] pretrained on COCO [29] is applied. For embedding model, bag of tricks are used to improve the performance of MGN [40]. For association step, we initialize a number of tracklets based on the estimated boxes in the first frame. In the subsequent frames, we associate the boxes to the existing tracklets (all activated tracklets) according to their distances measured by embedding features. We update the appearance features of the trackers in each time step to handle appearance variations. Then, unmatched activated tracklets and estimated boxes are associated by their distance of Intersection over Union (IoU). Also, inactivated tracklets and estimated boxes are associated by their distance of IoU.

1.3 A.3 Position-, Appearance- and Size-aware Tracker (PAS tracker)

Daniel Stadler, Lars Wilko Sommer and Arne Schumann

daniel.stadler@kit.edu,{lars.sommer,arne.schumann}@iosb.fraunhofer.de

The PAS algorithm follows the tracking-by-detection paradigm. As detectors, we train two Cascade R-CNN [6] with FPN [28] on the VisDrone2020 MOT train and val set applying as backbone ResNeXt-101 [49] and HRNetV2p-W32 [41], respectively. Training is performed on randomly sampled image crops (\(608 \times 608\) pixels) and the SSD [30] data augmentation pipeline is used. To improve the quality of the detections, we utilize test-time strategies like horizontal flipping and multi-scale testing. Additionally, we generate category-specific expert models using weights from different epochs and from the two detectors with different backbones. For associating detections, we build a similarity measure that integrates position, appearance and size information of objects. A constant velocity model is assumed for the motion prediction of objects and a camera motion compensation model based on the Enhanced Correlation Coefficient Maximization [15] is also applied. The appearance of an object is represented by a feature vector computed with a re-identification model from [31] based on a ResNet-50 [19]. The association of tracks and new detections is solved by the Hungarian method [23]. Additionally, to remove false positive detections in crowded scenarios, a simple filtering approach considering the overlap of existing tracks and new detections is proposed. Finally, we remove short tracks with less than 10 frames and small tracks with a mean size of less than 100 pixels as most of them are false positives.

1.4 A.4 Simple Online and Realtime Tracking with a Deep Association (Deepsort)

Zhaoze Zhao

hanjie@smail.swufe.edu.cn

Simple Online and Realtime Tracking (SORT) [46] is a pragmatic approach to multiple object tracking with a focus on simple, effective algorithms. In this paper, we integrate appearance information to improve the performance of SORT. Due to this extension we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches. In spirit of the original framework we place much of the computational complexity into an offline pre-training stage where we learn a deep association metric on a large-scale person re-identification dataset. During online application, we establish measurement-to-track associations using nearest neighbour queries in visual appearance space.

1.5 A.5 YOLOv5 based V-IOU tracker (YOLO-TRAC)

Zhizhao Duan, Xi Wu, Duo Xu and Zhen Xie

{Duanai,21725018}@zju.edu.cn,wuxi9410@gmail.com,zjutxz@hotmail.com

Trac is a track by detection framework. We use YOLO-V5Footnote 1 as our detection network, and V-IOU Tracker [4] is used for tracking.

1.6 A.6 An improved multi-object tracking method for the VisDrone videos based on CenterTrack (VDCT)

Shengwen Li and Yanyun Zhao

{2019140337,zyy}@bupt.edu.cn

VDCT is improved from CenterTrack, which is a point-based framework that combines detection and tracking [56]. Its inputs include the current frame, the previous frame, and the tracked objects in the previous frame; and it outputs the displacements of tracked objects. Our improvements include: (1) The tracked objects which do not match within 20 frames are allowed to associate with objects detected in current frame by properly extending the survival time of the tracked objects. (2) The motion direction of adjacent frame objects usually does not change abruptly due to the continuity of object motion, so we calculate the dot product of the displacements of adjacent frame objects and decide whether to associate the objects. (3) We use the NIOU method [34] to perform non-maximum suppression on vehicle objects. (4) We adopt the hierarchical matching strategy in DeepSORT [46] to solve the long occlusion problem. (5) OSNet [54] is used to extract each trajectory’s appearance feature, measure their distance from others and we simply merge two trajectories if their distance is close enough. The experimental results show the effectiveness of our improved method.

1.7 A.7 Cascade RCNN based IOU tracker (Cascade RCNN+IOU)

Ting Sun and Xingjie Zhao

sunting9999@stu.xjtu.edu.cn,1243273854@qq.com

We use Cascade R-CNN [6] as the detector with three improvements: (1) We use Group normalization [48] instead of Batch normalization; (2) We use online hard example mining to select positive and negative samples; (3) We use multiple scales to test our data; (4) We use two stronger backbones to train models and integrate them. Then, we perform detection association using the IOU tracker [4].

1.8 A.8 Hybrid task cascade based IOU tracker (HTC+IOU)

Ting Sun, Xingjie Zhao and Guizhong Liu

sunting9999@stu.xjtu.edu.cn,1243273854@qq.com

We use hybrid task cascade for instance segmentation [9] as the detector with three improvements: (1) We use Group normalization [48] instead of Batch normalization; (2) We use online hard example mining to select positive and negative samples; (3) We use multiple scales to test our data; (4) We use two stronger backbones to train models and integrate them. Then, we perform detection association using the IOU tracker [4].

1.9 A.9 Multi-object Tracking based on HRNet (HR-GNN)

Zheng Yang and Kaojin Zhu

151776257@qq.com,1320531351@qq.com

HR-GNN is built based on the detector using HRNet [41] as backbone. Then the tracking results are generated by using GNN to analyze the detection results.

1.10 A.10 Multi-object tracking with TrackletNet (TNT)

Haritha V, Melvin Kuriakose, Hrishikesh PS and Linu Shine

vakkatharitha@gmail.com

TNT is based on the work of [39] by merging temporal and appearance information together as a unified framework. We learn appearance similarity among tracklets by a graph model, where we use CNN features and intersection-over-union (IOU) with epipolar constraints to compensate camera movement between adjacent frames. Finally, the tracklets can be clustered into groups, resulting in trajectories with individual object IDs.

1.11 A.11 A simple baseline for one-shot multi-object tracking (anchor-free_mot)

Min Yao and Libo Zhang

libo@iscas.ac.cn

The anchor-free_mot method is based on FairMOT [51]. Specifically, we use the encoder-decoder network to extract feature maps. Then, two simple parallel heads are used to predict the bounding box and re-ID features of the targets, respectively. Notably, the targets are represented by points from the anchor-free object detection method.

1.12 A.12 Semantic Color Correlation Tracker (SCTrack)

Noor M. Al-Shakarji, Filiz Bunyak, Guna Seetharaman and Kannappan Palaniappan

{nmahyd,bunyak,palaniappank}@mail.missouri.edu,

gunasekaran.seetharaman@rl.af.mil

SCTrack is a time-efficient detection-based multi-object tracking method. Specifically, we use a three-step cascaded data association scheme to combine a fast spatial distance only short-term data association, a robust tracklet linking step using discriminative object appearance models, and an explicit occlusion handling unit relying not only on tracked objects’ motion patterns but also on environmental constraints such as presence of potential occluders in the scene. The details can be referred to [1, 2].

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fan, H. et al. (2020). VisDrone-MOT2020: The Vision Meets Drone Multiple Object Tracking Challenge Results. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12538. Springer, Cham. https://doi.org/10.1007/978-3-030-66823-5_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66823-5_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66822-8

  • Online ISBN: 978-3-030-66823-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics