Skip to main content
Log in

The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

With the increasing popularity of Unmanned Aerial Vehicles (UAVs) in computer vision-related applications, intelligent UAV video analysis has recently attracted the attention of an increasing number of researchers. To facilitate research in the UAV field, this paper presents a UAV dataset with 100 videos featuring approximately 2700 vehicles recorded under unconstrained conditions and 840k manually annotated bounding boxes. These UAV videos were recorded in complex real-world scenarios and pose significant new challenges, such as complex scenes, high density, small objects, and large camera motion, to the existing object detection and tracking methods. These challenges have encouraged us to define a benchmark for three fundamental computer vision tasks, namely, object detection, single object tracking (SOT) and multiple object tracking (MOT), on our UAV dataset. Specifically, our UAV benchmark facilitates evaluation and detailed analysis of state-of-the-art detection and tracking methods on the proposed UAV dataset. Furthermore, we propose a novel approach based on the so-called Context-aware Multi-task Siamese Network (CMSN) model that explores new cues in UAV videos by judging the consistency degree between objects and contexts and that can be used for SOT and MOT. The experimental results demonstrate that our model could make tracking results more robust in both SOT and MOT, showing that the current tracking and detection methods have limitations in dealing with the proposed UAV benchmark and that further research is indeed needed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. We use DJI Inspire 2 to collect videos. More information about the UAV platform can be found at http://www.dji.com/inspire-2.

  2. Our dataset is available for download at  https://sites.google.com/site/daviddo0323/.

  3. http://carlvondrick.com/vatic/

  4. The detection result has been taken from http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=2d.

References

  • Bae, S. H., & Yoon, K. (2014). Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In CVPR (pp. 1218–1225).

  • Bernardin, K., & Stiefelhagen, R. (2008). Evaluating multiple object tracking performance: The CLEAR MOT metrics. EURASIP Journal on Image and Video Processing, 2008, 246309.

    Article  Google Scholar 

  • Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2016). Fully-convolutional siamese networks for object tracking. In ECCV (pp. 850–865).

  • Bewley, A., Ge, Z., Ott, L., Ramos, F. T., & Upcroft, B. (2016). Simple online and realtime tracking. In ICIP (pp. 3464–3468).

  • Bochinski, E., Eiselein, V., & Sikora, T. (2017). High-speed tracking-by-detection without using image information. In AVSS (pp. 1–6).

  • Choi, W. (2015). Near-online multi-target tracking with aggregated local flow descriptor. In ICCV (pp. 3029–3037).

  • Chopra, S., Hadsell, R., & LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In CVPR (pp. 539–546).

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In NIPS (pp. 379–387).

  • Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2016). ECO: Efficient convolution operators for tracking. arXiv:1611.09224.

  • Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In ICCV (pp. 4310–4318).

  • Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2016). Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. In CVPR (pp. 1430–1438).

  • Danelljan, M., Robinson, A., Khan, F. S., & Felsberg, M. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In ECCV (pp. 472–488).

  • Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR (pp. 248–255).

  • Dicle, C., Camps, O. I., & Sznaier, M. (2013). The way they move: Tracking multiple targets with similar appearance. In ICCV (pp. 2304–2311).

  • Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761.

    Article  Google Scholar 

  • Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., & Tian, Q. (2018). The unmanned aerial vehicle benchmark: Object detection and tracking. In ECCV (pp. 375–391).

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Fan, H., & Ling, H. (2017). Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In ICCV.

  • Ferryman, J., & Shahrokni, A. (2009). Pets2009: Dataset and challenge. In AVSS (pp. 1–6).

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In CVPR (pp. 3354–3361).

  • Girshick, R. B. (2015). Fast R-CNN. In ICCV (pp. 1440–1448).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).

  • Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 FPS with deep regression networks. In ECCV (pp. 749–765).

  • Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.

    Article  Google Scholar 

  • Hsieh, M., Lin, Y., & Hsu, W. H. (2017). Drone-based object counting by spatially regularized regional proposal network. In ICCV.

  • Hwang, S., Park, J., Kim, N., Choi, Y., & Kweon, I. S. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In CVPR (pp. 1037–1045).

  • Izadinia, H., Saleemi, I., Li, W., & Shah, M. (2012). (MP)2T: Multiple people multiple parts tracker. In ECCV (pp. 100–114).

  • Kalra, I., Singh, M., Nagpal, S., Singh, R., Vatsa, M., & Sujit, P. (2019). Dronesurf: Benchmark dataset for drone-based face recognition. In IEEE FG 2019 (pp. 1–7).

  • Kiani Galoogahi, H., Fagg, A., Huang, C., Ramanan, D., & Lucey, S. (2017). Need for speed: A benchmark for higher frame rate object tracking. In ICCV (pp. 1125–1134).

  • Kim, C., Li, F., Ciptadi, A., & Rehg, J. M. (2015). Multiple hypothesis tracking revisited. In ICCV (pp. 4696–4704).

  • Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., & Chen, Y. (2017). RON: Reverse connection with objectness prior networks for object detection. In CVPR.

  • Kristan, M., Leonardis, A., Matas, J., et al. (2016). The visual object tracking VOT2016 challenge results. In ECCV workshop (pp. 777–823).

  • Kristan, M., Leonardis, A., Matas, J., Felsberg, M., & He, Z. (2017). The visual object tracking VOT2017 challenge results. In ICCV workshop.

  • Leal-Taixé, L., Milan, A., Reid, I. D., Roth, S., & Schindler, K. (2015). Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv:1504.01942.

  • Li, F., Tian, C., Zuo, W., Zhang, L., & Yang, M. (2018). Learning spatial-temporal regularized correlation filters for visual tracking. In CVPR.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S. E., Fu, C., & Berg, A. C. (2016). SSD: Single shot multibox detector. In ECCV (pp. 21–37).

  • Ma, C., Huang, J., Yang, X., & Yang, M. (2015). Hierarchical convolutional features for visual tracking. In ICCV (pp. 3074–3082).

  • Milan, A., Leal-Taixé, L., Reid, I. D., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv:1603.00831.

  • Milan, A., Rezatofighi, S. H., Dick, A. R., Reid, I. D., & Schindler, K. (2017). Online multi-target tracking using recurrent neural networks. In AAAI (pp. 4225–4232).

  • Milan, A., Roth, S., & Schindler, K. (2014). Continuous energy minimization for multitarget tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(1), 58–72.

    Article  Google Scholar 

  • Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV.

  • Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for UAV tracking. In ECCV (pp. 445–461).

  • Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In CVPR.

  • Munkres, J. (1957). Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1), 32–38.

    Article  MathSciNet  Google Scholar 

  • Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for visual tracking. In CVPR (pp. 4293–4302).

  • Ning, W., Wengang, Z., Qi, T., Richang, H., Meng, W., & Houqiang, L. (2018). Multi-cue correlation filters for robust visual tracking. In CVPR (pp. 4844–4853).

  • Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33.

    Article  Google Scholar 

  • Pirsiavash, H., Ramanan, D., & Fowlkes, C. C. (2011). Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR (pp. 1201–1208).

  • Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., & Yang, M. (2016). Hedged deep tracking. In CVPR (pp. 4303–4311).

  • Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).

  • Ristani, E., Solera, F., Zou, R. S., Cucchiara, R., & Tomasi, C. (2016). Performance measures and a data set for multi-target, multi-camera tracking. In ECCVW (pp. 17–35).

  • Robicquet, A., Sadeghian, A., Alahi, A., & Savarese, S. (2016). Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV (pp. 549–565).

  • Shu, G., Dehghan, A., Oreifej, O., Hand, E., & Shah, M. (2012). Part-based multiple-person tracking with partial occlusion handling. In CVPR (pp. 1815–1821).

  • Smeulders, A. W. M., Chu, D. M., Cucchiara, R., Calderara, S., Dehghan, A., & Shah, M. (2014). Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1442–1468.

    Article  Google Scholar 

  • Solera, F., Calderara, S., & Cucchiara, R. (2015). Towards the evaluation of reproducible robustness in tracking-by-detection. In AVSS (pp. 1–6).

  • Son, J., Baek, M., Cho, M., & Han, B. (2017). Multi-object tracking with quadruplet convolutional neural networks. In CVPR.

  • Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R. W. H., & Yang, M. (2017). CREST: Convolutional residual learning for visual tracking. arXiv:1708.00225.

  • Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau, R. W. H., & Yang, M. (2018). VITAL: Visual tracking via adversarial learning. arXiv:1804.04273.

  • Tang, S., Andres, B., Andriluka, M., & Schiele, B. (2016). Multi-person tracking by multicut and deep matching. In ECCV workshops (pp. 100–111).

  • Tang, S., Andriluka, M., Andres, B., & Schiele, B. (2017). Multiple people tracking by lifted multicut and person re-identification. In CVPR.

  • Tao, R., Gavves, E., & Smeulders, A. W. M. (2016). Siamese instance search for tracking. In CVPR (pp. 1420–1429).

  • Valmadre, J., Bertinetto, L., Henriques, J. F., Vedaldi, A., & Torr, P. H. S. (2017). End-to-end representation learning for correlation filter based tracking. In CVPR.

  • Wang, L., Ouyang, W., Wang, X., & Lu, H. (2015). Visual tracking with fully convolutional networks. In ICCV (pp. 3119–3127).

  • Wang, L., Ouyang, W., Wang, X., Lu, H. (2016). STCT: Sequentially training convolutional networks for visual tracking. In CVPR (pp. 1373–1381).

  • Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M., Qi, H., Lim, J., Yang, M., & Lyu, S. (2015). DETRAC: A new benchmark and protocol for multi-object tracking. arXiv:1511.04136.

  • Wojke, N., Bewley, A., & Paulus, D. (2017). Simple online and realtime tracking with a deep association metric. arXiv:1703.07402.

  • Wu, Y., Lim, J., & Yang, M. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.

    Article  Google Scholar 

  • Xia, G. S., Bai, X., Ding, J., Zhu, Z., Belongie, S., Luo, J., Datcu, M., Pelillo, M., & Zhang, L. (2018). DOTA: A large-scale dataset for object detection in aerial images. In CVPR (pp. 3974–3983).

  • Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In ICCV (pp. 4705–4713).

  • Yoon, J. H., Lee, C., Yang, M., & Yoon, K. (2016). Online multi-object tracking via structural constraint event aggregation. In CVPR (pp. 1392–1400).

  • Yoon, J. H., Yang, M., Lim, J., & Yoon, K. (2015). Bayesian multi-object tracking using motion context from multiple objects. In WACV (pp. 33–40).

  • Yu, H., Qin, L., Huang, Q., & Yao, H. (2018). Online multiple object tracking via exchanging object context. Neurocomputing, 292, 28–37.

    Article  Google Scholar 

  • Yun, S., Choi, J., Yoo, Y., Yun, K., & Choi, J. Y. (2017). Action-decision networks for visual tracking with deep reinforcement learning. In CVPR.

  • Zhang, K., Zhang, L., Liu, Q., Zhang, D., & Yang, M. (2014). Fast visual tracking via dense spatio-temporal context learning. In ECCV (pp. 127–141).

  • Zhang, T., Xu, C., & Yang, M. H. (2017). Multi-task correlation particle filter for robust visual tracking. In CVPR.

  • Zhong, B., Bai, B., Li, J., Zhang, Y., & Fu, Y. (2018). Hierarchical tracking by reinforcement learning-based searching and coarse-to-fine verifying. IEEE Transactions on Image Processing, 28(5), 2331–2341.

    Article  MathSciNet  Google Scholar 

  • Zhou, Q., Zhong, B., Zhang, Y., Li, J., & Fu, Y. (2018). Deep alignment network based multi-person tracking with occlusion and motion reasoning. IEEE Transactions on Multimedia, 21(5), 1183–1194.

    Article  Google Scholar 

  • Zhu, P., Wen, L., Bian, X., Haibin, L., & Hu, Q. (2018). Vision meets drones: A challenge. arXiv preprint arXiv:1804.07437.

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China under Grant 61620106009, Grant 61836002, Grant U1636214, Grant 61931008, Grant 61772494 and Grant 61976069, in part by Key Research Program of Frontier Sciences, CAS: QYZDJ-SSW-SYS013, in part by the Italy-China collaboration project TALENT: 2018YFE0118400, in part by the University of Chinese Academy of Sciences, in part by Youth Innovation Promotion Association CAS, in part by ARO grants W911NF-15-1-0290 and Faculty Research Gift Awards by NEC Laboratories of America and Blippar.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Guorong Li or Qingming Huang.

Additional information

Communicated by Andreas Geiger.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, H., Li, G., Zhang, W. et al. The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline. Int J Comput Vis 128, 1141–1159 (2020). https://doi.org/10.1007/s11263-019-01266-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01266-1

Keywords

Navigation