Semi-automated Generation of Accurate Ground-Truth for 3D Object Detection

Zwemer, M. H.; Scholte, D.; de With, P. H. N.

doi:10.1007/978-3-031-45725-8_2

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1815))

Included in the following conference series:

International Joint Conference on Computer Vision, Imaging and Computer Graphics

194 Accesses

Abstract

Visual algorithms for traffic surveillance systems typically locate and observe traffic movement by representing all traffic with 2D boxes. These 2D bounding boxes around vehicles are insufficient to generate accurate real-world locations. However, 3D annotation datasets are not available for training and evaluation of detection for traffic surveillance. Therefore, a new dataset for training the 3D detector is required. We propose and validate seven different annotation configurations for automated generation of 3D box annotations using only camera calibration, scene information (static vanishing points) and existing 2D annotations. The proposed novel Simple Box method does not require segmentation of vehicles and provides a more simple 3D box construction, which assumes a fixed predefined vehicle width and height. The existing KM3D CNN-based 3D detection model is adopted for traffic surveillance, which directly estimates 3D boxes around vehicles in the camera image, by training the detector on the newly generated dataset. The KM3D detector trained with the Simple Box configuration provides the best 3D object detection results, resulting in 51.9% AP3D on this data. The 3D object detector can estimate an accurate 3D box up to a distance of 125 m from the camera, with a median middle point mean error of only 0.5–1.0 m.

M. H. Zwemer and D. Scholte—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Detection of 3D bounding boxes of vehicles using perspective transformation for accurate speed measurement

Article 04 September 2020

Can Existing 3D Monocular Object Detection Methods Work in Roadside Contexts? A Reproducibility Study

Supervised and Unsupervised Detections for Multiple Object Tracking in Traffic Scenes: A Comparative Study

References

Ansari, J.A., Sharma, S., Majumdar, A., Jatavallabhula, K.M., Krishna, K.M.: The earth ain’t flat: monocular reconstruction of vehicles on steep and graded roads from a moving camera. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8404–8410 (2018). https://api.semanticscholar.org/CorpusID:3761728
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9286–9295. IEEE Computer Society, Los Alamitos (2019). https://doi.ieeecomputersociety.org/10.1109/ICCV.2019.00938
Brouwers, G.M.Y.E., Zwemer, M.H., Wijnhoven, R.G.J., de With, P.H.N.: Automatic calibration of stationary surveillance cameras in the wild. In: Proceedings of the IEEE CVPR (2016). https://doi.org/10.1007/978-3-319-48881-3_52
Cai, Y., Li, B., Jiao, Z., Li, H., Zeng, X., Wang, X.: Monocular 3D object detection with decoupled structured polygon estimation and height-guided depth estimation. In: AAAI (2020). https://arxiv.org/abs/2002.01619
Chabot, F., Chaouch, M., Rabarisoa, J., Teulière, C., Chateau, T.: Deep MANTA: a coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1827–1836. IEEE Computer Society, Los Alamitos (2017). https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.198
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE CVPR (2016). https://doi.org/10.1109/CVPR.2016.236
Choe, J., Joo, K., Rameau, F., Shim, G., Kweon, I.S.: Segment2Regress: monocular 3D vehicle localization in two stages. In: Robotics: Science and Systems (2019). https://doi.org/10.15607/RSS.2019.XV.016
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893. IEEE (2005). https://doi.org/10.1109/CVPR.2005.177
Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth-guided convolutions for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on CVPR (2020). https://arxiv.org/abs/1912.04799
Dubská, M., Herout, A., Sochor, J.: Automatic camera calibration for traffic understanding. In: BMVC, vol. 4, 6, p. 8 (2014). https://doi.org/10.5244/C.28.42
Fang, J., Zhou, L., Liu, G.: 3D bounding box estimation for autonomous vehicles by cascaded geometric constraints and depurated 2D detections using 3D results (2019). https://arxiv.org/abs/1909.01867
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50
Chapter Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: Conference on CVPR (2012). https://doi.org/10.1109/CVPR.2012.6248074
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015). https://arxiv.org/abs/1504.08083
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE ICCV (2017). https://arxiv.org/abs/1703.06870
He, T., Soatto, S.: Mono3D++: monocular 3D vehicle detection with two-scale 3D hypotheses and task priors. CoRR abs/1901.03446 (2019). https://arxiv.org/abs/1901.03446
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: Proceedings of the IEEE ICCV, pp. 1521–1529 (2017). https://arxiv.org/abs/1711.10006
Kundu, A., Li, Y., Rehg, J.M.: 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3559–3568 (2018). https://doi.org/10.1109/CVPR.2018.00375
Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X.: GS3D: an efficient 3D object detection framework for autonomous driving, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1019–1028. IEEE Computer Society, Los Alamitos (2019). https://doi.ieeecomputersociety.org/10.1109/CVPR.2019.00111
Li, P.: Monocular 3D detection with geometric constraints embedding and semi-supervised training (2020). https://arxiv.org/abs/2009.00764
Li, P., Zhao, H., Liu, P., Cao, F.: RTM3D: real-time monocular 3D detection from object keypoints for autonomous driving. arXiv preprint arXiv:2001.03343 (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. CoRR arXiv:abs/1405.0312 (2014). https://dblp.org/rec/bib/journals/corr/LinMBHPRDZ14
Liu, Z., Wu, Z., Tóth, R.: SMOKE: single-stage monocular 3D object detection via keypoint estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 4289–4298 (2020). https://doi.org/10.1109/CVPRW50498.2020.00506
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: Proceedings of the IEEE ICCV (2019). https://arxiv.org/abs/1903.11444
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640. IEEE Computer Society, Los Alamitos (2017). ISSN:1063-6919. https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.597
Nilsson, M., Ardö, H.: In search of a car - utilizing a 3D model with context for object detection. In: Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014), pp. 419–424. INSTICC, SciTePress (2014). https://doi.org/10.5220/0004685304190424
Qin, Z., Wang, J., Lu, Y.: Monogrnet: a geometric reasoning network for monocular 3D object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019). https://arxiv.org/abs/1811.10247
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. arXiv:1811.08188 (2018)
Sochor, J., Juránek, R., Herout, A.: Traffic surveillance camera calibration by 3D model bounding box alignment for accurate vehicle speed measurement. Comput. Vis. Image Underst. 161, 87–98 (2017). https://arxiv.org/abs/1702.06451
Article Google Scholar
Sochor, J., Špaňhel, J., Herout, A.: Boxcars: Improving fine-grained recognition of vehicles using 3-D bounding boxes in traffic surveillance. IEEE Trans. Intell. Transp. Syst. 20(1), 97–108 (2018). https://arxiv.org/abs/1703.00686
Article Google Scholar
Srivastava, S., Jurie, F., Sharma, G.: Learning 2D to 3D lifting for object detection in 3D for autonomous vehicles. arXiv preprint arXiv:1904.08494 (2019)
Sullivan, G.D., Baker, K.D., Worrall, A.D., Attwood, C., Remagnino, P.: Model-based vehicle detection and classification using orthographic approximations. Image Vis. Comput. 15(8), 649–654 (1997). https://doi.org/10.1016/S0262-8856(97)00009-7
Article Google Scholar
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022)
Wang, X., Yin, W., Kong, T., Jiang, Y., Li, L., Shen, C.: Task-aware monocular depth estimation for 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020). https://arxiv.org/abs/1909.07701
Wang, Y., Chao, W., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. CoRR abs/1812.07179 (2018). https://arxiv.org/abs/1812.07179
Weber, M., Fürst, M., Zöllner, J.M.: Direct 3D detection of vehicles in monocular images with a CNN based 3D decoder. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 417–423 (2019). https://doi.org/10.1109/IVS.2019.8814198
Wijnhoven, R.G.J., de With, P.H.N.: Unsupervised sub-categorization for object detection: finding cars from a driving vehicle. In: 2011 IEEE ICCV Workshops, pp. 2077–2083. IEEE (2011). https://doi.org/10.1109/ICCVW.2011.6130504
You, Y., et al.: Pseudo-lidar++: accurate depth for 3D object detection in autonomous driving. CoRR abs/1906.06310 (2019). https://arxiv.org/abs/1906.06310
Zwemer, M.H., Scholte, D., Wijnhoven, R.G.J., de With, P.H.N.: 3D detection of vehicles from 2D images in traffic surveillance. In: Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, pp. 97–106. INSTICC, SciTePress (2022). https://doi.org/10.5220/0010783600003124

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Eindhoven University, Eindhoven, The Netherlands
M. H. Zwemer & P. H. N. de With
ViNotion BV, Eindhoven, The Netherlands
M. H. Zwemer & D. Scholte

Authors

M. H. Zwemer
View author publications
You can also search for this author in PubMed Google Scholar
D. Scholte
View author publications
You can also search for this author in PubMed Google Scholar
P. H. N. de With
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. H. Zwemer .

Editor information

Editors and Affiliations

University of Porto, Porto, Portugal
A. Augusto de Sousa
University of Warwick, Coventry, UK
Kurt Debattista
Mines ParisTech, Paris, France
Alexis Paljic
Bentley University, Waltham, USA
Mounia Ziat
French Civil Aviation University (ENAC), Toulouse, France
Christophe Hurter
Monash University, Melbourne, VIC, Australia
Helen Purchase
Department of Mathematics, University of Catania, Catania, Italy
Giovanni Maria Farinella
Computer Vision Center, University of Barcelona, Barcelona, Spain
Petia Radeva
IRISA, University of Rennes 1, Rennes, France
Kadi Bouatouch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zwemer, M.H., Scholte, D., de With, P.H.N. (2023). Semi-automated Generation of Accurate Ground-Truth for 3D Object Detection. In: de Sousa, A.A., et al. Computer Vision, Imaging and Computer Graphics Theory and Applications. VISIGRAPP 2022. Communications in Computer and Information Science, vol 1815. Springer, Cham. https://doi.org/10.1007/978-3-031-45725-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-45725-8_2
Published: 18 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45724-1
Online ISBN: 978-3-031-45725-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semi-automated Generation of Accurate Ground-Truth for 3D Object Detection