Abstract
Visual algorithms for traffic surveillance systems typically locate and observe traffic movement by representing all traffic with 2D boxes. These 2D bounding boxes around vehicles are insufficient to generate accurate real-world locations. However, 3D annotation datasets are not available for training and evaluation of detection for traffic surveillance. Therefore, a new dataset for training the 3D detector is required. We propose and validate seven different annotation configurations for automated generation of 3D box annotations using only camera calibration, scene information (static vanishing points) and existing 2D annotations. The proposed novel Simple Box method does not require segmentation of vehicles and provides a more simple 3D box construction, which assumes a fixed predefined vehicle width and height. The existing KM3D CNN-based 3D detection model is adopted for traffic surveillance, which directly estimates 3D boxes around vehicles in the camera image, by training the detector on the newly generated dataset. The KM3D detector trained with the Simple Box configuration provides the best 3D object detection results, resulting in 51.9% AP3D on this data. The 3D object detector can estimate an accurate 3D box up to a distance of 125 m from the camera, with a median middle point mean error of only 0.5–1.0 m.
M. H. Zwemer and D. Scholte—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ansari, J.A., Sharma, S., Majumdar, A., Jatavallabhula, K.M., Krishna, K.M.: The earth ain’t flat: monocular reconstruction of vehicles on steep and graded roads from a moving camera. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8404–8410 (2018). https://api.semanticscholar.org/CorpusID:3761728
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9286–9295. IEEE Computer Society, Los Alamitos (2019). https://doi.ieeecomputersociety.org/10.1109/ICCV.2019.00938
Brouwers, G.M.Y.E., Zwemer, M.H., Wijnhoven, R.G.J., de With, P.H.N.: Automatic calibration of stationary surveillance cameras in the wild. In: Proceedings of the IEEE CVPR (2016). https://doi.org/10.1007/978-3-319-48881-3_52
Cai, Y., Li, B., Jiao, Z., Li, H., Zeng, X., Wang, X.: Monocular 3D object detection with decoupled structured polygon estimation and height-guided depth estimation. In: AAAI (2020). https://arxiv.org/abs/2002.01619
Chabot, F., Chaouch, M., Rabarisoa, J., Teulière, C., Chateau, T.: Deep MANTA: a coarse-to-fine many-task network for joint 2D and 3D vehicle analysis from monocular image. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1827–1836. IEEE Computer Society, Los Alamitos (2017). https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.198
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: Proceedings of the IEEE CVPR (2016). https://doi.org/10.1109/CVPR.2016.236
Choe, J., Joo, K., Rameau, F., Shim, G., Kweon, I.S.: Segment2Regress: monocular 3D vehicle localization in two stages. In: Robotics: Science and Systems (2019). https://doi.org/10.15607/RSS.2019.XV.016
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893. IEEE (2005). https://doi.org/10.1109/CVPR.2005.177
Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth-guided convolutions for monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on CVPR (2020). https://arxiv.org/abs/1912.04799
Dubská, M., Herout, A., Sochor, J.: Automatic camera calibration for traffic understanding. In: BMVC, vol. 4, 6, p. 8 (2014). https://doi.org/10.5244/C.28.42
Fang, J., Zhou, L., Liu, G.: 3D bounding box estimation for autonomous vehicles by cascaded geometric constraints and depurated 2D detections using 3D results (2019). https://arxiv.org/abs/1909.01867
Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: Conference on CVPR (2012). https://doi.org/10.1109/CVPR.2012.6248074
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015). https://arxiv.org/abs/1504.08083
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE ICCV (2017). https://arxiv.org/abs/1703.06870
He, T., Soatto, S.: Mono3D++: monocular 3D vehicle detection with two-scale 3D hypotheses and task priors. CoRR abs/1901.03446 (2019). https://arxiv.org/abs/1901.03446
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: Proceedings of the IEEE ICCV, pp. 1521–1529 (2017). https://arxiv.org/abs/1711.10006
Kundu, A., Li, Y., Rehg, J.M.: 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3559–3568 (2018). https://doi.org/10.1109/CVPR.2018.00375
Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X.: GS3D: an efficient 3D object detection framework for autonomous driving, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1019–1028. IEEE Computer Society, Los Alamitos (2019). https://doi.ieeecomputersociety.org/10.1109/CVPR.2019.00111
Li, P.: Monocular 3D detection with geometric constraints embedding and semi-supervised training (2020). https://arxiv.org/abs/2009.00764
Li, P., Zhao, H., Liu, P., Cao, F.: RTM3D: real-time monocular 3D detection from object keypoints for autonomous driving. arXiv preprint arXiv:2001.03343 (2020)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. CoRR arXiv:abs/1405.0312 (2014). https://dblp.org/rec/bib/journals/corr/LinMBHPRDZ14
Liu, Z., Wu, Z., Tóth, R.: SMOKE: single-stage monocular 3D object detection via keypoint estimation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 4289–4298 (2020). https://doi.org/10.1109/CVPRW50498.2020.00506
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: Proceedings of the IEEE ICCV (2019). https://arxiv.org/abs/1903.11444
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640. IEEE Computer Society, Los Alamitos (2017). ISSN:1063-6919. https://doi.ieeecomputersociety.org/10.1109/CVPR.2017.597
Nilsson, M., Ardö, H.: In search of a car - utilizing a 3D model with context for object detection. In: Proceedings of the 9th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2014), pp. 419–424. INSTICC, SciTePress (2014). https://doi.org/10.5220/0004685304190424
Qin, Z., Wang, J., Lu, Y.: Monogrnet: a geometric reasoning network for monocular 3D object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019). https://arxiv.org/abs/1811.10247
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2015)
Roddick, T., Kendall, A., Cipolla, R.: Orthographic feature transform for monocular 3D object detection. arXiv:1811.08188 (2018)
Sochor, J., Juránek, R., Herout, A.: Traffic surveillance camera calibration by 3D model bounding box alignment for accurate vehicle speed measurement. Comput. Vis. Image Underst. 161, 87–98 (2017). https://arxiv.org/abs/1702.06451
Sochor, J., Špaňhel, J., Herout, A.: Boxcars: Improving fine-grained recognition of vehicles using 3-D bounding boxes in traffic surveillance. IEEE Trans. Intell. Transp. Syst. 20(1), 97–108 (2018). https://arxiv.org/abs/1703.00686
Srivastava, S., Jurie, F., Sharma, G.: Learning 2D to 3D lifting for object detection in 3D for autonomous vehicles. arXiv preprint arXiv:1904.08494 (2019)
Sullivan, G.D., Baker, K.D., Worrall, A.D., Attwood, C., Remagnino, P.: Model-based vehicle detection and classification using orthographic approximations. Image Vis. Comput. 15(8), 649–654 (1997). https://doi.org/10.1016/S0262-8856(97)00009-7
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022)
Wang, X., Yin, W., Kong, T., Jiang, Y., Li, L., Shen, C.: Task-aware monocular depth estimation for 3D object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020). https://arxiv.org/abs/1909.07701
Wang, Y., Chao, W., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. CoRR abs/1812.07179 (2018). https://arxiv.org/abs/1812.07179
Weber, M., Fürst, M., Zöllner, J.M.: Direct 3D detection of vehicles in monocular images with a CNN based 3D decoder. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 417–423 (2019). https://doi.org/10.1109/IVS.2019.8814198
Wijnhoven, R.G.J., de With, P.H.N.: Unsupervised sub-categorization for object detection: finding cars from a driving vehicle. In: 2011 IEEE ICCV Workshops, pp. 2077–2083. IEEE (2011). https://doi.org/10.1109/ICCVW.2011.6130504
You, Y., et al.: Pseudo-lidar++: accurate depth for 3D object detection in autonomous driving. CoRR abs/1906.06310 (2019). https://arxiv.org/abs/1906.06310
Zwemer, M.H., Scholte, D., Wijnhoven, R.G.J., de With, P.H.N.: 3D detection of vehicles from 2D images in traffic surveillance. In: Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 4: VISAPP, pp. 97–106. INSTICC, SciTePress (2022). https://doi.org/10.5220/0010783600003124
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zwemer, M.H., Scholte, D., de With, P.H.N. (2023). Semi-automated Generation of Accurate Ground-Truth for 3D Object Detection. In: de Sousa, A.A., et al. Computer Vision, Imaging and Computer Graphics Theory and Applications. VISIGRAPP 2022. Communications in Computer and Information Science, vol 1815. Springer, Cham. https://doi.org/10.1007/978-3-031-45725-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-45725-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45724-1
Online ISBN: 978-3-031-45725-8
eBook Packages: Computer ScienceComputer Science (R0)