Abstract
In this work, we present a new dataset for monocular depth estimation created by extracting images, dense depth maps, and odometer data from a realistic video game simulation, Euro Truck Simulator 2\(^\textrm{TM}\). The dataset is used to train state-of-the-art depth estimation models in both supervised and unsupervised ways, which are evaluated against real-world sequences. Our results demonstrate that models trained exclusively with synthetic data achieve satisfactory performance in the real domain. The quantitative evaluation brings light to possible causes of domain gap in monocular depth estimation. Specifically, we discuss the effects of coarse-grained ground-truth depth maps in contrast to the fine-grained depth estimation. The dataset and code for data extraction and experiments are released open-source.
This research work has been supported by project TED2021-129162B-C22, funded by the Recovery and Resilience Facility program from the NextGenerationEU and the Spanish Research Agency (Agencia Estatal de Investigación); and PID2021-128362OB-I00, funded by the Spanish Plan for Scientific and Technical Research and Innovation of the Spanish Research Agency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning abs/1812.11941, arXiv:1812.11941 (2018)
Behley, J., et al.: SemanticKITTI: a dataset for semantic scene understanding of lidar sequences. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9296–9306 (2019). https://doi.org/10.1109/ICCV.2019.00939
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding (2016). https://doi.org/10.1109/CVPR.2016.350, www.cityscapes-dataset.net
Cvišić, I., Marković, I., Petrović, I.: Recalibrating the KITTI dataset camera setup for improved odometry accuracy, pp. 1–6 (2021). https://doi.org/10.1109/ECMR50962.2021.9568821
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Deschaud, J.E.: KITTI-carla: a kitti-like dataset generated by CARLA simulator (2021). https://doi.org/10.48550/arxiv.2109.00892, https://arxiv.org/abs/2109.00892
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator (2017)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture (2015). https://doi.org/10.1109/ICCV.2015.304
Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis, pp. 4340–4349 (2016)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361 (2012). https://doi.org/10.1109/CVPR.2012.6248074
Godard, C., Aodha, O.M., Firman, M., Brostow, G.: Digging into self-supervised monocular depth estimation (2018). https://doi.org/10.1109/ICCV.2019.00393
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2016)
Hirschmüller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. II, pp. 807–814 (2005). https://doi.org/10.1109/CVPR.2005.56, https://researchcode.com/code/672268296/accurate-and-efficient-stereo-processing-by-semi-global-matching-and-mutual-information/
Hu, Y.T., Wang, J., Yeh, R., Schwing, A.: SAIL-VOS 3D: a synthetic dataset and baselines for object detection and 3D mesh reconstruction from video data, pp. 3359–3369 (2021). https://doi.org/10.1109/CVPRW53098.2021.00375
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks, pp. 2261–2269 (2017). https://doi.org/10.1109/CVPR.2017.243
Huang, Y., Dong, D., Lv, C.: Obtain datasets for self-driving perception from video games automatically. In: 12th International Conference on Reliability, Maintainability, and Safety (ICRMS), pp. 203–207 (2018). https://doi.org/10.1109/ICRMS.2018.00046
Rashed, H., Ramzy, M., Vaquero, V., El Sallab, A., Sistu, G., Yogamani, S.: FuseMODNet: real-time camera and LiDAR based moving object detection for robust low-light autonomous driving. In: The IEEE International Conference on Computer Vision (ICCV) Workshops (2019). https://doi.org/10.1109/ICCVW.2019.00293
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3234–3243 (2016). https://doi.org/10.1109/CVPR.2016.352
Saxena, A., Chung, S.H., Ng, A.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems, vol. 18 (2005). https://doi.org/10.5555/2976248.2976394
Saxena, A., Schulte, J., Ng, A.: Depth estimation using monocular and stereo cues. In: Proceedings of the 20th International joint conference on Artifical Intelligence (IJCAI) (2007). https://doi.org/10.5555/1625275.1625630
Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 601–608 (2011). https://doi.org/10.1109/ICCVW.2011.6130298
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54, https://www.scinapse.io/papers/125693051
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
María-Arribas, D., Cuesta-Infante, A., Pantrigo, J.J. (2023). The ETS2 Dataset, Synthetic Data from Video Games for Monocular Depth Estimation. In: Pertusa, A., Gallego, A.J., Sánchez, J.A., Domingues, I. (eds) Pattern Recognition and Image Analysis. IbPRIA 2023. Lecture Notes in Computer Science, vol 14062. Springer, Cham. https://doi.org/10.1007/978-3-031-36616-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-36616-1_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-36615-4
Online ISBN: 978-3-031-36616-1
eBook Packages: Computer ScienceComputer Science (R0)