Abstract
Monocular depth estimation, which is playing an increasingly important role in 3D scene understanding, has been attracting increasing attention in the computer vision field in recent years. The latest monocular depth estimation methods based on deep learning have achieved significant performance by exploring various network architectures. However, compared with designing larger and more complex model architectures for monocular depth estimation, leveraging scene geometry relations to boost the performance of monocular depth estimation models has been less studied. To explore further utilization of scene geometry relations on monocular depth estimation, we propose a geometry-aware constraint that makes use of relative order information to improve the performance of monocular depth estimation models. Specifically, we first design a relative order descriptor (ROD) to construct the relative order description on single scene location. Then, based on the ROD, the relative order map (ROM) is built to represent the relative order information of the whole scene. Finally, a loss term relative order loss (ROL), which relies on ROM to supervise the training process of the monocular depth estimation model is presented. Our proposed method can help monocular depth estimation models to predict more accurate depth maps. Moreover, with the geometry constraint from our method, the monocular depth estimation model can provide prediction results where high-quality scene structure can be better preserved. We conduct extensive experiments on the popular datasets NYU Depth V2 and KITTI. The experimental results demonstrate the effectiveness of our method.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs10489-023-04851-7/MediaObjects/10489_2023_4851_Fig10_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
We only use the data from public datasets mention in Section. 4 and no extra data which is self-generated or self-collected is utilized.
References
Eigen D, Fergus R (2015) “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture.” In Proceedings of the IEEE international conference on computer vision, pp 2650–2658
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) “Deeper depth prediction with fully convolutional residual networks,” In 2016 Fourth international conference on 3D vision (3DV), pp 239–248. IEEE
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) “Deep ordinal regression network for monocular depth estimation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2002–2011
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) “Attention is all you need.” Adv Neural Inf Process Syst 30
Yuan W, Gu X, Dai Z, Zhu S, Tan P (2022) “Newcrfs: Neural window fully-connected crfs for monocular depth estimation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Kim D, Ga W, Ahn P, Joo D, Chun S, Kim J (2022) “Global-local path networks for monocular depth estimation with vertical cutdepth.” arXiv:2201.07436
Lee JH, Han MK, Ko DW, Suh IH (2019) “From big to small: Multi-scale local planar guidance for monocular depth estimation.” arXiv:1907.10326
Qi X, Liao R, Liu Z, Urtasun R, Jia J (2018) “Geonet: geometric neural network for joint depth and surface normal estimation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 283–291
Yin W, Liu Y, Shen C, Yan Y, (2019) “Enforcing geometric constraints of virtual normal for depth prediction.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5684–5693
Eigen D, Puhrsch C, Fergus R (2014)“Depth map prediction from a single image using a multi-scale deep network.” Adv Neural Inf Process Syst 27
Silberman N, Hoiem D, Kohli P, Fergus R (2012)“Indoor segmentation and support inference from rgbd images.” In European conference on computer vision, pp 746–760. Springer
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237
Masoumian A, Rashwan HA, Cristiano J, Asif MS, Puig D (2022) Monocular depth estimation using deep learning: a review. Sensors 22(14):5353
Vyas P, Saxena C, Badapanda A, Goswami A (2022) “Outdoor monocular depth estimation: a research review.” arXiv:2205.01399
He K, Zhang X, Ren S, Sun J (2016) “Deep residual learning for image recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2017) “Multi-scale continuous crfs as sequential deep networks for monocular depth estimation.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5354–5362
Ricci E, Ouyang W, Wang X, Sebe N et al (2018) Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. IEEE Trans Pattern Anal Mach Intell 41(6):1426–1440
Cao Y, Wu Z, Shen C (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circuits Syst Video Technol 28(11):3174–3182
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) “An image is worth 16x16 words: transformers for image recognition at scale.” In International Conference on Learning Representations
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) “Swin transformer: hierarchical vision transformer using shifted windows.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10 012–10 022
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y et al (2022) A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 45(1):87–110
Bhat SF, Alhashim I, Wonka P (2021) “Adabins: depth estimation using adaptive bins.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4009–4018
Ranftl R, Lasinger K, Hafner D, Schindler K, Koltun V (2020) “Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer.” IEEE Trans Pattern Anal Mach Intell (TPAMI)
Mo Y, Wu Y, Yang X, Liu F, Liao Y (2022) Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 493:626–646
Wang Y, Zhou W, Lv Q, Yao G (2022) “Metricmask: single category instance segmentation by metric learning.” Neurocomputing
Gao B, Zhao Y, Zhang F, Luo B, Yang C (2022) Video object segmentation based on multi-level target models and feature integration. Neurocomputing 492:396–407
Zhang Z, Cui Z, Xu C, Jie Z, Li X, Yang J (2018) “Joint task-recursive learning for semantic segmentation and depth estimation.” In Proceedings of the European Conference on Computer Vision (ECCV), pp 235–251
Kwak Dh, Lee Sh (2022) A novel method for estimating monocular depth using cycle gan and segmentation. Sensors 20(9):2567
He L, Lu J, Wang G, Song S, Zhou J (2021) Sosd-net: joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 440:251–263
Li R, Xue D, Su S, He X, Mao Q, Zhu Y, Sun J, Zhang Y (2023) “Learning depth via leveraging semantics: self-supervised monocular depth estimation with both implicit and explicit semantic guidance.” Pattern Recognit 109297
Benkirane FE, Crombez N, Ruichek Y, Hilaire V (2023) Integration of ontology reasoning-based monocular cues in deep learning modeling for single image depth estimation in urban driving scenarios. Knowl-Based Syst 260:110184
Zhou T, Brown M, Snavely N, Lowe DG (2017) “Unsupervised learning of depth and ego-motion from video.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1851–1858
Zhan H, Garg R, Weerasekera CS, Li K, Agarwal H, Reid I (2018) “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 340–349
Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) “Digging into self-supervised monocular depth estimation.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3828–3838
Zhao C, Tang Y, Sun Q (2022) Unsupervised monocular depth estimation in highly complex environments. IEEE Trans Emerg Topics Comput Intell 6(5):1237–1246
Zhou Z, Dong Q (2022) “Self-distilled feature aggregation for self-supervised monocular depth estimation.” In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, pp 709–726. Springer
Masoumian A, Rashwan HA, Abdulwahab S, Cristiano J, Asif MS, Puig D (2023) Gcndepth: self-supervised monocular depth estimation based on graph convolutional network. Neurocomputing 517:81–92
He M, Hui L, Bian Y, Ren J, Xie J, Yang J (2022) “Ra-depth: resolution adaptive self-supervised monocular depth estimation.” In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pp 565–581. Springer
Wofk D, Ma F, Yang TJ, Karaman S, Sze V (2019) “Fastdepth: fast monocular depth estimation on embedded systems.” In 2019 International Conference on Robotics and Automation (ICRA), pp 6101–6108. IEEE
Liu X, Wei W, Liu C, Peng Y, Huang J, Li J (2023) “Real-time monocular depth estimation merging vision transformers on edge devices for aiot.” IEEE Trans Instrum Meas
Dong X, Garratt MA, Anavatti SG, Abbass HA (2022) “Towards real-time monocular depth estimation for robotics: a survey.” IEEE Trans Intell Transport Syst 23(10):16 940–16 961
Liu Y, Chen K, Liu C, Qin Z, Luo Z, Wang J (2019) “Structured knowledge distillation for semantic segmentation.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2604–2613
Wang K, Zhang Z, Yan Z, Li X, Xu B, Li J, Yang J (2021) “Regularizing nighttime weirdness: efficient self-supervised monocular depth estimation in the dark.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 16 055–16 064
Ranftl R, Lasinger K, Hafner D, Schindler K, Koltun V (2020) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Analysis Machine Intell 44(3):1623–1637
Chen W, Fu Z, Yang D, Deng J (2016) “Single-image depth perception in the wild.” Adv Neural Inf Process Syst 29
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32:8026–8037
Liu F, Shen C, Lin G, Reid I (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38(10):2024–2039
Abdulwahab S, Rashwan HA, Garcia MA, Masoumian A, Puig D (2022) “Monocular depth map estimation based on a multi-scale deep architecture and curvilinear saliency feature boosting.” Neural Comput Appl 34(19):16 423–16 440
Song M, Lim S, Kim W (2021) Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE Trans Circuits Systems Video Technol 31(11):4381–4393
Meng X, Fan C, Ming Y, Yu H (2021) Cornet: context-based ordinal regression network for monocular depth estimation. IEEE Trans Circuits Systr Video Technol 32(7):4841–4853
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
No conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, C., Zuo, W., Yang, G. et al. Relative order constraint for monocular depth estimation. Appl Intell 53, 24804–24821 (2023). https://doi.org/10.1007/s10489-023-04851-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04851-7