Abstract
Monocular depth estimation is regarded as a critical link in context-aware scene comprehension, which typically uses image data from a single point of view as the input to directly predict the depth value corresponding to each pixel in the image. However, predicting accurate object borders without replicating texture is difficult, resulting in missing tiny objects and blurry object edge in predicted depth images. In this paper, we propose a method for estimating monocular depth using an improved U-Net-based encoder-decoder network structure. We propose a new training loss term called edge-guide loss, which pushes the network to focus on object edges, resulting in better accuracy of the depth of tiny objects and edges. In the network, we build the encoder using DenseNet-169 and the decoder using 2 × bilinear up-sampling, skip-connections and hybrid dilated convolution. And skip-connections are used to send multi-scale feature maps from encoder to decoder. We specifically create a new loss function, edge-guide loss and three basic loss terms. We test our algorithm on the NYU Depth V2 dataset. The results of the experiments show that the proposed network can create depth image from a single RGB image with unambiguous borders and more tiny object depth. In the meantime, compared with state-of-the-art approaches, our proposed network outperforms for both visual quality and objective measurement.








Similar content being viewed by others
References
Huang C-H, Tsung W-N, Yang W-J, Chen C-H (2019) Unsupervised monocular depth estimation for autonomous driving. In: Proceedings of the international display workshops (IDV), pp 128–131
Lai C, Su K (2018) Development of an intelligent mobile robot localization system using Kinect RGB-D mapping and neural network. Comput Electr Eng 67:620–628
Lee J, Joo S (2021) Three-dimensional depth estimation of virtual objects in augmented reality. J Vision 21(9):2485. https://doi.org/10.1167/jov.21.9.2485
Smisek J, Jancosek M, Pajdla T (2011) 3D with kinect. In: IEEE international conference on computer vision workshops (ICCV Workshops), pp 1154–1160. https://doi.org/10.1109/ICCVW.2011.6130380
Dubayah RO, Drake JB (2000) Lidar remote sensing for forestry. J Forest 98(6):44–46
Godard C, Aodha O, Firman M, Brostow G (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 3827–3837. https://doi.org/10.1109/ICCV.2019.00393
Chen KY, Chien CC, Tseng CT (2013) Improving the accuracy of depth estimation in binocular vision for robotic applications. Appl Mech Mater 284–287:1862–1866. https://doi.org/10.4028/www.scientific.net/AMM.284-287.1862
Allison RS, Gillam BJ, Vecellio E (2009) Binocular depth discrimination and estimation beyond interaction space. J Vision 9(1):1–14. https://doi.org/10.1167/9.1.10
Zhou T, Brown M, Snavely N, Lowe D (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6612–6621. https://doi.org/10.1109/CVPR.2017.700
Wang C, Buenaposada JM, Rui Z, Lucey S (2018) Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2022–2030. https://doi.org/10.1109/CVPR.2018.00216
Ranjan A, Jampani V, Balles L, Kim K, Sun D, Wulff J, Black MJ (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 12232–12241. https://doi.org/10.1109/CVPR.2019.01252
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Adv Neural Inf Process Syst 3:2366–2374
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2650–2658. https://doi.org/10.1109/ICCV.2015.304
Alhashim I, Wonka P (2018) High quality monocular depth estimation via transfer learning. arXiv:1812.11941
Laga H, Jospin L, Boussaid F, Bennamoun M (2022) A survey on deep learning techniques for stereo-based depth estimation. IEEE T Pattern Anal 44(4):1738–1764
Bolles RC, Baker HH, Marimont DH (1987) Epipolar-plane image analysis: An approach to determining structure from motion. Int J Comput Vis 1(1):7–55
Prados E, Faugeras O (2005) A generic and provably convergent shape-from-shading method for orthographic and pinhole cameras. Int J Comput Vision 65(1–2):97–125
Nayar SK, Nakagawa Y (1994) Shape from focus. IEEE T Pattern Anal 16(8):824–831
Paolo F, Stefano S (2005) A geometric approach to shape from defocus. IEEE T Pattern Anal 27(3):406–417
Huang G, Liu Z, Laurens V, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Hao Z, Li Y, You S, Lu F (2018) Detail preserving depth estimation from a single image using attention guided networks. In: Proceedings of the international conference on 3D vision (3DV), pp 304–313. https://doi.org/10.1109/3DV.2018.00043
Lee J, Kim C (2019) Monocular depth estimation using relative depth maps. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 9721–9730. https://doi.org/10.1109/CVPR.2019.00996
Xue F, Cao J, Zhou Y, Sheng F, Wang Y, Ming A (2021) Boundary-induced and scene-aggregated network for monocular depth prediction. Pattern Recogn 115. https://doi.org/10.1016/j.patcog.2021.107901
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: Proceedings of the international conference on 3D vision (3DV), pp 239–248. https://doi.org/10.1109/3DV.2016.32
Wang L, Zhang J, Wang O, Lin Z, Lu H (2020) SDC-depth: semantic divide-and-conquer network for monocular depth estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 538–547. https://doi.org/10.1109/CVPR42600.2020.00062
Lyu X, Liu L, Wang M, Kong X, Liu L, Liu Y, Chen X, Yuan Y (2021) HR-depth: high resolution self-supervised monocular depth estimation. In: 35th AAAI conference on artificial intelligence (AAAI), pp 2294–2301
Li B, Shen C, Dai Y, Hengel AVD, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 1119–1127. https://doi.org/10.1109/CVPR.2015.7298715
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Hu J, Ozay M, Zhang Y, Okatani T (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: Proceedings of the IEEE winter conference on applications of computer vision (WACV), pp 1043–1051. https://doi.org/10.1109/WACV.2019.00116
Chen W, Fu Z, Yang D, Deng J (2016) Single-image depth perception in the wild. In: Proceedings of the annual conference on neural information processing systems (NIPS), pp 730–738
Xian K, Zhang J, Wang O, Mai L, Lin Z, Cao Z (2020) Structure-guided ranking loss for single image depth prediction. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 608–617. https://doi.org/10.1109/CVPR42600.2020.00069
Zeiler M, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature learning. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2018–2025. https://doi.org/10.1109/ICCV.2011.6126474
Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks. Lect Notes Comput Sci 8689:818–833
Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE T Pattern Anal 39(4):640–651
Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: Proceedings of the international conference on learning representations (ICLR)
Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell G (2018) Understanding convolution for semantic segmentation. In: Proceedings of the IEEE winter conference on applications of computer vision (WACV), pp 1451–1460. https://doi.org/10.1109/WACV.2018.00163
Zhu S, Brazil G, Liu X (2020) The edge of depth: explicit constraints between segmentation and depth. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 13113–13122. https://doi.org/10.1109/CVPR42600.2020.01313
Wang Z, Bovik A, Sheikh H, Simoncelli E (2004) Image quality assessment: from error visibility to structural similarity. IEEE T Image Process 13(4):600–612
Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6602–6611. https://doi.org/10.1109/CVPR.2017.699
Canny J (1986) A computational approach to edge detection. IEEE T Pattern Anal 8(6):679–698
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of the european conference on computer vision (ECCV), pp 746–760. https://doi.org/10.1007/978-3-642-33715-4_54
Levin A, Lischinski D, Weiss Y (2004) Colorization using optimization. Acm T Graphic 23:689–694
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. In: Proceedings of the international conference on learning representations (ICLR)
Jia D, Wei D, Socher R, Li L, Kai L, Li F (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Acknowledgements
This work was supported by the Key R&D Program Project of Shaanxi Province, China (Grant Numbers 2020NY-144). The authors appreciate the funding organization for their financial supports. The authors would also like to thank the helpful comments and suggestions provided by all the authors cited in this article and the anonymous reviewers.
Funding
The research leading to these results received funding from the Key R&D Program Project of Shaanxi Province, China (Grant Numbers 2020NY-144).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
He, M., Gao, Y. & Long, Y. Single image depth estimation using improved U-Net and edge-guide loss. Multimed Tools Appl 83, 84619–84637 (2024). https://doi.org/10.1007/s11042-024-19235-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-024-19235-3