Abstract
A self-supervised algorithm based on deep learning is designed to estimate the depth of the driving scene. The depth estimation network and pose estimation network designed based on convolutional neural network take the video information obtained by monocular camera as the input, and output the depth map of each frame of input image and the pose changes between two adjacent frames of input images, respectively. The view synthesis, that is, the image reconstruction loss between two adjacent frame images, is used as the supervision signal to train the neural network. The problem of scale inconsistency in monocular depth estimation is solved through the scale consistency loss, and the weight mask obtained from the scale inconsistency loss is used to solve the dynamic problems and the adverse effects of occluded objects in driving environment. The tests results show that the designed self-supervised depth estimation algorithm based on monocular video information shows high accuracy on the KITTI dataset and almost reaches the same level as the supervised algorithm.








Similar content being viewed by others
References
Chen, Q., Xie, Y., Guo, S., Bai, J., Shu, Q.: Sensing system of environmental perception technologies for driverless vehicle: a review of state of the art and challenges. Sens Actuat: A Phys 319, 112566 (2021)
Li, X., Yang, A., Qin, B., Jia, S., Qiu, H.: Monocular camera three dimensional reconstruction based on optical flow feedback. Acta Optica Sinica 35(5), 1–8 (2015)
Zhan, K., Chen, W., Li, W., Zhang, L.: Line laser 3D scene reconstruction system and error analysis. Chinese J Lasers. 45(12), 1–9 (2018)
Bi, T., Liu, Y., Weng, D., Wang, Y.: Survey on supervised learning based depth estimation from a single image. J Comp-Aid Design & Comp Grap 30(8), 1383–1393 (2018)
Žbontar, J., Lecun, Y.: Stereo matching by training a convolutional neural network to compare image patches. J Mach Learn Resear 17(1), 2287–2318 (2016)
Heiko, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: IEEE Conference on computer vision and pattern recognition (CVPR). (2005)
Zhao, S., Zhang, L., Shen, Y., et al.: Super-resolution for monocular depth estimation with multi-scale sub-pixel convolutions and a smoothness constraint. IEEE Access. 7, 16323–16335 (2019)
Lei, H., Qiulei, D., Zhanyi, H.: The inherent ambiguity in scene depth learning from single images. Scientia Sinica Informationis. 46(7), 811–818 (2016)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. MIT Press. 2, 2366–2374 (2014)
Grigorev, A., Jiang, F., Rho, S., Sori, W.J., Liu, S., Sai, S.: Depth estimation from single monocular images using deep hybrid network. Multim Tool Appl 76(18), 18585–18604 (2017)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Patt Anal. Mach. Intell. 38, 2024–2039 (2016)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 640–651 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Laina, I., Rupprecht, C., Belagiannis, V., F Tombari, Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: Fourth International conference on 3d vision, (2016)
Russell, B.C., Torralba, A., Murphy, K.P., et al.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vision 77(1–3), 157–173 (2008)
Xie, J., Girshick, R., Farhadi, A., Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. Springer International Publishing, Berlin (2016)
Garg, R., Bg, V. K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: European conference on computer vision, pp. 740–756 (2016)
Godard, C., Aodha, O. M., Brostow, G. J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on computer vision and pattern recognition, pp. 6602–6609 (2017)
Athiwaratkun, B., Finzi, M., Izmailov, P., Wilson, A. G.: There are many consistent explanations of unlabeled data: Why you should average. arXiv preprint arXiv:1806.05594 (2018)
Bachman, P., Hjelm, R. D., Buchwalter, W.: Learning representations by maximizing mutual information across views. Advances in neural information processing systems. (2019)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp. 6612–6619 (2017)
Casser, V., Pirk, S., Mahjourian, R., Angelova, A.: Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. AAAI (2018). https://doi.org/10.1609/aaai.v33i01.33018001
Lin, T.Y., Dollár, P., Girshick, R., et al.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125 (2017)
Young, M.: Pinhole optics. Appl. Opt. 10(12), 2763–2767 (1971)
Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. of 7th International joint conference on artificial intelligence (IJCAI), pp.674–679 (1997)
Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Kuznietsov, Y., Stuckler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.6647–6655(2017)
Ranjan, A., Jampani, V., Balles, L., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp.12240–12249 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xie, Z., Zhou, S., Zheng, M. et al. Research on self-supervised depth estimation algorithm of driving scene based on monocular vision. SIViP 17, 991–999 (2023). https://doi.org/10.1007/s11760-022-02303-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-022-02303-2