Skip to main content
Log in

Semi-Supervised Monocular Depth Estimation Based on Semantic Supervision

  • Published:
Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

Abstract

Monocular depth estimation by unsupervised learning is a potential strategy, which is mainly self-supervised by calculating view reconstruction loss from stereo pairs or monocular sequences. However, most existing works only consider the geometric information during training, without using semantics. We propose a semantic monocular depth estimation (SE-Net), a neural network framework that estimates depth using semantic information and video sequences. The whole framework is semi-supervised, because we take advantage of labelled semantic ground truth data. In view of the structural consistency between the semantically segmented image and the depth map, we first perform semantic segmentation on the image, and then use the semantic labels to guide the construction of the depth estimation network. Experiments on the KITTI dataset show that learning semantic information from images can effectively improve the effect of monocular depth estimation, and SE-Net is superior to the most advanced methods in depth estimation accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015)

    Article  Google Scholar 

  2. Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. 33(5), 1255–1262 (2017)

    Article  Google Scholar 

  3. G. Klein and D. Murray, "Parallel Tracking and Mapping for Small AR Workspaces," Presented at the Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, 2007

  4. Forster, C., Zhang, Z., Gassner, M., Werlberger, M., Scaramuzza, D.: SVO: Semidirect visual Odometry for monocular and multicamera systems. IEEE Trans. Robot. 33(2), 249–265 (2017)

    Article  Google Scholar 

  5. J. Engel, T. Schöps, and D. Cremers, "LSD-SLAM: Large-Scale Direct Monocular SLAM," in Computer Vision – ECCV 2014, Cham, 2014, pp. 834–849: Springer International Publishing

  6. J. Jiao, Y. Cao, Y. Song, and R. Lau, "Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 53–69

  7. W. Chen, S. Qian, and J. Deng, "Learning single-image depth from videos using quality assessment networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5604–5613

  8. Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, "Joint task-recursive learning for semantic segmentation and depth estimation," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 235–251

  9. A. Kendall, M. Grimes, and R. Cipolla, "PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization," in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2938–2946

  10. D. Eigen, C. Puhrsch, and R. Fergus, "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network," in Advances in Neural Information Processing Systems 27, 2014, pp. 2366--2374: Curran Associates, Inc.

  11. D. Eigen and R. Fergus, "Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture," The IEEE International Conference on Computer Vision (ICCV), 2015

  12. S. Wang, R. Clark, H. Wen, and N. Trigoni, "DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks," in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, Pp. 2043-2050

  13. B. Ummenhofer et al., "DeMoN: Depth and Motion Network for Learning Monocular Stereo," in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  14. C. Godard, O. Mac Aodha, and G. J. Brostow, "Unsupervised monocular depth estimation with left-right consistency," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270–279

  15. T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, "Unsupervised Learning of Depth and Ego-Motion from Video," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 6612–6619

  16. Z. Yin and J. Shi, "GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, Pp. 1983-1992

  17. A. Wong and S. Soatto, "Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5644–5653

  18. Y. Kuznietsov, J. Stuckler, and B. Leibe, "Semi-Supervised Deep Learning for Monocular Depth Map Prediction," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2215–2223

  19. N. Yang, R. Wang, J. Stückler, and D. Cremers, "Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part VIII," 2018, pp. 835–852

  20. V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, and I. Reid, "Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations," 2018

    Google Scholar 

  21. A. Atapour-Abarghouei and T. P. Breckon, "Veritatem dies aperit-temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3373–3384

  22. P. Z. Ramirez, M. Poggi, F. Tosi, S. Mattoccia, and L. Di Stefano, "Geometry meets semantics for semi-supervised monocular depth estimation," in Asian Conference on Computer Vision, 2018, pp. 298–313: Springer

  23. A. Atapour-Abarghouei and T. P. Breckon, "Monocular segment-wise depth: Monocular depth estimation based on a semantic segmentation prior," in 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 4295–4299: IEEE

  24. G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, "The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243

  25. G. Lin, A. Milan, C. Shen, and I. Reid, RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation. 2017, pp. 5168–5177

  26. Fehn, C.: Depth-image-based rendering (DIBR), compression and transmission for a new approach on 3D-TV. Proc. SPIE. 5291, 05/01 (2004)

    Article  Google Scholar 

  27. T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. Efros, View Synthesis by Appearance Flow. 2016, pp. 286–301

  28. Z. Wang, A. Bovik, H. R. Sheikh, and E. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Trans. Image Process., vol. 13, pp. 600–612, 01/01 2014

  29. M. Abadi et al., TensorFlow : Large-Scale Machine Learning on Heterogeneous Distributed Systems. 2015

  30. A. Geiger, Are we ready for autonomous driving? The KITTI vision benchmark suite. 2012, pp. 3354–3361

  31. .D. Kingma and J. Ba, "Adam: a Method for Stochastic Optimization," International Conference on Learning Representations, 12/22 2014

  32. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38, 02/25 (2015)

    Google Scholar 

  33. R. Garg, V. K. B G, G. Carneiro, and I. Reid, Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. 2016

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Yue.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yue, M., Fu, G., Wu, M. et al. Semi-Supervised Monocular Depth Estimation Based on Semantic Supervision. J Intell Robot Syst 100, 455–463 (2020). https://doi.org/10.1007/s10846-020-01205-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10846-020-01205-0

Keywords

Navigation