Skip to main content
Log in

A self-supervised method of single-image depth estimation by feeding forward information using max-pooling layers

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

We propose an encoder–decoder CNN framework to predict depth from one single image in a self-supervised manner. To this aim, we design three kinds of encoder based on the recent advanced deep neural network and one kind of decoder which can generate multiscale predictions. Eight loss functions are designed based on the proposed encoder–decoder CNN framework to validate the performance. For training, we take rectified stereo image pairs as input of the CNN, which is trained by reconstructing image via learning multiscale disparity maps. For testing, the CNN can estimate the accurate depth information by inputting only one single image. We validate our framework on two public datasets in contrast to the state-of-the-art methods and our designed different variants, and the performance of different encoder–decoder architectures and loss functions is evaluated to obtain the best combination, which proves that our proposed method performs very well for single-image depth estimation without the supervision of ground truth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547 (2015)

  2. Zabulis, X., Lourakis, M.I., Koutlemanis, P.: Correspondence-free pose estimation for 3d objects from noisy depth data. Vis. Comput. 34(2), 193–211 (2018)

    Article  Google Scholar 

  3. Ding, K., Chen, W., Wu, X.: Optimum inpainting for depth map based on l 0 total variation. Vis. Comput. 30(12), 1311–1320 (2014)

    Article  Google Scholar 

  4. Shi, J., Sun, Z., Bai, S.: 3d reconstruction framework via combining one 3d scanner and multiple stereo trackers. Vis. Comput. 34(3), 377–389 (2018)

    Article  Google Scholar 

  5. Kopf, J., Cohen, M.F., Szeliski, R.: First-person hyper-lapse videos. ACM Trans. Graph. (TOG) 33(4), 78 (2014)

    Article  Google Scholar 

  6. Barron, J. T., Adams, A.,Shih, Y. , Hernández, C.: Fast bilateral-space stereo for synthetic defocus. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4466–4474 (2015)

  7. Karsch, K., Sunkavalli, K., Hadap, S., Carr, N., Jin, H., Fonte, R., Sittig, M., Forsyth, D.: Automatic scene inference for 3d object compositing. ACM Trans. Graph. (TOG) 33(3), 32 (2014)

    Article  Google Scholar 

  8. Mancini, M., Costante, G., Valigi, P., Ciarfuglia, T.A.: Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 4296–4303 (2016)

  9. Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. Int. J. Robot. Res. 34(4–5), 705–724 (2015)

    Article  Google Scholar 

  10. Xie, J., Girshick, R., Farhadi, A.: Deep3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: European Conference on Computer Vision. Springer, pp. 842–857 (2016)

  11. Criminisi, A., Blake, A., Rother, C., Shotton, J., Torr, P.H.: Efficient dense stereo with occlusions for new view-synthesis by four-state dynamic programming. Int. J. Comput. Vis. 71(1), 89–110 (2007)

    Article  Google Scholar 

  12. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 1297–1304 (2011)

  13. Stoyanov, D., Scarzanella, M.V., Pratt, P., Yang, G.-Z.: Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp. 275–282 (2010)

  14. He, K.,Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  15. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p. 3 (2017)

  16. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)

  17. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2650–2658 (2015)

  18. Chakrabarti, A., Shao, J., Shakhnarovich, G.: Depth from a single image by harmonizing overcomplete local network predictions. In: Advances in Neural Information Processing Systems, pp. 2658–2666 (2016)

  19. Jafari, O.H., Groth, O., Kirillov, A., Yang, M.Y., Rother, C.: Analyzing modular cnn architectures for joint depth prediction and semantic segmentation. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 4620–4627 (2017)

  20. Tulsiani, S., Zhou, T., Efros, A. A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR, p. 3 (2017)

  21. Yan, X. , Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3d object reconstruction without 3d supervision. In: Advances in Neural Information Processing Systems, pp. 1696–1704 (2016)

  22. Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: Advances in Neural Information Processing Systems, pp. 364–375 (2017)

  23. Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: European Conference on Computer Vision. Springer, pp. 484–499 (2016)

  24. Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012

  25. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, p. 7 (2017)

  26. Garg, R., BG, V.K., Carneiro, G. , Reid, I.: Unsupervised cnn for single view depth estimation: geometry to the rescue. In: European Conference on Computer Vision. Springer, pp. 740–756 (2016)

  27. Kuznietsov, Y. , Stückler, J., Leibe, B.: Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6647–6655 (2017)

  28. Simonyan, K. , Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  29. Pillai, S., Ambrus, R., Gaidon, A.: Superdepth: self-supervised, super-resolved monocular depth estimation. arXiv preprint arXiv:1810.01849

  30. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth International Conference on 3D Vision (3DV). IEEE, pp. 239–248 (2016)

  31. Chen, W., Fu, Z. , Yang, D., Deng, J.: Single-image depth perception in the wild. In: Advances in Neural Information Processing Systems, pp. 730–738 (2016)

  32. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision. Springer, pp. 483–499 (2016)

  33. Srinivasan, P.P., Wang, T., Sreelal, A., Ramamoorthi, R., Ng, R.: Learning to synthesize a 4d rgbd light field from a single image. In: International Conference on Computer Vision (ICCV), p. 6 (2017)

  34. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122

  35. Zoran, D., Isola, P., Krishnan, D., Freeman, W.T.: Learning ordinal relationships for mid-level vision. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, pp. 388–396 (2015)

  36. Repala, V.K., Dubey, S.R.: Dual cnn models for unsupervised monocular depth estimation. arXiv preprint arXiv:1804.06324

  37. Li, B., Shen, C., Dai, Y., van den Hengel, A. , He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1119–1127 (2015)

  38. Xu, D., Ricci, E., Ouyang, W. ,Wang, X., Sebe, N.: Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: Proceedings of CVPR (2017)

  39. Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5162–5170 (2015)

  40. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)

    Article  Google Scholar 

  41. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A.L.: Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2809 (2015)

  42. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  43. Ramirez, P.Z., Poggi, M., Tosi, F., Mattoccia, S., Di Stefano, L.: Geometry meets semantics for semi-supervised monocular depth estimation. arXiv preprint arXiv:1810.04093

  44. Godard, C., Mac Aodha, O., Brostow, G.:Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260

  45. Yin, Z., Shi, J.: Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992 (2018)

  46. Li, R., Wang, S., Long, Z., Gu, D.: Undeepvo: monocular visual odometry through unsupervised deep learning. In: 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 7286–7291 (2018)

  47. Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: Sfm-net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804

  48. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)

  49. Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675 (2018)

  50. Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Dosovitskiy, A., Brox, T.: Demon: depth and motion network for learning monocular stereo. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 5 (2017)

  51. Mukasa, T., Xu, J., Stenger, B.: 3d scene mesh from cnn depth predictions and sparse monocular slam. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 921–928 (2017)

  52. Tateno, K., Tombari, F., Laina, I., Navab, N.: Cnn-slam: real-time dense monocular slam with learned depth prediction. arXiv preprint arXiv:1704.03489

  53. Saxena, A., Chung, S.H., Ng, A.Y.: Learning depth from single monocular images. In: Advances in Neural Information Processing Systems, pp. 1161–1168 (2006)

  54. Saxena, A., Sun, M., Ng, A.Y.: Make3d: learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)

    Article  Google Scholar 

  55. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017)

    Article  Google Scholar 

  56. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)

  57. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  58. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 3354–3361 (2012)

  59. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

Download references

Acknowledgements

This work is supported by National Key Research and Development Program of China (No. 2018YFC0309100, No. 2018YFC0309104), Six Talent Peaks Project in Jiangsu Province, Scholarship program of Jiangsu Provincial Government and Jiangsu University of Science and Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinlong Shi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, J., Sun, Y., Bai, S. et al. A self-supervised method of single-image depth estimation by feeding forward information using max-pooling layers. Vis Comput 37, 815–829 (2021). https://doi.org/10.1007/s00371-020-01832-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-020-01832-6

Keywords

Navigation