Skip to main content
Log in

Efficient unsupervised monocular depth estimation using attention guided generative adversarial network

  • Special Issue Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Deep-learning-based approaches to depth estimation are rapidly advancing, offering better performance over traditional computer vision approaches across many domains. However, for many critical applications, cutting-edge deep-learning based approaches require too much computational overhead to be operationally feasible. This is especially true for depth-estimation methods that leverage adversarial learning, such as Generative Adversarial Networks (GANs). In this paper, we propose a computationally efficient GAN for unsupervised monocular depth estimation using factorized convolutions and an attention mechanism. Specifically, we leverage the Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions (EESP) module of ESPNetv2 inside the network, leading to a total reduction of \(22.8\%\), \(35.37\%\), and \(31.5\%\) in the number of model parameters, FLOPs, and inference time respectively, as compared to the previous unsupervised GAN approach. Finally, we propose a context-aware attention architecture to generate detail-oriented depth images. We demonstrate superior performance of our proposed model on two benchmark datasets KITTI and Cityscapes. We have also provided more qualitative examples (Fig. 8) at the end of this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, pp. 2366–2374 (2014)

  2. Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attention guided convolutional neural fields for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3917–3925 (2018)

  3. Godard, C., Aodha, O. M., Brostow, G. J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279 (2017)

  4. Pilzer, A., Xu, D., Puscas, M., Ricci, E., Sebe, N.: Unsupervised adversarial depth estimation using cycled generative networks. In: 2018 International conference on 3D vision (3DV), IEEE, pp. 587–595 (2018)

  5. Mehta, S., Rastegari, M., Shapiro, L., Hajishirzi, H.: Espnetv2: a light-weight, power efficient, and general purpose convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9190–9200 (2019)

  6. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)

    Article  Google Scholar 

  7. Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5515–5524 (2016)

  8. Saxena, A., Sun, M., Ng, A. Y.: Learning 3-d scene structure from a single still image. In: 2007 IEEE 11th International conference on computer vision, IEEE, pp. 1–8 (2007)

  9. Konrad, J., Wang, M., Ishwar, P., Wu, C., Mukherjee, D.: Learning-based, automatic 2d-to-3d image and video conversion. IEEE Trans. Image Process. 22(9), 3485–3496 (2013)

    Article  Google Scholar 

  10. Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. Int. J. Comput. Vis. 75(1), 151–172 (2007)

    Article  Google Scholar 

  11. Chen, R., Mahmood, F., Yuille, A., and Durr, N. J.: Rethinking monocular depth estimation with adversarial training. arXiv preprint. arXiv:1808.07528 (2018)

  12. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European conference on computer vision, Springer, pp. 746–760 (2012)

  13. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp. 3354–3361 (2012)

  14. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223 (2016)

  15. Li, B., Shen, C., Dai, Y., Van Den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1119–1127 (2015)

  16. Cao, Y., Wu, Z., Shen, C.: Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 28(11), 3174–3182 (2017)

    Article  Google Scholar 

  17. Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A. L.: Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2800–2809 (2015)

  18. Xu, D., Ouyang, W., Wang, X., Sebe, N.: Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 675–684 (2018)

  19. Chen, C., Wei, J., Peng, C., Zhang, W., Qin, H.: Improved saliency detection in rgb-d images using two-phase depth estimation and selective deep fusion. IEEE Trans. Image Process. 29, 4296–4307 (2020)

    Article  Google Scholar 

  20. Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 340–349 (2018)

  21. Wang, C., Buenaposada, J. M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2022–2030 (2018)

  22. Garg, R., Bg, V. K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: geometry to the rescue. In: European conference on computer vision, Springer, pp. 740–756 (2016)

  23. Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A. A.: Learning dense correspondence via 3d-guided cycle consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 117–126 (2016)

  24. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems, pp. 2672–2680 (2014)

  25. Kundu, J. N., Uppala, P. K., Pahuja, A., Babu, R. V.: Adadepth: unsupervised content congruent adaptation for depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2656–2665 (2018)

  26. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint. arXiv:1411.1784 (2014)

  27. Zhu, J.-Y., Park, T., Isola, P., Efros, A. A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017)

  28. Kumar, A. C. S., Bhandarkar, S. M., Prasad, M.: Monocular depth prediction using generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 300–308 (2018)

  29. Almalioglu, Y., Saputra, M. R. U., de Gusmao, P. P., Markham, A., Trigoni, N.: Ganvo: unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: 2019 International conference on robotics and automation (ICRA), IEEE, pp. 5474–5480 (2019)

  30. Hao, Z., Li, Y., You, S., and Lu, F.: Detail preserving depth estimation from a single image using attention guided networks. In: 2018 International conference on 3D vision (3DV), IEEE, pp. 304–313 (2018)

  31. Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.: Self-attention generative adversarial networks. In: International conference on machine learning, pp. 7354–7363, PMLR (2019)

  32. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint. arXiv:1802.05957 (2018)

  33. Krizhevsky, A., Sutskever, I., and Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)

  34. He, K., Zhang, X, Ren, S., and Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  35. Xie, S., and Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision, pp. 1395–1403 (2015)

  36. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,G. S., Davis, A., Dean, J., Devin, M. et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint. arXiv:1603.04467 (2016)

  37. Kingma, D.P., and Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980 (2014)

  38. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)

    Article  Google Scholar 

  39. Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1426–1440 (2019)

  40. Zhou, T., Brown, M., Snavely, N., and Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858 (2017)

Download references

Acknowledgements

This work is partially supported by the National Science Foundation (NSF) under Grant No. 1910844.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sumanta Bhattacharyya.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhattacharyya, S., Shen, J., Welch, S. et al. Efficient unsupervised monocular depth estimation using attention guided generative adversarial network. J Real-Time Image Proc 18, 1357–1368 (2021). https://doi.org/10.1007/s11554-021-01092-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-021-01092-0

Keywords

Navigation