Efficient unsupervised monocular depth estimation using attention guided generative adversarial network

Bhattacharyya, Sumanta; Shen, Ju; Welch, Stephen; Chen, Chen

doi:10.1007/s11554-021-01092-0

Efficient unsupervised monocular depth estimation using attention guided generative adversarial network

Special Issue Paper
Published: 22 March 2021

Volume 18, pages 1357–1368, (2021)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Sumanta Bhattacharyya¹,
Ju Shen²,
Stephen Welch¹ &
…
Chen Chen¹

487 Accesses
3 Citations
Explore all metrics

Abstract

Deep-learning-based approaches to depth estimation are rapidly advancing, offering better performance over traditional computer vision approaches across many domains. However, for many critical applications, cutting-edge deep-learning based approaches require too much computational overhead to be operationally feasible. This is especially true for depth-estimation methods that leverage adversarial learning, such as Generative Adversarial Networks (GANs). In this paper, we propose a computationally efficient GAN for unsupervised monocular depth estimation using factorized convolutions and an attention mechanism. Specifically, we leverage the Extremely Efficient Spatial Pyramid of Depth-wise Dilated Separable Convolutions (EESP) module of ESPNetv2 inside the network, leading to a total reduction of \(22.8\%\), \(35.37\%\), and \(31.5\%\) in the number of model parameters, FLOPs, and inference time respectively, as compared to the previous unsupervised GAN approach. Finally, we propose a context-aware attention architecture to generate detail-oriented depth images. We demonstrate superior performance of our proposed model on two benchmark datasets KITTI and Cityscapes. We have also provided more qualitative examples (Fig. 8) at the end of this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on Image Data Augmentation for Deep Learning

Article Open access 06 July 2019

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Image Matching from Handcrafted to Deep Features: A Survey

Article Open access 04 August 2020

References

Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in neural information processing systems, pp. 2366–2374 (2014)
Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attention guided convolutional neural fields for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3917–3925 (2018)
Godard, C., Aodha, O. M., Brostow, G. J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279 (2017)
Pilzer, A., Xu, D., Puscas, M., Ricci, E., Sebe, N.: Unsupervised adversarial depth estimation using cycled generative networks. In: 2018 International conference on 3D vision (3DV), IEEE, pp. 587–595 (2018)
Mehta, S., Rastegari, M., Shapiro, L., Hajishirzi, H.: Espnetv2: a light-weight, power efficient, and general purpose convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9190–9200 (2019)
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)
Article Google Scholar
Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5515–5524 (2016)
Saxena, A., Sun, M., Ng, A. Y.: Learning 3-d scene structure from a single still image. In: 2007 IEEE 11th International conference on computer vision, IEEE, pp. 1–8 (2007)
Konrad, J., Wang, M., Ishwar, P., Wu, C., Mukherjee, D.: Learning-based, automatic 2d-to-3d image and video conversion. IEEE Trans. Image Process. 22(9), 3485–3496 (2013)
Article Google Scholar
Hoiem, D., Efros, A.A., Hebert, M.: Recovering surface layout from an image. Int. J. Comput. Vis. 75(1), 151–172 (2007)
Article Google Scholar
Chen, R., Mahmood, F., Yuille, A., and Durr, N. J.: Rethinking monocular depth estimation with adversarial training. arXiv preprint. arXiv:1808.07528 (2018)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: European conference on computer vision, Springer, pp. 746–760 (2012)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE, pp. 3354–3361 (2012)
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223 (2016)
Li, B., Shen, C., Dai, Y., Van Den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1119–1127 (2015)
Cao, Y., Wu, Z., Shen, C.: Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans. Circuits Syst. Video Technol. 28(11), 3174–3182 (2017)
Article Google Scholar
Wang, P., Shen, X., Lin, Z., Cohen, S., Price, B., Yuille, A. L.: Towards unified depth and semantic prediction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2800–2809 (2015)
Xu, D., Ouyang, W., Wang, X., Sebe, N.: Pad-net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 675–684 (2018)
Chen, C., Wei, J., Peng, C., Zhang, W., Qin, H.: Improved saliency detection in rgb-d images using two-phase depth estimation and selective deep fusion. IEEE Trans. Image Process. 29, 4296–4307 (2020)
Article Google Scholar
Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 340–349 (2018)
Wang, C., Buenaposada, J. M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2022–2030 (2018)
Garg, R., Bg, V. K., Carneiro, G., Reid, I.: Unsupervised cnn for single view depth estimation: geometry to the rescue. In: European conference on computer vision, Springer, pp. 740–756 (2016)
Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A. A.: Learning dense correspondence via 3d-guided cycle consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 117–126 (2016)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems, pp. 2672–2680 (2014)
Kundu, J. N., Uppala, P. K., Pahuja, A., Babu, R. V.: Adadepth: unsupervised content congruent adaptation for depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2656–2665 (2018)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint. arXiv:1411.1784 (2014)
Zhu, J.-Y., Park, T., Isola, P., Efros, A. A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232 (2017)
Kumar, A. C. S., Bhandarkar, S. M., Prasad, M.: Monocular depth prediction using generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 300–308 (2018)
Almalioglu, Y., Saputra, M. R. U., de Gusmao, P. P., Markham, A., Trigoni, N.: Ganvo: unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In: 2019 International conference on robotics and automation (ICRA), IEEE, pp. 5474–5480 (2019)
Hao, Z., Li, Y., You, S., and Lu, F.: Detail preserving depth estimation from a single image using attention guided networks. In: 2018 International conference on 3D vision (3DV), IEEE, pp. 304–313 (2018)
Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.: Self-attention generative adversarial networks. In: International conference on machine learning, pp. 7354–7363, PMLR (2019)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint. arXiv:1802.05957 (2018)
Krizhevsky, A., Sutskever, I., and Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
He, K., Zhang, X, Ren, S., and Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Xie, S., and Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision, pp. 1395–1403 (2015)
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado,G. S., Davis, A., Dean, J., Devin, M. et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint. arXiv:1603.04467 (2016)
Kingma, D.P., and Ba, J.: Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980 (2014)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)
Article Google Scholar
Xu, D., Ricci, E., Ouyang, W., Wang, X., Sebe, N.: Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1426–1440 (2019)
Zhou, T., Brown, M., Snavely, N., and Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1851–1858 (2017)

Download references

Acknowledgements

This work is partially supported by the National Science Foundation (NSF) under Grant No. 1910844.

Author information

Authors and Affiliations

University of North Carolina at Charlotte, Charlotte, USA
Sumanta Bhattacharyya, Stephen Welch & Chen Chen
University of Dayton, 300 College Park, Dayton, OH, 45469, USA
Ju Shen

Authors

Sumanta Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar
Ju Shen
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Welch
View author publications
You can also search for this author in PubMed Google Scholar
Chen Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sumanta Bhattacharyya.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhattacharyya, S., Shen, J., Welch, S. et al. Efficient unsupervised monocular depth estimation using attention guided generative adversarial network. J Real-Time Image Proc 18, 1357–1368 (2021). https://doi.org/10.1007/s11554-021-01092-0

Download citation

Received: 31 January 2020
Accepted: 06 March 2021
Published: 22 March 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s11554-021-01092-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient unsupervised monocular depth estimation using attention guided generative adversarial network

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Attention mechanisms in computer vision: A survey

Image Matching from Handcrafted to Deep Features: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient unsupervised monocular depth estimation using attention guided generative adversarial network

Abstract

Access this article

Similar content being viewed by others

A survey on Image Data Augmentation for Deep Learning

Attention mechanisms in computer vision: A survey

Image Matching from Handcrafted to Deep Features: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation