Skip to main content
Log in

Depth estimation from single monocular images using deep hybrid network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Depth estimation is a significant task in the robotics vision. In this paper, we address the depth estimation from a single monocular image, which is a challenging problem in automated vision systems since a single image alone does not carry any additional measurements. To tackle our main objective, we design a deep hybrid neural network, which is composed of convolutional and recurrent layers (ReNet), where each ReNet layer is composed of the Long Short-Term Memory unit (LSTM), which is famous for the ability to memorize long-range context. In the proposed network, ReNet layers aim to enrich the features representation by directly capturing global context. The effective integration of ReNet and convolutional layers in the common CNN framework allows us to train the hybrid network in the end-to-end fashion. Experimental evaluation on the benchmarks dataset demonstrated, that hybrid network achieves the state-of-the-art results without any post-processing steps. Moreover, the composition of recurrent and convolutional layers provide more satisfying results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Reference

  1. Bottou L (2012) Stochastic gradient descent tricks. Neural networks: Tricks of the trade 1(1):421–436

    Google Scholar 

  2. Chen B-W, Ji W (2016) Intelligent marketing in smart cities: Crowdsourced data for geo-conquesting. IT Prof 18(4):18–24

    Article  Google Scholar 

  3. Chen B, Wang J, Wang J (2009a) A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11(2):295–312

    Article  Google Scholar 

  4. Chen BW, Tsai AC, Wang JF (2009b) Structuralized context-aware content and scalable resolution support for wireless VoD services. IEEE Trans Consum Electron 55(2):713–720

    Article  Google Scholar 

  5. Chen BW, Chen CY, Wang JF (2013) Smart homecare surveillance system: behavior identification based on state-transition support vector machines and sound directivity pattern analysis. IEEE Trans Syst Man Cybern Syst 43(6):1279–1289

    Article  Google Scholar 

  6. Chen L C, Papandreou G, Kokkinos I, et al. (2014) Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, pages 1–14.

  7. W. Chen, Z. Fu, D. Yang, and Deng J (2016) Single-image depth perception in the wild, arXiv.

  8. Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. 2015 I.E. International Conference on Computer Vision (ICCV), pages 2650–2658

  9. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Nips:1–9

  10. Garg R, BG VK, Reid I (2016) Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

  11. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79:2554–2558

  12. Hua Y, Tian H (2016) Depth estimation with convolutional conditional random field network. Neurocomputing 214:546–554

    Article  Google Scholar 

  13. Jia Y, Shelhamer E, Donahue J, et al. (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093

  14. Kang S, Ji W, Rho S, Anu V (2016) Cooperative mobile video transmission for traffic surveillance in smart cities. Comput Electr Eng 54:16–25

    Article  Google Scholar 

  15. Karsch K, Liu C, Kang SB (2014) Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell 36(11):2144–2158

    Article  Google Scholar 

  16. Kim S, Choi S, Sohn K (2015) Learning depth from a single image using visual-depth words. 1(c):1895–1899

  17. Konda K, Memisevic R (2013) Unsupervised learning of depth and motion. CoRR, abs/1312.3

  18. Krizhevsky A, Sulskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst:1–9

  19. Ladicky L, Shi J, Pollefeys M (2014) Pulling Things out of Perspective, pages 89–96

  20. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  21. Bo Li, Chunhua Shen, Yuchao Dai, et al. (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE 1119–1127

  22. Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels, pages 1253–1260

  23. Liu M, Salzmann M, He X (2014) Discrete-Continuous Depth Estimation from a Single Image, pages 716–723

  24. Liu F, Shen C, Lin G, et al. (2015) Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1

  25. Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440

  26. Muhammad K, Sajjad M, Mehmood I, Rho S and Baik SW (2015) A novel magic LSB substitution method (M-LSB-SM) using multi-level encryption and achromatic component of an image. Multimed Tools Appl, pp 14867–14893

  27. Nilsson NJ (2009) The quest for artificial intelligence: a history of ideas and achievements. Cambridge University Press, Cambridge

    Book  Google Scholar 

  28. Olivia A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Progress in brain research: Visual perception 155:23–36

    Article  Google Scholar 

  29. Radosavljevic V, Vucetic S, Obradovic Z (2010) Continuous conditional random fields for regression in remote sensing. Frontiers in Artificial Intelligence and Applications 215:809–814

    MATH  Google Scholar 

  30. Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: An astounding baseline for recognition. IEEE Comput Soc Conf Comput Vis Pattern Recognit Work:512–519

  31. Ristovski K, Radosavljevic V, Vucetic S, et al. (2012) Continuous conditional random fields for efficient regression in large fully connected graphs. Twenty-Seventh AAAI Conference on Artificial Intelligence, pages 840–846

  32. Saxena A, Chung S, Ng A (2005) Learning depth from single monocular images[J]. Advances in Neural

    Google Scholar 

  33. Saxena A, Sun M, Ng AY (2009) Make3D: learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824–840

    Article  Google Scholar 

  34. Shotton J, Girshick R, Fitzgibbon A et al (2013) Efficient human pose estimation from single depth images. Pattern analysis and machine intelligence. IEEE Transactions on 35(12):2821–2840

    Google Scholar 

  35. Silberman N, Hoiem D, Kohli P and Fergus R (2012) Indoor segmentation and support inference from rgbd images. In ECCV.

  36. Simonyan K, Zisserman A (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. Iclr, pages 1–14

  37. Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. Jmlr W&Cp 28(2010):1139–1147

    Google Scholar 

  38. Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system. Nature 381:520–522

    Article  Google Scholar 

  39. Visin F, Kastner K, Courville A, et al. (2015a) ReSeg: A Recurrent Neural Network for Object Segmentation, pages 1–12

  40. Visin F, Kastner K, Cho K, et al. (2015b) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. Arxiv, pages 1–9

  41. Wang P, Shen X, Lin Z, et al. (2015) Towards unified depth and semantic prediction from a single image. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE: 2800–2809

  42. Xiao J, Hays J, Russell BC, Patterson G, Ehinger KA, Torralba A, Oliva A (2013) Basic level scene understanding: categories, attributes and structures. Front Psychol 4:506

    Google Scholar 

  43. Xiaofeng R, Bo L (2012) Discriminatively trained sparse code gradients for contour detection. Nips, pages 593–601

  44. Yan Z, Zhang H, Jia Y, et al. (2016) Combining the best of convolutional layers and recurrent layers: A Hybrid Network for Semantic Segmentation.

  45. Zeller N, Quint F, Stilla U (2016) Depth estimation and camera calibration of a focused plenoptic camera for visual odometry. ISPRS J Photogramm Remote Sens 118:83–100

    Article  Google Scholar 

  46. Zhang S, Sheng H, Li C, Zhang J, Xiong Z (2016) Robust depth estimation for light field via spinning parallelogram operator. Comput Vis Image Underst 145:148–159

    Article  Google Scholar 

  47. Zhuo W, Salzmann M, He X, et al. (2015) Indoor scene structure analysis for single image depth estimation. Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pages 614–622

  48. Zoran D, Isola P, Krishnan D and Freeman WT (2015) Learning ordinal Relationships for mid-level vision” in 2015 I.E. International Conference on Computer Vision (ICCV), pp 388–396

Download references

Acknowledgments

This work is partially funded by the MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, the Major State Basic Research Development Program of China (973 Program 2015CB351804) and the National Natural Science Foundation of China under Grant No. 61572155, 61672188 and 61272386. We would also like to acknowledge NVIDIA Corporation who kindly provided two sets of GPU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Jiang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Grigorev, A., Jiang, F., Rho, S. et al. Depth estimation from single monocular images using deep hybrid network. Multimed Tools Appl 76, 18585–18604 (2017). https://doi.org/10.1007/s11042-016-4200-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-4200-x

Keywords

Navigation