research-article

LA-Net: Layout-Aware Dense Network for Monocular Depth Estimation

Authors:
Kecheng Zheng

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Zheng-Jun Zha

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Yang Cao

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Xuejin Chen

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

,
Feng Wu

University of Science and Technology of China, Hefei, China

University of Science and Technology of China, Hefei, China
View Profile

MM '18: Proceedings of the 26th ACM international conference on MultimediaOctober 2018Pages 1381–1388https://doi.org/10.1145/3240508.3240628

Published:15 October 2018Publication History

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 1381–1388

ABSTRACT

Depth estimation from monocular images is an ill-posed and inherently ambiguous problem. Recently, deep learning technique has been applied for monocular depth estimation seeking data-driven solutions. However, most existing methods focus on pursuing the minimization of average depth regression error at pixel level and neglect to encode the global layout of scene, resulting in layout-inconsistent depth map. This paper proposes a novel Layout-Aware Convolutional Neural Network (LA-Net) for accurate monocular depth estimation by simultaneously perceiving scene layout and local depth details. Specifically, a Spatial Layout Network (SL-Net) is proposed to learn a layout map representing the depth ordering between local patches. A Layout-Aware Depth Estimation Network (LDE-Net) is proposed to estimate pixel-level depth details using multi-scale layout maps as structural guidance, leading to layout-consistent depth map. A dense network module is used as the base network to learn effective visual details resorting to dense feed-forward connections. Moreover, we formulate an order-sensitive softmax loss to well constrain the ill-posed depth inferring problem. Extensive experiments on both indoor scene (NYUD-v2) and outdoor scene (Make3D) datasets have demonstrated that the proposed LA-Net outperforms the state-of-the-art methods and leads to faithful 3D projections.

References

Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. 2017. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology , Vol. pp (2017), 1--1.Google Scholar
Ayan Chakrabarti, Jingyu Shao, and Greg Shakhnarovich. 2016. Depth from a single image by harmonizing overcomplete local network predictions. In Proceedings of the Advances in Neural Information Processing Systems. 2658--2666. Google ScholarDigital Library
Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. 2016. Single-image depth perception in the wild. In Proceedings of the Advances in Neural Information Processing Systems. 730--738. Google ScholarDigital Library
James M Coughlan and Alan L Yuille. 1999. Manhattan world: Compass direction from a single image by bayesian inference. In Proceedings of the IEEE International Conference on Computer Vision . 941--947. Google ScholarDigital Library
Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2014. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision. 184--199.Google ScholarCross Ref
David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision. 2650--2658. Google ScholarDigital Library
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in neural information processing systems . 2366--2374. Google ScholarDigital Library
Ravi Garg, Vijay Kumar BG, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: geometry to the rescue. In Proceedings of the European Conference on Computer Vision. 740--756.Google ScholarCross Ref
Abhinav Gupta, Alexei A Efros, and Martial Hebert. 2010. Blocks world revisited: image understanding using qualitative geometry and mechanics. In Proceedings of the European Conference on Computer Vision. 482--496. Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Varsha Hedau, Derek Hoiem, and David Forsyth. 2010. Thinking inside the box: using appearance models and context based on room geometry. In Proceedings of the European Conference on Computer Vision. 224--237. Google ScholarDigital Library
Derek Hoiem, Alexei A Efros, and Martial Hebert. 2005. Automatic photo pop-up. In ACM transactions on graphics (TOG) , Vol. 24. 577--584. Google ScholarDigital Library
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3--12.Google ScholarCross Ref
Omid Hosseini Jafari, Oliver Groth, Alexander Kirillov, Michael Ying Yang, and Carsten Rother. 2017. Analyzing modular cnn architectures for joint depth prediction and semantic segmentation. In Proceedings of the IEEE International Conference on Robotics and Automation. 4620--4627.Google ScholarCross Ref
Kevin Karsch, Ce Liu, and Sing Bing Kang. 2014. Depth transfer: depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 36, 11 (2014), 2144--2158.Google ScholarCross Ref
Seungryong Kim, Kihong Park, Kwanghoon Sohn, and Stephen Lin. 2016. Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In Proceedings of the European Conference on Computer Vision. 143--159.Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105. Google ScholarDigital Library
Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. 2014. Pulling things out of perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 89--96. Google ScholarDigital Library
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the Fourth IEEE International Conference on 3D Vision. 239--248.Google ScholarCross Ref
Bo Li, Chunhua Shen, Yuchao Dai, Anton Van Den Hengel, and Mingyi He. 2015. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1119--1127.Google Scholar
Jun Li, Reinhard Klein, and Angela Yao. 2017. A two-streamed network for estimating fine-scaled depth maps from single rgb images. (2017), 3372--3380.Google Scholar
Beyang Liu, Stephen Gould, and Daphne Koller. 2010. Single image depth estimation from predicted semantic labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1253--1260.Google ScholarCross Ref
Fayao Liu, Chunhua Shen, and Guosheng Lin. 2015. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 5162--5170.Google ScholarCross Ref
Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2016. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 38, 10 (2016), 2024--2039. Google ScholarDigital Library
Miaomiao Liu, Mathieu Salzmann, and Xuming He. 2014. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723. Google ScholarDigital Library
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition . 3431--3440.Google ScholarCross Ref
Anirban Roy and Sinisa Todorovic. 2016. Monocular depth estimation using neural regression forest. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5506--5514.Google ScholarCross Ref
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , Vol. 115, 3 (2015), 211--252. Google ScholarDigital Library
Bryan C Russell and Antonio Torralba. 2009. Building a database of 3d scenes from user annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2711--2718.Google ScholarCross Ref
Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. 2006. Learning depth from single monocular images. In Proceedings of the Advances in neural information processing systems. 1161--1168. Google ScholarDigital Library
Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. 2008. 3-d depth reconstruction from a single still image. International journal of computer vision , Vol. 76, 1 (2008), 53--69. Google ScholarDigital Library
Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2009. Make3d: learning 3d scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol. 31, 5 (2009), 824--840. Google ScholarDigital Library
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. Computer Science (2014).Google Scholar
Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2017. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 161--169.Google ScholarCross Ref

Index Terms

LA-Net: Layout-Aware Dense Network for Monocular Depth Estimation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
      2. Computer vision tasks
        Scene understanding

Recommendations

Investigating ResNet Variants in U-Net to Obtain High-Quality Depth Maps in Indoor Scenes
SIET '23: Proceedings of the 8th International Conference on Sustainable Information Engineering and Technology

Depth estimation has an important role in various applications. Generally, active sensors such as Light Detector and Ranging (LiDAR) or stereo cameras are used in depth estimation. However, they have various drawbacks, primarily the computational cost ...
Read More
Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry
Computer Vision – ECCV 2018
Abstract
Monocular visual odometry approaches that purely rely on geometric cues are prone to scale drift and require sufficient motion parallax in successive frames for motion estimation and 3D reconstruction. In this paper, we propose to leverage deep ...
Read More
GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network
Graphical abstract

Display Omitted
Highlights
- Monocular depth estimation.
- Graph convolutional network.
Abstract
Depth estimation is a challenging task of 3D reconstruction to enhance the accuracy sensing of environment awareness. This work brings a new solution with improvements, which increases the quantitative and qualitative understanding of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '18: Proceedings of the 26th ACM international conference on Multimedia
October 2018
2167 pages
ISBN:9781450356657
DOI:10.1145/3240508
General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3d project
deep learning
monocular depth estimation
scene layout
Qualifiers
- research-article
Conference

Acceptance Rates
MM '18 Paper Acceptance Rate209of757submissions,28%Overall Acceptance Rate995of4,171submissions,24%
More
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 406
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LA-Net: Layout-Aware Dense Network for Monocular Depth Estimation

MM '18: Proceedings of the 26th ACM international conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Investigating ResNet Variants in U-Net to Obtain High-Quality Depth Maps in Indoor Scenes

Deep Virtual Stereo Odometry: Leveraging Deep Depth Prediction for Monocular Direct Sparse Odometry

GCNDepth: Self-supervised monocular depth estimation based on graph convolutional network