research-article

Heuristic Depth Estimation with Progressive Depth Reconstruction and Confidence-Aware Loss

Authors:

Zhan WangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 2252 - 2261

https://doi.org/10.1145/3474085.3475386

Published: 17 October 2021 Publication History

Abstract

Recently deep learning-based depth estimation has shown the promising result, especially with the help of sparse depth reference samples. Existing works focus on directly inferring the depth information from sparse samples with high confidence. In this paper, we propose a Heuristic Depth Estimation Network (HDEN) with progressive depth reconstruction and confidence-aware loss. The HDEN leverages the reference samples with low confidence to distill the spatial geometric and local semantic information for dense depth prediction. Specifically, we first train a U-NET network to generate a coarse-level dense reference map. Second, the progressive depth reconstruction module successively reconstructs the fine-level dense depth map from different scales, where a multi-level upsampling block is designed to recover the local structure of object. Finally, the confidence-aware loss is proposed to trigger the reference samples with low confidence, which enforces the model focusing on estimating the depth of the tiny structure. Extensive experiments on the NYU-Depth-v2 and KITTI-Odometry dataset show the effectiveness of our method. Visualization results demonstrate that the dense depth maps generated by HDEN have better consistency at the entity edge with RGB image.

References

[1]

Po-Yi Chen, Alexander H Liu, Yen-Cheng Liu, and Yu-Chiang Frank Wang. 2019 b. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2624--2632.

[2]

Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. 2016. Single-image depth perception in the wild. In Advances in neural information processing systems. 730--738.

Digital Library

[3]

Xiaotian Chen, Xuejin Chen, and Zheng-Jun Zha. 2019 a. Structure-aware residual pyramid network for monocular depth estimation. arXiv preprint arXiv:1907.06023 (2019).

Digital Library

[4]

Xuejin Chen, Xiaotian Chen, Yiteng Zhang, Xueyang Fu, and Zheng-Jun Zha. 2020. Laplacian Pyramid Neural Network for Dense Continuous-Value Regression for Complex Scenes. IEEE Transactions on Neural Networks and Learning Systems (2020).

[5]

Xiaotian Chen, Yuwang Wang, Xuejin Chen, and Wenjun Zeng. 2021. S2R-DepthNet: Learning a Generalizable Depth-specific Structural Representation. arXiv preprint arXiv:2104.00877 (2021).

[6]

Zhao Chen, Vijay Badrinarayanan, Gilad Drozdov, and Andrew Rabinovich. 2018. Estimating depth from rgb and sparse sensing. In Proceedings of the European Conference on Computer Vision (ECCV). 167--182.

[7]

Xinjing Cheng, Peng Wang, Chenye Guan, and Ruigang Yang. 2020. Cspn

[8]

: Learning context and resource aware convolutional spatial propagation networks for depth completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10615--10622.

[9]

Xinjing Cheng, Peng Wang, and Ruigang Yang. 2018. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV). 103--119.

[10]

Xinjing Cheng, Peng Wang, and Ruigang Yang. 2019. Learning depth with convolutional spatial propagation network. IEEE transactions on pattern analysis and machine intelligence (2019).

[11]

Jincan Deng, Liang Li, Beichen Zhang, Shuhui Wang, Zhengjun Zha, and Qingming Huang. 2021. Syntax-guided Hierarchical Attention Network for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology (2021).

[12]

Tom van Dijk and Guido de Croon. 2019. How do neural networks see depth in single images?. In Proceedings of the IEEE International Conference on Computer Vision. 2183--2191.

[13]

David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658.

Digital Library

[14]

David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems. 2366--2374.

Digital Library

[15]

Abdelrahman Eldesokey, Michael Felsberg, Karl Holmquist, and Michael Persson. 2020. Uncertainty-Aware CNNs for Depth Completion: Uncertainty from Beginning to End. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12014--12023.

[16]

Abdelrahman Eldesokey, Michael Felsberg, and Fahad Khan. 2019. Confidence Propagation through CNNs for Guided Sparse Depth Regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PP (07 2019), 1--1. https://doi.org/10.1109/TPAMI.2019.2929170

[17]

Abdelrahman Eldesokey, Michael Felsberg, and Fahad Shahbaz Khan. 2018. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913 (2018).

[18]

Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger. 2013. A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms. In International Conference on Intelligent Transportation Systems (ITSC) .

[19]

Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision. Springer, 740--756.

[20]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR) (2013).

Digital Library

[21]

Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Conference on Computer Vision and Pattern Recognition (CVPR) .

Digital Library

[22]

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2485--2494.

[23]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[24]

Derek Hoiem, Alexei A Efros, and Martial Hebert. 2005. Geometric context from a single image. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, Vol. 1. IEEE, 654--661.

Digital Library

[25]

Junjie Hu, Yan Zhang, and Takayuki Okatani. 2019. Visualization of convolutional neural networks for monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision . 3869--3878.

[26]

Eddy Ilg, Ozgun Cicek, Silvio Galesso, Aaron Klein, Osama Makansi, Frank Hutter, and Thomas Brox. 2018. Uncertainty estimates and multi-hypotheses networks for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV). 652--667.

[27]

Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482--7491.

[28]

Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. European Conference on Computer Vision (ECCV) (2020).

[29]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM, Vol. 60, 6 (2017), 84--90.

Digital Library

[30]

Ryohei Kuga, Asako Kanezaki, Masaki Samejima, Yusuke Sugano, and Yasuyuki Matsushita. 2017. Multi-task learning using multi-modal encoder-decoder networks with shared skip connections. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 403--411.

[31]

Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV). IEEE, 239--248.

[32]

Jae-Han Lee and Chang-Su Kim. 2020. Multi-Loss Rebalancing Algorithm for Monocular Depth Estimation. In Proceedings of the European Conference on Computer Vision (ECCV) .

[33]

Liang Li, Shijie Yang, Li Su, Shuhui Wang, Chenggang Yan, Zheng-jun Zha, and Qingming Huang. 2020. Diverter-Guider Recurrent Network for Diverse Poems Generation from Image. In Proceedings of the 28th ACM International Conference on Multimedia. 3875--3883.

Digital Library

[34]

Beyang Liu, Stephen Gould, and Daphne Koller. 2010. Single image depth estimation from predicted semantic labels. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1253--1260.

[35]

Fayao Liu, Chunhua Shen, and Guosheng Lin. 2015. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition . 5162--5170.

[36]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2611--2620.

[37]

Zhenhuan Liu, Jincan Deng, Liang Li, Shaofei Cai, Qianqian Xu, Shuhui Wang, and Qingming Huang. 2020. IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning. In Proceedings of the 28th ACM International Conference on Multimedia . 322--330.

Digital Library

[38]

Kaiyue Lu, Nick Barnes, Saeed Anwar, and Liang Zheng. 2020. From depth what can you see? depth completion via auxiliary image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 11306--11315.

[39]

Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1--8.

[40]

Moritz Menze and Andreas Geiger. 2015. Object Scene Flow for Autonomous Vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR) .

[41]

Art B Owen. 2007. A robust hybrid of lasso and ridge regression. Contemp. Math., Vol. 443, 7 (2007), 59--72.

[42]

Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. 2021. Beyond Image to Depth: Improving Depth Prediction using Echoes. arXiv preprint arXiv:2103.08468 (2021).

[43]

Jinsun Park, Kyungdon Joo, Zhe Hu, Chi-Kuei Liu, and In So Kweon. 2020. Non-local spatial propagation network for depth completion. European Conference on Computer Vision (ECCV) (2020).

Digital Library

[44]

Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mattoccia. 2020. On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 3227--3237.

[45]

Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. 2019. Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3313--3322.

[46]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.

[47]

Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2008. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, Vol. 31, 5 (2008), 824--840.

Digital Library

[48]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision. Springer, 746--760.

Digital Library

[49]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[50]

Lijun Wang, Jianming Zhang, Yifan Wang, Huchuan Lu, and Xiang Ruan. 2020. CLIFFNet for Monocular Depth Estimation with Hierarchical Embedding Loss. In European Conference on Computer Vision. Springer, 316--331.

[51]

Tsun-Hsuan Wang, Fu-En Wang, Juan-Ting Lin, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun. 2019. Plug-and-play: Improve depth prediction via sparse data propagation. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 5880--5886.

[52]

Diana Wofk, Fangchang Ma, Tien-Ju Yang, Sertac Karaman, and Vivienne Sze. 2019. Fastdepth: Fast monocular depth estimation on embedded systems. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 6101--6108.

Digital Library

[53]

Zhihao Xia, Patrick Sullivan, and Ayan Chakrabarti. 2020. Generating and Exploiting Probabilistic Monocular Depth Estimates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 65--74.

[54]

Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. 2020. Structure-guided ranking loss for single image depth prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 611--620.

[55]

Yan Xu, Xinge Zhu, Jianping Shi, Guofeng Zhang, Hujun Bao, and Hongsheng Li. 2019. Depth completion from sparse lidar data with depth-normal constraints. In Proceedings of the IEEE International Conference on Computer Vision . 2811--2820.

[56]

Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2020 a. Deep multi-view enhancement hashing for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 4 (2020), 1445--1451.

[57]

Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2021 a. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology (2021).

[58]

Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, and Yongdong Zhang. 2020 b. Depth image denoising using nuclear norm and learning graph model. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 16, 4 (2020), 1--17.

Digital Library

[59]

Chenggang Yan, Lixuan Meng, Liang Li, Jiehua Zhang, Jian Yin, Jiyong Zhang, Zhan Wang, Yaoqi Sun, and Bolun Zheng. 2021 b. Age-Invariant Face Recognition By Multi-Feature Fusion and Decomposition with Self-Attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2021).

[60]

Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 1281--1292.

[61]

Xinchen Ye, Shude Chen, and Rui Xu. 2021. DPNet: Detail-preserving network for high quality monocular depth estimation. Pattern Recognition, Vol. 109 (2021), 107578.

[62]

Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence (2019).

Digital Library

[63]

Mingliang Zhang, Xinchen Ye, and Xin Fan. 2020. Unsupervised detail-preserving network for high quality monocular depth estimation. Neurocomputing, Vol. 404 (2020), 1--13.

[64]

Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. 2018. Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV) . 235--251.

[65]

Laurent Zwald and Sophie Lambert-Lacroix. 2012. The berhu penalty and the grouped effect. arXiv preprint arXiv:1207.6868 (2012).

Cited By

Zhang JLi LYan CWang ZXu CZhang JChen C(2024)Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation AdaptationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239720:9(1-23)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672397
Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Yang ZLi LZhang JWang TSun YYan CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Domain Shared and Specific Prompt Learning for Incremental Monocular Depth EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681155(8306-8315)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681155
Show More Cited By

Index Terms

Heuristic Depth Estimation with Progressive Depth Reconstruction and Confidence-Aware Loss
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision representations
        Appearance and texture representations

Recommendations

Multi-scale depth classification network for monocular depth estimation
Highlights
- A multiscale classification network for monocular depth estimation is proposed by transforming regression tasks into classification tasks.
Abstract
In addition to the RGB information of an image, depth information is the most critical. Monocular depth estimation is an effective method for predicting depth from RGB images. First, we propose a multiscale classification network that ...
Graphical abstract

Display Omitted
Stable Depth Estimation Within Consecutive Video Frames
Advances in Computer Graphics
Abstract
Deep learning based depth estimation methods have been proven effective and promising, especially learning depth from monocular video. Depth-from-video is the real sense of unsupervised depth estimation, as it doesn’t need depth ground truth or ...
Efficient Unsupervised Monocular Depth Estimation with Inter-Frame Depth Interpolation
Image and Graphics
Abstract
To alleviate the need of expensive depth annotations, some existing works resort to unsupervised learning methods for depth estimation using monocular videos. To improve the accuracy of the prediction with the relationship of the inter-frame, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Youth Innovation Promotion Association of Chinese Academy of Sciences under Grant
National Natural Science Foundation of China
Zhejiang Province Natural Science Foundation of China
National Key Research and Development Program of China under Grant
111 Project

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
261
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)7

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang JLi LYan CWang ZXu CZhang JChen C(2024)Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation AdaptationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239720:9(1-23)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3672397
Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Yang ZLi LZhang JWang TSun YYan CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Domain Shared and Specific Prompt Learning for Incremental Monocular Depth EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681155(8306-8315)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681155
Yan SWen HChang SZhu HZhou LCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Fooling 3D Face Recognition with One Single 2D ImageProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680840(4043-4052)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680840
Wei CYang MHe LZheng N(2024)FS-Depth: Focal-and-Scale Depth Estimation From a Single Image in Unseen Indoor SceneIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341168834:11_Part_1(10604-10617)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3411688

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten