skip to main content
10.1145/3474085.3475386acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Heuristic Depth Estimation with Progressive Depth Reconstruction and Confidence-Aware Loss

Published: 17 October 2021 Publication History

Abstract

Recently deep learning-based depth estimation has shown the promising result, especially with the help of sparse depth reference samples. Existing works focus on directly inferring the depth information from sparse samples with high confidence. In this paper, we propose a Heuristic Depth Estimation Network (HDEN) with progressive depth reconstruction and confidence-aware loss. The HDEN leverages the reference samples with low confidence to distill the spatial geometric and local semantic information for dense depth prediction. Specifically, we first train a U-NET network to generate a coarse-level dense reference map. Second, the progressive depth reconstruction module successively reconstructs the fine-level dense depth map from different scales, where a multi-level upsampling block is designed to recover the local structure of object. Finally, the confidence-aware loss is proposed to trigger the reference samples with low confidence, which enforces the model focusing on estimating the depth of the tiny structure. Extensive experiments on the NYU-Depth-v2 and KITTI-Odometry dataset show the effectiveness of our method. Visualization results demonstrate that the dense depth maps generated by HDEN have better consistency at the entity edge with RGB image.

References

[1]
Po-Yi Chen, Alexander H Liu, Yen-Cheng Liu, and Yu-Chiang Frank Wang. 2019 b. Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2624--2632.
[2]
Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. 2016. Single-image depth perception in the wild. In Advances in neural information processing systems. 730--738.
[3]
Xiaotian Chen, Xuejin Chen, and Zheng-Jun Zha. 2019 a. Structure-aware residual pyramid network for monocular depth estimation. arXiv preprint arXiv:1907.06023 (2019).
[4]
Xuejin Chen, Xiaotian Chen, Yiteng Zhang, Xueyang Fu, and Zheng-Jun Zha. 2020. Laplacian Pyramid Neural Network for Dense Continuous-Value Regression for Complex Scenes. IEEE Transactions on Neural Networks and Learning Systems (2020).
[5]
Xiaotian Chen, Yuwang Wang, Xuejin Chen, and Wenjun Zeng. 2021. S2R-DepthNet: Learning a Generalizable Depth-specific Structural Representation. arXiv preprint arXiv:2104.00877 (2021).
[6]
Zhao Chen, Vijay Badrinarayanan, Gilad Drozdov, and Andrew Rabinovich. 2018. Estimating depth from rgb and sparse sensing. In Proceedings of the European Conference on Computer Vision (ECCV). 167--182.
[7]
Xinjing Cheng, Peng Wang, Chenye Guan, and Ruigang Yang. 2020. Cspn
[8]
: Learning context and resource aware convolutional spatial propagation networks for depth completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10615--10622.
[9]
Xinjing Cheng, Peng Wang, and Ruigang Yang. 2018. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV). 103--119.
[10]
Xinjing Cheng, Peng Wang, and Ruigang Yang. 2019. Learning depth with convolutional spatial propagation network. IEEE transactions on pattern analysis and machine intelligence (2019).
[11]
Jincan Deng, Liang Li, Beichen Zhang, Shuhui Wang, Zhengjun Zha, and Qingming Huang. 2021. Syntax-guided Hierarchical Attention Network for Video Captioning. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[12]
Tom van Dijk and Guido de Croon. 2019. How do neural networks see depth in single images?. In Proceedings of the IEEE International Conference on Computer Vision. 2183--2191.
[13]
David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658.
[14]
David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems. 2366--2374.
[15]
Abdelrahman Eldesokey, Michael Felsberg, Karl Holmquist, and Michael Persson. 2020. Uncertainty-Aware CNNs for Depth Completion: Uncertainty from Beginning to End. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 12014--12023.
[16]
Abdelrahman Eldesokey, Michael Felsberg, and Fahad Khan. 2019. Confidence Propagation through CNNs for Guided Sparse Depth Regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PP (07 2019), 1--1. https://doi.org/10.1109/TPAMI.2019.2929170
[17]
Abdelrahman Eldesokey, Michael Felsberg, and Fahad Shahbaz Khan. 2018. Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913 (2018).
[18]
Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger. 2013. A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms. In International Conference on Intelligent Transportation Systems (ITSC) .
[19]
Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision. Springer, 740--756.
[20]
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR) (2013).
[21]
Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Conference on Computer Vision and Pattern Recognition (CVPR) .
[22]
Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2485--2494.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[24]
Derek Hoiem, Alexei A Efros, and Martial Hebert. 2005. Geometric context from a single image. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, Vol. 1. IEEE, 654--661.
[25]
Junjie Hu, Yan Zhang, and Takayuki Okatani. 2019. Visualization of convolutional neural networks for monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision . 3869--3878.
[26]
Eddy Ilg, Ozgun Cicek, Silvio Galesso, Aaron Klein, Osama Makansi, Frank Hutter, and Thomas Brox. 2018. Uncertainty estimates and multi-hypotheses networks for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV). 652--667.
[27]
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482--7491.
[28]
Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. European Conference on Computer Vision (ECCV) (2020).
[29]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM, Vol. 60, 6 (2017), 84--90.
[30]
Ryohei Kuga, Asako Kanezaki, Masaki Samejima, Yusuke Sugano, and Yasuyuki Matsushita. 2017. Multi-task learning using multi-modal encoder-decoder networks with shared skip connections. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 403--411.
[31]
Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV). IEEE, 239--248.
[32]
Jae-Han Lee and Chang-Su Kim. 2020. Multi-Loss Rebalancing Algorithm for Monocular Depth Estimation. In Proceedings of the European Conference on Computer Vision (ECCV) .
[33]
Liang Li, Shijie Yang, Li Su, Shuhui Wang, Chenggang Yan, Zheng-jun Zha, and Qingming Huang. 2020. Diverter-Guider Recurrent Network for Diverse Poems Generation from Image. In Proceedings of the 28th ACM International Conference on Multimedia. 3875--3883.
[34]
Beyang Liu, Stephen Gould, and Daphne Koller. 2010. Single image depth estimation from predicted semantic labels. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 1253--1260.
[35]
Fayao Liu, Chunhua Shen, and Guosheng Lin. 2015. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition . 5162--5170.
[36]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qingming Huang. 2019. Adaptive reconstruction network for weakly supervised referring expression grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2611--2620.
[37]
Zhenhuan Liu, Jincan Deng, Liang Li, Shaofei Cai, Qianqian Xu, Shuhui Wang, and Qingming Huang. 2020. IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning. In Proceedings of the 28th ACM International Conference on Multimedia . 322--330.
[38]
Kaiyue Lu, Nick Barnes, Saeed Anwar, and Liang Zheng. 2020. From depth what can you see? depth completion via auxiliary image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 11306--11315.
[39]
Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1--8.
[40]
Moritz Menze and Andreas Geiger. 2015. Object Scene Flow for Autonomous Vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR) .
[41]
Art B Owen. 2007. A robust hybrid of lasso and ridge regression. Contemp. Math., Vol. 443, 7 (2007), 59--72.
[42]
Kranti Kumar Parida, Siddharth Srivastava, and Gaurav Sharma. 2021. Beyond Image to Depth: Improving Depth Prediction using Echoes. arXiv preprint arXiv:2103.08468 (2021).
[43]
Jinsun Park, Kyungdon Joo, Zhe Hu, Chi-Kuei Liu, and In So Kweon. 2020. Non-local spatial propagation network for depth completion. European Conference on Computer Vision (ECCV) (2020).
[44]
Matteo Poggi, Filippo Aleotti, Fabio Tosi, and Stefano Mattoccia. 2020. On the uncertainty of self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 3227--3237.
[45]
Jiaxiong Qiu, Zhaopeng Cui, Yinda Zhang, Xingdi Zhang, Shuaicheng Liu, Bing Zeng, and Marc Pollefeys. 2019. Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3313--3322.
[46]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.
[47]
Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2008. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, Vol. 31, 5 (2008), 824--840.
[48]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In European conference on computer vision. Springer, 746--760.
[49]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[50]
Lijun Wang, Jianming Zhang, Yifan Wang, Huchuan Lu, and Xiang Ruan. 2020. CLIFFNet for Monocular Depth Estimation with Hierarchical Embedding Loss. In European Conference on Computer Vision. Springer, 316--331.
[51]
Tsun-Hsuan Wang, Fu-En Wang, Juan-Ting Lin, Yi-Hsuan Tsai, Wei-Chen Chiu, and Min Sun. 2019. Plug-and-play: Improve depth prediction via sparse data propagation. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 5880--5886.
[52]
Diana Wofk, Fangchang Ma, Tien-Ju Yang, Sertac Karaman, and Vivienne Sze. 2019. Fastdepth: Fast monocular depth estimation on embedded systems. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 6101--6108.
[53]
Zhihao Xia, Patrick Sullivan, and Ayan Chakrabarti. 2020. Generating and Exploiting Probabilistic Monocular Depth Estimates. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 65--74.
[54]
Ke Xian, Jianming Zhang, Oliver Wang, Long Mai, Zhe Lin, and Zhiguo Cao. 2020. Structure-guided ranking loss for single image depth prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 611--620.
[55]
Yan Xu, Xinge Zhu, Jianping Shi, Guofeng Zhang, Hujun Bao, and Hongsheng Li. 2019. Depth completion from sparse lidar data with depth-normal constraints. In Proceedings of the IEEE International Conference on Computer Vision . 2811--2820.
[56]
Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2020 a. Deep multi-view enhancement hashing for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 4 (2020), 1445--1451.
[57]
Chenggang Yan, Yiming Hao, Liang Li, Jian Yin, Anan Liu, Zhendong Mao, Zhenyu Chen, and Xingyu Gao. 2021 a. Task-adaptive attention for image captioning. IEEE Transactions on Circuits and Systems for Video Technology (2021).
[58]
Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, and Yongdong Zhang. 2020 b. Depth image denoising using nuclear norm and learning graph model. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 16, 4 (2020), 1--17.
[59]
Chenggang Yan, Lixuan Meng, Liang Li, Jiehua Zhang, Jian Yin, Jiyong Zhang, Zhan Wang, Yaoqi Sun, and Bolun Zheng. 2021 b. Age-Invariant Face Recognition By Multi-Feature Fusion and Decomposition with Self-Attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2021).
[60]
Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. 2020. D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 1281--1292.
[61]
Xinchen Ye, Shude Chen, and Rui Xu. 2021. DPNet: Detail-preserving network for high quality monocular depth estimation. Pattern Recognition, Vol. 109 (2021), 107578.
[62]
Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence (2019).
[63]
Mingliang Zhang, Xinchen Ye, and Xin Fan. 2020. Unsupervised detail-preserving network for high quality monocular depth estimation. Neurocomputing, Vol. 404 (2020), 1--13.
[64]
Zhenyu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and Jian Yang. 2018. Joint task-recursive learning for semantic segmentation and depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV) . 235--251.
[65]
Laurent Zwald and Sophie Lambert-Lacroix. 2012. The berhu penalty and the grouped effect. arXiv preprint arXiv:1207.6868 (2012).

Cited By

View all
  • (2024)Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation AdaptationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239720:9(1-23)Online publication date: 13-Jun-2024
  • (2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
  • (2024)Domain Shared and Specific Prompt Learning for Incremental Monocular Depth EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681155(8306-8315)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Heuristic Depth Estimation with Progressive Depth Reconstruction and Confidence-Aware Loss

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '21: Proceedings of the 29th ACM International Conference on Multimedia
      October 2021
      5796 pages
      ISBN:9781450386517
      DOI:10.1145/3474085
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. confidence-aware loss
      2. depth estimation
      3. multi-level up-sampling
      4. progressive depth reconstruction

      Qualifiers

      • Research-article

      Funding Sources

      • Youth Innovation Promotion Association of Chinese Academy of Sciences under Grant
      • National Natural Science Foundation of China
      • Zhejiang Province Natural Science Foundation of China
      • National Key Research and Development Program of China under Grant
      • 111 Project

      Conference

      MM '21
      Sponsor:
      MM '21: ACM Multimedia Conference
      October 20 - 24, 2021
      Virtual Event, China

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)40
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 20 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Learning Domain Invariant Features for Unsupervised Indoor Depth Estimation AdaptationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367239720:9(1-23)Online publication date: 13-Jun-2024
      • (2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
      • (2024)Domain Shared and Specific Prompt Learning for Incremental Monocular Depth EstimationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681155(8306-8315)Online publication date: 28-Oct-2024
      • (2024)Fooling 3D Face Recognition with One Single 2D ImageProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680840(4043-4052)Online publication date: 28-Oct-2024
      • (2024)FS-Depth: Focal-and-Scale Depth Estimation From a Single Image in Unseen Indoor SceneIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.341168834:11_Part_1(10604-10617)Online publication date: 10-Jun-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media