research-article

AggNet for Self-supervised Monocular Depth Estimation: Go An Aggressive Step Furthe

Authors:

Liusheng Huang,

Errui DingAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 1526 - 1534

https://doi.org/10.1145/3474085.3475287

Published: 17 October 2021 Publication History

Abstract

Without appealing to exhaustive labeled data, self-supervised monocular depth estimation (MDE) plays a fundamental role in computer vision. Previous methods usually adopt a one-stage MDE network, which is insufficient to achieve high performance. In this paper, we dig deep into this task to propose an aggressive framework termed AggNet. The framework is based on a training-only progressive two-stage module to perform pseudo counter-surveillance as well as a simple yet effective dual-warp loss function between image pairs. In particular, we first propose a residual module, which follows the MDE network to learn a refined depth. The residual module takes both the initial depth generated from MDE and the initial color image as input to generate refined depth with residual depth learning. Then, the refined depth is leveraged to supervise the initial depth simultaneously during the training period. For inference, only the MDE network is retained to regress depth from a single image, which gains better performance without introducing extra computation. In addition to self-distillation loss, a simple yet effective dual-warp consistency loss is introduced to encourage the MDE network to keep depth consistency between stereo image pairs. Extensive experiments show that our AggNet achieves state-of-the-art performance on the KITTI and Make3D datasets.

References

[1]

Alex M Andrew. 2001. Multiple view geometry in computer vision. Kybernetes (2001).

[2]

Claudia Armbrüster, Marc Wolter, Torsten Kuhlen, Will Spijkers, and Bruno Fimm. 2008. Depth perception in virtual reality: distance estimations in peri-and extrapersonal space. Cyberpsychology & Behavior, Vol. 11, 1 (2008), 9--15.

[3]

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2020. AdaBins: Depth Estimation using Adaptive Bins. arXiv preprint arXiv:2011.14141 (2020).

[4]

Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 742--751.

Digital Library

[5]

David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658.

Digital Library

[6]

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2002--2011.

[7]

Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision. Springer, 740--756.

[8]

Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 270--279.

[9]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828--3838.

[10]

Juan Luis GonzalezBello and Munchurl Kim. 2020. Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes. Advances in Neural Information Processing Systems, Vol. 33 (2020).

[11]

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2485--2494.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[13]

Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, Vol. 30, 2 (2007), 328--341.

Digital Library

[14]

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).

[15]

Adrian Johnston and Gustavo Carneiro. 2020. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4756--4765.

[16]

Kevin Karsch, Ce Liu, and Sing Bing Kang. 2014. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 11 (2014), 2144--2158.

[17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[18]

Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision. Springer, 582--600.

[19]

Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. 2017. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6647--6655.

[20]

Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV). IEEE, 239--248.

[21]

Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019).

[22]

Yining Li, Chen Huang, Xiaoou Tang, and Chen Change Loy. 2017. Learning to disambiguate by asking discriminative questions. In Proceedings of the IEEE International Conference on Computer Vision. 3419--3428.

[23]

Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 10 (2015), 2024--2039.

Digital Library

[24]

Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, and Liangjun Zhang. 2020. FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion. arXiv preprint arXiv:2012.08270 (2020).

[25]

Miaomiao Liu, Mathieu Salzmann, and Xuming He. 2014. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723.

Digital Library

[26]

Pengpeng Liu, Irwin King, Michael R Lyu, and Jia Xu. 2019. Ddflow: Learning optical flow with unlabeled data distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8770--8777.

Digital Library

[27]

Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac Karaman. 2019. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 3288--3295.

[28]

Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4796--4803.

[29]

Ishit Mehta, Parikshit Sakurikar, and PJ Narayanan. 2018. Structured adversarial training for unsupervised monocular depth estimation. In 2018 International Conference on 3D Vision (3DV). IEEE, 314--323.

[30]

Sudeep Pillai, Rarecs Ambrucs, and Adrien Gaidon. 2019. Superdepth: Self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9250--9256.

Digital Library

[31]

Andry Maykol Pinto, Paulo Costa, Antonio P Moreira, Lu'is F Rocha, Germano Veiga, and Eduardo Moreira. 2015. Evaluation of depth sensors for robotic applications. In 2015 IEEE International Conference on Autonomous Robot Systems and Competitions. IEEE, 139--143.

Digital Library

[32]

Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. 2018. Learning monocular depth estimation with unsupervised trinocular assumptions. In 2018 International conference on 3d vision (3DV). IEEE, 324--333.

[33]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.

[34]

Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2008. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, Vol. 31, 5 (2008), 824--840.

Digital Library

[35]

Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. 2020. Feature-metric loss for self-supervised learning of depth and egomotion. In European Conference on Computer Vision. Springer, 572--588.

[36]

Wen Su, Haifeng Zhang, Jia Li, Wenzhen Yang, and Zengfu Wang. 2019. Monocular depth estimation as regression of classification using piled residual networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2161--2169.

Digital Library

[37]

Fabio Tosi, Filippo Aleotti, Matteo Poggi, and Stefano Mattoccia. 2019. Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9799--9809.

[38]

Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2022--2030.

[39]

Chong Wang, Xipeng Lan, and Yangang Zhang. 2017. Model distillation with knowledge transfer from face classification to alignment and verification. arXiv preprint arXiv:1709.02929 (2017).

[40]

Lijun Wang, Jianming Zhang, Oliver Wang, Zhe Lin, and Huchuan Lu. 2020. SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 541--550.

[41]

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, Vol. 13, 4 (2004), 600--612.

Digital Library

[42]

Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. 2019. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2162--2171.

[43]

Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2017. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5354--5362.

[44]

Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. 2021. Transformers Solve the Limited Receptive Field for Monocular Depth Prediction. arXiv preprint arXiv:2103.12091 (2021).

[45]

Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. 2019. Pseudo-lidar+: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310 (2019).

[46]

Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1851--1858.

[47]

Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. 2020. The edge of depth: Explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13116--13125.

Cited By

Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Gui JChen TZhang JCao QSun ZLuo HTao D(2024)A Survey on Self-Supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341511246:12(9052-9071)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3415112
Wang CSun JLiu LWu CShen ZWu DDai YZhang LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Digging into Depth Priors for Outdoor Neural Radiance FieldsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612306(1221-1230)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612306

Index Terms

AggNet for Self-supervised Monocular Depth Estimation: Go An Aggressive Step Furthe
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Image and video acquisition
        3D imaging
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Transferring knowledge from monocular completion for self-supervised monocular depth estimation
Abstract
Monocular depth estimation is a very challenging task in computer vision, with the goal to predict per-pixel depth from a single RGB image. Supervised learning methods require large amounts of depth measurement data, which are time-consuming and ...
Multi-resolution distillation for self-supervised monocular depth estimation
Abstract
Obtaining dense depth ground-truth is not trivial, which leads to the introduction of self-supervised monocular depth estimation. Most self-supervised methods utilize the photometric loss as the primary supervisory signal to optimize a depth ...
Highlights
- A simple yet effective distillation is proposed for self-supervised monocular depth.
- Resolving the resolution bias eventually improves depths in a target resolution.
- The proposed depth consistency learning produces reliable self-...
Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation Learning
Monocular depth estimation aims to infer a depth map from a single image. Although supervised learning-based methods have achieved remarkable performance, they generally rely on a large amount of labor-intensively annotated data. Self-supervised methods, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
210
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YXue FMing AZhao MMa HSebe NCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One ModelProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681405(3469-3478)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681405
Gui JChen TZhang JCao QSun ZLuo HTao D(2024)A Survey on Self-Supervised Learning: Algorithms, Applications, and Future TrendsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.341511246:12(9052-9071)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3415112
Wang CSun JLiu LWu CShen ZWu DDai YZhang LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Digging into Depth Priors for Outdoor Neural Radiance FieldsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612306(1221-1230)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612306

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten