ABSTRACT
Without appealing to exhaustive labeled data, self-supervised monocular depth estimation (MDE) plays a fundamental role in computer vision. Previous methods usually adopt a one-stage MDE network, which is insufficient to achieve high performance. In this paper, we dig deep into this task to propose an aggressive framework termed AggNet. The framework is based on a training-only progressive two-stage module to perform pseudo counter-surveillance as well as a simple yet effective dual-warp loss function between image pairs. In particular, we first propose a residual module, which follows the MDE network to learn a refined depth. The residual module takes both the initial depth generated from MDE and the initial color image as input to generate refined depth with residual depth learning. Then, the refined depth is leveraged to supervise the initial depth simultaneously during the training period. For inference, only the MDE network is retained to regress depth from a single image, which gains better performance without introducing extra computation. In addition to self-distillation loss, a simple yet effective dual-warp consistency loss is introduced to encourage the MDE network to keep depth consistency between stereo image pairs. Extensive experiments show that our AggNet achieves state-of-the-art performance on the KITTI and Make3D datasets.
- Alex M Andrew. 2001. Multiple view geometry in computer vision. Kybernetes (2001).Google Scholar
- Claudia Armbrüster, Marc Wolter, Torsten Kuhlen, Will Spijkers, and Bruno Fimm. 2008. Depth perception in virtual reality: distance estimations in peri-and extrapersonal space. Cyberpsychology & Behavior, Vol. 11, 1 (2008), 9--15.Google ScholarCross Ref
- Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2020. AdaBins: Depth Estimation using Adaptive Bins. arXiv preprint arXiv:2011.14141 (2020).Google Scholar
- Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 742--751. Google ScholarDigital Library
- David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658. Google ScholarDigital Library
- Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2002--2011.Google ScholarCross Ref
- Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision. Springer, 740--756.Google ScholarCross Ref
- Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 270--279.Google ScholarCross Ref
- Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828--3838.Google ScholarCross Ref
- Juan Luis GonzalezBello and Munchurl Kim. 2020. Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes. Advances in Neural Information Processing Systems, Vol. 33 (2020).Google Scholar
- Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2485--2494.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, Vol. 30, 2 (2007), 328--341. Google ScholarDigital Library
- Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).Google Scholar
- Adrian Johnston and Gustavo Carneiro. 2020. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4756--4765.Google ScholarCross Ref
- Kevin Karsch, Ce Liu, and Sing Bing Kang. 2014. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 11 (2014), 2144--2158.Google Scholar
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision. Springer, 582--600.Google ScholarCross Ref
- Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. 2017. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6647--6655.Google ScholarCross Ref
- Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV). IEEE, 239--248.Google ScholarCross Ref
- Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019).Google Scholar
- Yining Li, Chen Huang, Xiaoou Tang, and Chen Change Loy. 2017. Learning to disambiguate by asking discriminative questions. In Proceedings of the IEEE International Conference on Computer Vision. 3419--3428.Google ScholarCross Ref
- Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 10 (2015), 2024--2039. Google ScholarDigital Library
- Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, and Liangjun Zhang. 2020. FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion. arXiv preprint arXiv:2012.08270 (2020).Google Scholar
- Miaomiao Liu, Mathieu Salzmann, and Xuming He. 2014. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723. Google ScholarDigital Library
- Pengpeng Liu, Irwin King, Michael R Lyu, and Jia Xu. 2019. Ddflow: Learning optical flow with unlabeled data distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8770--8777.Google ScholarDigital Library
- Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac Karaman. 2019. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 3288--3295.Google ScholarCross Ref
- Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4796--4803.Google ScholarCross Ref
- Ishit Mehta, Parikshit Sakurikar, and PJ Narayanan. 2018. Structured adversarial training for unsupervised monocular depth estimation. In 2018 International Conference on 3D Vision (3DV). IEEE, 314--323.Google ScholarCross Ref
- Sudeep Pillai, Rarecs Ambrucs, and Adrien Gaidon. 2019. Superdepth: Self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9250--9256.Google ScholarDigital Library
- Andry Maykol Pinto, Paulo Costa, Antonio P Moreira, Lu'is F Rocha, Germano Veiga, and Eduardo Moreira. 2015. Evaluation of depth sensors for robotic applications. In 2015 IEEE International Conference on Autonomous Robot Systems and Competitions. IEEE, 139--143. Google ScholarDigital Library
- Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. 2018. Learning monocular depth estimation with unsupervised trinocular assumptions. In 2018 International conference on 3d vision (3DV). IEEE, 324--333.Google ScholarCross Ref
- Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarCross Ref
- Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2008. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, Vol. 31, 5 (2008), 824--840. Google ScholarDigital Library
- Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. 2020. Feature-metric loss for self-supervised learning of depth and egomotion. In European Conference on Computer Vision. Springer, 572--588.Google ScholarCross Ref
- Wen Su, Haifeng Zhang, Jia Li, Wenzhen Yang, and Zengfu Wang. 2019. Monocular depth estimation as regression of classification using piled residual networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2161--2169. Google ScholarDigital Library
- Fabio Tosi, Filippo Aleotti, Matteo Poggi, and Stefano Mattoccia. 2019. Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9799--9809.Google ScholarCross Ref
- Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2022--2030.Google ScholarCross Ref
- Chong Wang, Xipeng Lan, and Yangang Zhang. 2017. Model distillation with knowledge transfer from face classification to alignment and verification. arXiv preprint arXiv:1709.02929 (2017).Google Scholar
- Lijun Wang, Jianming Zhang, Oliver Wang, Zhe Lin, and Huchuan Lu. 2020. SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 541--550.Google ScholarCross Ref
- Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, Vol. 13, 4 (2004), 600--612. Google ScholarDigital Library
- Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. 2019. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2162--2171.Google ScholarCross Ref
- Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2017. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5354--5362.Google ScholarCross Ref
- Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. 2021. Transformers Solve the Limited Receptive Field for Monocular Depth Prediction. arXiv preprint arXiv:2103.12091 (2021).Google Scholar
- Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. 2019. Pseudo-lidar+: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310 (2019).Google Scholar
- Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1851--1858.Google ScholarCross Ref
- Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. 2020. The edge of depth: Explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13116--13125.Google ScholarCross Ref
Index Terms
- AggNet for Self-supervised Monocular Depth Estimation: Go An Aggressive Step Furthe
Recommendations
Transferring knowledge from monocular completion for self-supervised monocular depth estimation
AbstractMonocular depth estimation is a very challenging task in computer vision, with the goal to predict per-pixel depth from a single RGB image. Supervised learning methods require large amounts of depth measurement data, which are time-consuming and ...
Multi-resolution distillation for self-supervised monocular depth estimation
AbstractObtaining dense depth ground-truth is not trivial, which leads to the introduction of self-supervised monocular depth estimation. Most self-supervised methods utilize the photometric loss as the primary supervisory signal to optimize a depth ...
Highlights- A simple yet effective distillation is proposed for self-supervised monocular depth.
- Resolving the resolution bias eventually improves depths in a target resolution.
- The proposed depth consistency learning produces reliable self-...
Semantic and Optical Flow Guided Self-supervised Monocular Depth and Ego-Motion Estimation
Image and GraphicsAbstractThe self-supervised depth and camera pose estimation methods are proposed to address the difficulty of acquiring the densely labeled ground-truth data and have achieved a great advance. As the stereo vision could constrain the predicted depth to a ...
Comments