skip to main content
10.1145/3474085.3475287acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

AggNet for Self-supervised Monocular Depth Estimation: Go An Aggressive Step Furthe

Authors Info & Claims
Published:17 October 2021Publication History

ABSTRACT

Without appealing to exhaustive labeled data, self-supervised monocular depth estimation (MDE) plays a fundamental role in computer vision. Previous methods usually adopt a one-stage MDE network, which is insufficient to achieve high performance. In this paper, we dig deep into this task to propose an aggressive framework termed AggNet. The framework is based on a training-only progressive two-stage module to perform pseudo counter-surveillance as well as a simple yet effective dual-warp loss function between image pairs. In particular, we first propose a residual module, which follows the MDE network to learn a refined depth. The residual module takes both the initial depth generated from MDE and the initial color image as input to generate refined depth with residual depth learning. Then, the refined depth is leveraged to supervise the initial depth simultaneously during the training period. For inference, only the MDE network is retained to regress depth from a single image, which gains better performance without introducing extra computation. In addition to self-distillation loss, a simple yet effective dual-warp consistency loss is introduced to encourage the MDE network to keep depth consistency between stereo image pairs. Extensive experiments show that our AggNet achieves state-of-the-art performance on the KITTI and Make3D datasets.

References

  1. Alex M Andrew. 2001. Multiple view geometry in computer vision. Kybernetes (2001).Google ScholarGoogle Scholar
  2. Claudia Armbrüster, Marc Wolter, Torsten Kuhlen, Will Spijkers, and Bruno Fimm. 2008. Depth perception in virtual reality: distance estimations in peri-and extrapersonal space. Cyberpsychology & Behavior, Vol. 11, 1 (2008), 9--15.Google ScholarGoogle ScholarCross RefCross Ref
  3. Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. 2020. AdaBins: Depth Estimation using Adaptive Bins. arXiv preprint arXiv:2011.14141 (2020).Google ScholarGoogle Scholar
  4. Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 742--751. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David Eigen and Rob Fergus. 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE international conference on computer vision. 2650--2658. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. 2018. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2002--2011.Google ScholarGoogle ScholarCross RefCross Ref
  7. Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European conference on computer vision. Springer, 740--756.Google ScholarGoogle ScholarCross RefCross Ref
  8. Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. 2017. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 270--279.Google ScholarGoogle ScholarCross RefCross Ref
  9. Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828--3838.Google ScholarGoogle ScholarCross RefCross Ref
  10. Juan Luis GonzalezBello and Munchurl Kim. 2020. Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes. Advances in Neural Information Processing Systems, Vol. 33 (2020).Google ScholarGoogle Scholar
  11. Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 2020. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2485--2494.Google ScholarGoogle ScholarCross RefCross Ref
  12. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  13. Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, Vol. 30, 2 (2007), 328--341. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).Google ScholarGoogle Scholar
  15. Adrian Johnston and Gustavo Carneiro. 2020. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4756--4765.Google ScholarGoogle ScholarCross RefCross Ref
  16. Kevin Karsch, Ce Liu, and Sing Bing Kang. 2014. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 11 (2014), 2144--2158.Google ScholarGoogle Scholar
  17. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  18. Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision. Springer, 582--600.Google ScholarGoogle ScholarCross RefCross Ref
  19. Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. 2017. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6647--6655.Google ScholarGoogle ScholarCross RefCross Ref
  20. Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. 2016. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV). IEEE, 239--248.Google ScholarGoogle ScholarCross RefCross Ref
  21. Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong Suh. 2019. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019).Google ScholarGoogle Scholar
  22. Yining Li, Chen Huang, Xiaoou Tang, and Chen Change Loy. 2017. Learning to disambiguate by asking discriminative questions. In Proceedings of the IEEE International Conference on Computer Vision. 3419--3428.Google ScholarGoogle ScholarCross RefCross Ref
  23. Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015. Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence, Vol. 38, 10 (2015), 2024--2039. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, and Liangjun Zhang. 2020. FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion. arXiv preprint arXiv:2012.08270 (2020).Google ScholarGoogle Scholar
  25. Miaomiao Liu, Mathieu Salzmann, and Xuming He. 2014. Discrete-continuous depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 716--723. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Pengpeng Liu, Irwin King, Michael R Lyu, and Jia Xu. 2019. Ddflow: Learning optical flow with unlabeled data distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8770--8777.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fangchang Ma, Guilherme Venturelli Cavalheiro, and Sertac Karaman. 2019. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 3288--3295.Google ScholarGoogle ScholarCross RefCross Ref
  28. Fangchang Ma and Sertac Karaman. 2018. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4796--4803.Google ScholarGoogle ScholarCross RefCross Ref
  29. Ishit Mehta, Parikshit Sakurikar, and PJ Narayanan. 2018. Structured adversarial training for unsupervised monocular depth estimation. In 2018 International Conference on 3D Vision (3DV). IEEE, 314--323.Google ScholarGoogle ScholarCross RefCross Ref
  30. Sudeep Pillai, Rarecs Ambrucs, and Adrien Gaidon. 2019. Superdepth: Self-supervised, super-resolved monocular depth estimation. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 9250--9256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Andry Maykol Pinto, Paulo Costa, Antonio P Moreira, Lu'is F Rocha, Germano Veiga, and Eduardo Moreira. 2015. Evaluation of depth sensors for robotic applications. In 2015 IEEE International Conference on Autonomous Robot Systems and Competitions. IEEE, 139--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Matteo Poggi, Fabio Tosi, and Stefano Mattoccia. 2018. Learning monocular depth estimation with unsupervised trinocular assumptions. In 2018 International conference on 3d vision (3DV). IEEE, 324--333.Google ScholarGoogle ScholarCross RefCross Ref
  33. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.Google ScholarGoogle ScholarCross RefCross Ref
  34. Ashutosh Saxena, Min Sun, and Andrew Y Ng. 2008. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, Vol. 31, 5 (2008), 824--840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. 2020. Feature-metric loss for self-supervised learning of depth and egomotion. In European Conference on Computer Vision. Springer, 572--588.Google ScholarGoogle ScholarCross RefCross Ref
  36. Wen Su, Haifeng Zhang, Jia Li, Wenzhen Yang, and Zengfu Wang. 2019. Monocular depth estimation as regression of classification using piled residual networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2161--2169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Fabio Tosi, Filippo Aleotti, Matteo Poggi, and Stefano Mattoccia. 2019. Learning monocular depth estimation infusing traditional stereo knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9799--9809.Google ScholarGoogle ScholarCross RefCross Ref
  38. Chaoyang Wang, José Miguel Buenaposada, Rui Zhu, and Simon Lucey. 2018. Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2022--2030.Google ScholarGoogle ScholarCross RefCross Ref
  39. Chong Wang, Xipeng Lan, and Yangang Zhang. 2017. Model distillation with knowledge transfer from face classification to alignment and verification. arXiv preprint arXiv:1709.02929 (2017).Google ScholarGoogle Scholar
  40. Lijun Wang, Jianming Zhang, Oliver Wang, Zhe Lin, and Huchuan Lu. 2020. SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 541--550.Google ScholarGoogle ScholarCross RefCross Ref
  41. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, Vol. 13, 4 (2004), 600--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jamie Watson, Michael Firman, Gabriel J Brostow, and Daniyar Turmukhambetov. 2019. Self-supervised monocular depth hints. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2162--2171.Google ScholarGoogle ScholarCross RefCross Ref
  43. Dan Xu, Elisa Ricci, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. 2017. Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5354--5362.Google ScholarGoogle ScholarCross RefCross Ref
  44. Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, and Elisa Ricci. 2021. Transformers Solve the Limited Receptive Field for Monocular Depth Prediction. arXiv preprint arXiv:2103.12091 (2021).Google ScholarGoogle Scholar
  45. Yurong You, Yan Wang, Wei-Lun Chao, Divyansh Garg, Geoff Pleiss, Bharath Hariharan, Mark Campbell, and Kilian Q Weinberger. 2019. Pseudo-lidar+: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310 (2019).Google ScholarGoogle Scholar
  46. Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. 2017. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1851--1858.Google ScholarGoogle ScholarCross RefCross Ref
  47. Shengjie Zhu, Garrick Brazil, and Xiaoming Liu. 2020. The edge of depth: Explicit constraints between segmentation and depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13116--13125.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. AggNet for Self-supervised Monocular Depth Estimation: Go An Aggressive Step Furthe

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        MM '21: Proceedings of the 29th ACM International Conference on Multimedia
        October 2021
        5796 pages
        ISBN:9781450386517
        DOI:10.1145/3474085

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 October 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate995of4,171submissions,24%

        Upcoming Conference

        MM '24
        MM '24: The 32nd ACM International Conference on Multimedia
        October 28 - November 1, 2024
        Melbourne , VIC , Australia

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader