Skip to main content
Log in

Self-distillation framework for indoor and outdoor monocular depth estimation

  • 1190: Depth-Related Processing and Applications in Visual Systems
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

As one of the most crucial tasks of scene perception, Monocular Depth Estimation (MDE) has made considerable development in recent years. Current MDE researchers are interested in the precision and speed of the estimation, but pay less attention to the generalization ability across scenes. For instance, the MDE networks trained on outdoor scenes achieve impressive performance on outdoor scenes but poor performance on indoor scenes, and vice versa. To tackle this problem, we propose a self-distillation MDE framework to improve the generalization ability across different scenes in this paper. Specifically, we design a student encoder that extracts features from two datasets of indoor and outdoor scenes, respectively. After that, we introduce a dissimilarity loss to pull apart encoded features of different scenes in the feature space. Finally, a decoder is adopted to estimate the final depth from encoded features. By doing so, our self-distillation MDE framework can learn the depth estimation of two different datasets. To the best of our knowledge, we are the first one to tackle the generalization problem across datasets of different scenes in the MDE field. Experiments demonstrate that our method reduces the degradation problem when a MDE network is in the face of datasets with complex data distribution. Note that evaluating on two datasets by a single network is more challenging than evaluating on two datasets by two different networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Codes will be released once the paper is accepted.

References

  1. Anil R, Pereyra G, Passos A, Ormandi R, Dahl GE, Hinton GE (2018) Large scale distributed neural network training through online distillation. In: ICLR

  2. Bhoi A (2019) Monocular depth estimation: A survey. In: arXiv preprint at arXiv:1412.6572. Accessed 15 Jan 2021

  3. Cao Y, Wu Z, Shen C (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28(11):3174–3182

    Article  Google Scholar 

  4. Cao ZL, Yan ZH, Wang H (2015) Summary of binocular stereo vision matching technology. Journal of Chongqing University of Technology (Natural Science) 29(2):70–75

    Google Scholar 

  5. Chen P, Liu AH, Liu Y, Wang, YF (2019) Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In: CVPR, pp 2619–2627. https://doi.org/10.1109/CVPR.2019.00273

  6. Chen W, Fu Z, Yang D, Deng J (2016) Single-image depth perception in the wild. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) NIPS, 29, pp. 730–738

  7. Dai A, Nießner M, Zollhöfer M, Izadi S, Theobalt C (2017) Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans Graph 36(3). https://doi.org/10.1145/3054739

  8. Droeschel D, Behnke S (2017) Mrslasermap: Local multiresolution grids for efficient 3d laser mapping and localization. In: Behnke S, Sheh R, Sar\(\backslash\)iel, S, Lee DD (eds. RoboCup. Springer International Publishing, pp 319–326

  9. Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp 2650–2658. https://doi.org/10.1109/ICCV.2015.304

  10. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: In NIPS, pp 2366–2374

  11. Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: CVPR, pp 2002–2011. https://doi.org/10.1109/CVPR.2018.00214

  12. Garg R, Bg VK, Carneiro G, Reid I (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: ECCV. Springer, pp 740–756

  13. Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11):1231–1237

    Article  Google Scholar 

  14. Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp 270–279

  15. Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth estimation. In: ICCV, pp 3827–3837. https://doi.org/10.1109/ICCV.2019.00393

  16. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  17. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6):84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  18. Kuznietsov Y, Stückler J, Leibe B (2017) Semi-supervised deep learning for monocular depth map prediction. In: CVPR, pp. 2215–2223. https://doi.org/10.1109/CVPR.2017.238

  19. Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: 3DV. IEEE. pp 239–248

  20. Lee JH, Han M, Ko DW, Suh IH (2019) From big to small: Multi-scale local planar guidance for monocular depth estimation. In: arXiv

  21. Lee JH, Kim CS (2019) Monocular depth estimation using relative depth maps. In: CVPR, pp 9729–9738

  22. Li B, Shen C, Dai Y, Van Den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: CVPR, pp 1119–1127

  23. Li R, Xian K, Shen C, Cao Z, Lu H, Hang L (2018) Deep attention-based classification network for robust depth prediction. In: Jawahar C, Li H, Mori G, Schindler K (eds) ACCV. Springer, pp 663–678

  24. Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. In: CVPR, pp 5162–5170

  25. Liu F, Shen C, Lin G, Reid I (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence 38(10):2024–2039

    Article  Google Scholar 

  26. Mousavian A, Pirsiavash H, Košecká J (2016) Joint semantic segmentation and depth estimation with deep convolutional networks. In: 3DV, pp 611–619. https://doi.org/10.1109/3DV.2016.69

  27. Nathan Silberman Derek Hoiem PK, Fergus R Indoor segmentation and support inference from rgbd images. https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html. Accessed 15 Jan 2021

  28. Nathan Silberman Derek Hoiem PK, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: ECCV, pp 746–760

  29. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L., et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: NIPS, pp 8026–8037

  30. Wang P, Shen X, Lin Z, Cohen S, Price B, Yuille A (2015) owards unified depth and semantic prediction from a single image. In: CVPR, pp 2800–2809. https://doi.org/10.1109/CVPR.2015.7298897

  31. Poggi M, Aleotti F, Tosi F, Mattoccia S (2020) On the uncertainty of self-supervised monocular depth estimation. In: CVPR, pp 3224–3234. https://doi.org/10.1109/CVPR42600.2020.00329

  32. Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets for 3d object detection from rgb-d data. In: CVPR, pp 918–927. https://doi.org/10.1109/CVPR.2018.00102

  33. Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y (2014) Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550. Accessed 15 Jan 2021

  34. Saxena A, Chung SH, Ng AY (2006) Learning depth from single monocular images. In: NIPS, pp 1161–1168

  35. Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR, pp 770–779. https://doi.org/10.1109/CVPR.2019.00086

  36. Ullman S (1979) The interpretation of structure from motion. Royal Society of London 203(1153):405–426

    Google Scholar 

  37. Weder S, Schönberger J, Pollefeys M, Oswald MR (2020) Routedfusion: Learning real-time depth map fusion. In: CVPR, pp 4886–4896. https://doi.org/10.1109/CVPR42600.2020.00494

  38. Whelan T, Salas-Moreno RF, Glocker B, Davison AJ, Leutenegger S (2016) Elasticfusion: Dense slam without a pose graph. Robotics: Science Systems 35(14), 1697–1716. https:/doi.org/10.1177/0278364916669237

  39. Xu D, Wang W, Tang H, Liu H, Sebe N, Ricci E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In: CVPR, pp 3917–3925

  40. Yoneda K, Tehrani H, Ogawa T, Hukuyama N, Mita S (2014) Lidar scan feature for localization with highly precise 3-d map. In: Intelligent Vehicles Symposium Proceedings. IEEE pp 1345–1350

  41. Zhang L, Song J, Gao A, Chen J, Bao C, Ma K (2019) Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: ICCV, pp 3712–3721. https://doi.org/10.1109/ICCV.2019.00381

  42. Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE multimedia 19(2):4–10. https://doi.org/10.1109/MMUL.2012.24

    Article  Google Scholar 

  43. Zhao C, Sun Q, Zhang C, Tang Y, Qian F (2017) Monocular depth estimation based on deep learning: An overview. Science China Technological Sciences 63:1612–1627

    Article  Google Scholar 

  44. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017). https://doi.org/10.1109/CVPR.2017.700

  45. Zou L, Li Y (2010) A method of stereo vision matching based on opencv. International Conference on Audio. Language and Image Processing, IEEE, pp 185–190

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 62071500, 61701313) and Sino-Germen Mobility Programme M-0421.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi Jin.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, M., Zhang, H., Wu, J. et al. Self-distillation framework for indoor and outdoor monocular depth estimation. Multimed Tools Appl 81, 35899–35913 (2022). https://doi.org/10.1007/s11042-021-11500-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11500-z

Keywords

Navigation