Abstract
As one of the most crucial tasks of scene perception, Monocular Depth Estimation (MDE) has made considerable development in recent years. Current MDE researchers are interested in the precision and speed of the estimation, but pay less attention to the generalization ability across scenes. For instance, the MDE networks trained on outdoor scenes achieve impressive performance on outdoor scenes but poor performance on indoor scenes, and vice versa. To tackle this problem, we propose a self-distillation MDE framework to improve the generalization ability across different scenes in this paper. Specifically, we design a student encoder that extracts features from two datasets of indoor and outdoor scenes, respectively. After that, we introduce a dissimilarity loss to pull apart encoded features of different scenes in the feature space. Finally, a decoder is adopted to estimate the final depth from encoded features. By doing so, our self-distillation MDE framework can learn the depth estimation of two different datasets. To the best of our knowledge, we are the first one to tackle the generalization problem across datasets of different scenes in the MDE field. Experiments demonstrate that our method reduces the degradation problem when a MDE network is in the face of datasets with complex data distribution. Note that evaluating on two datasets by a single network is more challenging than evaluating on two datasets by two different networks.
Similar content being viewed by others
Notes
Codes will be released once the paper is accepted.
References
Anil R, Pereyra G, Passos A, Ormandi R, Dahl GE, Hinton GE (2018) Large scale distributed neural network training through online distillation. In: ICLR
Bhoi A (2019) Monocular depth estimation: A survey. In: arXiv preprint at arXiv:1412.6572. Accessed 15 Jan 2021
Cao Y, Wu Z, Shen C (2017) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28(11):3174–3182
Cao ZL, Yan ZH, Wang H (2015) Summary of binocular stereo vision matching technology. Journal of Chongqing University of Technology (Natural Science) 29(2):70–75
Chen P, Liu AH, Liu Y, Wang, YF (2019) Towards scene understanding: Unsupervised monocular depth estimation with semantic-aware representation. In: CVPR, pp 2619–2627. https://doi.org/10.1109/CVPR.2019.00273
Chen W, Fu Z, Yang D, Deng J (2016) Single-image depth perception in the wild. In: Lee D, Sugiyama M, Luxburg U, Guyon I, Garnett R (eds) NIPS, 29, pp. 730–738
Dai A, Nießner M, Zollhöfer M, Izadi S, Theobalt C (2017) Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans Graph 36(3). https://doi.org/10.1145/3054739
Droeschel D, Behnke S (2017) Mrslasermap: Local multiresolution grids for efficient 3d laser mapping and localization. In: Behnke S, Sheh R, Sar\(\backslash\)iel, S, Lee DD (eds. RoboCup. Springer International Publishing, pp 319–326
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp 2650–2658. https://doi.org/10.1109/ICCV.2015.304
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. In: In NIPS, pp 2366–2374
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: CVPR, pp 2002–2011. https://doi.org/10.1109/CVPR.2018.00214
Garg R, Bg VK, Carneiro G, Reid I (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: ECCV. Springer, pp 740–756
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32(11):1231–1237
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp 270–279
Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth estimation. In: ICCV, pp 3827–3837. https://doi.org/10.1109/ICCV.2019.00393
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6):84–90. https://doi.org/10.1145/3065386
Kuznietsov Y, Stückler J, Leibe B (2017) Semi-supervised deep learning for monocular depth map prediction. In: CVPR, pp. 2215–2223. https://doi.org/10.1109/CVPR.2017.238
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: 3DV. IEEE. pp 239–248
Lee JH, Han M, Ko DW, Suh IH (2019) From big to small: Multi-scale local planar guidance for monocular depth estimation. In: arXiv
Lee JH, Kim CS (2019) Monocular depth estimation using relative depth maps. In: CVPR, pp 9729–9738
Li B, Shen C, Dai Y, Van Den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In: CVPR, pp 1119–1127
Li R, Xian K, Shen C, Cao Z, Lu H, Hang L (2018) Deep attention-based classification network for robust depth prediction. In: Jawahar C, Li H, Mori G, Schindler K (eds) ACCV. Springer, pp 663–678
Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. In: CVPR, pp 5162–5170
Liu F, Shen C, Lin G, Reid I (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelligence 38(10):2024–2039
Mousavian A, Pirsiavash H, Košecká J (2016) Joint semantic segmentation and depth estimation with deep convolutional networks. In: 3DV, pp 611–619. https://doi.org/10.1109/3DV.2016.69
Nathan Silberman Derek Hoiem PK, Fergus R Indoor segmentation and support inference from rgbd images. https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html. Accessed 15 Jan 2021
Nathan Silberman Derek Hoiem PK, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: ECCV, pp 746–760
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L., et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: NIPS, pp 8026–8037
Wang P, Shen X, Lin Z, Cohen S, Price B, Yuille A (2015) owards unified depth and semantic prediction from a single image. In: CVPR, pp 2800–2809. https://doi.org/10.1109/CVPR.2015.7298897
Poggi M, Aleotti F, Tosi F, Mattoccia S (2020) On the uncertainty of self-supervised monocular depth estimation. In: CVPR, pp 3224–3234. https://doi.org/10.1109/CVPR42600.2020.00329
Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets for 3d object detection from rgb-d data. In: CVPR, pp 918–927. https://doi.org/10.1109/CVPR.2018.00102
Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y (2014) Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550. Accessed 15 Jan 2021
Saxena A, Chung SH, Ng AY (2006) Learning depth from single monocular images. In: NIPS, pp 1161–1168
Shi S, Wang X, Li H (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR, pp 770–779. https://doi.org/10.1109/CVPR.2019.00086
Ullman S (1979) The interpretation of structure from motion. Royal Society of London 203(1153):405–426
Weder S, Schönberger J, Pollefeys M, Oswald MR (2020) Routedfusion: Learning real-time depth map fusion. In: CVPR, pp 4886–4896. https://doi.org/10.1109/CVPR42600.2020.00494
Whelan T, Salas-Moreno RF, Glocker B, Davison AJ, Leutenegger S (2016) Elasticfusion: Dense slam without a pose graph. Robotics: Science Systems 35(14), 1697–1716. https:/doi.org/10.1177/0278364916669237
Xu D, Wang W, Tang H, Liu H, Sebe N, Ricci E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In: CVPR, pp 3917–3925
Yoneda K, Tehrani H, Ogawa T, Hukuyama N, Mita S (2014) Lidar scan feature for localization with highly precise 3-d map. In: Intelligent Vehicles Symposium Proceedings. IEEE pp 1345–1350
Zhang L, Song J, Gao A, Chen J, Bao C, Ma K (2019) Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: ICCV, pp 3712–3721. https://doi.org/10.1109/ICCV.2019.00381
Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE multimedia 19(2):4–10. https://doi.org/10.1109/MMUL.2012.24
Zhao C, Sun Q, Zhang C, Tang Y, Qian F (2017) Monocular depth estimation based on deep learning: An overview. Science China Technological Sciences 63:1612–1627
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 6612–6619 (2017). https://doi.org/10.1109/CVPR.2017.700
Zou L, Li Y (2010) A method of stereo vision matching based on opencv. International Conference on Audio. Language and Image Processing, IEEE, pp 185–190
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 62071500, 61701313) and Sino-Germen Mobility Programme M-0421.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pan, M., Zhang, H., Wu, J. et al. Self-distillation framework for indoor and outdoor monocular depth estimation. Multimed Tools Appl 81, 35899–35913 (2022). https://doi.org/10.1007/s11042-021-11500-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11500-z