Abstract
Monocular depth estimation (MDE) provides information (from a single image) about overall scene layout, and is useful in robotics for autonomous navigation and vision-aided guidance. Advancements in deep learning, particularly self-supervised convolutional neural networks (CNNs), have led to the development of MDE models capable of providing highly accurate per-pixel depth maps. However, these models are typically tuned for specific datasets, leading to sharp performance degradation in real-world scenarios, particularly in robot vision tasks—where the natural environments are too varied and complex to be sufficiently described by standard datasets. Motivated by the approach of biological vision, whose immense success relies on optimal combination of multiple depth cues and knowledge about the underlying environments, we exploit structure from motion (SfM) through optical flow as an additional depth cue and prior knowledge about depth distribution in the environment to improve monocular depth prediction. Meanwhile, there is a general incompatibility between the outputs of these models—whereas SfM measures absolute distances, MDE is scale ambiguous, returning only depth ratios. Consequently, we show how it is possible to promote MDE cue from ordinal scale to the same metric scale as SfM, thus, enabling their optimal integration in a Bayesian optimal manner. Additionally, we generalize the relationship between camera tilt angles and resulting MDE distortions, and show how this can be used to further improve depth perception robustness and accuracy (up to 6.2%) for a mobile robot whose heading is subject to arbitrary angular inclinations.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig1_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig3_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig4_HTML.jpg)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs41315-022-00226-2/MediaObjects/41315_2022_226_Fig11_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aleotti, F., Zaccaroni, G., Bartolomei, L., Poggi, M., Tosi, F., Mattoccia, S.: Real-time single image depth perception in the wild with handheld devices. Sensors 21(1), 15 (2021)
Alhashim, I., Wonka, P.: High quality monocular depth estimation via transfer learning. arXiv preprint arXiv:1812.11941 (2019)
Andhare, P., Rawat, S.: Pick and place industrial robot controller with computer vision. In: 2016 International Conference on Computing Communication Control and automation (ICCUBEA) (pp. 1–4). IEEE (2016)
Andraghetti, L., Myriokefalitakis, P., Dovesi, P. L., Luque, B., Poggi, M., Pieropan, A., Mattoccia, S.: Enhancing self-supervised monocular depth estimation with traditional visual odometry. In 2019 International Conference on 3D Vision (3DV) (pp. 424–433). IEEE (2019)
Aytekin, M., Rucci, M.: Motion parallax from microscopic head movements during visual fixation. Vision. Res. 70, 7–17 (2012)
Bernstein, A.V., Burnaev, E.V., Kachan, O.N.: Reinforcement learning for computer vision and robot navigation. In: International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 258–272. Springer, Cham (2018)
Bian, J., Lin, W.Y., Matsushita, Y., Yeung, S.K., Nguyen, T.D., Cheng, M.M.: Gms: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4181–4190) (2017).
Cadena, C., Dick, A.R., Reid, I.D.: Multi-modal auto-encoders as joint estimators for robotics scene understanding. Robot. Sci. Syst. 5, 1 (2016)
Chen, X., McNamara, T.P., Kelly, J.W., Wolbers, T.: Cue combination in human spatial navigation. Cogn. Psychol. 95, 105–144 (2017)
Chen, T., An, S., Zhang, Y., Ma, C., Wang, H., Guo, X., Zheng, W.: Improving monocular depth estimation by leveraging structural awareness and complementary datasets. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer International Publishing, pp. 90–108 (2020)
Cheng, L., Wu, G.: Obstacles detection and depth estimation from monocular vision for inspection robot of high voltage transmission line. Clust. Comput. 22(2), 2611–2627 (2019)
Cheng, K., Shettleworth, S.J., Huttenlocher, J., Rieser, J.J.: Bayesian integration of spatial information. Psychol. Bull. 133(4), 625 (2007)
Cheng, Y., Wang, G. Y.: Mobile robot navigation based on lidar. In: 2018 Chinese Control and Decision Conference (CCDC) (pp. 1243–1246). IEEE (2018)
Cui, Y., Chen, R., Chu, W., Chen, L., Tian, D., Li, Y., Cao, D.: Deep learning for image and point cloud fusion in autonomous driving: a review. IEEE Trans. Intell. Transport. Syst. 23(2), 722–739 (2022). https://doi.org/10.1109/TITS.2020.3023541
Cutting, J.E., Vishton, P.M.: Perceiving layout and knowing distances: the integration, relative potency, and contextual use of different information about depth. In: Perception of space and motion, pp. 69–117. Academic Press, New York (1995)
de Queiroz Mendes, R., Ribeiro, E.G., dos Santos Rosa, N., Grassi, V., Jr.: On deep learning techniques to boost monocular depth estimation for autonomous navigation. Robot. Autonom. Syst. 136, 103701 (2021)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283 (2014)
Farooq Bhat, S., Alhashim, I., Wonka, P.: AdaBins: Depth Estimation using Adaptive Bins. arXiv e-prints, arXiv-2011 (2020)
Ferreira, J., Lobo, J., Bessiere, P., Castelo-Branco, M., Dias, J.: A Bayesian framework for active artificial perception. IEEE Trans. Cybern. 43(2), 699–711 (2013)
Fritsche, P., Zeise, B., Hemme, P., Wagner, B:. Fusion of radar, LiDAR and thermal information for hazard detection in low visibility environments. In: 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR) (pp. 96–101). IEEE (2017)
Godard, C., Mac Aodha, O., Brostow, G. J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 270–279) (2017)
Göhring, D., Wang, M., Schnürmacher, M., Ganjineh, T.: Radar/lidar sensor fusion for car-following on highways. In: The 5th International Conference on Automation, Robotics and Applications (pp. 407–412). IEEE (2011)
Hambarde, P., Murala, S.: S2dnet: depth estimation from single image and sparse samples. IEEE Trans. Comput. Imaging 6, 806–817 (2020)
Huber, J., Graefe, V.: Motion stereo for mobile robots. IEEE Trans. Ind. Electron. 41(4), 378–383 (1994)
Jonschkowski, R., Stone, A., Barron, J. T., Gordon, A., Konolige, K., Angelova, A.: What matters in unsupervised optical flow. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 (pp. 557–572). Springer International Publishing (2020)
Klingner, M., Termöhlen, J.A., Mikolajczyk, J., Fingscheidt, T.: Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: European Conference on Computer Vision, pp. 582–600. Springer, Cham (2020)
Knill, D.C., Pouget, A.: The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 27(12), 712–719 (2004)
Knill, D.C., Saunders, J.A. (2007). Bayesian models of sensory cue integration. Bayesian brain: Probabilistic approaches to neural coding, 189–206.
Krauzlis, R.J., Goffart, L., Hafed, Z.M.: Neuronal control of fixation and fixational eye movements. Philos. Trans. R. Soc. B Biol. Sci. 372(1718), 20160205 (2017)
Landy, M.S., Maloney, L.T., Johnston, E.B., Young, M.: Measurement and modeling of depth cue combination: in defense of weak fusion. Vision. Res. 35(3), 389–412 (1995)
Lee, J. H., Han, M. K., Ko, D. W., Suh, I. H.: From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326 (2019)
Liu, L., Zhang, J., He, R., Liu, Y., Wang, Y., Tai, Y. et al.: Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6489–6498) (2020)
Ma, F., Karaman, S.: Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In: 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 4796–4803). IEEE (2018).
Ma, J., Zhao, J., Jiang, J., Zhou, H., Guo, X.: Locality preserving matching. Int. J. Comput. Vision 127(5), 512–531 (2019)
Mumuni, A., Mumuni, F.: CNN architectures for geometric transformation-invariant feature representation in computer vision: a review. SN Comput. Sci. 2, 340 (2021a)
Mumuni, F., Mumuni, A.: Adaptive Kalman filter for MEMS IMU data fusion using enhanced covariance scaling. Control Theory Technol 19, 1–10 (2021b)
Peluso, V., Cipolletta, A., Calimera, A., Poggi, M., Tosi, F., Mattoccia, S.: Enabling energy-efficient unsupervised monocular depth estimation on armv7-based platforms. In: 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1703–1708). IEEE (2019)
Saeedan, F., Roth, S.: Boosting Monocular Depth with Panoptic Segmentation Maps. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3853–3862) (2021)
Song, M., Lim, S., Kim, W.: Monocular depth estimation using Laplacian pyramid-based depth residuals. IEEE Trans. Circuits Syst. Video Technol. 31, 4381–4393 (2021)
Tu, X., Xu, C., Liu, S., Li, R., Xie, G., Huang, J., Yang, L.T.: Efficient monocular depth estimation for edge devices in internet of things. IEEE Trans. Industr. Inf. 17(4), 2821–2832 (2020)
Turan, M., Shabbir, J., Araujo, H., Konukoglu, E., Sitti, M.: A deep learning based fusion of RGB camera information and magnetic localization information for endoscopic capsule robots. Int. J. Intell. Robot. Appl 1(4), 442–450 (2017)
Vásquez, B.P.E.A., Matía, F.: A tour-guide robot: moving towards interaction with humans. Eng. Appl. Artif. Intell. 88, 103356 (2020)
Vuong, Q.C., Domini, F., Caudek, C.: Disparity and shading cues cooperate for surface interpolation. Perception 35(2), 145–155 (2006)
Wang, L., Zhang, J., Wang, O., Lin, Z., Lu, H.: SDC-depth: semantic divide-and-conquer network for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 541–550) (2020)
Wofk, D., Ma, F., Yang, T. J., Karaman, S., Sze, V. Fastdepth: fast monocular depth estimation on embedded systems. In: 2019 International Conference on Robotics and Automation (ICRA) (pp. 6101–6108). IEEE (2019)
Yang, H., Chen, L., Ma, Z., Chen, M., Zhong, Y., Deng, F., Li, M.: Computer vision-based high-quality tea automatic plucking robot using Delta parallel manipulator. Comput. Electron. Agric. 181, 105946 (2021)
Yokoyama, K., Morioka, K.: Autonomous mobile robot with simple navigation system based on deep reinforcement learning and a monocular camera. In: 2020 IEEE/SICE International Symposium on System Integration (SII) (pp. 525–530). IEEE (2020)
Yoneyama, R., Duran, A.J., Del Pobil, A.P.: Integrating sensor models in deep learning boosts performance: application to monocular depth estimation in warehouse automation. Sensors 21(4), 1437 (2021)
Yucel, M. K., Dimaridou, V., Drosou, A., Saa-Garriga, A.: Real-time monocular depth estimation with sparse supervision on mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2428–2437) (2021)
Zhan, H., Garg, R., Weerasekera, C.S., Li, K., Agarwal, H., Reid, I.: Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 340–349) (2018)
Zhao, Y., Kong, S., Fowlkes, C.: When Perspective Comes for Free: Improving Depth Prediction with Camera Pose Encoding. arXiv preprint arXiv:2007.03887 (2020)
Zhou, T., Brown, M., Snavely, N., Lowe, D. G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1851–1858) (2017)
Zingg, S., Scaramuzza, D., Weiss, S., Siegwart, R.: MAV navigation through indoor corridors using optical flow. In: 2010 IEEE International Conference on Robotics and Automation (pp. 3361–3368). IEEE (2010)
Funding
We have no financial and personal relationships with other people or organizations that can inappropriately influence our work, and there is no professional or personal interest in any business or product.
Author information
Authors and Affiliations
Contributions
The contributions of the authors to this manuscript are as follows: FM: conceptualized the project framework. AM: helped concretized and refined the initial ideas. Both authors jointly build the robotic platform and fitted hardware components. The algorithms were developed and programmed jointly. Both authors carried out the experiments and wrote approximately equal portions of the text. The graphics are also the works of both authors.
Corresponding author
Ethics declarations
Conflict of interest
We declare that there is no conflict of interest associated with this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mumuni, F., Mumuni, A. Bayesian cue integration of structure from motion and CNN-based monocular depth estimation for autonomous robot navigation. Int J Intell Robot Appl 6, 191–206 (2022). https://doi.org/10.1007/s41315-022-00226-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41315-022-00226-2