Abstract
This study introduces a novel deep learning approach for monocular depth estimation that exhibits excellent performance while utilizing significantly fewer computational and memory resources. Our proposed method, called IBG-Mono, involves the use of the Intercept Block, in conjunction with cost-effective GhostNet components, to efficiently extract relevant information from the input image. The Intercept Block incorporates a mechanism that facilitates the retention and integration of low-resolution feature maps derived from the input image with the downsampled feature maps at different resolutions. On the other hand, the GhostNet components facilitate the efficient processing of coarse-grained feature maps by the Intercept Block, while also enabling their seamless integration with downsampled feature maps. Furthermore, a progressive downsampling is employed to maintain spatial alignment between feature maps of varying resolutions and those that have undergone downsampling. Our study involves conducting extensive experiments using NYU Depth V2 and KITTI dataset and presenting comparative results to the state-of-the-art lightweight monocular depth estimation with only 0.63M parameters and 0.31 GMACs. The results demonstrate the superiority of our system in comparison to other existing lightweight monocular depth estimation approaches.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03720-1/MediaObjects/11760_2024_3720_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03720-1/MediaObjects/11760_2024_3720_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03720-1/MediaObjects/11760_2024_3720_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03720-1/MediaObjects/11760_2024_3720_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03720-1/MediaObjects/11760_2024_3720_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03720-1/MediaObjects/11760_2024_3720_Fig6_HTML.png)
Similar content being viewed by others
Data availibility
No datasets were generated or analysed during the current study.
References
Mur-Artal, R., Tardós, J.D.: Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Rob. 33(5), 1255–1262 (2017). https://doi.org/10.1109/TRO.2017.2705103
Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: Dtam: Dense tracking and mapping in real-time. In: 2011 International Conference on Computer Vision (2011), pp. 2320–2327 (2011). https://doi.org/10.1109/ICCV.2011.6126513
Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 3061–3070 (2015). https://doi.org/10.1109/CVPR.2015.7298925
Yin, W., Liu, Y., Shen, C., Yan, Y.: Enforcing geometric constraints of virtual normal for depth prediction. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (IEEE Computer Society, Los Alamitos, CA, USA, 2019), pp. 5683–5692 (2019). https://doi.org/10.1109/ICCV.2019.00578
Jiao, J., Cao, Y., Song, Y., Lau, R.: Look deeper into depth: monocular depth estimation with semantic booster and attention-driven loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pp. 55–71. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_4
Wofk, D., Ma, F., Yang, T.J., Karaman, S., Sze, V.: Fastdepth: Fast monocular depth estimation on embedded systems. In: 2019 International Conference on Robotics and Automation (ICRA) (2019), pp. 6101–6108 (2019). https://doi.org/10.1109/ICRA.2019.8794182
Tu, X., Xu, C., Liu, S., Li, R., Xie, G., Huang, J., Yang, L.T.: Efficient monocular depth estimation for edge devices in internet of things. IEEE Trans. Industr. Inf. 17(4), 2821–2832 (2021). https://doi.org/10.1109/TII.2020.3020583
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., Krishnamurthy, A.: Tvm: An automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (USENIX Association, USA, 2018), OSDI’18, p. 579–594 (2018)
Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: More features from cheap operations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 1577–1586 (2020). https://doi.org/10.1109/CVPR42600.2020.00165
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 2650–2658 (2015). https://doi.org/10.1109/ICCV.2015.304
Chen, X., Chen, X., Zha, Z.J.: Structure-aware residual pyramid network for monocular depth estimation. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence (AAAI Press, 2019), IJCAI’19, p. 694–700 (2019)
Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE Computer Society, Los Alamitos, CA, USA, 2019), pp. 1043–1051 (2019). https://doi.org/10.1109/WACV.2019.00116
Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: Geometric neural network for joint depth and surface normal estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 283–291 (2018). https://doi.org/10.1109/CVPR.2018.00037
Facil, J.M., Ummenhofer, B., Zhou, H., Montesano, L., Brox, T., Civera, J.: Cam-convs: Camera-aware multi-scale convolutions for single-view depth. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 11818–11827 (2019). https://doi.org/10.1109/CVPR.2019.01210
Gonzalez, J.L., Kim, M.: Forget about the lidar: Self-supervised depth estimators with med probability volumes. ArXiv arXiv:2008.03633 (2020)
Huynh, L., Nguyen-Ha, P., Matas, J., Rahtu, E., Heikkilä, J.: Guiding monocular depth estimation using depth-attention volume. In: Computer Vision – ECCV 2020, ed. by A. Vedaldi, H. Bischof, T. Brox, J.M. Frahm (Springer International Publishing, Cham, 2020), pp. 581–597 (2020)
Ramamonjisoa, M., Lepetit, V.: Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (IEEE Computer Society, Los Alamitos, CA, USA, 2019), pp. 2109–2118 (2019). https://doi.org/10.1109/ICCVW.2019.00266
Zhang, Z., Chan, R.K.Y., Wong, K.K.Y.: Glocalfuse-depth: Fusing transformers and cnns for all-day self-supervised monocular depth estimation (2023)
Yang, G., Tang, H., Ding, M., Sebe, N., Ricci, E.: Transformer-based attention networks for continuous pixel-wise prediction. in ICCV (2021) (2021)
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 12159–12168 (2021). https://doi.org/10.1109/ICCV48922.2021.01196
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017)
Wang, Y., Zhu, H.: Monocular depth estimation: lightweight convolutional and matrix capsule feature-fusion network. Sensors (2022). https://doi.org/10.3390/s22176344
Liu, J., Li, Q., Cao, R., Tang, W., Qiu, G.: Mininet: an extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation. ISPRS J. Photogramm. Remote. Sens. 166, 255–267 (2020). https://doi.org/10.1016/j.isprsjprs.2020.06.004
Liu, S., Yang, L.T., Tu, X., Li, R., Xu, C.: Lightweight monocular depth estimation on edge devices. IEEE Internet Things J. 9(17), 16168–16180 (2022). https://doi.org/10.1109/JIOT.2022.3151374
Kwon, W., Yu, G.I., Jeong, E., Chun, B.G.: Nimble: lightweight and parallel gpu task scheduling for deep learning. Adv. Neural. Inf. Process. Syst. 2020, 8343–8354 (2020)
Elkerdawy, S., Zhang, H., Ray, N.: Lightweight monocular depth estimation model by joint end-to-end filter pruning. In: 2019 IEEE International Conference on Image Processing (ICIP) (2019), pp. 4290–4294 (2019). https://doi.org/10.1109/ICIP.2019.8803544
Lyu, X., Liu, L., Wang, M., Kong, X., Liu, L., Liu, Y., Chen, X., Yuan, Y.: Hr-depth: high resolution self-supervised monocular depth estimation. Proc. AAAI Conf. Artif. Intell. 35(3), 2294–2301 (2021). https://doi.org/10.1609/aaai.v35i3.16329
Papa, L., Proietti Mattia, G., Russo, P., Amerini, I., Beraldi, R.: Lightweight and energy-aware monocular depth estimation models for iot embedded devices: challenges and performances in terrestrial and underwater scenarios. Sensors (2023). https://doi.org/10.3390/s23042223
Khan, F., Shariff, W., Farooq, M.A., Basak, S., Corcoran, P.: A robust light-weight fused-feature encoder-decoder model for monocular facial depth estimation from single images trained on synthetic data. IEEE Access 11, 41480–41491 (2023). https://doi.org/10.1109/ACCESS.2023.3267970
Papa, L., Russo, P., Amerini, I.: Meter: a mobile vision transformer architecture for monocular depth estimation. IEEE Trans. Circuits Syst. Video Technol. 33(10), 5882–5893 (2023). https://doi.org/10.1109/TCSVT.2023.3260310
Wang, Y., Zhu, H.: Monocular depth estimation: lightweight convolutional and matrix capsule feature-fusion network. Sensors (2022). https://doi.org/10.3390/s22176344.1424-8220/22/17/6344
Liu, S., Yang, L.T., Tu, X., Li, R., Xu, C.: Lightweight monocular depth estimation on edge devices. IEEE Internet Things J. 9(17), 16168–16180 (2022). https://doi.org/10.1109/JIOT.2022.3151374
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37 (PMLR, 2015), pp. 448–456 (2015)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 1800–1807 (2017). https://doi.org/10.1109/CVPR.2017.195
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 7132–7141 (2018). https://doi.org/10.1109/CVPR.2018.00745
Lin, M., Chen, Q., Yan, S.: Network in network (2013)
Howard, A.G., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 1314–1324 (2019)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: Computer Vision – ECCV 2012 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012), pp. 746–760 (2012)
Rudolph, M., Dawoud, Y., Güldenring, R., Nalpantidis, L., Belagiannis, V.: Lightweight monocular depth estimation through guided decoding. In: 2022 International Conference on Robotics and Automation (ICRA) (2022), pp. 2344–2350 (2022). https://doi.org/10.1109/ICRA46639.2022.9812220
Alhashim, I., Wonka, P.L High quality monocular depth estimation via transfer learning. ArXiv arXiv:1812.11941 (2018)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NIPS (2014)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Rob. Res. 32(11), 1231–1237 (2013). https://doi.org/10.1177/0278364913491297
Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: ACM SIGGRAPH 2004 Papers (Association for Computing Machinery, New York, NY, USA, 2004), SIGGRAPH ’04, p. 689–694 (2004). https://doi.org/10.1145/1186562.1015780
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019), pp. 8024–8035 (2019)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 (JMLR.org, 2013), ICML’13, p. III–1139–III–1147 (2013)
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004). https://doi.org/10.1109/TIP.2003.819861
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viegas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems (2016)
Author information
Authors and Affiliations
Contributions
Igi Ardiyanto developed and implemented the algorithms for the work. Resha Dwika Hefni Al-Fahsi implemented the algorithms on the mobile phone. Both authors draft the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ardiyanto, I., Al-Fahsi, R.D.H. Lightweight monocular depth estimation network for robotics using intercept block GhostNet. SIViP 19, 34 (2025). https://doi.org/10.1007/s11760-024-03720-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03720-1