Abstract
Monocular depth estimation (MDE) is critical in enabling intelligent autonomous systems and has received considerable attention in recent years. Achieving both low latency and high accuracy in MDE is desirable but challenging to optimize, especially on edge devices. In this paper, we present a novel approach to balancing speed and accuracy in MDE on edge devices. We introduce FasterMDE, an efficient and fast encoder-decoder network architecture that leverages a multiobjective neural architecture search method to find the optimal encoder structure for the target edge. Moreover, we incorporate a neural window fully connected CRF module into the network as the decoder, enhancing fine-grained depth prediction based on coarse depth and image features. To address the issue of bad “local minimums” in the multiobjective neural architecture search, we propose a new approach for automatically learning the weights of subobjective loss functions based on uncertainty. We also accelerate the FasterMDE model using TensorRT and implement it on a target edge device. The experimental results demonstrate that FasterMDE achieves a better balance of speed and accuracy on the KITTI and NYUv2 datasets compared to previous methods. We validate the effectiveness of the proposed method through an ablation study and verify the real-time monocular depth estimation performance of FasterMDE in realistic scenarios. On the KITTI dataset, the FasterMDE model achieves a high frame rate of 555.55 FPS with 9.1% Abs Rel on a single NVIDIA Titan RTX GPU and 14.46 FPS on the NVIDIA Jetson Xavier NX.


















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The datasets generated during the current study are available in the [github] repository [https://github.com/douziwenhit/FasterMDE.git].
References
Liu J, Li Q, Cao R, Tang W, Qiu G (2020) Mininet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation. ISPRS J Photogramm Remote Sens 166:255–267477
Zhang Z, Wang Y, Huang Z, Luo G, Yu G, Fu B (2021) A simple baseline for fast and accurate depth estimation on mobile devices. In: Computer vision and pattern recognition
Muhammad K, Ullah A, Lloret J, Del Ser J, de Albuquerque VHC (2020) Deep learning for safe autonomous driving: Current challenges and future directions. IEEE Trans Intell Transp Syst 22(7):4316–4336
Xu D, Wang W, Tang H, Liu H, Sebe N, Ricci, E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3917–3925
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 2002–2011
Udaya Mohanan K, Cho S, Park B-G (2022) Optimization of the structural complexity of artificial neural network for hardware-driven neuromorphic computing application. Applied Intelligence 1–19
Bhat SF, Alhashim I, Wonka P (2021) Adabins: Depth estimation using adaptive bins. In: Computer vision and pattern recognition
Ignatov A, Malivenko G, Plowman D, Shukla S, Timofte R, Zhang Z, Wang Y, Huang Z, Luo G, Yu G (2021) Fast and accurate single-image depth estimation on mobile devices, Mobile AI 2021 challenge: Report
Dong X, Garratt MA, Anavatti SG, Abbass HA (2021) Towards real-time monocular depth estimation for robotics: A survey
Wofk D, Ma F, Yang TJ, Karaman S, Sze V (2019) FastDepth: Fast monocular depth estimation on embedded systems. IEEE
Yuan W, Gu X, Dai Z, Zhu S, Tan P (2022) New crfs: Neural window fully-connected crfs for monocular depth estimation
Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2018) Monocular depth estimation using multi-scale continuous crfs as sequential deep networks. IEEE
Dan X, Wei W, Hao T, Hong L, Ricci E (2018) Structured attention guided convolutional neural fields for monocular depth estimation. IEEE
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. IEEE
Huang G, Liu Z, Laurens V, Weinberger KQ (2016) Densely connected convolutional networks. IEEE Computer Society
Zhang X, Zhou X, Lin M, Sun J (2017) Shufflenet: An extremely efficient convolutional neural network for mobile devices
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications
Liu C, Chen LC, Schroff F, Adam H, Hua W, Yuille AL, Li FF (2019) Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Bartoccioni F, Zablocki É, Pérez P, Cord M, Alahari K (2023) Lidartouch: Monocular metric depth estimation with a few-beam lidar. Comput Vis Image Understand 227:103601
Hwang J-J, Kretzschmar H, Manela J, Rafferty S, Armstrong-Crews N, Chen T, Anguelov D (2022) Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVIII, pp 388–405 . Springer
Dong X, Garratt MA, Anavatti SG, Abbass HA (2022) Towards real-time monocular depth estimation for robotics: A survey. IEEE Trans Intell Transp Syst 23(10):16940–16961
Liu S, Tu X, Xu C, Li R (2022) Deep neural networks with attention mechanism for monocular depth estimation on embedded devices. Future generations computer systems: FGCS (131-), 131
Dong X, Garratt MA, Anavatti SG, Abbass HA (2022) Mobilexnet: An efficient convolutional neural network for monocular depth estimation. IEEE Trans Intell Transp Syst 23(11):20134–20147
Wang L, Famouri M, Wong A (2020) Depthnet nano: A highly compact self-normalizing neural network for monocular depth estimation
Liu H, Simonyan K, Yang Y (2018) Darts: Differentiable architecture search. arXiv:1806.09055
Liu C, Chen L-C, Schroff F, Adam H, Hua W, Yuille AL, Fei-Fei L (2019) Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 82–92
Wu J, Kuang H, Lu Q, Lin Z, Shi Q, Liu X, Zhu X (2022) M-fasterseg: An efficient semantic segmentation network based on neural architecture search. Eng Appl Artif Intell 113:104962
Chen W, Gong X, Liu X, Zhang Q, Li Y, Wang Z (2019) Fasterseg: Searching for faster real-time semantic segmentation. arXiv:1912.10917
Dai X, Chen D, Liu M, Chen Y, Yuan L (2020) Da-nas: Data adapted pruning for efficient neural architecture search
Lin P, Sun P, Cheng G, Xie S, Shi J (2020) Graph-Guided Architecture Search for Real-Time Semantic Segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Ding M, Huo Y, Lu H, Yang L, Wang Z, Lu Z, Wang J, Luo P (2021) Learning versatile neural architectures by propagating network codes
Wang J, Sun K, Cheng T, Jiang B, Xiao B (2020) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell PP(99):1–1
Lee JH, Han MK, Ko DW, Suh IH (2019) From big to small: Multi-scale local planar guidance for monocular depth estimation
Chai Y (2019) Patchwork: A patch-wise attention network for efficient object detection and segmentation in video streams. IEEE
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. MIT Press
Chen W, Gong X, Liu X, Zhang Q, Li Y, Wang Z (2020) Fasterseg: Searching for faster real-time semantic segmentation. In: International conference on learning representations
Cheng A-C, Lin CH, Juan D-C, Wei W, Sun M (2020) Instanas: Instance-aware neural architecture search. Proceedings of the AAAI Conference on artificial intelligence 34:3577–3584
Li X, Zhou Y, Pan Z, Feng J (2019) Partial order pruning: for best speed/accuracy trade-off in neural architecture search. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 9145–9153
Gong X, Chang S, Jiang Y, Wang Z (2019) Autogan: Neural architecture search for generative adversarial networks. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 3224–3234
Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. IEEE
Qi X, Liao R, Liu Z, Urtasun R, Jia J (2018) Geonet: Geometric neural network for joint depth and surface normal estimation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Pilzer A, Xu D, Puscas MM, Ricci E, Sebe N (2018) Unsupervised adversarial depth estimation using cycled generative networks. IEEE
Mahjourian R, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. IEEE
Poggi M, Aleotti F, Tosi F, Mattoccia S (2018) Towards real-time unsupervised monocular depth estimation on cpu. IEEE
Tosi F, Aleotti F, Poggi M, Mattoccia S (2019) Learning monocular depth estimation infusing traditional stereo knowledge
Atapour-Abarghouei A, Breckon TP (2018) Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 2800–2810
Patil V, Gansbeke WV, Dai D, Gool LV (2020) Don’t forget the past: Recurrent depth estimation from monocular video. IEEE Robotics and Automation Letters
Alhashim I, Wonka P (2018) High quality monocular depth estimation via transfer learning
Chen X, Zhang R, Jiang J, Wang Y, Li G, Li TH (2023) Self-supervised monocular depth estimation: Solving the edge-fattening problem. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp 5776–5786
Dao T-T, Pham Q-V, Hwang W-J (2022) Fastmde: A fast cnn architecture for monocular depth estimation at high resolution. IEEE Access 10:16111–16122
Yin W, Liu Y, Shen C, Yan Y (2019) Enforcing geometric constraints of virtual normal for depth prediction. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 5684–5693
Yang G, Tang H, Ding M, Sebe N, Ricci E (2021) Transformer-based attention networks for continuous pixel-wise prediction. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 16269–16279
Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 12179–12188
Lee JH, Han M-K, Ko DW, Suh IH (2019) From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv:1907.10326
Song M, Lim S, Kim W (2021) Monocular depth estimation using laplacian pyramid-based depth residuals. IEEE Trans Circ Syst Vid Technol 31(11):4381–4393
Shu C, Chen Z, Chen L, Ma K, Wang M, Ren H (2022) Sidert: A real-time pure transformer architecture for single image depth estimation. arXiv:2204.13892
Li X, Zhou Y, Pan Z, Feng J (2019) Partial order pruning: for best speed/accuracy trade-off in neural architecture search. IEEE
Lee JH, Heo M, Kim KR, Kim CS (2018) Single-image depth estimation based on fourier domain analysis. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition
Papa L, Alati E, Russo P, Amerini I (2022) Speed: Separable pyramidal pooling encoder-decoder for real-time monocular depth estimation on low-resource settings. IEEE Access 10:44881–44890
Ibrahem H, Salem A, Kang H-S (2022) Sd-depth: Light-weight monocular depth estimation using space depth cnn for real-time applications. In: Machine learning and artificial intelligence, pp 49–55. IOS Press
Mehta S, Rastegari M (2021) Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. ICLR 2022
Papa L, Russo P, Amerini I (2023) Meter: a mobile vision transformer architecture for monocular depth estimation. IEEE Transactions on Circuits and Systems for Video Technology
Author information
Authors and Affiliations
Contributions
DouZiWen: conceptualization, methodology, validation, investigation, writing; YeDong: supervision; LiYuQi: data curation;
Corresponding author
Ethics declarations
Conflict of Interest
No potential conflict of interest was reported by the authors.
Competing Interests
The authors did not receive support from any organization for the submitted work.
Ethical and informed consent for data used
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
ZiWen, D., YuQi, L. & Dong, Y. FasterMDE: A real-time monocular depth estimation search method that balances accuracy and speed on the edge. Appl Intell 53, 24566–24586 (2023). https://doi.org/10.1007/s10489-023-04872-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04872-2