Abstract
IoT and edge devices, capable of capturing data from their surroundings, are becoming increasingly popular. However, the onboard analysis of the acquired data is usually limited by their computational capabilities. Consequently, the most recent and accurate deep learning technologies, such as Vision Transformers (ViT) and their hybrid (hViT) versions, are typically too cumbersome to be exploited for onboard inferences. Therefore, the purpose of this work is to analyze and investigate the impact of efficient ViT methodologies applied to the monocular depth estimation (MDE) task, which computes the depth map from an RGB image. This task is a critical feature for autonomous and robotic systems in order to perceive the surrounding environment. More in detail, this work leverages innovative solutions designed to reduce the computational cost of self-attention, the fundamental element on which ViTs are based, applying this modification to METER architecture, a lightweight model designed to tackle the MDE task which can be further enhanced. The proposed efficient variants, namely Meta-METER and Pyra-METER, are capable of achieving an average speed boost of 41.4% and 34.4% respectively, over a variety of edge devices when compared with the original model, while keeping a limited degradation of the estimation capabilities when tested on the indoor NYU dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Dong, X., et al.: Towards real-time monocular depth estimation for robotics: a survey. IEEE Trans. Intell. Transport. Syst. 23(10), 16940–16961 (2022)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 27 (2014)
Han, K., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koohpayegani, S.A., Pirsiavash, H.: Sima: simple softmax-free attention for vision transformers. arXiv preprint arXiv:2206.08898 (2022)
Li, Z., et al.: Binsformer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)
Jiachen, L., et al.: Soft: Softmax-free transformer with linear complexity. Adv. Neural. Inf. Process. Syst. 34, 21297–21309 (2021)
Makarov, I., Borisenko, G.: Depth inpainting via vision transformer. In: 2021 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 286–291. IEEE (2021)
Mehta, S., Rastegari, M.: Mobilevit: light-weight, generalpurpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)
Papa, L., Russo, P., Amerini, I.: METER: a mobile vision transformer architecture for monocular depth estimation. IEEE Trans. Circuits Syst. Video Technol. (2023)
Papa, L., et al.: Lightweight and energy-aware monocular depth estimation models for IoT embedded devices: challenges and performances in terrestrial and underwater scenarios. Sensors 23(4), 2223 (2023)
Papa, L., et al.: Speed: separable pyramidal pooling encoder-decoder for real-time monocular depth estimation on low-resource settings. IEEE Access 10, 44881–44890 (2022)
Poggi, M., et al.: Towards real-time unsupervised monocular depth estimation on CPU. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5848–5854. IEEE (2018)
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188 (2021)
Sandler, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision – ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Wofk, D., et al.: Fastdepth: fast monocular depth estimation on embedded systems. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6101–6108. IEEE (2019)
Wu, H., et al.: Flowformer: linearizing transformers with conservation flows. arXiv preprint arXiv:2202.06258 (2022)
Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Yucel, M.K., et al.: Real-time monocular depth estimation with sparse supervision on mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2428–2437 (2021)
Zhao, C.Q., Sun, Q.Y., Zhang, C.Z., Tang, Y., Qian, F.: Monocular depth estimation based on deep learning: an overview. Sci. China Technol. Sci. 63(9), 1612–1627 (2020)
Acknowledgments
This study has been partially supported by SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU, Sapienza University of Rome project 2022–2024 “EV2” (003_009_22), and project 2022–2023 “RobFastMDE”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Schiavella, C., Cirillo, L., Papa, L., Russo, P., Amerini, I. (2024). Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing - ICIAP 2023 Workshops. ICIAP 2023. Lecture Notes in Computer Science, vol 14365. Springer, Cham. https://doi.org/10.1007/978-3-031-51023-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-51023-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51022-9
Online ISBN: 978-3-031-51023-6
eBook Packages: Computer ScienceComputer Science (R0)