Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task

Schiavella, Claudio; Cirillo, Lorenzo; Papa, Lorenzo; Russo, Paolo; Amerini, Irene

doi:10.1007/978-3-031-51023-6_32

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14365))

Included in the following conference series:

International Conference on Image Analysis and Processing

372 Accesses

Abstract

IoT and edge devices, capable of capturing data from their surroundings, are becoming increasingly popular. However, the onboard analysis of the acquired data is usually limited by their computational capabilities. Consequently, the most recent and accurate deep learning technologies, such as Vision Transformers (ViT) and their hybrid (hViT) versions, are typically too cumbersome to be exploited for onboard inferences. Therefore, the purpose of this work is to analyze and investigate the impact of efficient ViT methodologies applied to the monocular depth estimation (MDE) task, which computes the depth map from an RGB image. This task is a critical feature for autonomous and robotic systems in order to perceive the surrounding environment. More in detail, this work leverages innovative solutions designed to reduce the computational cost of self-attention, the fundamental element on which ViTs are based, applying this modification to METER architecture, a lightweight model designed to tackle the MDE task which can be further enhanced. The proposed efficient variants, namely Meta-METER and Pyra-METER, are capable of achieving an average speed boost of 41.4% and 34.4% respectively, over a variety of edge devices when compared with the original model, while keeping a limited degradation of the estimation capabilities when tested on the indoor NYU dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FasterMDE: A real-time monocular depth estimation search method that balances accuracy and speed on the edge

Article 26 July 2023

Optimizing depth estimation with attention U-Net

Article 20 July 2024

Repmono: a lightweight self-supervised monocular depth estimation architecture for high-speed inference

Article Open access 10 August 2024

Notes

References

Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Dong, X., et al.: Towards real-time monocular depth estimation for robotics: a survey. IEEE Trans. Intell. Transport. Syst. 23(10), 16940–16961 (2022)
Article MathSciNet Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst. 27 (2014)
Google Scholar
Han, K., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 87–110 (2022)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koohpayegani, S.A., Pirsiavash, H.: Sima: simple softmax-free attention for vision transformers. arXiv preprint arXiv:2206.08898 (2022)
Li, Z., et al.: Binsformer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)
Jiachen, L., et al.: Soft: Softmax-free transformer with linear complexity. Adv. Neural. Inf. Process. Syst. 34, 21297–21309 (2021)
Google Scholar
Makarov, I., Borisenko, G.: Depth inpainting via vision transformer. In: 2021 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), pp. 286–291. IEEE (2021)
Google Scholar
Mehta, S., Rastegari, M.: Mobilevit: light-weight, generalpurpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)
Papa, L., Russo, P., Amerini, I.: METER: a mobile vision transformer architecture for monocular depth estimation. IEEE Trans. Circuits Syst. Video Technol. (2023)
Google Scholar
Papa, L., et al.: Lightweight and energy-aware monocular depth estimation models for IoT embedded devices: challenges and performances in terrestrial and underwater scenarios. Sensors 23(4), 2223 (2023)
Google Scholar
Papa, L., et al.: Speed: separable pyramidal pooling encoder-decoder for real-time monocular depth estimation on low-resource settings. IEEE Access 10, 44881–44890 (2022)
Article Google Scholar
Poggi, M., et al.: Towards real-time unsupervised monocular depth estimation on CPU. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5848–5854. IEEE (2018)
Google Scholar
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188 (2021)
Google Scholar
Sandler, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision – ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wofk, D., et al.: Fastdepth: fast monocular depth estimation on embedded systems. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6101–6108. IEEE (2019)
Google Scholar
Wu, H., et al.: Flowformer: linearizing transformers with conservation flows. arXiv preprint arXiv:2202.06258 (2022)
Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Google Scholar
Yucel, M.K., et al.: Real-time monocular depth estimation with sparse supervision on mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2428–2437 (2021)
Google Scholar
Zhao, C.Q., Sun, Q.Y., Zhang, C.Z., Tang, Y., Qian, F.: Monocular depth estimation based on deep learning: an overview. Sci. China Technol. Sci. 63(9), 1612–1627 (2020)
Google Scholar

Download references

Acknowledgments

This study has been partially supported by SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU, Sapienza University of Rome project 2022–2024 “EV2” (003_009_22), and project 2022–2023 “RobFastMDE”.

Author information

Authors and Affiliations

Department of Computer, Control and Management Engineering (DIAG), Sapienza University of Rome, Rome, Italy
Claudio Schiavella, Lorenzo Cirillo, Lorenzo Papa, Paolo Russo & Irene Amerini

Authors

Claudio Schiavella
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Cirillo
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Papa
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Russo
View author publications
You can also search for this author in PubMed Google Scholar
Irene Amerini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudio Schiavella .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schiavella, C., Cirillo, L., Papa, L., Russo, P., Amerini, I. (2024). Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing - ICIAP 2023 Workshops. ICIAP 2023. Lecture Notes in Computer Science, vol 14365. Springer, Cham. https://doi.org/10.1007/978-3-031-51023-6_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-51023-6_32
Published: 24 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51022-9
Online ISBN: 978-3-031-51023-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

FasterMDE: A real-time monocular depth estimation search method that balances accuracy and speed on the edge

Optimizing depth estimation with attention U-Net

Repmono: a lightweight self-supervised monocular depth estimation architecture for high-speed inference

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

FasterMDE: A real-time monocular depth estimation search method that balances accuracy and speed on the edge

Optimizing depth estimation with attention U-Net

Repmono: a lightweight self-supervised monocular depth estimation architecture for high-speed inference

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation