research-article

Open access

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

Authors:

Kun SuoAuthors Info & Claims

ACMSE '24: Proceedings of the 2024 ACM Southeast Conference

Pages 69 - 76

https://doi.org/10.1145/3603287.3651205

Published: 27 April 2024 Publication History

Abstract

The success of ChatGPT is reshaping the landscape of the entire IT industry. The large language model (LLM) powering ChatGPT is experiencing rapid development, marked by enhanced features, improved accuracy, and reduced latency. Due to the execution overhead of LLMs, prevailing commercial LLM products typically manage user queries on remote servers. However, the escalating volume of user queries and the growing complexity of LLMs have led to servers becoming bottlenecks, compromising the quality of service (QoS). To address this challenge, a potential solution is to shift LLM inference services to edge devices, a strategy currently being explored by industry leaders such as Apple, Google, Qualcomm, Samsung, and others. Beyond alleviating the computational strain on servers and enhancing system scalability, deploying LLMs at the edge offers additional advantages. These include real-time responses even in the absence of network connectivity and improved privacy protection for customized or personal LLMs.

This article delves into the challenges and potential bottlenecks currently hindering the effective deployment of LLMs on edge devices. Through deploying the LLaMa-2 7B model with INT4 quantization on diverse edge devices and systematically analyzing experimental results, we identify insufficient memory and/or computing resources on traditional edge devices as the primary obstacles. Based on our observation and empirical analysis, we further provide insights and design guidance for the next generation of edge devices and systems from both hardware and software directions.

References

[1]

[n. d.]. Jetson AGX Orin. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/.

[2]

[n.d.]. Port of Facebook's LLaMA Model in C/C++. https://github.com/ggerganov/llama.cpp.

[3]

2023. A New Foundation for AI on Android. https://android-developers.googleblog.com/2023/12/a-new-foundation-for-ai-on-android.html.

[4]

2023. Qualcomm Works with Meta to Enable On-device AI Applications Using Llama 2. https://www.qualcomm.com/news/releases/2023/07/qualcomm-works-with-meta-to-enable-on-device-ai-applications-usi.

[5]

2023. Samsung Looks Towards AI For The Galaxy S24. https://www.forbes.com/sites/ewanspence/2023/11/13/samsung-galaxys24-ultra-generative-ai-qualcomm-snapdragon-exynos-2400/?sh=6a019d2b3fba.

[6]

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2023. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]

[7]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]

[8]

Ellis Di Cataldo. 2023. OpenAI Stops New ChatGPT Plus Subscriptions Due to Demand. https://tech.co/news/openai-stops-new-chatgpt-plus-subscriptions

[9]

Chao Chen, Bohang Jiang, Shengli Liu, Chuanhuang Li, Celimuge Wu, and Rui Yin. 2023. Efficient Federated Learning in Resource-Constrained Edge Intelligence Networks using Model Compression. IEEE Transactions on Vehicular Technology (2023), 1--12. https://doi.org/10.1109/TVT.2023.3318080

[10]

Shuiguang Deng, Hailiang Zhao, Weijia Fang, Jianwei Yin, Schahram Dustdar, and Albert Y. Zomaya. 2020. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet of Things Journal 7, 8 (2020), 7457--7469. https://doi.org/10.1109/JIOT.2020.2984887

[11]

Warren Gay. 2014. Raspberry Pi Hardware Reference. https://doi.org/10.1007/978-1-4842-0799-4

[12]

Tong Jian, Debashri Roy, Batool Salehi, Nasim Soltani, Kaushik Chowdhury, and Stratis Ioannidis. 2023. Communication-Aware DNN Pruning. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications. 1--10. https://doi.org/10.1109/INFOCOM53939.2023.10229043

[13]

Guangchen Lan, Xiao-Yang Liu, Yijing Zhang, and Xiaodong Wang. 2023. Communication-efficient Federated Learning for Resource-constrained Edge Devices. IEEE Transactions on Machine Learning in Communications and Networking (2023).

[14]

En Li, Zhi Zhou, and Xu Chen. 2018. Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications (Budapest, Hungary) (MECOMM'18). Association for Computing Machinery, New York, NY, USA, 31--36. https://doi.org/10.1145/3229556.3229562

Digital Library

[15]

Hai Lin, Sherali Zeadally, Zhihong Chen, Houda Labiod, and Lusheng Wang. 2020. A Survey on Computation Offloading Modeling for Edge Computing. Journal of Network and Computer Applications 169 (2020), 102781. https://doi.org/10.1016/j.jnca.2020.102781

[16]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978 (2023).

[17]

Li Lin, Xiaofei Liao, Hai Jin, and Peng Li. 2019. Computation Offloading Toward Edge Computing. Proc. IEEE 107, 8 (2019), 1584--1607. https://doi.org/10.1109/JPROC.2019.2922285

[18]

Di Liu, Hao Kong, Xiangzhong Luo, Weichen Liu, and Ravi Subramaniam. 2022. Bringing AI to Edge: From Deep Learning's Perspective. Neurocomputing 485 (2022), 297--320. https://doi.org/10.1016/j.neucom.2021.04.141

Digital Library

[19]

Pradeep Menon. 2023. Introduction to Large Language Models and the Transformer Architecture. https://rpradeepmenon.medium.com/introduction-to-large-language-models-and-the-transformer-architecture-534408ed7e61.

[20]

Ishan Misra, Rohit Girdhar, and Armand Joulin. 2021. An End-to-end Transformer Model for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2906--2917.

[21]

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. 2021. A White Paper on Neural Network Quantization. arXiv preprint arXiv:2106.08295 (2021).

[22]

Liangxin Qian and Jun Zhao. 2023. User Association and Resource Allocation in Large Language Model Based Mobile Edge Computing System over Wireless Communications. arXiv:2310.17872 [cs.IT]

[23]

Umber Saleem, Yu Liu, Sobia Jangsher, Xiaoming Tao, and Yong Li. 2020. Latency Minimization for D2D-Enabled Partial Computation Offloading in Mobile Edge Computing. IEEE Transactions on Vehicular Technology 69, 4 (2020), 4472--4486. https://doi.org/10.1109/TVT.2020.2978027

[24]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]

[25]

Lionel Sujay Vailshery. 2023. Number of Internet of Things (IoT) Connected Devices Worldwide from 2019 to 2023, with Forecasts from 2022 to 2030. https://www.statista.com/statistics/1183457/iot-connected-devices-worldwide/

[26]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you Need. Advances in neural information processing systems 30 (2017).

[27]

Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. 2019. Adaptive Federated Learning in Resource Constrained Edge Computing Systems. IEEE journal on selected areas in communications 37, 6 (2019), 1205--1221.

[28]

Herbert Woisetschläger, Alexander Isenko, Shiqiang Wang, Ruben Mayer, and Hans-Arno Jacobsen. 2023. Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly. arXiv preprint arXiv:2310.03150 (2023).

[29]

Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, and Pan Hui. 2020. Edge Intelligence: Architectures, Challenges, and Applications. arXiv:2003.12172 [cs.NI]

[30]

Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. 2023. Edgemoe: Fast on-device Inference of Moe-based Large Language Models. arXiv preprint arXiv:2308.14352 (2023).

[31]

Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, and Hong Zhou. 2023. Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM. arXiv preprint arXiv:2310.04836 (2023).

[32]

Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed Adaptive Deep Learning Inference on Resource-Constrained Iot Edge Clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.

Cited By

Javaid SKhalil RSaeed NHe BAlouini M(2025)Leveraging Large Language Models for Integrated Satellite-Aerial-Terrestrial Networks: Recent Advances and Future DirectionsIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.35221036(399-432)Online publication date: 2025
https://doi.org/10.1109/OJCOMS.2024.3522103
Piccialli FChiaro DQi PBellandi VDamiani E(2025)Federated and edge learning for large language modelsInformation Fusion10.1016/j.inffus.2024.102840117(102840)Online publication date: May-2025
https://doi.org/10.1016/j.inffus.2024.102840
Baptista MYue NManjurul Islam MPrendinger H(2025)Large Language Models (LLMs) for Smart Manufacturing and Industry X.0Artificial Intelligence for Smart Manufacturing and Industry X.010.1007/978-3-031-80154-9_5(97-119)Online publication date: 6-Mar-2025
https://doi.org/10.1007/978-3-031-80154-9_5
Show More Cited By

Index Terms

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

Hybrid SLM and LLM for Edge-Cloud Collaborative Inference
EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models

Edge-Cloud collaboration for deep learning inference has been actively studied, to enhance the inference performance by leveraging both Edge and Cloud resources. However, traditional Edge-Cloud collaboration based on model partitioning or confidence ...
Efficient Memory Management for Large Language Model Serving with PagedAttention
SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. ...
A Survey on Evaluation of Large Language Models
Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ACMSE '24: Proceedings of the 2024 ACM Southeast Conference

April 2024

337 pages

ISBN:9798400702372

DOI:10.1145/3603287

Organizing Chair:
Dan Lo,
Program Chair:
Eric Gamess

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

ACM SE '24

Sponsor:

ACM

ACM SE '24: 2024 ACM Southeast Conference

April 18 - 20, 2024

GA, Marietta, USA

Acceptance Rates

ACMSE '24 Paper Acceptance Rate 44 of 137 submissions, 32%;

Overall Acceptance Rate 502 of 1,023 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
1,282
Total Downloads

Downloads (Last 12 months)1,282
Downloads (Last 6 weeks)157

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Javaid SKhalil RSaeed NHe BAlouini M(2025)Leveraging Large Language Models for Integrated Satellite-Aerial-Terrestrial Networks: Recent Advances and Future DirectionsIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.35221036(399-432)Online publication date: 2025
https://doi.org/10.1109/OJCOMS.2024.3522103
Piccialli FChiaro DQi PBellandi VDamiani E(2025)Federated and edge learning for large language modelsInformation Fusion10.1016/j.inffus.2024.102840117(102840)Online publication date: May-2025
https://doi.org/10.1016/j.inffus.2024.102840
Baptista MYue NManjurul Islam MPrendinger H(2025)Large Language Models (LLMs) for Smart Manufacturing and Industry X.0Artificial Intelligence for Smart Manufacturing and Industry X.010.1007/978-3-031-80154-9_5(97-119)Online publication date: 6-Mar-2025
https://doi.org/10.1007/978-3-031-80154-9_5
Friha OAmine Ferrag MKantarci BCakmak BOzgun AGhoualmi-Zine N(2024)LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and TrustworthinessIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34565495(5799-5856)Online publication date: 2024
https://doi.org/10.1109/OJCOMS.2024.3456549
Dhar NDeng BIslam MAhmad Nasif KZhao LSuo K(2024)Activation Sparsity Opportunities for Compressing General Large Language Models2024 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59868.2024.10850382(1-9)Online publication date: 22-Nov-2024
https://doi.org/10.1109/IPCCC59868.2024.10850382
Islam MDhar NDeng BNguyen THe SSuo K(2024)Characterizing and Understanding the Performance of Small Language Models on Edge Devices2024 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59868.2024.10850044(1-10)Online publication date: 22-Nov-2024
https://doi.org/10.1109/IPCCC59868.2024.10850044
Ray PPradhan M(2024)LLMEdge: A Novel Framework for Localized LLM Inferencing at Resource Constrained Edge2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS)10.1109/ICICNIS64247.2024.10823332(1-8)Online publication date: 17-Dec-2024
https://doi.org/10.1109/ICICNIS64247.2024.10823332
Grumeza TLazãr TFortiş A(2024)Performance of LLMs on Computing Systems for Deployment in IoT DevicesAdvances on Broad-Band Wireless Computing, Communication and Applications10.1007/978-3-031-76452-3_24(252-262)Online publication date: 12-Nov-2024
https://doi.org/10.1007/978-3-031-76452-3_24
Tavakkoli VMohsenzadegan KKyamakya K(2024)Leveraging Context-Aware Emotion and Fatigue Recognition Through Large Language Models for Enhanced Advanced Driver Assistance Systems (ADAS)Recent Advances in Machine Learning Techniques and Sensor Applications for Human Emotion, Activity Recognition and Support10.1007/978-3-031-71821-2_2(49-85)Online publication date: 8-Nov-2024
https://doi.org/10.1007/978-3-031-71821-2_2

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten