skip to main content
10.1145/3603287.3651205acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
research-article
Open access

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

Published: 27 April 2024 Publication History

Abstract

The success of ChatGPT is reshaping the landscape of the entire IT industry. The large language model (LLM) powering ChatGPT is experiencing rapid development, marked by enhanced features, improved accuracy, and reduced latency. Due to the execution overhead of LLMs, prevailing commercial LLM products typically manage user queries on remote servers. However, the escalating volume of user queries and the growing complexity of LLMs have led to servers becoming bottlenecks, compromising the quality of service (QoS). To address this challenge, a potential solution is to shift LLM inference services to edge devices, a strategy currently being explored by industry leaders such as Apple, Google, Qualcomm, Samsung, and others. Beyond alleviating the computational strain on servers and enhancing system scalability, deploying LLMs at the edge offers additional advantages. These include real-time responses even in the absence of network connectivity and improved privacy protection for customized or personal LLMs.
This article delves into the challenges and potential bottlenecks currently hindering the effective deployment of LLMs on edge devices. Through deploying the LLaMa-2 7B model with INT4 quantization on diverse edge devices and systematically analyzing experimental results, we identify insufficient memory and/or computing resources on traditional edge devices as the primary obstacles. Based on our observation and empirical analysis, we further provide insights and design guidance for the next generation of edge devices and systems from both hardware and software directions.

References

[1]
[n. d.]. Jetson AGX Orin. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/.
[2]
[n.d.]. Port of Facebook's LLaMA Model in C/C++. https://github.com/ggerganov/llama.cpp.
[3]
2023. A New Foundation for AI on Android. https://android-developers.googleblog.com/2023/12/a-new-foundation-for-ai-on-android.html.
[4]
2023. Qualcomm Works with Meta to Enable On-device AI Applications Using Llama 2. https://www.qualcomm.com/news/releases/2023/07/qualcomm-works-with-meta-to-enable-on-device-ai-applications-usi.
[5]
2023. Samsung Looks Towards AI For The Galaxy S24. https://www.forbes.com/sites/ewanspence/2023/11/13/samsung-galaxys24-ultra-generative-ai-qualcomm-snapdragon-exynos-2400/?sh=6a019d2b3fba.
[6]
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2023. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]
[7]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
[8]
Ellis Di Cataldo. 2023. OpenAI Stops New ChatGPT Plus Subscriptions Due to Demand. https://tech.co/news/openai-stops-new-chatgpt-plus-subscriptions
[9]
Chao Chen, Bohang Jiang, Shengli Liu, Chuanhuang Li, Celimuge Wu, and Rui Yin. 2023. Efficient Federated Learning in Resource-Constrained Edge Intelligence Networks using Model Compression. IEEE Transactions on Vehicular Technology (2023), 1--12. https://doi.org/10.1109/TVT.2023.3318080
[10]
Shuiguang Deng, Hailiang Zhao, Weijia Fang, Jianwei Yin, Schahram Dustdar, and Albert Y. Zomaya. 2020. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet of Things Journal 7, 8 (2020), 7457--7469. https://doi.org/10.1109/JIOT.2020.2984887
[11]
Warren Gay. 2014. Raspberry Pi Hardware Reference. https://doi.org/10.1007/978-1-4842-0799-4
[12]
Tong Jian, Debashri Roy, Batool Salehi, Nasim Soltani, Kaushik Chowdhury, and Stratis Ioannidis. 2023. Communication-Aware DNN Pruning. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications. 1--10. https://doi.org/10.1109/INFOCOM53939.2023.10229043
[13]
Guangchen Lan, Xiao-Yang Liu, Yijing Zhang, and Xiaodong Wang. 2023. Communication-efficient Federated Learning for Resource-constrained Edge Devices. IEEE Transactions on Machine Learning in Communications and Networking (2023).
[14]
En Li, Zhi Zhou, and Xu Chen. 2018. Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications (Budapest, Hungary) (MECOMM'18). Association for Computing Machinery, New York, NY, USA, 31--36. https://doi.org/10.1145/3229556.3229562
[15]
Hai Lin, Sherali Zeadally, Zhihong Chen, Houda Labiod, and Lusheng Wang. 2020. A Survey on Computation Offloading Modeling for Edge Computing. Journal of Network and Computer Applications 169 (2020), 102781. https://doi.org/10.1016/j.jnca.2020.102781
[16]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978 (2023).
[17]
Li Lin, Xiaofei Liao, Hai Jin, and Peng Li. 2019. Computation Offloading Toward Edge Computing. Proc. IEEE 107, 8 (2019), 1584--1607. https://doi.org/10.1109/JPROC.2019.2922285
[18]
Di Liu, Hao Kong, Xiangzhong Luo, Weichen Liu, and Ravi Subramaniam. 2022. Bringing AI to Edge: From Deep Learning's Perspective. Neurocomputing 485 (2022), 297--320. https://doi.org/10.1016/j.neucom.2021.04.141
[19]
Pradeep Menon. 2023. Introduction to Large Language Models and the Transformer Architecture. https://rpradeepmenon.medium.com/introduction-to-large-language-models-and-the-transformer-architecture-534408ed7e61.
[20]
Ishan Misra, Rohit Girdhar, and Armand Joulin. 2021. An End-to-end Transformer Model for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2906--2917.
[21]
Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. 2021. A White Paper on Neural Network Quantization. arXiv preprint arXiv:2106.08295 (2021).
[22]
Liangxin Qian and Jun Zhao. 2023. User Association and Resource Allocation in Large Language Model Based Mobile Edge Computing System over Wireless Communications. arXiv:2310.17872 [cs.IT]
[23]
Umber Saleem, Yu Liu, Sobia Jangsher, Xiaoming Tao, and Yong Li. 2020. Latency Minimization for D2D-Enabled Partial Computation Offloading in Mobile Edge Computing. IEEE Transactions on Vehicular Technology 69, 4 (2020), 4472--4486. https://doi.org/10.1109/TVT.2020.2978027
[24]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
[25]
Lionel Sujay Vailshery. 2023. Number of Internet of Things (IoT) Connected Devices Worldwide from 2019 to 2023, with Forecasts from 2022 to 2030. https://www.statista.com/statistics/1183457/iot-connected-devices-worldwide/
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you Need. Advances in neural information processing systems 30 (2017).
[27]
Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. 2019. Adaptive Federated Learning in Resource Constrained Edge Computing Systems. IEEE journal on selected areas in communications 37, 6 (2019), 1205--1221.
[28]
Herbert Woisetschläger, Alexander Isenko, Shiqiang Wang, Ruben Mayer, and Hans-Arno Jacobsen. 2023. Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly. arXiv preprint arXiv:2310.03150 (2023).
[29]
Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, and Pan Hui. 2020. Edge Intelligence: Architectures, Challenges, and Applications. arXiv:2003.12172 [cs.NI]
[30]
Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. 2023. Edgemoe: Fast on-device Inference of Moe-based Large Language Models. arXiv preprint arXiv:2308.14352 (2023).
[31]
Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, and Hong Zhou. 2023. Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM. arXiv preprint arXiv:2310.04836 (2023).
[32]
Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed Adaptive Deep Learning Inference on Resource-Constrained Iot Edge Clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.

Cited By

View all
  • (2025)Leveraging Large Language Models for Integrated Satellite-Aerial-Terrestrial Networks: Recent Advances and Future DirectionsIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.35221036(399-432)Online publication date: 2025
  • (2025)Federated and edge learning for large language modelsInformation Fusion10.1016/j.inffus.2024.102840117(102840)Online publication date: May-2025
  • (2025)Large Language Models (LLMs) for Smart Manufacturing and Industry X.0Artificial Intelligence for Smart Manufacturing and Industry X.010.1007/978-3-031-80154-9_5(97-119)Online publication date: 6-Mar-2025
  • Show More Cited By

Index Terms

  1. An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ACMSE '24: Proceedings of the 2024 ACM Southeast Conference
    April 2024
    337 pages
    ISBN:9798400702372
    DOI:10.1145/3603287
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 April 2024

    Check for updates

    Author Tags

    1. Edge Computing
    2. Edge Devices
    3. LLaMA-2
    4. Large Language Models (LLMs)

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ACM SE '24
    Sponsor:
    ACM SE '24: 2024 ACM Southeast Conference
    April 18 - 20, 2024
    GA, Marietta, USA

    Acceptance Rates

    ACMSE '24 Paper Acceptance Rate 44 of 137 submissions, 32%;
    Overall Acceptance Rate 502 of 1,023 submissions, 49%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,282
    • Downloads (Last 6 weeks)157
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Leveraging Large Language Models for Integrated Satellite-Aerial-Terrestrial Networks: Recent Advances and Future DirectionsIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.35221036(399-432)Online publication date: 2025
    • (2025)Federated and edge learning for large language modelsInformation Fusion10.1016/j.inffus.2024.102840117(102840)Online publication date: May-2025
    • (2025)Large Language Models (LLMs) for Smart Manufacturing and Industry X.0Artificial Intelligence for Smart Manufacturing and Industry X.010.1007/978-3-031-80154-9_5(97-119)Online publication date: 6-Mar-2025
    • (2024)LLM-Based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and TrustworthinessIEEE Open Journal of the Communications Society10.1109/OJCOMS.2024.34565495(5799-5856)Online publication date: 2024
    • (2024)Activation Sparsity Opportunities for Compressing General Large Language Models2024 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59868.2024.10850382(1-9)Online publication date: 22-Nov-2024
    • (2024)Characterizing and Understanding the Performance of Small Language Models on Edge Devices2024 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC59868.2024.10850044(1-10)Online publication date: 22-Nov-2024
    • (2024)LLMEdge: A Novel Framework for Localized LLM Inferencing at Resource Constrained Edge2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS)10.1109/ICICNIS64247.2024.10823332(1-8)Online publication date: 17-Dec-2024
    • (2024)Performance of LLMs on Computing Systems for Deployment in IoT DevicesAdvances on Broad-Band Wireless Computing, Communication and Applications10.1007/978-3-031-76452-3_24(252-262)Online publication date: 12-Nov-2024
    • (2024)Leveraging Context-Aware Emotion and Fatigue Recognition Through Large Language Models for Enhanced Advanced Driver Assistance Systems (ADAS)Recent Advances in Machine Learning Techniques and Sensor Applications for Human Emotion, Activity Recognition and Support10.1007/978-3-031-71821-2_2(49-85)Online publication date: 8-Nov-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media