skip to main content
10.1145/3603287.3651205acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
research-article
Open Access

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

Published:27 April 2024Publication History

ABSTRACT

The success of ChatGPT is reshaping the landscape of the entire IT industry. The large language model (LLM) powering ChatGPT is experiencing rapid development, marked by enhanced features, improved accuracy, and reduced latency. Due to the execution overhead of LLMs, prevailing commercial LLM products typically manage user queries on remote servers. However, the escalating volume of user queries and the growing complexity of LLMs have led to servers becoming bottlenecks, compromising the quality of service (QoS). To address this challenge, a potential solution is to shift LLM inference services to edge devices, a strategy currently being explored by industry leaders such as Apple, Google, Qualcomm, Samsung, and others. Beyond alleviating the computational strain on servers and enhancing system scalability, deploying LLMs at the edge offers additional advantages. These include real-time responses even in the absence of network connectivity and improved privacy protection for customized or personal LLMs.

This article delves into the challenges and potential bottlenecks currently hindering the effective deployment of LLMs on edge devices. Through deploying the LLaMa-2 7B model with INT4 quantization on diverse edge devices and systematically analyzing experimental results, we identify insufficient memory and/or computing resources on traditional edge devices as the primary obstacles. Based on our observation and empirical analysis, we further provide insights and design guidance for the next generation of edge devices and systems from both hardware and software directions.

References

  1. [n. d.]. Jetson AGX Orin. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/.Google ScholarGoogle Scholar
  2. [n.d.]. Port of Facebook's LLaMA Model in C/C++. https://github.com/ggerganov/llama.cpp.Google ScholarGoogle Scholar
  3. 2023. A New Foundation for AI on Android. https://android-developers.googleblog.com/2023/12/a-new-foundation-for-ai-on-android.html.Google ScholarGoogle Scholar
  4. 2023. Qualcomm Works with Meta to Enable On-device AI Applications Using Llama 2. https://www.qualcomm.com/news/releases/2023/07/qualcomm-works-with-meta-to-enable-on-device-ai-applications-usi.Google ScholarGoogle Scholar
  5. 2023. Samsung Looks Towards AI For The Galaxy S24. https://www.forbes.com/sites/ewanspence/2023/11/13/samsung-galaxys24-ultra-generative-ai-qualcomm-snapdragon-exynos-2400/?sh=6a019d2b3fba.Google ScholarGoogle Scholar
  6. Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2023. LLM in a Flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]Google ScholarGoogle Scholar
  7. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]Google ScholarGoogle Scholar
  8. Ellis Di Cataldo. 2023. OpenAI Stops New ChatGPT Plus Subscriptions Due to Demand. https://tech.co/news/openai-stops-new-chatgpt-plus-subscriptionsGoogle ScholarGoogle Scholar
  9. Chao Chen, Bohang Jiang, Shengli Liu, Chuanhuang Li, Celimuge Wu, and Rui Yin. 2023. Efficient Federated Learning in Resource-Constrained Edge Intelligence Networks using Model Compression. IEEE Transactions on Vehicular Technology (2023), 1--12. https://doi.org/10.1109/TVT.2023.3318080Google ScholarGoogle ScholarCross RefCross Ref
  10. Shuiguang Deng, Hailiang Zhao, Weijia Fang, Jianwei Yin, Schahram Dustdar, and Albert Y. Zomaya. 2020. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet of Things Journal 7, 8 (2020), 7457--7469. https://doi.org/10.1109/JIOT.2020.2984887Google ScholarGoogle ScholarCross RefCross Ref
  11. Warren Gay. 2014. Raspberry Pi Hardware Reference. https://doi.org/10.1007/978-1-4842-0799-4Google ScholarGoogle ScholarCross RefCross Ref
  12. Tong Jian, Debashri Roy, Batool Salehi, Nasim Soltani, Kaushik Chowdhury, and Stratis Ioannidis. 2023. Communication-Aware DNN Pruning. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications. 1--10. https://doi.org/10.1109/INFOCOM53939.2023.10229043Google ScholarGoogle ScholarCross RefCross Ref
  13. Guangchen Lan, Xiao-Yang Liu, Yijing Zhang, and Xiaodong Wang. 2023. Communication-efficient Federated Learning for Resource-constrained Edge Devices. IEEE Transactions on Machine Learning in Communications and Networking (2023).Google ScholarGoogle ScholarCross RefCross Ref
  14. En Li, Zhi Zhou, and Xu Chen. 2018. Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy. In Proceedings of the 2018 Workshop on Mobile Edge Communications (Budapest, Hungary) (MECOMM'18). Association for Computing Machinery, New York, NY, USA, 31--36. https://doi.org/10.1145/3229556.3229562Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hai Lin, Sherali Zeadally, Zhihong Chen, Houda Labiod, and Lusheng Wang. 2020. A Survey on Computation Offloading Modeling for Edge Computing. Journal of Network and Computer Applications 169 (2020), 102781. https://doi.org/10.1016/j.jnca.2020.102781Google ScholarGoogle ScholarCross RefCross Ref
  16. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978 (2023).Google ScholarGoogle Scholar
  17. Li Lin, Xiaofei Liao, Hai Jin, and Peng Li. 2019. Computation Offloading Toward Edge Computing. Proc. IEEE 107, 8 (2019), 1584--1607. https://doi.org/10.1109/JPROC.2019.2922285Google ScholarGoogle ScholarCross RefCross Ref
  18. Di Liu, Hao Kong, Xiangzhong Luo, Weichen Liu, and Ravi Subramaniam. 2022. Bringing AI to Edge: From Deep Learning's Perspective. Neurocomputing 485 (2022), 297--320. https://doi.org/10.1016/j.neucom.2021.04.141Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Pradeep Menon. 2023. Introduction to Large Language Models and the Transformer Architecture. https://rpradeepmenon.medium.com/introduction-to-large-language-models-and-the-transformer-architecture-534408ed7e61.Google ScholarGoogle Scholar
  20. Ishan Misra, Rohit Girdhar, and Armand Joulin. 2021. An End-to-end Transformer Model for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2906--2917.Google ScholarGoogle ScholarCross RefCross Ref
  21. Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. 2021. A White Paper on Neural Network Quantization. arXiv preprint arXiv:2106.08295 (2021).Google ScholarGoogle Scholar
  22. Liangxin Qian and Jun Zhao. 2023. User Association and Resource Allocation in Large Language Model Based Mobile Edge Computing System over Wireless Communications. arXiv:2310.17872 [cs.IT]Google ScholarGoogle Scholar
  23. Umber Saleem, Yu Liu, Sobia Jangsher, Xiaoming Tao, and Yong Li. 2020. Latency Minimization for D2D-Enabled Partial Computation Offloading in Mobile Edge Computing. IEEE Transactions on Vehicular Technology 69, 4 (2020), 4472--4486. https://doi.org/10.1109/TVT.2020.2978027Google ScholarGoogle ScholarCross RefCross Ref
  24. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]Google ScholarGoogle Scholar
  25. Lionel Sujay Vailshery. 2023. Number of Internet of Things (IoT) Connected Devices Worldwide from 2019 to 2023, with Forecasts from 2022 to 2030. https://www.statista.com/statistics/1183457/iot-connected-devices-worldwide/Google ScholarGoogle Scholar
  26. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you Need. Advances in neural information processing systems 30 (2017).Google ScholarGoogle Scholar
  27. Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. 2019. Adaptive Federated Learning in Resource Constrained Edge Computing Systems. IEEE journal on selected areas in communications 37, 6 (2019), 1205--1221.Google ScholarGoogle ScholarCross RefCross Ref
  28. Herbert Woisetschläger, Alexander Isenko, Shiqiang Wang, Ruben Mayer, and Hans-Arno Jacobsen. 2023. Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly. arXiv preprint arXiv:2310.03150 (2023).Google ScholarGoogle Scholar
  29. Dianlei Xu, Tong Li, Yong Li, Xiang Su, Sasu Tarkoma, Tao Jiang, Jon Crowcroft, and Pan Hui. 2020. Edge Intelligence: Architectures, Challenges, and Applications. arXiv:2003.12172 [cs.NI]Google ScholarGoogle Scholar
  30. Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, and Mengwei Xu. 2023. Edgemoe: Fast on-device Inference of Moe-based Large Language Models. arXiv preprint arXiv:2308.14352 (2023).Google ScholarGoogle Scholar
  31. Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, and Hong Zhou. 2023. Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM. arXiv preprint arXiv:2310.04836 (2023).Google ScholarGoogle Scholar
  32. Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. Deepthings: Distributed Adaptive Deep Learning Inference on Resource-Constrained Iot Edge Clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348--2359.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ACM SE '24: Proceedings of the 2024 ACM Southeast Conference
      April 2024
      337 pages
      ISBN:9798400702372
      DOI:10.1145/3603287

      Copyright © 2024 Owner/Author

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 April 2024

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      ACM SE '24 Paper Acceptance Rate44of137submissions,32%Overall Acceptance Rate178of377submissions,47%
    • Article Metrics

      • Downloads (Last 12 months)142
      • Downloads (Last 6 weeks)142

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader