Adaptive layer splitting for wireless large language model inference in edge computing: a model-based reinforcement learning approach

Chen, Yuxuan; Li, Rongpeng; Yu, Xiaoxue; Zhao, Zhifeng; Zhang, Honggang

doi:10.1631/FITEE.2400468

Adaptive layer splitting for wireless large language model inference in edge computing: a model-based reinforcement learning approach

基于模型强化学习的边缘计算无线大语言模型推理自适应层切分方法

Published: 05 March 2025

Volume 26, pages 278–292, (2025)
Cite this article

Download PDF

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Adaptive layer splitting for wireless large language model inference in edge computing: a model-based reinforcement learning approach

Download PDF

183 Accesses
Explore all metrics

Abstract

Optimizing the deployment of large language models (LLMs) in edge computing environments is critical for enhancing privacy and computational efficiency. In the path toward efficient wireless LLM inference in edge computing, this study comprehensively analyzes the impact of different splitting points in mainstream open-source LLMs. Accordingly, this study introduces a framework taking inspiration from model-based reinforcement learning to determine the optimal splitting point across the edge and user equipment. By incorporating a reward surrogate model, our approach significantly reduces the computational cost of frequent performance evaluations. Extensive simulations demonstrate that this method effectively balances inference performance and computational load under varying network conditions, providing a robust solution for LLM deployment in decentralized settings.

摘要

在边缘计算环境中优化大型语言模型(LLMs)的部署对提升隐私保护和计算效率至关重要。为实现高效的无线LLM推理, 本文全面分析了主流开源LLMs中不同分割点的影响。本文引入一个基于模型的强化学习(MBRL)框架, 以确定边缘和用户设备(UE)之间的最佳分割点。通过引入奖励替代模型, 该方法显著减少了频繁的性能评估的计算成本。广泛的仿真结果表明, 该方法在不同网络条件下有效地平衡了推理性能和计算负载, 为去中心化环境中LLM的部署提供稳健的解决方案。

Article PDF

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Data availability

The data that support the findings of this study are available in zjunice github at https://github.com/zjunice.

References

Abbas N, Zhang Y, Taherkordi A, et al., 2018. Mobile edge computing: a survey. IEEE Int Things J, 5(1):450–465. https://doi.org/10.1109/JIOT.2017.2750180
Article MATH Google Scholar
Bai YT, Jones A, Ndousse K, et al., 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. https://arxiv.org/abs/2204.05862
MATH Google Scholar
Beaulieu NC, Cheng C, 2005. Efficient Nakagami-m fading channel simulation. IEEE Trans Veh Technol, 54(2):413–424. https://doi.org/10.1109/TVT.2004.841555
Article MATH Google Scholar
Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Int Conf on Neural Information Processing Systems, Article 159. https://doi.org/10.5555/3495724.3495883
MATH Google Scholar
Chen L, Ahmed NK, Dutta A, et al., 2024. The landscape and challenges of HPC research and LLMs. https://arxiv.org/abs/2402.02018
MATH Google Scholar
Chen MZ, Gündüz D, Huang KB, et al., 2021. Distributed learning in wireless networks: recent progress and future challenges. IEEE J Sel Areas Commun, 39(12):3579–3605. https://doi.org/10.1109/JSAC.2021.3118346
Article MATH Google Scholar
Chen YX, Li RP, Zhao ZF, et al., 2024. NetGPT: an AI-native network architecture for provisioning beyond personalized generative services. IEEE Netw, 38(6):404–413. https://doi.org/10.1109/MNET.2024.3376419
Article MATH Google Scholar
Cleveland WS, 1979. Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc, 74(368):829–836. https://doi.org/10.1080/01621459.1979.10481038
Article MathSciNet MATH Google Scholar
Deisenroth MP, Rasmussen CE, 2011. PILCO: a model-based and data-efficient approach to policy search. Proc 28^th Int Conf on Machine Learning, p.465–472. https://doi.org/10.5555/3104482.3104541
MATH Google Scholar
Dong QF, Chen XL, Satyanarayanan M, 2024. Creating edge AI from cloud-based LLMs. Proc 25^th Int Workshop on Mobile Computing Systems and Applications, p.8–13. https://doi.org/10.1145/3638550.3641126
Chapter MATH Google Scholar
Egorov V, Shpilman A, 2022. Scalable multi-agent modelbased reinforcement learning. Proc 21^st Int Conf on Autonomous Agents and Multiagent Systems, p.381–390.
MATH Google Scholar
Gemini Team Google, 2023. Gemini: a family of highly capable multimodal models. https://arxiv.org/abs/2312.11805
Google Scholar
Gupta O, Raskar R, 2018. Distributed learning of deep neural network over multiple agents. J Netw Comput Appl, 116:1–8. https://doi.org/10.1016/j.jnca.2018.05.003
Article MATH Google Scholar
Gupta R, Sosio N, 2024. Introducing Prem-1B. https://blog.premai.io/introducing-prem-1b/
Google Scholar
Hadi MU, Tashi QA, Qureshi R, et al., 2023. A survey on large language models: applications, challenges, limitations, and practical usage. https://www.techrxiv.org/doi/full/10.36227/techrxiv.23589741.v1
MATH Google Scholar
Icarte RT, Klassen TQ, Valenzano R, et al., 2023. Learning reward machines: a study in partially observable reinforcement learning. Artif Intell, 323:103989. https://doi.org/10.1016/j.artint.2023.103989
Article MathSciNet MATH Google Scholar
Jiang AQ, Sablayrolles A, Mensch A, et al., 2023. Mistral 7B. https://arxiv.org/abs/2310.06825
Google Scholar
Jin MY, Yu QK, Shu D, et al., 2024. Health-LLM: personalized retrieval-augmented disease prediction system. https://arxiv.org/abs/2402.00746
MATH Google Scholar
Kaddour J, Harris J, Mozes M, et al., 2023. Challenges and applications of large language models. https://arxiv.org/abs/2307.10169
MATH Google Scholar
Kaiser L, Babaeizadeh M, Milos P, et al., 2019. Model-based reinforcement learning for Atari. https://arxiv.org/abs/1903.00374
MATH Google Scholar
Karjee J, Naik SP, Anand K, et al., 2022. Split computing: DNN inference partition with load balancing in IoT-edge platform for beyond 5G. Meas Sens, 23:100409. https://doi.org/10.1016/j.measen.2022.100409
Article Google Scholar
Ke CH, Astuti L, 2023. Applying multi-agent deep reinforcement learning for contention window optimization to enhance wireless network performance. ICT Express, 9(5):776–782. https://doi.org/10.1016/j.icte.2022.07.009
Article Google Scholar
Lan Q, Zeng QS, Popovski P, et al., 2021. Progressive feature transmission for split inference at the wireless edge. https://arxiv.org/abs/2112.07244
MATH Google Scholar
Le Scao T, Fan A, Akiki C, et al., 2022. BLOOM: a 176B-parameter open-access multilingual language model. https://arxiv.org/abs/2211.05100
MATH Google Scholar
Lee J, Lee H, Choi W, 2023. Wireless channel adaptive DNN split inference for resource-constrained edge devices. IEEE Commun Lett, 27(6):1520–1524. https://doi.org/10.1109/LCOMM.2023.3269769
Article MATH Google Scholar
Letaief KB, Shi YM, Lu JM, et al., 2022. Edge artificial intelligence for 6G: vision, enabling technologies, and applications. IEEE J Sel Areas Commun, 40(1):5–36. https://doi.org/10.1109/JSAC.2021.3126076
Article MATH Google Scholar
Li E, Zeng LK, Zhou Z, et al., 2020. Edge AI: on-demand accelerating deep neural network inference via edge computing. IEEE Trans Wirel Commun, 19(1):447–457. https://doi.org/10.1109/TWC.2019.2946140
Article MATH Google Scholar
Li X, Lu LY, Ni W, et al., 2022. Federated multi-agent deep reinforcement learning for resource allocation of vehicle-to-vehicle communications. IEEE Trans Veh Technol, 71(8):8810–8824. https://doi.org/10.1109/TVT.2022.3173057
Article MATH Google Scholar
Li YX, 2017. Deep reinforcement learning: an overview. https://arxiv.org/abs/1701.07274
MATH Google Scholar
Lin B, Zhang C, Peng T, et al., 2024. Infinite-LLM: efficient LLM service for long context with DistAttention and distributed KVCache. https://arxiv.org/abs/2401.02669
Google Scholar
Lin Z, Qu GQ, Chen QY, et al., 2024a. Pushing large language models to the 6G edge: vision, challenges, and opportunities. https://arxiv.org/abs/2309.16739
MATH Google Scholar
Lin Z, Qu GQ, Chen XH, et al., 2024b. Split learning in 6G edge networks. IEEE Wirel Commun, 31(4):170–176. https://doi.org/10.1109/MWC.014.2300319
Article MATH Google Scholar
Liu D, Sun CJ, Yang CY, et al., 2020. Optimizing wireless systems using unsupervised and reinforced-unsupervised deep learning. IEEE Netw, 34(4):270–277. https://doi.org/10.1109/MNET.001.1900517
Article MATH Google Scholar
Luong NC, Hoang DT, Gong SM, et al., 2019. Applications of deep reinforcement learning in communications and networking: a survey. IEEE Commun Surv Tutor, 21(4):3133–3174. https://doi.org/10.1109/COMST.2019.2916583
Article MATH Google Scholar
Mach P, Becvar Z, 2017. Mobile edge computing: a survey on architecture and computation offloading. IEEE Commun Surv Tutor, 19(3):1628–1656. https://doi.org/10.1109/COMST.2017.2682318
Article MATH Google Scholar
Mao YY, You CS, Zhang J, et al., 2017. A survey on mobile edge computing: the communication perspective. IEEE Commun Surv Tutor, 19(4):2322–2358. https://doi.org/10.1109/COMST.2017.2745201
Article MATH Google Scholar
Merity S, Xiong CM, Bradbury J, et al., 2016. Pointer sentinel mixture models. https://arxiv.org/abs/1609.07843
Google Scholar
Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. https://doi.org/10.1038/nature14236
Article Google Scholar
Mnih V, Badia AP, Mirza M, et al., 2016. Asynchronous methods for deep reinforcement learning. Proc 33^rd Int Conf on Machine Learning, p.1928–1937. https://doi.org/10.5555/3045390.3045594
MATH Google Scholar
Moerland TM, Broekens J, Plaat A, et al., 2023. Model-based reinforcement learning: a survey. Foundat Trends® Mach Learn, 16(1):1–118. https://doi.org/10.1561/2200000086
Article MATH Google Scholar
Nakagami M, 1960. The m-distribution—a general formula of intensity distribution of rapid fading. In: Hoffman WC (Ed.), Statistical Methods in Radio Wave Propagation. Pergamon, UK, p.3–36. https://doi.org/10.1016/B978-0-08-009306-2.50005-4
Chapter MATH Google Scholar
Nijkamp E, Pang B, Hayashi H, et al., 2022. CodeGen: an open large language model for code with multi-turn program synthesis. https://arxiv.org/abs/2203.13474
MATH Google Scholar
Ong I, 2024. Efficient Distributed LLM Inference with Dynamic Partitioning. Technical Report UCB/EECS-2024-108, California, USA.
MATH Google Scholar
OpenAI, 2023. GPT-4 Technical Report, San Francisco, USA.
Google Scholar
Patil R, Gudivada V, 2024. A review of current trends, techniques, and challenges in large language models (LLMs). Appl Sci, 14(5):2074. https://doi.org/10.3390/app14052074
Article MATH Google Scholar
Pham QV, Fang F, Ha VN, et al., 2020. A survey of multiaccess edge computing in 5G and beyond: fundamentals, technology integration, and state-of-the-art. IEEE Access, 8:116974–117017. https://doi.org/10.1109/ACCESS.2020.3001277
Article MATH Google Scholar
Qian YC, Wu J, Wang R, et al., 2019. Survey on reinforcement learning applications in communication networks. J Commun Inform Netw, 4(2):30–39. https://doi.org/10.23919/JCIN.2019.8917870
Article MATH Google Scholar
Qiao LT, Zhou Y, 2023. Timely split inference in wireless networks: an accuracy-freshness tradeoff. IEEE Trans Veh Technol, 72(12):16817–16822. https://doi.org/10.1109/TVT.2023.3294494
Article Google Scholar
Romoff J, Henderson P, Piché A, et al., 2018. Reward estimation for variance reduction in deep reinforcement learning. Proc 2^nd Conf on Robot Learning, p.674–699.
MATH Google Scholar
Rozière B, Gehring J, Gloeckle F, et al., 2023. Code LLAMA: open foundation models for code. https://arxiv.org/abs/2308.12950
Google Scholar
Ryu J, Won D, Lee Y, 2022. A study of split learning model. 16^th Int Conf on Ubiquitous Information Management and Communication, p.1–4. https://doi.org/10.1109/IMCOM53663.2022.9721798
MATH Google Scholar
Satyanarayanan M, Bahl P, Caceres R, et al., 2009. The case for VM-based cloudlets in mobile computing. IEEE Pervas Comput, 8(4):14–23. https://doi.org/10.1109/MPRV.2009.82
Article MATH Google Scholar
Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347
Google Scholar
Shlezinger N, Farsad N, Eldar YC, et al., 2021. Model-based machine learning for communications. https://arxiv.org/abs/2101.04726
MATH Google Scholar
Stone M, 1974. Cross-validatory choice and assessment of statistical predictions. J Roy Stat Soc Ser B, 36(2):111–133. https://doi.org/10.1111/j.2517-6161.1974.tb00994.x
Article MathSciNet MATH Google Scholar
Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al., 2023. Large language models in medicine. Nat Med, 29(8):1930–1940. https://doi.org/10.1038/s41591-023-02448-8
Article MATH Google Scholar
Touvron H, Martin L, Stone K, et al., 2023. LLAMA 2: open foundation and fine-tuned chat models. https://arxiv.org/abs/2307.09288
Google Scholar
Üstün A, Aryabumi V, Yong ZX, et al., 2024. Aya model: an instruction finetuned open-access multilingual language model. Proc 62^nd Annual Meeting of the Association for Computational Linguistics, p.15894–15939. https://doi.org/10.18653/v1/2024.acl-long.845
MATH Google Scholar
Wang G, Cheng SJ, Zhan XY, et al., 2023. OpenChat: advancing open-source language models with mixed-quality data. https://arxiv.org/abs/2309.11235
MATH Google Scholar
Wang YZ, Guo K, Hong W, et al., 2023. Split learning in wireless networks: a communication and computation adaptive scheme. IEEE/CIC Int Conf on Communications in China, p.1–6. https://doi.org/10.1109/ICCC57788.2023.10233330
MATH Google Scholar
Webb T, Holyoak KJ, Lu HJ, 2023. Emergent analogical reasoning in large language models. Nat Hum Behav, 7(9):1526–1541. https://doi.org/10.1038/s41562-023-01659-w
Article MATH Google Scholar
Wei J, Bosma M, Zhao VY, et al., 2021. Finetuned language models are zero-shot learners. https://arxiv.org/abs/2109.01652
Google Scholar
Wu SJ, Irsoy O, Lu S, et al., 2023. BloombergGPT: a large language model for finance. https://arxiv.org/abs/2303.17564
MATH Google Scholar
Yang K, Shi CS, Shen C, et al., 2023. Offline reinforcement learning for wireless network optimization with mixture datasets. IEEE Trans Wirel Commun, 23(10):12703–12716. https://doi.org/10.1109/TWC.2024.3395624
Article MATH Google Scholar
Zhang MJ, Cao JN, Shen XM, et al., 2024. EdgeShard: efficient LLM inference via collaborative edge computing. https://arxiv.org/abs/2405.14371
MATH Google Scholar
Zhang XH, Yu BW, Yu HY, et al., 2023. Wider and deeper LLM networks are fairer LLM evaluators. https://arxiv.org/abs/2308.01862
MATH Google Scholar
Zhu LW, Takami G, Kawahara M, et al., 2022. Alleviating parameter-tuning burden in reinforcement learning for large-scale process control. Comput Chem Eng, 158:107658. https://doi.org/10.1016/j.compchemeng.2022.107658
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science & Electronic Engineering, Zhejiang University, Hangzhou, 310027, China
Yuxuan Chen (陈宇轩), Rongpeng Li (李荣鹏), Xiaoxue Yu (于小雪) & Honggang Zhang (张宏纲)
Zhejiang Lab, Hangzhou, 310012, China
Zhifeng Zhao (赵志峰)

Authors

Yuxuan Chen (陈宇轩)
View author publications
You can also search for this author inPubMed Google Scholar
Rongpeng Li (李荣鹏)
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoxue Yu (于小雪)
View author publications
You can also search for this author inPubMed Google Scholar
Zhifeng Zhao (赵志峰)
View author publications
You can also search for this author inPubMed Google Scholar
Honggang Zhang (张宏纲)
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Yuxuan CHEN and Rongpeng LI designed the research. Yuxuan CHEN performed the simulation, processed the data, and drafted the paper. Rongpeng LI and Honggang ZHANG helped organize the paper. Honggang ZHANG, Zhifeng ZHAO, and Xiaoxue YU revised the paper. Yuxuan CHEN and Rongpeng LI finalized the paper.

Corresponding author

Correspondence to Rongpeng Li (李荣鹏).

Ethics declarations

Honggang ZHANG is a guest editor of this special issue; he was not involved with the peer review process of this paper. All the authors declare that they have no conflict of interest.

Additional information

Project supported by the National Key Research and Development Program of China (No. 2024YFE0200600), the National Natural Science Foundation of China (No. 62071425), the Zhejiang Key Research and Development Plan, China (No. 2022C01093), the Zhejiang Provincial Natural Science Foundation of China (No. LR23F010005), the National Key Laboratory of Wireless Communications Foundation, China (No. 2023KP01601), and the Big Data and Intelligent Computing Key Lab of CQUPT, China (No. BDIC-2023-B-001)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Li, R., Yu, X. et al. Adaptive layer splitting for wireless large language model inference in edge computing: a model-based reinforcement learning approach. Front Inform Technol Electron Eng 26, 278–292 (2025). https://doi.org/10.1631/FITEE.2400468

Download citation

Received: 01 June 2024
Accepted: 13 September 2024
Published: 05 March 2025
Issue Date: February 2025
DOI: https://doi.org/10.1631/FITEE.2400468

Key words

关键词

CLC number

TP391

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Adaptive layer splitting for wireless large language model inference in edge computing: a model-based reinforcement learning approach

Abstract

摘要

Article PDF

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number