research-article

Navigating Challenges and Technical Debt in Large Language Models Deployment

Authors:

Ahmed Menshawy,

Mahmoud FahmyAuthors Info & Claims

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems

Pages 192 - 199

https://doi.org/10.1145/3642970.3655840

Published: 22 April 2024 Publication History

Abstract

Large Language Models (LLMs) have become an essential tool in advancing artificial intelligence and machine learning, enabling outstanding capabilities in natural language processing, and understanding. However, the efficient deployment of LLMs in production environments reveals a complex landscape of challenges and technical debt.

In this paper, we aim to highlight unique forms of challenges and technical debt associated with the deployment of LLMs, including those related to memory management, parallelism strategies, model compression, and attention optimization. These challenges emphasize the necessity of custom approaches to deploying LLMs, demanding customization and sophisticated engineering solutions not readily available in broad-use machine learning libraries or inference engines.

References

[1]

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in Machine learning systems. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'15). MIT Press, Cambridge, MA, USA, 2503--2511. Conference Name:ACM Woodstock conference

Digital Library

[2]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23). Association for Computing Machinery, New York, NY, USA, 611--626. https://doi.org/10.1145/3600006.3613165

Digital Library

[3]

Clusmann, J., Kolbinger, F.R., Muti, H.S. et al. The future landscape of large language models in medicine. Commun Med 3, 141 (2023). https://doi.org/10.1038/s43856-023-00370-1

[4]

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C. Mixtral of Experts. arXiv:2401.04088 [cs.LG].Conference Short Name:WOODSTOCK'18

[5]

https://ai.google.dev/gemmaConference Location:El Paso, Texas USA

[6]

Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. 2023. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. In Proceedings of ACM Woodstock conference (WOODSTOCK'23). ACM, New York, NY, USA, 25 pages. https://doi.org/10.48550/arXiv.2303.08302

[7]

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Proceedings of ACM Woodstock conference (WOODSTOCK'22). ACM, New York, NY, USA. https://doi.org/10.48550/arXiv.2205.14135

[8]

Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784--3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.

[9]

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and social risks of harm from language models. arXiv:2112.04359, 2021.

[10]

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al., "On the opportunities and risks of foundation models," arXiv preprint arXiv:2108.07258, 2021.

[11]

A. Borzunov, M. Ryabinin, A. Chumachenko, D. Baranchuk, T. Dettmers, Y. Belkada, P. Samygin, C. A. Raffel, Distributed inference and finetuning of large language models over the internet, Advances in Neural Information Processing Systems 36 (2024).

[12]

Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey on large language model (llm) security and privacy: The good, the bad, and the ugly, High-Confidence Computing (2024) 100211.

[13]

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey of hallucination in natural language generation, ACM Computing Surveys 55 (12) (2023) 1-38.

Digital Library

[14]

D. Myers, R. Mohawesh, V. I. Chellaboina, A. L. Sathvik, P. Venkatesh, Y.-H. Ho, H. Henshaw, M. Alhawawreh, D. Berdik, Y. Jararweh, Foundation and large language models: fundamentals, challenges, opportunities, and social impacts, Cluster Computing (2023) 1-26.

[15]

Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al., A survey on evaluation of large language models, ACM Transactions on Intelligent Systems and Technology (2023).

[16]

L. Yang, H. Chen, Z. Li, X. Ding, X. Wu, Give us the facts: Enhancing large language models with knowledge graphs for fact-aware language modeling, IEEE Transactions on Knowledge and Data Engineering (2024).

Digital Library

[17]

Y. Chen, Q. Fu, Y. Yuan, Z. Wen, G. Fan, D. Liu, D. Zhang, Z. Li, Y. Xiao, Hallucination detection: Robustly discerning reliable answers in large language models, in: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 245--255.

Digital Library

[18]

Aminabadi, R. Y., Rajbhandari, S., Zhang, M., Awan, A. A., Li, C., Li, D., Zheng, E., Rasley, J., Smith, S., Ruwase, O., et al. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022.

[19]

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. LLM.int8(): 8-bit matrix multiplication for transformers at scale. ArXiv, abs/2208.07339, 2022a.

[20]

Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization. International Conference on Learning Representations (ICLR), 2022b.

[21]

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.

[22]

Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., Zoph, B., Fedus, L., Bosma, M., Zhou, Z., Wang, T., Wang, Y. E., Webster, K., Pellat, M., Robinson, K., Meier-Hellstern, K., Duke, T., Dixon, L., Zhang, K., Le, Q. V., Wu, Y., Chen, Z., and Cui, C. Glam: Efficient scaling of language models with mixture-of-experts. CoRR, abs/2112.06905, 2021. URL https://arxiv.org/abs/2112.06905.

[23]

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790--2799. PMLR, 2019.

[24]

Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models, 2021.

[25]

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pp. 103--112, 2019.

[26]

Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model parallelism for deep neural networks. In Talwalkar, A., Smith, V., and Zaharia, M. (eds.), Proceedings of Machine Learning and Systems, volume 1, pp. 1--13, 2019.

[27]

Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045--3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.243.

[28]

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimization towards training a trillion parameter models. In SC, 2020.

[29]

Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., and He, Y. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. arXiv preprint arXiv:2104.07857, 2021.

[30]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.

[31]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.

[32]

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022.

[33]

Pan Dhoni and Ravinder Kumar. 2023. "Synergizing Generative AI and Cybersecurity: Roles of Generative AI Entities, Companies, Agencies, and Government in Enhancing Cybersecurity." TechRxiv. Published August 18, 2023.

[34]

Wang, J., Yuan, B., Rimanic, L., He, Y., Dao, T., Chen, B., Re, C., and Zhang, C. Fine-tuning language models over slow networks using activation compression with guarantees, 2022. URL https://arxiv.org/abs/2206.01299.

[35]

Wang, L., Ma, C., Feng, X. et al. 2024. "A Survey on Large Language Model Based Autonomous Agents." Frontiers of Computer Science 18: 186345.

Digital Library

[36]

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. Falcon-40B: an open large language model with state-of-the-art performance. 2023.

[37]

Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., & Chandra, V. 2023. "LLM-QAT: Data-Free Quantization Aware Training for Large Language Models." arXiv preprint arXiv:2305.17888.

[38]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.

[39]

Zhao, Y., Lin, C.-Y., Zhu, K., Ye, Z., Chen, L., Zheng, S., Ceze, L., Krishnamurthy, A., Chen, T., & Kasikci, B. 2023. "Atom: Low-bit Quantization for Efficient and Accurate LLM Serving." arXiv preprint arXiv:2310.19102.

[40]

Liu, Z., Oğuz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., & Chandra, V. 2023. "LLM-QAT: Data-Free Quantization Aware Training for Large Language Models." ArXiv, vol. abs/2305.17888, https://api.semanticscholar.org/CorpusID:258959117.

[41]

Li, L., Li, Q., Zhang, B., & Chu, X. 2023. "Norm Tweaking: High-performance Low-bit Quantization of Large Language Models." ArXiv, vol. abs/2309.02784, https://api.semanticscholar.org/CorpusID:261557634.

[42]

Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., & Gholami, A. 2024. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." ArXiv, vol. abs/2401.18079, https://api.semanticscholar.org/CorpusID:267335271.

[43]

Liu, R., Bai, H., Lin, H., Li, Y., Gao, H., Xu, Z.-J., Hou, L., Yao, J., & Yuan, C. 2024. "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact." Presented at [conference name], https://api.semanticscholar.org/CorpusID:268230707.

[44]

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. 2023. "OPTQ: Accurate Quantization for Generative Pre-trained Transformers." In Proceedings of the International Conference on Learning Representations (ICLR), https://api.semanticscholar.org/CorpusID:259298689.

[45]

Brakel, Felix, Uraz Odyurt, and Ana-Lucia Varbanescu. "Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies." arXiv preprint arXiv:2403.03699 (2024).

[46]

Yikang Pan, Liangming Pan, Wenhu Chen, Preslav Nakov, Min-Yen Kan, and William Wang. 2023. On the Risk of Misinformation Pollution with Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1389--1403, Singapore. Association for Computational Linguistics.

[47]

Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212--218, Singapore. Association for Computational Linguistics.

[48]

X. Ma, G. Fang and X. Wang. 2023. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.

[49]

L. Weng. 2023. Large transformer model inference optimization. Lil'Log.

[50]

Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. 2024. FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design. arXiv preprint arXiv:2401.14112

[51]

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. (2019). Megatronlm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.

[52]

https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/

Recommendations

An Analysis of Large Language Models and LangChain in Mathematics Education
ICAAI '23: Proceedings of the 2023 7th International Conference on Advances in Artificial Intelligence

The development of large language models (LLMs) has led to the consideration of new approaches, particularly in education. Word problems, especially in subjects like mathematics, and the need to solve these problems by collectively addressing specific ...
Adversarial Attacks on Large Language Models
Knowledge Science, Engineering and Management
Abstract
Large Language Models (LLMs) have rapidly advanced and garnered increasing attention due to their remarkable capabilities across various applications. However, adversarial attacks pose a significant threat to LLMs, as prior research has ...
Research and Application of Large Language Models in HealthcareCurrent Development of Large Language Models in the Healthcare FieldA Framework for Applying Large Language Models and the Opportunities and Challenges of Large Language Models in Healthcare: A Framework for Applying Large Language Models and the Opportunities and Challenges of Large Language Models in Healthcare
ISAIMS '23: Proceedings of the 2023 4th International Symposium on Artificial Intelligence for Medicine Science

The burgeoning field of Large Language Models (LLM) within the realm of medical research has garnered significant attention. However, there exists a noticeable dearth of scholarly exploration concerning the practical utility of LLM in addressing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems

April 2024

218 pages

ISBN:9798400705410

DOI:10.1145/3642970

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroSys '24

Sponsor:

SIGOPS

EuroSys '24: Nineteenth European Conference on Computer Systems

April 22, 2024

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 18 of 26 submissions, 69%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
399
Total Downloads

Downloads (Last 12 months)399
Downloads (Last 6 weeks)28

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten