Abstract
Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants' workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection.
- DCGM Library API Reference Guide - Data Center GPU Management Documentation. https://docs .nvidia.com/datacenter/dcgm/dcgm-api/. Accessed Feb 2022.Google Scholar
- Exporter for machine metrics. https://github.com /prometheus/node_exporter. Accessed Feb 2022.Google Scholar
- Grafana: The open observability platform. https:// grafana.com/grafana/dashboards/. Accessed Feb 2022.Google Scholar
- NCCL Tests. https://github.com/NVIDIA/nccl-t ests. Accessed Feb 2022.Google Scholar
- NVIDIA CUDA Profiling Tools Interface (CUPTI) - CUDA Toolkit. https://developer.nvidia.com/c upti. Accessed Feb 2022.Google Scholar
- NVIDIA Nsight Compute. https://developer.nvid ia.com/nsight-compute. Accessed Feb 2022.Google Scholar
- NVIDIA Nsight Systems. https://developer.nvid ia.com/nsight-systems. Accessed Feb 2022.Google Scholar
- NVIDIA System Management Interface. https://de veloper.nvidia.com/nvidia-system-managemen t-interface. Accessed Feb 2022.Google Scholar
- NVTOP: NVIDIA GPUs htop like monitoring tool. ht tps://github.com/Syllo/nvtop. Accessed Feb 2022.Google Scholar
- Profiler - CUDA Toolkit Documentation. https://do cs.nvidia.com/cuda/profiler-users-guide/in dex.html. Accessed Feb 2022.Google Scholar
- ROCm Command Line Interface. https://rocmdocs .amd.com/en/latest/ROCm_System_Managment/R OCm-SMI-CLI.html. Accessed Feb 2022.Google Scholar
- ROCm Data Center Tool User Guide. https://gith ub.com/RadeonOpenCompute/rdc/blob/roc-5.0. 24 x/docs/AMD_ROCm_Data_Center_Tool_User_Guide .pdf. Accessed Feb 2022.Google Scholar
- TensorBoard: TensorFlow's visualization toolkit. http s://www.tensorflow.org/tensorboard. Accessed Feb 2022.Google Scholar
- Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685-- 701, 2010.Google ScholarDigital Library
- Amazon. Cloud Services - Amazon Web Services (AWS). https://aws.amazon.com/. Accessed Feb 2022.Google Scholar
- Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.Google Scholar
- Microsoft Azure. Cloud Computing Services | Microsoft Azure. https://azure.microsoft.com/en-us/. Accessed Feb 2022.Google Scholar
- Microsoft Azure. ND A100 v4-series - Azure Virtual Machines. https://docs.microsoft.com/en-us/a zure/virtual-machines/nd-v4-series. Accessed Feb 2022.Google Scholar
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
- Fan Guo, Yongkun Li, John CS Lui, and Yinlong Xu. Dcuda: Dynamic gpu scheduling with live migration support. In Proceedings of the ACM Symposium on Cloud Computing, pages 114--125, 2019.Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarCross Ref
- Yuting Jiang, Yifan Xiong, Lei Qu, Cheng Luo, Chen Tian, Peng Cheng, and Yongqiang Xiong. Moneo: Nonintrusive fine-grained monitor for AI infrastructure. In 2022 IEEE International Conference on Communica- tions (ICC): SAC Cloud Computing, Networking and Storage Track (IEEE ICC'22 - SAC-02 CCNS Track), Seoul, Korea (South), May 2022. IEEE.Google Scholar
- Ronny Krashinsky et al. NVIDIA Ampere Architecture In-Depth. https://developer.nvidia.com/b log/nvidia-ampere-architecture-in-depth/. Accessed Feb 2022.Google Scholar
- Chuan Li. OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/de mystifying-gpt-3/. Accessed Feb 2022.Google Scholar
- NVIDIA. NVLink & NVSwitch for Advanced Multi- GPU Communication. https://www.nvidia.com/e n-us/data-center/nvlink/. Accessed Feb 2022.Google Scholar
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.Google Scholar
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.Google Scholar
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google Scholar
- Sameer S Shende and Allen D Malony. The tau parallel performance system. The International Journal of High Performance Computing Applications, 20(2):287--311, 2006.Google ScholarDigital Library
- Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.Google Scholar
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
- Minlan Yu, Albert G Greenberg, David A Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. Profiling network performance for multi-tier data center applications. In NSDI, volume 11, pages 5--5, 2011.Google Scholar
- Hui Zhang and Jeffrey Hollingsworth. Understanding the performance of gpgpu applications from a datacentric view. In 2019 IEEE/ACM International Work- shop on Programming and Performance Visualization Tools (ProTools), pages 1--8. IEEE, 2019.Google Scholar
- Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. Gvprof: A value profiler for gpu-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020.Google ScholarCross Ref
- Keren Zhou, Mark W Krentel, and John Mellor- Crummey. Tools for top-down performance analysis of gpu-accelerated applications. In Proceedings of the 34th ACM International Conference on Supercomputing, pages 1--12, 2020.Google ScholarDigital Library
Index Terms
- Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure
Recommendations
Monitoring-based auto-scalability across hybrid clouds
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingCloud computing is a relatively new type of Internet-based computing that becomes more and more popular. Using methods like virtualization, adopting architectures based on microservices, automation of building and deployment processes, Cloud could ...
A Conceptual Platform of SLA in Cloud Computing
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure ComputingCloud computing is a promising technology, where the infrastructure, developing platform, software and storage are delivered as a service. With the development of cloud computing, more and more cloud service providers emerge. However, there are no ...
soCloud: a service-oriented component-based PaaS for managing portability, provisioning, elasticity, and high availability across multiple clouds
Multi-cloud computing is a promising paradigm to support very large scale world wide distributed applications. Multi-cloud computing is the usage of multiple, independent cloud environments, which assumed no priori agreement between cloud providers or ...
Comments