skip to main content
article

Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure

Published:14 June 2022Publication History
Skip Abstract Section

Abstract

Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants' workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection.

References

  1. DCGM Library API Reference Guide - Data Center GPU Management Documentation. https://docs .nvidia.com/datacenter/dcgm/dcgm-api/. Accessed Feb 2022.Google ScholarGoogle Scholar
  2. Exporter for machine metrics. https://github.com /prometheus/node_exporter. Accessed Feb 2022.Google ScholarGoogle Scholar
  3. Grafana: The open observability platform. https:// grafana.com/grafana/dashboards/. Accessed Feb 2022.Google ScholarGoogle Scholar
  4. NCCL Tests. https://github.com/NVIDIA/nccl-t ests. Accessed Feb 2022.Google ScholarGoogle Scholar
  5. NVIDIA CUDA Profiling Tools Interface (CUPTI) - CUDA Toolkit. https://developer.nvidia.com/c upti. Accessed Feb 2022.Google ScholarGoogle Scholar
  6. NVIDIA Nsight Compute. https://developer.nvid ia.com/nsight-compute. Accessed Feb 2022.Google ScholarGoogle Scholar
  7. NVIDIA Nsight Systems. https://developer.nvid ia.com/nsight-systems. Accessed Feb 2022.Google ScholarGoogle Scholar
  8. NVIDIA System Management Interface. https://de veloper.nvidia.com/nvidia-system-managemen t-interface. Accessed Feb 2022.Google ScholarGoogle Scholar
  9. NVTOP: NVIDIA GPUs htop like monitoring tool. ht tps://github.com/Syllo/nvtop. Accessed Feb 2022.Google ScholarGoogle Scholar
  10. Profiler - CUDA Toolkit Documentation. https://do cs.nvidia.com/cuda/profiler-users-guide/in dex.html. Accessed Feb 2022.Google ScholarGoogle Scholar
  11. ROCm Command Line Interface. https://rocmdocs .amd.com/en/latest/ROCm_System_Managment/R OCm-SMI-CLI.html. Accessed Feb 2022.Google ScholarGoogle Scholar
  12. ROCm Data Center Tool User Guide. https://gith ub.com/RadeonOpenCompute/rdc/blob/roc-5.0. 24 x/docs/AMD_ROCm_Data_Center_Tool_User_Guide .pdf. Accessed Feb 2022.Google ScholarGoogle Scholar
  13. TensorBoard: TensorFlow's visualization toolkit. http s://www.tensorflow.org/tensorboard. Accessed Feb 2022.Google ScholarGoogle Scholar
  14. Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685-- 701, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Amazon. Cloud Services - Amazon Web Services (AWS). https://aws.amazon.com/. Accessed Feb 2022.Google ScholarGoogle Scholar
  16. Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.Google ScholarGoogle Scholar
  17. Microsoft Azure. Cloud Computing Services | Microsoft Azure. https://azure.microsoft.com/en-us/. Accessed Feb 2022.Google ScholarGoogle Scholar
  18. Microsoft Azure. ND A100 v4-series - Azure Virtual Machines. https://docs.microsoft.com/en-us/a zure/virtual-machines/nd-v4-series. Accessed Feb 2022.Google ScholarGoogle Scholar
  19. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020.Google ScholarGoogle Scholar
  20. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google ScholarGoogle Scholar
  21. Fan Guo, Yongkun Li, John CS Lui, and Yinlong Xu. Dcuda: Dynamic gpu scheduling with live migration support. In Proceedings of the ACM Symposium on Cloud Computing, pages 114--125, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  23. Yuting Jiang, Yifan Xiong, Lei Qu, Cheng Luo, Chen Tian, Peng Cheng, and Yongqiang Xiong. Moneo: Nonintrusive fine-grained monitor for AI infrastructure. In 2022 IEEE International Conference on Communica- tions (ICC): SAC Cloud Computing, Networking and Storage Track (IEEE ICC'22 - SAC-02 CCNS Track), Seoul, Korea (South), May 2022. IEEE.Google ScholarGoogle Scholar
  24. Ronny Krashinsky et al. NVIDIA Ampere Architecture In-Depth. https://developer.nvidia.com/b log/nvidia-ampere-architecture-in-depth/. Accessed Feb 2022.Google ScholarGoogle Scholar
  25. Chuan Li. OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/de mystifying-gpt-3/. Accessed Feb 2022.Google ScholarGoogle Scholar
  26. NVIDIA. NVLink & NVSwitch for Advanced Multi- GPU Communication. https://www.nvidia.com/e n-us/data-center/nvlink/. Accessed Feb 2022.Google ScholarGoogle Scholar
  27. Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.Google ScholarGoogle Scholar
  28. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.Google ScholarGoogle Scholar
  29. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google ScholarGoogle Scholar
  30. Sameer S Shende and Allen D Malony. The tau parallel performance system. The International Journal of High Performance Computing Applications, 20(2):287--311, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.Google ScholarGoogle Scholar
  32. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google ScholarGoogle Scholar
  33. Minlan Yu, Albert G Greenberg, David A Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. Profiling network performance for multi-tier data center applications. In NSDI, volume 11, pages 5--5, 2011.Google ScholarGoogle Scholar
  34. Hui Zhang and Jeffrey Hollingsworth. Understanding the performance of gpgpu applications from a datacentric view. In 2019 IEEE/ACM International Work- shop on Programming and Performance Visualization Tools (ProTools), pages 1--8. IEEE, 2019.Google ScholarGoogle Scholar
  35. Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. Gvprof: A value profiler for gpu-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  36. Keren Zhou, Mark W Krentel, and John Mellor- Crummey. Tools for top-down performance analysis of gpu-accelerated applications. In Proceedings of the 34th ACM International Conference on Supercomputing, pages 1--12, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGOPS Operating Systems Review
        ACM SIGOPS Operating Systems Review  Volume 56, Issue 1
        SIGOPS
        June 2022
        76 pages
        ISSN:0163-5980
        DOI:10.1145/3544497
        Issue’s Table of Contents

        Copyright © 2022 Copyright is held by the owner/author(s)

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 June 2022

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader