article

Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure

Authors:
Yuting Jiang

Nanjing University

Nanjing University
View Profile

,
Yifan Xiong

Microsoft Research

Microsoft Research
View Profile

,
Lei Qu

Microsoft Research

Microsoft Research
View Profile

,
Cheng Luo Luo

Microsoft Research

Microsoft Research
View Profile

,
Chen Tian

Nanjing University

Nanjing University
View Profile

,
Peng Cheng

Microsoft Research

Microsoft Research
View Profile

,
Yongqiang Xiong

Microsoft Research

Microsoft Research
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 56 Issue 1June 2022pp 18–25https://doi.org/10.1145/3544497.3544501

Published:14 June 2022Publication History

ACM SIGOPS Operating Systems Review

Abstract

Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants' workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection.

References

DCGM Library API Reference Guide - Data Center GPU Management Documentation. https://docs .nvidia.com/datacenter/dcgm/dcgm-api/. Accessed Feb 2022.Google Scholar
Exporter for machine metrics. https://github.com /prometheus/node_exporter. Accessed Feb 2022.Google Scholar
Grafana: The open observability platform. https:// grafana.com/grafana/dashboards/. Accessed Feb 2022.Google Scholar
NCCL Tests. https://github.com/NVIDIA/nccl-t ests. Accessed Feb 2022.Google Scholar
NVIDIA CUDA Profiling Tools Interface (CUPTI) - CUDA Toolkit. https://developer.nvidia.com/c upti. Accessed Feb 2022.Google Scholar
NVIDIA Nsight Compute. https://developer.nvid ia.com/nsight-compute. Accessed Feb 2022.Google Scholar
NVIDIA Nsight Systems. https://developer.nvid ia.com/nsight-systems. Accessed Feb 2022.Google Scholar
NVIDIA System Management Interface. https://de veloper.nvidia.com/nvidia-system-managemen t-interface. Accessed Feb 2022.Google Scholar
NVTOP: NVIDIA GPUs htop like monitoring tool. ht tps://github.com/Syllo/nvtop. Accessed Feb 2022.Google Scholar
Profiler - CUDA Toolkit Documentation. https://do cs.nvidia.com/cuda/profiler-users-guide/in dex.html. Accessed Feb 2022.Google Scholar
ROCm Command Line Interface. https://rocmdocs .amd.com/en/latest/ROCm_System_Managment/R OCm-SMI-CLI.html. Accessed Feb 2022.Google Scholar
ROCm Data Center Tool User Guide. https://gith ub.com/RadeonOpenCompute/rdc/blob/roc-5.0. 24 x/docs/AMD_ROCm_Data_Center_Tool_User_Guide .pdf. Accessed Feb 2022.Google Scholar
TensorBoard: TensorFlow's visualization toolkit. http s://www.tensorflow.org/tensorboard. Accessed Feb 2022.Google Scholar
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685-- 701, 2010.Google ScholarDigital Library
Amazon. Cloud Services - Amazon Web Services (AWS). https://aws.amazon.com/. Accessed Feb 2022.Google Scholar
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.Google Scholar
Microsoft Azure. Cloud Computing Services | Microsoft Azure. https://azure.microsoft.com/en-us/. Accessed Feb 2022.Google Scholar
Microsoft Azure. ND A100 v4-series - Azure Virtual Machines. https://docs.microsoft.com/en-us/a zure/virtual-machines/nd-v4-series. Accessed Feb 2022.Google Scholar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901, 2020.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Google Scholar
Fan Guo, Yongkun Li, John CS Lui, and Yinlong Xu. Dcuda: Dynamic gpu scheduling with live migration support. In Proceedings of the ACM Symposium on Cloud Computing, pages 114--125, 2019.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.Google ScholarCross Ref
Yuting Jiang, Yifan Xiong, Lei Qu, Cheng Luo, Chen Tian, Peng Cheng, and Yongqiang Xiong. Moneo: Nonintrusive fine-grained monitor for AI infrastructure. In 2022 IEEE International Conference on Communica- tions (ICC): SAC Cloud Computing, Networking and Storage Track (IEEE ICC'22 - SAC-02 CCNS Track), Seoul, Korea (South), May 2022. IEEE.Google Scholar
Ronny Krashinsky et al. NVIDIA Ampere Architecture In-Depth. https://developer.nvidia.com/b log/nvidia-ampere-architecture-in-depth/. Accessed Feb 2022.Google Scholar
Chuan Li. OpenAI's GPT-3 Language Model: A Technical Overview. https://lambdalabs.com/blog/de mystifying-gpt-3/. Accessed Feb 2022.Google Scholar
NVIDIA. NVLink & NVSwitch for Advanced Multi- GPU Communication. https://www.nvidia.com/e n-us/data-center/nvlink/. Accessed Feb 2022.Google Scholar
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.Google Scholar
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.Google Scholar
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.Google Scholar
Sameer S Shende and Allen D Malony. The tau parallel performance system. The International Journal of High Performance Computing Applications, 20(2):287--311, 2006.Google ScholarDigital Library
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.Google Scholar
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.Google Scholar
Minlan Yu, Albert G Greenberg, David A Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. Profiling network performance for multi-tier data center applications. In NSDI, volume 11, pages 5--5, 2011.Google Scholar
Hui Zhang and Jeffrey Hollingsworth. Understanding the performance of gpgpu applications from a datacentric view. In 2019 IEEE/ACM International Work- shop on Programming and Performance Visualization Tools (ProTools), pages 1--8. IEEE, 2019.Google Scholar
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. Gvprof: A value profiler for gpu-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020.Google ScholarCross Ref
Keren Zhou, Mark W Krentel, and John Mellor- Crummey. Tools for top-down performance analysis of gpu-accelerated applications. In Proceedings of the 34th ACM International Conference on Supercomputing, pages 1--12, 2020.Google ScholarDigital Library

Index Terms

Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Networks
  1. Network performance evaluation

Index terms have been assigned to the content through auto-classification.

Recommendations

Monitoring-based auto-scalability across hybrid clouds
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

Cloud computing is a relatively new type of Internet-based computing that becomes more and more popular. Using methods like virtualization, adopting architectures based on microservices, automation of building and deployment processes, Cloud could ...
Read More
A Conceptual Platform of SLA in Cloud Computing
DASC '11: Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing

Cloud computing is a promising technology, where the infrastructure, developing platform, software and storage are delivered as a service. With the development of cloud computing, more and more cloud service providers emerge. However, there are no ...
Read More
soCloud: a service-oriented component-based PaaS for managing portability, provisioning, elasticity, and high availability across multiple clouds

Multi-cloud computing is a promising paradigm to support very large scale world wide distributed applications. Multi-cloud computing is the usage of multiple, independent cloud environments, which assumed no priori agreement between cloud providers or ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGOPS Operating Systems Review Volume 56, Issue 1
SIGOPS
June 2022
76 pages
ISSN:0163-5980
DOI:10.1145/3544497
Editors:
Christopher J. Rossbach
Stop D9500, Austin, TX
,
Kishore Pusukuri
1850 Nantucket Cir APT 149 Santa Clara, CA, USA
,
Harvard D. Johansen
The Arctic University of Norway
,
John Chandy
University of Connecticut
,
Antônio Fröhlich
Dederal Univ. of Santa Catarina
,
Ashvin Goel
University of Toronto
Issue’s Table of Contents
Copyright © 2022 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 June 2022
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 219
  Total Downloads
- Downloads (Last 12 months)115
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

Monitoring-based auto-scalability across hybrid clouds

A Conceptual Platform of SLA in Cloud Computing

soCloud: a service-oriented component-based PaaS for managing portability, provisioning, elasticity, and high availability across multiple clouds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

Monitoring-based auto-scalability across hybrid clouds

A Conceptual Platform of SLA in Cloud Computing

soCloud: a service-oriented component-based PaaS for managing portability, provisioning, elasticity, and high availability across multiple clouds

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media