research-article

Open access

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Authors:

Matthijs Jansen,

Alexandru Iosup,

Animesh TrivediAuthors Info & Claims

ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering

Pages 57 - 63

https://doi.org/10.1145/3447545.3451185

Published: 19 April 2021 Publication History

Abstract

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

References

[1]

Agelastos et al. 2014. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In SC.

[2]

Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[3]

Bal et al. 2016. A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer (2016).

[4]

Banerjee et al. 2020. Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments. In 2020 USENIX Conference on Operational Machine Learning (OpML 20).

[5]

Baylor et al. 2017. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[6]

Baylor et al. 2019. Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).

[7]

Ben-Nun et al. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]

Chu et al. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD.

[9]

Coleman et al. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training, Vol. 100 (2017).

[10]

Crankshaw et al. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation.

[11]

Dai et al. 2011. HiTune: Dataflow-Based Performance Analysis for Big Data Cloud. In ATC.

[12]

Gadiraju et al. 2020. Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.

Digital Library

[13]

Gardu n o et al. 2012. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters. In Strategies, Tools, and Techniques: Proceedings of the 26th Large Installation System Administration Conference, LISA 2012, San Diego, CA, USA, December 9--14, 2012.

[14]

Ghorbani and Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019).

[15]

Yolanda Gil and Bart Selman. 2019. A 20-Year Community Roadmap for Artificial Intelligence Research in the US. arxiv: 1908.02624 [cs.CY]

[16]

Guo et al. 2011. G2: A Graph Processing System for Diagnosing Distributed Systems. In ATC.

[17]

Harlap et al. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).

[18]

He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

[19]

Hegeman et al. 2020. Grade10: A Framework for Performance Characterization of Distributed Graph Processing. In CLUSTER.

[20]

Hestness et al. 2019. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.

Digital Library

[21]

Hopsworks. 2021. Hopsworks. https://www.hopsworks.ai/.

[22]

Huang et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems.

[23]

Jansen et al. 2020. DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning. In ISC.

[24]

Jayaram et al. 2019. FfDL: A Flexible Multi-Tenant Deep Learning Platform. In Proceedings of the 20th International Middleware Conference.

Digital Library

[25]

Jeon et al. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference.

Digital Library

[26]

Jiang et al. 2020. Hpc ai500: The methodology, tools, roofline performance models, and metrics for benchmarking hpc ai systems. arXiv preprint arXiv:2007.00279 (2020).

[27]

Justus et al. 2018. Predicting the computational cost of deep learning models. In 2018 IEEE International Conference on Big Data (Big Data).

[28]

Karlavs et al. 2020. Building Continuous Integration Services for Machine Learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.

Digital Library

[29]

Kim and Lee. 2020. Reducing Tail Latency of DNN-Based Recommender Systems Using in-Storage Processing. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems.

[30]

Knü pfer et al. 2008. The Vampir Performance Analysis Tool-Set. In Tools for High Performance Computing - Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing, July 2008, HLRS, Stuttgart.

[31]

Kotthoff et al. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research, Vol. 18 (2017).

[32]

Kraska. 2018. Northstar: An interactive data science system. Proceedings of the VLDB Endowment, Vol. 11 (2018).

[33]

Li et al. 2018. Ease.Ml: Towards Multi-Tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11 (2018).

Digital Library

[34]

Li et al. 2019. Across-Stack Profiling and Characterization of Machine Learning Models on GPUs. CoRR, Vol. abs/1908.06869 (2019).

[35]

Lim et al. 2019. MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).

[36]

Litjens et al. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis, Vol. 42 (2017).

[37]

Mace et al. 2015a. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP.

Digital Library

[38]

Mace et al. 2015b. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation.

[39]

Mai et al. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In OSDI.

[40]

Mattson et al. 2019. Mlperf training benchmark. arXiv preprint arXiv:1910.01500 (2019).

[41]

Mayer and Jacobsen. 2020. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools. ACM Comput. Surv., Vol. 53 (2020).

[42]

Mirgorodskiy et al. 2008. Diagnosing distributed systems with self-propelled instrumentation. In Middleware.

[43]

NVIDIA. 2021 a. NVIDIA Nsight. https://developer.nvidia.com/tools-overview.

[44]

NVIDIA. 2021 b. NVIDIA Profiler. https://docs.nvidia.com/cuda/profiler-users-guide/index.html.

[45]

Ousterhout et al. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI.

[46]

Pi et al. 2018. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics. In HPDC.

[47]

Polyzotis et al. 2019. Data validation for machine learning. Proceedings of Machine Learning and Systems, Vol. 1 (2019).

[48]

Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).

[49]

Salloum et al. 2017. A Survey of Text Mining in Social Media: Facebook and Twitter Perspectives. Advances in Science, Technology and Engineering Systems Journal, Vol. 2 (2017).

[50]

Sergeev and Del. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).

[51]

Shende and Malony. 2006. The TAU Parallel Performance System. IJHPCA, Vol. 20 (2006).

[52]

Sigelman et al. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).

[53]

Tian et al. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering.

Digital Library

[54]

Tian et al. 2019. Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems.

Digital Library

[55]

Wang et al. 2012. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada, December 3--7, 2012. Proceedings, Vol. 7662.

[56]

Wang et al. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).

[57]

Wang et al. 2020. Metis: learning to schedule long-running applications in shared container clusters at scale. In SC.

[58]

Yang et al. [n.d.]. End-to-end I/O Monitoring on a Leading Supercomputer. In NSDI.

[59]

Yang et al. 2018. Nanolog: A nanosecond scale logging system. In ATC.

[60]

You et al. 2018. Imagenet training in minutes. In ICPP.

[61]

Zaharia et al. 2018. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull., Vol. 41 (2018).

[62]

Zhang et al. 2014. MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines. In CCGrid.

[63]

Zhang et al. 2020. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20).

[64]

Zhao et al. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In OSDI.

[65]

Zhao et al. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP.

[66]

Zhu et al. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. arXiv preprint arXiv:2006.03318 (2020).

Cited By

Weber THußmann H(2022)Tooling for Developing Data-Driven Applications: Overview and OutlookProceedings of Mensch und Computer 202210.1145/3543758.3543779(66-77)Online publication date: 4-Sep-2022
https://dl.acm.org/doi/10.1145/3543758.3543779

Index Terms

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Performance metrics and ontologies for Grid workflows

Many Grid workflow middleware services require knowledge about the performance behavior of Grid applications/services in order to effectively select, compose, and execute workflows in dynamic and complex Grid systems. To provide performance information ...
Towards Holistic Continuous Software Performance Assessment
ICPE '17 Companion: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion

In agile, fast and continuous development lifecycles, software performance analysis is fundamental to confidently release continuously improved software versions. Researchers and industry practitioners have identified the importance of integrating ...
Automatic performance analysis tools for the Grid: Research Articles
Grid Performance

Applications on Grids require scalable and online performance analysis tools. The execution environment of such applications includes a large number of processors. In addition, some of the resources such as the network will be shared with other ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering

April 2021

198 pages

ISBN:9781450383318

DOI:10.1145/3447545

General Chairs:
Johann Bourcier
University of Rennes 1, France
,
Zhen Ming (Jack) Jiang
York University, Canada
,
Program Chairs:
Cor-Paul Bezemer
University of Alberta, Canada
,
Vittorio Cortellessa
University of L'Aquila, Italy

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 April 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICPE '21

Sponsor:

ICPE '21: ACM/SPEC International Conference on Performance Engineering

April 19 - 23, 2021

Virtual Event, France

Acceptance Rates

Overall Acceptance Rate 252 of 851 submissions, 30%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
307
Total Downloads

Downloads (Last 12 months)65
Downloads (Last 6 weeks)8

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Weber THußmann H(2022)Tooling for Developing Data-Driven Applications: Overview and OutlookProceedings of Mensch und Computer 202210.1145/3543758.3543779(66-77)Online publication date: 4-Sep-2022
https://dl.acm.org/doi/10.1145/3543758.3543779

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten