skip to main content
10.1145/3447545.3451185acmconferencesArticle/Chapter ViewAbstractPublication PagesicpeConference Proceedingsconference-collections
research-article
Open access

GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

Published: 19 April 2021 Publication History

Abstract

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

References

[1]
Agelastos et al. 2014. The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications. In SC.
[2]
Amershi et al. 2019. Software Engineering for Machine Learning: A Case Study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).
[3]
Bal et al. 2016. A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term. IEEE Computer (2016).
[4]
Banerjee et al. 2020. Challenges and Experiences with MLOps for Performance Diagnostics in Hybrid-Cloud Enterprise Software Deployments. In 2020 USENIX Conference on Operational Machine Learning (OpML 20).
[5]
Baylor et al. 2017. Tfx: A tensorflow-based production-scale machine learning platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[6]
Baylor et al. 2019. Continuous Training for Production ML in the TensorFlow Extended (TFX) Platform. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).
[7]
Ben-Nun et al. 2019. A modular benchmarking infrastructure for high-performance and reproducible deep learning. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[8]
Chu et al. 2016. Data cleaning: Overview and emerging challenges. In SIGMOD.
[9]
Coleman et al. 2017. Dawnbench: An end-to-end deep learning benchmark and competition. Training, Vol. 100 (2017).
[10]
Crankshaw et al. 2017. Clipper: A Low-Latency Online Prediction Serving System. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation.
[11]
Dai et al. 2011. HiTune: Dataflow-Based Performance Analysis for Big Data Cloud. In ATC.
[12]
Gadiraju et al. 2020. Multimodal Deep Learning Based Crop Classification Using Multispectral and Multitemporal Satellite Imagery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.
[13]
Gardu n o et al. 2012. Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters. In Strategies, Tools, and Techniques: Proceedings of the 26th Large Installation System Administration Conference, LISA 2012, San Diego, CA, USA, December 9--14, 2012.
[14]
Ghorbani and Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019).
[15]
Yolanda Gil and Bart Selman. 2019. A 20-Year Community Roadmap for Artificial Intelligence Research in the US. arxiv: 1908.02624 [cs.CY]
[16]
Guo et al. 2011. G2: A Graph Processing System for Diagnosing Distributed Systems. In ATC.
[17]
Harlap et al. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).
[18]
He et al. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
[19]
Hegeman et al. 2020. Grade10: A Framework for Performance Characterization of Distributed Graph Processing. In CLUSTER.
[20]
Hestness et al. 2019. Beyond Human-Level Accuracy: Computational Challenges in Deep Learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming.
[21]
Hopsworks. 2021. Hopsworks. https://www.hopsworks.ai/.
[22]
Huang et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems.
[23]
Jansen et al. 2020. DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning. In ISC.
[24]
Jayaram et al. 2019. FfDL: A Flexible Multi-Tenant Deep Learning Platform. In Proceedings of the 20th International Middleware Conference.
[25]
Jeon et al. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference.
[26]
Jiang et al. 2020. Hpc ai500: The methodology, tools, roofline performance models, and metrics for benchmarking hpc ai systems. arXiv preprint arXiv:2007.00279 (2020).
[27]
Justus et al. 2018. Predicting the computational cost of deep learning models. In 2018 IEEE International Conference on Big Data (Big Data).
[28]
Karlavs et al. 2020. Building Continuous Integration Services for Machine Learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining.
[29]
Kim and Lee. 2020. Reducing Tail Latency of DNN-Based Recommender Systems Using in-Storage Processing. In Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems.
[30]
Knü pfer et al. 2008. The Vampir Performance Analysis Tool-Set. In Tools for High Performance Computing - Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing, July 2008, HLRS, Stuttgart.
[31]
Kotthoff et al. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. The Journal of Machine Learning Research, Vol. 18 (2017).
[32]
Kraska. 2018. Northstar: An interactive data science system. Proceedings of the VLDB Endowment, Vol. 11 (2018).
[33]
Li et al. 2018. Ease.Ml: Towards Multi-Tenant Resource Sharing for Machine Learning Workloads. Proc. VLDB Endow., Vol. 11 (2018).
[34]
Li et al. 2019. Across-Stack Profiling and Characterization of Machine Learning Models on GPUs. CoRR, Vol. abs/1908.06869 (2019).
[35]
Lim et al. 2019. MLOp Lifecycle Scheme for Vision-based Inspection Process in Manufacturing. In 2019 USENIX Conference on Operational Machine Learning (OpML 19).
[36]
Litjens et al. 2017. A survey on deep learning in medical image analysis. Medical Image Analysis, Vol. 42 (2017).
[37]
Mace et al. 2015a. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In SOSP.
[38]
Mace et al. 2015b. Retro: Targeted Resource Management in Multi-tenant Distributed Systems. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation.
[39]
Mai et al. 2020. KungFu: Making Training in Distributed Machine Learning Adaptive. In OSDI.
[40]
Mattson et al. 2019. Mlperf training benchmark. arXiv preprint arXiv:1910.01500 (2019).
[41]
Mayer and Jacobsen. 2020. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools. ACM Comput. Surv., Vol. 53 (2020).
[42]
Mirgorodskiy et al. 2008. Diagnosing distributed systems with self-propelled instrumentation. In Middleware.
[43]
NVIDIA. 2021 a. NVIDIA Nsight. https://developer.nvidia.com/tools-overview.
[44]
NVIDIA. 2021 b. NVIDIA Profiler. https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
[45]
Ousterhout et al. 2015. Making Sense of Performance in Data Analytics Frameworks. In NSDI.
[46]
Pi et al. 2018. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics. In HPDC.
[47]
Polyzotis et al. 2019. Data validation for machine learning. Proceedings of Machine Learning and Systems, Vol. 1 (2019).
[48]
Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).
[49]
Salloum et al. 2017. A Survey of Text Mining in Social Media: Facebook and Twitter Perspectives. Advances in Science, Technology and Engineering Systems Journal, Vol. 2 (2017).
[50]
Sergeev and Del. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
[51]
Shende and Malony. 2006. The TAU Parallel Performance System. IJHPCA, Vol. 20 (2006).
[52]
Sigelman et al. 2010. Dapper, a large-scale distributed systems tracing infrastructure. (2010).
[53]
Tian et al. 2018. DeepTest: Automated Testing of Deep-Neural-Network-Driven Autonomous Cars. In Proceedings of the 40th International Conference on Software Engineering.
[54]
Tian et al. 2019. Towards Framework-Independent, Non-Intrusive Performance Characterization for Dataflow Computation. In Proceedings of the 10th ACM SIGOPS Asia-Pacific Workshop on Systems.
[55]
Wang et al. 2012. VScope: Middleware for Troubleshooting Time-Sensitive Data Center Applications. In Middleware 2012 - ACM/IFIP/USENIX 13th International Middleware Conference, Montreal, QC, Canada, December 3--7, 2012. Proceedings, Vol. 7662.
[56]
Wang et al. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).
[57]
Wang et al. 2020. Metis: learning to schedule long-running applications in shared container clusters at scale. In SC.
[58]
Yang et al. [n.d.]. End-to-end I/O Monitoring on a Leading Supercomputer. In NSDI.
[59]
Yang et al. 2018. Nanolog: A nanosecond scale logging system. In ATC.
[60]
You et al. 2018. Imagenet training in minutes. In ICPP.
[61]
Zaharia et al. 2018. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull., Vol. 41 (2018).
[62]
Zhang et al. 2014. MIMP: Deadline and Interference Aware Scheduling of Hadoop Virtual Machines. In CCGrid.
[63]
Zhang et al. 2020. Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems. In 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 20).
[64]
Zhao et al. 2016. Non-Intrusive Performance Profiling for Entire Software Stacks Based on the Flow Reconstruction Principle. In OSDI.
[65]
Zhao et al. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP.
[66]
Zhu et al. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. arXiv preprint arXiv:2006.03318 (2020).

Cited By

View all
  • (2022)Tooling for Developing Data-Driven Applications: Overview and OutlookProceedings of Mensch und Computer 202210.1145/3543758.3543779(66-77)Online publication date: 4-Sep-2022

Index Terms

  1. GradeML: Towards Holistic Performance Analysis for Machine Learning Workflows

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICPE '21: Companion of the ACM/SPEC International Conference on Performance Engineering
    April 2021
    198 pages
    ISBN:9781450383318
    DOI:10.1145/3447545
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 April 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. MLdevops
    2. data gathering
    3. gradeML
    4. machine learning workflow
    5. modeling
    6. performance analysis

    Qualifiers

    • Research-article

    Conference

    ICPE '21

    Acceptance Rates

    Overall Acceptance Rate 252 of 851 submissions, 30%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)65
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 27 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Tooling for Developing Data-Driven Applications: Overview and OutlookProceedings of Mensch und Computer 202210.1145/3543758.3543779(66-77)Online publication date: 4-Sep-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media