research-article

Predicting GPU Failures With High Precision Under Deep Learning Workloads

Authors:

Chuanxiong GuoAuthors Info & Claims

SYSTOR '23: Proceedings of the 16th ACM International Conference on Systems and Storage

Pages 124 - 135

https://doi.org/10.1145/3579370.3594777

Published: 22 June 2023 Publication History

Abstract

Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. In large-scale GPU clusters, GPU failures are inevitable and may cause severe consequences. For example, GPU failures disrupt distributed training, crash inference services, and result in service level agreement violations. In this paper, we study the problem of predicting GPU failures using machine learning (ML) models to mitigate their damages.

We train prediction models on a four-month production dataset with 350 million entries at ByteDance. We observe that classic prediction models (GBDT, MLP, LSTM, and 1D-CNN) do not perform well---they are inaccurate for predictions and unstable over time. We propose several techniques to improve the precision and stability of predictions, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performance of our proposed techniques. The results show that our proposed techniques improve the prediction precision from 46.3% to 85.4% on production workloads.

References

[1]

Samuel Ackerman, Orna Raz, Marcel Zalmanovici, and Aviad Zlotnick. 2021. Automatically detecting data drift in machine learning classifiers. arXiv preprint arXiv:2111.05672 (2021).

[2]

Robert Birke, Ioana Giurgiu, Lydia Y Chen, Dorothea Wiesmann, and Ton Engbersen. 2014. Failure analysis of virtual and physical machines: patterns, causes and characteristics. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

Digital Library

[3]

Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Digital Library

[4]

Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning. 18.

Digital Library

[5]

Anwesha Das, Frank Mueller, Charles Siegel, and Abhinav Vishnu. 2018. Desh: deep learning for system health prediction of lead times to failure in HPC. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing.

Digital Library

[6]

Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

Digital Library

[7]

Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain Muller. 2019. Deep neural network ensembles for time series classification. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--6.

[8]

Daniel Ford, François Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. 2010. Availability in globally distributed storage systems. (2010).

[9]

Jiechao Gao, Haoyu Wang, and Haiying Shen. 2020. Task failure prediction in cloud data centers using deep learning. IEEE Transactions on Services Computing (2020).

[10]

Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, and Don Maxwell. 2015. Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

Digital Library

[11]

Lars Kai Hansen and Peter Salamon. 1990. Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence 12, 10 (1990), 993--1001.

Digital Library

[12]

Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems 212 (2021), 106622.

[13]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[14]

Huzaifa Izzeldin, Vijanth S Asirvadam, and Nordin Saad. 2011. Online sliding-window based for training MLP networks using advanced conjugate gradient. In 2011 IEEE 7th International Colloquium on Signal Processing and its Applications. IEEE, 112--116.

[15]

Kalervo Järvelin and Jaana Kekäläinen. 2017. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 243--250.

Digital Library

[16]

Charu Kalra, Fritz Previlon, Xiangyu Li, Norman Rubin, and David Kaeli. 2018. Prism: Predicting resilience of gpu applications using statistical methods. In SC'18: International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[17]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).

Digital Library

[18]

Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Morris Jette, and Ramendra Sahoo. 2006. Bluegene/l failure analysis and prediction models. In International Conference on Dependable Systems and Networks (DSN'06).

Digital Library

[19]

Yuri Malheiros, Alan Moraes, Cleyton Trindade, and Silvio Meira. 2012. A source code recommender system to support newcomers. In 36th annual computer software and applications conference. IEEE, 19--24.

[20]

Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field. In 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

Digital Library

[21]

Fionn Murtagh. 1991. Multilayer perceptrons for classification and regression. Neurocomputing 2, 5-6 (1991), 183--197.

[22]

Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H Rogers. 2016. A large-scale study of soft-errors on GPUs in the field. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]

Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. 2017. Characterizing temperature, power, and soft-error behaviors in data center systems: Insights, challenges, and opportunities. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[24]

Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. 2018. Machine learning models for GPU error prediction in a large scale HPC system. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[25]

Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07). IEEE, 575--584.

Digital Library

[26]

Daniel Oliveira, Laércio Pilla, Nathan DeBardeleben, Sean Blanchard, Heather Quinn, Israel Koren, Philippe Navaux, and Paolo Rech. 2017. Experimental and analytical study of Xeon Phi reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[27]

Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. (2007).

[28]

Orna Raz, Marcel Zalmanovici, Aviad Zlotnick, and Eitan Farchi. 2019. Automatically detecting data drift in machine learning based classifiers. In The AAAI-19 Workshop on Engineering Dependable and Secure Machine Learning Systems Software Engineering for Machine Learning (EDSMLS 2019).

[29]

Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. 2016. Flash reliability in production: The expected and the unexpected. In 14th USENIX Conference on File and Storage Technologies (FAST 16). 67--80.

Digital Library

[30]

Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, and Sudhanva Gurumurthi. 2013. Feng shui of supercomputer memory positional effects in DRAM and SRAM faults. In SC'13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis.

Digital Library

[31]

Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. {NetBouncer}: Active Device and Link Failure Localization in Data Center Networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 599--614.

[32]

Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, et al. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[33]

Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM symposium on Cloud computing.

Digital Library

[34]

Guosai Wang, Lifei Zhang, and Wei Xu. 2017. What can we learn from four years of data center hardware failures?. In 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[35]

Wei Wang, Ming Zhu, Jinlin Wang, Xuewen Zeng, and Zhongzhen Yang. 2017. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In 2017 IEEE international conference on intelligence and security informatics (ISI). IEEE, 43--48.

Digital Library

[36]

Ji Xue, Feng Yan, Robert Birke, Lydia Y Chen, Thomas Scherer, and Evgenia Smirni. 2015. Practise: Robust prediction of data center time series. In 2015 11th International Conference on Network and Service Management (CNSM). IEEE.

Digital Library

[37]

Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutional neural networks on multichannel time series for human activity recognition. In Twenty-fourth international joint conference on artificial intelligence.

Digital Library

[38]

Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. 2015. Long short-term memory over recursive structures. In International Conference on Machine Learning. PMLR, 1604--1612.

Index Terms

Predicting GPU Failures With High Precision Under Deep Learning Workloads
1. Computing methodologies
  1. Machine learning

Recommendations

Predicting GPU Performance from CPU Runs Using Machine Learning
SBAC-PAD '14: Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing

Graphics processing units (GPUs) can deliver considerable performance gains over general purpose processors. However, GPU performance improvement vary considerably across applications. Porting applications to GPUs by rewriting code with GPU-specific ...
Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly
ICPE '24: Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering

Machine Learning (ML) workloads generally contain a significant amount of matrix computations; hence, hardware accelerators for ML have been incorporating support for matrix accelerators. With the popularity of GPUs as hardware accelerators for ML, ...
Optimizing Deep Learning Workloads on ARM GPU with TVM
ReQuEST '18: Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning

With the great success of deep learning, the demand for deploying deep neural networks to mobile devices is growing rapidly. However, current popular deep learning frameworks are often poorly optimized for mobile devices, especially mobile GPU. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SYSTOR '23: Proceedings of the 16th ACM International Conference on Systems and Storage

June 2023

168 pages

ISBN:9781450399623

DOI:10.1145/3579370

Chair:
Yosef Moatti,
General Chair:
Ofer Biran
IBM Research - Haifa, Israel
,
Program Chairs:
Yossi Gilad
Hebrew University of Jerusalem, Israel
,
Dejan Kostic
KTH Royal Institute of Technology, Sweden

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SYSTOR '23

Sponsor:

SIGOPS

SYSTOR '23: 16th ACM International Conference on Systems and Storage

June 5 - 7, 2023

Haifa, Israel

Acceptance Rates

SYSTOR '23 Paper Acceptance Rate 12 of 30 submissions, 40%;

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
235
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)22

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten