Understanding failures through the lifetime of a top-level supercomputer

doi:10.1016/j.jpdc.2021.04.001

Journal of Parallel and Distributed Computing

Volume 154, August 2021, Pages 27-41

https://doi.org/10.1016/j.jpdc.2021.04.001 Get rights and content

Highlights

•
A classification of failures occurred during the lifetime of a supercomputer.
•
A statistical modelling of failure rates for different types of failures.
•
An interplay analysis between workload and failures on the system.

Abstract

High performance computing systems are required to solve grand challenges in many scientific disciplines. These systems assemble many components to be powerful enough for solving extremely complex problems. An inherent consequence is the intricacy of the interaction of all those components, especially when failures come into the picture. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms in the future. This paper presents the results on studying multi-year failure and workload records of a powerful supercomputer that topped the world rankings. We provide a thorough analysis of the data and characterize the reliability of the system through several dimensions: failure classification, failure-rate modelling, and interplay between failures and workload. The results shed some light on the dynamics of top-level supercomputers and sensitive areas ripe for improvement.

Introduction

The convergence of high performance computing, data science and artificial intelligence is providing fertile ground to develop creative solutions for grand challenges in many disciplines. From climate modelling to social network analysis, successful computational strategies demand a gigantic number of processing cycles. Supercomputers built with many heterogeneous computing parts are an essential tool to: run simulations where real experiments would be too onerous or hazardous, analyze massive amounts of scientific records, and look for hidden patterns in the data. The lasting impact of these machines also comes with some downsides. Supercomputers integrate many components (processing cores, memory hierarchies, communication networks, storage servers, cooling equipment, software layers, and more) and understanding how those components interact, particularly in the presence of failures, is becoming and arduous task. For supercomputers, getting the entire machine allocated for a single simulation represents a significant challenge, since failures are likely to occur in small parts of the machine. Therefore, one of the major problems extreme-scale systems are facing is to develop a comprehension of their reliability [2], [8], [27].

This work aims at refining our understanding of failures on supercomputers and its components to improve hardware and software co-design for future systems. The approach in reaching our aim includes the analysis of component failures and workload records of a large supercomputer. We believe it is possible to study historical records of a machine and create a profile of the dynamics of the machine's reliability and pin down sensitive areas for improvement.

Starting from a huge curated event database of the machine, we strive to unveil the hidden failure patterns of a top-level supercomputer¹ Such a database was built throughout the years by the system administrators and it is possibly a reflection of the best information source of the machine's internal operation (albeit raw data without conclusions). We start by breaking down the failures into types and providing an enlightening classification and other descriptive statistics of the machine. We extend that understanding by modelling failure rate with powerful distribution functions. Finally, we cross correlate the reliability events with the workload reflecting the use of the system. All our analyses are based on a five-year record dataset, representing most of the lifetime of the supercomputer. As far as we know, this is the first time such techniques have been applied to an uninterrupted and long-duration dataset taken from a supercomputer that topped the world rankings.

The contributions of this paper are:

•
A classification of failures occurred during the lifetime of a top-level supercomputer. We classify the failures according to type, location, and impact on the system.
•
A statistical modelling of failure rates for different types of failures. We provide failure time series using different time units and use probability distributions to best fit the data.
•
An interplay analysis between workload and failures on the system. We contrast failure occurrences with job submissions and measure the impact of failures on execution time of submitted jobs.

Section snippets

Related work

The analysis of failures on supercomputers has been an active research area. Schroeder and Gibson [25] studied the failure logs of 22 high performance supercomputing systems. They analyzed the root cause of failures, failures rates, statistical properties of time between failures, and repair time. According to their analysis there is evidence of a correlation between failures and the type and intensity of the workload and they found that the time between failures is not well modeled by an

System description

Failure data from Titan supercomputer were analyzed in this study. This supercomputer was located at the Oak Ridge Leadership Computing Facility (OLCF). Titan was a Cray XK7 supercomputer and was one of the first supercomputers to use CPUs and GPUs together in a hybrid architecture. It had 18,688 nodes, each with an AMD 16-core 2.2 GHz Opteron CPU and a NVIDIA Tesla K20 GPU. Also, each node had 32 GB of RAM memory. Its maximum performance was 17.59 petaFLOPS.

Failure dataset

Titan supercomputer was in

Failure categorization

This section starts the reliability analysis of a top-level supercomputer with the fundamental goal of understanding how failures are classified. We set off by discovering what types of failures are more prominent, both for system and user classes. Then, we visualize the physical location of failures and see what patterns emerge in each case. Finally, we figure out the impact failures have on the number of affected nodes, whether a single node or multiple nodes are taken down by a failure.

Fig. 3

Failure rate analysis

This section continues the exploration on the reliability of a top-level supercomputer by analyzing the frequency of failures. First, we visualize both system and user failures as a time series across the five-year span. Second, we compute the mean-time-between-failures (MTBF) and model that descriptor. Third, we use powerful statistical distributions to find a best fit for failure frequencies.

We can understand failures as a discrete variable that changes through time. Fig. 4 shows a timeline

Workload and failure interplay

This section complements our study on the reliability of a top-level supercomputer by delving into the interaction between failures and workload. We start by computing what the correlation is between failures observed in the system and jobs submitted to the supercomputer. Then, we measure the impact of failures on the use of the system and vice versa, what effect the workload has on the reliability of the supercomputer.

To analyze the possible interaction between the workload and the failures

Discussion

This section discusses various points that we thought deserve some consideration. Our in-depth study of failures and workload of a top-level supercomputer has shed some light in various research directions worth exploring:

•
Failure monitoring systems. After an in-depth analysis of the failure dataset, we realized how important it is to include relevant data about other vital hardware components. For instance, failure data on network and storage. Using that missing information, we could develop a

Concluding remarks

Large-scale supercomputing systems integrate a vast amount of components that make them powerful enough to solve highly complex problems. An unintended consequence of the sheer size of those systems is their low reliability. Creating a strong understanding of how these components fail and interact is fundamental to maintaining the productivity of HPC platforms.

Our analysis of five years of reliability and workload data on a top-level supercomputer shows that user-generated failures are

CRediT authorship contribution statement

Elvis Rojas: Conceptualization, Data curation, Investigation, Methodology, Software, Writing – original draft. Esteban Meneses: Conceptualization, Formal analysis, Methodology, Supervision. Terry Jones: Formal analysis, Validation, Writing – original draft. Don Maxwell: Formal analysis, Validation, Writing – original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This research was partially supported by a machine allocation on Kabré supercomputer at the Costa Rica National High Technology Center.

Elvis Rojas is an Information Systems Engineer graduated from the National University of Costa Rica. He obtained a Master's Degree in Computer Science with an emphasis in Telelematics from the Costa Ria Institute of Technology in 2009. He is currently a student of the Engineering Doctorate Program at the Costa Rica Institute of Technology. In addition, it is part of the School of Informatics of the National University of Costa Rica.

References (37)

Y. Yuan et al.
Job failures in high performance computing systems: a large-scale empirical study
Comput. Math. Appl.
(Jan. 2012)
R. Ashraf et al.
Analyzing the impact of system reliability events on applications in the titan supercomputer
F. Cappello et al.
Toward exascale resilience: 2014 update
Supercomput. Front. Innov. Int. J.
(Apr. 2014)
X. Chen et al.
Study and analysis of the high performance computing failures in China meteorological field
J. Geosci. Environ. Prot.
(2017)
S. Di et al.
LogAider: a tool for mining potential correlations of HPC log events
S. Di et al.
Exploring properties and correlations of fatal events in a large-scale HPC system
IEEE Trans. Parallel Distrib. Syst.
(2019)
S. Di et al.
Characterizing and understanding HPC job failures over the 2k-day life of IBM BlueGene/Q system
N. El-Sayed et al.
Reading between the lines of failure logs: understanding how HPC systems fail
E.N. Elnozahy et al.
System resilience at extreme scale
(2008)
M. Ezell
Understanding the Impact of Interconnect Failures on System Operation
(2013)

D.A.G. Gonçalves de Oliveira et al.

Evaluation and mitigation of radiation-induced soft errors in graphics processing units

IEEE Trans. Comput.

(2016)

H. Guo et al.

La VALSE: scalable log visualization for fault characterization in supercomputers

S. Gupta et al.

Failures in large scale systems: long-term measurement, analysis, and implications

N. Gurumdimma et al.

Understanding error log event sequence for failure analysis

Sci. World J.

(2018)

A.A. Hwang et al.

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

Comput. Archit. News

(Mar. 2012)

Y. Liang et al.

Filtering failure logs for a BlueGene/L prototype

R.-T. Liu et al.

A large-scale study of failures on petascale supercomputers

J. Comput. Sci. Technol.

(Jan 2018)

C.D. Martino et al.

Lessons learned from the analysis of system failures at petascale: the case of blue waters

Cited by (1)

An empirical study of major page faults for failure diagnosis in cluster systems
2023, Journal of Supercomputing

Esteban Meneses is the director of the National Advanced Computing Collaboratory at the Costa Rica National High Technology Center, and a part-time Associate Professor in the School of Computing of the Costa Rica Institute of Technology. He holds a PhD in Computer Science from the University of Illinois at Urbana-Champaign. He was a Research Assistant Professor at the Center for Simulation and Modelling of the University of Pittsburgh. His research interests include scientific simulations using parallel objects, fault tolerance for HPC systems, and distributed machine learning frameworks. Dr. Meneses was the General Chair of the Latin America High Performance Computing Conference (CARLA) 2019.

Terry Jones is a scientist who's found a love of working on powerful computing systems, particularly the kind which use multiple computers to tackle massive problems by working in parallel. He started his career as a computer scientist worrying about time-critical real time programming in the aerospace industry. In 2008, he joined Oak Ridge National Laboratory (ORNL) to work on system software for supercomputers. His interests include parallel programming, distributed systems, operating systems, runtime systems, programming languages and programming environments. His recreational interests include sports, movies and reading.

Don Maxwell is the team lead for the HPC team in the High-Performance Computing and Data Operations group. Maxwell began his career with the Department of Energy more than 25 years ago at the Y-12 Security Complex in Oak Ridge, Tennessee before joining Oak Ridge National Laboratory in 2000. He was integral to the transition from IBM systems to the Cray XT series, including Jaguar, which ranked as the world's most powerful supercomputer in November 2009 on the Top500 list, and Jaguar's successor Titan. Don Maxwell received a lifetime achievement award from Adaptive Computing as part of their first annual Adaptive Awards.

^☆: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

^☆☆: Notice: This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

View full text

Understanding failures through the lifetime of a top-level supercomputer☆,☆☆

Highlights

Abstract

Introduction

Section snippets

Related work

System description

Failure dataset

Failure categorization

Failure rate analysis

Workload and failure interplay

Discussion

Concluding remarks

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Comput. Math. Appl.

Analyzing the impact of system reliability events on applications in the titan supercomputer

Toward exascale resilience: 2014 update

Supercomput. Front. Innov. Int. J.

Study and analysis of the high performance computing failures in China meteorological field

J. Geosci. Environ. Prot.

LogAider: a tool for mining potential correlations of HPC log events

Exploring properties and correlations of fatal events in a large-scale HPC system

IEEE Trans. Parallel Distrib. Syst.

Characterizing and understanding HPC job failures over the 2k-day life of IBM BlueGene/Q system

Reading between the lines of failure logs: understanding how HPC systems fail

System resilience at extreme scale

Understanding the Impact of Interconnect Failures on System Operation