invited-talk

Failures in Large-Scale Systems: Insights from the Field

Author:
Sudhanva Gurumurthi

Advanced Micro Devices, Inc. & University of Virginia, Boxborough, MA, USA

Advanced Micro Devices, Inc. & University of Virginia, Boxborough, MA, USA
View Profile

FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme ScaleJune 2015Pages 1https://doi.org/10.1145/2751504.2751514

Published:15 June 2015Publication History

FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

Pages 1

ABSTRACT

The use of highly scaled technologies and large component counts pose significant reliability challenges for large-scale systems. Knowledge of failures that occur in such systems is valuable for driving RAS design decisions for component and system vendors, as well as for the operators of those systems in order to improve resilience. Field studies play a key role in providing insights into the types of failures that occur in real systems, especially at scale. This talk will highlight the value of such studies, discuss implications for future exascale systems, and identify research needs using data from failure analyses of supercomputers and cloud data centers.

Index Terms

Failures in Large-Scale Systems: Insights from the Field
1. Hardware
  1. Robustness

Recommendations

Understanding network failures in data centers: measurement, analysis, and implications
SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic ...
Read More
Understanding and coping with failures in large-scale storage systems
Read More
A Large-Scale Study of Failures in High-Performance Computing Systems

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
June 2015
78 pages
ISBN:9781450335690
DOI:10.1145/2751504
General Chair:
Nathan DeBardeleben
Los Alamos National Laboratory, USA
,
Program Chairs:
Nathan DeBardeleben
Los Alamos National Laboratory, USA
,
Franck Cappello
Argonne National Laboratory and UIUC, USA
,
Robert Clay
Sandia National Laboratories, USA
Copyright © 2015 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2015
Check for updates
Author Tags
data centers
field data analysis
supercomputers
Qualifiers
- invited-talk
Conference

Acceptance Rates
FTXS '15 Paper Acceptance Rate9of15submissions,60%Overall Acceptance Rate16of25submissions,64%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 91
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Failures in Large-Scale Systems: Insights from the Field

FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale

ABSTRACT

Cited By

Index Terms

Recommendations

Understanding network failures in data centers: measurement, analysis, and implications

Understanding and coping with failures in large-scale storage systems

A Large-Scale Study of Failures in High-Performance Computing Systems