ABSTRACT
The use of highly scaled technologies and large component counts pose significant reliability challenges for large-scale systems. Knowledge of failures that occur in such systems is valuable for driving RAS design decisions for component and system vendors, as well as for the operators of those systems in order to improve resilience. Field studies play a key role in providing insights into the types of failures that occur in real systems, especially at scale. This talk will highlight the value of such studies, discuss implications for future exascale systems, and identify research needs using data from failure analyses of supercomputers and cloud data centers.
Index Terms
- Failures in Large-Scale Systems: Insights from the Field
Recommendations
Understanding network failures in data centers: measurement, analysis, and implications
SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conferenceWe present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic ...
A Large-Scale Study of Failures in High-Performance Computing Systems
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-...
Comments