A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log
- ORNL
Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1570137
- Resource Relation:
- Conference: IEEE International Conference on Cluster Computing (CLUSTER 2018) - Belfast, , United Kingdom - 9/10/2018 8:00:00 AM-9/13/2018 4:00:00 AM
- Country of Publication:
- United States
- Language:
- English
A Mathematical Theory of Communication
|
journal | July 1948 |
Co-analysis of RAS Log and Job Log on Blue Gene/P
|
conference | May 2011 |
Flap
|
conference | August 2017 |
Failures in large scale systems: long-term measurement, analysis, and implications
|
conference | January 2017 |
Cassandra: a decentralized structured storage system
|
journal | April 2010 |
Mining association rules between sets of items in large databases
|
journal | June 1993 |
A Large-Scale Study of Failures in High-Performance Computing Systems
|
journal | October 2010 |
Improving Log-based Field Failure Data Analysis of multi-node computing systems
|
conference | June 2011 |
LIII. On lines and planes of closest fit to systems of points in space
|
journal | November 1901 |
Apache Spark: a unified engine for big data processing
|
journal | October 2016 |
What Supercomputers Say: A Study of Five System Logs
|
conference | June 2007 |
LogDiver
|
conference | June 2015 |
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
|
conference | June 2014 |
Mining frequent patterns without candidate generation
|
journal | June 2000 |
Similar Records
Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale
Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System