skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference ·

Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1570137
Resource Relation:
Conference: IEEE International Conference on Cluster Computing (CLUSTER 2018) - Belfast, , United Kingdom - 9/10/2018 8:00:00 AM-9/13/2018 4:00:00 AM
Country of Publication:
United States
Language:
English

References (14)

A Mathematical Theory of Communication journal July 1948
Co-analysis of RAS Log and Job Log on Blue Gene/P conference May 2011
Flap conference August 2017
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
Cassandra: a decentralized structured storage system journal April 2010
Mining association rules between sets of items in large databases journal June 1993
A Large-Scale Study of Failures in High-Performance Computing Systems journal October 2010
Improving Log-based Field Failure Data Analysis of multi-node computing systems conference June 2011
LIII. On lines and planes of closest fit to systems of points in space journal November 1901
Apache Spark: a unified engine for big data processing journal October 2016
What Supercomputers Say: A Study of Five System Logs conference June 2007
LogDiver conference June 2015
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62
conference June 2014
Mining frequent patterns without candidate generation journal June 2000

Similar Records

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log
Conference · Sat Sep 01 00:00:00 EDT 2018 · 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) · OSTI ID:1570137

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale
Conference · Fri Sep 01 00:00:00 EDT 2017 · OSTI ID:1570137

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System
Journal Article · Tue Aug 14 00:00:00 EDT 2018 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1570137

Related Subjects