A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Park, Byung; Hui, Yawei; Boehm, Swen; Ashraf, Rizwan; Layton, Chris; Engelmann, Christian

doi:10.1109/CLUSTER.2018.00073

Title: A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference · Thu Nov 01 00:00:00 EDT 2018

DOI:https://doi.org/10.1109/CLUSTER.2018.00073· OSTI ID:1570137

^[1];

^[1]; Layton, Chris ^[1];

^[1]

ORNL

Reliability, availability and serviceability (RAS) logs of high performance computing (HPC) resources, when closely investigated in spatial and temporal dimensions, can provide invaluable information regarding system status, performance, and resource utilization. These data are often generated from multiple logging systems and sensors that cover many components of the system. The analysis of these data for finding persistent temporal and spatial insights faces two main difficulties: the volume of RAS logs makes manual inspection difficult and the unstructured nature and unique properties of log data produced by each subsystem adds another dimension of difficulty in identifying implicit correlation among recorded events. To address these issues, we recently developed a multi-user Big Data analytics framework for HPC log data at Oak Ridge National Laboratory (ORNL). This paper introduces three in-progress data analytics projects that leverage this framework to assess system status, mine event patterns, and study correlations between user applications and system events. We describe the motivation of each project and detail their workflows using three years of log data collected from ORNL's Titan supercomputer.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1570137

Resource Relation:: Conference: IEEE International Conference on Cluster Computing (CLUSTER 2018) - Belfast, , United Kingdom - 9/10/2018 8:00:00 AM-9/13/2018 4:00:00 AM

Country of Publication:: United States

Language:: English

References (14)

A Mathematical Theory of Communication Shannon, C. E. Bell System Technical Journal, Vol. 27, Issue 3 https://doi.org/10.1002/j.1538-7305.1948.tb01338.x	journal	July 1948
Co-analysis of RAS Log and Job Log on Blue Gene/P Zheng, Ziming; Yu, Li; Tang, Wei Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.83	conference	May 2011
Flap Li, Tao; Jiang, Yexi; Zeng, Chunqiu Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining https://doi.org/10.1145/3097983.3098022	conference	August 2017
Failures in large scale systems: long-term measurement, analysis, and implications Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937	conference	January 2017
Cassandra: a decentralized structured storage system Lakshman, Avinash; Malik, Prashant ACM SIGOPS Operating Systems Review, Vol. 44, Issue 2 https://doi.org/10.1145/1773912.1773922	journal	April 2010
Mining association rules between sets of items in large databases Agrawal, Rakesh; Imieliński, Tomasz; Swami, Arun ACM SIGMOD Record, Vol. 22, Issue 2 https://doi.org/10.1145/170036.170072	journal	June 1993
A Large-Scale Study of Failures in High-Performance Computing Systems Schroeder, Bianca; Gibson, Garth A. IEEE Transactions on Dependable and Secure Computing, Vol. 7, Issue 4 https://doi.org/10.1109/TDSC.2009.4	journal	October 2010
Improving Log-based Field Failure Data Analysis of multi-node computing systems Pecchia, Antonio; Cotroneo, Domenico; Kalbarczyk, Zbigniew 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN) https://doi.org/10.1109/DSN.2011.5958210	conference	June 2011
LIII. On lines and planes of closest fit to systems of points in space Pearson, Karl The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Vol. 2, Issue 11 https://doi.org/10.1080/14786440109462720	journal	November 1901
Apache Spark: a unified engine for big data processing Zaharia, Matei; Franklin, Michael J.; Ghodsi, Ali Communications of the ACM, Vol. 59, Issue 11 https://doi.org/10.1145/2934664	journal	October 2016
What Supercomputers Say: A Study of Five System Logs Oliner, Adam; Stearley, Jon 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07) https://doi.org/10.1109/DSN.2007.103	conference	June 2007
LogDiver Martino, Catello Di; Jha, Saurabh; Kramer, William Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale https://doi.org/10.1145/2751504.2751511	conference	June 2015
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K. 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62	conference	June 2014
Mining frequent patterns without candidate generation Han, Jiawei; Pei, Jian; Yin, Yiwen ACM SIGMOD Record, Vol. 29, Issue 2 https://doi.org/10.1145/335191.335372	journal	June 2000

Similar Records

A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Conference · Sat Sep 01 00:00:00 EDT 2018 · 2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) · OSTI ID:1570137

Park, Byung H.; Hui, Yawei; Boehm, Swen; +3 more

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Conference · Fri Sep 01 00:00:00 EDT 2017 · OSTI ID:1570137

Park, Byung; Hukerikar, Saurabh; Adamson, Ryan; +1 more

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

Journal Article · Tue Aug 14 00:00:00 EDT 2018 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1570137

Di, Sheng; Guo, Hanqi; Gupta, Rinku; +3 more

Title: A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log

Citation Formats

References (14)

Similar Records

Related Subjects