skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer

Abstract

Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.

Authors:
ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1486940
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) - Dallas, Texas, United States of America - 11/11/2018 8:00:00 PM-11/16/2018 8:00:00 PM
Country of Publication:
United States
Language:
English

Citation Formats

Ashraf, Rizwan, and Engelmann, Christian. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. United States: N. p., 2018. Web. doi:10.1109/FTXS.2018.00008.
Ashraf, Rizwan, & Engelmann, Christian. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. United States. https://doi.org/10.1109/FTXS.2018.00008
Ashraf, Rizwan, and Engelmann, Christian. 2018. "Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer". United States. https://doi.org/10.1109/FTXS.2018.00008. https://www.osti.gov/servlets/purl/1486940.
@article{osti_1486940,
title = {Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer},
author = {Ashraf, Rizwan and Engelmann, Christian},
abstractNote = {Extreme-scale computing systems employ Reliability, Availability and Serviceability (RAS) mechanisms and infrastructure to log events from multiple system components. In this paper, we analyze RAS logs in conjunction with the application placement and scheduling database, in order to understand the impact of common RAS events on application performance. This study conducted on the records of about 2 million applications executed on Titan supercomputer provides important insights for system users, operators and computer science researchers. Specifically, we investigate the impact of RAS events on application performance and its variability by comparing cases where events are recorded with corresponding cases where no events are recorded. Such a statistical investigation is possible since we observed that system users tend to execute their applications multiple times. Our analysis reveals that most RAS events do impact application performance, although not always. We also find that different system components affect application performance differently. In particular, our investigation includes the following components: parallel file system processor, memory, graphics processing units, system and user software issues. Our work establishes the importance of providing feedback to system users for increasing operational efficiency of extreme-scale systems.},
doi = {10.1109/FTXS.2018.00008},
url = {https://www.osti.gov/biblio/1486940}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Nov 01 00:00:00 EDT 2018},
month = {Thu Nov 01 00:00:00 EDT 2018}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017

  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • https://doi.org/10.1145/3126908.3126937

Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design
conference, January 2012

  • Hwang, Andy A.; Stefanovici, Ioan A.; Schroeder, Bianca
  • Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '12
  • https://doi.org/10.1145/2150976.2150989

Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems
conference, May 2012

  • Gainaru, Ana; Cappello, Franck; Kramer, William
  • 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium
  • https://doi.org/10.1109/IPDPS.2012.107

Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters
journal, November 2018


LogDiver
conference, June 2015


Measuring the Impact of Memory Errors on Application  Performance
journal, January 2017


Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
conference, June 2018


Memory Errors in Modern Systems
conference, March 2015

  • Sridharan, Vilas; DeBardeleben, Nathan; Blanchard, Sean
  • Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
  • https://doi.org/10.1145/2694344.2694348

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
conference, June 2018


Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
conference, June 2014

  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • https://doi.org/10.1109/DSN.2014.62

Reading between the lines of failure logs: Understanding how HPC systems fail
conference, June 2013


Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
conference, June 2015

  • Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • https://doi.org/10.1109/DSN.2015.50

DRAM errors in the wild: a large-scale field study
conference, January 2009

  • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
  • Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09
  • https://doi.org/10.1145/1555349.1555372