Characterization and identification of HPC applications at leadership computing facility

Liu, Zhengchun; Rao, Nageswara; Kettimuthu, Rajkumar; Foster, Ian; Lewis, Ryan; Harms, Kevin; Carns, Philip; Papka, Michael

doi:10.1145/3392717.3392774

Title: Characterization and identification of HPC applications at leadership computing facility

Conference · Mon Jun 01 00:00:00 EDT 2020

DOI:https://doi.org/10.1145/3392717.3392774· OSTI ID:1649007

^[1];

^[2]; Kettimuthu, Rajkumar ^[1]; Foster, Ian ^[1]; Lewis, Ryan ^[3]; Harms, Kevin ^[1]; Carns, Philip ^[1]; Papka, Michael ^[1]

Argonne National Laboratory (ANL)
ORNL
Northern Illinois University

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation. In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O. Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1649007

Resource Relation:: Conference: ACM International Conference on Supercomputing (ICS20) - Virtual conference, Tennessee, United States of America - 6/29/2020 4:00:00 AM-7/2/2020 4:00:00 AM

Country of Publication:: United States

Language:: English

Similar Records

SCR-Exa: Enhanced Scalable Checkpoint Restart (SCR) Library for Next Generation Exascale Computing

Technical Report · Mon Feb 21 00:00:00 EST 2022 · OSTI ID:1649007

Dai, Donglai

...And Eat it Too: High Read Performance in Write-Optimized HPC I/O Middleware File Formats

Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:1649007

Klasky, Scott A; Lofstead, J.; Bent, John; +5 more

Scalable I/O Tracing and Analysis

Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:1649007

Vijayakumar, Karthik; Mueller, Frank; Ma, Xiaosong; +1 more

Title: Characterization and identification of HPC applications at leadership computing facility

Citation Formats

Similar Records

Related Subjects