Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System
- Wayne State University, Detroit
- Intel Corporation
- Northeastern University, Boston
- University of Tennessee, Knoxville (UTK)
- University of North Texas
- ORNL
Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1465034
- Resource Relation:
- Conference: 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018 - Luxembourg City, , Luxembourg - 6/25/2018 12:00:00 PM-6/28/2018 12:00:00 PM
- Country of Publication:
- United States
- Language:
- English
Similar Records
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect