Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Kumar, Mohit; Gupta, Saurabh; Patel, Tirthak; Wilder, Michael; Shi, Weisong; Fu, Song; Engelmann, Christian; Tiwari, Devesh

doi:10.1109/DSN.2018.00023

Title: Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Conference · Fri Jun 01 00:00:00 EDT 2018

DOI:https://doi.org/10.1109/DSN.2018.00023· OSTI ID:1465034

Kumar, Mohit ^[1]; Gupta, Saurabh ^[2]; Patel, Tirthak ^[3]; Wilder, Michael ^[4]; Shi, Weisong ^[1]; Fu, Song ^[5];

^[6]; Tiwari, Devesh ^[3]

Wayne State University, Detroit
Intel Corporation
Northeastern University, Boston
University of Tennessee, Knoxville (UTK)
University of North Texas
ORNL

Today's High Performance Computing (HPC) systems are capable of delivering performance in the order of petaflops due to the fast computing devices, network interconnect, and back-end storage systems. In particular, interconnect resilience and congestion resolution methods have a major impact on the overall interconnect and application performance. This is especially true for scientific applications running multiple processes on different compute nodes as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks state-of-practice experience reports that detail how different interconnect errors and congestion events occur on large-scale HPC systems. Therefore, in this paper, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors and congestion events. We also study the interaction between interconnect, errors, network congestion and application characteristics.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1465034

Resource Relation:: Conference: 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018 - Luxembourg City, , Luxembourg - 6/25/2018 12:00:00 PM-6/28/2018 12:00:00 PM

Country of Publication:: United States

Language:: English

Similar Records

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

Journal Article · Mon Mar 22 00:00:00 EDT 2021 · Journal of Parallel and Distributed Computing · OSTI ID:1465034

Kumar, Mohit; Gupta, Saurabh; Patel, Tirthak; +5 more

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

Conference · Sun Feb 01 00:00:00 EST 2015 · 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA); 7-11 Feb. 2015; Burlingame, CA, USA · OSTI ID:1465034

Tiwari, Devesh; Gupta, Saurabh; Rogers, James; +9 more

Designing Scalable PGAS Communication Subsystems on Cray Gemini Interconnect

Conference · Wed Dec 26 00:00:00 EST 2012 · OSTI ID:1465034

Vishnu, Abhinav; Daily, Jeffrey A.; Palmer, Bruce J.

Title: Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System

Citation Formats

Similar Records

Related Subjects