Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

doi:10.1016/j.future.2013.04.014

Future Generation Computer Systems

Volume 30, January 2014, Pages 59-65

https://doi.org/10.1016/j.future.2013.04.014 Get rights and content

Highlights

•
Simulation of different future high-performance computing architectures at scale.
•
Demonstrates scalability to 134,217,728 simulated MPI ranks on 960 real cores.
•
Evaluates MPI collective communication performance on 2,097,152 simulated ranks.
•
Estimates the performance of a Monte Carlo solver on 16,777,216 simulated ranks.

Abstract

As supercomputers scale to 1000 PFlop/s over the next decade, investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices for high-performance computing (HPC) hardware/software co-design is crucial. This paper summarizes recent efforts in designing and implementing a novel HPC hardware/software co-design toolkit. The presented Extreme-scale Simulator (xSim) permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing. This paper demonstrates the capabilities and usefulness of the xSim performance investigation toolkit, such as its scalability to 2²⁷ simulated Message Passing Interface (MPI) ranks on 960 real processor cores, the capability to evaluate the performance of different MPI collective communication algorithms, and the ability to evaluate the performance of a basic Monte Carlo application with different architectural parameters.

Introduction

With the recent deployment of 10–20 PFlop/s (1 PFlop/s=10¹⁵ floating-point operations per second) supercomputers and the exascale roadmap targeting 100, 300, and eventually 1000 PFlop/s over the next decade, the trend in supercomputer architecture goes clearly in only one direction. Systems will dramatically scale up in size, i.e., in compute node and processor thread counts. By 2020, an exascale system may have 1,000,000 compute nodes with 1000–10,000 threads per node. This poses several challenges related to power consumption, performance, resilience, productivity, programmability, data movement, and data management.

The expected growth in concurrency from today’s 1.57 million hardware threads in the IBM BlueGene/Q Sequoia supercomputer at Lawrence Livermore National Laboratory to 1–10 billion hardware threads at exascale, causes parallel application scalability issues due to sequential application parts, synchronizing communication, and other bottlenecks. High-performance computing (HPC) hardware/software co-design is crucial to enable extreme-scale computing by closing the gap between the peak capabilities of the hardware and the performance realized by HPC applications (application-architecture performance gap).

Investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices is an important component of HPC hardware/software co-design. Without having access to future architectures at scale, simulation approaches provide an alternative for estimating parallel application performance on potential architecture choices. As highly accurate simulations are extremely slow and less scalable, different solution paths exist to trade-off simulation accuracy in order to gain simulation performance and scalability.

This paper summarizes recent efforts in designing and implementing a novel HPC hardware/software co-design toolkit. The presented work focuses on a lightweight parallel discrete event simulation (PDES) solution for investigating the performance of Message Passing Interface (MPI) applications at extreme scale. The developed Extreme-scale Simulator (xSim) permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale system using architectural models and virtual timing. This paper demonstrates the capabilities and usefulness of the xSim performance toolkit. Specifically, it shows:

•
the scalability to 134,217,728 (2²⁷) simulated MPI ranks, each with its own context, on a 960-core Linux cluster (a world record in extreme-scale simulation);
•
the capability to evaluate the performance of MPI collective communication algorithms on 2,097,152 (2²¹) simulated MPI ranks using the same cluster; and
•
the ability to estimate the performance of a Monte Carlo solver on 16,777,216 (2²⁴) simulated MPI ranks with varying architectural parameters using the same cluster.

The paper is structured as follows. Section 2 briefly discusses related work, while Section 3 provides an overview of xSim’s architecture and design. Section 4 demonstrates the capabilities and usefulness of xSim via a variety of experimental results. Sections 5 Conclusions, 6 Future work conclude this paper with a summary and an outlook on future work.

Section snippets

Related work

xSim’s predecessor, the Java Cellular Architecture Simulator (JCAS) [1], was developed in 2001 to investigate scalable fault-tolerant algorithms for large-scale systems. The prototype was able to run up to 500,000 simulated processes on a cluster with 5 native processors (using 1 for visualization) solving basic mathematical problems. While it was able to run algorithms at scale, it lacked important features, such as time-accurate simulation, high performance, support for running the

xSim: the Extreme-scale Simulator

The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running HPC applications in a controlled environment with millions of concurrent execution threads. It allows observing application performance in a simulated extreme-scale HPC system for hardware/software co-design. Much of its architecture and design details has been published before [12], [13], [14], [15] and is only summarized in this paper. Additionally, a few new features have been added that will be

Experimental results

The main contribution of this paper is the demonstration of the capabilities and usefulness of xSim. Experiments were executed focusing on evaluating xSim’s scalability, using it for investigating the performance of different MPI collective communication algorithms, and employing it to identify the impact of different architectural choices on the performance of a basic Monte Carlo application. The experiments were performed on a 960-core Linux cluster with 40 compute nodes, two 1.7 GHz AMD

Conclusions

With this paper, a recent effort was summarized that focused on the design and implementation of a novel HPC hardware/software co-design toolkit. The presented Extreme-scale Simulator (xSim) focuses on a unique concept that utilizes a light-weight PDES to run an HPC application in a controlled environment with millions of concurrent execution threads. Using architectural models and virtual timing, application performance can be observed in a simulated extreme-scale HPC system.

The capabilities

Future work

While the presented work is quite novel and useful, a few limitations remain that will be addressed in the near future. First and foremost, full network contention modeling is planned as an optional part of xSim. The current sender-/receiver-based network contention modeling does not include all network nodes in the routing path and will be extended to support this as an optional feature (limiting scalability though). Another deficiency is the quite simplistic processor model. Current plans

Acknowledgments

This research is sponsored by the Office of Advanced Scientific Computing Research, US Department of Energy (DOE). This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the DOE. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or

References (15)

C. Engelmann et al.
Super-scalable algorithms for computing on 100,000 processors
G. Bosilca et al.
Recovery patterns for iterative methods in a parallel unstable environment
SIAM Journal on Scientific Computing (SISC)
(2007)
Z. Chen et al.
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
H. Ltaief et al.
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing (JPDC)
(2008)
G. Zheng et al.
BigSim: a parallel simulator for performance prediction of extremely large parallel machines
L.V. Kale et al.
Programming petascale applications with Charm $+ +$ and AMPI
K.S. Perumalla
$μ π$ : a highly scalable and transparent system for simulating MPI programs

There are more references available in the full text version of this article.

Cited by (39)

Simulation-based optimization and sensibility analysis of MPI applications: Variability matters
2022, Journal of Parallel and Distributed Computing
Citation Excerpt :
The main difficulty resides in capturing and modeling the interplay between the application and the platform while faithfully accounting for their respective complexity. A promising approach recently pioneered in several tools [9,13,19] consists in emulating the application in a controlled way so that a platform simulator governs its execution. Although this approach's scalability is a primary concern that has already received lots of attention, the accuracy of the simulation is even more challenging.
Finely tuning MPI applications and understanding the influence of key parameters (number of processes, granularity, collective operation algorithms, virtual topology, and process placement) is critical to obtain good performance on supercomputers. With the high consumption of running applications at scale, doing so solely to optimize their performance is particularly costly. Having inexpensive but faithful predictions of expected performance could be a great help for researchers and system administrators. The methodology we propose decouples the complexity of the platform, which is captured through statistical models of the performance of its main components (MPI communications, BLAS operations), from the complexity of adaptive applications by emulating the application and skipping regular non-MPI parts of the code. We demonstrate the capability of our method with High-Performance Linpack (HPL), the benchmark used to rank supercomputers in the TOP500, which requires careful tuning. We briefly present (1) how the open-source version of HPL can be slightly modified to allow a fast emulation on a single commodity server at the scale of a supercomputer. Then we present (2) an extensive (in)validation study that compares simulation with real experiments and demonstrates our ability to predict the performance of HPL within a few percent consistently. This study allows us to identify the main modeling pitfalls (e.g., spatial and temporal node variability or network heterogeneity and irregular behavior) that need to be considered. Last, we show (3) how our “surrogate” allows studying several subtle HPL parameter optimization problems while accounting for uncertainty on the platform.
Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems
2019, Future Generation Computer Systems
This special issue of Future Generation Computer Systems contains four extended papers selected from the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems (PMBS 2016), held as part of the 28th International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016). These papers represent worldwide programmes of research committed to understanding application and architecture performance to enable post-peta-scale computational science.
On the road to exascale: Advances in High Performance Computing and Simulations—An overview and editorial
2018, Future Generation Computer Systems
In recent decades, the complexity of scientific and engineering problems has increased considerably. New applications and domains that use high performance computing systems have been introduced. These trends are projected to continue for the foreseen future (Reed and Dongarra, 2015) [1]. In many areas of engineering and science, High-Performance Computing (HPC) and Simulations have become determinants of industrial competitiveness and advanced research. In fact, advances in HPC architectures, storages, networking, and software capabilities are leading to a new era in HPC and simulations, along with new challenges both in computing and systems modeling (Geist and Lucas, 2009) [2]. These developments are especially critical considering that HPC systems continue to scale up in terms of nodes, cores, and accelerators, as well as software, infrastructure and tools, which in turn are expediting the move on the path toward Exascale (Reed and Dongarra, 2015; Geist and Lucas, 2009; Dongarra and Beckman, 2011; Dosanjh et al., 2014; Engelmann, 2014) [[1], [2], [3], [4], [5]].
Scalability and availability represent two of the main requirements that need to be considered before conceiving of these large-scale systems (ASCAC Subcommittee on Exascale Computing, 2010). The scalability feature allows the system to proportionately grow when service demand increases, whereas availability means the system continues to provide their services despite hardware and software failures (Theodoropoulos et al., 2014; Tang et al., 2014) [[7], [8]]. The goal in large-scale HPC is to accommodate both availability and scalability while staying under strict constraints on performance (e.g., processing time) and cost metrics (e.g., power consumption).
This special issue is envisioned to provide examples of research work on topics related to recent advances in High Performance Computing and Simulations. It briefly addresses and explores challenges toward Exascale computing, current state-of-the-art in HPC and simulation, and the path forward in the domains of large-scale HPC systems.
ARRC: A random ray neutron transport code for nuclear reactor simulation
2018, Annals of Nuclear Energy
Citation Excerpt :
This extreme memory savings can also improve computational performance by increasing cache efficiency and reducing memory bandwidth requirements. Any method aiming to simulate entire nuclear reactors in high fidelity must be able to map efficiently onto supercomputer architectures, where many separate computers are networked together and working in tight coordination to run a simulation (Attig et al., 2011; Engelmann et al., 2014; Rajovic et al., 2013). Running on a supercomputer is necessary both to reduce the time to solution and to greatly increase the total amount of memory available for the simulation in order to allow a 3D full reactor core to be simulated with a sufficiently high resolution (Sanchez, 2012; Kochunas and Downar, 2013; Romano et al., 2013; Hoogenboom et al., 2013; Felker et al., 2012; Horelik et al., 2014).
A massively parallel implementation of a recently developed technique for numerically integrating the transport equation, The Random Ray Method (TRRM) (Tramm et al., 2017), is applied to several large reactor benchmark problems. The implementation, which is part of a new development called The Advanced Random Ray Code (ARRC), is one of the first parallel implementations of TRRM. Our goal is to better understand the accuracy and performance characteristics of TRRM on massive scale problems, and to provide community software that facilitates further algorithmic development and potentially its application to a broader class of problems. Key features of ARRC include extreme memory efficiency, domain decomposition, a task based parallel structure, and the ability to efficiently utilize Single Instruction Multiple Data (SIMD) vector units. These attributes lead to efficient performance on modern high performance computer (HPC) architectures, enabling the detailed simulation of reactor cores in three dimensions.
Memory bottlenecks and memory contention in multi-core Monte Carlo transport codes
2015, Annals of Nuclear Energy
Citation Excerpt :
One might reasonably conclude that 69% or 96% scaling out to 16 cores is adequate speedup. However, next-generation node architectures are likely to require up to thousand-way on-node shared memory parallelism (Dosanjh et al., 2014; Engelmann, 2014; Rajovic et al., 2013; Attig et al., 2011), and thus it is crucial to ascertain the cause of the observed degradation and the implications for greater levels of scalability. Considering nodes with 32, 64, 128, or 1024 shared memory cores and beyond, it cannot be taken for granted that performance will continue to improve.
Improving Simulations of Task-Based Applications on Complex NUMA Architectures
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

Christian Engelmann is Task Lead of the System Software Team in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. He holds a Ph.D. and an M.Sc. degree in Computer Science from the University of Reading, UK, and a German Certified Engineer diploma, in Computer Systems Engineering from the University of Applied Sciences Berlin. He has 12+ years experience in research and development for next-generation extreme-scale high-performance computing (HPC) systems. In collaboration with other laboratories and universities, his research aims at computer science challenges for HPC system software, such as dependability, scalability, and portability. Dr. Engelmann’s primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. His secondary expertise is in HPC hardware/software co-design through lightweight simulation of extreme-scale systems with millions of processor cores to study the impact of hardware properties on parallel application performance.

View full text

Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

Highlights

Abstract

Introduction

Section snippets

Related work

xSim: the Extreme-scale Simulator

Experimental results

Conclusions

Future work

Acknowledgments

Super-scalable algorithms for computing on 100,000 processors

Recovery patterns for iterative methods in a parallel unstable environment

SIAM Journal on Scientific Computing (SISC)

Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

Fault tolerant algorithms for heat transfer problems

Journal of Parallel and Distributed Computing (JPDC)

BigSim: a parallel simulator for performance prediction of extremely large parallel machines

Programming petascale applications with Charm++ and AMPI

μπ: a highly scalable and transparent system for simulating MPI programs

Programming petascale applications with Charm $+ +$ and AMPI

$μ π$ : a highly scalable and transparent system for simulating MPI programs