Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

https://doi.org/10.1016/j.future.2013.04.014Get rights and content

Highlights

  • Simulation of different future high-performance computing architectures at scale.

  • Demonstrates scalability to 134,217,728 simulated MPI ranks on 960 real cores.

  • Evaluates MPI collective communication performance on 2,097,152 simulated ranks.

  • Estimates the performance of a Monte Carlo solver on 16,777,216 simulated ranks.

Abstract

As supercomputers scale to 1000 PFlop/s over the next decade, investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices for high-performance computing (HPC) hardware/software co-design is crucial. This paper summarizes recent efforts in designing and implementing a novel HPC hardware/software co-design toolkit. The presented Extreme-scale Simulator (xSim) permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale HPC system using architectural models and virtual timing. This paper demonstrates the capabilities and usefulness of the xSim performance investigation toolkit, such as its scalability to 227 simulated Message Passing Interface (MPI) ranks on 960 real processor cores, the capability to evaluate the performance of different MPI collective communication algorithms, and the ability to evaluate the performance of a basic Monte Carlo application with different architectural parameters.

Introduction

With the recent deployment of 10–20 PFlop/s (1 PFlop/s=1015 floating-point operations per second) supercomputers and the exascale roadmap targeting 100, 300, and eventually 1000 PFlop/s over the next decade, the trend in supercomputer architecture goes clearly in only one direction. Systems will dramatically scale up in size, i.e., in compute node and processor thread counts. By 2020, an exascale system may have 1,000,000 compute nodes with 1000–10,000 threads per node. This poses several challenges related to power consumption, performance, resilience, productivity, programmability, data movement, and data management.

The expected growth in concurrency from today’s 1.57 million hardware threads in the IBM BlueGene/Q Sequoia supercomputer at Lawrence Livermore National Laboratory to 1–10 billion hardware threads at exascale, causes parallel application scalability issues due to sequential application parts, synchronizing communication, and other bottlenecks. High-performance computing (HPC) hardware/software co-design is crucial to enable extreme-scale computing by closing the gap between the peak capabilities of the hardware and the performance realized by HPC applications (application-architecture performance gap).

Investigating the performance of parallel applications at scale on future architectures and the performance impact of different architecture choices is an important component of HPC hardware/software co-design. Without having access to future architectures at scale, simulation approaches provide an alternative for estimating parallel application performance on potential architecture choices. As highly accurate simulations are extremely slow and less scalable, different solution paths exist to trade-off simulation accuracy in order to gain simulation performance and scalability.

This paper summarizes recent efforts in designing and implementing a novel HPC hardware/software co-design toolkit. The presented work focuses on a lightweight parallel discrete event simulation (PDES) solution for investigating the performance of Message Passing Interface (MPI) applications at extreme scale. The developed Extreme-scale Simulator (xSim) permits running an HPC application in a controlled environment with millions of concurrent execution threads while observing its performance in a simulated extreme-scale system using architectural models and virtual timing. This paper demonstrates the capabilities and usefulness of the xSim performance toolkit. Specifically, it shows:

  • the scalability to 134,217,728 (227) simulated MPI ranks, each with its own context, on a 960-core Linux cluster (a world record in extreme-scale simulation);

  • the capability to evaluate the performance of MPI collective communication algorithms on 2,097,152 (221) simulated MPI ranks using the same cluster; and

  • the ability to estimate the performance of a Monte Carlo solver on 16,777,216 (224) simulated MPI ranks with varying architectural parameters using the same cluster.

The paper is structured as follows. Section  2 briefly discusses related work, while Section  3 provides an overview of xSim’s architecture and design. Section  4 demonstrates the capabilities and usefulness of xSim via a variety of experimental results. Sections  5 Conclusions, 6 Future work conclude this paper with a summary and an outlook on future work.

Section snippets

Related work

xSim’s predecessor, the Java Cellular Architecture Simulator (JCAS)   [1], was developed in 2001 to investigate scalable fault-tolerant algorithms for large-scale systems. The prototype was able to run up to 500,000 simulated processes on a cluster with 5 native processors (using 1 for visualization) solving basic mathematical problems. While it was able to run algorithms at scale, it lacked important features, such as time-accurate simulation, high performance, support for running the

xSim: the Extreme-scale Simulator

The Extreme-scale Simulator (xSim) is a performance investigation toolkit that permits running HPC applications in a controlled environment with millions of concurrent execution threads. It allows observing application performance in a simulated extreme-scale HPC system for hardware/software co-design. Much of its architecture and design details has been published before  [12], [13], [14], [15] and is only summarized in this paper. Additionally, a few new features have been added that will be

Experimental results

The main contribution of this paper is the demonstration of the capabilities and usefulness of xSim. Experiments were executed focusing on evaluating xSim’s scalability, using it for investigating the performance of different MPI collective communication algorithms, and employing it to identify the impact of different architectural choices on the performance of a basic Monte Carlo application. The experiments were performed on a 960-core Linux cluster with 40 compute nodes, two 1.7 GHz AMD

Conclusions

With this paper, a recent effort was summarized that focused on the design and implementation of a novel HPC hardware/software co-design toolkit. The presented Extreme-scale Simulator (xSim) focuses on a unique concept that utilizes a light-weight PDES to run an HPC application in a controlled environment with millions of concurrent execution threads. Using architectural models and virtual timing, application performance can be observed in a simulated extreme-scale HPC system.

The capabilities

Future work

While the presented work is quite novel and useful, a few limitations remain that will be addressed in the near future. First and foremost, full network contention modeling is planned as an optional part of xSim. The current sender-/receiver-based network contention modeling does not include all network nodes in the routing path and will be extended to support this as an optional feature (limiting scalability though). Another deficiency is the quite simplistic processor model. Current plans

Acknowledgments

This research is sponsored by the Office of Advanced Scientific Computing Research, US Department of Energy (DOE). This manuscript has been authored by UT-Battelle, LLC, under Contract No. DE-AC05-00OR22725 with the DOE. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or

Christian Engelmann is Task Lead of the System Software Team in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. He holds a Ph.D. and an M.Sc. degree in Computer Science from the University of Reading, UK, and a German Certified Engineer diploma, in Computer Systems Engineering from the University of Applied Sciences Berlin. He has 12+ years experience in research and development for next-generation extreme-scale high-performance computing (HPC) systems. In

References (15)

  • C. Engelmann et al.

    Super-scalable algorithms for computing on 100,000 processors

  • G. Bosilca et al.

    Recovery patterns for iterative methods in a parallel unstable environment

    SIAM Journal on Scientific Computing (SISC)

    (2007)
  • Z. Chen et al.

    Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

  • H. Ltaief et al.

    Fault tolerant algorithms for heat transfer problems

    Journal of Parallel and Distributed Computing (JPDC)

    (2008)
  • G. Zheng et al.

    BigSim: a parallel simulator for performance prediction of extremely large parallel machines

  • L.V. Kale et al.

    Programming petascale applications with Charm++ and AMPI

  • K.S. Perumalla

    μπ: a highly scalable and transparent system for simulating MPI programs

There are more references available in the full text version of this article.

Cited by (39)

  • Simulation-based optimization and sensibility analysis of MPI applications: Variability matters

    2022, Journal of Parallel and Distributed Computing
    Citation Excerpt :

    The main difficulty resides in capturing and modeling the interplay between the application and the platform while faithfully accounting for their respective complexity. A promising approach recently pioneered in several tools [9,13,19] consists in emulating the application in a controlled way so that a platform simulator governs its execution. Although this approach's scalability is a primary concern that has already received lots of attention, the accuracy of the simulation is even more challenging.

  • ARRC: A random ray neutron transport code for nuclear reactor simulation

    2018, Annals of Nuclear Energy
    Citation Excerpt :

    This extreme memory savings can also improve computational performance by increasing cache efficiency and reducing memory bandwidth requirements. Any method aiming to simulate entire nuclear reactors in high fidelity must be able to map efficiently onto supercomputer architectures, where many separate computers are networked together and working in tight coordination to run a simulation (Attig et al., 2011; Engelmann et al., 2014; Rajovic et al., 2013). Running on a supercomputer is necessary both to reduce the time to solution and to greatly increase the total amount of memory available for the simulation in order to allow a 3D full reactor core to be simulated with a sufficiently high resolution (Sanchez, 2012; Kochunas and Downar, 2013; Romano et al., 2013; Hoogenboom et al., 2013; Felker et al., 2012; Horelik et al., 2014).

  • Memory bottlenecks and memory contention in multi-core Monte Carlo transport codes

    2015, Annals of Nuclear Energy
    Citation Excerpt :

    One might reasonably conclude that 69% or 96% scaling out to 16 cores is adequate speedup. However, next-generation node architectures are likely to require up to thousand-way on-node shared memory parallelism (Dosanjh et al., 2014; Engelmann, 2014; Rajovic et al., 2013; Attig et al., 2011), and thus it is crucial to ascertain the cause of the observed degradation and the implications for greater levels of scalability. Considering nodes with 32, 64, 128, or 1024 shared memory cores and beyond, it cannot be taken for granted that performance will continue to improve.

  • Improving Simulations of Task-Based Applications on Complex NUMA Architectures

    2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
View all citing articles on Scopus

Christian Engelmann is Task Lead of the System Software Team in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. He holds a Ph.D. and an M.Sc. degree in Computer Science from the University of Reading, UK, and a German Certified Engineer diploma, in Computer Systems Engineering from the University of Applied Sciences Berlin. He has 12+ years experience in research and development for next-generation extreme-scale high-performance computing (HPC) systems. In collaboration with other laboratories and universities, his research aims at computer science challenges for HPC system software, such as dependability, scalability, and portability. Dr. Engelmann’s primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. His secondary expertise is in HPC hardware/software co-design through lightweight simulation of extreme-scale systems with millions of processor cores to study the impact of hardware properties on parallel application performance.

View full text