No abstract available.
Proceeding Downloads
Towards millions of communicating threads
We explore in this paper the advantages that accrue from avoiding the use of wildcards in MPI. We show that, with this change, one can efficiently support millions of concurrently communicating light-weight threads using send-receive communication.
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning
Emerging paradigms like High Performance Data Analytics (HPDA) and Deep Learning (DL) pose at least two new design challenges for existing MPI runtimes. First, these paradigms require an efficient support for communicating unusually large messages ...
Generalisation of Recursive Doubling for AllReduce
The performance of AllReduce is crucial at scale. The recursive doubling with pairwise exchange algorithm theoretically achieves O(log2 N) scaling for short messages with N peers, but is limited by improvements in network latency. A multi-way exchange ...
Space Performance Tradeoffs in Compressing MPI Group Data Structures
MPI is a popular programming paradigm on parallel machines today. MPI libraries sometimes use O(N) data structures to implement MPI functionality. The IBM Blue Gene/Q machine has 16 GB memory per node. If each node runs 32 MPI processes, only 512 MB is ...
Modeling MPI Communication Performance on SMP Nodes: Is it Time to Retire the Ping Pong Test
The "postal" model of communication [3, 8] T = α + βn, for sending n bytes of data between two processes with latency α and bandwidth 1/β, is perhaps the most commonly used communication performance model in parallel computing. This performance model is ...
Introducing Task-Containers as an Alternative to Runtime-Stacking
- Jean-Baptiste Besnard,
- Julien Adam,
- Sameer Shende,
- Marc Pérache,
- Patrick Carribault,
- Julien Jaeger,
- Allen D. Maloney
The advent of many-core architectures poses new challenges to the MPI programming model which has been designed for distributed memory message passing. It is now clear that MPI will have to evolve in order to exploit shared-memory parallelism, either by ...
The MIG Framework: Enabling Transparent Process Migration in Open MPI
This paper introduces the mig framework: an Open MPI extension to transparently support the migration of application processes, over different nodes of a distributed High-Performance Computing (HPC) system. The framework provides mechanism on top of ...
Architecting Malleable MPI Applications for Priority-driven Adaptive Scheduling
Future supercomputers will need to support both traditional HPC applications and Big Data/High Performance Analysis applications seamlessly in a common environment. This motivates traditional job scheduling systems to support malleable jobs along with ...
Infrastructure and API Extensions for Elastic Execution of MPI Applications
Dynamic Processes support was added to MPI in version 2.0 of the standard. This feature of MPI has not been widely used by application developers in part due to the performance cost and limitations of the spawn operation. In this paper, we propose an ...
A Library for Advanced Datatype Programming
We present a library providing functionality beyond the MPI standard for manipulating application data layouts described by MPI derived datatypes. The main contributions are: a) Constructors for several, new datatypes for describing application relevant ...
On the Expected and Observed Communication Performance with MPI Derived Datatypes
We examine natural expectations on communication performance using MPI derived datatypes in comparison to the baseline, "raw" performance of communicating simple, noncontiguous data layouts. We show that common MPI libraries sometimes violate these ...
MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale
- Daniel Holmes,
- Kathryn Mohror,
- Ryan E. Grant,
- Anthony Skjellum,
- Martin Schulz,
- Wesley Bland,
- Jeffrey M. Squyres
MPI includes all processes in MPI_COMM_WORLD; this is untenable for reasons of scale, resiliency, and overhead. This paper offers a new approach, extending MPI with a new concept called Sessions, which makes two key contributions: a tighter integration ...
Distributed Memory Implementation Strategies for the kinetic Monte Carlo Algorithm
This paper presents strategies to parallelize a previously implemented kinetic Monte Carlo (kMC) algorithm. The process under simulation is the precipitation in an aluminum scandium alloy. The selected parallel algorithm is called synchronous parallel ...
How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms
Scientific workloads running on current extreme-scale systems routinely generate tremendous volumes of data for postprocessing. This data movement has become a serious issue due to its energy cost and the fact that I/O bandwidths have not kept pace with ...
The Potential of Diffusive Load Balancing at Large Scale
Dynamic load balancing with diffusive methods is known to provide minimal load transfer and requires communication between neighbor nodes only. These are very attractive properties for highly parallel systems. We compare diffusive methods with state-of-...
Optimization of Message Passing Services on POWER8 InfiniBand Clusters
We present scalability and performance enhancements to MPI libraries on POWER8 InfiniBand clusters. We explore optimizations in the Parallel Active Messaging Interface (PAMI) libraries. We bypass IB VERBS via low level inline calls resulting in low ...
Using InfiniBand Hardware Gather-Scatter Capabilities to Optimize MPI All-to-All
The MPI all-to-all algorithm is a data intensive, high-cost collective algorithm used by many scientific High Performance Computing applications. Optimizations for small data exchange use aggregation techniques, such as the Bruck algorithm, to minimize ...
Revisiting RDMA Buffer Registration in the Context of Lightweight Multi-kernels
Lightweight multi-kernel architectures, where HPC specialized lightweight kernels (LWKs) run side-by-side with Linux on compute nodes, have received a great deal of attention recently due to their potential for addressing many of the challenges system ...
An Evaluation of the One-Sided Performance in Open MPI
Open MPI provides an implementation of the MPI-3.1 standard supporting native communication over a wide range of high-performance network interconnects. As of version 2.0.0 Open MPI provides two implementations of the MPI-3.1 Remote Memory Access (RMA) ...
Runtime Correctness Analysis of MPI-3 Nonblocking Collectives
The Message Passing Interface (MPI) includes nonblocking collective operations that support additional overlap between computation and communication. These new operations enable complex data movement between large numbers of processes. However, their ...
CAF Events Implementation Using MPI-3 Capabilities
MPI-3.1 is currently the most recent version of the MPI standard. It adds important extensions to MPI-2, including a simplified semantic for the one-sided communication routines and a new tool interface, capable of exposing performance data of the MPI ...
Allowing MPI tools builders to forget about Fortran
C tool writers are forced to deal with a number of Fortran and C interoperability issues when intercepting MPI routines and completing them with PMPI. The C based tool has to intercept the Fortran MPI routines and marshal arguments between C and Fortran,...
FFT data distribution in plane-waves DFT codes. A case study from Quantum ESPRESSO
Density Functional Theory calculations with plane waves and pseudopotentials represent one of the most important simulation techniques in high performance computing. Together with parallel linear algebra (ZGEMM and matrix diagonalization), the most ...
Optimizing PARSEC for Knights Landing
PARSEC is a massively parallel Density-Functional-Theory (DFT) code. Within the modernization effort towards the new Intel Knights Landing platform, we adapted the main computational kernel, represented as high-order finite-difference stencils, to use ...
Effective Calculation with Halo communication using Halo Functions
The issue of halo communication is the decrease of parallel scalability. To overcome the issues, we have introduced "Halo thread" to our simulation code. However, we have not solved the issue basically in the strong scaling. In this study, we have ...
MPI usage at NERSC: Present and Future
In this poster, we describe how MPI is used at the National Energy Research Scientific Computing Center (NERSC) NERSC is the production high-performance computing center for the US Department of Energy, with more than 5000 users and 800 distinct ...
Performance comparison of Eulerian kinetic Vlasov code between flat-MPI parallelism and hybrid parallelism on Fujitsu FX100 supercomputer
The present study deals with the Vlasov simulation code, which solves the first-principle kinetic equations called the Vlasov equation for space plasma. In the present study, a five-dimensional Vlasov code with two spatial dimension and three velocity ...
Recommendations
Acceptance Rates
Year | Submitted | Accepted | Rate |
---|---|---|---|
EuroMPI '19 | 26 | 13 | 50% |
EuroMPI '17 | 37 | 17 | 46% |
EuroMPI '15 | 29 | 14 | 48% |
EuroMPI '13 | 47 | 22 | 47% |
Overall | 139 | 66 | 47% |