Architectural support for efficient message passing on shared memory multi-cores

https://doi.org/10.1016/j.jpdc.2016.02.005Get rights and content

Highlights

  • We present hardware support to reduce overheads incurred by message passing (MP).

  • We modified an MPI library to add support for our ISA extensions.

  • Our design eliminates over 60%–92% of cache accesses during data transfer.

  • Adding simple MP support to shared memory multicores improves energy efficiency.

Abstract

Thanks to programming approaches like actor-based models, message passing is regaining popularity outside large-scale scientific computing for building scalable distributed applications in multi-core processors. Unfortunately, the mismatch between message passing models and today’s shared-memory hardware provided by commercial vendors results in suboptimal performance and a waste of energy. This paper presents a set of architectural extensions to reduce the overheads incurred by message passing workloads running on shared memory multi-core architectures. It describes the instruction set extensions and the hardware implementation. In order to facilitate programmability, the proposed extensions are used by a message passing library, allowing programs to take advantage of them transparently. As a proof-of-concept, we use modified MPI libraries and unmodified MPI programs to evaluate the proposal. Experimental results show that a best-effort design can eliminate over 60% of cache accesses caused by message data transmission and reduce the cycles spent in such task by 75%, while the addition of a simple coprocessor can completely off-load data movement from the CPU to avoid up to 92% of cache accesses, and a reduction of 12% of network traffic on average. The design achieves an improvement of 11%–12% in the energy-delay product of on-chip caches.

Introduction

Message passing (MP) has been around for a long time and is the most common programming model for high-performance computing (HPC) applications that run on distributed memory systems with a large number of nodes. More recently, actor-based programming languages and libraries, like Erlang  [1] or Akka/Scala  [10], are making the MP model gain popularity in domains beyond its traditional realm of HPC, as a promising approach to build parallel applications that exploit the increasing number of cores available in current and future chip multiprocessors. Regardless of this growing popularity of MP, shared memory multi-cores are likely to remain the design-of-choice for chip manufacturers, given their wide versatility–ranging from HPC environments to data-centres, to hand-held devices–and the vast experience that industry has accumulated on this domain, amongst other reasons.

These emerging workloads that communicate through messages are nowadays executed in commodity shared-memory multi-cores at a considerable loss of efficiency, caused by the mismatch between hardware and programming model. This mismatch leads to create excessive copies of the message payload and exchange unnecessary control messages. The hardware coherence mechanisms that provide a consistent view of shared memory are to blame for these overheads. Hence, there exists an increasing demand for hardware support that enables more efficient execution of MP workloads in commodity multi-core processors, without burdening the execution of native shared memory applications.

This work attempts to fill that gap by proposing and evaluating a set of hardware extensions to shared memory multi-core processors aimed at optimizing the execution of MP workloads. Such mechanisms do not interfere with normal handling of shared data, and they are only invoked when the programmer explicitly uses the new interface, which can be easily encapsulated inside a library so that applications can benefit from an improvement in communication performance in a completely transparent manner. Though our work is focused on MPI (as the de-facto standard model for message passing), the proposed architectural support can be used to enhance communication via shared-memory in any other implementation of the MP model. Using MPI allows us to evaluate our proposals using existing libraries (MPICH  [5] and HMPI  [6]) and well-known benchmarks (NAS  [2] and Mantevo  [11]).

MPI has traditionally been used in HPC systems with a large number of distributed nodes, where the cost of sending messages is high. For this reason, programmers often try to minimize communication and tend to exchange a small number of large messages. In shared memory multi-cores, the trade-offs are different and MP can be used with a finer granularity when it suits better the nature of the algorithm. On the one hand, very small messages are embedded in the same cache line as the message meta-data (source, tag, etc.), thus keeping both latency and bandwidth to a minimum. On the other hand, large messages can be handled by kernel extensions such as KNEM  [8] and LiMIC  [13], which provide single-copy message passing across different address spaces. The Nemesis communication subsystem of MPICH  [4] also offers a kernel-assisted, single-copy model for communicating large messages within a node, using Linux system calls and KNEM. Unfortunately, the fixed cost of a system call makes kernel-based approaches unsuitable for small to medium-sized messages, which are typically exchanged using a two-copy method. Most MPI implementations, including MPICH, are process-based (i.e. each rank is a separate process), and thus must deal with process separation enforced by the operating system: the message is first copied from the sender’s private memory into a buffer in shared memory, and from there to its destination in the receiver’s memory. This wastes precious cache resources and generates unnecessary coherence traffic. Hybrid MPI  [6] (HMPI) attempts to provide single-copy for messages of any size by combining an underlying process-based MPI implementation (such as MPICH) with a top layer that captures calls to malloc and implements a global heap shared by all ranks. Unfortunately, its applicability is limited to those cases when both send and receive buffers are allocated in the heap. Even when single-copy is possible in HMPI, the overheads introduced by the underlying cache coherence substrate hinder energy efficiency: When copying from send buffer to receive buffer, one of the buffers is fetched from the private cache at the other end, and gets replicated into the local cache, wasting on-chip cache real estate, generating additional network traffic as a result of evictions, subsequent invalidations, etc.

In this work, we propose a general hardware mechanism that uses direct cache-level messages to enable single-copy message transfers from private buffer to private buffer. The proposed architectural support not only allows process-based MPI libraries like MPICH to bypass intermediate buffers, but also avoids data replication in a rank’s private cache. In this way, DiMP addresses an inherent overhead of using memcpy to carry out message-passing communication on a shared-memory architectures. The main contributions of this paper are: (1) architectural extensions that can be used to improve performance and energy efficiency of message-passing applications running on shared-memory multi-cores; (2) a possible hardware implementation for the aforementioned extensions that is simple enough to ensure it is energy efficient; (3) a proof-of-concept consisting of two modified MPI libraries that shows how the extensions can be used transparently by the programmer; and (4) an evaluation of the proposal.

A first approximation to the advantages of architectural extensions for MP on shared-memory multi-cores was presented in  [22]. Here, we extend that work with the following contributions: (1) We have introduced DiMP support into a second library (HMPI) as another use case for our proposed architectural support. Section  5.2 describes the key changes to this state-of-the-art library. (2) We have expanded the analysis presented in Section  7, which compares both modified (i.e. DiMP-ready) and unmodified versions of MPICH and HMPI, including a new benchmark that suits HMPI’s requirements. (3) The hardware implementation of DiMP (Section  3) and the software architecture of the modified MPICH library (Section  5) are described more precisely, with additional comments and explanatory figures that facilitate the understanding of its operation, and the software changes to make use of the ISA extensions. (4) The design has been comprehensively evaluated in terms of energy efficiency, resulting in a large improvement over the performance-only analysis presented in prior work.

Section snippets

Background and related work

The message passing (MP) model has been used for decades to build scalable applications for systems with a large number of nodes. However, this programming model alone cannot achieve its maximum efficiency in systems where each node is in turn a shared-memory multiprocessor and runs several processes (ranks, in MPI jargon) of the distributed application. In this heterogeneous scenario, programmers have resorted to hybrid programming models such as MPI + OpenMP  [20], combining both shared

Direct cache-level message passing (basic DiMP)

In Direct Message Passing, or DiMP, the sender CPU requests to open a channel to the receiver, including matching metadata in the request. The message is matched to the receive operation, opening the channel. Then the sender CPU copies from the send buffer in its private cache directly into the receive buffer in the receiver’s private cache. The DiMP implementation includes a new hardware unit, the Message Passing Unit (MPU), which handles both incoming and outgoing channel connections. The MPU

Full DiMP: coprocessor-assisted message passing

The design presented in Section  3 provides “best-effort” single-copy delivery, relying on the ordering of the instructions. Alternatively, it would be possible to force single-copy delivery always by stalling the sender CPU until dimp_recv_post is executed. However, it seems unreasonable to stall the CPU for this reason during an unknown number of cycles. This is one of the reasons why we extend our basic DiMP design with a communication coprocessor.

The advantages of using a coprocessor are

Use cases: adding DiMP support to MPICH and HMPI

We consider that in any realistic scenario, application programmers will not deal directly with the MP support provided by the hardware but that it will be used transparently. In particular, programmers will use a programming model that utilizes MP, and the library that implements it will use our proposed MP support whenever it is suitable. We have modified two state-of-the-art MPI libraries, MPICH and HMPI, to use our instructions. Thus, MPI applications can benefit from DiMP without the need

Simulation environment and methodology

Simulation environment. We evaluate our architectural extensions for direct-cache level MP using the GEM5 simulator  [3] in full-system mode. GEM5 provides functional simulation of the 64-bit X86 ISA and boots an unmodified Linux kernel. We use the Ruby detailed timing model for the memory subsystem, combined with the timing simple processor model. A distributed directory coherence protocol on a mesh-based network-on-chip is simulated. Each node in the mesh corresponds to a processing core with

Analysis of communication performance and efficiency

We begin our evaluation with a quantitative comparison of the different DiMP configurations, in both MPICH and HMPI libraries. For this analysis, we use two workloads. On the one hand, in order to focus on the communication performance of all configurations, we use a simple ping–pong MPI program extracted from the OSU micro-benchmarks.4 On the other hand, we also selected the miniMD benchmark (2 ranks), as it is an ideal candidate to demonstrate the

Performance and energy evaluation

In this section we present the results obtained from running the benchmarks, for the baseline and the three DiMP configurations: Fallback-DiMP, Basic-DiMP and Full-DiMP. Fig. 8 shows the total number of L1D and L2 cache accesses encountered by all ranks during data transfer, normalized to the baseline. All data array accesses except load/store hits (L1D_Data_Hit) entail a full cache line, including those generated by dimp_send_line instructions: L1D_DataRd_Send on the sender and L1D_DataWr_Recv

Conclusions

In this paper we have described a set of ISA extensions and associated hardware support, and shown how they improve the efficiency of message passing workloads running on a shared memory multi-core. Our solution combines hardware and software support to enable efficient communication amongst processes that run in different address spaces. We have shown how the proposed extensions can be incorporated to two real-world MPI libraries, improving communication performance and reducing energy

Acknowledgment

This project and the research leading to these results has received funding from the European Community’s Seventh Framework Programme  [FP7/2007–2013] under grant agreement number 318693.

Rubén Titos-Gil received the M.S. and Ph.D. degrees in Computer Science from the University of Murcia, Spain, in 2006 and 2011, respectively. Between 2012 and 2014, he held a postdoc position at Chalmers University of Technology, Sweden. In April 2014, he joined B.Sc. and began working for the ParaDIME project. His research interests lay on the fields of parallel computer architecture and programming models, including synchronization, coherence protocols and memory hierarchy.

References (22)

  • M.A. Heroux et al.

    Improving performance via mini-applications, Tech. Rep. SAND2009-5574

    (2009)
  • Rubén Titos-Gil received the M.S. and Ph.D. degrees in Computer Science from the University of Murcia, Spain, in 2006 and 2011, respectively. Between 2012 and 2014, he held a postdoc position at Chalmers University of Technology, Sweden. In April 2014, he joined B.Sc. and began working for the ParaDIME project. His research interests lay on the fields of parallel computer architecture and programming models, including synchronization, coherence protocols and memory hierarchy.

    Oscar Palomar received his degree on Computer Sciences on 2002 from the Universitat Politécnica de Catalunya and his Ph.D. on Computer Architecture in 2011 from the same university. Since 2010 he has been working on the Barcelona Supercomputing Center in the Computer Architectures for Parallel Paradigms group. His research interests involve low-power vector architectures and energy minimization.

    Osman Unsal received the B.S., M.S. and Ph.D. degrees in Electrical and Computer Engineering from Istanbul Technical University (Turkey), Brown University (USA) and University of Massachusetts, Amherst (USA) respectively. Together with Dr. Adrian Cristal, he co-manages the Computer Architecture for Parallel Paradigms research group at B.Sc. His current research interests include many-core computer architecture, reliability, low-power computing, programming models and transactional memory.

    Adrián Cristal is co-manager of the Computer Architecture for Parallel Paradigms research group at B.Sc. His interests include high-performance microarchitecture, multi- and many-core chip multiprocessors, transactional memory, and programming models. He received a Ph.D. from the Computer Architecture Department at the Polytechnic University of Catalonia (UPC), Spain, and he has a B.S. and an M.S. in computer science from the University of Buenos Aires, Argentina.

    1

    O. Palomar is also affiliated with UPC.

    2

    A. Cristal is also affiliated with UPC and IIIA-CSIC.

    View full text