Network interface active messages for low overhead communication on SMP PC clusters

https://doi.org/10.1016/S0167-739X(99)00137-5Get rights and content

Abstract

A communication layer NICAM is designed to reduce the overhead and latency by directly utilizing a micro-processor equipped on the network interface. NICAM runs active messages handlers on the network interface for flexibility in programming. Running message handlers on the network interface reduces the overhead by freeing the main processors from the work of polling incoming messages, and also reduces the latency in synchronizations by avoiding costly interactions between the main processors and the network interface. Moreover, this makes it possible to hide latency of barriers completely in data-parallel programs, because barriers can be performed in the background of the main processors.

Introduction

Symmetric multi-processor PCs (SMP PCs) have recently attracted widespread attention, and clusters of SMP PCs with fast networks have emerged as important platforms for high performance computing. However, while the bus bandwidth sometimes limits the computation performance, SMP PCs easily reveal a bottle-neck when multiple processors are accessing the bus simultaneously.

Even worse, a network interface for clustering further burdens the bus, because it is implemented as an adaptor card and shares the bus for polling incoming messages and transferring data between memory and communication channels. In this regard, reduction of communication overhead is important, because the overhead, the involvement of main processors in communication, wastes bus bandwidth as well as the processing power of the processors. Moreover, a common technique of overlapping of computation and communication tends to make the communication grain-size finer, which results in larger total cost of overhead.

Thus, we designed a communication layer NICAM which utilizes a micro-processor on the network interface in order to reduce the communication overhead. As researchers show [1], overhead reduction directly affects the utilization of the processing power, while latency reduction in data transfer is not so relevant for performance. This means that the overhead reduction by the network interface will be fruitful, even though it incurs larger latency resulting from a relatively slow micro-processor on the network interface.

In addition, NICAM can reduce latency in synchronization primitives. While latency is not the first issue in data transfer, latency is the only issue in synchronization. Direct handling of messages by the network interface not only frees the processors from polling overhead, but also eliminates the costly interaction between the processors and the network interface. Synchronization primitives such as barriers get faster by on-board handling of incoming messages.

This paper reports on the design of NICAM and its basic performance. We first present our platform PC cluster and its performance characteristics in Section 2, and then show the NICAM primitives with their implementation and basic performance in Section 3. Then, we present a latency hiding technique of barriers in data-parallel programs in Section 4, and present two sets of experimental results in Section 5. We briefly discuss related work in Section 6 and then conclude in Section 7.

Section snippets

SMP PC cluster platform

Our research platform, COMPaS [2], is a PC-based SMP cluster, consisting of eight server-type PCs (Fig. 1). Each node contains four Pentium Pro’s (200 MHz, 512 KB L2 cache, 450GX chip-set, 256 MB main memory), and a Myrinet network interface card. They are connected by a single Myrinet switch. The operating system is Solaris 2.5. This section presents the performance characteristics of the node which guided our design of the communication layer.

Table 1 shows the memory bus bandwidth for read,

Communication layer design

NICAM provides primitives for remote memory operations and synchronizations. We based data transfer primitives on remote memory operations, because they are preferable to message passing with respect to overhead, i.e., message passing suffers from flow-control and buffer management tasks for handling incoming messages, and sometimes requires copying messages which sacrifices the bus bandwidth. In addition, message passing may need mutual exclusions to coordinate processors in an SMP node. Using

Array class library in C++

The main target application of NICAM is a scientific computing library in C++ [8]. The array class library may be considered as a variant of the BSP (bulk synchronous parallel) model [9]. However, while synchronization points are explicitly specified as super-steps in programs in the BSP model, they are implicit in an array class and it is necessary to synchronize at each expression or statement.

Some researchers have been successful in avoiding explicit barriers at the end of the super-steps in

Overlapping communication and computation

Overlapping of computation and communication is a common technique in exploiting hybrid distributed/shared-memory programming on SMP clusters [2]. The communication overhead must be small enough for overlapping to be effective. An overlapping effect is presented for an explicit Laplace equation solver using the Jacobi method. In each iteration, a new array u′ is computed from the old array u by averaging the four neighbors in a two-dimensional space:u′i,j=14(ui−1,j+ui+1,j+ui,j−1+ui,j+1).

The

Related work

Schauser et al. [12] reported on experiments running active messages on a network interface. Also, Krishnamurthy et al. [13] reported on running the active messages handlers for Split-C primitives on various network interfaces. They reported that low latency is achieved on platforms such as the Paragon and Berkeley NOW. However, NICAM further exploited the benefits of utilizing the network interface.

The Cray T3E [14] provides those primitives such as remote memory operations, DMA engines, and

Conclusion

NICAM makes use of the micro-processor on the network interface to reduce the overhead in data transfer, and also to reduce the latency in synchronization. In addition, background execution of barriers can completely hide latency. It employs an active messages framework for flexibility in programming the network interface, which allows an easy integration of new primitives, such as the ones to help rewrite message passing by remote memory operations.

While NICAM is based on remote memory

Motohiko Matsuda received the B.S. in science from Kyoto University in 1988. He joined Sumitomo Metal Industries, Ltd., in 1988. He was a senior researcher at Real World Computing Partnership from 1995 to 1999. Currently, he is a researcher at Sumitomo Metal Industries, Ltd. He received Dr. Eng. from Kyoto University in 1999. His research interests include object-oriented libraries for high performance computing systems.

References (16)

  • R.P. Martin, A.M. Vahdat, D.E. Culler, T.E. Anderson, Effects of communication latency overhead and bandwidth in a...
  • Y. Tanaka, M. Matsuda, K. Kubota, M. Sato, COMPaS: a Pentium Pro PC-based SMP cluster, in: R. Buyya (Ed.), High...
  • N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, S. Wen-King, Myrinet—a...
  • T. von Eicken, D.E. Culler, S.C. Goldstein, K.E. Schauser, Active messages: a mechanism for integrated communication...
  • H. Tezuka, A. Hori, Y. Ishikawa, M. Sato, PM: An operating system coordinated high performance communication library,...
  • M. Gupta, E. Schonberg, Static analysis to reduce synchronization costs in data-parallel programs, Symposium on...
  • R. Gupta, The fuzzy barrier: a mechanism for high speed synchronization of processors, Proceedings of the International...
  • M. Matsuda, M. Sato, Y. Ishikawa, OBP Lib: an object-oriented parallel library and its preliminary performance, RWCP...
There are more references available in the full text version of this article.

Cited by (0)

Motohiko Matsuda received the B.S. in science from Kyoto University in 1988. He joined Sumitomo Metal Industries, Ltd., in 1988. He was a senior researcher at Real World Computing Partnership from 1995 to 1999. Currently, he is a researcher at Sumitomo Metal Industries, Ltd. He received Dr. Eng. from Kyoto University in 1999. His research interests include object-oriented libraries for high performance computing systems.

Yoshio Tanaka received his B.E. in 1987, M.E. in 1989 and Ph.D. (Eng.) in 1995 all in mathematics from Keio University. He joined Real World Computing Partnership in 1996, and now is a senior researcher in Parallel and Distributed System Performance Laboratory. His current research interests include performance evaluation of parallel systems and high performance computing on cluster systems.

Kazuto Kubota received Dr. Eng. form Waseda University in 1993. He was a research member at Real World Computing Partnership from 1995 to 1998. Currently, he is a research staff at Research & Development Center at Toshiba Corp. He is interested in cluster computer system, programming tool, performance analysis, and data-mining.

Mitsuhisa Sato received the M.S. and the Ph.D. in information science from the University of Tokyo in 1984 and 1990, respectively. He was a senior researcher at Electrotechnical Laboratory, working on the EM-X multiprocessor project, from 1991 to 1996. Currently, he is the Chief of Parallel and Distributed System Performance Laboratory in Real World Computing Partnership. His research interests include computer architecture, compilers and performance evaluation for parallel computer systems, and global computing. Dr. Sato is a member of IPSJ (the Information Processing Society of Japan) and JSIAM (the Japan Society for Industrial and Applied Mathematics).

View full text