Directed Point: a communication subsystem for commodity supercomputing with Gigabit Ethernet☆
Introduction
Commodity supercomputing is one of the targets in building clusters. Being one form of message passing machines, the performance of clusters depends largely on the performance of the interconnection network and the communication software. Recently, the microprocessor clock speeds have approached gigahertz range. Such explosive growth of processor performance has greatly improved the computation. Nevertheless, it stresses the need of a higher speed communication subsystem that can achieve low-overhead calls to access the interconnect, as the performance of a parallel application depends largely on the performance of the interconnection network and the communication software.
Gigabit Ethernet is widely available and is becoming the commodity of the next generation LAN [3]. Gigabit Ethernet appears to be an ideal solution to the increasing demands placed on today’s high-end server that operates at gigahertz clock rate. However, installing a Gigabit Ethernet adapter in an existing server generally will not yield the 10-fold performance boost over a fast Ethernet adapter. To achieve high performance, the communication software needs to minimize the protocol-processing overheads and resource consumption.
With the introduction of low-latency messaging systems, such as Active Messages (AM) [24], Fast Messages (FM) [16], BIP [17], PM [18], U-Net [25], and GAMMA [6], protocol-processing overheads induced in communication have been significantly reduced. Some of these messaging systems achieve low-latency by pinning down a large area of memory as the send or receive buffers. This avoids the delay caused by the virtual memory system for mapping virtual address to physical address during the messaging stages. Such approach trades memory space for shorter communication latency. However, without a good flow control (FC) mechanism or an efficient higher level protocol, this type of messaging systems is usually not scalable. It may experience poor performance when a large number of concurrent communication channels need to be established during the program execution, because of the inefficient memory utilization. Some other messaging systems adopted the user-level approach, which allows moving data from user space to the network adapter without switching contexts or additional memory copy. This type of communication software may achieve shorter latency, however, memory copy is sometimes inevitable to maintain the message integrity while extending the messaging system to develop a higher communication layer with more functions, such as the reliable support. Moreover, to avoid violating the OS protection mechanisms, these user-level solutions are usually restricted to have only one process using the communication system on a single host machine.
Besides the performance issue, programmability is also an essential goal for the design of communication subsystem. Adequate programmability means that programmer’s parallel algorithm can be easily translated to parallel code through the provided API. This requires the provision of a communication abstraction model which can be used to depict various inter-process communication patterns exhibited during the program execution and a powerful yet ease-to-learn API for translating such patterns to program code. Many existing low-latency communication packages provide good performance, but they neglect the need of a simple and user-friendly API. Most of them just form their own programming interfaces, using complex data structures and syntax, thus making them difficult to use. They usually lack a good abstraction model for the high-level description of the communication algorithm, or an API that copes with the abstraction model and is easy to learn. For example, neither AM nor FM provide standard set of operations like those used in the widely accepted MPI [14]. A message has to be received within a handler routine, which is specified by the send function. Programs are more easily prone to errors, especially for those with many communication partners and having multiple ways of message handling.
In this paper, we present a high performance communication system, Directed Point (DP), with the goals of achieving high performance and good programmability. The DP abstraction model depicts the communication channels built among a group of communicating processes. It supports not only the point-to-point communication but also various types of group operations. Based on the abstraction model, all inter-process communication patterns are described by a directed graph. For example, a directed edge connecting two endpoints represents a uni-directional communication channel between a source process and a destination process. The application programming interface (API) of DP combines features from BSD Sockets and MPI to facilitate the peer-to-peer communication in a cluster. DP API preserves the syntax and semantics of traditional UNIX I/O interface by associating each DP endpoint with a file descriptor. All messaging operations must go through the file descriptor to send or receive messages. With the file descriptor, a process can access the communication system via traditional I/O system calls.
To achieve high performance with the consideration of generation scalability, DP is designed and implemented based on a realistic yet flexible cost model. The cost model captures the overheads incurred in the host machine and network hardware. It is used as a tool for both communication algorithm design and performance analysis. We consider data communication via the network as an extension to the concept of memory hierarchy. Thus, we abstract the communication event by means of local and remote data movements, and express all parameters by their associated cost functions. This model helps the design of DP adapt well to various speed gaps in processor, memory, I/O bus, and network technologies, thus achieves good generation scalability.
Based on the cost model, we propose various optimizing techniques, namely directed message (DM), token buffer pool (TBP), and light-weight messaging call. DP improves the communication performance by reducing protocol complexity through the use of DM, by reducing the intermediate memory copies between protocol layers through the use of TBP, and by reducing the context-switching and scheduling overhead through the use of light-weight messaging calls. DP allocates one TBP for each DP endpoint. It requires no common dedicated global buffers for storing incoming messages in the kernel space or user space. When a process needs to maintain a large number of simultaneous connections or multiple parallel programs are in execution, separate control of receive buffers avoids locking overhead. Moreover, the memory resource in a host machine can be efficiently utilized and can eliminate unnecessary memory copy as the message buffers are mapped to both the kernel space and the user space.
We have implemented DP for various networks, including Intel EEPro Fast Ethernet, Digital 21140A Fast Ethernet, Packet Engine G-NIC II Gigabit Ethernet, and FORE PCA-200E ATM. DP effectively streamlines the communication steps and reduces protocol-processing overhead, network buffer management overhead and process-kernel space transition overhead. The performance test of DP shows low communication latency and high bandwidth as well as less memory resource consumption.
For the rest of the paper, we first introduce the DP abstraction model in Section 2. Section 3 describes the architectural background and assumptions of our communication model, together with a layout of all model parameters. The performance enhancement techniques inspired by this cost model are discussed in Section 4. In Section 5, we evaluated and discussed the performance characteristics of two DP implementations, which are based on two different Ethernet-based technologies. In Section 6, we briefly studied and compared DP with other Gigabit communication packages, and finally the conclusions are given in Section 7.
Section snippets
DP abstraction model
The communication traffic in a cluster is caused by the inter-process communication within a group of cooperating processes, which reside on different nodes to solve a single task. Various communication patterns are usually used in algorithm design, such as point-to-point, pair-wise data exchange, broadcast tree, total-exchange, etc. A communication abstraction model can be used to describe the inter-process communication patterns during the algorithm design stage. It also serves as a guide to
DP cost model
We propose a communication cost model that can be used as a versatile tool for performance analysis/evaluation and algorithm design. By providing a set of model parameters that captures the crucial performance characteristics of the cluster network together with the methodologies to derive those parameters, efficient but portable algorithms can be designed which are well fitted to the underlying cluster domain.
A model, in general, is an abstract view of a system or a part of a system, obtained
DP system architecture
The cost model discussed in the previous section not only serves as a means of performance analysis and evaluation, it also throws some light to improve the performance of the communication system architecture. Based on the cost model, we derive the DP system architecture, which involves a number of performance enhancement techniques to be discussed in this section.
Performance analysis
We have implemented DP on two Ethernet clusters, one is a Fast Ethernet cluster (FEDP) and the other is a Gigabit Ethernet cluster (GEDP). The FEDP cluster consists of 16 PCs running Linux 2.0.36. Each node is equipped with a 450 MHz Pentium III processor with 512 KB L2 cache and 128 MB of main memory, and uses a Digital 21140A Fast Ethernet adapter for high-speed communication. The whole cluster is connected to a 24-port IBM 8275-326 Fast Ethernet switch which has 5 Gbps backplane capacity. For
Related works
In the past, several prototype cluster communication systems based on Gigabit networking technology have been built. The Genoa Active Message Machine (GAMMA) [6], [8] is an experimental communication system on Fast Ethernet and Gigabit Ethernet. The GAMMA driver uses a mechanism derived from AM called Active Ports [7]. Active Ports is the abstract communication endpoint inside processes to exchange messages with each other. The Active Ports differs from DP endpoint in that an active port is
Conclusions
The design of DP exploits the underlying hardware architecture and operating system characteristics to effectively utilize the network and host system resources. We emphasize on the use of a realistic communication cost model so that designers can use it as a calibration tool to assess various design tradeoffs. The proposed communication model clearly delineates the characteristics of the communication network and allows the decoupling of communication overheads incurred in the host machine and
References (25)
- A. Bar-Noy, S. Kipnis, Designing broadcasting algorithms in the postal model for message-passing systems, in:...
- D. Becker, A Packet Engines GNIC-II Gigabit Ethernet Driver for Linux....
- Mark Baker (Ed.), Cluster Computing White Paper....
- B.W.L. Cheung, C.L. Wang, K. Hwang, JUMP-DP: a software DSM system with low-latency communication support, in:...
- A. Chien, et al., Design and evaluation of an HPVM-based Windows NT Supercomputer, Int. J. High-perform. Comput. Appl....
- G. Chiola, G. Ciaccio, GAMMA: a low-cost network of workstations based on active messages, in: Proceedings of the Fifth...
- G. Chiola, G. Ciaccio, Active ports: a performance-oriented operating system support to fast LAN communications, in:...
- G. Ciaccio, G. Chiola, GAMMA and MPI/GAMMA on Gigabit Ethernet, in: Proceedings the Seventh EuroPVM-MPI, Balatonfured,...
- D.E. Culler, R.M. Karp, D.A. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, T. von Eicken, LogP:...
- T. Heywood, C. Leopold, Models of parallelism, Technical Report CSR-28-93, Department of Computer Science, University...
Cited by (1)
A qualitative analysis of the critical's path of communication models for next performant implementations of high-speed interfaces
2006, Advances in Systems, Computing Sciences and Software Engineering - Proceedings of SCSS 2005
- ☆
This research was supported by Hong Kong RGC Grant HKU 10201701 and HKU University Research Grants 1023009.