Elsevier

Parallel Computing

Volume 26, Issues 2–3, February 2000, Pages 313-332
Parallel Computing

Architectures and message-passing algorithms for cluster computing: Design and performance

https://doi.org/10.1016/S0167-8191(99)00107-6Get rights and content

Abstract

This paper considers the architecture of clusters and related message-passing (MP) software algorithms and their effect on performance (speedup and efficiency) of cluster computing (CC). We present new architectures for multi-segment Ethernet clusters and new MP algorithms which fit these architectures. The multiple segments (e.g. commodity hubs) connect commodity processor nodes so as to allow MP to be highly parallelized by avoiding network contention and collisions in many applications where the all-gather and other collective operations are central. We analyze all-gather in some detail, and present new network topologies and new MP algorithms to minimize latency. The new topologies are based on a design, called two-by-four nets (2×4nets), by Compbionics. An integrated MP software system, called Reduced Overhead Cluster Communication (ROCC), which embodies the MP algorithms is also described. In brief, 2×4nets are networks of “supernodes”, called 2×4's, each having 4 processors on 2 segments and segments usually being Ethernet hubs. The supernodes are typically connected to form rings or tori of supernodes. We present actual test results and supporting analyses to demonstrate that 2×4nets with the ROCC MP software are faster than many existing clusters and generally less costly.

Introduction

Cluster computing (CC), as treated here, refers to the organization of a network of workstations or PC nodes to perform parallel-distributed computation (PDC) on a single application problem [1], [7], [22], [23], [24], [25], [26]. The nodes are allocated parts of the problem that can be computed in parallel between certain “synchronization” points. At these points, designated nodes pause in their computation phase and enter a communication phase in which they send to other nodes messages containing intermediate results needed in the next computation phase. CC implements PDC and makes it feasible to execute large cases of an application that would not be practical on a single workstation. Compared to supercomputers, the low cost of a cluster assembled from commodity components transforms it into an affordable laboratory instrument for large-scale computer experiments.

There are two known issues involved in PDC of a single application problem. The first may be characterized as load balancing and the second as process/processor communication. In CC, communication is done primarily by message-passing (MP), rather than by shared memory. In this paper, we focus on a class of applications that have a structure which suggests a natural load-balancing. Therefore, our main interest is in the communication issues, specifically in the context of nodes communicating on an Ethernet network. However, some of our analysis applies to clusters using other communication media such as, for example, high-speed switchboards of IBM SP2 type.

There are several recently developed general environments for interprocess communication which apply to clusters, such as, for example, Linda, MPICH, p4, PVM and TCGMSG. For a more complete list that includes the more recent implementations Active Messages, CHIMP, Express, LAM, MPICH, NXlib, PICL and others, see [9]. Linda [1] and PVM [2] are based on UDP sockets although PVM provides an option to use TCP/IP sockets. p4 [3] and TCGMSG [4] are based on Unix TCP/IP sockets. The ROCC [24] MP system described herein uses TCP/IP sockets. p4 is one of the parallel environment kernels for MPICH [12], which is a higher-level implementation of the MP standard MPI [5]. This paper is not concerned with distributed networks like NOW. It deals rather with local clusters comparable to the Beowulf [21], [22], [23], Loki [17], and LoBoS [19] clusters. The first Beowulf cluster had two Ethernet segments. Loki has a hypercube topology and LoBoS uses ring topologies. More recent Beowulf-derivative clusters use a mesh-like network [23] or switches [20].

In this paper, we present new and different topologies, having multi-segment Ethernet architectures which facilitate highly parallel MP and avoid network contention and collisions in collective MP operations like all-gather. This reduces communication time, a critical factor in many parallel distributed computations. Beowulf and other clusters have demonstrated the potential of CC. However, high-performance CC is not just a matter of a large number of processor nodes. It also depends on the MP software system as well as the network hardware topology. The new cluster topologies and MP software we describe here yield increases in performance (speed and efficiency) of CC. They are based on a patented design, called 2×4nets, by Compbionics [24] and its integrated MP software system, Reduced Overhead Cluster Communication (ROCC) [24]. The topologies of 2×4nets and the ROCC MP algorithms are described in the sections which follow. Some early test results on small 2×4nets, reported in [6], demonstrate improvements in efficiency over existing clusters. In this paper, we present analyses which show that this improved performance will scale to large 2×4nets.

An outline of the rest of the paper is as follows. In Section 2.1, we review some basic elements of cluster network architecture and some basic MP timing formulas. In Section 2.2, we describe the new cluster topologies of 2×4nets and introduce the associated ROCC MP software system. In Section 3, we carry out an analysis of the MPI all-gather operation, first for some well-known cluster topologies and MP algorithms and then for 2×4nets using the ROCC MP system. In Section 4, we analyze all-gather for switch-connected clusters like the SP2. In Section 5, we solve some special cluster optimization problems. In Section 6, we present our conclusions regarding cluster performance.

Section snippets

Basic elements of cluster topologies and MP timing

All the clusters we consider use commodity Ethernet hardware; e.g. network interface cards (NICs), hubs/switches. For comparison, we also consider the SP2 as an exception which uses a special (non-commodity) switchboard. The clusters can be divided into three basic types: single-segment, multi-segment and switched. A segment is a shared communication medium such as an Ethernet UTP cable or hub. Hubs may be intelligent and/or segmentable. Ethernet switches may be viewed as providing multiple

Performance analysis of the all-gather operation

Performance evaluation of PC/workstation clusters has mainly concentrated on comparison of the bandwidth and latency of various MP libraries [14], [15]. Unlike the communication operations send, receive, broadcast, gather and scatter, the all-gather operation is a collective one in which all the nodes send as well as receive messages [5], [6], [12]. Thus benchmarking the all-gather operation tests network overall throughput when multiple nodes are simultaneously or nearly simultaneously

Switch-connected clusters

Commodity Ethernet switches (e.g. 3COM, Intel, Nortel, etc.) may provide parallel MP between many pairs of nodes if the node send/receive's are suitably programmed. However, switches introduce a “switch latency” that can add significantly to the time to do a collective operation like all-gather. Before analyzing all-gather on commodity switches, it is instructive to consider all-gather on the non-commodity SP2 switch [13], for which we have some actual test data.

An algorithm is described in [18]

Cluster performance and optimizing cluster size

The performance of a cluster is a complicated issue, a function of many interrelated factors. These factors include the application (the algorithm and size of the application), the architecture of the cluster, the MP software and the network and node hardware characteristics.

We calculate some performance numbers using the all-gather time formulas developed in Section 3 and the analysis methods given below. We apply them to vector iterative (VI) computations in which there are O(n2)

Conclusion

There are several conclusions that can be drawn from the results presented in the preceding sections.

Our analysis, in Section 3, of MP times of the all-gather operation in clusters of various multi-segment topologies clearly shows that the latency is significantly reduced by the 2×4net ring topologies of supernodes compared to the corresponding topologies of single nodes. See formulas , , and (8). Further reductions in all-gather latency are obtained in the 2×4net 2D torus (formulas , ) and in

For further reading

[8], [27], [29], [30], [31], [33].

References (33)

  • T.F. Smith et al.

    Indentification of common molecular sub-sequences

    J. Mol. Biol.

    (1981)
  • N. Carriero et al.

    How To Write Parallel Programs, A First Course

    (1990)
  • A. Geist et al.

    PVM: Parallel Virtual Machine. A Users' Guide and Tutorial for Network Parallel Programming

    (1994)
  • R. Butler, E. Lusk, User's Guide to the p4 Programming System Argonne National Lab.,...
  • R.J. Harrison

    Portable tools and applications for parallel computers

    J. Quantum Chem.

    (1991)
  • M. Snir et al.

    MPI: The Complete Reference

    (1996)
  • E.K. Blum et al.

    Parallel execution of iterative algorithms on workstation clusters

    J. Parallel and Distributed Computation

    (1996)
  • D.H. Bailey, T. Harris, W. Saphir, R. van der Wijngaart, A. Woo, M. Yarrow, The NAS parallel benchmarks 2.0, Report...
  • E.K. Blum, P.K. Leung, A computer model and simulation of lower neural circuits for motor control of a simple...
  • E.K. Blum, P.K. Leung, Modeling and simulation of human walking: Using a neuro-musculo-skeletal model to design a...
  • W.R. Stevens

    UNIX Network Programming

    (1990)
  • M.J. Burgard et al.

    DOS-UNIX Networking and Internetworking

    (1994)
  • MPICH: A Portable Implementation of MPI Math. and Comp. Sci. Div., Argonne National Lab.,...
  • IBM AIX Parallel Environment: Parallel Programming Sub-routine Reference Release 2.1, December...
  • N. Nupatiroj, L.M. Ni, Performance evaluation of some MPI implementations on workstation clusters, in: Proceedings of...
  • C.C. Douglas, T.G. Mattson, M.H. Schultz, Parallel programming systems for workstation clusters, Research Report TR-975...
  • Cited by (4)

    View full text