Elsevier

Information Sciences

Volume 163, Issue 4, 18 June 2004, Pages 293-306
Information Sciences

Towards scalable collective communication for multicomputer interconnection networks

https://doi.org/10.1016/j.ins.2003.06.014Get rights and content

Abstract

A considerable number of broadcast algorithms have been proposed for the mesh over the past decade. Nonetheless, most of these algorithms do not exhibit good scalability properties as the network size increases. As a consequence, most existing broadcast algorithms cannot efficiently support real-world parallel applications that require large-scale system sizes due to their high computational demands. Motivated by these observations, this paper proposes the Nearest Side First Algorithm (or NSF for short) as a new adaptive broadcast algorithm for the mesh. One of the key results is that the performance of the NSF algorithm scales up well with the increase of processing elements, a feature not demonstrated by any previous broadcast algorithms, which enables the proposed algorithm to utilise massive parallel architectures with maximum effectiveness.

Introduction

In distributed memory paradigms, the communication process can only be implemented by message-passing approach between the processing nodes. The routing algorithms provide different routes for the messages exchanged through an interconnection network. Clearly, the effect of routing algorithms is a critical key to the successful use of these paradigms as it greatly affects the overall system performance. Most of these systems must be scalable, i.e. they must be economically deployable in a wide range of sizes and configurations. Incremental scalability refers to the ability of a network to add small numbers of nodes without disruption or routing problems. However, incremental scalability in parallel computing is not only dependent upon the topological side, i.e. it is heavily dependent upon the underling routing algorithm [5]. As a result, the emphasis of research on routing techniques has shifted over the past a few years towards providing scalable collective communication [5].

The mesh, as illustrated in Fig. 1, has been one of the most common large-scale networks for practical multicomputers due to its desirable properties, such as scalability, ease of implementation, recursive structure, and ability to exploit communication locality found in many parallel application to reduce message latency. The J-machine, Caltech Mosaic, Intel Touchstone Delta, Symult 2010 and Stanford DASH, are examples of practical systems that are based on the mesh topology [5], [7], [8]. Wormhole switching [12] has also promoted the use of the mesh as it makes latency almost insensitive to the message distance, especially when the network is lightly loaded, and also simplifies router design due to its minimal buffer requirement. In wormhole switching, a message is divided into elementary units called flits, each of a few bytes for transmission and flow control. The header flit (containing routing information) governs the route and the remaining data flits follow it in a pipelined fashion. If a channel transmits the header of a message, it must transmit all the remaining flits of the same message before transmitting flits of another message. When the header is blocked the data flits are blocked in situ.

Broadcast operation is an important collective communication, which refers to the delivery of the same message originating from a given source to all network nodes. It is required in many real-world parallel applications found in the areas of Science and Engineering [2], [3], [10], [16], [17], [18]. For instance, broadcast communication is often required in scientific computations to distribute large data arrays over system nodes in order, for example, to perform various data manipulation operations. Furthermore, it is required in control operations such as global synchronisation and to signal changes in network conditions, e.g., faults. In the distributed shared-memory paradigm, broadcast communication is usually used to support shared data invalidation and updating procedures required for cache coherence protocols [4]. Intuitively, a practical broadcast algorithm must be deadlock free and capable of broadcasting in few message-passing steps. Deadlock occurs when messages cannot advance in the network because of cyclic waiting for resources (i.e., buffers and channels) [12].

The message-passing latency includes three components, start-up latency, network latency and blocking latency. The start-up latency is the required time to handle a broadcast message at both the source and destination nodes. The network latency consists of channel propagation and router latencies. Blocking latency accounts for all delays associated with contention for routing resources among the various messages in the network. In current practical machines, start-up latency has been the dominating factor in the cost of communication, i.e., it is typically in the order of several microseconds whereas network latency is in the order of a few nanoseconds [13]. On the other hand, blocking latency depends on the generated traffic pattern and on the routing algorithm itself. Due to its dominating effects, a lot of research work has been done to design broadcast routing algorithms that minimise the number of start-ups required to implement broadcast operations [2], [3], [16], [17], [18]. In general, most existing studies [2], [3], [9], [10], [13], [17] have focused on minimising the number of message-passing steps required for collective communication, such as broadcast. However, there has been hardly any study that has considered minimising the effects of the network size on the performance of broadcast algorithms. As a result, most existing algorithms do not scale well with the network size as the number of message-passing steps increases proportionally with the system size. In other words, the larger the underling interconnection network is, the more severe this limitation becomes. Many parallel application, therefore, cannot be implemented efficiently, such as, real time application and synchronisation, where the processors should receive the broadcast message in comparable message arrivals. In addition, most of the existing broadcast algorithms in the literature [2], [3], [10], [17], [18] handle broadcast with deterministic routing. Hardly any study has exploited the performance advantages of adaptive routing to develop efficient broadcast algorithms. To address this, the present study proposes a new broadcast algorithm that uses an adaptive routing and maintains good performance levels for various system sizes.

Motivated by these observations, this study proposes a new broadcast algorithm for the mesh. The proposed algorithm is based on the Coded Path Routing (CPR for short) approach, which has been proposed in [1]. Owing to the properties of the CPR, the proposed broadcast algorithm requires only three message-passing steps to implement a broadcast operation, irrespective of the system size, and thus considerably reducing the effects of the start-up latency. Another important feature of the new algorithm is its use of the Turn model, a partially adaptive routing to provide greater flexibility than deterministic routing in choosing a path for a broadcast message. An extensive comparative analysis presented below reveals that the new broadcast algorithm exhibits superior performance characteristics over the well-known Recursive Doubling and Extending Dominating Node algorithms of [2], [16], respectively.

Definition 1

Given a source node (Sx,Sy,Sz), destination node/nodes D̵ such that D̵⊆V, α the sending start-up latency, ψ, (1⩽ψ⩽6) the number of the utilised ports of the system and γ the receiving start-up latency, we say that the (Sx,Sy,Sz) capable of delivering a message M to D̵ in a single message-passing step if and only if it requires ((ψ*α)+γ) as start-up latency, irrespective of the number of nodes traversed.

Definition 2

In the absence of contention in the network, the communication latency at the network level, τ, for a message length of L can be generally estimated asτBroadcast=Mα+βD+βL+Cμ+γwhere M is the number of copies of the broadcast message prepared by the source to be injected into the network, α the sending latency for each message, β the time required to transmit a flit on a channel, D the distance between the source and destination of a message, γ the receiving latency, μ the time required to change the header message and C the number of message-passing steps required to deliver the message to all network nodes. The remainder of this paper is organised as follows. Section 2 describes briefly the CPR. Section 3 is devoted to the proposed broadcast presents the system model and introduces the new broadcast algorithm. Section 4 compares the performance of the proposed algorithm to the existing Recursive Doubling and Extending Dominating Node algorithms. Finally, Section 5 concludes this study.

Section snippets

Coded path routing (CPR)

This section describes briefly the Coded Path Routing (CPR) approach that can reduce the overhead due to the start-up latency and the effects of the network size on the performance of collective communication [1]. The CPR exploits the main features of wormhole switching, such as few buffer requirements and distance insensitivity, to overcome the limitations of the existing approaches, and to efficiently support collective communications. In the CPR, the header flit has two bits that form the

The proposed “Nearest Side First” broadcast algorithm

While most previous broadcast algorithm for the mesh have discussed in the context of deterministic routing [3], [12], [13], [16] this section introduces the “Nearest-Side-First” (or NSF for short) algorithm as a new broadcast algorithm that is based on adaptive routing. The NSF uses the turn model discussed in [6] to achieve routing adaptivity while ensuring deadlock freedom (due to space limitation, we will omit the description of this routing algorithm. We refer the reader to [6] for more

Performance evaluation

This section compares the performance of the proposed algorithm to the well-known Recursive Doubling [2] and the Extended Dominating Node algorithms [16] (both algorithms are based on deterministic routing). In what follows, we will use the short abbreviation NSF, RD and EDN to refer to the three algorithms, respectively. Firstly, we will compare these algorithms in terms of the number of message-passing steps required. We then conduct a timing analysis to estimate the communication latency

Conclusions and future directions

This paper has proposed a new broadcast algorithm for mesh network, which overcomes some of the severe limitations of the existing broadcast algorithms. Results obtained from extensive comparative analysis have revealed that the proposed algorithm is capable of implementing broadcast in the mesh with a high degree of scalability and parallelism. Furthermore, it exhibits superior performance characteristics over those of the well-known Recursive Doubling and Extending Dominating Node algorithms.

References (18)

There are more references available in the full text version of this article.

Cited by (7)

  • Non-contiguous processor allocation in the mesh-connected multicomputers using compaction

    2012, 2012 International Conference on Computer Systems and Industrial Informatics, ICCSII 2012
  • The effect of real workloads and stochastic workloads on the performance of allocation and scheduling algorithms in 2D mesh multicomputers

    2008, IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM
  • Performance of deterministic and adaptive broadcast algorithms in multicomputer networks

    2006, International Journal of High Performance Computing and Networking
View all citing articles on Scopus
View full text