An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors

doi:10.1016/S1383-7621(00)00007-2

Journal of Systems Architecture

Volume 46, Issue 11, September 2000, Pages 1019-1032

https://doi.org/10.1016/S1383-7621(00)00007-2 Get rights and content

Abstract

This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. The mechanism is a variation of tree-based multicast with pruning to recover from deadlocks and it is well suited for distributed shared-memory multiprocessors (DSMs) with hardware cache coherence. It does not require any preprocessing of multicast messages reducing notably the software overhead required to send a multicast message. Also, it allows messages to use any deadlock-free routing function. The new scheme has been evaluated by simulation using synthetic loads. It achieves multicast latency reductions of 30% on average. Also it was compared with other multicast mechanisms proving its benefits. Finally, it can be easily implemented in hardware with minimal changes to existing unicast wormhole routers.

Introduction

The performance of scalable multiprocessors is often determined by how effectively they support processor communication. In many cases, processor communication is slowed down by insufficient throughput and high latency in the interconnection network. It is important, therefore, to design cost-effective techniques that enhance network throughput and reduce message latency.

Interprocessor communications can be classified into three types depending on the number of message destinations, namely one-to-one (unicast), one-to-many (multicast), and one-to-all (broadcast). Of these schemes, unicast and broadcast can be considered a special case of multicast. Multicast communications routinely appear in parallel programs. Typical examples include explicit distribution of data to several nodes or invalidation and update messages in distributed shared-memory multiprocessors [10] (DSMs). Similarly, the inverse of multicast, namely many-to-one messages, is also common. Examples include barrier synchronization and global reductions. It appears, therefore, that optimizing the multicast operation would improve the performance of scalable multiprocessors.

Multicast is supported in hardware by at least one commercial machine. In-deed, the NCube-2 multicomputer supports broadcast and, in addition, multicast messages whose destinations belong to a subcube of the hypercube [16]. This multicast scheme uses a tree-based multicast mechanism to reach nodes within a given subcube. This mechanism, however, is not deadlock-free. Currently, there is not any satisfactory commercial solution to hardware multicast.

Efficient support for multicast has been the subject of much previous academic research. The earliest studies [9], [11] proposed optimal tree-based algorithms for multicast routing based on graph-modeling theory. However, aspects like topology, router design and deadlock problems were not considered. In [1], three deadlock-free multicast protocols were presented. These multicast protocols are specially designed for virtual cut-through networks [8] but no routing algorithm was proposed.

Later, deadlock-freedom was studied for multicast communications in multicomputer networks using wormhole switching [12], [17]. The approach was to define a Hamiltonian path to route multidestination worms while avoiding deadlock. Multicast messages are propagated following one path that visits all destinations without branching at intermediate routers. This type of multicast is called path-based multicast. Routing algorithms like Dual-Path and Multi-Path were proposed for 2-D meshes.

New partially and fully adaptive path-based multicast wormhole routing algorithms called PM, FM, and LD were defined for 2-D meshes [13]. However, the design of deadlock-free adaptive multicast routing algorithms is complex. For this reason, a new theory and methodology for designing deadlock-free adaptive multicast algorithms was proposed in [4], [6]. The theory makes the design of these algorithms easy and efficient. This theory is an extension of a previously proposed theory for deadlock-free unicast routing [5]. Alternative approaches to solve the same problem were proposed in [14], [15].

Other works developed broadcast communication in wormhole networks using spanning set of dimensional-disjoint paths (SDP) [18]. This scheme uses several phases to transport one message to all the destinations. Two general solutions proposed are the 1-port and the n-port. They are deadlock-free and achieve better performance than path-based multicast.

After these studies, the base routing conformed path (BRCP) model was developed [19]. This is a new path-based message passing mechanism that transports multicast and broadcast messages and is deadlock-free. This mechanism routes multicast messages using the same routing algorithm as for unicast messages, so routing algorithms like e-cube, planar-adaptive, turn-model, or fully adaptive can be used. Multicast and broadcast messages are carried toward their destinations in several sequential steps using two protocols: Hierarchical leader-based (HL) and multiphase greedy (MG). Finally, multi-destination messages have been used to optimize barrier synchronization and global reduction [20], [21]. All these schemes use path-based multicast based on the BRCP model.

Most of the work described above for wormhole networks uses path-based multicast. Unfortunately, path-based multicast has several inefficiencies, especially when messages are short. The first inefficiency is that each multicast message needs a message preparation phase to order the destinations. Usually, this preparation phase involves a split-and-order function with a software cost of $O (n∗ log n)$ , where n is the number of destinations. This preparation phase may take more time than the transfer itself. This is the main limitation of path-based multicast schemes when the latency of multicast messages is an important issue.

The second inefficiency is that, in most of the proposed schemes, path-based multicast does not use a minimal path for all of the destinations of a multicast message. As a result, more network resources are used and network contention increases. The third inefficiency is that, to prevent deadlocks, some path-based multicast mechanisms use a routing function that follows a Hamiltonian path. As a result, unicast routing must use the same routing function and, therefore, it cannot exploit the advantages of other unicast routing functions. This limitation has been removed by the BRCP model. Finally, path-based multicast routing requires several delivery channels at each node to avoid deadlock [19].

In this paper, we propose a new tree-based multicast mechanism that overcomes the limitations of the previously proposed mechanisms. First, the new mechanism does not require an initial ordering of the destinations. This makes the message preparation phase much faster. This, however, does not affect deadlock-freedom: a pruning mechanism guarantees deadlock-freedom. A second advantage is that it can use a minimal path for all the destinations of a multicast message. A third advantage is that it is able to reuse any deadlock-free routing algorithm used by unicast messages. Therefore, the routing flexibility for unicast messages is also available to multicast messages. Finally, tree-based multicast with pruning does not require several delivery channels to guarantee deadlock-freedom.

Simulation results of networks under synthetic loads show that the new scheme can significantly reduce the latency of multicast messages. Furthermore, it has a higher performance than the other multicast mechanisms when the multicast traffic is composed of short messages like in DSMs. Finally, the new scheme can be easily implemented in hardware with minimal changes to existing wormhole routers.

The rest of this paper is organized as follows. Section 2 describes the new multicast mechanism. Section 3 analyzes deadlock avoidance. Section 4 explains the additional hardware needed to implement tree-based multicast and its impact on the critical path of a conventional router. Section 5 evaluates the scheme and compares it with other schemes. Finally, Section 6 presents conclusions and future work.

Section snippets

Tree-based multicast with pruning

The new scheme proposed in this paper is named Tree-Based Multicast with Pruning. Tree-based multicast has traditionally been considered a good mechanism for multidestination message routing. This mechanism was successfully used to broadcast and multicast messages in store-and-forward networks. However, with the arrival of wormhole switching, it became very prone to congestion and deadlocks in the interconnection network. As a consequence, other multidestination routing mechanisms like

Deadlock recovery in tree-based multicast with pruning

When the network is lightly loaded, tree-based routing works well for multicast and broadcast messages. For example, it is able to transport multicast messages with low latencies using a minimum number of channels. However, when the network is heavily loaded, the branches generated by tree-based routing increase contention and may cause deadlocks in the interconnection network.

We propose resolving deadlocks and reducing contention by controlling multicast branches through a pruning mechanism.

Router design

In this section, we describe the hardware extensions required to add support for tree-based multicast routing to a unicast router. A typical unicast router consists of a routing control unit, a switch, and several input and output channels with their corresponding channel controllers. The routing control unit selects the output channel for a message as a function of its destination node, the current node, and the output channel status. In most routers, the routing control unit can only process

Evaluation

We have developed a flit-level simulator for interconnection networks that supports unicast routing, path-based multicast routing and tree-based multicast routing with pruning. It takes as input parameters, the switching technique (wormhole or virtual cut-through), topology, routing algorithm, message size, message distribution, number of destinations for multicast messages, network size and number of virtual channels. In our experiments, we run several simulations to analyze the behavior of

Conclusions and future work

This paper presents a fast hardware-supported multicast mechanism for worm-hole networks. The advantages of the new scheme are that multicast messages do not need a pre-processing step that orders the destinations, messages reach their destinations following minimal paths if the base routing algorithm is minimal, it works for any topology, and it can reuse the routing algorithm for unicast messages. We call the new scheme tree-based multicast with branch pruning. The new scheme is deadlock-free

M.P. Malumbres received his M.S. and Ph.D. degrees in Computer Science from the Technical University of Valencia (UPV), Spain, at 1991 and 1996, respectively. He is currently an assistant professor of Computer Science at the UPV and his research and teaching activities are related to networked multimedia and high-speed networking.

References (22)

P. Kermani et al.
Virtual Cut-through: a new computer communication switching technique
Comput. Networks
(1979)
G.T. Byrd, N.P. Saraiya, B.A. Delagi, Multicast communication in multiprocessor systems, in: Proceedings of the...
C.M. Chiang, L.M. Ni, Multi-address encoding for multicast, in: Proceedings of the Parallel Computer Routing and...
W.J. Dally et al.
Deadlock-free message routing in multiprocessor interconnection networks
IEEE Trans. Comp.
(1987)
J. Duato
A new theory of deadlock-free adaptive routing in wormhole networks
IEEE Trans. Parallel Distrib. Syst.
(1993)
J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, in: Proceedings of the Fifth IEEE...
J. Duato
A new theory of deadlock-free adaptive multicast routing in wormhole networks
IEEE Trans. Parallel Distrib. Syst.
(1995)
C.J. Glass, L.M. Ni, The turn model for adaptive routing, in: Proceedings of the 19th Annual International Symposium...
Y. Lan, A.H. Esfahanian, L.M. Ni, Multicast in hypercube multiprocessors, J. Parallel Distrib. Comput. (1990)...
D. Lenoski, J. Laudon et al., The stanford dash multiprocessor, IEEE Comput. 25 (3) (1992)...

X. Lin, L.M. Ni, Multicast communication in multicomputer networks, in: Proceedings of the International Conference on...

Cited by (33)

On balancing network traffic in path-based multicast communication
2006, Future Generation Computer Systems
Citation Excerpt :
It is also required in control operations such as global synchronisation and to signal changes in network conditions, e.g., faults, image processing, matrix multiplication and graphics on parallel computers [2,6,14]. Multicast latency consists of three components, start-up latency, network latency and blocking latency [1–4,6,9,10,17]. The start-up latency is the time incurred by the operating system when preparing a message for injection into the network.
This paper presents a new multicast path-based algorithm, referred to here as the Qualified Groups (QG for short), which can achieve a high degree of parallelism and low communication latency over a wide range of traffic loads in the mesh. The QG algorithm relies on a new approach that divides the destinations in a way that balances the traffic load on network channels during the propagation of the multicast message. Results from extensive simulations under a variety of working conditions confirm that the QG algorithm exhibits superior performance characteristics over those of some well-known existing algorithms, such as dual-path, multiple-path, and column-path algorithm.
A plane-based broadcast algorithm for multicomputer networks
2005, Journal of Systems Architecture
Citation Excerpt :
Although the message in this approach can be delivered to multiple destinations with a message-passing step, the number of these destinations that can be reached by a single step is limited [16]. In [9] and [27] the authors have shown that each message-passing step requires a message preparation phase to sort n addresses with a minimum software cost of O(n × log n). As a consequence, this preparation phase may take longer than the actual message transmission time, especially when n is high.
Maximising the performance of parallel systems requires matching message-passing algorithms and application characteristics with a suitable underling interconnection network. Broadcast algorithms for wormhole-switched meshes have been widely reported in the literature. However, most of these algorithms handle broadcast in a sequential manner and do not scale well with the network size. As a consequence, many parallel applications cannot be efficiently supported using existing techniques. Motivated by these observations, this paper presents a new efficient broadcast algorithm for the mesh, called the Plane-Based (PB) algorithm. The main feature of this approach is its ability to perform broadcast operation with a high degree of scalability and parallelism. Furthermore, performance is insensitive to the network size, i.e., only three message-passing steps are required to implement a broadcast operation irrespective of the network size. Results from a comparative analysis demonstrate that the PB algorithm exhibits superior performance characteristics over those of the well-known Recursive Doubling and Extending Dominating Node algorithms.
A new scalable broadcast algorithm for multiport meshes with minimum communication steps
2003, Microprocessors and Microsystems
Citation Excerpt :
Unfortunately, the multidestination approach suffers from several limitations. In Refs. [10,23], the authors have shown that each message-passing step requires a message preparation phase to sort n addresses with a minimum software cost of O(n×log n). As a consequence, this preparation phase may take a longer time than the actual message transmission time especially when n is high.
Many broadcast algorithms have been proposed for the mesh in the literature. However, most of these algorithms do not exhibit good scalability properties as the network size increases. As a consequence, most existing broadcast algorithms cannot support real-world parallel applications that require large-scale system sizes due to their high computational demands. Motivated by this observation, this paper makes two contributions. Firstly, in an effort to minimise the effects of network size on communication performance, this study proposes a new routing approach that enables the development of efficient broadcast algorithms that can maintain good performance levels for various mesh sizes. Secondly, based on the new routing approach, we propose a new adaptive broadcast algorithm for the mesh. The main feature of the proposed algorithm is its ability to handle broadcast operations with a fixed number of message-passing steps irrespective of the network size. Results from extensive comparative analysis reveal that our algorithm exhibits superior performance characteristics over those of the well-known Recursive Doubling and Extending Dominating Node algorithms.
Tree-based wireless NoC architecture: enhancing scalability and latency
2024, Optical and Quantum Electronics
MCL: A Cost-Efficient Nonblocking Multicast Interconnection Network
2018, IEEE Transactions on Parallel and Distributed Systems
Enabling High-Performance SMART NoC Architectures Using On-Chip Wireless Links
2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems

View all citing articles on Scopus

Jose Duato received the M.S. and Ph.D. degrees in electrical engineering from the Technical University of Valencia, Spain, in 1981 and 1985, respectively. He is currently a Professor at the Department of Information Systems and Computer Architecture, Technical University of Valencia, and Adjunct Professor at the Department of Computer and Information Science, The Ohio State University. He is currently researching on multiprocessor systems, networks of workstations, interconnection networks, and multimedia systems. His theory on deadlock-free adaptive routing for wormhole networks has been used in the design of the routing algorithms for the MIT Reliable Router and the Cray T3E. He coauthored the text “Interconnection Networks: An Engineering Approach” with S. Yalamanchili and L. M. Ni (published by IEEE CS Press). Dr. Duato served as a member of the editorial board of IEEE Transactions on Parallel and Distributed Systems from 1995 to 1997. Also, he has been or is a member of the Program Committee for several major conferences (ICPADS, ICDCS, Europar, HPCA, ICPP, MPPOI, HiPC, PDCS, ISCA, IPPS/SPDP).

^☆: This work was supported by Spanish CICYT under Grant TIC97-0897-C04-01.

View full text

An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors☆

Abstract

Introduction

Section snippets

Tree-based multicast with pruning

Deadlock recovery in tree-based multicast with pruning

Router design

Evaluation

Conclusions and future work

Comput. Networks

Deadlock-free message routing in multiprocessor interconnection networks

IEEE Trans. Comp.

A new theory of deadlock-free adaptive routing in wormhole networks

IEEE Trans. Parallel Distrib. Syst.

A new theory of deadlock-free adaptive multicast routing in wormhole networks

IEEE Trans. Parallel Distrib. Syst.