An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors☆
Introduction
The performance of scalable multiprocessors is often determined by how effectively they support processor communication. In many cases, processor communication is slowed down by insufficient throughput and high latency in the interconnection network. It is important, therefore, to design cost-effective techniques that enhance network throughput and reduce message latency.
Interprocessor communications can be classified into three types depending on the number of message destinations, namely one-to-one (unicast), one-to-many (multicast), and one-to-all (broadcast). Of these schemes, unicast and broadcast can be considered a special case of multicast. Multicast communications routinely appear in parallel programs. Typical examples include explicit distribution of data to several nodes or invalidation and update messages in distributed shared-memory multiprocessors [10] (DSMs). Similarly, the inverse of multicast, namely many-to-one messages, is also common. Examples include barrier synchronization and global reductions. It appears, therefore, that optimizing the multicast operation would improve the performance of scalable multiprocessors.
Multicast is supported in hardware by at least one commercial machine. In-deed, the NCube-2 multicomputer supports broadcast and, in addition, multicast messages whose destinations belong to a subcube of the hypercube [16]. This multicast scheme uses a tree-based multicast mechanism to reach nodes within a given subcube. This mechanism, however, is not deadlock-free. Currently, there is not any satisfactory commercial solution to hardware multicast.
Efficient support for multicast has been the subject of much previous academic research. The earliest studies [9], [11] proposed optimal tree-based algorithms for multicast routing based on graph-modeling theory. However, aspects like topology, router design and deadlock problems were not considered. In [1], three deadlock-free multicast protocols were presented. These multicast protocols are specially designed for virtual cut-through networks [8] but no routing algorithm was proposed.
Later, deadlock-freedom was studied for multicast communications in multicomputer networks using wormhole switching [12], [17]. The approach was to define a Hamiltonian path to route multidestination worms while avoiding deadlock. Multicast messages are propagated following one path that visits all destinations without branching at intermediate routers. This type of multicast is called path-based multicast. Routing algorithms like Dual-Path and Multi-Path were proposed for 2-D meshes.
New partially and fully adaptive path-based multicast wormhole routing algorithms called PM, FM, and LD were defined for 2-D meshes [13]. However, the design of deadlock-free adaptive multicast routing algorithms is complex. For this reason, a new theory and methodology for designing deadlock-free adaptive multicast algorithms was proposed in [4], [6]. The theory makes the design of these algorithms easy and efficient. This theory is an extension of a previously proposed theory for deadlock-free unicast routing [5]. Alternative approaches to solve the same problem were proposed in [14], [15].
Other works developed broadcast communication in wormhole networks using spanning set of dimensional-disjoint paths (SDP) [18]. This scheme uses several phases to transport one message to all the destinations. Two general solutions proposed are the 1-port and the n-port. They are deadlock-free and achieve better performance than path-based multicast.
After these studies, the base routing conformed path (BRCP) model was developed [19]. This is a new path-based message passing mechanism that transports multicast and broadcast messages and is deadlock-free. This mechanism routes multicast messages using the same routing algorithm as for unicast messages, so routing algorithms like e-cube, planar-adaptive, turn-model, or fully adaptive can be used. Multicast and broadcast messages are carried toward their destinations in several sequential steps using two protocols: Hierarchical leader-based (HL) and multiphase greedy (MG). Finally, multi-destination messages have been used to optimize barrier synchronization and global reduction [20], [21]. All these schemes use path-based multicast based on the BRCP model.
Most of the work described above for wormhole networks uses path-based multicast. Unfortunately, path-based multicast has several inefficiencies, especially when messages are short. The first inefficiency is that each multicast message needs a message preparation phase to order the destinations. Usually, this preparation phase involves a split-and-order function with a software cost of , where n is the number of destinations. This preparation phase may take more time than the transfer itself. This is the main limitation of path-based multicast schemes when the latency of multicast messages is an important issue.
The second inefficiency is that, in most of the proposed schemes, path-based multicast does not use a minimal path for all of the destinations of a multicast message. As a result, more network resources are used and network contention increases. The third inefficiency is that, to prevent deadlocks, some path-based multicast mechanisms use a routing function that follows a Hamiltonian path. As a result, unicast routing must use the same routing function and, therefore, it cannot exploit the advantages of other unicast routing functions. This limitation has been removed by the BRCP model. Finally, path-based multicast routing requires several delivery channels at each node to avoid deadlock [19].
In this paper, we propose a new tree-based multicast mechanism that overcomes the limitations of the previously proposed mechanisms. First, the new mechanism does not require an initial ordering of the destinations. This makes the message preparation phase much faster. This, however, does not affect deadlock-freedom: a pruning mechanism guarantees deadlock-freedom. A second advantage is that it can use a minimal path for all the destinations of a multicast message. A third advantage is that it is able to reuse any deadlock-free routing algorithm used by unicast messages. Therefore, the routing flexibility for unicast messages is also available to multicast messages. Finally, tree-based multicast with pruning does not require several delivery channels to guarantee deadlock-freedom.
Simulation results of networks under synthetic loads show that the new scheme can significantly reduce the latency of multicast messages. Furthermore, it has a higher performance than the other multicast mechanisms when the multicast traffic is composed of short messages like in DSMs. Finally, the new scheme can be easily implemented in hardware with minimal changes to existing wormhole routers.
The rest of this paper is organized as follows. Section 2 describes the new multicast mechanism. Section 3 analyzes deadlock avoidance. Section 4 explains the additional hardware needed to implement tree-based multicast and its impact on the critical path of a conventional router. Section 5 evaluates the scheme and compares it with other schemes. Finally, Section 6 presents conclusions and future work.
Section snippets
Tree-based multicast with pruning
The new scheme proposed in this paper is named Tree-Based Multicast with Pruning. Tree-based multicast has traditionally been considered a good mechanism for multidestination message routing. This mechanism was successfully used to broadcast and multicast messages in store-and-forward networks. However, with the arrival of wormhole switching, it became very prone to congestion and deadlocks in the interconnection network. As a consequence, other multidestination routing mechanisms like
Deadlock recovery in tree-based multicast with pruning
When the network is lightly loaded, tree-based routing works well for multicast and broadcast messages. For example, it is able to transport multicast messages with low latencies using a minimum number of channels. However, when the network is heavily loaded, the branches generated by tree-based routing increase contention and may cause deadlocks in the interconnection network.
We propose resolving deadlocks and reducing contention by controlling multicast branches through a pruning mechanism.
Router design
In this section, we describe the hardware extensions required to add support for tree-based multicast routing to a unicast router. A typical unicast router consists of a routing control unit, a switch, and several input and output channels with their corresponding channel controllers. The routing control unit selects the output channel for a message as a function of its destination node, the current node, and the output channel status. In most routers, the routing control unit can only process
Evaluation
We have developed a flit-level simulator for interconnection networks that supports unicast routing, path-based multicast routing and tree-based multicast routing with pruning. It takes as input parameters, the switching technique (wormhole or virtual cut-through), topology, routing algorithm, message size, message distribution, number of destinations for multicast messages, network size and number of virtual channels. In our experiments, we run several simulations to analyze the behavior of
Conclusions and future work
This paper presents a fast hardware-supported multicast mechanism for worm-hole networks. The advantages of the new scheme are that multicast messages do not need a pre-processing step that orders the destinations, messages reach their destinations following minimal paths if the base routing algorithm is minimal, it works for any topology, and it can reuse the routing algorithm for unicast messages. We call the new scheme tree-based multicast with branch pruning. The new scheme is deadlock-free
M.P. Malumbres received his M.S. and Ph.D. degrees in Computer Science from the Technical University of Valencia (UPV), Spain, at 1991 and 1996, respectively. He is currently an assistant professor of Computer Science at the UPV and his research and teaching activities are related to networked multimedia and high-speed networking.
References (22)
- et al.
Virtual Cut-through: a new computer communication switching technique
Comput. Networks
(1979) - G.T. Byrd, N.P. Saraiya, B.A. Delagi, Multicast communication in multiprocessor systems, in: Proceedings of the...
- C.M. Chiang, L.M. Ni, Multi-address encoding for multicast, in: Proceedings of the Parallel Computer Routing and...
- et al.
Deadlock-free message routing in multiprocessor interconnection networks
IEEE Trans. Comp.
(1987) A new theory of deadlock-free adaptive routing in wormhole networks
IEEE Trans. Parallel Distrib. Syst.
(1993)- J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, in: Proceedings of the Fifth IEEE...
A new theory of deadlock-free adaptive multicast routing in wormhole networks
IEEE Trans. Parallel Distrib. Syst.
(1995)- C.J. Glass, L.M. Ni, The turn model for adaptive routing, in: Proceedings of the 19th Annual International Symposium...
- Y. Lan, A.H. Esfahanian, L.M. Ni, Multicast in hypercube multiprocessors, J. Parallel Distrib. Comput. (1990)...
- D. Lenoski, J. Laudon et al., The stanford dash multiprocessor, IEEE Comput. 25 (3) (1992)...
Cited by (33)
On balancing network traffic in path-based multicast communication
2006, Future Generation Computer SystemsCitation Excerpt :It is also required in control operations such as global synchronisation and to signal changes in network conditions, e.g., faults, image processing, matrix multiplication and graphics on parallel computers [2,6,14]. Multicast latency consists of three components, start-up latency, network latency and blocking latency [1–4,6,9,10,17]. The start-up latency is the time incurred by the operating system when preparing a message for injection into the network.
A plane-based broadcast algorithm for multicomputer networks
2005, Journal of Systems ArchitectureCitation Excerpt :Although the message in this approach can be delivered to multiple destinations with a message-passing step, the number of these destinations that can be reached by a single step is limited [16]. In [9] and [27] the authors have shown that each message-passing step requires a message preparation phase to sort n addresses with a minimum software cost of O(n × log n). As a consequence, this preparation phase may take longer than the actual message transmission time, especially when n is high.
A new scalable broadcast algorithm for multiport meshes with minimum communication steps
2003, Microprocessors and MicrosystemsCitation Excerpt :Unfortunately, the multidestination approach suffers from several limitations. In Refs. [10,23], the authors have shown that each message-passing step requires a message preparation phase to sort n addresses with a minimum software cost of O(n×log n). As a consequence, this preparation phase may take a longer time than the actual message transmission time especially when n is high.
Tree-based wireless NoC architecture: enhancing scalability and latency
2024, Optical and Quantum ElectronicsMCL: A Cost-Efficient Nonblocking Multicast Interconnection Network
2018, IEEE Transactions on Parallel and Distributed SystemsEnabling High-Performance SMART NoC Architectures Using On-Chip Wireless Links
2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems
M.P. Malumbres received his M.S. and Ph.D. degrees in Computer Science from the Technical University of Valencia (UPV), Spain, at 1991 and 1996, respectively. He is currently an assistant professor of Computer Science at the UPV and his research and teaching activities are related to networked multimedia and high-speed networking.
Jose Duato received the M.S. and Ph.D. degrees in electrical engineering from the Technical University of Valencia, Spain, in 1981 and 1985, respectively. He is currently a Professor at the Department of Information Systems and Computer Architecture, Technical University of Valencia, and Adjunct Professor at the Department of Computer and Information Science, The Ohio State University. He is currently researching on multiprocessor systems, networks of workstations, interconnection networks, and multimedia systems. His theory on deadlock-free adaptive routing for wormhole networks has been used in the design of the routing algorithms for the MIT Reliable Router and the Cray T3E. He coauthored the text “Interconnection Networks: An Engineering Approach” with S. Yalamanchili and L. M. Ni (published by IEEE CS Press). Dr. Duato served as a member of the editorial board of IEEE Transactions on Parallel and Distributed Systems from 1995 to 1997. Also, he has been or is a member of the Program Committee for several major conferences (ICPADS, ICDCS, Europar, HPCA, ICPP, MPPOI, HiPC, PDCS, ISCA, IPPS/SPDP).
- ☆
This work was supported by Spanish CICYT under Grant TIC97-0897-C04-01.